MIT AI Conference 2022
Updated: Aug 7, 2022
Prof Antonio Toralba
AI has been trying to figure out a way to learn from data without labels just like humans do. So far, self-supervised learning has been used for context prediciton, colorization, audio prediction, solving puzzle and more. Self-supervised systems learn by themselves by creating a pre-task which will help with the learning by itself. This is a system that doesn’t require any training labels. Another way is learning by visual representations.
Self-supervised methods generally involve a pretext task that is solved to learn a good representation and a loss function to learn with.
The goal of contrastive representation learning is to learn such an embedding space in which similar sample pairs stay close to each other while dissimilar ones are far apart. Contrastive learning can be applied to both supervised and unsupervised settings.
How do you learn to see if there is no data?
Imagenet 100 classification task: We take 105K samples, using Alexnet-based encoder and align the unifrom loss, we get 50% performance with no training data. The Imagenet100 does classifications on its own.
What if we don’t want to use the computer images or real images?
Biological systems learn about the world even before they open their eyes. Similarly AI can be trained to anticipate motions. Using dead leaves model and following uniform sample position and color, to recognize images with just basic color grid containing figures and geometric figures (Natural Images).GANs model work very well already with real world and synthetic images. Upon putting random weights into the StyleGANs, you get natural images. OpenGL based Shaders21 K can generate random natural images.
The future is not having any training data but starting that journey from having very little training data is a great starting point for this kind of research.
Prof Michael Carbin
Usual pruning method: (Lottery Ticket Hypothesis)
Train the network. So apply your standard training regimes to train that network up to the accuracy that you expect for the application that you have at hand.
Remove superfluous structure, which can be weights, neurons, channels, filters, etc. . There are pruning techniques that eliminate essentially all of these and even more of these structures. And they try to understand what structures can we actually remove?
Fine-tune that network. So when you remove structure, what happens is the accuracy of that network is going to degrade typically. And what you need to do is do a little bit more training on the original data set to recover the accuracy that you had lost. But it's possible, and typically possible, to restore quite a significant amount, if not all of the accuracy, that you have.
Keep pruning, training, pruning and training and pruning and training and removing and training over and over again to get a high-quality compact model.
If we can prune models after training, can we actually train smaller models from the start?
What's fundamentally different about these two models?
So pruning after training, we should think about the capacity to represent some function that has been learned. For pruning during training or even before training, we consider the capacity for learning. Weights pruned after training could have been pruned before training. It's possible to find a network or a subnetwork of this originally initialized network, which can essentially achieve the same accuracy.
We use iterative magnitude pruning, randomly initializing the full network with random weights, we are going to train it, and we are going to prune the superfluous structure. We find some substructure at the end of training via traditional pruning and then taking that substructure and then porting it back to time zero before we actually started any training to begin with. Basically, we find that subnetworks can train from initialization to full accuracy at non-trivial sparsities. So we're actually able to use this technique to find significantly smaller networks. And what we mean here is by non-trivial is that often times these networks, they have some residual overcapacity. Sometimes, it's called overparameterization.
Prof Dina Katabi
Ambient intelligence - It is the convergence of radio signals with machine learning, particularly neural networks.
Can we design systems that would leverage machine learning neural network radio signals and analyze these radio signals that bounce off people's body to understand how they are moving, their respiration, their vital signs, maybe even diagnose diseases before doctors can diagnose them? And with the wireless device in the adjacent office, what it's going to try to do is to try to kind of see this person through the wall, track his motion, and be able to localize him at any point in time without having this person wear sensors, wearables, nothing, just based on how his movement changes electromagnetic waves.
So when you go to sleep, your brain waves change and enter different stages-- awake, light sleep, deep sleep, REM. REM is rapid eye movement, which is the stage in which you dream. Now, these sleep stages are very important, of course, for sleep disorders, which we know one in every three Americans have problems sleeping. But not just that. Actually, sleep is a platform for almost all diseases. If you think about depression, for example, in depression, REM tends to happen too early in the sleep. In things like Alzheimer's, for example, the slow waves in the brain during deep sleep are correlated with Alzheimer's. So if we can understand these sleep stages, not only do we understand sleep, we can actually understand a lot about mood disorders and neurological diseases. Their device, transmits very low-power wireless signal, 1,000 times lower power than Wi-Fi, analyzes the reflection with machine learning, and generates hypnogram. For every 30 seconds of sleep, it tells you whether this person is now in REM, in light sleep, deep sleep, or awake, for every 30 seconds.
Prof Jesus A. Del Alamo
More power is required for more tremendous computation.
Cerebras Wafer Scale Engine 2 - AI Accelerator. 1 chip/Silicon Wafer, Really huge in size.
Chips and layer accelerators are all fabricated using modern CMOS[ Complementary Metal Oxide Semiconductor] technology. How could we increase the performance per watt ratio by orders of magnitude, which is what is going to be required to do justice to all the great work that is being done in algorithmic development and in data sets. But we are limited by the memory bottleneck? The silicon technology that is used to implement processors, where you do the computation, and to implement the memory, where you hold the data, is actually very different. And it's very difficult to have one silicon chip that contains high performance logic, and at the same time high density, inexpensive, fast, energy-efficient memory. There is always a certain amount of latency in the time that it takes the data to flow from memory to the logic chip, and then back again, and also is going to consume quite a bit of energy in order to drive these lines and do all this.
Heterogeneous Integration refers to the integration of separately manufactured components into a higher level assembly (SiP) that, in the aggregate, provides enhanced functionality and improved operating characteristics.
We can put two chips together, so a packaging process, so that we have them right on top of each other, as the graphic on the right indicates, connected with a very high density of very short vias that provide data at a very high throughput, and burning very little energy as we get the data back and forth between memory and compute. And so one approach to do this is heterogeneous integration of chiplets, which can go beyond just AI engines. This really applies to all kinds of complex systems.
Prof Song Han (OMNIML)
AI model sizes are huge. How to deploy them to low memory small edge devices? Can we optimize models, compress neural nets to better fit h/w specific devices? Pruning: Redundant connections can be removed without affecting accuracy. So rather than compressing a large and bulky model, can we design small and compact models to begin with? We propose is automatic design, using hardware-aware AutoML techniques to synthesize neural networks automatically, given a latency, accuracy, and memory constraints. This is like an EDA tool for neural network design. It is sparsely activated, and then we can select the subnetwork without any retraining to get the accuracy and the latency, which is much cheaper. We repeat this process from sampling from this once-for-all network, select the best fit, so that we can drastically reduce the design costs, similar to training a neural network only once.
Using a new hardware-aware neural architecture search, we can reduce the latency by 4.9 times latency reduction, with even better accuracy. And we can deploy it on the phone in real time to recognize people's gestures, and understand the activity. We don't want to send the data to the cloud to better preserve user's accuracy, however, edge devices have limited memory. OmniML uses tiny transfer learning technique that can reduce the training memory from 300 megabytes to only 16 megabytes.