# Deep Learning Session ICML 2022

Updated: Aug 6

#### How Momentum improves generalization in deep learning

The main aim of deep learning is to build an artificial neural network based on representation learning. It is also used to improve the momentum by training the loss and accuracy of the model using data augmentation and batch normalization. The first step developed to improve the motion was by trying to remove the use of data augmentation and check whether there was an improvement in the momentum. This depends on the batch size that is given, i.e., the larger the batch size shows improved the accuracy.

The main observation is that the momentum is not dependent on the noise of the data but rather on the structure of the data and the learning problem. It also shows that the accuracy is more prominent in big margin data compared to the small margin dataset. Hence historical gradients in momentum help to learn small margin data.

#### What can linear interpolation on neural network loss landscapes indicate?

The problem with network loss is that it is very high-dimensional and is hard to understand. So, linear interpolation would be very valuable to indicate this loss. Linear interpolation is a simple and lightweight method to probe neural network loss landscapes. It is by using initial and final model parameters. The main observation was that the absence of barriers along the linear path and the task was easy to understand. Therefore, the simple way here was by going straight without encountering obstacles.

In modern neural network architecture, it shows the loss plateau, the error remains at random chance until it is near the optimum.

#### Deep equilibrium networks are sensitive to initialization statistics

A deep equilibrium network is a feed-forward network with weights that keep changing while the process repeats itself until the output is visible. If the process needs more accuracy then the initial weight is added to all the other stages. The reusing of the parameters has a dynamic effect on the result. For the stability to remain in a range it should weigh less than one but, when it is more than one stability is lost. For the higher order matrix, these equilibriums are very sensitive and are visible by plotting the graph.

The theory suggests the orthogonal and gaussian initialization has more stability The orthogonal initialization outperforms the random initialization and allows a larger initial weight to be used for better performance.

#### Scaling up Diverse Orthogonal Convolutional Network by Paraunitary framework.

The orthogonal convolutional network is required as it has adverse robustness. It is also known for its gradient stability because of its well-conditioned optimization. The challenge is to prove that the matrix is flat and maintain the orthogonal constraint during the training. The solution is to parameterize the convolutional layer as an orthogonal filter bank. The standard convolutional layer implements a transfer matrix with the properties on both sides to obtain the solution.

The main aim is to make the matrix orthogonal. Implementing the transfer matrix should be paraunitary while maintaining the orthogonal constraint skew-symmetric matrix plays a role in making the matrix exponential. The design is better and has accuracy compared to other old calculations.

#### Stability-based Generalization Bound for Exponential Family Langevin Dynamics

There is a lot of advancement in generalization bounds for noisy stochastic algorithms, especially stochastic gradient Langevin dynamics. The aim is to unify and broadly expand stability-based generalization bounds. The generalization error was first constraining in anticipated stability, which may have produced quantitatively tighter bounds.

The main contribution is the introduction of the exponential family Langevin dynamics (EFLD), a significant expansion of sign-SGD that incorporates quantized SGD and noisy versions of sign-SGD as instances. With a sample dependency and reliance on gradient discrepancy rather than the gradient norm, the development of data-dependent anticipated stability-based generalization constraints for any EFLD method, producing much sharper bounds. Thirdly, creating optimization guarantees for unique EFLD scenarios. Further evidence of constraints are non-vacuous, quantitatively sharper than current bounds, and behave appropriately comes from actual findings on benchmarks.

#### Local Augmentation for graph neural networks

The challenges on graphs, Graph Neural Networks (GNN) have displayed astounding performance. The main goal of GNNs is to provide functional representation by combining data from nearby areas. Whether the neighbourhood data sufficiently aggregates to learn ways of representing nodes with few neighbours is still in debate.

The solution to this is a straightforward and effective data augmentation technique called local augmentation, which uses produced features to improve the expressiveness of GNNs and learn the distribution of neighbours node representations conditioned on the core node's representation. As a broad framework, local augmentation applies to any GNN model. Each time the backbone model gets trained, it takes a sample of the feature vectors related to each node from the conditional distribution. Numerous tests and analyses demonstrate that local augmentation consistently improves performance with different GNN designs across a particular set of benchmarks. For instance, investigations reveal that plugging in local augmentation to GCN and GAT enhances test accuracy on Cora, Citeseer, and Pubmed by an average of 3.4 per cent and 1.6 per cent, respectively. Additionally, the experimental findings on big graphs (OGB) demonstrate that the model regularly outperforms backbones in terms of performance.

#### Non-Local convergence analysis on deep linear networks

In this article, the deep linear networks' non-local convergence features are the main focus. The general classification of deep linear networks with at least one layer contains just one neuron when the loss is quadratic. Under gradient flow, the convergence point of trajectories with arbitrarily balanced starting points, including the pathways that converge to one of the saddle points. It also provides the precision of rates at which various trajectory types converge in stages to the global minimizer. According to the findings, the rates range from polynomial to linear. In contrast to the slow training regime that has dominated the literature on deep neural networks, these results are the first to provide a non-local analysis of such networks with an arbitrary balanced amount of tasks.

#### Adaptive Inertia – Disentangling the effects of adaptive learning rate and momentum

The most well-liked stochastic regulator for quickening deep neural network training is Adaptive Moment Estimation, which combines Adaptive Learning Rate and Momentum. However, research shows that Adam is frequently poorer than Stochastic Gradient Descent (SGD). In the context of diffusion theory, this study aims to solve the neighbour's behaviour pe cent behaviour. The goal is to separate the impacts of Adaptive Learning Rate and Momentum of the Adam dynamics on saddle-point escape and flat minima selection. It demonstrates that Adaptive Learning rates can effectively escape saddle points but cannot choose flat minima as SGD.

Momentum, on the other hand, practically has no impact on the choice of flat minima and offers a drift effect to assist the training process in passing through saddle points. It helps to explain why Adam is poorer but converges quickly whereas SGD with momentum does it better. In addition, inspired by the study, the development of a novel adaptive optimizing framework called Adaptive Inertia probably favours flat minima as well as SGD and leverages parameter-wise adaptive inertia to speed up training. The rigorous testing shows that, compared to SGD and traditional adaptive gradient methods, the suggested adaptive inertia approach may generalise to a wide range of situations.

#### Diversified adversarial attacks based on Conjugate Gradient Method

Deep learning models are susceptible to hostile instances, and research has been quite active on the adversarial approaches that produce these examples. Despite having high attack success rates, current systems focused on the steepest drop occasionally perform less well due to ill-conditioned issues. The usage of the conjugate gradient (CG) approach is efficient for this kind of problem. To solve this restriction and to present the Auto Conjugate Gradient (ACG) attack, a unique attack strategy based on the CG method. ACG was able to identify more adversarial cases with fewer iterations than the current SOTA method Auto-PGD for a majority of the most recent resilient models, according to the findings of large-scale assessment trials performed on those models (APGD).

The main observation is the difference in search performance between ACG and APGD in terms of diversification and intensification, defining a measure called Diversity Index (DI) to quantify the degree of diversity. The analysis of the diversity using this index shows that the more diverse search of the proposed method remarkably improves its attack success rate.

#### Optimization landscape of neural collapse under MSE loss – Global Optimality with Unconstrained features

The deep neural networks use techniques for the classification of tasks. It has a fascinating empirical phenomenon occurring in the last-layer classifiers and features. This class means all collapse to the vertices of a Simplex Equiangular Tight Frame (ETF) up to a particular level, and also indicates the cross-example within-class variability of last-layer activations collapses to zero. Neural Collapse (NC) is the name given to this phenomenon, which appears to occur independently of the loss function selected. In this study, to defend the Neural Collapse under the mean squared error (MSE) loss, new empirical data demonstrate that it outperforms the de facto cross-entropy loss by an equal or significant margin.

This is the first global landscape analysis for vanilla nonconvex MSE loss under a simplified unconstrained feature model and demonstrates that the (only!) global minimizers are neural collapse solutions, whereas all other critical points are strict saddles whose Hessian exhibit negative curvature directions. By investigating the optimization landscape around the NC solutions, it is further validated for the use of rescaled MSE loss by demonstrating how the landscape may be enhanced by adjusting the rescaling hyperparameters. Finally, using real network designs, the theoretical results are empirically validated.

#### Equivalence between the temporal and static Equivariant graph representations

The associational goal in foreseeing node attribute change in temporal networks is in the study from the viewpoint of learning equivariant representations. It indicates two frameworks for node depictions in its charts. Firstly, the most common method, which is labelled as time and graph, is in which equivariant graph and sequence models are combined to describe the temporal dynamics of component characteristics in the graphs observation. Secondly, an approach is to identify a time where the durations characterise the component and boundary movement.

Surprisingly, the demonstrated period representations outperform duration representations in concision when both utilize component GNNs that are not the most descriptive. Furthermore, obtaining state-of-the-art results is not necessarily the goal, the experiment demonstrates that time-then graph methods can accomplish and operate more effectively than the state-of-the-art time and graph methods in some real-world tasks, demonstrating that the duration framework is a great asset to the graph ML toolbox.

#### Robust Training under label noise by over parameterization

Modern machine learning efficiency has recently been dominated by over-parameterized convolutional models, which contain a growing set of network parameters compared to training samples. On the other hand, Over-parameterized networking frequently overfits and fails to generalize when the training data is tainted. The resilient training of over-parameterized convolutional models in classification tasks where a percentage of trained labels are corrupted is what we propose in this paper as a principle strategy. The fundamental concept is still fairly straightforward: label noise makes the network trained on clean data sparse and inconsistent.

As a result, we model the noise and train the network to distinguish it from the data. To recover the underlying corruptions, we explicitly represent the label noise via another sparse over-parameterization term and use an implicit computational and smooth workflow. Surprisingly, we show state-of-the-art test accuracy against label noise on a range of actual samples when trained using a basic approach. Moreover, the theory of reducing linear models supports the initial findings, demonstrating that precise separation between sparse noise and low-rank data may be accomplished under inconsistent circumstances. By employing sparse over-parameterization and implicit regularisation, in which the study offers a wide range of intriguing avenues for enhancing excessive models.

#### Implicit bias of the step size in linear diagonal neural networks

The demonstration of the steepest descent magnitude may have a relevant quality impact on the implicit bias and generalization ability, concentrating on vertical linear systems as a framework for assessing the implicit bias in indeterministic models. In particular, even though gradient flow previously investigated in prior studies did not leave the kernel regime, employing a high step size for non-centred data may transform the implicit bias from a kernel-type behaviour to a sparsity-inducing regime.

Achieve the use of stability demonstrates that convergence to dynamically stable global minima involves a constraint on some weighted norm of the linear predictor, sometimes known as a rich regime. In a minimal distraction situation, it shows this results in high generalization.

#### Extended unconstrained features model for exploring the deep neural collapse

To further reduce the learning loss toward zero, the connection weights are optimized once the prediction error has disappeared into the current method is based on deep learning for classification techniques. In this training technique, a recent phenomenon known as Neural Collapse (NC) has been an experimental discovery. It has been demonstrated, in particular, that within-class samples' learned characteristics of the output of the penultimate layer converge to their standard and that the means of various classes display a particular tight structural system that is also in line with the weights of the final layer. Recent studies have demonstrated that when a streamlined unconstrained features model(UFM) is optimized using a regularised cross-entropy loss with this structure appears.

The model expands upon and extensively investigates the UFM. The characteristics of the objective functions can be more delicately arranged than in the cross-entropy situation, and observed from the analysis of the UFM for the regularised MSE loss. It has an impact on the weights' structure as well. Then, to generalize earlier results it is done by expanding the UFM by including an additional layer of weights and ReLU is highly nonlinear. Finally, the experimental results show how our nonlinear extended UFM is beneficial for simulating the NC phenomena that occur in real networks.

#### Score-guided intermediate level optimisation – Fast Langevin mixing for inverse the problem

The main aim was to establish rapid mixing and describe the Langevin Algorithm's stable distribution for reversing randomized weighted DNN generators. With this finding, Hand and Voroninski's work expanded from effective reversal to efficient posterior sampling.

This accomplishes a posterior sampling in the latent space of a pre-trained generative model in practice to enable enhanced expressive power. , the score-based model is trained in the StyleGAN-2's latent space and apply it to inverse tasks. By substituting a generative just before the middle layer for the feature space regularisation, our system, Score-Guided Intermediate Layer Optimization (SGILO), builds on earlier work. In terms of test findings, it significantly outperformed the prior condition, particularly in the low measurement regime.

### References

How Momentum improves generalization in deep learning - __https://icml.cc/virtual/2022/spotlight/16112__

What can linear interpolation on neural network loss landscapes indicate? - __https://icml.cc/virtual/2022/spotlight/16392__

Deep equilibrium networks are sensitive to initialization statistics - __https://icml.cc/virtual/2022/spotlight/16466__

Scaling up Diverse Orthogonal Convolutional Network by Paraunitary framework. - __https://icml.cc/virtual/2022/spotlight/16576__

Stability-based Generalization Bound for Exponential Family Langevin Dynamics - __https://icml.cc/virtual/2022/spotlight/16622__

Local Augmentation for graph neural networks - __https://icml.cc/virtual/2022/spotlight/16800__

Non-Local convergence analysis on deep linear networks - __https://icml.cc/virtual/2022/spotlight/16812____ __

Adaptive Inertia – Disentangling the effects of adaptive learning rate and momentum - __https://icml.cc/virtual/2022/oral/17066__

Diversified adversarial attacks based on Conjugate Gradient Method - __https://icml.cc/virtual/2022/spotlight/16956__

Optimization landscape of neural collapse under MSE loss – Global Optimality with Unconstrained features - __https://icml.cc/virtual/2022/spotlight/16984__

Equivalence between the temporal and static Equivariant graph representations - __https://icml.cc/virtual/2022/spotlight/16898__

Robust Training under label noise by over parameterization - __https://icml.cc/virtual/2022/spotlight/17128__

Implicit bias of the step size in linear diagonal neural networks - __https://icml.cc/virtual/2022/spotlight/17798__

Extended unconstrained features model for exploring the deep neural collapse - __https://icml.cc/virtual/2022/spotlight/18158__

Score-guided intermediate level optimisation – Fast Langevin mixing for inverse the problem - __https://icml.cc/virtual/2022/spotlight/18402__