# ICML 2022 Highlights

I had the pleasure of attending the ICML conference held in Baltimore. Here are a few highlights from the sessions I attended.

#### Learning Mixtures of Linear Dynamical Systems

The main observation in the issue of learning a combination of several linear dynamical systems (LDS) from brief, annotated sample paths that were each produced by a different LDS framework. Although mixture models for time-series data have a wide range of applications, there is little research on instructional methods that offer end-to-end optimum solutions. The existence of latent variables like the unknown labels of trajectories, the potential for sample paths to have lengths much smaller than the dimension of the LDS models, and the complex system involvement in built-to-time-series data are just a few of the many references of technical faults.

So, the creation of a two-stage meta-algorithm to address these issues is certain to effectively recover each ground-truth LDS model up to an error in the entire sample size. The statistical tests to ensure our theoretical aspects and establish the effectiveness of the suggested approach.

#### G- Mix-up: Graph Data Augmentation for Graph Classification

In this study, a mix-up for large graphical datasets was developed by overlaying characteristics and labels among two randomly selected samples, mix-up has demonstrated advantages in enhancing the resilience and generalization of neural networks. Mix-up has historically been able to process regular, grid-like, and Euclidean data, such as tabular or picture data. However, because distinct graphs typically have varied numbers of nodes and are not easily aligned. They have unique typologies in non-Euclidean space as it is hard to use Mix-up to amplify graph data. To overcome, they used G-Mix-up, which interpolates the producer like in graphon of several categories of graphs, to enhance these graphs for categorization. To predict a graph, it employs graphs from the same class.

#### Bayesian model selection: The marginal Likelihood and generalization Understanding

The main question is: How can we analyze theories that are compatible with the data? The chance of deriving these findings from a previous observation is represented by the marginal likelihood known as Bayesian evidence, which offers a novel solution to this fundamental topic by implicitly embedding Occam's razor. The minimal probability can fit the data and is susceptible to previous beliefs, but its limits for hyperparameter learning and discrete model evaluation have not been well studied. So, the first step is to go through the minimal likelihood's alluring attributes for scrutinizing the limits and theory testing. The conceptual and practical problems with utilizing the minimal probability as a stand-in for generalization were highlighted.

The specific demonstration of how minimal probability may impact neural network model search can cause both underfitting and generalization in hyper-parameter learning, as well as how it can be adversely connected with generalization. Through a provisional marginal likelihood, which demonstrates that it is better in line with generalization and is useful realistically for large-scale hyperparameter learning, such as in deep kernel learning, we offer a partial cure.

#### Do Differentiable Simulators give Better Policy Gradients?

In this, the main aim is to understand that by substituting predictions based on the first gradients for 0th gradient estimations of a probabilistic goal, variational simulations offer shorter computing times for supervised learning. Despite the critical importance of this subject for the usefulness of discrete simulations, it is still unknown what parameters determine the results of the two estimation methods in complicated environments that entail long-horizon monitoring and scheduling on physical systems. The effectiveness of the first-order estimation may be endangered by properties of some physical systems, such as stiffness or discontinuities, which demonstrate and examine via the prisms of bias and variance. To combine the effectiveness of first-order estimates with the resilience of zero-order approaches, we also offer a -order gradients estimation, with α ∈[0,1], which accurately employs precise gradients.

#### Privacy for free: How does dataset condensation help privacy?

The research community has turned to data generators that may provide privacy-preserving models from data development to minimize unintended data leaks. However, present systems suffer from either high training costs or subpar generalization ability just for data protection. Therefore, we ask if it can simultaneously attain training effectiveness and confidentiality. In this work, for the first time, it is shown that data condensation (DC) is a superior method to replace the conventional data generation for personal data production, giving privacy for free. DC was initially created to improve training efficiency. It establishes a link between different datasets and DC to show how DC enhances security.

The availability of one instance has a minimal influence (O(m/n)) on the resulting in multiple networks trained on m samples synthesized from n(n >> m) specimens via DC, according to theoretical research on linear feature extractors and eventually expanded to non-linear feature extractors. By employing both the cutting-edge likelihood-based participation inference attacks and loss-based membership inference attacks, statistically confirm the aesthetic appearance and participation privacy of DC-synthesized data. It is a turning point for machine learning that is both data-efficient and confidential.

#### Stable Conformal Prediction Sets

Conformal Prediction (CP) is an approach that enables one to calculate a probability set for y n + 1 given x n + 1 by only assuming that the data distribution is substitutable when one observes a series of values (x1, y 1 ),..., (x n, y n ). For each finite population size n, CP sets have a covering that is guaranteed. Even though it seems enticing, computing such a set generally proves to be impossible, for instance when the unknown variable y n + 1 is continuous. The problem is that it relies on a process that updates a forecasting model on data, replacing the unidentified target with all of its potential values before choosing the most likely one. This is frequently difficult since it calls for the computation of an endless set of models.

To create a predictive set that can be computed with a single model fit, we integrate CP approaches with traditional algorithmic stability constraints in this study. This show that the covering guarantees provided by our suggested confidence set are not compromised and are developed to avoid data splitting, which is currently used in this literature. On both simulated and actual datasets, provides various numerical tests to demonstrate the accuracy of our estimation when the sample size is suitably big.

#### Casual Conceptions of Fairness and their consequences

The latest research emphasizes the importance of causality in developing fair decision-making systems. However, it is not immediately apparent how the various causal notions of fairness connect or what effects utilizing these concepts as design guidelines might have. Here, we first group and classify causal definitions of algorithmic fairness into two families, those that limit the impact of decisions on inequalities and those that hinder the influence of legally protected variables, such as race and gender, on decisions. Then, we demonstrate analytically and empirically that both families of definitions produce substantially strong dominated decision policies, like in another unrestricted strategy supported by every stakeholder with preferences chosen from a broad, natural class.

Every stakeholder with neutral or favourable interests for both academic readiness and variety, as in the case of university admissions choices, would be opposed to measures confined to fulfilling causal fairness standards. In fact, in the demonstration, the resultant regulations necessitate admitting all students with the same probability, independent of academic standing or peer acceptance, under a well-known notion of causal fairness. The findings draw attention to the formal drawbacks and potentially harmful effects of conventional mathematical ideas of causal justice.

#### The importance of non-Markovianity in maximum state entropy

It is a strategy that maximizes the entropy of the anticipated state visits learned by an agent through interaction with a reward-free environment in the maximum state entropy exploration paradigm. The class of Markovian stochastic strategies is adequate for attaining the optimal state entropy of a target, and the use of non-Markovianity is not worthwhile in this context. In this study, they propose that for maximum state entropy exploration in a finite-sample regime, non-Markovianity is crucial. In particular, the changes in the goal to focus on the anticipated entropy of the induced state visits in a single experiment. Then, demonstrate that, while Markovian policies typically experience non-zero remorse, the class of non-Markovian predictable strategies is adequate for the stated purpose.

However, demonstrate that it is NP-hard to discover the best non-Markovian policy. Despite this disappointing outcome, an examination of possible solutions and how non-Markovian exploration could enhance the sampling effectiveness of online reinforcement learning, in future studies can be considered.

#### Exploration Solving Stakelberg Prediction game with least squares loss via Spherically constrained least squares

The Stackelberg prediction game (SPG) is a well-known tool for describing how a learner and an attacker engage strategically. The SPG with least squares loss (SPG-LS) has lately attracted a lot of academic interest as an essential special example. SPG-LS permits amenable different formulations that can be polynomially universally addressed by semidefinite programming or second-order cone coding while being initially described as a challenging bi-level optimization issue. However, none of the methods now in use is very effective for managing big datasets, especially those with a significant number of characteristics. In this study, we investigate a different SPG-LS new formulation. We recast the SPG-LS as a spherically constrained least squares (SCLS) problem using a unique nonlinear change of variables.

Theoretically, it demonstrates that the SCLS (and the SPG-LS) may be solved with an optimum solution in ~ O (N/ √ ϵ) procedures with floating points, where N is the quantity of nonzero data mixture components. Practically speaking, we use the Riemannian trust area approach and the Krylov subspace method to solve this novel reformulation. Both techniques are appropriate to address complex problems since they are factorization-free. The SPG-LS with the SCLS reformulation can solve orders of magnitude quicker than the state of the art, according to numerical findings on artificial and different real-world datasets.

#### Monarch: Expressive Structured matrices for efficient and accurate training

Although they are expensive to train and optimize, large neural networks are excellent in many fields as they use organized connection weights in place of dense versions, this is a common approach for lowering the computation or memory demands. Due to negative performance considerations in end-to-end training and a lack of tractable algorithms to approximate a given dense set of weights, these approaches are not popular. The provision of a class of matrices that are expressive (they are parameterized as products of two block-diagonal matrices for improved hardware use) and hardware-efficient to solve these problems (they can represent many commonly used transforms). Interestingly, while not being convex, the issue of approximating a dense weight matrix with a monarch matrix has an analytically optimum solution.

These characteristics of Monarch matrices provide new avenues for training and optimizing dense and patchy models. This observationally verifies that Monarch can accomplish favourable precision tradeoffs in several end-to-end sparse practice applications, including speeding up Vit and GPT-2 training on Wikitext-103 language modelling and ImageNet classification by 2x with comparable model quality and lowering the error on PDE solving and MRI reconstruction tasks by 40%. With the use of a straightforward method dubbed "reverse sparsification," Monarch matrices may be used to speed up GPT-2 pretraining on OpenWebText by two times without sacrificing quality in sparse-to-dense training. The same method speeds up BERT pretraining by 23%, outperforming even the highly optimized Nvidia version that broke the MLPerf 1.1 record. As an example of dense-to-sparse fine-tuning.

### References

Learning Mixtures of Linear Dynamical Systems - __https://icml.cc/virtual/2022/oral/15994__

G-Mix-up: Graph Data Augmentation for Graph Classification - __https://icml.cc/virtual/2022/poster/16665__

Bayesian model selection: The marginal Likelihood and generalization Understanding- __https://icml.cc/virtual/2022/oral/17992__

Do Differentiable Simulators give Better Policy Gradients? __-https://icml.cc/virtual/2022/oral/16770__

Privacy for free: How does dataset condensation help privacy? - __https://icml.cc/virtual/2022/oral/18236__

Stable Conformal Prediction Sets - __https://icml.cc/virtual/2022/oral/16842__

Casual Conceptions of Fairness and their consequences - __https://icml.cc/virtual/2022/oral/17122__

The importance of non-Markovianity in maximum state entropy - __https://icml.cc/virtual/2022/poster/16289__

Exploration Solving Stakelberg Prediction game with least squares loss via Spherically constrained least squares - __https://icml.cc/virtual/2022/poster/17691__

Monarch: Expressive Structured matrices for efficient and accurate training - __https://icml.cc/virtual/2022/poster/17899__