“The ability to generalize to unseen data is at the core of machine learning.”

At the heart of much of current AI research is the question of how to make algorithms **generalize to unseen data**.

In the context of machine learning, most models are trained and evaluated on data following an** i.i.d.** (independent and identically distributed) assumption, which implies that training data and test data for a given task are sampled from the same distribution. Generalization then means extracting this shared underlying distribution from the train data alone.

The i.i.d. assumption, however, frequently fails in the real world, where environments constantly shift, and o.o.d. (out of distribution) learning is crucial for survival.

Humans are currently much better at generalization than machines: we can quickly recognize distribution shifts in the environment and are “few-shot-learners”, inferring rules from few examples. We can more flexibly adjust our inference models to data that is different from what we have seen before. For many classical ML models, this isn’t true: **catastrophic forgetting** is a common problem that refers to the phenomenon of neural network models abruptly forgetting everything they learned when trained on new, unseen data.

Generalization is closely related to the problem of **overfitting vs. underfitting** training data, where overfitting refers to overexplaining the data by essentially fitting too much **noise** vs. too little **signal**. Standard ways of tackling overfitting are lower-parameter models, pruned models and regularisation techniques (dropout, L2 norms, etc.). Some of these intuitions have, however, been questioned by phenomena like the **double descent** (this Twitter feed explains double descent and how it relates to regularisation with a simple example), where higher-capacity models generalize worse than lower-capacity models because they overfit, but going to even higher-capacity models results in them generalizing even better than lower-capacity models.

Intuitions on overfitting have also been put into question by large-scale transformer-based models’ generalization performance, which began with GPT-3’s uncanny ability to solve tasks it hadn’t been trained on.

DeepMind’s new Flamingo goes further by connecting language with vision models that can integrate across a wide range of combined vision-and-language tasks:

The ability to **represent knowledge in a way that generalizes across tasks** seems intuitively much more intelligent than a neural network classifying dogs and cats after seeing millions of labeled examples of dogs and cats.

And so the surprising success of these models poses interesting questions about what generalization means, and how it can be achieved: what exactly is learned by these models? This question is not becoming easier to answer with ever-increasing model sizes, with parameter counts close to the number of neurons in the human brain. Given their enormous capacity, do these models just remember all of the training data in a clever way, or is there something more?

Generalization interacts in important ways with memory: the idea is that if we extract **understanding** from data, we have access to a much more flexible, condensed representation of knowledge than if we just remember it. This is an essential task in many unsupervised learning settings, e.g. disentangled representation learning. The ability to generalize to unseen data is thus not only at the core of machine learning but at the heart of many **definitions of intelligence**.

According to Markus Hutter, intelligence bears many similarities to **lossless compression**, and appropriately, the Hutter price is awarded for advances in the compression of a text file of the first 1.000.000.000 letters of a specific version of English Wikipedia. Together with his colleague Shane Legg, they condensed a definition of intelligence out of a wide range of definitions from psychology, machine learning, and philosophy into this formula:

To put it less formally, intelligence is the ability of an **agent** to extract** value** from the space of all **environments**, weighted by the **complexity** of the respective environments. The Kolmogorov complexity function is used as a measure of complexity: it is an information-theoretic measure of how complex an object is. It corresponds to the shortest line of code necessary to produce it, which, connecting to the idea of intelligence as compression, corresponds to the optimally compressed, **memory-efficient representation **of it (I went into similar ideas in more detail in my article on chaos theory and computational irreducibility). When **overfitting noise**, we have to **remember** it, because in an information-theoretic sense noise is uncorrelated and has no meaningful explanation, and thus contains no relevant information about the past or future.

However, while everyone seems to agree that generalization is important for machine learning, and in some way connected to complexity, it remains difficult to measure, with this Google paper compiling over 40 measures that aim at characterizing complexity and generalization, with wildly different results.

The question of how well neural networks generalize relates to how much they remember and how much they **learn to forget**. A recent paper by Pedro Domingos titled *“Every Model Learned by Gradient Descent Is Approximately a Kernel Machine” *brings an interesting new angle to this discussion:

“Deep networks…are in fact mathematically approximately equivalent to kernel machines, a learning method that simply memorizes the data and uses it directly for prediction via a similarity function (the kernel). This greatly enhances the interpretability of deep network weights, by elucidating that they are effectively a superposition of the training examples.” — Pedro Domingos

According to Domingos, learning in neural networks bears many mathematical similarities to kernel-based methods, such as Support Vector Machines.

Simply speaking, in kernel-based methods, the training data is first embedded in a new space, the so-called feature vector space, via a nonlinear transformation. Features (the dimensions of the embedding space) can have properties that make intuitive sense to us (e.g. how happy or how scary a movie is, or how furry a cat is), but in a more general sense, the metric of the embedding space captures similarities between data points (e.g. how close to each other two movies are on the happiness dimension). Once the features are embedded, they can be linearily separated, or e.g. used e.g. for a k-nearest neighbor classification, where test data is compared to the k neighboring data points in the feature space, and classification is made based e.g. on the most common label of these neighboring data points (you can e.g. figure out how happy a movie is by looking at how happy similar movies are).

The field of deep metric learning tackles similar questions: it aims to find embedding spaces of data in which similarity between samples can be easily measured (such as the similarity between unseen images of faces for the task of face recognition). And on the flip side, the neural tangent kernel has been used to derive a kernel function that corresponds to an infinite-width neural network, which in turn has shown to be a useful kernel function, and provided new theoretical insights into how neural networks learn.

Domingo’s paper reveals an interesting parallel between models learned with gradient-descent, and kernel-based techniques: during training, the ** training data is implicitly remembered** in the network weights. During inference, the “remembered” train data and the nonlinear transformation represented by the neural network work together to compare a test point to previously seen data, and classify it in analogy to kernel methods.

While the implications of this are not fully understood yet, they could shed some light on why neural networks trained with gradient descent have struggled with o.o.d. learning: if they indeed rely on remembering training, then following the logic of the previous discussion before they should be worse at generalizing when not also taught to sometimes forget (i.e. regularized). Hence, this perspective could also shed some light on how to better regularize models for generalization.

Memory is concerned with the storing and retrieval of information across time, so questions of memory have been similarly important in the domain of time series analysis. Recurrent Neural Networks (RNNs) and Long-Short-Term-Memory Networks (LSTMs) are two of the most popular models for modeling time series data.

A classical benchmark for memory in sequence models is the addition problem: a model is tasked with adding up two numbers that are “shown” to at time points** t1** and **t2** and **output the correct sum** at a time point T. The model, therefore, needs to retain information across longer stretches of time, which becomes increasingly difficult to train with gradient-based methods if the time lags between t1 and t2 are increased. This is related to the vanishing-and-exploding gradient problem, which is caused by the recurrent applications of the same layer for t times when backpropagating through sequence models (for time series from chaotic systems, this is even bound to occur). This frequently leads to them exploding or dying out, making recurrent models either **costly or even impossible to train** on certain tasks.

The difficulty of retaining memory is related to the difficulty of learning **slow time scales**: it has been shown that the addition problems can be solved by initiating subspaces of slow dynamics in a subspace of an RNN (a so-called line attractor), in which information can be stably retained without being affected by the dynamics of the rest of the network.

LSTMs, which have become the most cited neural network architecture of the 20th century, address the memory problem by explicitly adding a cell state, which retains information over arbitrary amounts of time, and an input, output and forget gate that regulate the information flow into the cell. Therefore, LSTMs are better than vanilla RNNs at “remembering” information across thousands of time steps, and at solving tasks like the addition problem.

But as discussed earlier, in this context memory can also have its downsides: it makes it easier to “overfit” information by **remembering** it instead of **compressing** it by **understanding** it.

The language of dynamical systems is the physicist’s way of talking about temporal phenomena. A dynamical description of the world lies at the core of most physical theories, from Newton’s theorems to Schrödinger’s equation:

What characterizes these descriptions of reality via differential equations is that they are **memoryless**. Given an initial state and complete description of the time evolution operator of the system (i.e. its Hamiltonian), the time evolution of the system is known until infinity (and is even time reversal symmetric, so no information is lost). Hence, there is no need for memory: if the description is indeed complete, it is perfectly compressed in the sense of Kolmogorov complexity.

In dynamical systems reconstruction, the field of machine learning concerned with recovering a dynamical system from time series, models with memory can actually be detrimental, in that they run the risk of failing to generalize to the underlying system by finding the optimal, **memoryless **description, but instead overfit it by remembering spurious patterns in the train data. This is an ongoing challenge for learning models of complex (dynamical) systems, such as the brain or the climate, where generalizing to the proper description of the system that captures its long-term behavior has many important practical implications, e.g. for predicting dynamics after tipping points. These can play a big role in forecasting e.g. extreme weather events or the long-term effects of climate change. However, most real-world systems are noisy, chaotic, and only partially observed, so separating signal from noise remains a big challenge.

In many practical applications, we don’t have a complete description and full knowledge of the world we observe around us. Making use of memory, especially when no more compressed description of reality is available or feasible, remains a crucial ingredient to the mix of building practically intelligent systems, and is a defining feature of our own intelligence. Nevertheless, I think it’s a fruitful angle to think about how generalization and memory interact, and how it can help us design algorithms that generalize better.

Memory and Generalization in Artificial Intelligence Republished from Source https://towardsdatascience.com/memory-and-generalization-in-artificial-intelligence-35e006ca0a9a?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed

<!–

–>