top of page

Attention Mechanism in NLP

The attention mechanism in deep learning is like a spotlight that helps a model focus on specific parts of input data when making predictions or decisions. It's inspired by how humans pay attention to different things when processing information.

Imagine you're reading a long paragraph, and there's a word you don't understand. Your attention mechanism would zoom in on that word, ignoring the rest of the text for a moment. This allows you to better understand the context around that word and make sense of it.

In deep learning, the attention mechanism does something similar. When a model processes a sequence of data, like words in a sentence or elements in a time series, it assigns different levels of importance or "attention" to each part of the sequence. It learns to focus more on the relevant parts and less on the irrelevant ones.

How Attention Mechanism was Introduced

Neural machine translation was based on encoder-decoder RNNs/LSTMs. Both the encoder and decoder are stacks of LSTM/RNN units. It works in the two following steps:

  1. The encoder LSTM is used to process the entire input sentence and encode it into a context vector, which is the last hidden state of the LSTM/RNN. This is expected to be a good summary of the input sentence. All the intermediate states of the encoder are ignored, and the final state is supposed to be the initial hidden state of the decoder

  2. The decoder LSTM or RNN units produce the words in a sentence one after another

The main drawback of this approach is the encoder's potential to create poor summaries, leading to low-quality translations. This problem is particularly noticeable when dealing with longer sentences, referred to as the long-range dependency problem in RNN/LSTMs. RNNs struggle to remember lengthy sequences due to the vanishing/exploding gradient issue, leading to performance degradation with longer input sentences.

Even LSTM, which is expected to handle long-range dependencies better than RNNs, can suffer from forgetfulness in specific cases. Additionally, there is no mechanism to assign varying levels of importance to different input words during the translation process. Bahdanau et al. (2015) proposed the idea of considering all input words in a context vector while also assigning varying levels of importance to each of them. The proposed model searches for specific positions in the encoder's hidden states, where the most relevant information is found when generating a sentence. This concept is referred to as "Attention."


  • The figure provides an overview of a new mechanism used in the decoder. When generating an output word at time "yt," the decoder utilizes the last hidden state from the decoder (which can be viewed as a representation of the words generated so far) and a dynamically computed context vector based on the input sequence.

  • The authors introduced a replacement for the fixed-length context vector, called "ci," which is obtained by summing the hidden states of the input sequence, each weighted by alignment scores.

  • It's important to note that now, the probability of generating each output word is influenced by a distinct context vector "ci" corresponding to each target word "yt."

  • The updated decoder is defined as follows:

Probability of "yt" given "y1,…,yt−1" and "c" is represented as "p(yt|y1,…,yt−1,c)," where "si" is the hidden state at time "i," calculated as follows:

si = f(si−1,yi−1,ci)

  • In this equation, the new hidden state at time "i" depends on the previous hidden state, the representation of the word generated in the previous step ("yi−1"), and the context vector for position "i" ("ci"). This dynamic computation of context vectors and hidden states enables the model to generate output words with contextually relevant information from the input sequence.

Assignment of different levels of importance to words

The attention mechanism assigns different levels of importance to words or elements in a sequence through a process that can be visualized as a set of weights or scores. A simplified explanation of how this is done:

  1. Score Calculation: For each word or element in the input sequence, the attention mechanism calculates a score. These scores are like measures of how relevant each element is to the task at hand. The score is determined by comparing the current state of the model with the representation of the element being considered. This comparison can be done using various methods, but one common approach is to use a neural network layer, such as a feedforward layer, to calculate these scores.

  2. Softmax: After calculating the scores for all elements in the sequence, the softmax function is often applied to these scores. The softmax function converts the scores into probabilities, ensuring that they add up to 1. These probabilities represent the attention weights for each element.

  3. Weighted Sum: With the attention weights obtained from the softmax, the model computes a weighted sum of the elements in the sequence. The elements that have higher attention weights contribute more to this sum, while those with lower weights contribute less. This weighted sum is then used in further processing by the model.

Let's say you're translating a sentence from English to French. The attention mechanism calculates attention scores for each word in the English sentence, determining how much focus should be given to each word when generating the corresponding word in the French translation. Words that are more relevant to the translation context will get higher attention scores.

By dynamically adjusting these attention weights during the model's processing of the sequence, it effectively "attends" to different parts of the input sequence at different times, giving more importance to the words that are most relevant to the current step in the task. This allows the model to capture dependencies and relationships between elements in the sequence and generate more contextually accurate outputs.

11 views0 comments
Post: Blog2_Post
bottom of page