Tensorflow Keras Attention source code line-by-line explained

11 min readMay 12, 2020

Recently (at least pre-covid sense), Tensorflow’s Keras implementation added Attention layers. There are two types of attention layers included in the package:

The two types of attention layers function nearly identically except for how they calculate the score.

Interestingly, Tensorflow’s own tutorial does not use these two layers. Instead, it wrote a separate Attention layer. The difficulty for folks who only read paper 1) and 2) when reading the source code is that the source code introduced some concepts such as “mask”, “query”, “key”, “value” , which are absent from the papers. The reason is because that the source code followed some of the concepts and conventions in a different paper called “Attention is all you need”.

In this blog, I will go through the source code and try to explain what it is doing and also provide few examples of how to use these layers.

Before we begin, there are a few things we need to have a very clear understanding of:

What are the Attention mechanism and different versions of Attentions

When dealing with sequence to sequence (seq2seq) problems, we used to rely on encoder-decoder architecture.

Let’s use the classic machine translation as an example: An English sentence is first encoded into a “context vector” and then pass to a decoder. Then, the decoder will keep this context vector and generate another Spanish sentence word by word (or character by character depending on how you implement it). It worked OK when the original sentence is short. But the issue occurs when the sentence was long. Obviously, compressing a large volume of information into a shorter vector will inevitably lose information. This is due to the nature of human languages, which often have long range dependencies.

The idea of Attention is to let the decoder focus on the relevant part of the input sentence when generating new words. In the original encoder-decoder model, the new word (or character) is typically generated by an RNN, which requires not only an input at each timestep but also previous timestep’s hidden state and previous timestep’s output.

The paper defines that each output (yi)of the decoder is a function of previous output (yi-1), the hidden state for timestep i (Si) and a context vector ci for each target word yi.

Earlier I mentioned that the older encoder-decoder model also relies on a context vector, so what’s the difference? The difference is that Attention model uses a context vector for each target word while the older encoder-decoder model compresses an entire sentence into one context vector.

For example, if the target sequence is 10 words long, then there are 10 context vectors. Don’t worry, I will walk through a concrete example in a minute. Before we do that, how is the context vector ci calculated?

αij here is a weight for each original input word.

Let’s walk through an example. Let’s say we have an input sentence:

I will not inject disinfectant

and the target output is:

No inyectaré desinfectante

There are five English words and three targeted Spanish words. Imagine, your max length of Spanish translation is three.

So there should be two context vectors, C1, C2 and C3.

C1 is defined as the sum of alpha weights from timestep 1 to timestep 5 multiply the hidden state of each of the three timesteps. α in the equation means how much attention each word in Spanish should pay attention to each of the original English words.

αij is more specifically computed as:

Effectively, αij weight is a Softmax function across eij, which is called score in Tensorflow’s source code or “energy” in the paper (which I will explain in a minute). In the original paper, the author noted it as “an alignment model, which scores how well the inputs around position j and the output at position i match”. More specifically, because it’s a Softmax function, it is a probability distribution across different j timesteps across the original English words. A different way of understanding that is the weight variable αij is how likely a target Spanish Yi word is aligned to each of the original English word Xj. This is effectively where “attention” got its name.

a here denotes a latent variable/neural network trained with other components. The simplest way of imagining this is that you have a dense neural network that intakes the previous timesteps’ hidden states of the current RNN and the word annotations of the original English sentence.

Please make sure you notice the weight is called α while the function/neural network is a. It got me when I read the paper.

Some quick notes:

The “energy” or score can be calculated in different ways.
One way seen in Luong’s 2015 paper is calculated as:

3. The same paper also proposed another way of calculating the score called Dot-product, effectively removed the trainable weight:

Hopefully, this clarifies the mechanism behind Attention. If not, Jay Alammar has an excellent illustration on how Attention works.

Having read the Bahdanau paper is not enough to understand what is going on inside the source code. You will also need to understand some of the ideas in “Attention is all you need”. The source code implemented a lot of the concepts from this paper. Let’s quickly go over some of the big ticket items without writing on what is the Transformer model, the core idea of “Attention is all you need”.

As a starter, the paper introduced a group of concepts called:

Query
Key
Value

These concepts originated from information retrieval. Think of these concepts in an online shopping website context. Query(ies) is the search term you type in, Key(s) is the descriptions and titles of individual products the search engine indexed. The Value(s) is product page URLs.

Let’s put these concepts in the context of sequence generation. When using an encoder-decoder model, at each time step of generating a Spanish word the decoder will ask “given my current timestep’s hidden state (Query) , which English word annotation (Key/Value) should I pay attention to?” Again, in the context of Natural Language Processing, key and value are often the same.

Another important concept is called Selt-Attention, which is one of the variables you can turn on when using the Tensorflow layer. What is it?

We talked about Attention is effectively a set of weights denoting how much influence the decoder should receive from the original input word annotation (how much each of the English words should influence each of the output Spanish words). Self-Attention, as the name suggests, it how much each of words in the either Input or Output is influencing each other. In our example, earlier:

I will not inject disinfectant

and the target output is:

No inyectaré desinfectante

Self-attention denotes how much the word “will” is associated with the word “disinfectant”, vice versa or how much “inyectare” is associated with “desinfectante”.

In the transformer model, Self-Attention is used in both the encoder and the decoder. In Tensorflow’s implementation, it seems only decoder can use self-attention:

Args: causal: Boolean. Set to `True` for decoder self-attention. Adds a mask such that position `i` cannot attend to positions `j > i`. This prevents the flow of information from the future towards the past.

That’s pretty much there is to it.

Tensorflow Keras’ Attention Source Code

As mentioned earlier, there are two attention score calculations Tensor implemented:

They both inherited from a base class called BaseDenseAttention. Let’s unwind the clock a little from there. The base class leaves the build() function in each of the two inherited classes to deal with. That’s because the two styles of attentions require different trainable weights. So for the base class, we only need to focus on the call() function as the entry point.

There are two arguments that are really required:

inputs: a list of query, value and key tensors
mask: a list of query and value boolean tensor.

Notice that both call() arguments are lists. That’s why at line 72, it did a validation:

self._validate_call_args(inputs=inputs, mask=mask)

Between line 111 and line 131, it checked not only whether the arguments are lists, but also the number of tensors in the lists to make sure a sufficient number of tensors are provided.

Line 73 to 77 are effectively taking apart the inputs and assigned to query, value and key variables as well as masks:

q = inputs[0]
v = inputs[1]
k = inputs[2] if len(inputs) > 2 else v
q_mask = mask[0] if mask else None
v_mask = mask[1] if mask else None

Notice that it has a if statement check where if key is missing, it just goes and uses the value tensor. That’s because in a lot of settings, value and key are the same.

Just to add some important notes:

The respective tensor shapes of these variables are defined as:

Query: [batch_size, query timeteps, query dimension]

Value: [batch_size, value timeteps, value dimension]

key: [batch_size, key timeteps, key dimension]

query_mask: [batch_size, query timesteps]

value_mask: [batch_size, value timesteps]

scores = self._calculate_scores(query=q, key=k)

Line 78 called a function to calculate energy/score. This function is not implemented in the base attention layer; instead, because the two styles of Attentions have different ways of calculating the score, the base class leaves the implementation to the child class.

Let’s assume the simplest case that we do not provide a mask and move on. At line 82, it asks for a variable called “casual” is turned on.

The argument is essentially to ensure the future information is not flows towards the past by using a lower triangle tensor of the last two dimensions. Here is the official explanation:

causal: Boolean. Set to `True` for decoder self-attention. Adds a mask such that position `i` cannot attend to positions `j > i`. This prevents the flow of information from the future towards the past.

Let’s assume this is turned off and we are not using masks, we can not move to line 95.

Subsequently, it called a function called _apply_scores to apply the calculated scores to value

result = self._apply_scores(scores=scores, value=v, scores_mask=scores_mask)

This leads us to line 44. Because we assume that we are not using any masks, we can skip line 63 to line 66.

Recall that once you have the scores, the first thing you do is to apply a softmax on the scores to create a distribution of attention weights?

This is what line 67 is doing:

attention_distribution = nn.softmax(scores)

The next step is to create a context vector C.

Line 68 is doing exactly that:

return math_ops.matmul(attention_distribution, value)

The value is corresponding to hj in the eqution while αij is corresponding to the attention_distribution variable that was just calculated.

This context tensor is what ended up being returned at line 100. Please note that earlier, I talked about context vector and here I use context tensor. That’s because you have multiple words; therefore, a 1D vector becomes a 2D tensor.

So there this is basically the gist of the attention base class. Let’s look at individual attention implementations.

Bahdanau-style attention

Earlier, you saw the score in Bahdanau’s paper was calculated as:

a here is a latent variable. More specifically, the paper defined the score as:

In Tensorflow’s tutorial, the score is calculated as:

score = self.V(tf.nn.tanh(
        self.W1(query_with_time_axis) + self.W2(values)))

W1, W2 and V are all trainable latent variables.

The calculation perfectly matched the paper’s computation. However, in Tensorflow’s source code, as you can see in line 118, it is done very differently.

First, they reshape both the query and key tensors by expanding their respective dimensions (line 109–113):

# Reshape tensors to enable broadcasting.
# Reshape into [batch_size, Tq, 1, dim].
q_reshaped = array_ops.expand_dims(query, axis=-2)
# Reshape into [batch_size, 1, Tv, dim].
k_reshaped = array_ops.expand_dims(key, axis=-3)

A separate trainable “scale” weight was added at line 421.

Then, I am not sure is how they calculate the score at line 449–450.

return math_ops.reduce_sum( scale * math_ops.tanh(q_reshaped + k_reshaped), axis=-1)

To me, this looks different from the paper’s approach. Instead, they simply shaped the two tensors and added them together.

The documentation’s explanation:

1. Reshape `query` and `value` into shapes `[batch_size, Tq, 1, dim]` and `[batch_size, 1, Tv, dim]` respectively.
2. Calculate scores with shape `[batch_size, Tq, Tv]` as a non-linear sum: `scores = tf.reduce_sum(tf.tanh(query + value), axis=-1)`
3. Use scores to calculate a distribution with shape `[batch_size, Tq, Tv]`: `distribution = tf.nn.softmax(scores)`.
4. Use `distribution` to create a linear combination of `value` with shape `batch_size, Tq, dim]`: `return tf.matmul(distribution, value)`.

I filed an issue on Github and hope to get an answer soon.

Anyways, once the score is calculated, the rest follows the same process in call() function in their parent class.

Luong-style Attention

Earlier, we mentioned that the score can be calcualted in different ways. One way is called Dot-product:

That’s effectively how it was implemented in the Attention class.

Just like the other attention layer, it has a scale variable set forth in build() at line 85:

if self.use_scale:
self.scale = self.add_weight( name=’scale’, shape=(), initializer=init_ops.ones_initializer(), dtype=self.dtype, trainable=True)
else:
self.scale = None

The score calculation is very straightforward at line 103–105:

scores = math_ops.matmul(query, key, transpose_b=True) if self.scale is not None:
scores *= self.scale

Problem with the Keras Attention layers

It’s great that Tensorflow has a built-in attention layer. But per the official documentation:

This class is suitable for Dense or CNN networks, and not for RNN networks.

The problem is, Attention is widely used in NLP along with RNN.

The second biggest problem I have is here:

query: Query `Tensor` of shape `[batch_size, Tq, dim]`.
value: Value `Tensor` of shape `[batch_size, Tv, dim]`.

It requires the Query and Value as well as Key vectors have the same dimension. Think of the example we have had with translating from English to Spanish. The embedding dimensions of these two languages could be very different. In theory, we can have the encoder go through another DNN to reshape the dimension, but that’s just not necessary.

So what’s the alternative then?

Tensorflow’s tutorial actually has implemented a much better Attention layer, which allows you to specify the dimension you want to map the Query and Key/Value to. Although the score calculation used Bahdanau’s style, it’s not that difficult to change to Luong’s style.

So with all that, good luck with your Attention journey. It is one of the most important topics to explore if you are interested in Seq2Seq.