Transformer… Transformer…

Neural Machine Translation [NMT] is a recently proposed task of machine learning that builds and trains a single, large neural network that reads a sentence and outputs a correct translation. Previous state of the art methods [here] use Recurrent Neural Networks and LSTM architectures to model long sequences, however, the recurrent nature of these methods prevents parallelization within training examples and this in turn leads to longer training time. Vaswani et al. 2017 proposes a novel technique, the Transformer, that relies entirely on the Attention Mechanism to model long sequences, thus can be parallelized and can be trained quicker.

This article introduces and explains the Transformer architecture, published in Wasvani et al . If you have not yet read the article, I strongly advise it as it is a very interesting and influential paper, however, its contents are explained here, so read on! I equally add the code for the different layers, you can check the github repo (here) for the complete code. Another great source is the Annotated Transformer, that explains the code and architecture together.

Introduction: The great idea behind Transformer

The transformer architecture is a Seq2Seq model proposed for translation. Ideally, we input a sentence in a source language and output the translation in the target language. What does a such model needs to do? It needs to be able to map individual tokens from the source language to the target language, it needs to know the order of the input tokens and needs to remember dependencies between tokens.

Let’s see one example for each of these conditions:

  • The first one is trivial to understand, if our translation model cannot translate individual tokens, it will not work with longer input sentences.
  • Let’s assume that the model is able to find the good mapping for each source language token to the target language. Would this be considered good then?

The cat drank the milk.
The milk drank the cat.

The order of the word (relative and absolute- as we wish to differentiate between the first or last word of a sentence but we equally want to know whether the 5th word is the 5th word of a 10 token sentence or a 5 tokens sentence!) is essential for a good translation model.

Furthermore, when the model sees things like “la femme jolie est rentrée” (“the woman beautiful has entered” in mirror translation), it should understand that the order of the words might not be the same in the source and target language (as in French we place the adjective after the noun but not in English). For this, the model should understand which word from the source language is important for the translation of the next word in the target language.

  • When translating from or to a source language that uses gender specific nouns, it is important that the model understands the relationships between the words. Consider for instance the sentence:

                                   She sat down in the chair, it was very comfortable.

Now if we translate this sentence into French, the noun “chair” is feminine (“la chaise”) so when translating “it”, the model should understand that it is referring to the chair and therefore translate it to “elle”.

This is just an example and there are a lot more, but I think you got the idea! Such a translation model has to be aware of the mappings of words from the source language to the target language and the order of the words, moreover the dependencies between them, even long-term dependencies!

Seq2Seq models are good to take a variable length input and project these inputs to a hidden dimension, a vector of a fixed size. This is extremely useful as by taking the an arbitrary length sentence, we are able to create a fixed dimensional representation of the whole sentence. This is the Encoder part. Next, our decoder will “decode” the hidden representation our sentence into the sentence in the target language. 


The really cool think about Seq2Seq models is their ability to input variable length inputs and project them to a lower dimensional vector. This will be the input to the Decoder and the Decoder will learn everything from this lower-dimensional vector. Almost? Wait, let’s take a look. The encoder inputs the source sentence by the first, second, etc.. words. The output of each layer (represented by the rectangles- this could be an RNN/LSTM or any other defined structure) is a hidden vector of that input. This will be inputted into the next layer in which the next word is equally inputted. These two inputs will create the next hidden state that will iteratively inputted into the next layer, etc. Then the decoder takes the input of the hidden representation of the sentence (the last hidden state) and the End-Of-Sequence (EOS) token, and starts to decode the hidden representation. It returns the first translated word and creates a new hidden state that will be inputted into the next layer and so forth.

We use every information from the input sentence, right? Well, yes, but there is something we could reuse in our architecture. You see the hidden representation of the first word is inputted into the next layer in order to create the next hidden representation (by also taking the second word as an input), however, after this step, we don’t reuse the hidden representation of the first word. We could save it, maybe transform it, and input it into the model when we translate the first word, it would actually make sense! But wait, what happens if the first translated word should not be the translation of the first source language word? (in case the order of the words in the source and target languages are not aligned) as the two languages are constructed differently? Well, in this case we could reuse the hidden representation of all words from the source sentence, maybe transform them, tell the model which hidden representation is the most relevant and input it to each layer when outputting the translated words!


In this case the first decoder layer could have the hidden representation of the whole sentence but also the hidden representation of the words! And this is how the idea of attention mechanism was born. 🙂

Before the Transformer model, dependencies between sequence elements were encoded by forms of RNN and LSTM. These models managed to achieve reasonably good results, however, there is a hack. They are not  due to their recursive structure and they are slow to train. LSTMs handle long-term dependencies better than RNN but they are more complex and difficult to optimise. It’s funny how sometimes it is exactly the foundations of a model, a seemingly brilliant idea that achieved previously SOTA results that becomes a source of inefficiency. And in these cases all we should do is… focusing our attention elsewhere.. and find a similar, recursive structure without the painfully complicated recursive architecture.

Attention, attention… but for the right thingattention.png

Attention is one of the most influential ideas in today’s DL. It was introduced by Bahdanau et al., 2015 and although it was initially designed in the context of Neural Machine Translation, its current applications range from CV, captioning and NLP tasks. Wasvani et al. proposes a non-recursive model that can create dependencies between words. How? By heavily building on the Attention Mechanism in his Seq2Seq model.

So what is the Attention Mechanism? Intuitively, think about the attention mechanism as importance weights. Let’s say you want to predict the word “Teddy” in the sentence: “I like my Teddy bear”. How could we start this?

Imagine, that you have already the lower dimensional representation of “I”, “like”, “my” and “bear” and imagine that somehow you know how these words correlate with the word you wish to predict. “bear” is very important, correlates a lot with the predicted word. “my” is important too while “I” is less important but correlates strongly with the word “like”, and so on. One way to predict the word “Teddy” would be then to take the word representation of all other tokens and take the weighted average of them: weight them with respect to their importance towards the predicted token. I think this is an intuitive and easy way to think about attention mechanism.

listening-closelyAnother way to think about the Attention mechanism is with the translation task. We have the input sequence of words in the source language and we translate it to the target language. Let’s assume that the order or the words should not be the same, that is for instance our French input sentence contains “une fille gentille” (“one girl kind” in mirror translation) should be translated to a “a kind girl”. Since the order of the corresponding tokens is not aligned, the translated word “girl” should attend/listen to the word “fille”, “a” should attend to “une” and “kind” should consider the word “gentille” the most. How can we achieve that? By Attention Mechanism. Attention Mechanism is a way to learn these importance weights and tell our model which element it should “listen” or “attend” to. Let’s do a little math as well because it facilitate the intuition.

listen.pngAttention, with the maths

First, let’s define our input as x = [x_1, x_2, .. x_n] and the output as y = [y_1, y_2, .. y_m] (where n ∈ , m ∈ ). The Seq2Seq model inputs the x values in a sequential fashion and creates a hidden representation for each word, h = [h_1, h_2, .. h_n]. This is the encoder part. Then, this hidden representation will be then inputted in the decoder that will generate the y values.

Now I introduce the original idea of attention. Suppose we wish to translate the t-th word, y_t. We will reuse the hidden representation of each word in the source language (that is hi where i ∈ [1, m]. The idea is to create a context vector, where the context vector is calculated as the weighted average of the hidden representations (h_i) of each word, weighted by α, that shows how attention/ degree of alignment between the source words and the predicted word. How can we compute these alignment weights? We use a score between the decoder’s hidden state (s_t) and the hidden representation of each source word. Intuitively, the decoder creates a hidden state, and we wish to know which input words are aligned/ match this hidden state.

Screenshot 2019-09-24 at 16.02.51

Notice, that now the relations between a source word and a target word are no longer defined by their relative distance, since each target word receives the context, calculated from all source words’ hidden representations. 

There is a family of attention mechanisms now, check this great blog for more info on them, I will focus now on the Attention mechanisms used in Vaswani et al. 2017.

The Transformer model is entirely built on self-attention mechanism. Self-attention mechanisms is an attention mechanism relating different positions of a single sequence in order to compute a representation of the same sequence.

The Transformer model defines the attention mechanism as a mapping, in which we map a query (Q) to a key (K), value (V) pair. All these three, the query, the key and the value are vectors. Then the Scaled-Dot Product Attention is defined as:

Screenshot 2019-09-24 at 16.04.01.png

This means that we take the dot product of the query and key vectors, scale it by the square root of the dimension of the key vector, compute its softmax and take again the dot-product with the value.

The article equally introduces the Multi-Head attention mechanism, that I describe below. This section serves only to give a short intuition into what really is the attention. So in summary, we can think of attention as an importance weighting mechanism. We identify the elements at which the currently predicted token should “attend to”, and consider them with a higher weight than other elements. There are different formulations of the attention mechanism, depending on which alignment and which compatibility score we use, depending on the which elements can be attended to (self-attention) and depending on how exactly we compute it (weighted or not, etc).

Transformer Architecture


As already explained, the Transformer follows a Sequence to Sequence (Seq2Seq) architecture (Encoder-Decoder framework. So first I explain the encoder-decoder framework, then each layer used in the encoder and decoder units.

Encoder-decoder framework

The sequence to sequence learning inputs a source sequence (a sentence for instance), and maps this sequence to a vector of fixed dimensionality. This is the encoder part. Then, another model, jointly trained with the encoder architecture decodes the vector and outputs the translated sequence. More precisely, the encoder maps the input sequence (x_1, x_2, … , x_n) to a continuous representation hidden vector, h = (h_1, h_2, … h_n), and the decoder generates the output sequence (y_1, y_2, …, ym), one at the time in an auto-regressive fashion, taking the previously generated elements as an additional input to a generate new one. The Transformer employs this architecture with six identical layers constructing the Encoder unit, and six identical Decoder layers defining the Decoder unit. Both, the Encoder and Decoder layers are equipped with stacked self-attention and fully connected layers.

Screenshot 2019-09-24 at 16.33.43.png

The below figure shows the complete Transformer architecture. The input sequence is fed into the embedding layer, then the token’s embedding is passed to the encoder. These embeddings are of fixed dimension of 512. The encoder is composed of six identical layers, while each encoder layer consists two sublayers, a multi-head attention and a feed forward layer. Both these layers profit from residual connections and normalization. The embedding and sublayers all have a dimension of 512. Then, after N=6 stacked encoding layers, the encoder’s output is fed into the decoder. The decoder also consists of six identical decoder layers, each containing three sublayers. In addition to the multi-head attention and feed forward layers, there is another multi-head attention layer in the decoder which performs multi-head attention over the encoder’s output. As in the encoder, each layer employs residual connections and a normalization layer that follows them. Finally, the multi-head attention layer is modified in the transformer layer, since the subsequent positions are masked, ensuring that predictions for position i are only dependent on the known outputs for positions [0, i − 1]. The output of the decoder is fed into a linear layer with a softmax activation function, and the output is the predicted probabilities of the target language’s words. In what follows I explain each layer in more detail.

Embedding layer with positional encoding

Word embedding is a low-dimensional word-representation technique that allows words with similar meanings to have similar embedding. The transformer model uses learned word embeddings in order to convert the input tokens to 512- dimensional vectors. As the model does not contain any recurrence, positional encodings are added to the embedding layer, containing information about the order of the sequence. “dmodel” is the dimension of the model (512), “pos” shows the order in the sequence 10 while “i” stands for the position among the embedding (i ∈ [0, 512[).

Screenshot 2019-09-24 at 16.06.22.png

Then the embeddings and the positional encoding layer is summed up as the following figure shows.


Why do we need positional encodings in the Transformer architecture? Notice that previous Neural Machine Translation models did not use positional encodings because the words were sequentially added to the model. The hidden representation of the first was added to the embedding of the second word in order to form its hidden representation. Therefore, there was an order indicated by the sequence of words, i.e. 𝑛-th word is fed at step 𝑛, which helps the model incorporate the order of words.

Here, however, there is no notion of word order (1st word, 2nd word, ..) at the embedding stage. All input sequence words are inputted into the model in the same time, with no special ordering. Once they are embedded, the attention mechanism and the feed-forward computations (explained below) are applied to them in a parallel fashion. Thus, the model has no idea how the words are ordered. Consequently, a position-dependent signal needs to be added to each word-embedding so that the model can incorporate the order of words. This addition not only avoids destroying the embedding information but also adds the vital position information.

I found it hard to understand why sine and cosine functions are used to embed the position of each word, so I detail an example below that may provide some intuition.

Let’s take the example of word x at position (pos ∈ [0, L-1] where the sequence length is L. Let’s also assume that the dimension of the embeddings is 512 (as in the original article) and that the embedding vector of word x is e_w. Then, the positional encoding of the word x is also a 512-dimensional vector, that starts as:

Screenshot 2019-09-24 at 16.09.06.png

Let’s say that that the position of word x is 1, in this case the positional encoding becomes:

Screenshot 2019-09-24 at 16.08.07.png

Once the positional encoding is computed, it is added to the embedding vector of word x.

Screenshot 2019-09-24 at 16.09.50.png

Notice also that as the dimension of the model is 512, the value of i is actually less than 256, that is : i ∈ [0, 255].

In [15]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class Embedder(nn.Module):
    def __init__(self, vocab_size, d_model):
        self.d_model = d_model
        self.embed = nn.Embedding(vocab_size, d_model)
    def forward(self, x):
        return self.embed(x)
class PositionalEncoder(nn.Module):
    def __init__(self, d_model, max_seq_len = 200, dropout = 0.1):
        self.d_model = d_model
        self.dropout = nn.Dropout(dropout)
        # create constant 'pe' matrix with values dependant on 
        # pos and i
        pe = torch.zeros(max_seq_len, d_model)
        for pos in range(max_seq_len):
            for i in range(0, d_model, 2):
                pe[pos, i] = \
                math.sin(pos / (10000 ** ((2 * i)/d_model)))
                pe[pos, i + 1] = \
                math.cos(pos / (10000 ** ((2 * (i + 1))/d_model)))
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
    def forward(self, x):
        # make embeddings relatively larger
        x = x * math.sqrt(self.d_model)
        #add constant to embedding
        seq_len = x.size(1)
        pe = Variable([:,:seq_len], requires_grad=False)
        if x.is_cuda:
        x = x + pe
        return self.dropout(x)

Multi-Head attention layer

We can think of the attention mechanism as a mapping, where we map a query (Q) and a set of key-value pairs (K, V) to an output. The query, key and value are all vectors. The output is then computed as the weighted sum of the values, weighted by compatibility function applied to the key and query. The compatibility function shows the relevance of the key to the given query.

As already mentioned, Vaswani et al. 2017 proposes the Scaled Dot-Product Attention. The query, key and value are all vectors, with Screenshot 2019-09-24 at 16.11.17.png

Screenshot 2019-09-24 at 16.04.01

This means that we take the dot product of the query and key vectors, scale it by the square root of the dimension of the key vector, compute its softmax and take again the dot-product with the value. In reality the model adds a dropout layer as well. The next image shows the way how the scaled dot-product attention is computed:



def dot_scaled_product_attention(q, k, v, d_k, mask=None, dropout=None):
    scores = torch.matmul(q, k.transpose(-2, -1)) /  math.sqrt(d_k)
    if mask is not None:
        mask = mask.unsqueeze(1)
        scores = scores.masked_fill(mask == 0, -1e9)
    scores = F.softmax(scores, dim=-1)
    if dropout is not None:
        scores = dropout(scores)
    output = torch.matmul(scores, v)
    return output

The Multi-Head Self-Attention mechanism runs through this Scaled Dot-Product Attention multiple times in parallel. Why would this be a good idea? We can easily imagine that a the Scaled Dot-Product Attention shows one aspect of a sentence. This might not be all aspects we are interested in, hence by using weighted versions of the Scaled Dot-Product Attention, concatenating and weighting them by matrix W, we might capture information from different representation subplaces. Another intuitive explanation why it would help to compute the attention several times is that ensembling often helps. The Figure below shows the Multi-Head Self-Attention mechanism while the following equation presents the mathematical formulation.

Screenshot 2019-09-24 at 16.12.44.png

with Screenshot 2019-09-24 at 16.16.39.png all matrices to be learned, with dimensions Screenshot 2019-09-24 at 16.17.16.png. [ ; ] stands for concatenating the elements.

Screenshot 2019-09-24 at 16.35.02.png

class MultiHeadAttention(nn.Module):
    def __init__(self, heads, d_model, dropout = 0.1):
        self.d_model = d_model
        self.d_k = d_model // heads
        self.h = heads
        self.q_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
        self.out = nn.Linear(d_model, d_model)
    def forward(self, q, k, v, mask=None):
        bs = q.size(0)
        # perform linear operation and split into N heads
        k = self.k_linear(k).view(bs, -1, self.h, self.d_k)
        q = self.q_linear(q).view(bs, -1, self.h, self.d_k)
        v = self.v_linear(v).view(bs, -1, self.h, self.d_k)
        # transpose to get dimensions bs * N * sl * d_model
        k = k.transpose(1,2)
        q = q.transpose(1,2)
        v = v.transpose(1,2)

        # calculate attention using function we will define next
        scores = dot_scaled_product_attention(q, k, v, self.d_k, mask, self.dropout)
        # concatenate heads and put through final linear layer
        concat = scores.transpose(1,2).contiguous()\
        .view(bs, -1, self.d_model)
        output = self.out(concat)
        return output
I already mentioned that Q, K, V are all vectors, however, I did not mention which vectors. Indeed, this is the difference between the encoder’s and decoder’s attention layers.
  • In the encoder, all Q, K, V vectors are derived from the previous encoder hidden state, thus Q, K and V are identical. This self-attention mechanism creates therefore token representations that are formed by all other tokens. This can be done in a parallel fashion.
  • Similarly, the Masked Multi-Head layer in the decoder stacks take the previous decoderstates as the Q, K and V vectors, however, this time only the previously predicted words’ representations can be part of these vectors, hence the rest is padded. This can no longer be computed in a parallel fashion.
  • In the decoder layers, the Multi-Head Attention Layer takes the previous decoder hiddenstate as the Q, while its K and V vectors are the previous states outputted by the encoder

Normalization layer

Once the data is embedded and ran through the Multi-Head Attention Layers, it is inputtedin the Normalization Layers, wrapped around by Residual Connections.The Normalization Layer applies a normalization method that normalizes activations ina network across features, preventing large changes in the values of the neurons. It computes the mean (μ) and variance (σ) from all of the summed inputs, that is, it computes the normalization statistics over all hidden units of the layer across each feature, then applies the same normalization for all neurons in the layer.

As a consequence, normalization statistics are independent of other examples and each input has a different normalization. The advantages of the Layer normalization is that it allows the use of arbitrary batch sizes and that the normalization used during training is the same as during evaluation time. The Normalization layer is wrapped around by Residual Connections. They allow fora connection between the outputs of the previous and the Normalization layers, making the optimization of the network easier by preventing exploding or vanishing gradients. A code of the Normalization layer is shown below.


class Norm(nn.Module):
    def __init__(self, d_model, eps = 1e-6):
        self.size = d_model
        # create two learnable parameters to calibrate normalisation
        self.alpha = nn.Parameter(torch.ones(self.size))
        self.bias = nn.Parameter(torch.zeros(self.size))
        self.eps = eps
    def forward(self, x):
        norm = self.alpha * (x - x.mean(dim=-1, keepdim=True)) \
        / (x.std(dim=-1, keepdim=True) + self.eps) + self.bias
        return norm

Feed forward layer

Once the input is outputted by the Normalization layer, it is passed through a simple FeedForward layer. This layer consists of two linear operations, with a ReLu and a dropout operation between them. The ReLu function is a non-linear transformation of the input while the dropout [28] is added in order to prevent overfitting.

Screenshot 2019-09-24 at 16.20.01.png

The decoder’s Feed Forward layer is followed by another Linear layer with a softmax activation function. The output of this layer is the probability of each word in the target language.

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff=2048, dropout = 0.1):
        self.linear_1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model)
    def forward(self, x):
        x = self.dropout(F.relu(self.linear_1(x)))
        x = self.linear_2(x)
        return x

These layers conclude the Transformer. We can now construct the encoder and decoder layers as:

class EncoderLayer(nn.Module):
    def __init__(self, d_model, heads, dropout=0.1):
        self.norm_1 = Norm(d_model)
        self.norm_2 = Norm(d_model)
        self.attn = MultiHeadAttention(heads, d_model, dropout=dropout)
        self.ff = FeedForward(d_model, dropout=dropout)
        self.dropout_1 = nn.Dropout(dropout)
        self.dropout_2 = nn.Dropout(dropout)
    def forward(self, x, mask):
        x2 = self.norm_1(x)
        x = x + self.dropout_1(self.attn(x2,x2,x2,mask))
        x2 = self.norm_2(x)
        x = x + self.dropout_2(self.ff(x2))
        return x
class DecoderLayer(nn.Module):
    def __init__(self, d_model, heads, dropout=0.1):
        self.norm_1 = Norm(d_model)
        self.norm_2 = Norm(d_model)
        self.norm_3 = Norm(d_model)
        self.dropout_1 = nn.Dropout(dropout)
        self.dropout_2 = nn.Dropout(dropout)
        self.dropout_3 = nn.Dropout(dropout)
        self.attn_1 = MultiHeadAttention(heads, d_model, dropout=dropout)
        self.attn_2 = MultiHeadAttention(heads, d_model, dropout=dropout)
        self.ff = FeedForward(d_model, dropout=dropout)

    def forward(self, x, e_outputs, src_mask, trg_mask):
        x2 = self.norm_1(x)
        x = x + self.dropout_1(self.attn_1(x2, x2, x2, trg_mask))
        x2 = self.norm_2(x)
        x = x + self.dropout_2(self.attn_2(x2, e_outputs, e_outputs, \
        x2 = self.norm_3(x)
        x = x + self.dropout_3(self.ff(x2))
        return x

And continue by defining the complete encoder and decoder unit as

class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, N, heads, dropout):
        self.N = N                                                           # We will have the encoder layer N times
        self.embed = Embedder(vocab_size, d_model)                           # Embeddingn layer = PositionalEncoder(d_model, dropout=dropout)                # PositionalEncoder layer
        self.layers = get_clones(EncoderLayer(d_model, heads, dropout), N)   # Layers of the Enc = N encoder layer
        self.norm = Norm(d_model)                                            # Normalization
    def forward(self, src, mask):
        x = self.embed(src)
        x =
        for i in range(self.N):
            x = self.layers[i](x, mask)
        return self.norm(x)
class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, N, heads, dropout):
        self.N = N
        self.embed = Embedder(vocab_size, d_model) = PositionalEncoder(d_model, dropout=dropout)
        self.layers = get_clones(DecoderLayer(d_model, heads, dropout), N)
        self.norm = Norm(d_model)
    def forward(self, trg, e_outputs, src_mask, trg_mask):
        x = self.embed(trg)
        x =
        for i in range(self.N):
            x = self.layers[i](x, e_outputs, src_mask, trg_mask)
        return self.norm(x)

And finally by defining the model:

class Transformer(nn.Module):
    def __init__(self, src_vocab, trg_vocab, d_model, N, heads, dropout):
        self.encoder = Encoder(src_vocab, d_model, N, heads, dropout)
        self.decoder = Decoder(trg_vocab, d_model, N, heads, dropout)
        self.out = nn.Linear(d_model, trg_vocab)
    def forward(self, src, trg, src_mask, trg_mask):
        e_outputs = self.encoder(src, src_mask)
        d_output = self.decoder(trg, e_outputs, src_mask, trg_mask)
        output = self.out(d_output)
        return output

Voilà voilà, our Transformer in code!! Great job in reading the post, any comments are welcome! I trained a French-to-English and English-to-French model for 20/ 100 epochs respectively, you can see the complete code on my github!

A Bientot 🙂



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Create a website or blog at

Up ↑

%d bloggers like this: