Transformer (machine learning model)


The Transformer is a deep learning model introduced in 2017, used primarily in the field of natural language processing.
Like recurrent neural networks, Transformers are designed to handle sequential data, such as natural language, for tasks such as translation and text summarization. However, unlike RNNs, Transformers do not require that the sequential data be processed in order. For example, if the input data is a natural language sentence, the Transformer does not need to process the beginning of it before the end. Due to this feature, the Transformer allows for much more parallelization than RNNs and therefore reduced training times.
Since their introduction, Transformers have become the model of choice for tackling many problems in NLP, replacing older recurrent neural network models such as the long short-term memory. Since the Transformer model facilitates more parallelization during training, it has enabled training on larger datasets than was possible before it was introduced. This has led to the development of pretrained systems such as BERT and GPT, which have been trained with huge general language datasets, and can be fine-tuned to specific language tasks.

Background

Before the introduction of Transformers, most state-of-the-art NLP systems relied on gated recurrent neural networks, such as LSTMs and Gated recurrent units, with added attention mechanisms. The Transformer built upon these attention technologies without using an RNN structure, highlighting the fact that the attention mechanisms alone, without recurrent sequential processing, are powerful enough to achieve the performance of RNNs with attention.
Gated RNNs process tokens sequentially, maintaining a state vector that contains a representation of the data seen after every token. To process the token, the model combines the state representing the sentence up to token with the information of the new token to create a new state, representing the sentence up to token. Theoretically, the information from one token can propagate arbitrarily far down the sequence, if at every point the state continues to encode information about the token. But in practice this mechanism is imperfect: due in part to the vanishing gradient problem, the model's state at the end of a long sentence often does not contain precise, extractable information about early tokens.
This problem was addressed by the introduction of attention mechanisms. Attention mechanisms let a model directly look at, and draw from, the state at any earlier point in the sentence. The attention layer can access all previous states and weighs them according to some learned measure of relevancy to the current token, providing sharper information about far-away relevant tokens. A clear example of the utility of attention is in translation. In an English-to-French translation system, the first word of the French output most probably depends heavily on the beginning of the English input. However, in a classic encoder-decoder LSTM model, in order to produce the first word of the French output the model is only given the state vector of the last English word. Theoretically, this vector can encode information about the whole English sentence, giving the model all necessary knowledge, but in practice this information is often not well preserved. If an attention mechanism is introduced, the model can instead learn to attend to the states of early English tokens when producing the beginning of the French output, giving it a much better concept of what it is translating.
When added to RNNs, attention mechanisms led to large gains in performance. The introduction of the Transformer brought to light the fact that attention mechanisms were powerful in themselves, and that sequential recurrent processing of data was not necessary for achieving the performance gains of RNNs with attention. The Transformer uses an attention mechanism without being an RNN, processing all tokens at the same time and calculating attention weights between them. The fact that Transformers do not rely on sequential processing, and lend themselves very easily to parallelization, allows Transformers to be trained more efficiently on larger datasets.

Architecture

Like the models invented before it, the Transformer is an encoder-decoder architecture. The encoder consists of a set of encoding layers that processes the input iteratively one layer after another and the decoder consists of a set of decoding layers that does the same thing to the output of the encoder.
The function of each encoder layer is to process its input to generate encodings, containing information about which parts of the inputs that are relevant to each other. It passes its set of encodings to the next encoder layer as inputs. Each decoder layer does the opposite, taking all the encodings and processes them, using their incorporated contextual information to generate an output sequence. To achieve this, each encoder and decoder layer makes use of an attention mechanism, which for each input, weighs the relevance of every other input and draws information from them accordingly to produce the output. Each layer decoder also has an additional attention mechanism which draws information from the outputs of previous decoders, before the decoder layer draws information from the encodings. Both the encoder and decoder layers have a feed-forward neural network for additional processing of the outputs, and contain residual connections and layer normalization steps.

Scaled dot-product attention

The basic building blocks of the Transformer are scaled dot-product attention units. When a sentence is passed into a Transformer model, attention weights are calculated between every token simultaneously. The attention unit produces embeddings for every token in context that contain information not only about the token itself, but also a weighted combination of other relevant tokens weighted by the attention weights.
Concretely, for each attention unit the Transformer model learns three weight matrices; the query weights, the key weights, and the value weights. For each token, the input word embedding is multiplied with each of the three weight matrices to produce a query vector, a key vector, and a value vector. Attention weights are calculated using the query and key vectors: the attention weight from token to token is the dot product between and. The attention weights are divided by the square root of the dimension of the key vectors,, which stabilizes gradients during training, and passed through a softmax which normalizes the weights to sum to. The fact that and are different matrices allows attention to be non-symmetric: if token attends to token , this does not necessarily mean that token will attend to token . The output of the attention unit for token is the weighted sum of the value vectors of all tokens, weighted by, the attention from to each token.
The attention calculation for all tokens can be expressed as one large matrix calculation, which is useful for training due to computational matrix operation optimizations which make matrix operations fast to compute. The matrices, and are defined as the matrices where the th rows are vectors,, and respectively.

Multi-head attention

One set of matrices is called an attention head, and each layer in a Transformer model has multiple attention heads. While one attention head attends to the tokens that are relevant to each token, with multiple attention heads the model can learn to do this for different definitions of "relevance". Research has shown that many attention heads in Transformers encode relevance relations that are transparent to humans. For example there are attention heads that, for every token, attend mostly to the next word, or attention heads that mainly attend from verbs to their direct objects. Since Transformer models have multiple attention heads, they have the possibility of capturing many levels and types of relevance relations, from surface-level to semantic. The multiple outputs for the multi-head attention layer are concatenated to pass into the feed-forward neural network layers.

Encoder

Each encoder consists of two major components: a self-attention mechanism and a feed-forward neural network. The self-attention mechanism takes in a set of input encodings from the previous encoder and weighs their relevance to each other to generate a set of output encodings. The feed-forward neural network then further processes each output encoding individually. These output encodings are finally passed to the next encoder as its input, as well as the decoders.
The first encoder takes positional information and embeddings of the input sequence as its input, rather than encodings. The positional information is necessary for the Transformer to make use of the order of the sequence, because no other part of the Transformer makes use of this.

Decoder

Each decoder consists of three major components: a self-attention mechanism, an attention mechanism over the encodings, and a feed-forward neural network. The decoder functions in a similar fashion to the encoder, but an additional attention mechanism is inserted which instead draws relevant information from the encodings generated by the encoders.
Like the first encoder, the first decoder takes positional information and embeddings of the output sequence as its input, rather than encodings. Since the transformer should not use the current or future output to predict an output though, the output sequence must be partially masked to prevent this reverse information flow. The last decoder is followed by a final linear transformation and softmax layer, to produce the output probabilities over the vocabulary.

Training

Transformers typically undergo semi-supervised learning involving unsupervised pretraining followed by supervised fine-tuning. Pretraining is typically done on a much larger dataset than fine-tuning, due to the restricted availability of labeled training data. Tasks for pretraining and fine-tuning commonly include:
The Transformer model has been implemented in major deep learning frameworks such as TensorFlow and PyTorch. Below is pseudo code for an implementation of the Transformer variant known as the "vanilla" transformer:

def vanilla_transformer:
"""Transformer variant known as the "vanilla" transformer."""
x = embedding * sqrt
x = x + pos_encoding
x = dropout
for _ in range:
attn = multi_head_attention
attn = dropout
attn = layer_normalization
x = point_wise_ff
x = layer_normalization
# x is at this point the output of the encoder
enc_out = x
x = embedding * sqrt
x = x + pos_encoding
x = dropout
mask = causal_mask
for _ in range:
attn1 = multi_head_attention
attn1 = layer_normalization
attn2 = multi_head_attention
attn2 = dropout
attn2 = layer_normalization
x = point_wise_ff
x = layer_normalization
return dense

Applications

The Transformer finds most of its applications in the field of natural language processing, for example the tasks of machine translation and time series prediction. Many pretrained models such as GPT-3, GPT-2, BERT, XLNet, and RoBERTa demonstrate the ability of Transformers to perform a wide variety of such NLP-related tasks, and have the potential to find real-world applications. These may include: