Everything you need to know about Transformers!

Yeshwanth G

Aug 2, 20239 min read

a picture of transformer neuron indicating different layers within a neuron — Architecture of transformer

Table of contents

->Introduction

->Why transformers?

->What transformers?

->How transformers?

->Conclusion

Introduction

Well well well,I understand the thumbnail could be a little over whelming but do not worry by the end of this blog we will be able to tell exactly what the hekk is going on.Transformers are a type of neural network architecture that have been gaining popularity. One of the main things i love about Deep learning and neural networks is that the intuition's behind it's concepts make perfect sense.

We will be going through intuitive concepts like multi-head self attention mechanism,(encoder,decoder) of transformers etc..Overall,we will take a deep dive into the latest technology that has been deployed in tech like ChatGPT,AlphaStar etc... Without further a due, Let's get right to it!

Why transformers?

Transformers were built to deal with Sequence Transduction. Machine learning tasks can be seen as a transformation from input sequence to output sequence (may or may not be of the same type).This includes speech recognition, text-to-speech transformation,text-to-image etc... In this process their are chances for distortion in the output sequence .i.e The output may not be what we expect or contain few distortions.

For example,Say a language translation model can sometimes generate a sentence that isn't syntactically and semantically right. Often caused due to the model not learning the context properly. To be able to discuss "Why transformers?" Let's look at other potential substitutes and their drawbacks.

->At first,we have

Recurrent Neural Networks(RNN)

An unrolled rnn neuron — Unrolled neuron

These are similar to Ann's( Artificial Neural Network ) but the key difference is that this model can learn from the present input along with previous output instances (hidden state). Hence learning the context properly before making a prediction and note that Rnn's process token by token (word by word) and the generated output is token by token as well. So there are

two problems that arise with it

1)Since it processes data sequentially(token by token). It takes a lot of time to train the model.

2)Say,our model has to predict the next word and the sentence is like: "i am from karnataka.I m proficient in ......". While we understand that it is a language. But to be able to predict the word "kannada". We need to look at the previous sentence since no words around it give us that context. Rnn's lag behind in performance when the number of words between the context and the to-be-predicted word is high. Since it captures the context of previous outputs as a whole and doesn't pay special ATTENTION to relevant information for predicting the next word.

Generally,It works something like this:

a gif representing the working of RNN — A Generalised Rnn Model

Encoder(where the model learns the input) and Decoder(where the prediction occurs based on what the encoder has learnt).Each of them usually contain of many layers within itself generating hidden state at each layer representing the information learnt.

If you'd like to read more about Rnn's, i suggest you read

link: https://www.analyticsvidhya.com/blog/2022/03/a-brief-overview-of-recurrent-neural-networks-rnn/

Another Alternative is LSTM (Long Short Term Memory)

This is a little more advanced than Rnn's.The key distinction in LSTM's is thagt they have two lines running across all the neurons (long term memory line and short term memory line). This enables the model to remember the context for longer time (.i.e even when the context is a little far away).

They make small modifications to the information by multiplications and additions. With LSTMs, the information flows through a mechanism known as cell states. In this way, LSTMs can selectively remember or forget things that are important and not so important.

Here's what it looks like:

Architecture of a LSTM neuron — Architecture of LSTM

Each cell takes as inputs Xt (a word in the case of a sentence to sentence translation), the previous cell state and the output of the previous cell. It manipulates these inputs and based on them, it generates a new cell state, and an output.I have already discussed about it in my stock-market prediction blog.You can refer to it for digging deeper.

Link: https://www.insightsinbits.com/post/copy-of-exploring-the-dynamics-of-market-prediction

or http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Again,There are certain problems associated with it

->This architecture processes data sequentially and hard to parallelize. Hence,takes more time to train.

->LSTM's dont do well with too long sentences.

->Distance between the context and to-be-predicted word can affect the model's performance.

->LSTM architecture does not have specialized mechanisms designed explicitly to handle long-range and short-range dependencies in the input data.

So,this calls for the need to have an attention mechanism that can focuses on relevant information by attending all parts of the input sequence before predicting or generating an output rather than a mere context vector representing the whole input sequence as int the case of RNN's and LSTM'S.

Attention Mechanism

The attention mechanism is a fundamental concept in deep learning, and it plays a crucial role in various tasks, especially in the context of natural language processing (NLP) and sequence-to-sequence models. The attention mechanism allows a model to focus on relevant parts of the input data when making predictions or generating outputs, enabling the model to handle long-range dependencies and improve performance.

This mechanism solves this limitation by allowing the model to selectively "attend" to different parts of the input sequence when processing a specific element (e.g., a word or token) in the output sequence. Instead of relying solely on a fixed context vector, attention computes dynamic weighted sums of the input elements to form a context vector specific to the current element being processed.

Main idea behind this is that each word that has been processed before can hold significance in the prediction of current word.So in order for the decoding to be precise, it takes into account every word of the input, using attention.

In RNN's with attention, the attention mechanism can improve parallelism compared to the standard RNNs, but it is not fully parallel. During decoding, the model needs to process the output sequence one token at a time, attending to the input sequence for each output token. While this is more efficient than sequentially processing the entire input sequence for each output token, it still incurs some computational overhead.

Combining both parallelization of processing and attention mechanism, we get transformers.Let's learn more about it's working in the following

What transformers?

We solve the problem by using encoder-decoder along with the attention model.Here, the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. At each step the model is auto-regressive(attends all parts of input sequence and assigns attention score deciding the relevant information for predicting the next token), consuming the previously generated symbols as additional input when generating the next.

Architecture of transformer with encoder and decoder stages — Attention model

Encoder: It is composed of stacks of many layers (generally-6 layers,More complex models like chatgpt-around 100 layers)

Each layer has two sub-layers:

1)Multi-head Self Attention mechanism

2)Feed Forward Neural net

Around each of these layers is a layer of normalization.We add this layer to keep the model stable at all times during training preventing vanishing(model parameter's(weights,biases) vanishing) gradient and explosive (model parameter's(weights,biases) exploding) gradient.

Decoder: it is also composed of a stack of many identical layers(usually 6). In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.

We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i. We will discuss about this a little more deeper later on.

encoder and decoder layer of a transformer — Layers of Transformer

This architecture allows parallelization in training.i.e It takes in all inputs at a time and then allows each of the tokens to establish relationships among them.

How transformers?

First step is to encode the input sequences of words to a way in which the computer can understand.We convert words to word embeddings(high dimensional numerical vectors in which similar vectors are mapped closer to each other) and hten the model learns these vectors.

Do read https://www.insightsinbits.com/post/word2vec-model-made-easy to know more about word embedding's!

Since we take in all words of a sequence at a time.We need to make use of certain mechanism to remember the position so as to understand the syntactic context and helps model understand the positional information of the words.This is done using positional encoding.Key idea is to add position to input embeddings so the model can differentiate different positions of the words.

The most common way to compute positional encoding is by using trigonometric functions. Specifically, the positional encodings are computed as follows:

For each word at position "pos" in the sequence and each dimension i of the word embedding:

positional_encoding[pos, 2i]   = sin(pos / (10000^(2i / d_model)))
positional_encoding[pos, 2i+1] = cos(pos / (10000^(2i / d_model)))

where d_model is the dimension of the word embeddings. The above formula ensures that even and odd dimensions of the embeddings are assigned different positional encodings to preserve the positional information effectively.

The positional encoding is then added to the original word embeddings before feeding them to the Transformer encoder or decoder. This addition of positional encoding occurs element-wise, effectively modifying the embeddings to carry positional information.

Multi-head Self attention Mechanism

Self-Attention in a Single Head: In self-attention, each word in the input sequence is associated with three vectors: the query vector, the key vector, and the value vector. These vectors are obtained by linear projections of the original word embeddings. The three projections are parameterized by three separate weight matrices, typically learned during training.

Query Vector (Q)(dimension=dk): The query vector represents the word we are trying to compute the attention weights for ( or the word to which the model wants to pay attention to) . It is obtained by multiplying the original word embedding by the query projection weight matrix(matrix since we perform attention to many queries at the same time so we club all query vectors to form one matrix and perform operations on it).

Key Vector (K)(dimension=dk): The key vector represents the words in the input sequence with respect to which the attention weights are computed. It is obtained by multiplying the original word embedding by the key projection weight matrix.

Value Vector (V)(dimension=dv): The value vector contains the information that the model should pay attention to when calculating the output representation. It is obtained by multiplying the original word embedding by the value projection weight matrix.wwe multiply the attention weights to this vector.

We compute the dot product of all the query vectors with the key vectors and we know that higher the value of dot product means vectors are closer to each other(implying words with similar meanings) and are assigned attention weights accordingly by dividing it with sqrt(dk) and applying softmax function.

Attention(Q, K, V ) = softmax(Q K^T /√ dk )*V

Q=Query vector matrix

K=key vector matrix(K^T implying transpose)

V=Value vector matrix

√ dk is root of the dimension of key vector

single head self attention and multi head self attention — Single-Head Self Attention Multi-Head Self Attention

Multi head self attention:It can be thought of as a collection of multiple single head attention mechanisms.Like i said earlier,each word in the input sequence is associated with three vectors: the query vector, the key vector, and the value vector. These vectors are obtained through linear projections of the original word embeddings using separate learned weight matrices for each head.Speciality is that each head queries over different context's capturing a bigger picture.Note that all the heads have their own set of query,key and value vectors. This enables parallelization in learning distinct features since all these heads run simultaneously.

For example, consider a sentence like "The cat sat on the mat." In a translation task, one attention head might focus on the relationship between "The" and "cat" to understand the subject, while another attention head could concentrate on the relationship between "cat" and "sat" to grasp the action. At the same time, another head might emphasize the relationship between "on" and "the mat" to identify the prepositional phrase.

The final output of the multi-head attention mechanism is a representation that

incorporates the contributions from all attention heads. This combined output is further processed through feed-forward neural network layers to produce the final representation of the input sequence. In simple words,The output at each encoder stage is a sequence of modified value vectors which represents the input the sequence.

Similarly,In case of a decoder stage, It will be a representation of output sequence(say translated sentence) after it's self attention mechanism has attended the final output of the encoder and tokens generated by the previous decoder. Note that a decoder stage masks the processing of tokens up to position i when generating the i-th word in the output sequence.

The masking mechanism ensures that the model attends only to the positions before the current word being generated and prevents the decoder from "cheating" by attending to future positions that it should not have access to during the generation.

Usually done using Causal Attention mask. This mask is applied on attention scores before applying softmax to obtain attention weights. The scores are set to negative infinity leading to a softmax output of 0 to all tokens that come after the current position becomes 0 and the decoder doesnt attend further positions...

Conclusion

In this blog we have covered all the concepts involved in the working of a transformers.If y'all wish to read more i have left behind the links which i referred to for your understanding. In my next blog we shall get our hands dirty and deploy a simple transformer's model of our own...So Stay Tuned :)

References:

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

http://jalammar.github.io/illustrated-transformer/

https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture14-transformers.pdf

Everything you need to know about Transformers!

Recent Posts

コメント