Transformer is a breakthrough NLP model, it completely depends on attention mechanism (eliminates convolutional and recurrent neural network). It is the core/backbone that build up many state-of-the-art models: BERT, XLNet,… It also utilizes the use of parallelism to speed up training.
The key features in Transformer:
- Self-Attention Layer.
- Cross-Attention Layer (Encoder-Decoder Attention Layer).
- Positional Embedding.
- Layer Normalization.
data:image/s3,"s3://crabby-images/6f77d/6f77d94204afc7645f240e6e2efa8a1eb9989dfc" alt=""