## Attention Is All You Need

Source: @vaswani2017

### Highlights 💡

*Hinzugefügt am 2022-10-07*

■ convolutional neural networks (p. 1)

■ best performing models (p. 1)

■ connect the encoder and decoder through an attention mechanism (p. 1)

■ Transformer (p. 1)

■ dispensing with recurrence and convolutions (p. 1)

■ BLEU score of 41.8 (p. 1)

■ 3.5 days on eight GPUs (p. 1)

■ generate a sequence of hidden (p. 2)

■ tates ht, as a function of the previous hidden state ht−1 and the input for position t. (p. 2)

■ precludes parallelization (p. 2)

■ memory constraints limit batching (p. 2)

■ Attention mechanisms have become an integral part (p. 2)

■ modeling of dependencies without regard to their distance in the input or output sequences (p. 2)

■ Self-attention, (p. 2)

■ relating different positions of a single sequence in order to compute a representation of the sequence (p. 2)

■ encoder maps an input sequence of symbol representations (x1, …, xn) to a sequence of continuous representations z = (z1, …, zn) (p. 2)

■ Given z, the decoder then generates an output sequence (y1, …, ym) of symbols one element at a time (p. 2)

■ auto-regressive (p. 2)

■ consuming (p. 2)

■ previously generated symbols as additional input when generating the next (p. 2)

■ stack of N = 6 identical layers (p. 3)

■ two sub-layers (p. 3)

■ N = 6 identical layers (p. 3)

■ In addition to the two sub-layers (p. 3)

■ decoder inserts a third sub-layer, which performs multi-head attention over the output (p. 3)

■ An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. (p. 3)

■ output is computed as a weighted sum (p. 3)

■ weight assigned to each value is computed by a compatibility function of the query with the corresponding key (p. 3)

■ queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder (p. 5)

■ allows every position in the decoder to attend over all positions in the input sequence (p. 5)

■ learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities (p. 5)

■ Positional Encoding (p. 5)

■ Why Self-Attention (p. 6)

■ computational complexity per layer (p. 6)

■ amount of computation that can be parallelized (p. 6)

■ path length between long-range dependencies (p. 6)

■ more interpretable models (p. 7)

■ WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs (p. 7)

■ encoded using byte-pair encoding (p. 7)

■ shared sourcetarget vocabulary of about 37000 tokens (p. 7)

■ 8 NVIDIA P100 GPUs (p. 7)

■ each training step took about 0.4 seconds (p. 7)

■ trained the base models for a total of 100,000 steps or 12 hours (p. 7)

■ big models were trained for 300,000 steps (p. 7)

■ Adam optimizer (p. 7)

■ Regularization (p. 7)

■ Residual Dropout (p. 7)

■ EN-DE (p. 8)

■ 27.3 (p. 8)

■ 28.4 (p. 8)

■ Label Smoothing (p. 8)

■ base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals (p. 8)

■ big models, we averaged the last 20 checkpoints (p. 8)

■ beam search with a beam size of 4 and length penalty α = 0.6 (p. 8)

■ estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU (p. 8)

■ code (p. 10)

■ available at (p. 10)

■ Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906, 2017. (p. 10)

■ Attention Visualizations Input-Input Laye (p. 13)

■ convolutional neural networks (p. 1)

■ best performing models (p. 1)

■ connect the encoder and decoder through an attention mechanism (p. 1)

■ Transformer, (p. 1)

■ dispensing with recurrence and convolutions (p. 1)

■ BLEU score of 41.8 (p. 1)

■ 3.5 days on eight GPUs (p. 1)

■ generate a sequence of hidden (p. 2)

■ tates ht, as a function of the previous hidden state ht−1 and the input for position t. (p. 2)

■ precludes parallelization (p. 2)

■ memory constraints limit batching (p. 2)

■ Attention mechanisms have become an integral part (p. 2)

■ modeling of dependencies without regard to their distance in the input or output sequences (p. 2)

■ Self-attention, (p. 2)

■ relating different positions of a single sequence in order to compute a representation of the sequence (p. 2)

■ encoder maps an input sequence of symbol representations (x1, …, xn) to a sequence of continuous representations z = (z1, …, zn). (p. 2)

■ Given z, the decoder then generates an output sequence (y1, …, ym) of symbols one element at a time. (p. 2)

■ auto-regressive (p. 2)

■ consuming t (p. 2)

■ previously generated symbols as additional input when generating the next (p. 2)

■ stack of N = 6 identical layers (p. 3)

■ two sub-layers (p. 3)

■ N = 6 identical layers (p. 3)

■ In addition to the two sub-layers (p. 3)

■ decoder inserts a third sub-layer, which performs multi-head attention over the output (p. 3)

■ An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. (p. 3)

■ output is computed as a weighted sum (p. 3)

■ weight assigned to each value is computed by a compatibility function of the query with the corresponding key (p. 3)

■ queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder (p. 5)

■ allows every position in the decoder to attend over all positions in the input sequence (p. 5)

■ learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. (p. 5)

■ Positional Encoding (p. 5)

■ Why Self-Attention (p. 6)

■ computational complexity per layer. (p. 6)

■ amount of computation that can be parallelized (p. 6)

■ path length between long-range dependencies (p. 6)

■ more interpretable models. (p. 7)

■ WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. (p. 7)

■ encoded using byte-pair encoding (p. 7)

■ shared sourcetarget vocabulary of about 37000 tokens. (p. 7)

■ 8 NVIDIA P100 GPUs (p. 7)

■ each training step took about 0.4 seconds (p. 7)

■ trained the base models for a total of 100,000 steps or 12 hours. (p. 7)

■ big models were trained for 300,000 steps (p. 7)

■ Adam optimizer (p. 7)

■ 4 Regularization (p. 7)

■ Residual Dropout W (p. 7)

■ EN-DE (p. 8)

■ 27.3 (p. 8)

■ 28.4 (p. 8)

■ Label Smoothing (p. 8)

■ base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals. (p. 8)

■ big models, we averaged the last 20 checkpoints (p. 8)

■ beam search with a beam size of 4 and length penalty α = 0.6 (p. 8)

■ estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU (p. 8)

■ code (p. 10)

■ available at (p. 10)

■ Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906, 2017. (p. 10)

■ Input-Input Layer Attention Visualizations (p. 13)