Attention Is All You Need
Source: @vaswani2017
Highlights 💡
Hinzugefügt am 2022-10-07
■ convolutional neural networks (p. 1)
■ best performing models (p. 1)
■ connect the encoder and decoder through an attention mechanism (p. 1)
■ Transformer (p. 1)
■ dispensing with recurrence and convolutions (p. 1)
■ BLEU score of 41.8 (p. 1)
■ 3.5 days on eight GPUs (p. 1)
■ generate a sequence of hidden (p. 2)
■ tates ht, as a function of the previous hidden state ht−1 and the input for position t. (p. 2)
■ precludes parallelization (p. 2)
■ memory constraints limit batching (p. 2)
■ Attention mechanisms have become an integral part (p. 2)
■ modeling of dependencies without regard to their distance in the input or output sequences (p. 2)
■ Self-attention, (p. 2)
■ relating different positions of a single sequence in order to compute a representation of the sequence (p. 2)
■ encoder maps an input sequence of symbol representations (x1, …, xn) to a sequence of continuous representations z = (z1, …, zn) (p. 2)
■ Given z, the decoder then generates an output sequence (y1, …, ym) of symbols one element at a time (p. 2)
■ auto-regressive (p. 2)
■ consuming (p. 2)
■ previously generated symbols as additional input when generating the next (p. 2)
■ stack of N = 6 identical layers (p. 3)
■ two sub-layers (p. 3)
■ N = 6 identical layers (p. 3)
■ In addition to the two sub-layers (p. 3)
■ decoder inserts a third sub-layer, which performs multi-head attention over the output (p. 3)
■ An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. (p. 3)
■ output is computed as a weighted sum (p. 3)
■ weight assigned to each value is computed by a compatibility function of the query with the corresponding key (p. 3)
■ queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder (p. 5)
■ allows every position in the decoder to attend over all positions in the input sequence (p. 5)
■ learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities (p. 5)
■ Positional Encoding (p. 5)
■ Why Self-Attention (p. 6)
■ computational complexity per layer (p. 6)
■ amount of computation that can be parallelized (p. 6)
■ path length between long-range dependencies (p. 6)
■ more interpretable models (p. 7)
■ WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs (p. 7)
■ encoded using byte-pair encoding (p. 7)
■ shared sourcetarget vocabulary of about 37000 tokens (p. 7)
■ 8 NVIDIA P100 GPUs (p. 7)
■ each training step took about 0.4 seconds (p. 7)
■ trained the base models for a total of 100,000 steps or 12 hours (p. 7)
■ big models were trained for 300,000 steps (p. 7)
■ Adam optimizer (p. 7)
■ Regularization (p. 7)
■ Residual Dropout (p. 7)
■ EN-DE (p. 8)
■ 27.3 (p. 8)
■ 28.4 (p. 8)
■ Label Smoothing (p. 8)
■ base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals (p. 8)
■ big models, we averaged the last 20 checkpoints (p. 8)
■ beam search with a beam size of 4 and length penalty α = 0.6 (p. 8)
■ estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU (p. 8)
■ code (p. 10)
■ available at (p. 10)
■ Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906, 2017. (p. 10)
■ Attention Visualizations Input-Input Laye (p. 13)
■ convolutional neural networks (p. 1)
■ best performing models (p. 1)
■ connect the encoder and decoder through an attention mechanism (p. 1)
■ Transformer, (p. 1)
■ dispensing with recurrence and convolutions (p. 1)
■ BLEU score of 41.8 (p. 1)
■ 3.5 days on eight GPUs (p. 1)
■ generate a sequence of hidden (p. 2)
■ tates ht, as a function of the previous hidden state ht−1 and the input for position t. (p. 2)
■ precludes parallelization (p. 2)
■ memory constraints limit batching (p. 2)
■ Attention mechanisms have become an integral part (p. 2)
■ modeling of dependencies without regard to their distance in the input or output sequences (p. 2)
■ Self-attention, (p. 2)
■ relating different positions of a single sequence in order to compute a representation of the sequence (p. 2)
■ encoder maps an input sequence of symbol representations (x1, …, xn) to a sequence of continuous representations z = (z1, …, zn). (p. 2)
■ Given z, the decoder then generates an output sequence (y1, …, ym) of symbols one element at a time. (p. 2)
■ auto-regressive (p. 2)
■ consuming t (p. 2)
■ previously generated symbols as additional input when generating the next (p. 2)
■ stack of N = 6 identical layers (p. 3)
■ two sub-layers (p. 3)
■ N = 6 identical layers (p. 3)
■ In addition to the two sub-layers (p. 3)
■ decoder inserts a third sub-layer, which performs multi-head attention over the output (p. 3)
■ An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. (p. 3)
■ output is computed as a weighted sum (p. 3)
■ weight assigned to each value is computed by a compatibility function of the query with the corresponding key (p. 3)
■ queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder (p. 5)
■ allows every position in the decoder to attend over all positions in the input sequence (p. 5)
■ learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. (p. 5)
■ Positional Encoding (p. 5)
■ Why Self-Attention (p. 6)
■ computational complexity per layer. (p. 6)
■ amount of computation that can be parallelized (p. 6)
■ path length between long-range dependencies (p. 6)
■ more interpretable models. (p. 7)
■ WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. (p. 7)
■ encoded using byte-pair encoding (p. 7)
■ shared sourcetarget vocabulary of about 37000 tokens. (p. 7)
■ 8 NVIDIA P100 GPUs (p. 7)
■ each training step took about 0.4 seconds (p. 7)
■ trained the base models for a total of 100,000 steps or 12 hours. (p. 7)
■ big models were trained for 300,000 steps (p. 7)
■ Adam optimizer (p. 7)
■ 4 Regularization (p. 7)
■ Residual Dropout W (p. 7)
■ EN-DE (p. 8)
■ 27.3 (p. 8)
■ 28.4 (p. 8)
■ Label Smoothing (p. 8)
■ base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals. (p. 8)
■ big models, we averaged the last 20 checkpoints (p. 8)
■ beam search with a beam size of 4 and length penalty α = 0.6 (p. 8)
■ estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU (p. 8)
■ code (p. 10)
■ available at (p. 10)
■ Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906, 2017. (p. 10)
■ Input-Input Layer Attention Visualizations (p. 13)