Attention Is All You Need

Source: @vaswani2017

Highlights 💡

Hinzugefügt am 2022-10-07

convolutional neural networks (p. 1)

best performing models (p. 1)

connect the encoder and decoder through an attention mechanism (p. 1)

Transformer (p. 1)

dispensing with recurrence and convolutions (p. 1)

BLEU score of 41.8 (p. 1)

3.5 days on eight GPUs (p. 1)

generate a sequence of hidden (p. 2)

tates ht, as a function of the previous hidden state ht−1 and the input for position t. (p. 2)

precludes parallelization (p. 2)

memory constraints limit batching (p. 2)

Attention mechanisms have become an integral part (p. 2)

modeling of dependencies without regard to their distance in the input or output sequences (p. 2)

Self-attention, (p. 2)

relating different positions of a single sequence in order to compute a representation of the sequence (p. 2)

encoder maps an input sequence of symbol representations (x1, …, xn) to a sequence of continuous representations z = (z1, …, zn) (p. 2)

Given z, the decoder then generates an output sequence (y1, …, ym) of symbols one element at a time (p. 2)

auto-regressive (p. 2)

consuming (p. 2)

previously generated symbols as additional input when generating the next (p. 2)

stack of N = 6 identical layers (p. 3)

two sub-layers (p. 3)

N = 6 identical layers (p. 3)

In addition to the two sub-layers (p. 3)

decoder inserts a third sub-layer, which performs multi-head attention over the output (p. 3)

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. (p. 3)

output is computed as a weighted sum (p. 3)

weight assigned to each value is computed by a compatibility function of the query with the corresponding key (p. 3)

queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder (p. 5)

allows every position in the decoder to attend over all positions in the input sequence (p. 5)

learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities (p. 5)

Positional Encoding (p. 5)

Why Self-Attention (p. 6)

computational complexity per layer (p. 6)

amount of computation that can be parallelized (p. 6)

path length between long-range dependencies (p. 6)

more interpretable models (p. 7)

WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs (p. 7)

encoded using byte-pair encoding (p. 7)

shared sourcetarget vocabulary of about 37000 tokens (p. 7)

8 NVIDIA P100 GPUs (p. 7)

each training step took about 0.4 seconds (p. 7)

trained the base models for a total of 100,000 steps or 12 hours (p. 7)

big models were trained for 300,000 steps (p. 7)

Adam optimizer (p. 7)

Regularization (p. 7)

Residual Dropout (p. 7)

EN-DE (p. 8)

27.3 (p. 8)

28.4 (p. 8)

Label Smoothing (p. 8)

base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals (p. 8)

big models, we averaged the last 20 checkpoints (p. 8)

beam search with a beam size of 4 and length penalty α = 0.6 (p. 8)

estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU (p. 8)

code (p. 10)

available at (p. 10)

Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906, 2017. (p. 10)

Attention Visualizations Input-Input Laye (p. 13)

convolutional neural networks (p. 1)

best performing models (p. 1)

connect the encoder and decoder through an attention mechanism (p. 1)

Transformer, (p. 1)

dispensing with recurrence and convolutions (p. 1)

BLEU score of 41.8 (p. 1)

3.5 days on eight GPUs (p. 1)

generate a sequence of hidden (p. 2)

tates ht, as a function of the previous hidden state ht−1 and the input for position t. (p. 2)

precludes parallelization (p. 2)

memory constraints limit batching (p. 2)

Attention mechanisms have become an integral part (p. 2)

modeling of dependencies without regard to their distance in the input or output sequences (p. 2)

Self-attention, (p. 2)

relating different positions of a single sequence in order to compute a representation of the sequence (p. 2)

encoder maps an input sequence of symbol representations (x1, …, xn) to a sequence of continuous representations z = (z1, …, zn). (p. 2)

Given z, the decoder then generates an output sequence (y1, …, ym) of symbols one element at a time. (p. 2)

auto-regressive (p. 2)

consuming t (p. 2)

previously generated symbols as additional input when generating the next (p. 2)

stack of N = 6 identical layers (p. 3)

two sub-layers (p. 3)

N = 6 identical layers (p. 3)

In addition to the two sub-layers (p. 3)

decoder inserts a third sub-layer, which performs multi-head attention over the output (p. 3)

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. (p. 3)

output is computed as a weighted sum (p. 3)

weight assigned to each value is computed by a compatibility function of the query with the corresponding key (p. 3)

queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder (p. 5)

allows every position in the decoder to attend over all positions in the input sequence (p. 5)

learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. (p. 5)

Positional Encoding (p. 5)

Why Self-Attention (p. 6)

computational complexity per layer. (p. 6)

amount of computation that can be parallelized (p. 6)

path length between long-range dependencies (p. 6)

more interpretable models. (p. 7)

WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. (p. 7)

encoded using byte-pair encoding (p. 7)

shared sourcetarget vocabulary of about 37000 tokens. (p. 7)

8 NVIDIA P100 GPUs (p. 7)

each training step took about 0.4 seconds (p. 7)

trained the base models for a total of 100,000 steps or 12 hours. (p. 7)

big models were trained for 300,000 steps (p. 7)

Adam optimizer (p. 7)

4 Regularization (p. 7)

Residual Dropout W (p. 7)

EN-DE (p. 8)

27.3 (p. 8)

28.4 (p. 8)

Label Smoothing (p. 8)

base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals. (p. 8)

big models, we averaged the last 20 checkpoints (p. 8)

beam search with a beam size of 4 and length penalty α = 0.6 (p. 8)

estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU (p. 8)

code (p. 10)

available at (p. 10)

Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906, 2017. (p. 10)

Input-Input Layer Attention Visualizations (p. 13)