📇 Index
Highlights
Highlight
Word2vec and fasttext are both trained on very shallow language modeling tasks, so there is a limitation to what the word embeddings can capture.
Highlight
Instead of just training a model to map a single vector for each word, these methods train a complex, deep neural network to map a vector to each word based on the entire sentence/surrounding context. → difference between simple context-less embeddings and representations from BERT etc
Highlight
given a context, a language model predicts the probability of a word occurring in that context → useful definition of “language model”
Highlight
BERT only swaps 10% of the 15% tokens selected for masking (in total 1.5% of all tokens) and leaves 10% of the tokens intact → noise is manually introduced by swapping words randomly without masking them, this prevents overfitting on MASK token
Highlight
BERT is fed two sentences and 50% of the time the second sentence comes after the first one and 50% of the time it is a randomly sampled sentence. BERT is then required to predict whether the second sentence is random or not.
Highlight
Left-to-right language modeling does converge faster, but masked language modeling achieves a much higher accuracy with the same number of steps.
Highlight
concatenation of the hidden activations from the last four layers provides very strong performance, only 0.3 behind finetuning the entire model