📇 Index

Highlights

Highlight

Word2vec and fasttext are both trained on very shallow language modeling tasks, so there is a limitation to what the word embeddings can capture.

Highlight

Instead of just training a model to map a single vector for each word, these methods train a complex, deep neural network to map a vector to each word based on the entire sentence/surrounding context. → difference between simple context-less embeddings and representations from BERT etc

Highlight

given a context, a language model predicts the probability of a word occurring in that context → useful definition of “language model”

Highlight

BERT only swaps 10% of the 15% tokens selected for masking (in total 1.5% of all tokens) and leaves 10% of the tokens intact → noise is manually introduced by swapping words randomly without masking them, this prevents overfitting on MASK token

Highlight

BERT is fed two sentences and 50% of the time the second sentence comes after the first one and 50% of the time it is a randomly sampled sentence. BERT is then required to predict whether the second sentence is random or not.

Highlight

Left-to-right language modeling does converge faster, but masked language modeling achieves a much higher accuracy with the same number of steps.

Highlight

concatenation of the hidden activations from the last four layers provides very strong performance, only 0.3 behind finetuning the entire model