-
Situations/Opportunities are only lucky when you act on them
-
2023-09-19
- Paper Dissected: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” Explained
- BERT trained on two tasks
- masked language modeling (MLM; BERT uses both previous and next token encoding for predictions →
[MASK]
special tokens)- BERT randomly masks words in the sentence to make predictions
- next sentence prediction (NSP; using a
[SEP]
token, sentences are separated and then fed into the model, switching positions fifty-fifty → model has to predict whether order is random or not)
- masked language modeling (MLM; BERT uses both previous and next token encoding for predictions →
- traditional word embeddings too shallow to capture all the semantic value in complex structures (negations, collocations)
- context-less
- ELMo, BERT etc train a deep LM to model words to vectors in context, based on the sentence/surrounding tokens
- no singular vectors for each word, always context representations
- can then be used on downstream tasks through transfer learning/fine-tuning
- previously: bidirectional LSTMS, trained L-to-R and R-to-L and then concatenated; used e.g. in ELMo → BERT parallelises forward and backward training
- BERT uses wordpiece tokenization:
playing -> play + ##ing
- so essentially it stores different types of information: token embeddings themselves + their positional encoding + segment information (e.g. to distinguish sentence pairs)
- for classification tasks, the sequence of hidden states “needs to be reduced to a single vector”
- using either max/min pooling strategy OR taking only the hidden state corresponding to the first token (attention)
- BERT prepends
[CLS]
token to start of each sentence
- see also: Open Sourcing BERT: State of the Art Pre-training for NLP | Google Research Blog
- BERT trained on two tasks
- Paper Dissected: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” Explained
-
2023-09-20
- “The process of converting a sequence of embeddings into a sentence embedding is called pooling.” | ML6 Blog
→ done by compression, so lower level of granularity
- CLS Pooling
- Prepend CLS token before start of every sequence, extract that token’s embedding
- Only useful if original model has been pre-trained to predict next sentences/tuned to identify sequences
- Mean Pooling
- Extracts arithmetic mean of all token-level embeddings in a sequence
- Min/Max Pooling
- Extract the element-wise max/mean values
- Mean Sqrt Len Pooling
- Mean divided by square root of number of tokens in sequence
- Why?
- think of the quote: “the whole is greater than the sum of its parts”
- token-by-token representations don’t necessarily capture the same information as the sequence embedding
- Cosine Similarity
- angle between vector embeddings, cosine increases when closer together and decreases when further apart
- Euclidean distance & Dot product, Manhattan distance, Minkowski distance or Chebyshev distance
- CLS Pooling
- Sentence encoders | ML6 Blog by the same dude as above
- sentence similarity acts on multiple levels; we cannot only compare the topic but also the sentiment they express
- pre-trained models can iron some of this out by seeing large amounts of data
- pooled sentence embeddings can be used for classification but might lead to worse results in semantic similarity tasks like clustering or semantic search
- cross-encoders
- Two sentences fed into model, SEP in between → classification head makes prediction on similarity, trained on sentence pairs along with ground-truth label
- DON’T create sentence embeddings
- bi-encoders
- SBERT paper → bi-encoder is the architecture,
sentence-transformers
is the implemented package
- SBERT paper → bi-encoder is the architecture,
- Siamese networks
- training procedure for sentence transformers
- they use three data points: anchors, positive and negative embeddings → minimize triplet loss
- minimize distance between anchor and positive, maximize distance between anchor and negative
- GLUE Benchmark, original paper
- Types of linguistics phenomena that similar sentences were grouped under (p. 4)
- p. 5
- Types of linguistics phenomena that similar sentences were grouped under (p. 4)
- “The process of converting a sequence of embeddings into a sentence embedding is called pooling.” | ML6 Blog
→ done by compression, so lower level of granularity
class BertPooler(nn.Module):
def __init__(self, config):
super(BertPooler, self).__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.activation = nn.Tanh()
def forward(self, hidden_states):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token.
first_token_tensor = hidden_states[:, 0]
pooled_output = self.dense(first_token_tensor)
pooled_output = self.activation(pooled_output)
return pooled_output