• Situations/Opportunities are only lucky when you act on them

  • 2023-09-19

    • Paper Dissected: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” Explained
      • BERT trained on two tasks
        • masked language modeling (MLM; BERT uses both previous and next token encoding for predictions → [MASK] special tokens)
          • BERT randomly masks words in the sentence to make predictions
        • next sentence prediction (NSP; using a [SEP] token, sentences are separated and then fed into the model, switching positions fifty-fifty → model has to predict whether order is random or not)
      • traditional word embeddings too shallow to capture all the semantic value in complex structures (negations, collocations)
        • context-less
      • ELMo, BERT etc train a deep LM to model words to vectors in context, based on the sentence/surrounding tokens
        • no singular vectors for each word, always context representations
        • can then be used on downstream tasks through transfer learning/fine-tuning
        • previously: bidirectional LSTMS, trained L-to-R and R-to-L and then concatenated; used e.g. in ELMo → BERT parallelises forward and backward training
        • BERT uses wordpiece tokenization: playing -> play + ##ing
      • so essentially it stores different types of information: token embeddings themselves + their positional encoding + segment information (e.g. to distinguish sentence pairs)
      • for classification tasks, the sequence of hidden states “needs to be reduced to a single vector”
        • using either max/min pooling strategy OR taking only the hidden state corresponding to the first token (attention)
        • BERT prepends [CLS] token to start of each sentence
      • see also: Open Sourcing BERT: State of the Art Pre-training for NLP | Google Research Blog
  • 2023-09-20

    • “The process of converting a sequence of embeddings into a sentence embedding is called pooling.” | ML6 Blog → done by compression, so lower level of granularity
      • CLS Pooling
        • Prepend CLS token before start of every sequence, extract that token’s embedding
        • Only useful if original model has been pre-trained to predict next sentences/tuned to identify sequences
      • Mean Pooling
        • Extracts arithmetic mean of all token-level embeddings in a sequence
      • Min/Max Pooling
        • Extract the element-wise max/mean values
      • Mean Sqrt Len Pooling
        • Mean divided by square root of number of tokens in sequence
      • Why?
        • think of the quote: “the whole is greater than the sum of its parts”
        • token-by-token representations don’t necessarily capture the same information as the sequence embedding
      • Cosine Similarity
        • cos-similarity-graph
        • angle between vector embeddings, cosine increases when closer together and decreases when further apart
      • Euclidean distance & Dot product, Manhattan distance, Minkowski distance or Chebyshev distance
    • Sentence encoders | ML6 Blog by the same dude as above
      • sentence similarity acts on multiple levels; we cannot only compare the topic but also the sentiment they express
      • pre-trained models can iron some of this out by seeing large amounts of data
      • pooled sentence embeddings can be used for classification but might lead to worse results in semantic similarity tasks like clustering or semantic search
      • cross-encoders
        • Two sentences fed into model, SEP in between → classification head makes prediction on similarity, trained on sentence pairs along with ground-truth label
        • DON’T create sentence embeddings
      • bi-encoders
        • SBERT paper → bi-encoder is the architecture, sentence-transformers is the implemented package
      • Siamese networks
        • training procedure for sentence transformers
        • they use three data points: anchors, positive and negative embeddings → minimize triplet loss
          • minimize distance between anchor and positive, maximize distance between anchor and negative

    • GLUE Benchmark, original paper
      • Types of linguistics phenomena that similar sentences were grouped under (p. 4)
      • p. 5
class BertPooler(nn.Module):
  def __init__(self, config):
    super(BertPooler, self).__init__()
    self.dense = nn.Linear(config.hidden_size, config.hidden_size)
    self.activation = nn.Tanh()
  def forward(self, hidden_states):
    # We "pool" the model by simply taking the hidden state corresponding
    # to the first token.
    first_token_tensor = hidden_states[:, 0]
    pooled_output = self.dense(first_token_tensor)
    pooled_output = self.activation(pooled_output)
    return pooled_output