Text can be modelled in various ways. The bag of words model is used to simplify that representation by tokenizing the data and counting the number of occurences per word. It’s a very basic form of extracting features.
John likes to watch movies. Mary likes movies too.
becomes
"John","likes","to","watch","movies","Mary","likes","movies","too"
or rather
BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
From Wikipedia:
from tensorflow import keras
from typing import List
from keras.preprocessing.text import Tokenizer
sentence = ["John likes to watch movies. Mary likes movies too."]
def print_bow(sentence: list[str]) -> None:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentence)
sequences = tokenizer.texts_to_sequences(sentence)
word_index = tokenizer.word_index
bow = {}
for key in word_index:
bow[key] = sequences[0].count(word_index[key])
print(f"Bag of word sentence 1:\n{bow}")
print(f"We found {len(word_index)} unique tokens.")
print_bow(sentence)
- Keywords: nlp
- Source: A Gentle Introduction to the Bag-of-Words Model - MachineLearningMastery.com
- Related: