Tokenization and Stemming, Lemmatization, Stop-word Removal: Foundational Works of NLP
Tokenization is a process that divides strings or sentences into their smallest meaningful units called "tokens." This is a key preprocessing mechanism in NLP that optimizes overall computation. While there are three types of methods, modern NLP models primarily use sub-word tokenization:
Word Tokenization:
“hello, world”
→[“hello”, “world”]
This method requires an excessively large vocabulary size and cannot handle misspellings or out-of-vocabulary words.
Character Tokenization:
“hi”
→[“h”, “i”]
This method uses an extremely small vocabulary size and eliminates the need to handle out-of-vocabulary words. However, it requires computationally expensive processing and often fails to capture word meanings accurately.
Sub-Word Tokenization:
“unhappiness”
→[“un”, “happiness”]
It serves as a middle ground between the above methods and handles out-of-vocabulary words by breaking them into sub-word units.
Understanding Mechanism of Sub-word Tokenization through Byte Pair Encoding
BPE(Byte Pair Encoding) was originally invented as a compression algorithm, but its primary use today is in NLP. Today's modern LLMs like GPT use BPE to encode text input, serving as a unit for examining performance and training costs.
BPE:
“low, lower, newest”
→[”low”, “low”, “##er”, “new”, “##est”]
The prefix
##
indicates the relationship between the current and previous token. This usage reduces vocabulary size, handles out-of-vocabulary words, and captures more meaning.In BPE, the vocabulary is optimized based on word frequency, while WordPiece and Unigram LM use likelihood instead.
LLMs like GPT and BERT can handle multiple languages in their vocabulary. About 10–20% of GPT's training set contains non-English content, which is known as diverse corpora. However, there is a noticeable performance gap between English and non-English content.
Stemming, Lemmatization, Stop-word Removal
These preprocessing techniques in NLP help standardize text data. They are commonly used together to clean and normalize text before further processing:
Stemming:
"running"
→"run"
,"fishes"
→"fish"
A rule-based process that removes word endings to obtain the root form
While fast and simple, it can sometimes produce non-existent words
Lemmatization:
"better"
→"good"
,"was"
→"be"
A more sophisticated approach that converts words to their dictionary base form
More accurate than stemming but computationally more expensive
Stop-word Removal: The process of filtering out common words that add little meaning
Reduces noise in text analysis and decreases computational complexity
The list of stop-words varies depending on the application and language
A Foundation of NLP: BOW Modeling
Bag of Words (BOW) modeling is a method that converts text into binary vectors by representing words as keys in a vocabulary and counting their frequency. This approach is also known as the bag-of-words method.
BoW:
"John likes to watch movies and also likes coffee"
→["John": 1, "like": 2, "to": 1, "watch": 1, "movie": 1, “and”: 1, “also”: 1, “coffee”: 1]
A hashing function can be used to map words to their corresponding keys.
Although transformers, which are the predominant architecture among LLMs, now use different representation methods, BoW remains a fundamental concept in NLP.
Last updated