Tokenization and Stemming, Lemmatization, Stop-word Removal: Foundational Works of NLP

Tokenization is a process that divides strings or sentences into their smallest meaningful units called "tokens." This is a key preprocessing mechanism in NLP that optimizes overall computation. While there are three types of methods, modern NLP models primarily use sub-word tokenization:

  • Word Tokenization: “hello, world”[“hello”, “world”]

    • This method requires an excessively large vocabulary size and cannot handle misspellings or out-of-vocabulary words.

  • Character Tokenization: “hi”[“h”, “i”]

    • This method uses an extremely small vocabulary size and eliminates the need to handle out-of-vocabulary words. However, it requires computationally expensive processing and often fails to capture word meanings accurately.

  • Sub-Word Tokenization: “unhappiness”[“un”, “happiness”]

    • It serves as a middle ground between the above methods and handles out-of-vocabulary words by breaking them into sub-word units.

Understanding Mechanism of Sub-word Tokenization through Byte Pair Encoding

BPE(Byte Pair Encoding) was originally invented as a compression algorithm, but its primary use today is in NLP. Today's modern LLMs like GPT use BPE to encode text input, serving as a unit for examining performance and training costs.

  • BPE: “low, lower, newest”[”low”, “low”, “##er”, “new”, “##est”]

  • The prefix ## indicates the relationship between the current and previous token. This usage reduces vocabulary size, handles out-of-vocabulary words, and captures more meaning.

  • In BPE, the vocabulary is optimized based on word frequency, while WordPiece and Unigram LM use likelihood instead.

LLMs like GPT and BERT can handle multiple languages in their vocabulary. About 10–20% of GPT's training set contains non-English content, which is known as diverse corpora. However, there is a noticeable performance gap between English and non-English content.


Stemming, Lemmatization, Stop-word Removal

These preprocessing techniques in NLP help standardize text data. They are commonly used together to clean and normalize text before further processing:

  • Stemming: "running""run", "fishes""fish"

    • A rule-based process that removes word endings to obtain the root form

    • While fast and simple, it can sometimes produce non-existent words

  • Lemmatization: "better""good", "was""be"

    • A more sophisticated approach that converts words to their dictionary base form

    • More accurate than stemming but computationally more expensive

  • Stop-word Removal: The process of filtering out common words that add little meaning

    • Reduces noise in text analysis and decreases computational complexity

    • The list of stop-words varies depending on the application and language


A Foundation of NLP: BOW Modeling

Bag of Words (BOW) modeling is a method that converts text into binary vectors by representing words as keys in a vocabulary and counting their frequency. This approach is also known as the bag-of-words method.

  • BoW: "John likes to watch movies and also likes coffee"["John": 1, "like": 2, "to": 1, "watch": 1, "movie": 1, “and”: 1, “also”: 1, “coffee”: 1]

  • A hashing function can be used to map words to their corresponding keys.

Although transformers, which are the predominant architecture among LLMs, now use different representation methods, BoW remains a fundamental concept in NLP.

Check out the code i made!

Last updated