Tokenization and Stemming, Lemmatization, Stop-word Removal: Foundational Works of NLP
ν ν°ν(tokenization)λ λ¬Έμμ΄μ λ μμ λ¨μμΈ ν ν°μΌλ‘ λλλ κ³Όμ μ λλ€. μ΄λ λͺ¨λΈμ κ³ μ λ μ΄ν μ¬μ μμ λ°μνλ λ±μ₯νμ§ μλ λ¨μ΄(out-of-vocabulary) λ¬Έμ λ₯Ό ν΄κ²°νκ³ μ°μ° λΉμ©μ μ κ°νκΈ° μν ν΅μ¬μ μΈ μ μ²λ¦¬ κ³Όμ μ λλ€. μ΄λ€μ ν¬κ² μΈ κ°μ§λ‘ λΆλ₯λλ©°, νλμ μμ°μ΄ μ²λ¦¬ λͺ¨λΈ(BERT, GPT, T5, Llama λ±)μ λλΆλΆ λ³΄μ‘°μ¬ κΈ°λ° ν ν°νλ₯Ό μ±νν©λλ€.
λ¨μ΄ κΈ°λ° ν ν°ν(word-based tokenization)λ
"hello, world"
λ₯Ό["hello", "world"]
λ‘ λΆλ¦¬ν©λλ€.μ§λμΉκ² ν° λ¨μ΄μ₯ 곡κ°μ μ°¨μ§νλ λ¨μ μ΄ μμΌλ©°, λ μΌμ΄λ ν°ν€μ΄μ²λΌ κΈ΄ ννμ μΈμ΄μ μ¬μ©μμ λ§μΆ€λ² μ€λ₯ λ° OOVμ λ§€μ° μ·¨μ½ν©λλ€.
μΊλ¦ν° κΈ°λ° ν ν°ν(character-based tokenization)λ
"hi"
λ₯Ό["h", "i"]
λ‘ λΆλ¦¬ν©λλ€.κ·Ήλλ‘ μμ λ¨μ΄μ₯ 곡κ°μ μ¬μ©νκ³ OOVλ₯Ό μ²λ¦¬ν νμκ° μμ΅λλ€. νμ§λ§ κ³Όλν κ³μ°λ ₯μ΄ νμνλ©° λ¨μ΄μ μλ―Έλ₯Ό μ λλ‘ μ²λ¦¬νμ§ λͺ»νλ κ²½ν₯μ΄ μμ΅λλ€.
λ³΄μ‘°μ¬ κΈ°λ° ν ν°ν(sub-word-based tokenization)λ
"unhappiness"
λ₯Ό["un", "happiness"]
λ‘ λΆλ¦¬ν©λλ€.μΊλ¦ν° κΈ°λ°κ³Ό λ¨μ΄ κΈ°λ°μ μ€κ°μ νΉμ±μ κ°μ§λ©°, μμ£Ό μ¬μ©νμ§ μλ λ¨μ΄λ₯Ό 보쑰μ¬λ‘ λλ OOV λ¬Έμ λ₯Ό ν΄κ²°ν©λλ€.
Understanding Mechanism of Sub-word Tokenization through Byte Pair Encoding
BPE(Byte Pair Encoding)λ μλ μμΆ μκ³ λ¦¬μ¦μΌλ‘ κ°λ°λμμΌλ, νμ¬λ μμ°μ΄ μ²λ¦¬ λΆμΌμμ λ리 νμ©λκ³ μμ΅λλ€.
λ¨Όμ λ¬Έμ μμ€μ ν ν°νλ₯Ό μ§νν©λλ€.
κ°μ₯ μμ£Ό λ±μ₯νλ ν ν° μμ μμ°¨μ μΌλ‘ λ³ν©ν©λλ€.
μ€μ ν λ¨μ΄μ₯ ν¬κΈ°μ λλ¬νλ©΄ λ³ν© κ³Όμ μ μ’ λ£ν©λλ€.
Understanding Stemming/Lemmatization Through BOW Modeling: The Legacy Methods of Natural Language Processes
μ€ν λ°(stemming) λλ 리λ©νμ μ΄μ (lemmatization)μ λ¨μ΄λ₯Ό μνμΌλ‘ λ³νν΄ λ¨μ΄μ₯(vocabulary)μ μ΅μ ννκΈ° μν κ³Όμ μ λλ€.
μμ½λ(encoder): μμ½λλ μ£Όμ΄μ§ μ λ ₯μ ν ν°νν ν, μ΄λ₯Ό μ μ°μ /μνμ λνλ‘ λ³νν©λλ€.
λμ½λ(decoder): λμ½λλ μ£Όμ΄μ§ μΆλ ₯μ μμ°μ΄λ‘ λ³νν©λλ€.
Bag-of-Words Modeling
BOW λͺ¨λΈλ§(Bag Of Words modeling)λ κ° λ¨μ΄λ₯Ό 벑ν°ν νλ μμ°μ΄ μ²λ¦¬ λͺ¨λΈμ ν κ³κΈμ λλ€βλͺ¨λ λ¨μ΄λ₯Ό ν¬ν¨ν κ°λ°©μ ν΅ν΄ κΈμλ₯Ό λννλ € ν©λλ€βμ΄λ° λͺ¨λΈλ§μ μ μνκ³ μ΄ν΄νκΈ° μν΄ λ€μμ μ΄ν΄λ΄ λλ€.
λ€μ λ λ¬Έμ₯μ΄ μ£Όμ΄μ§λλ€. μ΄ λͺ¨λΈμ μ΄λ₯Ό BOW κ°μ²΄λ‘ λ³ννλ € ν©λλ€.
"John likes to watch movies. Mary likes movies too"
"Mary also likes to watch football games"
μ΄ λ¬Έμ₯λ€μ λ€μκ³Ό κ°μ λ¬Έμμ΄μ μ§ν©μΌλ‘ λλ μ μμ΅λλ€.
"John","likes","to","watch","movies","Mary","likes","movies","too"
"Mary","also","likes","to","watch","football","games"
κ·Έ λ€μ, κ° ν ν°(token)μ κ³ μ ν ν€λ‘ κ°μ£Όνκ³ λ±μ₯ νμμ λ°λΌ μ 보λ₯Ό μ¬ννν μ μμ΅λλ€. μ΄λ κ² λ λ¬Έμ₯μ κ°κ° BOWλ‘ λ³νλ©λλ€.
{"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}
{"Mary":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1}
μ΄λ, BOWμ ν¬ν¨λ λ¨μ΄λ€μ μμκ° μκ³ λ¬Έλ§₯μ μλ―Έλ₯Ό κ°μ§ μμ΅λλ€. μ΄λ¬ν νΉμ± λλ¬Έμ λ¨μ΄ μμκ° μ€μν λ¬Έμ μμλ BOW λͺ¨λΈμ μ±λ₯μ΄ λ¨μ΄μ§ μ μμ΅λλ€.
Last updated