πŸ“˜
Lif31up's Blog
  • Welcome! I'm Myeonghwan
  • How to Read the Pages
  • Fundamental Machine Learning
    • Foundational Work of ML: Linear/Logistic Regression
    • Early-stage of AI: Perceptron and ADALINE
    • What is Deep Learning?: Artificial Neural Network to Deep Neural Network
    • Challenges in Training Deep Neural Network and the Latest Solutions
  • Modern AI Systems: An In-depth Guide to Cutting-edge Technologies and Applications
  • Few Shot Learning
    • Overview on Meta Learning
    • Prototypical Networks for Few-shot Learning
    • Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
  • Natural Language Process
    • Tokenization and Stemming, Lemmatization, Stop-word Removal: Foundational Works of NLP
    • Attention Mechanism: The Core of Modern AI
  • Front-end Development
    • Overview on Front-end Development
    • Learning React Basic
      • React Component: How They are Rendered and Behave in Browser
      • State and Context: A Key Function to Operate the React Application
      • Design Pattern for Higher React Programming
  • Songwriting
    • A Comprehensive Guide to Creating Memorable Melodies through Motif and Phrasing
  • Sound Engineering
    • How to Install and Load Virtual Studio Instruments
    • A Guide to Audio Signal Chains and Gain Staging
    • Equalizer and Audible Frequency: How to Adjust Tone of the Signal
    • Dynamic Range: the Right Way to Compress your Sample
    • Acoustic Space Perception and Digital Reverberation: A Comprehensive Analysis of Sound Field Simulat
  • Musical Artistry
    • What is Artistry: in Perspective of Modern Pop/Indie Artists
    • Visualizing as Musical Context: Choose Your Aesthetic
    • Analysis on Modern Personal Myth and How to Create Your Own
    • Instagram Management: Business Approach to Your Social Account
  • Art Historiography
    • Importance of Art Historiography: Ugly and Beauty Across Time and Space
    • Post-internet Art, New Aesthetic and Post-Digital Art
    • Brutalism and Brutalist Architecture
Powered by GitBook
On this page
  1. Natural Language Process

Tokenization and Stemming, Lemmatization, Stop-word Removal: Foundational Works of NLP

토큰화(tokenization)λŠ” λ¬Έμžμ—΄μ„ 더 μž‘μ€ λ‹¨μœ„μΈ ν† ν°μœΌλ‘œ λ‚˜λˆ„λŠ” κ³Όμ •μž…λ‹ˆλ‹€. μ΄λŠ” λͺ¨λΈμ˜ κ³ μ •λœ μ–΄νœ˜ μ‚¬μ „μ—μ„œ λ°œμƒν•˜λŠ” λ“±μž₯ν•˜μ§€ μ•ŠλŠ” 단어(out-of-vocabulary) 문제λ₯Ό ν•΄κ²°ν•˜κ³  μ—°μ‚° λΉ„μš©μ„ μ ˆκ°ν•˜κΈ° μœ„ν•œ 핡심적인 μ „μ²˜λ¦¬ κ³Όμ •μž…λ‹ˆλ‹€. 이듀은 크게 μ„Έ κ°€μ§€λ‘œ λΆ„λ₯˜λ˜λ©°, ν˜„λŒ€μ˜ μžμ—°μ–΄ 처리 λͺ¨λΈ(BERT, GPT, T5, Llama λ“±)은 λŒ€λΆ€λΆ„ 보쑰사 기반 토큰화λ₯Ό μ±„νƒν•©λ‹ˆλ‹€.

  • 단어 기반 토큰화(word-based tokenization)λŠ” "hello, world"λ₯Ό ["hello", "world"]둜 λΆ„λ¦¬ν•©λ‹ˆλ‹€.

    • μ§€λ‚˜μΉ˜κ²Œ 큰 단어μž₯ 곡간을 μ°¨μ§€ν•˜λŠ” 단점이 있으며, λ…μΌμ–΄λ‚˜ ν„°ν‚€μ–΄μ²˜λŸΌ κΈ΄ ν˜•νƒœμ˜ 언어와 μ‚¬μš©μžμ˜ λ§žμΆ€λ²• 였λ₯˜ 및 OOV에 맀우 μ·¨μ•½ν•©λ‹ˆλ‹€.

  • 캐릭터 기반 토큰화(character-based tokenization)λŠ” "hi"λ₯Ό ["h", "i"]둜 λΆ„λ¦¬ν•©λ‹ˆλ‹€.

    • κ·Ήλ„λ‘œ μž‘μ€ 단어μž₯ 곡간을 μ‚¬μš©ν•˜κ³  OOVλ₯Ό μ²˜λ¦¬ν•  ν•„μš”κ°€ μ—†μŠ΅λ‹ˆλ‹€. ν•˜μ§€λ§Œ κ³Όλ„ν•œ 계산λ ₯이 ν•„μš”ν•˜λ©° λ‹¨μ–΄μ˜ 의미λ₯Ό μ œλŒ€λ‘œ μ²˜λ¦¬ν•˜μ§€ λͺ»ν•˜λŠ” κ²½ν–₯이 μžˆμŠ΅λ‹ˆλ‹€.

  • 보쑰사 기반 토큰화(sub-word-based tokenization)λŠ” "unhappiness"λ₯Ό ["un", "happiness"]둜 λΆ„λ¦¬ν•©λ‹ˆλ‹€.

    • 캐릭터 기반과 단어 기반의 쀑간적 νŠΉμ„±μ„ κ°€μ§€λ©°, 자주 μ‚¬μš©ν•˜μ§€ μ•ŠλŠ” 단어λ₯Ό λ³΄μ‘°μ‚¬λ‘œ λ‚˜λˆ  OOV 문제λ₯Ό ν•΄κ²°ν•©λ‹ˆλ‹€.

[",", "?", "!", "\", "~"] λ“±μ˜ 기호λ₯Ό μ •μ§€ 단어(stop word)라 ν•˜λ©° μ˜λ―Έμƒ μ€‘μš”λ„κ°€ λ–¨μ–΄μ Έ 이λ₯Ό μ œκ±°ν•˜λŠ” 과정을 stop word removal라 ν•©λ‹ˆλ‹€.

λ‹€μ–‘ν•œ 보쑰사 기반 토큰화 μ•Œκ³ λ¦¬μ¦˜κ³Ό 채택 사둀

보쑰사 기반 ν† ν°ν™”λŠ” 크게 λ‹€μŒμ˜ λ„€ κ°€μ§€ μ•Œκ³ λ¦¬μ¦˜μœΌλ‘œ λŒ€ν‘œλ  수 μžˆμŠ΅λ‹ˆλ‹€:

  • BPE, Byte Pair Encoding

    • μ˜ˆμ‹œ: GPT, RoBERTa

  • WordPiece

    • μ˜ˆμ‹œ: BERT, DistilBERT

  • Unigram LM

    • μ˜ˆμ‹œ: XLNet, ALBERT

  • SentencePiece

    • μ˜ˆμ‹œ: T5, Llama

Understanding Mechanism of Sub-word Tokenization through Byte Pair Encoding

BPE(Byte Pair Encoding)λŠ” μ›λž˜ μ••μΆ• μ•Œκ³ λ¦¬μ¦˜μœΌλ‘œ κ°œλ°œλ˜μ—ˆμœΌλ‚˜, ν˜„μž¬λŠ” μžμ—°μ–΄ 처리 λΆ„μ•Όμ—μ„œ 널리 ν™œμš©λ˜κ³  μžˆμŠ΅λ‹ˆλ‹€.

// example of BPE tokenization
corpus: "low, lower, newest"
after BPE: ["low", "low", "##er", "new", "##est"]
  1. λ¨Όμ € 문자 μˆ˜μ€€μ˜ 토큰화λ₯Ό μ§„ν–‰ν•©λ‹ˆλ‹€.

  2. κ°€μž₯ 자주 λ“±μž₯ν•˜λŠ” 토큰 μŒμ„ 순차적으둜 λ³‘ν•©ν•©λ‹ˆλ‹€.

  3. μ„€μ •ν•œ 단어μž₯ 크기에 λ„λ‹¬ν•˜λ©΄ 병합 과정을 μ’…λ£Œν•©λ‹ˆλ‹€.

보쑰사 연속 접두사 ##

BPE와 WordPiece μ•Œκ³ λ¦¬μ¦˜μ—μ„œ ## μ ‘λ‘μ‚¬λŠ” λ³΄μ‘°μ‚¬μ˜ 연속성을 λ‚˜νƒ€λƒ…λ‹ˆλ‹€. μ΄λŠ” ν•΄λ‹Ή 보쑰사가 μ•žμ„  토큰과 λ°˜λ“œμ‹œ κ²°ν•©λ˜μ–΄μ•Ό 함을 μ˜λ―Έν•©λ‹ˆλ‹€. μ΄λŸ¬ν•œ 접두사λ₯Ό ν™œμš©ν•˜λ©΄ μ›ν˜• 단어와 λΆ„λ¦¬λœ 보쑰사 κ°„μ˜ 관계λ₯Ό λͺ…ν™•νžˆ ν•˜λ©΄μ„œ λ‹¨μ–΄μ˜ 본래 의미λ₯Ό μœ μ§€ν•  수 μžˆμŠ΅λ‹ˆλ‹€. λ˜ν•œ 단어μž₯의 크기λ₯Ό μ΅œμ ν™”ν•˜λ©΄μ„œλ„ λͺ¨λΈμ΄ λ‹¨μ–΄μ˜ 의미λ₯Ό μ •ν™•νžˆ νŒŒμ•…ν•  수 있게 ν•©λ‹ˆλ‹€.

  • WordPiece μ—­μ‹œ μ„€λͺ…λœ BPE와 같은 아이디어λ₯Ό μ‚¬μš©ν•˜μ§€λ§Œ, λΉˆλ„μˆ˜κ°€ μ•„λ‹Œ μš°λ„(likelihood) ν™•λ₯ μ„ κΈ°μ€€μœΌλ‘œ ν•©λ‹ˆλ‹€.

  • Unigram LM은 각 단어에 κ°œλ³„μ μΈ μš°λ„ 점수(likelihood score)λ₯Ό λΆ€μ—¬ν•˜κ³ , 이에 κΈ°λ°˜ν•˜μ—¬ λ‚΄λ¦Όμ°¨μˆœμœΌλ‘œ 단어μž₯을 κ΅¬μ„±ν•©λ‹ˆλ‹€.

μžμ—°μ–΄ 처리 λͺ¨λΈμ—μ„œμ˜ λ‹€μ–Έμ–΄

GPTλ‚˜ BERT와 같은 λͺ¨λΈμ€ μ˜μ–΄κ°€ μ•„λ‹Œ μ–Έμ–΄λ‚˜ λ³΅μž‘ν•œ ν˜•νƒœμ˜ μ–Έμ–΄λ₯Ό 더 μž‘μ€ λ‹¨μœ„λ‘œ λΆ„λ¦¬ν•˜μ—¬ 직접 μ²˜λ¦¬ν•˜λ©°, λ³„λ„μ˜ λ²ˆμ—­ κ³Όμ • 없이 원문을 κ·ΈλŒ€λ‘œ λ‹€λ£Ήλ‹ˆλ‹€. GPT-3/4λŠ” ν›ˆλ ¨ λ°μ΄ν„°μ…‹μ˜ 10~20%λ₯Ό λΉ„μ˜μ–΄ 자료둜 κ΅¬μ„±ν•˜λŠ”λ°, 이λ₯Ό diverse corpora라고 ν•©λ‹ˆλ‹€. κ·ΈλŸ¬λ‚˜ μ΄λŸ¬ν•œ ꡬ성에도 λΆˆκ΅¬ν•˜κ³  μ˜μ–΄ μ™Έμ˜ μ–Έμ–΄μ—μ„œλŠ” μ„±λŠ₯ 차이가 λΆˆκ°€ν”Όν•©λ‹ˆλ‹€.


Understanding Stemming/Lemmatization Through BOW Modeling: The Legacy Methods of Natural Language Processes

μŠ€ν…Œλ°(stemming) λ˜λŠ” λ¦¬λ©˜νƒ€μ œμ΄μ…˜(lemmatization)은 단어λ₯Ό μ›ν˜•μœΌλ‘œ λ³€ν™˜ν•΄ 단어μž₯(vocabulary)을 μ΅œμ ν™”ν•˜κΈ° μœ„ν•œ κ³Όμ •μž…λ‹ˆλ‹€.

  • 엔코더(encoder): μ—”μ½”λ”λŠ” μ£Όμ–΄μ§„ μž…λ ₯을 ν† ν°ν™”ν•œ ν›„, 이λ₯Ό 전산적/μˆ˜ν•™μ  λŒ€ν‘œλ‘œ λ³€ν™˜ν•©λ‹ˆλ‹€.

  • 디코더(decoder): λ””μ½”λ”λŠ” μ£Όμ–΄μ§„ 좜λ ₯을 μžμ—°μ–΄λ‘œ λ³€ν™˜ν•©λ‹ˆλ‹€.

ν•˜μ§€λ§Œ μ΄λŸ¬ν•œ 처리 방식은 각 λ‹¨μ–΄μ˜ λ―Έλ¬˜ν•œ 의미 차이λ₯Ό 포착해야 ν•˜λŠ” ν˜„λŒ€μ˜ μžμ—°μ–΄ 처리 λͺ¨λΈμ—μ„œλŠ” 거의 μ‚¬μš©λ˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄ GPTλŠ” μŠ€ν…Œλ°/λ¦¬λ©˜νƒ€μ œμ΄μ…˜ λŒ€μ‹  보쑰사 ν† ν°ν™”λ§Œμ„ ν™œμš©ν•©λ‹ˆλ‹€.

단어 주파수

단어 주파수(frequencies of words)λŠ” λ¬Έμž₯의 λ‚΄μš©μ„ κ°•μ‘°ν•˜κΈ° μœ„ν•œ κ°œλ…μœΌλ‘œ, 자주 μ‚¬μš©λ˜λŠ” λ‹¨μ–΄μ˜ λΉˆλ„λ₯Ό μΈ‘μ •ν•˜λŠ” μ²™λ„μž…λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄, is와 같은 λ‹¨μ–΄λŠ” 빈번히 λ“±μž₯ν•΄ μ£ΌνŒŒμˆ˜κ°€ λ†’μ§€λ§Œ, μ‹€μ§ˆμ μΈ λ‚΄μš© λ³€ν™”λ₯Ό μ£Όμ§€ λͺ»ν•΄ 큰 μ˜λ―Έκ°€ μ—†μŠ΅λ‹ˆλ‹€.

  • 이λ₯Ό 톡해 μ£ΌνŒŒμˆ˜κ°€ 높은 단어λ₯Ό μ²˜λ¦¬ν•¨μœΌλ‘œμ¨ μ£Όμ–΄μ§„ λ¬Έμž₯의 λ‚΄μš©μ„ 평균화할 수 μžˆμŠ΅λ‹ˆλ‹€.

  • 의미 μ—†λŠ” 단어λ₯Ό λΆ„μ„μ—μ„œ μ œμ™Έν•˜κΈ° μœ„ν•΄ 이진 가쀑(binary weight)이 μ‚¬μš©λ  수 μžˆμŠ΅λ‹ˆλ‹€. μ΄λŠ” λΆˆν•„μš”ν•œ 단어에 0을, κ·Έ μ™Έμ˜ 단어에 1을 κ³±ν•˜λŠ” λ°©μ‹μœΌλ‘œ κ΅¬ν˜„λ©λ‹ˆλ‹€.

μž„λ² λ”©κ³Ό μž„λ² λ”© 곡간

μž„λ² λ”©(embedding)은 μžμ—°μ–΄ 처리λ₯Ό μœ„ν•΄ ν…μŠ€νŠΈμ˜ λ‹¨μ–΄λ‚˜ ꡬλ₯Ό 숫자 λ²‘ν„°λ‘œ λ³€ν™˜ν•˜λŠ” κΈ°μˆ μž…λ‹ˆλ‹€. λ‹¨μ–΄μ˜ 의미λ₯Ό μž„λ² λ”© 곡간(embedding space) μƒμ˜ μœ„μΉ˜λ‘œ μ „ν™˜ν•©λ‹ˆλ‹€. μ–΄λ–€ μ—”μ½”λ”μ—μ„ μœ μ‚¬ν•œ λ‹¨μ–΄λŠ” λ¬Έλ§₯적 μœ μ‚¬μ„±μ— 따라 μ„œλ‘œ κ°€κΉκ²Œ μœ„μΉ˜ν•©λ‹ˆλ‹€.

Bag-of-Words Modeling

BOW λͺ¨λΈλ§(Bag Of Words modeling)λŠ” 각 단어λ₯Ό 벑터화 ν•˜λŠ” μžμ—°μ–΄ 처리 λͺ¨λΈμ˜ ν•œ κ³„κΈ‰μž…λ‹ˆλ‹€β€”λͺ¨λ“  단어λ₯Ό ν¬ν•¨ν•œ 가방을 톡해 κΈ€μžλ₯Ό λŒ€ν‘œν•˜λ € ν•©λ‹ˆλ‹€β€”μ΄λŸ° λͺ¨λΈλ§μ„ μ •μ˜ν•˜κ³  μ΄ν•΄ν•˜κΈ° μœ„ν•΄ λ‹€μŒμ„ μ‚΄νŽ΄λ΄…λ‹ˆλ‹€.

  • λ‹€μŒ 두 λ¬Έμž₯이 μ£Όμ–΄μ§‘λ‹ˆλ‹€. 이 λͺ¨λΈμ€ 이λ₯Ό BOW 객체둜 λ³€ν™˜ν•˜λ € ν•©λ‹ˆλ‹€.

    • "John likes to watch movies. Mary likes movies too"

    • "Mary also likes to watch football games"

  • 이 λ¬Έμž₯듀은 λ‹€μŒκ³Ό 같은 λ¬Έμžμ—΄μ˜ μ§‘ν•©μœΌλ‘œ λ‚˜λˆŒ 수 μžˆμŠ΅λ‹ˆλ‹€.

    • "John","likes","to","watch","movies","Mary","likes","movies","too"

    • "Mary","also","likes","to","watch","football","games"

  • κ·Έ λ‹€μŒ, 각 토큰(token)을 κ³ μœ ν•œ ν‚€λ‘œ κ°„μ£Όν•˜κ³  λ“±μž₯ νšŸμˆ˜μ— 따라 정보λ₯Ό μž¬ν‘œν˜„ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μ΄λ ‡κ²Œ 두 λ¬Έμž₯은 각각 BOW둜 λ³€ν™˜λ©λ‹ˆλ‹€.

    • {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}

    • {"Mary":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1}

    • μ΄λ•Œ, BOW에 ν¬ν•¨λœ 단어듀은 μˆœμ„œκ°€ μ—†κ³  λ¬Έλ§₯적 의미λ₯Ό κ°–μ§€ μ•ŠμŠ΅λ‹ˆλ‹€. μ΄λŸ¬ν•œ νŠΉμ„± λ•Œλ¬Έμ— 단어 μˆœμ„œκ°€ μ€‘μš”ν•œ λ¬Έμ œμ—μ„œλŠ” BOW λͺ¨λΈμ˜ μ„±λŠ₯이 λ–¨μ–΄μ§ˆ 수 μžˆμŠ΅λ‹ˆλ‹€.

λ‹¨μˆœν•œ 단어μž₯ λŒ€μ‹  ν•΄μ‹± ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€. 단어 λŒ€μ‹  ν•΄μ‹œλ₯Ό 직접 색인에 ν™œμš©ν•˜μ—¬ ν™•μž₯μ„±κ³Ό λ‹¨μˆœν™”λ₯Ό 달성할 수 μžˆμŠ΅λ‹ˆλ‹€.

BOWλŠ” μ›λž˜ 비지도 ν•™μŠ΅μœΌλ‘œ κ°œλ°œλ˜μ—ˆμŠ΅λ‹ˆλ‹€. κ·ΈλŸ¬λ‚˜ 지도 ν•™μŠ΅μ˜ λΆ„λ₯˜ λ¬Έμ œμ—λ„ ν™œμš©λ©λ‹ˆλ‹€. 이 경우, μ£Όμ–΄μ§„ λ¬Έμž₯에 λŒ€ν•΄ νŠΉμ • λ²”μ£Όμ˜ 라벨을 좜λ ₯ν•˜λ„λ‘ μ„€μ •ν•©λ‹ˆλ‹€.

μžμ—°μ–΄ 처리 뿐 μ•„λ‹ˆλΌ 정보 회수(information retrieval)μ—μ„œλ„ 자주 μ‚¬μš©λ©λ‹ˆλ‹€.

PreviousModel-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksNextAttention Mechanism: The Core of Modern AI

Last updated 1 day ago

μ œκ°€ 직접 μž‘μ„±ν•œ λ₯Ό ν™•μΈν•˜μ„Έμš”!

κ΅¬ν˜„ μ½”λ“œ