๐Ÿ“˜
Lif31up's Blog
  • Welcome! I'm Myeonghwan
  • How to Read the Pages
  • Fundamental Machine Learning
    • Foundational Work of ML: Linear/Logistic Regression
    • Early-stage of AI: Perceptron and ADALINE
    • What is Deep Learning?: Artificial Neural Network to Deep Neural Network
    • Challenges in Training Deep Neural Network and the Latest Solutions
  • Modern AI Systems: An In-depth Guide to Cutting-edge Technologies and Applications
  • Few Shot Learning
    • Overview on Meta Learning
    • Prototypical Networks for Few-shot Learning
    • Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
  • Natural Language Process
    • * Tokenization and Stemming, Lemmatization, Stop-word Removal: Core Fundamental of NLP
    • Bag-of-Words
  • Front-end Development
    • Overview on Front-end Development
    • Learning React Basic
      • React Component: How They are Rendered and Behave in Browser
      • State and Context: A Key Function to Operate the React Application
      • Design Pattern for Higher React Programming
  • Songwriting
    • A Comprehensive Guide to Creating Memorable Melodies through Motif and Phrasing
  • Sound Engineering
    • How to Install and Load Virtual Studio Instruments
    • A Guide to Audio Signal Chains and Gain Staging
    • Equalizer and Audible Frequency: How to Adjust Tone of the Signal
    • Dynamic Range: the Right Way to Compress your Sample
    • Acoustic Space Perception and Digital Reverberation: A Comprehensive Analysis of Sound Field Simulat
  • Musical Artistry
    • What is Artistry: in Perspective of Modern Pop/Indie Artists
    • Visualizing as Musical Context: Choose Your Aesthetic
    • Analysis on Modern Personal Myth and How to Create Your Own
    • Instagram Management: Business Approach to Your Social Account
  • Art Historiography
    • Importance of Art Historiography: Ugly and Beauty Across Time and Space
    • Post-internet Art, New Aesthetic and Post-Digital Art
    • Brutalism and Brutalist Architecture
Powered by GitBook
On this page
  • Linear Regression
  • Learning Rules
  • Logistic Regression
  • Activation Function
  1. Fundamental Machine Learning

Foundational Work of ML: Linear/Logistic Regression

PreviousHow to Read the PagesNextEarly-stage of AI: Perceptron and ADALINE

Last updated 1 day ago

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์™€ ์„ ํ˜• ํšŒ๊ท€๋Š” ๋”ฅ๋Ÿฌ๋‹์˜ ๊ธฐ์ดˆ๊ฐ€ ๋˜๋Š” ํ•ต์‹ฌ ๊ฐœ๋…์œผ๋กœ, ๊ฐ๊ฐ ๋ถ„๋ฅ˜์™€ ํšŒ๊ท€ ๋ฌธ์ œ์˜ ๊ธฐ๋ณธ ์›๋ฆฌ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์„ ํ˜• ํšŒ๊ท€์˜ ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•๊ณผ ๊ฐ€์ค‘์น˜ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์€ ์‹ ๊ฒฝ๋ง ํ•™์Šต์˜ ๋ฐฑ๋ณธ์œผ๋กœ ํ™•์žฅ๋˜์—ˆ์œผ๋ฉฐ, ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์˜ ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ๊ฐœ๋…์€ ๋”ฅ๋Ÿฌ๋‹์˜ ๋น„์„ ํ˜•์„ฑ ๊ตฌํ˜„์— ์˜ํ–ฅ์„ ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ์€ ํšŒ๊ท€๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ์ฃผ์š”ํ•œ ๊ฐœ๋…๋“ค์ž…๋‹ˆ๋‹ค:

  • ๋ชจ๋ธ(model)์€ ์ž…์ถœ๋ ฅ์— ๋Œ€ํ•œ ์˜ˆ์ธก๋œ ๊ด€๊ณ„๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.

  • ํ•™์Šต๋ฒ•(learning rule)์€ ์˜ค๋ฅ˜๋ฅผ ์ตœ์†Œํ™” ํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ€์ค‘์น˜ ์กฐ์ • ๋Œ€ํ•œ ์ •์˜์ž…๋‹ˆ๋‹ค.

์ด ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ๋“ค์€ ๋ณต์žกํ•œ ๋”ฅ๋Ÿฌ๋‹ ์•„ํ‚คํ…์ฒ˜์˜ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ์žฌํ•ด์„๋˜๋ฉฐ, ํŠนํžˆ ์ดˆ๊ธฐ์ธต์ด๋‚˜ ์ถœ๋ ฅ์ธต์—์„œ ์—ฌ์ „ํžˆ ํ™œ์šฉ๋ฉ๋‹ˆ๋‹ค.

ํ•ด๋‹น ํฌ์ŠคํŠธ์—์„  ์„ ํ˜• ํšŒ๊ท€์˜ ์†Œ๊ฐœ์™€ ํ•™์Šต๋ฒ•์„, ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์™€ ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ฅผ ๋ฌถ์–ด ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ๋ชจ๋ธ์ด๋ž€?

์ˆ˜ํ•™์  ๋ชจ๋ธ(mathematical model)์€ ํ˜„์‹ค ์„ธ๊ณ„์˜ ํ˜„์ƒ์„ ์ˆ˜ํ•™์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๋ ค๋Š” ์‹œ๋„์ž…๋‹ˆ๋‹ค. ์ด๋Š” ํ•ด๋‹น ํ˜„์ƒ์„ ์ดํ•ดํ•˜๊ณ , ๋ฏธ๋ž˜๋ฅผ ์˜ˆ์ธกํ•˜๋ฉฐ, ํ•ฉ๋ฆฌ์ ์ธ ์˜์‚ฌ ๊ฒฐ์ •์„ ๋‚ด๋ฆฌ๋Š” ๋ฐ ๋„์›€์„ ์ค๋‹ˆ๋‹ค.

์ด๋•Œ, ์ˆ˜ํ•™์  ๋ชจ๋ธ์€ ํ˜„์‹ค ์„ธ๊ณ„๋ฅผ ์™„๋ฒฝํ•˜๊ฒŒ ๋ชจ์‚ฌํ•˜๋Š” ๊ฒƒ์ด ๋ชฉ์ ์ด ์•„๋‹™๋‹ˆ๋‹ค. ์˜คํžˆ๋ ค ๊ทธ ํ˜„์ƒ์— ๋Œ€ํ•œ ์ด์ƒ์ ์ธ ํ‘œํ˜„์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ์„ ํ†ตํ•ด ์–ป์€ ์˜ˆ์ธก์ด ์˜๋ฏธ ์žˆ๋Š” ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•œ๋‹ค๋ฉด, ๋ชจ๋ธ์ด ์™„๋ฒฝํžˆ ์ •ํ™•ํ•˜์ง€ ์•Š๋”๋ผ๋„ ์ถฉ๋ถ„ํžˆ ๊ฐ€์น˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•จ์ˆ˜๋ฅผ ๊ธฐ๊ณ„ ๋ชจํ˜•(machine model)์œผ๋กœ ์ƒ๊ฐํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. xxx๊ฐ€ ๊ทธ ํ•จ์ˆ˜์˜ ์ •์˜์—ญ์ผ ๋•Œ, ๋Š” ๊ทธ ๊ธฐ๊ณ„๋กœ ๋“ค์–ด๊ฐ€๋Š” ์ž…๋ ฅ์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ถœ๋ ฅ์€ f(x)f(x)f(x)๊ฐ€ ๋˜๋ฉฐ ์ด๋Š” ๊ทธ ๊ธฐ๊ณ„์˜ ๊ทœ์น™์— ๋”ฐ๋ผ ์ •ํ•ด์ง‘๋‹ˆ๋‹ค.

Linear Regression

์„ ํ˜• ํšŒ๊ท€๋ถ„์„(linear regression)๋ž€ ํ•˜๋‚˜ ์ด์ƒ์˜ ์„ค๋ช…๋ณ€์ˆ˜์™€ ๊ทธ์— ๋Œ€ํ•œ ์Šค์นผ๋ผ ๋ฐ˜์‘(scalar response)์„ ํ†ตํ•œ ๋ชจ๋ธ๋ง์ž…๋‹ˆ๋‹ค. ๊ทธ ์ค‘์—์„œ๋„ ์„ ํ˜•์ ์ธ ์ ‘๊ทผ์„ ํ•˜๋Š” ๊ฒƒ์„ ์ง€์นญํ•ฉ๋‹ˆ๋‹ค.

  • ์ž๋ฃŒ๋กœ ์˜ˆ์ธก๋œ ๋งค๊ฐœ๋ณ€์ˆ˜, ๊ทธ๋กœ ๋งŒ๋“ค์–ด์ง„ ์„ ํ˜• ์˜ˆ์ธก์ž ํ•จ์ˆ˜(linear predictor function)๋กœ ์ž…์ถœ๋ ฅ ๊ด€๊ณ„๋ฅผ ๋ชจ์‚ฌํ•ฉ๋‹ˆ๋‹คโ€”์ด๋ฅผ ์„ ํ˜•๋ชจ๋ธ(linear model)์ด๋ผ ํ•ฉ๋‹ˆ๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ์— ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•ด ์„ค๋ช…๋ณ€์ˆ˜์™€ ๋ชจ๋ธ ๊ฐ„์˜ ๋ฐ˜์‘๊ณผ ๊ทธ ์กฐ๊ฑด๋ถ€ ํ‰๊ท ์€ ๊ฒฐ๊ตญ ์•„ํ•€ ํ•จ์ˆ˜๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  • ์ผ๋ฐ˜์ ์œผ๋กœ ์ตœ์†Œ ์ œ๊ณฑ๋ฒ•(least square)์ด ์ ํ•ฉ์„ ์œ„ํ•ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ด๋•Œ, ์ ํ•ฉ์ด ์ œ๋Œ€๋กœ ์ž‘๋™ํ•˜์ง€ ์•Š์€ ์ƒํƒœ๋ฅผ ์ ํ•ฉ์„ฑ ๊ฒฐ์—ฌ(lack of fitting, LOF)์ด๋ผ ํ•ฉ๋‹ˆ๋‹คโ€”์ด๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ์ตœ์†Œ ์ œ๊ณฑ๋ฒ•์—์„œ ํŒŒ์ƒ๋œ ๋‹ค์–‘ํ•œ ์ ‘๊ทผ์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

์—๋‹ˆ๋ฉ”์ด์…˜์„ ํ†ตํ•œ ์ดํ•ด

์„ ํ˜•ํšŒ๊ท€์˜ ๋‹ค์–‘ํ•œ ๋ณ€ํ˜•๋“ค

  • ๋งŒ์•ฝ ๋ชฉ์ ์ด ์„ค๋ช…๋ณ€์ธ ๊ฐ„์˜ ๋ฐ˜์‘, ๊ด€๊ณ„๋ฅผ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•จ์ด๋ผ๋ฉด ์ด๋Š” ์„ ํ˜•ํšŒ๊ท€๋ถ„์„(regression analysis)๋กœ ๋ฒ”์ฃผํ•ฉ๋‹ˆ๋‹ค.

  • ๋งŒ์•ฝ ๋ชฉ์ ์ด ์˜ˆ์ธก๊ณผ ์ถ”๊ณ„์ด๋ฉฐ ๋”ฐ๋ผ์„œ ์˜ค๋ฅ˜๋ฅผ ์ค„์—ฌ์•ผ ํ•œ๋‹ค๋ฉด ํ•ด๋‹น ๋ชจ๋ธ์€ ์˜ˆ์ธก๋ชจ๋ธ(predictive model)๋กœ ๋ฒ”์ฃผํ•ฉ๋‹ˆ๋‹ค.

ํŠน์ง•์— ๋”ฐ๋ฅธ ์„ ํ˜•ํšŒ๊ท€์˜ ๋ถ„๋ฅ˜

  • ํ•˜๋‚˜์˜ ์„ค๋ช…๋ณ€์ธ๋งŒ์„ ๊ฐ€์ง„ ๊ฒฝ์šฐ๋ฅผ ๋‹จ์ˆœ ์„ ํ˜• ํšŒ๊ท€๋ถ„์„(simple linear regression)๋ผ ๋ถ€๋ฆ…๋‹ˆ๋‹ค.

  • ํ•˜๋‚˜ ๋ณด๋‹ค ๋งŽ์€ ์„ค๋ช…๋ณ€์ธ์„ ๊ฐ€์ง„ ๊ฒฝ์šฐ๋ฅผ ๋‹ค์ค‘ ์„ ํ˜• ํšŒ๊ท€๋ถ„์„(multiple linear regression)๋ผ ๋ถ€๋ฆ…๋‹ˆ๋‹ค.

  • ์ข…์†๋ณ€์ธ์ด ํ•˜๋‚˜ ๋ณด๋‹ค ๋งŽ์€ ๊ฒฝ์šฐ๋ฅผ ๋‹ค๋ณ€ ์„ ํ˜• ํšŒ๊ท€๋ถ„์„(multivariate linear regression)๋ผ ํ•ฉ๋‹ˆ๋‹ค.

์„ ํ˜•ํšŒ๊ท€๋Š” ๊ฒฐํ•ฉ๋ถ€ ํ™•๋ฅ  ๋ถ„ํฌ(joint probability distribution)๋ณด๋‹ค๋Š” ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  ๋ถ„ํฌ(conditional probability distribution)์— ์ดˆ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Formulation

{yi,xi1,โ€ฆ,xip}i=1n\left\{ y_i, x_{i1}, โ€ฆ, x_{ip} \right\}^n_{i=1}{yiโ€‹,xi1โ€‹,โ€ฆ,xipโ€‹}i=1nโ€‹๋ฅผ ํ†ตํ•ด ์„ ํ˜•๋ชจ๋ธ์€ ๋…๋ฆฝ๋ณ€์ˆ˜ yyy์™€ ์ข…์†๋ณ€์ˆ˜ ฮฒ\betaฮฒ๋ฅผ ์˜ˆ์ธกํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

yi=ฮฒ0+ฮฒ1xi1+...+ฮฒpxip+ฯตi=ฮฒpxip+ฯตi y_{i} = \beta_{0} + \beta_{1}x_{i1} + ... + \beta_{p}x_{ip} + \epsilon_{i} = \beta_{p}x_{ip} + \epsilon_{i}yiโ€‹=ฮฒ0โ€‹+ฮฒ1โ€‹xi1โ€‹+...+ฮฒpโ€‹xipโ€‹+ฯตiโ€‹=ฮฒpโ€‹xipโ€‹+ฯตiโ€‹
  • yyy๋Š” ๊ด€์ธก๊ฐ’์˜ ๋ฒกํ„ฐ์ž…๋‹ˆ๋‹ค.

  • xxx๋Š” ์—ด ๋ฒกํ„ฐ xix_ixiโ€‹์˜ ํ–‰๋ ฌ ๋˜๋Š” ๋‹ค์ฐจ์›์˜ ํ–‰ ๋ฒกํ„ฐ xjx_jxjโ€‹์ž…๋‹ˆ๋‹ค.

  • ฮฒ\betaฮฒ๋Š” p+1p+1p+1 ์ฐจ์›์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ๋ฒกํ„ฐ์ž…๋‹ˆ๋‹ค.

  • ฯต\epsilonฯต๋Š” ฯตi\epsilon_iฯตiโ€‹์˜ ๋ฒกํ„ฐ์ž…๋‹ˆ๋‹ค.

ฯต\epsilonฯต๋Š” ๊ฒฐ๊ตญ ๊ด€์ธก๊ณผ ๋ฌด๊ด€ํ•œ ๋‚œ์ˆ˜, ๋ณต๊ท€์ž์™€ ๋…๋ฆฝ๋ณ€์ธ ๊ฐ„์˜ ๊ด€๊ณ„์— ์†Œ์Œ์ž…๋‹ˆ๋‹ค.

์„ ํ˜•ํšŒ๊ท€์˜ ์ฃผ์š” ๊ฐœ๋…๊ณผ ํ•œ๊ณ„

์„ ํ˜•ํšŒ๊ท€๋Š” ํ˜„๋Œ€์— ์™€์„œ ๋งŽ์€ ํ•œ๊ณ„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๊ณ  ์ƒ๊ฐ๋˜์ง€๋งŒ ๋ช‡ ๋ฌธ์ œ์—๋Š” ์ข€ ๋” ํšจ์œจ์ ์ธ ํ•ด๊ฒฐ์„ ์ œ๊ณตํ•˜๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹คโ€”๋‹ค์Œ์€ ์ฃผ์š”ํ•œ ๊ฐœ๋…๋…๊ณผ ํ•œ๊ณ„์ ์ž…๋‹ˆ๋‹ค.

  • ์™ธ์ƒ์„ฑ(exogeneity)๋ž€ ๋ชจ๋ธ์ด ์˜ค๋ฅ˜์™€ ์—ฐ๊ด€๋˜์ง€ ์•Š์Œ, ๋˜๋Š” ๊ทธ๋Ÿฐ ์„ฑ์งˆ์— ๋Œ€ํ•œ ์ฒ™๋„๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ˆ˜ํ•™์ ์œผ๋ก  E[ฯตโˆฃX]=0\mathbb{E}[ \epsilon | X ] = 0E[ฯตโˆฃX]=0์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹คโ€”์„ ํ˜•ํšŒ๊ท€๋Š” ์•ฝํ•œ ์™ธ์ƒ์„ฑ์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

    • ์—„๊ฒฉํ•œ ์™ธ์ƒ์„ฑ(strict exogeneity)์€ ๋ชจ๋“  ๊ธฐ๊ฐ„์— ๊ฑธ์ณ ์™ธ์ƒ์„ฑ์„ ๊ฐ€์ง์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

    • ์•ฝํ•œ ์™ธ์ƒ์„ฑ(weak exogeneity)์€ ํ˜„ ๊ธฐ๊ฐ„์— ๊ฑธ์ณ ์™ธ์ƒ์„ฑ์„ ๊ฐ€์ง์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

    • ๊ธฐ์ •์‚ฌ์‹ค์„ฑ(deterministic)์€ ๊ตฌ ๊ธฐ๊ฐ„์— ๋Œ€ํ•ด ์™ธ์ƒ์„ฑ์„ ๊ฐ€์ง€์ง€๋งŒ ํ˜„์žฌ์™€ ๋ฏธ๋ž˜์—” ๊ทธ๋ ‡์ง€ ๋ชปํ•จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

  • ์„ ํ˜•์„ฑ(linearity)์€ ๋ฐ˜์‘๋œ ๋ณ€์ˆ˜์˜ ํ‰๊ท ์ด ๋งค๊ฐœ๋ณ€์ˆ˜, ์˜ˆ์ธก์ž ๋ณ€์ˆ˜์˜ ์„ ํ˜•์กฐํ•ฉ์œผ๋กœ ์ธก์ •๋จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

  • ๊ณ ์ •์ ์ธ ๋ณ€ํ™”(constant variance, homoscedasticity)๋Š” ์˜ค๋ฅ˜์˜ ๋ณ€ํ™”๊ฐ€ ์˜ˆ์ธก์ž ๋ณ€์ˆ˜์˜ ๊ฐ’์— ์˜์กดํ•˜์ง€ ์•Š์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ๋“ค์–ด ์ˆ˜์ž…์ด 1000์œผ๋กœ ์˜ˆ์ธก๋œ ๊ฐœ์ธ์€ ์‹ค์งˆ์ ์œผ๋กœ 800 ~ 1200์˜ ์ˆ˜์ต์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

    • ์•ž์„  ์˜ˆ์‹œ์—์„œ |200|๋กœ ๋‚˜ํƒ€๋‚œ ์ด ๊ฐ’์„ ๊ณ ์ •์  ๋ณ€ํ™”๋ผ ๋ถ€๋ฆ…๋‹ˆ๋‹ค.

  • ์˜ค๋ฅ˜์˜ ๋…๋ฆฝ์„ฑ(Independence of errors)๋Š” ์˜ค๋ฅ˜๊ฐ€ ์ผ๊ด€๋œ ์—ฐ๊ด€์„ ์™„์ „ํžˆ ๋ฒ—์–ด๋‚จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์„ ํ˜•ํšŒ๊ท€๋Š” ์˜ค๋ฅ˜์˜ ๋…๋ฆฝ์„ฑ์— ํŠนํžˆ ์ทจ์•ฝํ•ฉ๋‹ˆ๋‹ค.

    • ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ž๋ฃŒ์˜ ์ •ํ˜•ํ™”(data regularization), ๋ฒ ์ด์‹œ์•ˆ ์„ ํ˜•ํšŒ๊ท€(Bayesโ€™ linear regression)๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.

Learning Rules

ํ•™์Šต๋ฒ•(learning rule)์€ ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ์„ ๋ฌธ์ œ ์ƒํ™ฉ์— ๋งž๊ฒŒ ์ตœ์ ํ™” ์‹œํ‚ค๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ์„ ํ˜• ํšŒ๊ท€์˜ ํ•™์Šต๋ฒ•์€ ํฌ๊ฒŒ Newton's method์™€ GDR๊ฐ€ ์žˆ์œผ๋ฉฐ ์ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

Newtonโ€™s Method

ํ•จ์ˆ˜ fff์™€ ๊ทธ์— ๋Œ€ํ•œ ๋„ํ•จ์ˆ˜ fโ€™fโ€™fโ€™์™€ ์‹œ์ž‘์  x0x_0x0โ€‹์ด ์žˆ์„ ๋•Œ, fff๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์˜ˆ์ธก์„ ๋งŒ์กฑํ•˜๊ณ  ์‹œ์ž‘์ ์„ ์ด์— ๋”ฐ๋ผ ์กฐ์ž‘ํ•˜๋ฉด ์ตœ์  ์ง€์ ์— ๋„์ฐฉํ•  ์ˆ˜ ์žˆ์Œ์„ ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์‹œ์ž‘์ ์— ๋Œ€ํ•ด ๋ฏธ๋ถ„์œผ๋กœ ์ถ”์ •๋œ ์ ‘์„ ์„ ๊ณ„์‚ฐํ•˜๊ณ  ์ด ์ ‘์„ ์— ๋Œ€ํ•œ xxx ์ถ•๊ณผ์˜ ๊ต์ ์„ ๊ตฌํ•ฉ๋‹ˆ๋‹คโ€”์ด ๊ต์ ์€ ์‹œ์ž‘์ ์„ ๋Œ€์ฒดํ•˜๋Š” ์ง€์ ์ด ๋˜๋ฉฐ ์ด๋ฅผ ๊ณ„์† ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.

  • ๋งŒ์•ฝ ๊ณก์„  f(x)f(x)f(x)์˜ x=xnx = x_nx=xnโ€‹์— ๋Œ€ํ•œ ์ ‘์„ ์ด xn+1x_{n+1}xn+1โ€‹์—์„œ xxx ์ถ•๊ณผ ๊ต์ฐจํ•˜๋ฉฐ ์ด๋ฅผ ๊ตฌํ•˜๋Š” ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • slopeย ofย theย tangentย isย fโ€™(xn)=f(xn)โˆ’0xnโˆ’xn+1\text{slope of the tangent is } fโ€™(x_n) = \frac{f(x_n) - 0}{x_n - x_{n+1}}slopeย ofย theย tangentย isย fโ€™(xnโ€‹)=xnโ€‹โˆ’xn+1โ€‹f(xnโ€‹)โˆ’0โ€‹

    • โˆดxn+1=xnโˆ’f(xn)fโ€™(xn)\therefore x_{n + 1} = x_n - \frac{f(x_n)}{fโ€™(x_n)}โˆดxn+1โ€‹=xnโ€‹โˆ’fโ€™(xnโ€‹)f(xnโ€‹)โ€‹

  • ์ด๋ฅผ ํ†ตํ•ด fโ€™(xn)fโ€™(x_n)fโ€™(xnโ€‹)๋ฅผ ๊ณ„์† ๊ตฌํ•จ์œผ๋กœ์จ xn+1x_{n+1}xn+1โ€‹์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ์›€์ง์ž…๋‹ˆ๋‹ค.

์—๋‹ˆ๋ฉ”์ด์…˜์„ ํ†ตํ•œ ์ดํ•ด

Newton's Method๋Š” ํฐ ๊ณ„์‚ฐ ๋น„์šฉ๊ณผ ์•ˆ์žฅ์ ์— ๋Œ€ํ•œ ์ทจ์•ฝ์„ฑ ๋•Œ๋ฌธ์— ์ตœ์‹  ์ธ๊ณต์ง€๋Šฅ์„ ์œ„ํ•œ ํ•™์Šต์œผ๋ก  ์ž˜ ์„ ํƒ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

ํ•ด๋‹น ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ฏธ๋ถ„๊ณผ ์ ‘์„ ๊ณผ ๊ทธ ๊ธฐ์šธ๊ธฐ์˜ ๊ด€๊ณ„๋ฅผ ๊ทผ๊ฑฐ๋กœ ์ตœ์  ์ง€์ ์„ ๊ตฌํ•˜๋Š” ์ œ 1 ๋ฐ˜๋ณต ์ตœ์ ํ™”(first-order iterative optimization)์ž… ๋ถ„๋ฅ˜๋ฉ๋‹ˆ๋‹ค.

GDR, Gradient Descent Rule

๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•(gradient descent rule)์€ ์†์‹คํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ€์ค‘์น˜๋ฅผ ๋ฐ˜๋ณต์ ์œผ๋กœ ์กฐ์ •ํ•˜๋Š” ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ๊ฐ€์ค‘์น˜๋Š” ๊ฐ€์žฅ ๊ฐ€ํŒŒ๋ฅธ ๊ฒฝ์‚ฌ์˜ ๋ฐฉํ–ฅ์œผ๋กœ ์กฐ์ •๋˜๋ฉฐ, ์ด๋Š” ํ˜„๋Œ€ ๊ธฐ๊ณ„ํ•™์Šต ๋ชจ๋ธ ํ•™์Šต์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค.

  • ๊ฐ€์ค‘์น˜ www๋ฅผ 0 ๋˜๋Š” ์ž„์˜์˜ ์ˆ˜๋กœ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

  • ๊ด€์ฐฐ๊ณผ ๋ชจ๋ธ์˜ ์˜ˆ์ธก ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

    • ์ด๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜๋ฅผ ์†์‹คํ•จ์ˆ˜ J(w)J(w)J(w)๋ผ ํ•ฉ๋‹ˆ๋‹ค.

  • ์ •ํ•ด์ง„ ๋ฐ˜๋ณต ํšŸ์ˆ˜ ๋˜๋Š” ์˜ค๋ฅ˜๊ฐ€ ์ถฉ๋ถ„ํžˆ ์ค„์–ด๋“ค ๋•Œ๊นŒ์ง€ wnew=woldโˆ’ฮฑโ‹…โˆ‡J(w)w_{\text{new}} = w_{\text{old} - \alpha \cdot \nabla J(w)}wnewโ€‹=woldโˆ’ฮฑโ‹…โˆ‡J(w)โ€‹์„ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.

    • ฮฑ\alphaฮฑ๋Š” ํ•™์Šต๋ฅ ๋กœ ํ•œ ๋ฐ˜๋ณต์— ๋Œ€ํ•œ ์กฐ์ •๊ฐ’์˜ ๋ณ€ํ™” ์ •๋„๋ฅผ ๊ทœ์ •ํ•ฉ๋‹ˆ๋‹ค.

์—๋‹ˆ๋ฉ”์ด์…˜์„ ํ†ตํ•œ ์ดํ•ด

Logistic Regression

์ด์ง„ํšŒ๊ท€ ๋ถ„์„ ์ดํ•ดํ•˜๊ธฐ๊ธฐ

์ด์ง„ ํšŒ๊ท€ ๋ถ„์„(binary logistic regression)์—์„  ์ง€์‹œ๋ณ€์ˆ˜(indicator variable)๋ฅผ ๋…๋ฆฝ๋ณ€์ˆ˜๋กœ ํ•ด๋…ํ•ด ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹คโ€”์ด๋•Œ, ์ง€์‹œ๋ณ€์ˆ˜๋Š” 0๊ณผ 1๋กœ ์ฃผ์–ด์ง€๋ฉฐ ๋…๋ฆฝ๋ณ€์ˆ˜๋Š” ๊ฒฐ๊ตญ 0 ~ 1์˜ ์—ฐ์†๋ณ€์ˆ˜(continuous variable)๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

  • ๋‘ ๊ฐœ ์ด์ƒ์˜ ์ด์ง„๋ณ€์ˆ˜๋ฅผ ๋…๋ฆฝ๋ณ€์ˆ˜๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ์—” ๋‹ค์ค‘ ํšŒ๊ท€ ๋ถ„์„(multinomial logistic regression)๋ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ, ๋‹ค์ˆ˜์˜ ์ด์ง„๋ณ€์ˆ˜๋Š” ๊ณง ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜(categorical variable)๋กœ ๋‹ค์‹œ ์ •ํ˜•ํ™”๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹คโ€”์ด๋•Œ, ๊ฐ ๋ฒ”์ฃผ๋“ค์ด ์ •๋ ฌ๋˜์—ˆ๊ณ  ์ด๋“ค์˜ ์—ฐ์†์  ํŠน์ง•์ด ์˜๋ฏธ๋ฅผ ๊ฐ€์งˆ ๋•Œ, ์ •๋ ฌ์  ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€(ordinal logistic regression)๋ผ ํ•ฉ๋‹ˆ๋‹ค.

  • ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ์ž์ฒด๋Š” ๋‹จ์ˆœํ•œ ํ™•๋ฅ ์  ์˜ˆ์ธก์„ ์ œ๊ณตํ•  ๋ฟ์ด์ง€๋งŒ ๋ถ„๋ฅ˜๊ธฐ๋กœ์จ๋„ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ ์ด๋Ÿฌํ•œ ํ˜•ํƒœ๋ฅผ ํ†ต๊ณ„์  ๋ถ„๋ฅ˜๊ธฐ(statistical classifier)๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

์ธก์ •๋‹จ์œ„๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋กœ๊ทธ-์˜ค์ฆˆ๋Š” ๋กœ์ง“(logit)์ด๋ผ ๋ถˆ๋ฆฝ๋‹ˆ๋‹คโ€”์ด๋Š” ๋กœ์ง€์Šคํ‹ฑ ๋‹จ์œ„(Logistic Unit)์˜ ์ค„์ž„๋ง์ž…๋‹ˆ๋‹ค.

์ด์ง„๋ณ€์ˆ˜(binary variable)๋Š” ์–ด๋–ค ๊ณ„๊ธ‰์ด๋‚˜ ์‚ฌ๊ฑด์˜ ๊ณต๊ฐ„์œผ๋กœ์จ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹คโ€”์˜ˆ๋ฅผ ๋“ค์–ด ํ•œ ํŒ€์˜ ์Šน๋ฆฌ์™€ ๊ฐ™์€ ์–‘์ž์  ๊ฐ’์„ ๊ฐ€์ง„ ๊ฒƒ์— ๋Œ€ํ•œ ํ™•๋ฅ , ํ™˜์ž์˜ ๊ฑด๊ฐ• ์ƒํƒœ ๋“ฑ์„ ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด์ง„ ๋ถ„๋ฅ˜๊ธฐ์™€ ์˜ˆ์ธก ๋ชจ๋ธ์˜ ์ฐจ์ด

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ๋ถ„์„์€ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์— ๋Œ€ํ•œ ๋ชจ๋ธ๋กœ ํ™•๋ฅ ์  ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ฐจ๋‹จ๊ฐ’(cut-off value)์„ ์ •์˜ํ•ด ๊ทธ ๊ฐ’์˜ ์œ„์•„๋ž˜๋ฅผ ์„œ๋กœ ๋‹ค๋ฅธ ๊ณ„์ธต์œผ๋กœ ๋ถ„๋ฅ˜ํ•  ๋•Œ ์ด์ง„ ๋ถ„๋ฅ˜๊ธฐ๋ผ ๋งํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์ด์ง„ ๋ณ€์ˆ˜๋ฅผ ์œ„ํ•œ ๋ถ„์„์  ํšŒ๊ท€ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ, ๋กœ์ง€์Šคํ‹ฑ ํ•จ์ˆ˜ ๋Œ€์‹  ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • ์ถœ๋ ฅ์— ๋Œ€ํ•ด ํŠน์ •ํ•œ ์ ˆ๋‹จ์„ ํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ ๊ณ„๊ธ‰์— ๋Œ€ํ•œ ์ง€์‹œ๋ณ€์ˆ˜๋กœ ๋ณ€ํ˜•ํ•˜๋Š” ๊ฒƒ์€ ์ด์ง„ ๋ถ„๋ฅ˜๊ธฐ์˜ ์ผ๋ฐ˜์ ์ธ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

๋กœ์ง€์Šคํ… ๋ชจ๋ธ(logistic model), ๋กœ์ง“ ๋ชจ๋ธ(logit model)์€ ๋กœ๊ทธ-์˜ค์ฆˆ(log-odds) ์‚ฌ๊ฑด์„ ์„ ํ˜•์กฐํ•ฉ์„ ํ†ตํ•ด ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€(logistic regression)๋Š” ์ด์— ๋Œ€ํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜, ์ˆ˜ํ•™์  ๋ชจ๋ธ๋ง์ž…๋‹ˆ๋‹ค.

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์˜ ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ์ธก์ •์€ MLE(Maximum-Likelihood Estimation)๋ฅผ ํ†ตํ•ด ์˜ˆ์ธก๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์„ ํ˜• ์ตœ์†Œ ์ œ๊ณฑ(linear least square)์™€๋Š” ๋‹ค๋ฆ…๋‹ˆ๋‹คโ€”๋‹ซํžŒ ํ˜•ํƒœ์˜ ํ‘œํ˜„์„ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์ฐจ์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. MLE๋ฅผ ํ†ตํ•œ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋Š” ์ •๋ ฌ ์ตœ์†Œ ์ œ๊ณฑ(oridnary least square)์— ์˜ํ•œ ์„ ํ˜• ํšŒ๊ท€๋กœ์จ ๋ฒ”์ฃผ/์ด์ง„ ๋ฐ˜์‘์œผ๋กœ ํ•ด์„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

MLE, the Learning Rule of Logistic Regression

์ตœ๋Œ€์šฐ๋„์ถ”์ •(Maximum Likelihood Estimation)์€ ๋ผ์ดํด๋ฆฌํ›„๋“œ์˜ ฮธ\thetaฮธ๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ฐ ํ•จ์ˆ˜๋ฅผ ์ง€์นญํ•ฉ๋‹ˆ๋‹ค.

  1. ๋งค๊ฐœ ๋ณ€์ˆ˜๋ฅผ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค. MLE๋Š” ๊ด€์ฐฐ์ž๋ฃŒ์™€ ์˜ˆ์ธก๊ฐ’์˜ ๊ฒฐํ•ฉํ™•๋ฅ ์ด ์ตœ๋Œ€๊ฐ€ ๋˜๋Š” $\theta$๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค: fn(y;ฮธ)=โˆk=1nfkunivar(yk;ฮธ) \mathcal{f}{n}(y;\theta) = \prod{k=1}^{n}\mathcal{f}_k^{univar}(y_k;\theta) fn(y;ฮธ)=โˆk=1nfkunivarโ€‹(ykโ€‹;ฮธ)

  2. ฮ˜\Thetaฮ˜์—์„œ ์šฐ๋„ํ•จ์ˆ˜๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ๋‹ค์Œ์œผ๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค: ฮธ^=argโ€‰maxโกฮธโˆˆฮ˜Ln(ฮธ;y) \hat{\theta} = \argmax_{\theta \in \Theta}\mathcal{L}_n(\theta;y) ฮธ^=argmaxฮธโˆˆฮ˜โ€‹Lnโ€‹(ฮธ;y)

    • ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ, ์šฐ๋„ํ•จ์ˆ˜์— ์ž์—ฐ๋กœ๊ทธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ˆ˜์‹์„ ๊ตฌ์„ฑํ•˜๋Š” ๊ฒƒ์ด ์œ ์šฉํ•ฉ๋‹ˆ๋‹คโ€”์ด๋ฅผ ๋กœ๊ทธ๋ผ์ดํด๋ฆฌ ํ•จ์ˆ˜(log-likelihood function)๋ผ ํ•ฉ๋‹ˆ๋‹ค: l(ฮธ;y)=lnโกL(ฮธ;y) l(\theta;y)=\ln\mathcal{L}(\theta;y) l(ฮธ;y)=lnL(ฮธ;y)

  3. ๋กœ๊ทธํ•จ์ˆ˜๋Š” ๋‹จ์กฐํ•จ์ˆ˜์ด๊ธฐ ๋•Œ๋ฌธ์— l(ฮธ;y)\mathcal{l}(\theta;y)l(ฮธ;y)๋Š” Ln\mathcal{L}_{n}Lnโ€‹์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ฮธ\thetaฮธ๋ผ๊ณ  ์ƒ๊ฐ๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ, ฮธ\thetaฮธ๋Š” ๋‹ค์Œ ์กฐ๊ฑด์„ ํ•„์ˆ˜์ ์œผ๋กœ ๋งŒ์กฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค: dldฮธk=0 \frac{dl}{d\theta_{k}} = 0 dฮธkโ€‹dlโ€‹=0

  4. ์šฐ๋„ํ•จ์ˆ˜์— ๋Œ€ํ•œ ฮธ\thetaฮธ์˜ ๋ฏธ๋ถ„์€ ๊ฒฐ๊ณผ์ ์œผ๋กœ 000์ด ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ‘œํ˜„์— ๋Œ€ํ•œ ์ •๋ฆฌ
  • ฮ˜\Thetaฮ˜๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ณต๊ฐ„์ž…๋‹ˆ๋‹ค.

  • L\mathcal{L}L์€ ๋ผ์ดํด๋ฆฌํ›„๋“œ ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค.

  • ฮธ^\hat{\theta}ฮธ^๋Š” ฮธ\thetaฮธ์— ๋Œ€ํ•œ ์˜ˆ์ธก๊ฐ’์ž…๋‹ˆ๋‹ค.

Activation Function

ํ™œ์„ฑํ™” ํ•จ์ˆ˜(activation function)๋Š” ๋‰ด๋Ÿฐ์˜ ์ถœ๋ ฅ์— ์ ์šฉ๋˜๋Š” ์ˆ˜ํ•™์  ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค. ์ด๋Š” ์ฃผ๋กœ ๋ชจ๋ธ์ด๋‚˜ ์‹ ๊ฒฝ๋ง์— ๋น„์„ ํ˜•์„ฑ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๋ณต์žกํ•œ ๋ฌธ์ œ์— ๋Œ€ํ•ด ๋” ๋‚˜์€ ๊ฒฐ์ •์„ ๋‚ด๋ฆด ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ์—†์ด๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋‚˜ ์ด๋ฏธ์ง€ ์ธ์‹๊ณผ ๊ฐ™์€ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

  • ํ™œ์„ฑํ™” ํ•จ์ˆ˜์˜ ๋น„์„ ํ˜•์„ฑ(non-linearity)์€ ๋ชจ๋ธ์ด ๋ณต์žกํ•œ ํŒจํ„ด์„ ํ•™์Šตํ•˜๊ณ  ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค๋‹ˆ๋‹ค.

  • ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋Š” ๋ฏธ๋ถ„๊ฐ€๋Šฅ(differentiability)ํ•˜๋ฉฐ ์ด๋Š” ์—ญ์ „ํŒŒ ํ•™์Šต์— ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค.

  • ํ™œ์„ฑํ™”ํ•จ์ˆ˜๋Š”๊ฐ’์˜์„๋ฒ”์œ„(range)๋กœ ์ขํ˜€ํ™œ์„ฑํ™” ๋ชจ๋ธ์˜ ์ถœ๋ ฅ ๋ฒ”์œ„๋ฅผ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ์€ ๋จธ์‹ ๋Ÿฌ๋‹์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๋‹ค์–‘ํ•œ ํ™œ์„ฑํ™” ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค.

๊ทธ๋ฆผ์œผ๋กœ ์ดํ•ดํ•˜๋Š” ๋‹ค์–‘ํ•œ ํ™œ์„ฑํ™” ํ•จ์ˆ˜์˜ ์ตœ์ ํ™” ํ™˜๊ฒฝ

ํ™œ์„ฑํ™” ํ•จ์ˆ˜์— ์—ฐ๊ด€ํ•œ ์ตœ์  ํ™˜๊ฒฝ์„ activation map์ด๋ผ ๋ถ€๋ฆ…๋‹ˆ๋‹ค.

Sigmoid/Logistic Function

ฯƒ(x)=11+3โˆ’x\sigma(x) = \frac{1}{1+3^{-x}}ฯƒ(x)=1+3โˆ’x1โ€‹
  • ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜๋Š” 000๊ณผ 111 ์‚ฌ์ด์˜ ๊ฐ’์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

  • ์ด์ง„ ๋ถ„๋ฅ˜ ๋ฌธ์ œ์—์„œ ์ฃผ๋กœ ํ™œ์šฉ๋ฉ๋‹ˆ๋‹ค.

    • ๊ทธ๋ผ๋””์–ธํŠธ ์†Œ์‹ค(vanishing gradients)์— ์ทจ์•ฝํ•œ ํŠน์ง•์ด ์žˆ์Šต๋‹ˆ๋‹ค.

    • ์ค‘์‹ฌ์ ์ด 000์ด ์•„๋‹ˆ์–ด์„œ ํ•™์Šต ์†๋„๊ฐ€ ๋А๋ฆฝ๋‹ˆ๋‹ค.

Tanh Function (Hyperbolic Tangent)

tahnh(x)=exโˆ’eโˆ’xex+eโˆ’xf(x)=xโˆ—e2piiฮพx\text{tahnh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}f(x) = x * e^{2 pi i \xi x}tahnh(x)=ex+eโˆ’xexโˆ’eโˆ’xโ€‹f(x)=xโˆ—e2piiฮพx
  • ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜๋Š” โˆ’1-1โˆ’1๊ณผ 111 ์‚ฌ์ด์˜ ๊ฐ’์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

  • ์€๋‹‰์ธต์— ๋Œ€ํ•ด ์‚ฌ์šฉ๋˜๊ณค ํ•ฉ๋‹ˆ๋‹ค.

    • ์ด ์—ญ์‹œ ๊ทธ๋ผ๋””์–ธํŠธ ์†Œ์‹ค์— ์ทจ์•ฝํ•œ ํŠน์ง•์ด ์žˆ์Šต๋‹ˆ๋‹ค.

    • ์ค‘์‹ฌ์ ์ด 000์ธ ๋•๋ถ„์— ์‹œ๊ทธ๋ชจ์ด๋“œ ๋ณด๋‹ค๋Š” ํ•™์Šต ์†๋„๊ฐ€ ๋น ๋ฆ…๋‹ˆ๋‹ค.

ReLu, Rectified linear Unit

ReLU(x)=maxโก(0,x)f(x)=xโˆ—e2piiฮพx\text{ReLU}(x) = \max(0,x)f(x) = x * e^{2 pi i \xi x}ReLU(x)=max(0,x)f(x)=xโˆ—e2piiฮพx
  • ๋ ๋ฃจ ํ•จ์ˆ˜๋Š” 000๋ถ€ํ„ฐ infโก\infinf๊นŒ์ง€์˜ ๊ฐ’์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

  • ๋Œ€๋ถ€๋ถ„์˜ ์€๋‹‰์ธต์—์„œ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

    • ๊ทธ๋ผ๋””์–ธํŠธ ์†Œ์‹ค ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ ๋‰ด๋Ÿฐ์ด ์Œ์ˆ˜ ๊ฐ’์— ๊ณ ์ •๋˜๋Š” Dying ReLU๋ผ๋Š” ํŠน์ˆ˜ํ•œ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

    • ํ•™์Šต ์†๋„๊ฐ€ ๋น ๋ฆ…๋‹ˆ๋‹ค.

Leaky ReLU

LeakyReLU(x)={xifย x>0ฮฑโ‹…xotherwise\text{LeakyReLU}(x) = \begin{cases} x &\text{if }x > 0 \\ \alpha \cdot x &\text{otherwise} \end{cases}LeakyReLU(x)={xฮฑโ‹…xโ€‹ifย x>0otherwiseโ€‹
  • ๋ฆญํ‚ค ๋ ๋ฃจ ํ•จ์ˆ˜๋Š” โˆ’infโก-\infโˆ’inf๋ถ€ํ„ฐ infโก\infinf๊นŒ์ง€์˜ ๊ฐ’์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

  • Dying ReLU ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐ์ฒต์œผ๋กœ ๊ณ ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • ฮฑ\alphaฮฑ๊ฐ€ ํ•™์Šต ์ค‘์— ์กฐ์ •๋˜๋Š” ๋ฆญํ‚ค ๋ ๋ฃจ ํ•จ์ˆ˜๋ฅผ PReLU(Parametric ReLU)๋ผ ํ•ฉ๋‹ˆ๋‹ค.

ELU, Exponential Linear Unit

ELU(x)={xifย x>0ฮฑ(exโˆ’1)otherwise\text{ELU}(x) = \begin{cases} x &\text{if } x > 0 \\ \alpha(e^x - 1) &\text{otherwise} \end{cases}ELU(x)={xฮฑ(exโˆ’1)โ€‹ifย x>0otherwiseโ€‹
  • โˆ’ฮฑ-\alphaโˆ’ฮฑ์—์„œ infโก\infinf๊นŒ์ง€์˜ ๊ฐ’์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

  • ๋ ๋ฃจ ํ•จ์ˆ˜๋ณด๋‹ค ์•ฝ๊ฐ„ ๋‚˜์€ ์„ฑ๋Šฅ์„ ๊ฐ€์ ธ๋‹ค ์ค๋‹ˆ๋‹ค.

Swish, Google Brain

Swish(x)=xโ‹…ฯƒ(ฮฒโ‹…x)whereย ฮฑย sigmoid,ฮฒย isย learnableย parameter\text{Swish}(x) = x\cdot \sigma(\beta \cdot x) \newline\text{where }\alpha \text{ sigmoid}, \beta \text{ is learnable parameter}Swish(x)=xโ‹…ฯƒ(ฮฒโ‹…x)whereย ฮฑย sigmoid,ฮฒย isย learnableย parameter
  • โˆ’infโก-\infโˆ’inf์—์„œ infโก\infinf๊นŒ์ง€์˜ ๊ฐ’์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

  • ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์—์„œ ๋ ๋ฃจ ํ•จ์ˆ˜๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๋ณ„ ๋ชจ์–‘์œผ๋กœ ํ‘œํ˜„๋˜๋Š” ํ˜•ํƒœ๋Š” ํ•™์Šต ์ƒ์—์„œ ๋ฏผ๊ฐํ•œ ๋ณ€ํ™”๋ฅผ ๋งŒ๋“ค์–ด๋‚ฉ๋‹ˆ๋‹ค. Swish์˜ ๊ฒฝ์šฐ์—” ์ด ํ˜•ํƒœ๊ฐ€ ์žˆ์ง€๋งŒ ์›ํ™œํ™”๋˜์–ด ๋ณธ๋ž˜ ์˜๋ฏธ๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ๊ณผ๋„ํ•œ ๋ฐ˜์‘์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.