๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
HYU/๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šค

8. Classification

by Jaeguk 2024. 4. 13.

Classification์ด๋ž€ ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ Class label์„ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ

 

Classificaiotn vs Regression


๋‘˜์€ ์–ด๋–ค ์ฐจ์ด๊ฐ€ ์žˆ์„๊นŒ?

  • Classification
    • ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด Categoricalํ•œ Class label์„ ์˜ˆ์ธก ํ•˜๋Š” ๊ฒƒ
    • ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด์„œ Classifier๋ฅผ ํ•™์Šต์‹œํ‚จ ๋‹ค์Œ, ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋ธ์— ๋„ฃ์–ด์„œ ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธก
    • Ex) ๋‚ ์”จ๊ฐ€ ์ถ”์šด์ง€ ์•ˆ ์ถ”์šด์ง€ ํŒ๋ณ„ํ•˜๋Š” ๋ชจ๋ธ
  • Regression
    • Continuousํ•œ ๊ฐ’์„ ๋ฑ‰์–ด๋‚ด๋Š” ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚จ๋‹ค
    • Unknownํ•˜๊ฑฐ๋‚˜ Missing๋œ ๊ฐ’์„ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด์„œ ์˜ˆ์ธกํ•œ๋‹ค
    • ์—ฐ์†๋œ ๊ฐ’์„ ์˜ˆ์ธก ํ•˜๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ
    • Ex) ๊ธฐ์˜จ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ

 

Classification


Classification์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ชจ๋ธ์€ ์–ด๋–ป๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ์„๊นŒ

  • ์ด๋ฏธ ํด๋ž˜์Šค ๊ฐ’์ด ๊ฒฐ์ •๋˜์–ด ์žˆ๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์„œ, ๊ทธ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์žฅ ์ž˜ ์„ค๋ช…ํ•˜๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ
  • Training Data
    • ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด์„œ ์‚ฌ์šฉ๋˜๋Š” ๋ฐ์ดํ„ฐ
    • <Feature 1, Feature 2, ..., Feature N, Lable> ํ˜•ํƒœ
    • ๊ฐ๊ฐ์˜ ๋ฐ์ดํ„ฐ๋Š” ํ•˜๋‚˜์˜ Class์— ์†ํ•œ ๊ฒƒ์œผ๋กœ ๊ฐ€์ •๋œ๋‹ค
  • Model
    • ๋ชจ๋ธ์€ <Feature 1, Feature 2, ..., Feature N> ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์˜ค๋ฉด, ํ•ด๋‹น ๋ฐ์ดํ„ฐ์˜ label์„ ์˜ˆ์ธกํ•œ๋‹ค
    • ๋‹ค์–‘ํ•œ Classifier ๋ชจ๋ธ์ด ์กด์žฌํ•œ๋‹ค
      • Classification Rules
      • Decision Trees
      • Networks
      • Mathematical Formula

๊ฒฐ๊ตญ, ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด์„œ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ํด๋ž˜์Šฅ ๊ฐ’์„ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด Classification์ด๋‹ค

 

์˜ˆ์‹œ


Spam mail์„ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜๊ณ , ์‚ฌ์šฉํ•˜๋Š” ์˜ˆ์‹œ

Spam Classifier


Supervised vs Unsupervised


๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์—๋Š” ํฌ๊ฒŒ 2๊ฐ€์ง€๊ฐ€ ์žˆ๋‹ค

  • Supervised Learning (Classification)
    • ํ•™์Šต ๋ฐ์ดํ„ฐ์— Class label์ด ๋ช…์‹œ๋˜์–ด ์žˆ๋‹ค
    • ์•ž์„  Spam mail ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ํ•™์Šต์‹œํ‚ฌ ๋•Œ ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ label์ด ์ŠคํŒธ์ธ์ง€ ์•„๋‹Œ์ง€ ๋ช…์‹œ๋˜์–ด ์žˆ์—ˆ๋‹ค
    • ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์˜ค๋ฉด, ํ•™์Šต ๋ฐ์ดํ„ฐ์— ๊ทผ๊ฑฐํ•ด์„œ Class๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค
  • Unsupervised Learning (Clustering)
    • ํ•™์Šต ๋ฐ์ดํ„ฐ์— Class label์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ์—†๋‹ค
    • ํด๋ž˜์Šค๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ๊ฐ€ ์•„๋‹ˆ๋ผ, ๋น„์Šทํ•œ ๋ฐ์ดํ„ฐ๋ผ๋ฆฌ ํด๋Ÿฌ์Šคํ„ฐ๋ง์„ ํ˜•์„ฑํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ

 

Issues in Classification


Classification์„ ํ•  ๋•Œ ๊ณ ๋ คํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ๋“ค

 

Data Preparation


๋ฐ์ดํ„ฐ๋ฅผ ์ค€๋น„ํ•˜๋Š” ๊ณผ์ •์ด ํ•„์š”ํ•˜๋‹ค

  • Data Cleaning
    • ๋ฐ์ดํ„ฐ๋ฅผ ์ „์ฒ˜๋ฆฌํ•ด์„œ ๋…ธ์ด์ฆˆ, ์—๋Ÿฌ ๋“ฑ์˜ ์ด์Šˆ๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ๊ณผ์ •
  • Relevance Analysis (Feature Selection)
    • ํ•„์š”์—†๋Š” Feature๋“ค์„ ์ œ๊ฑฐํ•ด์„œ, ๋” ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•˜๊ฒŒ ํ•™์Šต๊ณผ ๋ถ„๋ฅ˜๊ฐ€ ๊ฐ€๋Šฅํ•˜๋„๋ก ํ•œ๋‹ค
  • Data Transformation
    • ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ์ •๊ทœํ™”ํ•˜๋Š” ๊ฒƒ

 

Evaluation Points


์ˆ˜๋งŽ์€ ๋ชจ๋ธ๋“ค ์ค‘์—์„œ ํ•˜๋‚˜๋ฅผ ๊ณจ๋ผ์•ผ ํ•˜๋Š”๋ฐ, ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ์ค€

  • Accuracy
  • Speed
    • ํ•™์Šต ์‹œ๊ฐ„
    • ๋ถ„๋ฅ˜ ์‹œ๊ฐ„
  • Robustness
    • ๋ฐ์ดํ„ฐ์— ์„ž์ธ ๋…ธ์ด์ฆˆ, ์—๋Ÿฌ, ์•„์›ƒ๋ผ์ด์–ด ๋“ฑ์„ ์–ผ๋งˆ๋‚˜ ์ž˜ ํ•ธ๋“ค๋ง ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€
  • Scalability
    • ๋ฐ์ดํ„ฐ์˜ ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚˜๋„ ๋ฌธ์ œ๊ฐ€ ์—†๋Š”์ง€?
  • Interpretability
    • ์™œ ์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋†“์•˜๋Š”์ง€์— ๋Œ€ํ•œ ๋ถ„์„์ด ๊ฐ€๋Šฅํ•œ์ง€?

 

728x90

'HYU > ๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šค' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

10. Overfitting  (0) 2024.04.14
9. Decision Tree  (0) 2024.04.13
7. Association Rules  (0) 2024.04.13
6. Miner Improvements  (0) 2024.04.13
5. FP-growth  (0) 2024.04.13