๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
HYU/๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šค

10. Overfitting

by Jaeguk 2024. 4. 14.

Training Data๋ฅผ ์‚ฌ์šฉํ•ด์„œ Decision Tree๋ฅผ Top-down ๋ฐฉ์‹์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์„ ๋ดค๋‹ค

ํ•™์Šต์„ ์–ธ์ œ ์ข…๋ฃŒํ•˜๋Š” ๊ฒƒ์ด ์ข‹์„๊นŒ?
๋…ธ๋“œ์— ์žˆ๋Š” ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋“ค์ด ๊ฐ™์€ Class๋ฅผ ๊ฐ€์งˆ ๋•Œ๊นŒ์ง€ ๋ถ„๋ฅ˜ํ•œ๋‹ค
๊ทธ๋Ÿฌ๋ฉด ํ•™์Šต ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ 100% ๋ถ„๋ฅ˜ ์ •ํ™•๋„๋ฅผ ๋ณด์ด๋Š” Decision Tree๊ฐ€ ๋งŒ๋“ค์–ด์ง€๊ฒŒ ๋œ๋‹ค

 

  • ์œ„์ฒ˜๋Ÿผ ํ•™์Šต์ด ๋˜๋ฉด ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด 100% ์ •ํ™•ํžˆ ๋ถ„๋ฅ˜ํ•˜๋Š” Decision Tree๊ฐ€ ํ•™์Šต๋˜์—ˆ๋‹ค
  • ์ด๊ฒŒ ๊ณผ์—ฐ ์ข‹์€ ๊ฒƒ์ผ๊นŒ? ์•„๋‹ˆ๋‹ค. Overfitting์ด ๋œ ๊ฒƒ ์ด๋‹ค

 

Overfitting of Decision Tree Models


Decision Tree๋Š” Overfitting์ด ๋ฐœ์ƒํ•œ๋‹ค๋Š” ๋ฌธ์ œ ๊ฐ€ ์žˆ๋‹ค

  • Overfitting์ด ๋œ๋‹ค๋Š” ๊ฒƒ์€ ์™œ ๋ฌธ์ œ์ผ๊นŒ
    • ๋„ˆ๋ฌด ๋งŽ์€ Branch๊ฐ€ ์ƒ๊ธฐ๊ฒŒ ๋˜๊ณ , Noise๋‚˜ Outlier์™€ ๊ฐ™์€ ์ด์ƒํ•œ ๋ฐ์ดํ„ฐ๋“ค๋„ ๋ชจ๋‘ ๋ฐ˜์˜์ด ๋œ๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค
    • ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์ •ํ™•ํžˆ ๋ถ„๋ฅ˜ํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ๋‹ค
  • ๊ทธ๋Ÿผ ์–ด๋–ป๊ฒŒ?
    • Overfitting์ด ๋˜๊ธฐ ์ „์— ์ค‘๋‹จ์„ ํ•ด์•ผ ํ•œ๋‹ค

 

Tree Pruning


๋‚˜๋ฌด ๊ฐ€์ง€์น˜๊ธฐ

  • Overfitting์ด ๋œ๋‹ค๋Š” ๊ฒƒ์€ Decistion Tree์— ๋„ˆ๋ฌด ๋งŽ์€ Branch๊ฐ€ ์ƒ๊ธด๋‹ค๊ณ  ํ–ˆ๋‹ค
  • ๊ทธ๋Ÿผ ๊ฐ€์ง€์น˜๊ธฐ ๋ฅผ ํ•ด์ฃผ๋ฉด Overfitting์„ ๋ง‰์•„์ค„ ๊ฒƒ์ด๋‹ค
  • Overfitting์„ ๋ฐฉ์ง€ํ•˜๋Š” ๊ธฐ๋ฒ•์—๋Š” 2๊ฐ€์ง€ ์ •๋„๊ฐ€ ์žˆ๋‹ค
    • Pre-pruning: Tree์˜ ์ƒ์„ฑ์„ ์ผ์ฐ ์ค‘๋‹จํ•˜๋Š” ๊ฒƒ ์ด๋‹ค
      • ํŠน์ • ์ž„๊ณ„์น˜ ์ด์ƒ์œผ๋กœ ๋ถ„๋ฅ˜๊ฐ€ ๋˜์—ˆ๋‹ค๋ฉด, ๋” ์ด์ƒ Recursiveํ•˜๊ฒŒ ๋“ค์–ด๊ฐ€์ง€ ์•Š๊ณ  Split์„ ์ค‘๋‹จํ•œ๋‹ค
      • ์ ์ ˆํ•œ ์ž„๊ณ„์น˜๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ต๋‹ค
      • Minimum samples split, Maximum tree depth, Minimum gain,,,
      • ์ด ์ค‘์— ์–ด๋–ค ๊ฒƒ์„ Pruning์˜ ๊ธฐ์ค€์œผ๋กœ ์žก์•„์•ผ ํ• ์ง€ ๊ฒฐ์ •ํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต๋‹ค
    • Post-pruning: ์ผ๋‹จ Tree๋ฅผ ์ƒ์„ฑํ•˜๊ณ  Branch๋“ค์„ ์ œ๊ฑฐ ํ•˜๋Š” ๋ฐฉ๋ฒ•
      • Testset ์ค‘์— ์ผ๋ถ€๋ฅผ Validation set์œผ๋กœ ์‚ฌ์šฉํ•ด์„œ Pruning์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์‚ฌ์šฉํ•œ๋‹ค
      • Validation set์— ๋Œ€ํ•ด์„œ Pruning ํ›„์˜ ์„ฑ๋Šฅ์ด ๋” ์ข‹์œผ๋ฉด Pruning์„ ์ž˜ํ•œ ๊ฒƒ์ด๋‹ค
      • ์ด์ œ๋Š” Pruning์„ ํ•˜๋Š” ๊ฒƒ์ด ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์„ ์ €ํ•˜์‹œํ‚จ๋‹ค๊ณ  ํŒ๋‹จ๋  ๋•Œ๊นŒ์ง€ ๊ฐ€์ง€์น˜๊ธฐ๋ฅผ ํ•œ๋‹ค

 

Random Forest


Overfitting์„ ํ”ผํ•˜๋Š” ๊ฐ•๋ ฅํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ์†Œ๊ฐœ๋˜๊ณ  ์žˆ๋‹ค

  • Radom Forest๋Š” Decision Tree์˜ ์•™์ƒ๋ธ”(Ensemble) ๋ฒ„์ „์ด๋‹ค
    • ์•™์ƒ๋ธ”์€ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์•ฝํ•œ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ๊ฐ๊ฐ์˜ ์˜ˆ์ธก์„ ๊ฒฐํ•ฉํ•จ์œผ๋กœ์จ ๋‹จ์ผ ๋ถ„๋ฅ˜๊ธฐ๋ณด๋‹ค ์‹ ๋ขฐ์„ฑ์ด ๋†’์€ ์ตœ์ข… ์˜ˆ์ธก ๊ฐ’์„ ์–ป์–ด๋‚ด๋Š” ๊ธฐ๋ฒ•
  • Forest?
    • Decision Tree๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ
    • ์—ฌ๋Ÿฌ ๊ฐœ์˜ Decision Tree๊ฐ€ ๊ฒฐ์ •ํ•œ Decision์„ ์ข…ํ•ฉ(Aggregation)ํ•ด์„œ ์ตœ์ข… ๊ฒฐ๊ณผ๋กœ ์‚ฐ์ •ํ•œ๋‹ค

 

  • Original Data๋กœ๋ถ€ํ„ฐ ๋žœ๋คํ•˜๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‚˜์”ฉ ๋ฝ‘๋Š”๋‹ค
  • ์ด๊ฑธ N๋ฒˆ ๋ฐ˜๋ณตํ•˜๋ฉด, N๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์žˆ๋Š” Sample์ด ์ƒ์„ฑ๋  ๊ฒƒ์ด๋‹ค
  • ์ด๋ ‡๊ฒŒ ์ƒ์„ฑํ•œ ์ƒ˜ํ”Œ์„ Bootstrap Sample์ด๋ผ๊ณ  ํ•œ๋‹ค

 

  • ์ด๋•Œ Sub Training Dataset(Bootstrap Sample)์€ Original Dataset๊ณผ ๊ฐ™์€ ํฌ๊ธฐ๋ฅผ ๊ฐ–๋„๋ก ์ƒ˜ํ”Œ๋ง ํ•ด์•ผํ•œ๋‹ค
    • ๋‹จ, ์ค‘๋ณต๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ์ƒ˜ํ”Œ๋ง๋˜๋Š” ๊ฒƒ์„ ํ—ˆ์šฉ ํ•œ๋‹ค
  • ํ•™์Šต ๋ฐ์ดํ„ฐ์— Randomness๋ฅผ ๋”ํ•ด์ฃผ๋Š” ๋ฐฉ์‹

 

  • ์›๋ž˜๋Š” Decision Tree๋ฅผ ํ˜•์„ฑํ•  ๋•Œ, ์ „์ฒด Feature์— ๋Œ€ํ•ด์„œ ์ตœ์ ์˜ Feature๋ฅผ ์ฐพ์•„์„œ Split์„ ํ–ˆ๋‹ค
  • ๊ทธ๋Ÿฐ๋ฐ Tree๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ณผ์ •์—๋„ Randomness๋ฅผ ๋”ํ•ด์ฃผ์–ด์„œ, Overfitting์„ ๋ฐฉ์ง€ํ•œ๋‹ค
    • ๊ฐ Recursion๋งˆ๋‹ค ํ›„๋ณด๊ตฐ Feature๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ๋ฝ‘๋Š”๋‹ค
    • ๋žœ๋คํ•˜๊ฒŒ ๋ฝ‘ํžŒ ํ›„๋ณด๊ตฐ ์ค‘์—์„œ ์ตœ์ ์˜ Feature๋ฅผ ์ฐพ๊ณ , ํ•ด๋‹น Feature๋กœ Split์„ ํ•œ๋‹ค

 

Aggregation


๊ฐ๊ฐ์˜ ํŠธ๋ฆฌ๊ฐ€ ๋งŒ๋“ค์–ด๋‚ธ ๊ฒฐ์ •์„ ์–ด๋–ป๊ฒŒ ์ข…ํ•ฉํ•ด์„œ ์ตœ์ข…์ ์œผ๋กœ ํ•˜๋‚˜์˜ ๊ฒฐ์ •์„ ๋‚ด๋ฆด ๊ฒƒ์ธ๊ฐ€

 

Majority Voting

  • ๊ฐ๊ฐ์˜ ํŠธ๋ฆฌ๊ฐ€ ๋งŒ๋“ค์–ด๋‚ธ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๊ณ , ๋” ๋งŽ์ด vote๋œ ์ชฝ์œผ๋กœ ๊ฒฐ์ • (๋‹ค์ˆ˜๊ฒฐ)
  • ๊ฐ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์€ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ , ์˜ค์ง ๊ฒฐ๊ณผ๋งŒ ๋ณด๊ณ  ํŒ๋‹จ์„ ํ•œ๋‹ค
    • ๋ชจ๋“  ํ‘œ๊ฐ€ ๋™์ผํ•œ ํž˜์„ ๊ฐ€์ง„๋‹ค

 

Weighted Voting

  • ๊ฐ๊ฐ์˜ ํŠธ๋ฆฌ๊ฐ€ ๋งŒ๋“ค์–ด๋‚ธ ๊ฒฐ๊ณผ์—, ๊ฐ ํŠธ๋ฆฌ์˜ Training Accuracy๋ฅผ ํ•จ๊ป˜ ๊ณ ๋ คํ•ด์„œ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ๋‚ธ๋‹ค
  • ๊ฐ€์ค‘์น˜๋ฅผ ์„ค์ •ํ•˜์—ฌ ์ตœ์ข… Label์„ ๊ฒฐ์ •
    • Training Accuracy๊ฐ€ ๋†’์€ ๋ชจ๋ธ์˜ ํ‘œ๊ฐ€ ๋” ํฐ ํž˜์„ ๊ฐ€์ง„๋‹ค

 

์ •๋ฆฌ


  • Decision Tree Model
    • Recursiveํ•˜๊ฒŒ ๊ฐ ๋‹จ๊ณ„์—์„œ ์ตœ์ ์˜ Feature๋ฅผ ์„ ํƒํ•˜์—ฌ ๋ถ„๋ฆฌํ•˜๋Š” ๊ทธ๋ฆฌ๋””ํ•œ ๋ฐฉ์‹
    • ์ตœ์ ์˜ Feature๋ฅผ ์„ ํƒํ•˜๋Š” ๊ธฐ์ค€์€ Information Gain, Gain Ratio, Entropy, Gini Index ๋“ฑ ๋‹ค์–‘ํ–ˆ๋‹ค
  • Overfitting์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•
    • Tree Pruning
    • Random Forest
  • Random Forest
    • 500 ~ 10,000๊ฐœ ์ •๋„์˜ Tree๋ฅผ ์•™์ƒ๋ธ”ํ•˜์—ฌ ๋ ˆ์ด๋ธ”์„ ๊ฒฐ์ •
    • ๊ฐ๊ฐ์˜ Tree๋Š” Bootstrapped sample์„ ๊ฐ€์ง€๊ณ  ํ•™์Šต๋œ๋‹ค
    • ๊ฐ Tree๋ฅผ ์ƒ์„ฑํ•  ๋•Œ, ๋žœ๋คํ•˜๊ฒŒ ์„ ํƒํ•œ Feature๋ฅผ ์ด์šฉํ•ด์„œ Tree๋ฅผ ๋งŒ๋“ ๋‹ค
    • Sample์™€ Tree๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ณผ์ • ๋ชจ๋‘์— ๋ฌด์ž‘์œ„์„ฑ์„ ๋”ํ•ด์ฃผ์–ด Overfitting์„ ๋ฐฉ์ง€

 

Reference


 

[ML] RandomForest(๋žœ๋คํฌ๋ ˆ์ŠคํŠธ)

์ดˆ์ฝ”๋‚˜๋ฌด์ˆฒ๋ณด๋‹ค ๋‹ฌ๋‹ฌํ•œ ์ˆฒ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ”๋กœ '๋ฌด์ž‘์œ„ ์ˆฒ(Random Forest)' ์ž…๋‹ˆ๋‹ค!

velog.io

 

728x90

'HYU > ๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šค' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

12. Evaluation & Ensemble  (0) 2024.04.16
11. Rule Based Classification  (0) 2024.04.14
9. Decision Tree  (0) 2024.04.13
8. Classification  (0) 2024.04.13
7. Association Rules  (0) 2024.04.13