๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
HYU/๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šค

1. Introduction

by Jaeguk 2024. 4. 13.

What is Data Mining?


๋ฐ์ดํ„ฐ ๋งˆ์ด๋‹์ด๋ž€ ๋ฌด์—‡์ผ๊นŒ

  • ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ ์†์—์„œ ํฅ๋ฏธ๋กญ๊ณ  ์ค‘์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ž๋™์œผ๋กœ ๋ฝ‘์•„๋‚ด๋Š” ๊ณผ์ •
    • ์–ด๋–ค ๋ฐ์ดํ„ฐ๊ฐ€ ํฅ๋ฏธ๋กญ๊ณ  ์ค‘์š”?
    • Non-trivial, Implicit, Previously unknown, Potentially usefull ,,, ํ•œ ์ •๋ณด๋“ค
  • ์š”์ฆ˜ ์šฐ๋ฆฌ๋Š” ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ ์‹œ๋Œ€์— ์‚ด๊ณ  ์žˆ๊ณ , ๋ฐ์ดํ„ฐ๋Š” ๊ณ„์†ํ•ด์„œ ์Œ“์—ฌ๊ฐ€๊ธฐ ๋•Œ๋ฌธ์— ๊ทธ ์†์—์„œ ์ค‘์š”ํ•œ ์˜๋ฏธ๋ฅผ ์ฐพ์•„์•ผ ํ•œ๋‹ค

 

Knowledge Discovery Process


๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ ์†์—์„œ ์˜๋ฏธ์žˆ๋Š” ์ •๋ณด๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ณผ์ •

  • Data Cleaning
    • ๋ฐ์ดํ„ฐ์— ์„ž์—ฌ์žˆ๋Š” ๋…ธ์ด์ฆˆ, ์—๋Ÿฌ ๋“ฑ์„ ์ œ๊ฑฐํ•˜๋Š” ๊ณผ์ •
  • Data Warehouse
    • ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋“ค์ด ์ €์žฅ๋œ ์ €์žฅ์†Œ
  • Task-relevant Data
    • ํ˜„์žฌ ์ง„ํ–‰ํ•˜๊ณ  ์žˆ๋Š” Task์™€ ๊ด€๋ จ๋œ ๋ฐ์ดํ„ฐ๋งŒ Warehouse๋กœ๋ถ€ํ„ฐ ์ถ”์ถœ
  • Data Mining
    • ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์˜๋ฏธ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ณผ์ •
  • Pattern Evaluation
    • ๋ฐ์ดํ„ฐ ๋งˆ์ด๋‹์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ถ„์„ ๋ฐ ํ‰๊ฐ€
    • ์—ฌ๊ธฐ์„œ ์˜๋ฏธ์žˆ๋‹ค๊ณ  ํ‰๊ฐ€๋œ ๋ฐ์ดํ„ฐ๋Š” Knowledge๊ฐ€ ๋˜๊ณ , Data Warehouse์— ํ†ตํ•ฉ๋œ๋‹ค

 

Data Mining: Confluence of Multiple Disciplines


๋ฐ์ดํ„ฐ ๋งˆ์ด๋‹์„ ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ๊ธฐ์ˆ ๋“ค์ด ์‚ฌ์šฉ๋œ๋‹ค

  • ์—ฌ๋Ÿฌ ๋ถ„์•ผ์™€ ๊ด€๋ จ์ด ์žˆ๋‹ค

 

Functionalities for Data Mining


๋ฐ์ดํ„ฐ ๋งˆ์ด๋‹์˜ ๊ธฐ๋Šฅ๋“ค

  • ๋ฐ์ดํ„ฐ ๋งˆ์ด๋‹์˜ ๋ชฉ์ ์ด๋ผ ์ƒ๊ฐํ•  ์ˆ˜๋„ ์žˆ๋‹ค

 

Frequent Patterns, Association Rules


Frequent Pattern์„ ํŒŒ์•…ํ•˜๊ณ , ๊ทธ๋กœ๋ถ€ํ„ฐ Association Rule๋“ค์„ ์ถ”์ถœ

  • Ex) Diaper -> Beer
    • { Diaper, Beer }๊ฐ€ ๋ฐ์ดํ„ฐ์— ํ•จ๊ป˜ ๋งŽ์ด ๋“ฑ์žฅํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๊ณ 
    • ๊ทธ ์ค‘์—์„œ Diaper๋ฅผ ์‚ฐ ์‚ฌ๋žŒ๋“ค์ด Beer๋„ ํ•จ๊ป˜ ์‚ฌ๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ธ๋‹ค
    • ๊ทธ ์ด์œ ๋Š” ์•„๋น ๋“ค์ด ๊ธฐ์ €๊ท€๋ฅผ ์‚ฌ๋Ÿฌ ๊ฐ”๋‹ค๊ฐ€, ๋งฅ์ฃผ๋„ ํ•จ๊ป˜ ์‚ฌ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ
    • ํ•˜์ง€๋งŒ ๋ฐ˜๋Œ€๋กœ ๋งฅ์ฃผ๋ฅผ ์‚ฌ๋Ÿฌ ๊ฐ”๋‹ค๊ฐ€ ๊ธฐ์ €๊ท€๋ฅผ ์‚ฌ๋Š” ๊ฒฝ์šฐ(Beer -> Diaper)๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ์ ์„ ๊ฒƒ์ด๋‹ค
    • ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— Pattern ์†์—์„œ Rule์„ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ๋„ ์ค‘์š”
  • ์ด ์ •๋ณด๊ฐ€ ์–ด๋–ค ์˜๋ฏธ๊ฐ€ ์žˆ์„๊นŒ?
    • ์ด๋Ÿฐ ์ •๋ณด๋ฅผ ์•Œ๋ฉด, ๋งˆํŠธ์—์„œ ๊ธฐ์ €๊ท€์™€ ๋งฅ์ฃผ๋ฅผ ํ•จ๊ป˜ ๋ฐฐ์น˜ํ•œ๋‹ค๋ฉด ๋งค์ถœ์ด ์ฆ๊ฐ€ํ•  ๊ฒƒ์ด๋‹ค
    • ์˜๋ฏธ์žˆ๋Š” ์ •๋ณด์ž„!

 

Classification and Regression


ํด๋ž˜์Šค๋ฅผ ๋ถ„๋ฅ˜ํ•˜๊ฑฐ๋‚˜, Regression์„ ํ•˜๋Š” ๊ธฐ๋Šฅ

  • Classification: ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์™”์„ ๋•Œ, ํ•ด๋‹น ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๊ณ  ํด๋ž˜์Šค๋ฅผ ๋ถ„๋ฅ˜
  • Regression: ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์™”์„ ๋•Œ, ํ•ด๋‹น ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๊ณ  ํŠน์ • ๊ฐ’์„ ์˜ˆ์ธก
  • ์ด๊ฑด ์ฃผ๋กœ Machine Learning์„ ์ด์šฉํ•ด์„œ ์ด๋ฃจ์–ด์ง„๋‹ค

 

Cluster Analysis


ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๋ถ„์„

  • Classification๊ณผ ๋‹ค๋ฅด๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ์—๋Š” ๊ด€์‹ฌ์ด ์—†๊ณ , ๋ฐ์ดํ„ฐ ์ž์ฒด์— ๊ด€์‹ฌ์ด ์žˆ๋‹ค
  • ์œ ์‚ฌํ•œ ๋ฐ์ดํ„ฐ๋ผ๋ฆฌ ๊ทธ๋ฃนํ™”๋ฅผ ์‹œํ‚จ๋‹ค

 

Outlier Analysis


์ผ๋ฐ˜์ ์ธ ๋ฐ์ดํ„ฐ๋“ค๊ณผ ๋™๋–จ์–ด์ง„ Outlier๋ฅผ ์ฐพ๋Š”๋‹ค

  • ๊ทธ๋Ÿฌํ•œ Outlier๋Š” Noise, Error๊ฐ€ ๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ฐพ๋Š” ๊ฒƒ์ด ์˜๋ฏธ๊ฐ€ ์žˆ๋‹ค

 

Trend and Evolution Anylsis


๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ํŠธ๋ Œ๋“œ์™€ ๋ณ€ํ™”๋ฅผ ํŒŒ์•…
  • Sequential Pattern Mining
    • ์—ฐ์†์ ์œผ๋กœ ์ด๋ฃจ์–ด์ง€๋Š” ํŒจํ„ด์„ ๋ถ„์„
    • Frequent Pattern์€ ๋™์‹œ์— ์ผ์–ด๋‚˜๋Š” ํŒจํ„ด์ด์ง€๋งŒ, Sequential Pattern์€ ์—ฐ์†์ ์œผ๋กœ ์ด๋ฃจ์–ด์ง€๋Š” ํŒจํ„ด์„ ๋ถ„์„
    • Ex) Digital Camera -> Large SD Memory
    • ๋””์ง€ํ„ธ ์นด๋ฉ”๋ผ๋ฅผ ๊ตฌ๋งคํ•œ ์‚ฌ๋žŒ์€, ์–ผ๋งˆ ํ›„์— SD ์นด๋“œ๋ฅผ ๊ตฌ๋งคํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค
    • ์ฒ˜์Œ ์นด๋ฉ”๋ผ๋ฅผ ์‚ด ๋•Œ๋Š” ์‚ฌ์ง€ ์•Š์ง€๋งŒ, ์กฐ๊ธˆ ์“ฐ๋‹ค๋ณด๋‹ˆ ์šฉ๋Ÿ‰์ด ๋ถ€์กฑํ•ด์„œ ๊ตฌ๋งคํ•˜๊ฒŒ ๋˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค (์—ฐ์†์ )
728x90

'HYU > ๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šค' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

6. Miner Improvements  (0) 2024.04.13
5. FP-growth  (0) 2024.04.13
4. Improving Apriori  (0) 2024.04.13
3. Apriori  (0) 2024.04.13
2. Frequent Patterns  (0) 2024.04.13