๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
HYU/๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šค

2. Frequent Patterns

by Jaeguk 2024. 4. 13.

Mining Frequent Patterns, Association and Correlatons

 

Frequent Pattern Mining


๋ฐ์ดํ„ฐ ์†์—์„œ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ํŒจํ„ด์„ ๋ถ„์„ํ•˜๋Š” ๊ธฐ์ˆ 

Frequent Pattern?
: ๋ฐ์ดํ„ฐ์…‹ ๋‚ด์—์„œ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ํŒจํ„ด

์˜ˆ๋ฅผ ๋“ค๋ฉด, ์ž์ฃผ ํ•จ๊ป˜ ๊ตฌ๋งค๋˜๋Š” ์ƒํ’ˆ๋“ค

Motivation?
๋ฐ์ดํ„ฐ ์†์— ๋‚ด์žฌ๋œ ํŒจํ„ด๋“ค ์ฐพ๊ธฐ ์œ„ํ•จ

  • ์–ด๋–ค ์ƒํ’ˆ๋“ค์ด ํ•จ๊ป˜ ๊ตฌ๋งค๊ฐ€ ๋˜๋Š”๊ฐ€? (์ด๊ฒŒ ์•ž์œผ๋กœ ์ฃผ๋กœ ๋‹ค๋ค„์งˆ ์˜ˆ์‹œ)
    • Beers and Diapers
    • ๊ธฐ์ €๊ท€์™€ ๋งฅ์ฃผ๋Š” ํ•จ๊ป˜ ๊ตฌ๋งค๊ฐ€ ๋˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค
    • ์ด ์ •๋ณด๋ฅผ ์•Œ๋ฉด ๊ธฐ์ €๊ท€์™€ ๋งฅ์ฃผ๋ฅผ ํ•จ๊ป˜ ๋น„์น˜ํ•˜๋ฉด ํŒ๋งค์œจ์ด ์˜ฌ๋ผ๊ฐˆ ๊ฒƒ
  • ํŠน์ • ์ƒํ’ˆ์„ ๊ตฌ๋งคํ•œ ๋‹ค์Œ ์ˆœ์ฐจ์ ์œผ๋กœ ์–ด๋–ค ๊ฒƒ์„ ๊ตฌ๋งคํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋Š”๊ฐ€?
    • ๋””์ง€ํ„ธ ์นด๋ฉ”๋ผ๋ฅผ ๊ตฌ๋งคํ•œ ํ›„์— ์–ผ๋งˆ์žˆ๋‹ค๊ฐ€ SD์นด๋“œ(๋ฉ”๋ชจ๋ฆฌ)๋ฅผ ๊ตฌ๋งคํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค
  • ์–ด๋–ค DNA๊ฐ€ ์‹ ์•ฝ์— ๋ฏผ๊ฐํ•œ ๋ฐ˜์‘์„ ๋ณด์ด๋Š”๊ฐ€?
    • ์‹ ์•ฝ์— ๋Œ€ํ•ด์„œ ํŠน์ • ์ฆ์ƒ์„ ๋ณด์ธ ์ง‘๋‹จ์—์„œ ๊ณตํ†ต์ ์œผ๋กœ ํŠน์ • DNA๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๋ฉด, ํ•ด๋‹น DNA๊ฐ€ ์•ฝ์— ๋ฏผ๊ฐํ•œ ๊ฒƒ์œผ๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค
  • ์•„๋ฌดํŠผ ์ด๋Ÿฐ ์ •๋ณด๋“ค์„ ์–ป๊ธฐ ์œ„ํ•ด์„œ Frequent Pattern Minining์„ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค

 

Basic Concepts


Frequent Pattern Mining์„ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๊ธฐ๋ณธ์ ์œผ๋กœ ์•Œ์•„์•ผ ํ•˜๋Š” ์šฉ์–ด๋“ค

  • Itemset X = {x1, x2, ..., Xk}
    • Frequent Pattern์€ ์ด๋Ÿฌํ•œ Itemset ์ค‘์—์„œ ์ •์˜๋œ๋‹ค
    • {A, B, C, D, E} ๋ผ๋Š” ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์ด ์žˆ์„ ๋•Œ ์ด ๋ฐ์ดํ„ฐ๋“ค๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š” Itemset?
      • 2^5 - 1๊ฐœ์˜ Itemset์ด ๋งŒ๋“ค์–ด์งˆ ์ˆ˜ ์žˆ๋‹ค
      • ๋นˆ ์ง‘ํ•ฉ({})์€ Itemset์œผ๋กœ ์น˜์ง€ ์•Š๋Š”๋‹ค
  • Association Rules (X->Y)
    • Association Rules๋Š” 2๊ฐœ์˜ Itemset X, Y๋กœ๋ถ€ํ„ฐ ์ •์˜๋œ๋‹ค
    • ์ด๋•Œ X U Y(Union)์ด Frequent Pattern ์ด์–ด์•ผ ํ•œ๋‹ค
    • Ex) {Beers, Diapers}๋ผ๋Š” Frequent Pattern์ด ์žˆ์„ ๋•Œ
      • Beers -> Diapers
      • Diapers -> Beers
      • 2๊ฐœ์˜ Association Rules๋ฅผ ์ƒ๊ฐํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค
      • ์ด ์ค‘์— ์˜๋ฏธ์žˆ๋Š” Rule์„ ์ฐพ๋Š” ๊ฑด ๋’ค์—์„œ
  • Support and Confidence
    • Support: Transaction์ด Itemset X๋ฅผ ํฌํ•จํ•  ํ™•๋ฅ  ๋˜๋Š” ๋นˆ๋„
      • Minimum Support: Itemset X๊ฐ€ Frequent Pattern์ธ์ง€ ์•„๋‹Œ์ง€๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ์ž„๊ณ„์น˜(Threshold)
        • Itemset X์˜ Support๊ฐ€ Minimum Support๋ณด๋‹ค ํฌ๊ฑฐ๋‚˜ ๊ฐ™์œผ๋ฉด X๋Š” Frequent Pattern
    • Confidence: Itemset X๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ๋Š” Transaction์ด Itemset Y๋„ ํฌํ•จํ•˜๊ณ  ์žˆ์„ ํ™•๋ฅ  (์กฐ๊ฑด๋ถ€ ํ™•๋ฅ )
      • Minimum Confidence: ์ด๊ฒƒ๋„ ์—ญ์‹œ ์ž„๊ณ„์น˜๋กœ ์‚ฌ์šฉ

 

Let min_sup = 50%, min_conf = 50%:

Q. Find all frequent patterns
A. {A:3, B:3, D:4, E:3, AD:3}

Q. Find all association rules
A. A -> D (sup: 60%, conf: 100%), D -> A (sup: 60%, conf: 75%)

A๋ฅผ ์‚ฐ ์‚ฌ๋žŒ์ด D๋ฅผ ์‚ด ํ™•๋ฅ ์€ 100%
D๋ฅผ ์‚ฐ ์‚ฌ๋žŒ์ด A๋ฅผ ์‚ด ํ™•๋ฅ ์€ 75%

 

Closed Patterns and Max-Patterns


Frequent Pattern์˜ ํ•˜์œ„ ๊ฐœ๋…์ธ Closed Pattern๊ณผ Max-Pattern์„ ์•Œ์•„๋ณด์ž

  • ์‚ฌ์‹ค Pattern์ด๋ผ๋Š” ๊ฒŒ ์—„์ฒญ๋‚˜๊ฒŒ ๋งŽ์ด ๋งŒ๋“ค์–ด์งˆ ์ˆ˜ ์žˆ๋‹ค
  • {A1, A2, A3, ..., A100} ์ด๋Ÿฌํ•œ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์ด ์žˆ์„ ๋•Œ, 2^100 - 1๊ฐœ์˜ ์ง‘ํ•ฉ์ด ๋งŒ๋“ค์–ด์งˆ ์ˆ˜ ์žˆ๋‹ค
  • Subpattern์ด ๋„ˆ๋ฌด ๋งŽ์Œ,,,
  • ๊ทธ๋ž˜์„œ Subpattern๋“ค์„ ๋Œ€ํ‘œํ•˜๋Š” Closed Pattern๊ณผ Max Pattern์„ ๋งˆ์ด๋‹์„ ํ•œ๋‹ค
  • Closed Pattern?
    • Itemset X๊ฐ€ Frequent Pattern์ด๋ฉด์„œ, X์™€ ๊ฐ™์€ support๋ฅผ ๊ฐ–๋Š” X์˜ Superset์ด ์กด์žฌํ•˜์ง€ ์•Š์œผ๋ฉด, X๋Š” Closed ํ•˜๋‹ค๊ณ  ํ•œ๋‹ค
  • Max Pattern?
    • Itemset X๊ฐ€ Frequent Pattern์ด๋ฉด์„œ, X์˜ Superset์ธ ๊ทธ๋Ÿฌํ•œ Itemset Y๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š์œผ๋ฉด, X๋ฅผ Max Pattern์ด๋ผ๊ณ  ํ•œ๋‹ค
    • ์ฆ‰, Frequent Pattern ์ค‘์— ๋ณธ์ธ์˜ Superset์ด ์กด์žฌํ•˜์ง€ ์•Š์œผ๋ฉด ํ•ด๋‹น ํŒจํ„ด์„ Max Pattern์ด๋ผ๊ณ  ํ•œ๋‹ค
  • Closed Pattern์€ Frequent Pattern๋“ค์„ ์••์ถ•์‹œ์ผœ์ฃผ๋Š” ์—ญํ•  ์„ ํ•œ๋‹ค
    • ๋ถˆํ•„์š”ํ•œ Pattern, Rule๋“ค์„ ์ค„์—ฌ์ฃผ๋Š” ์—ญํ• ์„ ํ•œ๋‹ค

 

๋ง๋กœ๋Š” ๋ญ”๊ฐ€ ์ดํ•ดํ•˜๊ธฐ ์–ด๋ ค์šธ ์ˆ˜๋„ ์žˆ์œผ๋ฏ€๋กœ, ์˜ˆ์‹œ๋ฅผ ๋“ค์–ด๋ณด์ž

  • Y,Z๋Š” Closed Pattern์ด์ง€๋งŒ, X๋Š” Y๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— Closed Pattern์ด ๋  ์ˆ˜ ์—†๋‹ค
    • ์‚ฌ์‹ค Y๊ฐ€ ์žˆ๋Š” ์ด์ƒ, X๋Š” ํฌ๊ฒŒ ์˜๋ฏธ์—†๋Š” Pattern์ด๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— Closed Pattern์ด ์ค‘์š”ํ•œ ๊ฑฐ!
  • Z๋Š” Max Pattern์ด์ง€๋งŒ, Y๋Š” Z๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— Max Pattern์€ ๋  ์ˆ˜ ์—†๋‹ค

 

Quiz

Q1. How many frequent patterns are there?
A1. 2^100 - 1๊ฐœ์˜ Frequent Pattern ์กด์žฌ
min_sup์ด 1์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋“  ์กฐํ•ฉ์˜ ํŒจํ„ด์ด Frequent Pattern์ด ๋œ๋‹ค

Q2. What is the set of closed patterns?
A2.
<a1, a2, ..., a100>: 1
<a1, a2, ..., a50>: 2

๋‘ ํŒจํ„ด์€ Frequency๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ๋‘˜๋‹ค Closed Pattern์ด ๋œ๋‹ค

Q3. What is the set of max patterns?
A3. <a1, a2, ..., a100>: 1
Frequent Pattern๋“ค ์ค‘์—์„œ ๊ฐ€์žฅ Superset์ด Max Pattern์ด ๋œ๋‹ค

 

Scalable Methods for Mining Frequent Patterns


Database์— 100๊ฐœ์˜ Item์ด ์žˆ๋‹ค๊ณ  ํ•˜๋ฉด, 2^100 - 1๊ฐœ์˜ ํŒจํ„ด์ด ์ƒ์„ฑ๋  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด๊ฑด ๋„ˆ๋ฌด ๋งŽ๋‹ค

  • Downward closure propery of frequent patterns
    • Frequent Pattern์˜ Subset์€ ๋ฐ˜๋“œ์‹œ Frequentํ•˜๋‹ค
    • ์ด๊ฒƒ์„ ์ด์šฉํ•œ ๋ฐฉ๋ฒ•์ด Downward closure
    • Ex) {beer, diaper, nuts}๊ฐ€ Frequent Pattern์ด๋ฉด, {beer, diaper}์€ ๋‹น์—ฐํžˆ Frequent Pattern
      • ์ด๋Ÿฐ ๊ฒฝ์šฐ์— {beer, diaper}๋Š” ๊ตณ์ด ์—†์–ด๋„ ๋˜๋Š” Redundantํ•œ Pattern์ด๋‹ค
  • ์ด๊ฑธ ์ด์šฉํ•œ ๋งˆ์ด๋‹ ๋ฐฉ๋ฒ•๋“ค์—๋Š” ์–ด๋–ค Method ๋“ค์ด ์žˆ์„๊นŒ?
    • Apriori
    • FP-growth (Frequent Pattern Growth)
    • Charm (Vertical Data Format)
    • ์ด Method๋“ค์— ๋Œ€ํ•ด์„œ๋Š” ๋’ค์—์„œ ํ•˜๋‚˜์”ฉ ๋‹ค๋ฃฐ ๊ฒƒ
728x90

'HYU > ๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šค' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

6. Miner Improvements  (0) 2024.04.13
5. FP-growth  (0) 2024.04.13
4. Improving Apriori  (0) 2024.04.13
3. Apriori  (0) 2024.04.13
1. Introduction  (0) 2024.04.13