Đang tải... (xem toàn văn)
tài liệu giới thiệu về khai thác dữ liệu
Data Mining Tutorial D A Dickey April 2012 Data Mining - What is it? • • • • Large datasets Fast methods Not significance testing Topics – Trees (recursive splitting) – Logistic Regression – Neural Networks – Association Analysis – Nearest Neighbor – Clustering – Etc Trees • • • • • • • • A “divisive” method (splits) Start with “root node” – all in one group Get splitting rules Response often binary Result is a “tree” Example: Loan Defaults Example: Framingham Heart Study Example: Automobile fatalities Recursive Splitting Pr{default} =0.007 Pr{default} =0.012 Pr{default} =0.006 X1=Debt To Income Ratio Pr{default} =0.0001 Pr{default} =0.003 No default Default X2 = Age Some Actual Data • Framingham Heart Study • First Stage Coronary Heart Disease – P{CHD} = Function of: • Age - no drug yet! • Cholesterol • Systolic BP Import Example of a “tree” All 1615 patients Split # 1: Age Systolic BP “terminal node” How to make splits? • Which variable to use? • Where to split? – Cholesterol > – Systolic BP > _ • Goal: Pure “leaves” or “terminal nodes” • Ideal split: Everyone with BP>x has problems, nobody with BPChiSq D 2/5 2/3 C=>A 2/5 2/4 B&C=>D 1/5 1/3 ABC ACD BCD ADE BCE Don’t be Fooled! • Lift = Confidence /Expected Confidence if Independent Checking Saving No (1500) Yes (8500) (10000) No 500 3500 4000 Yes 1000 5000 6000 SVG=>CHKG Expect 8500/10000 = 85% if independent Observed Confidence is 5000/6000 = 83% Lift = 83/85 < Savings account holders actually LESS likely than others to have checking account !!! Summary • Data mining – a set of fast stat methods for large data sets • Some new ideas, many old or extensions of old • Some methods: – Trees (recursive splitting) – Logistic Regression – Neural Networks – Association Analysis – Nearest Neighbor – Clustering – Etc D.A.D ... small dataset, need all observations to estimate parameters of interest • Data mining – loads of data, can afford “holdout sample” • Variation: n-fold cross validation – Randomly divide data into... Multiple testing • • • • • • 50 different BPs in data, m=49 ways to split Multiply p-value by 49 Bonferroni – original idea Kass – apply to data mining (trees) Stop splitting if minimum p-value...April 2012 Data Mining - What is it? • • • • Large datasets Fast methods Not significance testing Topics – Trees (recursive splitting)