Data Mining Tutorial

102 599 3
Data Mining Tutorial

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

tài liệu giới thiệu về khai thác dữ liệu

Data Mining Tutorial D A Dickey April 2012 Data Mining - What is it? • • • • Large datasets Fast methods Not significance testing Topics – Trees (recursive splitting) – Logistic Regression – Neural Networks – Association Analysis – Nearest Neighbor – Clustering – Etc Trees • • • • • • • • A “divisive” method (splits) Start with “root node” – all in one group Get splitting rules Response often binary Result is a “tree” Example: Loan Defaults Example: Framingham Heart Study Example: Automobile fatalities Recursive Splitting Pr{default} =0.007 Pr{default} =0.012 Pr{default} =0.006 X1=Debt To Income Ratio Pr{default} =0.0001 Pr{default} =0.003 No default Default X2 = Age Some Actual Data • Framingham Heart Study • First Stage Coronary Heart Disease – P{CHD} = Function of: • Age - no drug yet!  • Cholesterol • Systolic BP Import Example of a “tree” All 1615 patients Split # 1: Age Systolic BP “terminal node” How to make splits? • Which variable to use? • Where to split? – Cholesterol > – Systolic BP > _ • Goal: Pure “leaves” or “terminal nodes” • Ideal split: Everyone with BP>x has problems, nobody with BPChiSq D 2/5 2/3 C=>A 2/5 2/4 B&C=>D 1/5 1/3 ABC ACD BCD ADE BCE Don’t be Fooled! • Lift = Confidence /Expected Confidence if Independent Checking Saving No (1500) Yes (8500) (10000) No 500 3500 4000 Yes 1000 5000 6000 SVG=>CHKG Expect 8500/10000 = 85% if independent Observed Confidence is 5000/6000 = 83% Lift = 83/85 < Savings account holders actually LESS likely than others to have checking account !!! Summary • Data mining – a set of fast stat methods for large data sets • Some new ideas, many old or extensions of old • Some methods: – Trees (recursive splitting) – Logistic Regression – Neural Networks – Association Analysis – Nearest Neighbor – Clustering – Etc D.A.D ... small dataset, need all observations to estimate parameters of interest • Data mining – loads of data, can afford “holdout sample” • Variation: n-fold cross validation – Randomly divide data into... Multiple testing • • • • • • 50 different BPs in data, m=49 ways to split Multiply p-value by 49 Bonferroni – original idea Kass – apply to data mining (trees) Stop splitting if minimum p-value...April 2012 Data Mining - What is it? • • • • Large datasets Fast methods Not significance testing Topics – Trees (recursive splitting)

Ngày đăng: 04/03/2013, 14:32

Từ khóa liên quan

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan