Statistical Machine Learning for High Dimensional Data Lecture 3

78 311 0
Statistical Machine Learning for High Dimensional Data Lecture 3

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Statistical Machine Learning for High Dimensional Data Lecture 3Statistical Machine Learning for High Dimensional Data Lecture 3Statistical Machine Learning for High Dimensional Data Lecture 3Statistical Machine Learning for High Dimensional Data Lecture 3Statistical Machine Learning for High Dimensional Data Lecture 3

Lecture Nonparametric Methods Statistical models with weak assumptions Topics • Nonparametric regression • Sparse additive models • Constrained rank additive models • Nonparametric graphical models Nonparametric Regression Given (X1 , Y1 ), , (Xn , Yn ) predict Y from X Assume only that Yi = m(Xi ) + function of x i where where m(x) is a smooth The most popular methods are kernel methods However, there are two types of kernels: Smoothing kernels Mercer kernels Smoothing kernels involve local averaging Mercer kernels involve regularization Smoothing Kernels • Smoothing kernel estimator: mh (x) = n i=1 Yi Kh (Xi , x) n i=1 Kh (Xi , x) where Kh (x, z) is a kernel such as Kh (x, z) = exp − x −z 2h2 and h > is called the bandwidth • mh (x) is just a local average of the Yi ’s near x • The bandwidth h controls the bias-variance tradeoff: Small h = large variance while large h = large bias Example: Some Data – Plot of Yi versus Xi ● ● ● ● ● ● ● ● ● Y ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● X Example: m(x) ● ● ● ● ● ● ● ● ● Y ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● X m(x) is a local average ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● Effect of the bandwidth h ● ● ● ● ● ●●●● ●●● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ●●● ● ● ● ●● ● ●● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ●●● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● very small bandwidth ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● small bandwidth ● ● ● ● ● ●●● ● ● ● ● ●●●● ● ● ● ●● ● ● ● ● ●● ● ●●● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ●●● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●●●● ● ● ● ●● ● ● ● ● ●● ● ●●● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ●●● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● medium bandwidth ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● large bandwidth Smoothing Kernels Risk = E(Y − mh (X ))2 = bias2 + variance + σ bias2 ≈ h4 , variance ≈ nhp where p = dimension of X σ = E(Y − m(X ))2 is the unavoidable prediction error small h: low bias, high variance (undersmoothing) large h: high bias, low variance (oversmoothing) Risk Versus Bandwidth Variance Bias h Risk optimal h 10 Graph-Valued Regression multivariate regression (supervised) µ(x) = E(Y | x) Y ∈ R p , x ∈ Rq graphical model (unsupervised) Graph(Y ) = (V , E) (j, k) ∈ E ⇐⇒ Yj Yk | Yrest ❅   ❅ ❘ ❅   ✠   graph-valued regression Graph(Y | x) • Gene associations from phenotype (or vice versa) • Voting patterns from covariates on bills • Stock interactions given market conditions, news items 64 Method I: Parametric • Assume that Z = (X , Y ) is jointly multivariate Gaussian Σ • Σ =@ X ΣYX ΣXY A ΣY • Get ΣX , ΣY , and ΣXY • Get ΩX by the glasso • ΣY |X = ΣY − ΣYX ΩX ΣXY • But, the estimated graph does not vary with different values of X 65 Method II: Kernel Smoothing ã Y |X = x N(à(x), (x)) Σ(x) = µ(x) = n i=1 K x−xi h (yi − µ(x)) (yi − µ(x))T n i=1 K n i=1 K n i=1 K x−xi h x−xi h x−xi h yi • Apply glasso to Σ(x) • Easy to but recovering X1 , , Xk requires difficult post-processing 66 Method III: Partition Estimator • Run CART but use Gaussian log-likelihood (on held out data) to determine the splits • This yields a partition X1 , , Xk (and a correspdonding tree) • Run the glasso within each partition element 67 Simulated Data 20 19 18 14 17 41 43 16 17 20 20 15 2 19 19 40 42 18 18 13 12 17 5 17 10 11 16 16 38 39 6 18 15 15 14 20 12 12 10 11 36 37 13 13 20 8 14 19 1 19 10 11 18 18 17 X1< 0.5 17 16 X1> 0.5 33 35 14 16 32 34 15 15 14 14 13 20 X2< 0.5 19 10 12 11 3 13 12 11 X2> 0.5 10 20 30 31 19 13 18 18 17 17 16 28 29 16 7 15 14 X2< 0.25 13 X2> 0.25 X1< 0.75 X1> 0.75 15 14 13 10 12 11 20 (b) 12 10 11 19 X1< 0.25 18 17 16 10 X2< 0.75 X1> 0.25 15 X2> 0.5 20 11 19 X2> 0.75 18 21.3 17 X2< 0.5 16 15 13 X2< 0.125 10 11 15 16 19 14 12 20 X2> 0.125 X1> 0.25 X1< 0.375 X1< 0.25 X1> 0.375 X2< 0.625 X2> 0.625 X2> 0.75 X1< 0.875 X2< 0.75 X1> 0.875 13 11 10 12 20 19 20 X1< X1> 0.125 0.125 17 16 21 X1< X1> 0.125 0.125 22 X2< X2> 0.375 0.375 23 X2< X2> 0.375 0.375 24 X1< X1> 0.625 0.625 25 X1< X1> 0.625 0.625 26 X2< X2> 0.875 0.875 18 27 X2< X2> 0.875 0.875 17 16 15 15 14 14 13 12 28 10 11 29 30 31 13 14 32 33 34 35 36 37 38 39 17 40 18 41 42 43 13 12 20 20 19 20 19 20 19 1 19 19 18 18 18 4 18 17 17 16 16 14 20 15 15 15 14 13 12 11 10 13 12 11 10 14 13 12 11 14 13 12 10 (a) 11 10 21.1 21 20.9 16 15 8 21.2 18 17 15 19 16 16 7 17 17 6 15 20 18 17 16 10 11 Held−out Risk 12 14 19 18 20.8 10 15 20 25 Splitting Sequence No 14 14 13 12 11 10 13 12 11 10 (c) 68 Climate Data 52 CO2 UV 53 54 55 CH4 DIR 18 19 15 50 51 62 63 17 14 ETRN 81 78 79 80 10 48 49 58 59 60 61 65 67 69 20 56 57 16 64 66 37 39 68 70 71 42 43 H2 WET ETR 77 74 75 76 CO 86 87 21 82 83 84 85 GLO CLD 73 72 VAP TMX 28 29 30 31 25 26 27 36 38 32 33 47 46 TMP PRE 44 41 40 45 (b) 12 23 34 35 54 55 15 50 51 62 63 14 48 49 58 59 52 13 11 22 FRS TMN DTR 24 53 18 19 82 83 84 85 86 87 78 79 80 81 74 75 76 77 70 71 72 73 42 43 46 47 40 41 44 45 21 10 17 60 61 65 67 69 20 CO2 UV CH4 DIR CO2 UV CO DIR ETRN CO2 UV CH4 DIR CO 56 57 28 29 16 64 66 37 39 68 CH4 CO H2 30 31 ETRN ETRN H2 H2 WET ETR WET ETR CLD GLO WET 25 24 26 27 36 38 32 33 ETR CLD GLO CLD GLO TMX VAP TMX VAP TMN DTR FRS 23 TMP PRE TMN DTR FRS (a) 34 35 13 TMX PRE TMP 12 VAP 11 22 PRE TMP TMN DTR FRS (c) 69 Tradeoff • Nonparanormal: Unrestricted graphs, semiparametric • We’ll now trade off structural flexibility for greater nonparametricity 70 Forest Densities (Gupta, Lafferty, Liu, Wasserman, Xu, 2011) A distribution is supported by a forest F with edge set E(F ) if p(x) = (i,j)∈E(F ) p(xi , xj ) p(xi ) p(xj ) p(xk ) k∈V • For known marginal densities p(xi , xj ), best tree obtained by minimum weight spanning tree algorithms • In high dimensions, a spanning tree will overfit • We prune back to a forest 71 Step 1: Constructing a Full Tree • Compute kernel density estimates fn1 (xi , xj ) = n1 s∈D1 K h22 (s) Xi − xi h2  K (s) Xj − xj h2   • Estimate mutual informations In1 (Xi , Xj ) = m m m fn1 (xki , x j ) log k =1 =1 fn1 (xki , x j ) fn1 (xki ) fn1 (x j ) • Run Kruskal’s algorithm (Chow-Liu) on edge weights 72 Step 2: Pruning the Tree • Heldout risk Rn2 (fF ) = − fn2 (xi , xj ) log (i,j)∈E f (xi , xj ) dxi dxj f (xi ) f (xj ) • Selected forest given by k = arg Rn2 fTb (k) k∈{0, ,p−1} n1 (k) where Tn1 is forest obtained after k steps of Kruskal 73 S&P Data: Forest Graph—Oops! ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 74 S&P Data: Forest Graph ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 75 S&P Data: Forest vs Nonparanormal common edges ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● differences ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● 76 Summary • Smoothing kernels, Mercer kernels • Sparse additive models • Constrained rank additive models • Nonparametric graphical models: Nonparanormal and forest-structured densities • A little nonparametricity goes a long way 77 Summary • Thresholded backfitting algorithms derived from subdifferential calculus • RKHS formulations are problematic • Theory for infinite dimensional optimizations still incomplete • Many extensions possible: Nonparanormal component analysis, etc • Variations on additive models enjoy most of the good statistical and computational properties of sparse linear models, with relaxed assumptions • We’re building a toolbox for large scale, high dimensional nonparametric inference 78 ... ● ● ● ● 14 15 ● ● 2.5 ● ● ● 16 weight ● ● 3. 0 ● ● ● ● 3. 0 ● ● 20 ● ● ● ● ● 3. 5 choline 3. 5 4.0 ● ● 20 phosphorus 2.5 3. 0 3. 5 44 Statistical Scaling for Prediction Let F be class of matrices of... ● 2.0 ● ● ● ● 1.5 ● 1.0 1.5 2.0 2.5 ● ● ● ● 3. 5 1.0 1.5 ● ● ● ● ● ● ● ● ● ● 2.5 3. 0 3. 5 1.0 2.0 2.5 3. 0 ● ● ● ● ● ● ● ● ● ● 3. 5 1.0 1.5 2.0 2.5 3. 0 3. 5 1.0 1.5 2.0 ● ● ● ● ● ● ● ● ● ● ● ● 1.5... Hotelling, 1 936 ) is classical method for finding correlations between components of two random vectors X ∈ Rp and Y ∈ Rq Sparse versions have been proposed for high dimensional data (Witten

Ngày đăng: 12/10/2015, 09:00

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan