Deep learning theory

50 93 0
Deep learning theory

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Deep  Learning  Theory     Yoshua  Bengio       April  15,  2015     London  &  Paris  ML  Meetup   Breakthrough •   Deep  Learning:  machine   learning  algorithms  based  on   learning  mul:ple  levels  of   representa:on  /  abstrac:on   Amazing  improvements  in  error  rate  in  object  recogni?on,  object   detec?on,  speech  recogni?on,  and  more  recently,  some  in   machine  transla?on     Ongoing Progress: Natural Language Understanding •  Recurrent  nets  genera?ng  credible  sentences,  even  beCer  if   condi?onally:   •  Machine  transla?on   Xu  et  al,  to  appear  ICML’2015   •  Image  2  text   Why is Deep Learning Working so Well?   Machine Learning, AI & No Free Lunch •  Three  key  ingredients  for  ML  towards  AI   1.  Lots  &  lots  of  data   2.  Very  flexible  models   3.  Powerful  priors  that  can  defeat  the  curse  of   dimensionality     Ultimate Goals •  AI   •  Needs  knowledge   •  Needs  learning             (involves  priors  +  op#miza#on/search)   •  Needs  generaliza:on                                                     (guessing  where  probability  mass  concentrates)   •  Needs  ways  to  fight  the  curse  of  dimensionality   (exponen?ally  many  configura?ons  of  the  variables  to  consider)   •  Needs  disentangling  the  underlying  explanatory  factors   (making  sense  of  the  data)     ML 101 What We Are Fighting Against: The Curse of Dimensionality      To  generalize  locally,   need  representa?ve   examples  for  all   relevant  varia?ons!     Classical  solu?on:  hope   for  a  smooth  enough   target  func?on,  or   make  it  smooth  by   handcraZing  good   features  /  kernel   Not Dimensionality so much as Number of Variations (Bengio, Dellalleau & Le Roux 2007) •  Theorem:  Gaussian  kernel  machines  need  at  least  k  examples   to  learn  a  func?on  that  has  2k  zero-­‐crossings  along  some  line             •  Theorem:  For  a  Gaussian  kernel  machine  to  learn  some   maximally  varying  func?ons    over  d  inputs  requires  O(2d)   examples     Putting Probability Mass where Structure is Plausible •  Empirical  distribu?on:  mass  at   training  examples   •  Smoothness:  spread  mass  around   •  Insufficient   •  Guess  some  ‘structure’  and   generalize  accordingly     Bypassing the curse of dimensionality We  need  to  build  composi?onality  into  our  ML  models     Just  as  human  languages  exploit  composi?onality  to  give   representa?ons  and  meanings  to  complex  ideas   Exploi?ng  composi?onality  gives  an  exponen?al  gain  in   representa?onal  power   Distributed  representa?ons  /  embeddings:  feature  learning   Deep  architecture:  mul?ple  levels  of  feature  learning   Prior:  composi?onality  is  useful  to  describe  the   world  around  us  efficiently   10     Manifold Learning = Representation Learning tangent directions tangent plane Data on a curved manifold 36   Non-Parametric Manifold Learning: hopeless without powerful enough priors Manifolds  es?mated  out  of  the   neighborhood  graph:      -­‐  node  =  example    -­‐  arc  =  near  neighbor   AI-­‐related  data  manifolds  have  too  many   twists  and  turns,  not  enough  examples   to  cover  all  the  ups  &  downs  &  twists   37   Auto-Encoders Learn Salient Variations, like a non-linear PCA •  Minimizing  reconstruc?on  error  forces  to   keep  varia?ons  along  manifold   •  Regularizer  wants  to  throw  away  all   varia?ons   •  With  both:  keep  ONLY  sensi?vity  to   varia?ons  ON  the  manifold   38   Denoising Auto-Encoder •  Learns  a  vector  field  poin?ng  towards   higher  probability  direc?on  (Alain  &  Bengio  2013)   reconstruction(x) x ! 2@ log p(x) @x •  Some  DAEs  correspond  to  a  kind  of   Gaussian  RBM  with  regularized  Score   Matching  (Vincent  2011)   Corrupted input          [equivalent  when  noiseà0]   prior:  examples   concentrate  near  a   lower  dimensional   “manifold”     Corrupted input Regularized Auto-Encoders Learn a Vector Field that Estimates a Gradient Field (Alain  &  Bengio  ICLR  2013)   40   Denoising Auto-Encoder Markov Chain corrupt   Xt   41   X~  t   denoise   Xt+1   X~  t+1   X~  t+2   Xt+2   Denoising Auto-Encoders Learn a Markov Chain Transition Distribution (Bengio  et  al  NIPS  2013)   42   Generative Stochastic Networks (GSN) (Bengio  et  al  ICML  2014,  Alain  et  al  arXiv  2015)   •  Recurrent  parametrized  stochas:c  computa:onal  graph  that   defines  a  transi:on  operator  for  a  Markov  chain  whose   asympto:c  distribu:on  is  implicitly  es:mated  by  the  model   •  Noise  injected  in  input  and  hidden  layers   •  Trained  to  max  reconstruc?on  prob  of  example  at  each  step   •  Example  structure  inspired  from  the  DBM  Gibbs  chain:   noise   W3" h3" h2" h1" noise   W1" x0" 1" 43   W2" WT"1" target" WT"2" W1" sample"x1" W3" W3"T" W3"T" W2" W2"T" W2" W1"T" W1" W1"T" target" sample"x2" target" sample"x3"  to  5  steps   Space-Filling in Representation-Space •  Deeper  representa:ons  "  abstrac:ons  "  disentangling   •  Manifolds  are  expanded  and  fla]ened   Pixel  space   9’s  manifold   3’s  manifold   Representa?on  space   9’s  manifold   X-­‐space   3’s  manifold   H-­‐space   Linear  interpola?on  at  layer  2   9’s  manifold   3’s  manifold   Linear  interpola?on  at  layer  1   Linear  interpola?on  in  pixel  space   (Bengio 2014, arXiv 1407.7906) Each  level  transforms  the   data  into  a  representa?on  in   which  it  is  easier  to  model,   unfolding  it  more,   contrac?ng  the  noise   dimensions  and  mapping  the   signal  dimensions  to  a   factorized  (uniform-­‐like)   distribu?on      min KL(Q(x, h)||P (x, h)) for  each  intermediate  level  h   45   Q(hL)   noise   Extracting Structure By Gradual Disentangling and Manifold Unfolding P(hL)   signal   fL   gL   f2   g2   P(h2|h1)   …   Q(h2|h1)   P(h1)   Q(h1)   Q(h1|x)   Q(x)   f1   g1   P(x|h1)   DRAW: the latest variant of Variational Auto-Encoder eural Network For Image Generation (Gregor  et  al  of  Google  DeepMind,  arXiv  1502.04623,  2015)     KAROLG @ GOOGLE COM DANIHELKA @ GOOGLE COM GRAVESA @ GOOGLE COM WIERSTRA @ GOOGLE COM •  Even  for  a  sta?c  input,  the  encoder  and  decoder  are  now   recurrent  nets,  which  gradually  add  elements  to  the  answer,   DRAW: Neuralm Network For Image and   use  AaRecurrent n  aCen?on   echanism   to  Generation choose  where  to  do  so   ial glimpses, or foveations, than by a sin- enugh the entire image (Larochelle & Hinton, ure ine al., 2012; Tang et al., 2013; Ranzato, 2014; ics 014; Mnih et al., 2014; Ba et al., 2014; Serial 14) The main challenge faced by sequential ws es is learning where to look, which can be els ate reinforcement learning techniques such as nd, ers s (Mnih et al., 2014) The attention model in in- er, is fully differentiable, making it possible andard backpropagation In this sense it relective read and write operations developed Turing Machine (Graves et al., 2014) Time P (x|z) decoder FNN ct ct write write decoder RNN decoder RNN z zt zt+1 sample sample sample Q(z|x) encoder FNN x hdec t Q(zt |x, z1:t henc t 1) cT Q(zt+1 |x, z1:t ) encoder RNN encoder RNN read read x x P (x|z1:T ) decoding (generative model) encoding (inference) a visual Figure Left: Conventional Variational Auto-Encoder Durfashion, section defines the DRAW architecture, 46   A trained Figure DRAW network generating MNIST dig Rough ing generation, a sample z is drawn from a prior P (z) and passed its Eachused row shows successive stages in thethe generation of a sinlossarefunction for training and proines gle digit Note how the lines composing the digits appear to be through the feedforward decoder network to compute the probaand the ge generation The presents thedelimits selec“drawn” bySection the network red rectangle the area atbility of the input P (x|z) given the sample During inference the atic im- Task #glimpses LSTM #h 100 ⇥ 100 MNIST Classification 256 MNIST Model 64 256 SVHN Model 32 800 Samples of SVHN Images: the CIFAR Model 64 400 DRAW drawing process 47   #z 100 100 200 DRAW Samples of SVHN Images: generated samples vs training nearest ecurrent Neural Network For Image Generation neighbor Nearest  training   example  for  last   column  of  samples   48   Figure Generated SVHN images The rightmost column Conclusions •  Distributed  representa:ons:     •  prior  that  can  buy  exponen?al  gain  in  generaliza?on   •  Deep  composi:on  of  non-­‐lineari:es:     •  prior  that  can  buy  exponen?al  gain  in  generaliza?on   •  Both  yield  non-­‐local  generaliza:on   •  Strong  evidence  that  local  minima  are  not  an  issue,  saddle  points   •  Auto-­‐encoders  capture  the  data  genera:ng  distribu:on   •  Gradient  of  the  energy   •  Markov  chain  genera?ng  an  es?mator  of  the  dgd   •  Can  be  generalized  to  deep  genera?ve  models   49   MILA: Montreal Institute for Learning Algorithms ...Breakthrough •  Deep Learning:  machine   learning  algorithms  based  on   learning  mul:ple  levels  of   representa:on  /  abstrac:on   Amazing... Transfer Learning Challenge: Deep Learning 1st Place Raw  data    layer   ICML’2011   workshop  on   Unsup  &   Transfer Learning    layers   NIPS’2011   Transfer   Learning   Challenge     Paper:... representa?onal  power   Distributed  representa?ons  /  embeddings:  feature learning   Deep  architecture:  mul?ple  levels  of  feature learning   Prior:  composi?onality  is  useful  to  describe  the

Ngày đăng: 01/06/2018, 14:57

Từ khóa liên quan

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan