Solution manual for pattern recognition and machine learning by bishop

Solution Manual for Pattern Recognition and Machine Learning by Bisho Contents Contents Chapter 1: Introduction Chapter 2: Probability Distributions Chapter 3: Linear Models for Regression Chapter 4: Linear Models for Classification Chapter 5: Neural Networks Chapter 6: Kernel Methods Chapter 7: Sparse Kernel Machines Chapter 8: Graphical Models Chapter 9: Mixture Models and EM Chapter 10: Approximate Inference Chapter 11: Sampling Methods Chapter 12: Continuous Latent Variables Chapter 13: Sequential Data Chapter 14: Combining Models 28 62 78 93 114 128 136 150 163 198 207 223 246 Full file at https://TestbankDirect.eu/ Solution Manual for Pattern Recognition and Machine Learning by Bisho CONTENTS Full file at https://TestbankDirect.eu/ Solution Manual for Pattern Recognition and Machine Learning by Bisho Solutions 1.1–1.4 Chapter Introduction 1.1 Substituting (1.1) into (1.2) and then differentiating with respect to wi we obtain N M n=1 j =0 wj xjn − tn xin = (1) Re-arranging terms then gives the required result 1.2 For the regularized sum-of-squares error function given by (1.4) the corresponding linear equations are again obtained by differentiation, and take the same form as (1.122), but with Aij replaced by Aij , given by Aij = Aij + λIij (2) 1.3 Let us denote apples, oranges and limes by a, o and l respectively The marginal probability of selecting an apple is given by p(a) = p(a|r)p(r) + p(a|b)p(b) + p(a|g)p(g) 3 × 0.2 + × 0.2 + × 0.6 = 0.34 = 10 10 (3) where the conditional probabilities are obtained from the proportions of apples in each box To find the probability that the box was green, given that the fruit we selected was an orange, we can use Bayes’ theorem p(g|o) = p(o|g)p(g) p(o) (4) The denominator in (4) is given by p(o) = p(o|r)p(r) + p(o|b)p(b) + p(o|g)p(g) = × 0.2 + × 0.2 + × 0.6 = 0.36 10 10 from which we obtain p(g|o) = 0.6 × = 10 0.36 (5) (6) 1.4 We are often interested in finding the most probable value for some quantity In the case of probability distributions over discrete variables this poses little problem However, for continuous variables there is a subtlety arising from the nature of probability densities and the way they transform under non-linear changes of variable Full file at https://TestbankDirect.eu/ Solution Manual for Pattern Recognition and Machine Learning by Bisho Solution 1.4 Consider first the way a function f (x) behaves when we change to a new variable y where the two variables are related by x = g(y) This defines a new function of y given by f (y) = f (g(y)) (7) Suppose f (x) has a mode (i.e a maximum) at x so that f (x) = The corresponding mode of f (y) will occur for a value y obtained by differentiating both sides of (7) with respect to y f (y) = f (g(y))g (y) = (8) Assuming g (y) = at the mode, then f (g(y)) = However, we know that f (x) = 0, and so we see that the locations of the mode expressed in terms of each of the variables x and y are related by x = g(y), as one would expect Thus, finding a mode with respect to the variable x is completely equivalent to first transforming to the variable y, then finding a mode with respect to y, and then transforming back to x Now consider the behaviour of a probability density px (x) under the change of variables x = g(y), where the density with respect to the new variable is py (y) and is given by ((1.27)) Let us write g (y) = s|g (y)| where s ∈ {−1, +1} Then ((1.27)) can be written py (y) = px (g(y))sg (y) Differentiating both sides with respect to y then gives py (y) = spx (g(y)){g (y)}2 + spx (g(y))g (y) (9) Due to the presence of the second term on the right hand side of (9) the relationship x = g(y) no longer holds Thus the value of x obtained by maximizing px (x) will not be the value obtained by transforming to py (y) then maximizing with respect to y and then transforming back to x This causes modes of densities to be dependent on the choice of variables In the case of linear transformation, the second term on the right hand side of (9) vanishes, and so the location of the maximum transforms according to x = g(y) This effect can be illustrated with a simple example, as shown in Figure We begin by considering a Gaussian distribution px (x) over x with mean µ = and standard deviation σ = 1, shown by the red curve in Figure Next we draw a sample of N = 50, 000 points from this distribution and plot a histogram of their values, which as expected agrees with the distribution px (x) Now consider a non-linear change of variables from x to y given by x = g(y) = ln(y) − ln(1 − y) + (10) The inverse of this function is given by y = g −1 (x) = Full file at https://TestbankDirect.eu/ 1 + exp(−x + 5) (11) Solution Manual for Pattern Recognition and Machine Learning by Bisho Solutions 1.5–1.6 Figure Example of the transformation of the mode of a density under a nonlinear change of variables, illustrating the different behaviour compared to a simple function See the text for details py (y) g −1 (x) y 0.5 px (x) 0 x 10 which is a logistic sigmoid function, and is shown in Figure by the blue curve If we simply transform px (x) as a function of x we obtain the green curve px (g(y)) shown in Figure 1, and we see that the mode of the density px (x) is transformed via the sigmoid function to the mode of this curve However, the density over y transforms instead according to (1.27) and is shown by the magenta curve on the left side of the diagram Note that this has its mode shifted relative to the mode of the green curve To confirm this result we take our sample of 50, 000 values of x, evaluate the corresponding values of y using (11), and then plot a histogram of their values We see that this histogram matches the magenta curve in Figure and not the green curve! 1.5 Expanding the square we have E[(f (x) − E[f (x)])2 ] = E[f (x)2 − 2f (x)E[f (x)] + E[f (x)]2 ] = E[f (x)2 ] − 2E[f (x)]E[f (x)] + E[f (x)]2 = E[f (x)2 ] − E[f (x)]2 as required 1.6 The definition of covariance is given by (1.41) as cov[x, y] = E[xy] − E[x]E[y] Using (1.33) and the fact that p(x, y) = p(x)p(y) when x and y are independent, we obtain E[xy] = p(x, y)xy x = y p(x)x x = E[x]E[y] Full file at https://TestbankDirect.eu/ p(y)y y Solution Manual for Pattern Recognition and Machine Learning by Bisho Full file at https://TestbankDirect.eu/ 10 Solutions 1.7–1.8 and hence cov[x, y] = The case where x and y are continuous variables is analogous, with (1.33) replaced by (1.34) and the sums replaced by integrals 1.7 The transformation from Cartesian to polar coordinates is defined by x = r cos θ y = r sin θ (12) (13) and hence we have x2 + y = r2 where we have used the well-known trigonometric result (2.177) Also the Jacobian of the change of variables is easily seen to be ∂(x, y) ∂(r, θ) = = ∂x ∂r ∂x ∂θ ∂y ∂y ∂r ∂θ cos θ −r sin θ sin θ r cos θ =r where again we have used (2.177) Thus the double integral in (1.125) becomes 2π I2 ∞ r2 r dr dθ 2σ 0 ∞ u = 2π exp − du 2σ ∞ u = π exp − −2σ 2σ = 2πσ = exp − (14) (15) (16) (17) where we have used the change of variables r2 = u Thus I = 2πσ /2 Finally, using the transformation y = x − µ, the integral of the Gaussian distribution becomes ∞ −∞ N x|µ, σ dx = = ∞ 1 /2 (2πσ ) I /2 (2πσ ) −∞ exp − y2 2σ dy =1 as required 1.8 From the definition (1.46) of the univariate Gaussian distribution, we have E[x] = ∞ −∞ Full file at https://TestbankDirect.eu/ 2πσ /2 exp − (x − µ)2 x dx 2σ (18) Solution Manual for Pattern Recognition and Machine Learning by Bisho Full file at https://TestbankDirect.eu/ 11 Solution 1.9 Now change variables using y = x − µ to give ∞ E[x] = −∞ 2πσ /2 exp − y (y + µ) dy 2σ (19) We now note that in the factor (y + µ) the first term in y corresponds to an odd integrand and so this integral must vanish (to show this explicitly, write the integral as the sum of two integrals, one from −∞ to and the other from to ∞ and then show that these two integrals cancel) In the second term, µ is a constant and pulls outside the integral, leaving a normalized Gaussian distribution which integrates to 1, and so we obtain (1.49) To derive (1.50) we first substitute the expression (1.46) for the normal distribution into the normalization result (1.48) and re-arrange to obtain ∞ −∞ exp − (x − µ)2 2σ dx = 2πσ /2 (20) We now differentiate both sides of (20) with respect to σ and then re-arrange to obtain /2 ∞ 1 exp − (x − µ)2 (x − µ)2 dx = σ (21) 2πσ 2σ −∞ which directly shows that E[(x − µ)2 ] = var[x] = σ (22) Now we expand the square on the left-hand side giving E[x2 ] − 2µE[x] + µ2 = σ Making use of (1.49) then gives (1.50) as required Finally, (1.51) follows directly from (1.49) and (1.50) E[x2 ] − E[x]2 = µ2 + σ − µ2 = σ 1.9 For the univariate case, we simply differentiate (1.46) with respect to x to obtain x−µ d N x|µ, σ = −N x|µ, σ dx σ2 Setting this to zero we obtain x = µ Similarly, for the multivariate case we differentiate (1.52) with respect to x to obtain ∂ N (x|µ, Σ) = − N (x|µ, Σ)∇x (x − µ)T Σ−1 (x − µ) ∂x = −N (x|µ, Σ)Σ−1 (x − µ), where we have used (C.19), (C.20)1 and the fact that Σ−1 is symmetric Setting this derivative equal to 0, and left-multiplying by Σ, leads to the solution x = µ NOTE: In the 1st printing of PRML, there are mistakes in (C.20); all instances of x (vector) in the denominators should be x (scalar) Full file at https://TestbankDirect.eu/ Solution Manual for Pattern Recognition and Machine Learning by Bisho Full file at https://TestbankDirect.eu/ 12 Solutions 1.10–1.11 1.10 Since x and z are independent, their joint distribution factorizes p(x, z) = p(x)p(z), and so E[x + z] = (x + z)p(x)p(z) dx dz = xp(x) dx + zp(z) dz = E[x] + E[z] (23) (24) (25) Similarly for the variances, we first note that (x + z − E[x + z])2 = (x − E[x])2 + (z − E[z])2 + 2(x − E[x])(z − E[z]) (26) where the final term will integrate to zero with respect to the factorized distribution p(x)p(z) Hence var[x + z] = = (x + z − E[x + z])2 p(x)p(z) dx dz (x − E[x])2 p(x) dx + (z − E[z])2 p(z) dz = var(x) + var(z) (27) For discrete variables the integrals are replaced by summations, and the same results are again obtained 1.11 We use to denote ln p(X|µ, σ ) from (1.54) By standard rules of differentiation we obtain N ∂ (xn − µ) = ∂µ σ n=1 Setting this equal to zero and moving the terms involving µ to the other side of the equation we get σ2 N xn = n=1 Nµ σ2 and by multiplying ing both sides by σ /N we get (1.55) Similarly we have ∂ = ∂σ 2(σ )2 N n=1 (xn − µ)2 − N σ2 and setting this to zero we obtain N = 2 σ 2(σ )2 N n=1 (xn − µ)2 Multiplying both sides by 2(σ )2 /N and substituting µML for µ we get (1.56) Full file at https://TestbankDirect.eu/ Solution Manual for Pattern Recognition and Machine Learning by Bisho Full file at https://TestbankDirect.eu/ Solutions 1.12–1.14 13 1.12 If m = n then xn xm = x2n and using (1.50) we obtain E[x2n ] = µ2 + σ , whereas if n = m then the two data points xn and xm are independent and hence E[xn xm ] = E[xn ]E[xm ] = µ2 where we have used (1.49) Combining these two results we obtain (1.130) Next we have E[µML ] = N N E[xn ] = µ (28) n=1 using (1.49) Finally, consider E[σML ] From (1.55) and (1.56), and making use of (1.130), we have   N N 1 E[σML ] = E xn − xm  N N n=1 = N m=1 N E x2n n=1 − xn N = µ2 + σ − µ2 + = N −1 N N m=1 xm + N σ N N N xm xl m=1 l=1 + µ2 + σ N σ2 (29) as required 1.13 In a similar fashion to solution 1.12, substituting µ for µML in (1.56) and using (1.49) and (1.50) we have E{xn } N N n=1 (xn − µ) = = N N N n=1 Exn x2n − 2xn µ + µ2 N n=1 µ2 + σ − 2µµ + µ2 = σ2 1.14 Define 1 A (wij + wji ) wij = (wij − wji ) (30) 2 from which the (anti)symmetry properties follow directly, as does the relation wij = A S We now note that + wij wij S wij = D D A wij xi xj = i=1 j =1 Full file at https://TestbankDirect.eu/ D D i=1 j =1 wij xi xj − D D wji xi xj = i=1 j =1 (31) Solution Manual for Pattern Recognition and Machine Learning by Bisho Full file at https://TestbankDirect.eu/ 14 Solution 1.15 S from which we obtain (1.132) The number of independent components in wij can be found by noting that there are D parameters in total in this matrix, and that entries off the leading diagonal occur in constrained pairs wij = wji for j = i Thus we S start with D2 parameters in the matrix wij , subtract D for the number of parameters on the leading diagonal, divide by two, and then add back D for the leading diagonal and we obtain (D2 − D)/2 + D = D(D + 1)/2 1.15 The redundancy in the coefficients in (1.133) arises from interchange symmetries between the indices ik Such symmetries can therefore be removed by enforcing an ordering on the indices, as in (1.134), so that only one member in each group of equivalent configurations occurs in the summation To derive (1.135) we note that the number of independent parameters n(D, M ) which appear at order M can be written as iM − i1 D n(D, M ) = i1 =1 i2 =1 ··· (32) (33) iM =1 which has M terms This can clearly also be written as iM − i1 D n(D, M ) = i1 =1 i2 =1 ··· iM =1 where the term in braces has M −1 terms which, from (32), must equal n(i1 , M −1) Thus we can write D n(D, M ) = i1 =1 n(i1 , M − 1) (34) which is equivalent to (1.135) To prove (1.136) we first set D = on both sides of the equation, and make use of 0! = 1, which gives the value on both sides, thus showing the equation is valid for D = Now we assume that it is true for a specific value of dimensionality D and then show that it must be true for dimensionality D + Thus consider the left-hand side of (1.136) evaluated for D + which gives D +1 i=1 (i + M − 2)! (i − 1)!(M − 1)! = = = (D + M − 1)! (D + M − 1)! + (D − 1)!M ! D!(M − 1)! (D + M − 1)!D + (D + M − 1)!M D!M ! (D + M )! D!M ! (35) which equals the right hand side of (1.136) for dimensionality D + Thus, by induction, (1.136) must hold true for all values of D Full file at https://TestbankDirect.eu/ Solution Manual for Pattern Recognition and Machine Learning by Bisho Full file at https://TestbankDirect.eu/ Solution 1.16 15 Finally we use induction to prove (1.137) For M = we find obtain the standard result n(D, 2) = 12 D(D + 1), which is also proved in Exercise 1.14 Now assume that (1.137) is correct for a specific order M − so that (D + M − 2)! (D − 1)! (M − 1)! n(D, M − 1) = (36) Substituting this into the right hand side of (1.135) we obtain D n(D, M ) = i=1 (i + M − 2)! (i − 1)! (M − 1)! (37) (D + M − 1)! (D − 1)! M ! (38) which, making use of (1.136), gives n(D, M ) = and hence shows that (1.137) is true for polynomials of order M Thus by induction (1.137) must be true for all values of M 1.16 NOTE: In the 1st printing of PRML, this exercise contains two typographical errors On line 4, M 6th should be M th and on the l.h.s of (1.139), N (d, M ) should be N (D, M ) The result (1.138) follows simply from summing up the coefficients at all order up to and including order M To prove (1.139), we first note that when M = the right hand side of (1.139) equals 1, which we know to be correct since this is the number of parameters at zeroth order which is just the constant offset in the polynomial Assuming that (1.139) is correct at order M , we obtain the following result at order M +1 M +1 n(D, m) N (D, M + 1) = m=0 M n(D, m) + n(D, M + 1) = m=0 = = = (D + M )! (D + M )! + D!M ! (D − 1)!(M + 1)! (D + M )!(M + 1) + (D + M )!D D!(M + 1)! (D + M + 1)! D!(M + 1)! which is the required result at order M + Full file at https://TestbankDirect.eu/ Solution Manual for Pattern Recognition and Machine Learning by Bisho Full file at https://TestbankDirect.eu/ 16 Solutions 1.17–1.18 Now assume M D Using Stirling’s formula we have (D + M )D+M e−D−M D! M M e−M D M D+M e−D 1+ D! M M M n(D, M ) = M D e−D D! 1+ D +M D(D + M ) M (1 + D)e−D D M D! which grows like M D with M The case where D M is identical, with the roles of D and M exchanged By numerical evaluation we obtain N (10, 3) = 286 and N (100, 3) = 176,851 1.17 Using integration by parts we have ∞ Γ(x + 1) = ux e−u du = −e−u ux ∞ ∞ + xux−1 e−u du = + xΓ(x) (39) For x = we have ∞ Γ(1) = ∞ e−u du = −e−u = (40) If x is an integer we can apply proof by induction to relate the gamma function to the factorial function Suppose that Γ(x + 1) = x! holds Then from the result (39) we have Γ(x + 2) = (x + 1)Γ(x + 1) = (x + 1)! Finally, Γ(1) = = 0!, which completes the proof by induction 1.18 On the right-hand side of (1.142) we make the change of variables u = r2 to give SD ∞ e−u uD/2−1 du = SD Γ(D/2) (41) where we have used the definition (1.141) of the Gamma function On the left hand side of (1.142) we can use (1.126) to obtain π D/2 Equating these we obtain the desired result (1.143) The volume of a sphere of radius in D-dimensions is obtained by integration rD−1 dr = VD = SD SD D (42) For D = and D = we obtain the following results S2 = 2π, Full file at https://TestbankDirect.eu/ S3 = 4π, V2 = πa2 , V3 = πa (43) Solution Manual for Pattern Recognition and Machine Learning by Bisho Full file at https://TestbankDirect.eu/ 17 Solutions 1.19–1.20 1.19 The volume of the cube is (2a)D Combining this with (1.143) and (1.144) we obtain (1.145) Using Stirling’s formula (1.146) in (1.145) the ratio becomes, for large D, volume of sphere = volume of cube πe 2D D/2 D (44) which goes to as D → ∞ The distance from the center of the cube to the mid point of one of the sides is a, since this is where √ it makes contact with the sphere Similarly the distance to one of the corners is a D from Pythagoras’ theorem Thus √ the ratio is D 1.20 Since p(x) is radially symmetric it will be roughly constant over the shell of radius r and thickness This shell has volume SD rD−1 and since x = r2 we have p(x) dx p(r)SD rD−1 (45) shell from which we obtain (1.148) We can find the stationary points of p(r) by differentiation r d p(r) ∝ (D − 1)rD−2 + rD−1 − dr σ Solving for r, and using D 1, we obtain r exp − r2 2σ = (46) √ Dσ Next we note that p(r + ) ∝ (r + )D−1 exp − = exp − (r + )2 2σ (r + )2 + (D − 1) ln(r + ) 2σ (47) We now expand p(r) around the point r Since this is a stationary point of p(r) we must keep terms up to second order Making use of the expansion ln(1 + x) = x − x2 /2 + O(x3 ), together with D 1, we obtain (1.149) Finally, from (1.147) we see that the probability density at the origin is given by p(x = 0) = (2πσ )1/2 while the density at x = r is given from (1.147) by p( x = r) = where we have used r Full file at https://TestbankDirect.eu/ r2 exp − 2σ (2πσ )1/2 = D exp − (2πσ )1/2 √ Dσ Thus the ratio of densities is given by exp(D/2) Solution Manual for Pattern Recognition and Machine Learning by Bisho Full file at https://TestbankDirect.eu/ 18 Solutions 1.21–1.24 1.21 Since the square root function is monotonic for non-negative numbers, we can take the square root of the relation a b to obtain a1/2 b1/2 Then we multiply both /2 sides by the non-negative quantity a to obtain a (ab)1/2 The probability of a misclassification is given, from (1.78), by p(mistake) = R1 = R1 p(x, C2 ) dx + R2 p(C2 |x)p(x) dx + p(x, C1 ) dx R2 p(C1 |x)p(x) dx (48) Since we have chosen the decision regions to minimize the probability of misclassification we must have p(C2 |x) p(C1 |x) in region R1 , and p(C1 |x) p(C2 |x) in region R2 We now apply the result a b ⇒ a1/2 b1/2 to give p(mistake) R1 {p(C1 |x)p(C2 |x)}1/2 p(x) dx + R2 = {p(C1 |x)p(C2 |x)}1/2 p(x) dx {p(C1 |x)p(x)p(C2 |x)p(x)}1/2 dx (49) since the two integrals have the same integrand The final integral is taken over the whole of the domain of x 1.22 Substituting Lkj = − δkj into (1.81), and using the fact that the posterior probabilities sum to one, we find that, for each x we should choose the class j for which − p(Cj |x) is a minimum, which is equivalent to choosing the j for which the posterior probability p(Cj |x) is a maximum This loss matrix assigns a loss of one if the example is misclassified, and a loss of zero if it is correctly classified, and hence minimizing the expected loss will minimize the misclassification rate 1.23 From (1.81) we see that for a general loss matrix and arbitrary class priors, the expected loss is minimized by assigning an input x to class the j which minimizes k Lkj p(Ck |x) = p(x) Lkj p(x|Ck )p(Ck ) k and so there is a direct trade-off between the priors p(Ck ) and the loss matrix Lkj 1.24 A vector x belongs to class Ck with probability p(Ck |x) If we decide to assign x to class Cj we will incur an expected loss of k Lkj p(Ck |x), whereas if we select the reject option we will incur a loss of λ Thus, if j = arg l Full file at https://TestbankDirect.eu/ k Lkl p(Ck |x) (50) Solution Manual for Pattern Recognition and Machine Learning by Bisho Full file at https://TestbankDirect.eu/ Solutions 1.25–1.26 19 then we minimize the expected loss if we take the following action choose class j, if minl k Lkl p(Ck |x) < λ; reject, otherwise (51) For a loss matrix Lkj = − Ikj we have k Lkl p(Ck |x) = − p(Cl |x) and so we reject unless the smallest value of − p(Cl |x) is less than λ, or equivalently if the largest value of p(Cl |x) is less than − λ In the standard reject criterion we reject if the largest posterior probability is less than θ Thus these two criteria for rejection are equivalent provided θ = − λ 1.25 The expected squared loss for a vectorial target variable is given by E[L] = y(x) − t p(t, x) dx dt Our goal is to choose y(x) so as to minimize E[L] We can this formally using the calculus of variations to give δE[L] = δy(x) 2(y(x) − t)p(t, x) dt = Solving for y(x), and using the sum and product rules of probability, we obtain tp(t, x) dt y(x) = = tp(t|x) dt p(t, x) dt which is the conditional average of t conditioned on x For the case of a scalar target variable we have y(x) = tp(t|x) dt which is equivalent to (1.89) 1.26 NOTE: In the 1st printing of PRML, there is an error in equation (1.90); the integrand of the second integral should be replaced by var[t|x]p(x) We start by expanding the square in (1.151), in a similar fashion to the univariate case in the equation preceding (1.90), y(x) − t = y(x) − E[t|x] + E[t|x] − t = y(x) − E[t|x] + (y(x) − E[t|x])T (E[t|x] − t) +(E[t|x] − t)T (y(x) − E[t|x]) + E[t|x] − t Following the treatment of the univariate case, we now substitute this into (1.151) and perform the integral over t Again the cross-term vanishes and we are left with E[L] = Full file at https://TestbankDirect.eu/ y(x) − E[t|x] p(x) dx + var[t|x]p(x) dx Solution Manual for Pattern Recognition and Machine Learning by Bisho Full file at https://TestbankDirect.eu/ 20 Solutions 1.27–1.28 from which we see directly that the function y(x) that minimizes E[L] is given by E[t|x] 1.27 Since we can choose y(x) independently for each value of x, the minimum of the expected Lq loss can be found by minimizing the integrand given by |y(x) − t|q p(t|x) dt (52) for each value of x Setting the derivative of (52) with respect to y(x) to zero gives the stationarity condition q|y(x) − t|q−1 sign(y(x) − t)p(t|x) dt ∞ y (x) = q −∞ |y(x) − t|q−1 p(t|x) dt − q y (x) |y(x) − t|q−1 p(t|x) dt = which can also be obtained directly by setting the functional derivative of (1.91) with respect to y(x) equal to zero It follows that y(x) must satisfy y (x) −∞ |y(x) − t|q−1 p(t|x) dt = ∞ y (x) |y(x) − t|q−1 p(t|x) dt (53) p(t|x) dt (54) For the case of q = this reduces to y (x) p(t|x) dt = −∞ ∞ y (x) which says that y(x) must be the conditional median of t For q → we note that, as a function of t, the quantity |y(x) − t|q is close to everywhere except in a small neighbourhood around t = y(x) where it falls to zero The value of (52) will therefore be close to 1, since the density p(t) is normalized, but reduced slightly by the ‘notch’ close to t = y(x) We obtain the biggest reduction in (52) by choosing the location of the notch to coincide with the largest value of p(t), i.e with the (conditional) mode 1.28 From the discussion of the introduction of Section 1.6, we have h(p2 ) = h(p) + h(p) = h(p) We then assume that for all k K, h(pk ) = k h(p) For k = K + we have h(pK +1 ) = h(pK p) = h(pK ) + h(p) = K h(p) + h(p) = (K + 1) h(p) Moreover, h(pn/m ) = n h(p1/m ) = Full file at https://TestbankDirect.eu/ n n n m h(p1/m ) = h(pm/m ) = h(p) m m m Solution Manual for Pattern Recognition and Machine Learning by Bisho Full file at https://TestbankDirect.eu/ 21 Solutions 1.29–1.30 and so, by continuity, we have that h(px ) = x h(p) for any real number x Now consider the positive real numbers p and q and the real number x such that p = q x From the above discussion, we see that h(q x ) x h(q) h(q) h(p) = = = ln(p) ln(q x ) x ln(q) ln(q) and hence h(p) ∝ ln(p) 1.29 The entropy of an M -state discrete variable x can be written in the form M M H(x) = − p(xi ) ln p(xi ) ln p(xi ) = i=1 i=1 p(xi ) (55) The function ln(x) is concave and so we can apply Jensen’s inequality in the form (1.115) but with the inequality reversed, so that M H(x) ln p(xi ) i=1 p(xi ) = ln M (56) 1.30 NOTE: In PRML, there is a minus sign (’−’) missing on the l.h.s of (1.103) From (1.113) we have KL(p q) = − p(x) ln q(x) dx + p(x) ln p(x) dx (57) Using (1.46) and (1.48)– (1.50), we can rewrite the first integral on the r.h.s of (57) as − N (x|µ, σ ) p(x) ln q(x) dx = ln(2πs2 ) + = ln(2πs2 ) + s2 = ln(2πs2 ) + σ + µ2 − 2µm + m2 s2 (x − m)2 s2 dx N (x|µ, σ )(x2 − 2xm + m2 ) dx (58) The second integral on the r.h.s of (57) we recognize from (1.103) as the negative differential entropy of a Gaussian Thus, from (57), (58) and (1.110), we have KL(p q) = ln(2πs2 ) + = ln Full file at https://TestbankDirect.eu/ s2 σ2 + σ + µ2 − 2µm + m2 − − ln(2πσ ) s2 σ + µ2 − 2µm + m2 −1 s2 Solution Manual for Pattern Recognition and Machine Learning by Bisho Full file at https://TestbankDirect.eu/ 22 Solutions 1.31–1.33 1.31 We first make use of the relation I(x; y) = H(y) − H(y|x) which we obtained in (1.121), and note that the mutual information satisfies I(x; y) since it is a form of Kullback-Leibler divergence Finally we make use of the relation (1.112) to obtain the desired result (1.152) To show that statistical independence is a sufficient condition for the equality to be satisfied, we substitute p(x, y) = p(x)p(y) into the definition of the entropy, giving H(x, y) = p(x, y) ln p(x, y) dx dy p(x)p(y) {ln p(x) + ln p(y)} dx dy = = p(x) ln p(x) dx + p(y) ln p(y) dy = H(x) + H(y) To show that statistical independence is a necessary condition, we combine the equality condition H(x, y) = H(x) + H(y) with the result (1.112) to give H(y|x) = H(y) We now note that the right-hand side is independent of x and hence the left-hand side must also be constant with respect to x Using (1.121) it then follows that the mutual information I[x, y] = Finally, using (1.120) we see that the mutual information is a form of KL divergence, and this vanishes only if the two distributions are equal, so that p(x, y) = p(x)p(y) as required 1.32 When we make a change of variables, the probability density is transformed by the Jacobian of the change of variables Thus we have p(x) = p(y) ∂yi = p(y)|A| ∂xj (59) where | · | denotes the determinant Then the entropy of y can be written H(y) = − p(x) ln p(x)|A|−1 dx = H(x) + ln |A| p(y) ln p(y) dy = − (60) as required 1.33 The conditional entropy H(y|x) can be written H(y|x) = − Full file at https://TestbankDirect.eu/ i j p(yi |xj )p(xj ) ln p(yi |xj ) (61) Solution Manual for Pattern Recognition and Machine Learning by Bisho Full file at https://TestbankDirect.eu/ Solution 1.34 23 which equals by assumption Since the quantity −p(yi |xj ) ln p(yi |xj ) is nonnegative each of these terms must vanish for any value xj such that p(xj ) = However, the quantity p ln p only vanishes for p = or p = Thus the quantities p(yi |xj ) are all either or However, they must also sum to 1, since this is a normalized probability distribution, and so precisely one of the p(yi |xj ) is 1, and the rest are Thus, for each value xj there is a unique value yi with non-zero probability 1.34 Obtaining the required functional derivative can be done simply by inspection However, if a more formal approach is required we can proceed as follows using the techniques set out in Appendix D Consider first the functional I[p(x)] = p(x)f (x) dx Under a small variation p(x) → p(x) + η(x) we have I[p(x) + η(x)] = p(x)f (x) dx + η(x)f (x) dx and hence from (D.3) we deduce that the functional derivative is given by δI = f (x) δp(x) Similarly, if we define J[p(x)] = p(x) ln p(x) dx then under a small variation p(x) → p(x) + η(x) we have J[p(x) + η(x)] = p(x) ln p(x) dx + and hence η(x) ln p(x) dx + p(x) η(x) dx p(x) + O( ) δJ = p(x) + δp(x) Using these two results we obtain the following result for the functional derivative − ln p(x) − + λ1 + λ2 x + λ3 (x − µ)2 Re-arranging then gives (1.108) To eliminate the Lagrange multipliers we substitute (1.108) into each of the three constraints (1.105), (1.106) and (1.107) in turn The solution is most easily obtained Full file at https://TestbankDirect.eu/ Solution Manual for Pattern Recognition and Machine Learning by Bisho Full file at https://TestbankDirect.eu/ 24 Solutions 1.35–1.36 by comparison with the standard form of the Gaussian, and noting that the results λ1 = 1− λ2 = λ3 = ln 2πσ 2 (62) (63) 2σ (64) indeed satisfy the three constraints Note that there is a typographical error in the question, which should read ”Use calculus of variations to show that the stationary point of the functional shown just before (1.108) is given by (1.108)” For the multivariate version of this derivation, see Exercise 2.14 1.35 NOTE: In PRML, there is a minus sign (’−’) missing on the l.h.s of (1.103) Substituting the right hand side of (1.109) in the argument of the logarithm on the right hand side of (1.103), we obtain H[x] = − = − = = p(x) ln p(x) dx (x − µ)2 p(x) − ln(2πσ ) − 2σ 1 ln(2πσ ) + 2 σ ln(2πσ ) + , dx p(x)(x − µ)2 dx where in the last step we used (1.107) 1.36 Consider (1.114) with λ = 0.5 and b = a + (and hence a = b − ), 0.5f (a) + 0.5f (b) > f (0.5a + 0.5b) = 0.5f (0.5a + 0.5(a + )) + 0.5f (0.5(b − ) + 0.5b) = 0.5f (a + ) + 0.5f (b − ) We can rewrite this as f (b) − f (b − ) > f (a + ) − f (a) We then divide both sides by and let → 0, giving f (b) > f (a) Since this holds at all points, it follows that f (x) Full file at https://TestbankDirect.eu/ everywhere Solution Manual for Pattern Recognition and Machine Learning by Bisho Full file at https://TestbankDirect.eu/ Solutions 1.37–1.38 25 To show the implication in the other direction, we make use of Taylor’s theorem (with the remainder in Lagrange form), according to which there exist an x such that f (x) = f (x0 ) + f (x0 )(x − x0 ) + f (x )(x − x0 )2 Since we assume that f (x) > everywhere, the third term on the r.h.s will always be positive and therefore f (x) > f (x0 ) + f (x0 )(x − x0 ) Now let x0 = λa + (1 − λ)b and consider setting x = a, which gives f (a) > f (x0 ) + f (x0 )(a − x0 ) = f (x0 ) + f (x0 ) ((1 − λ)(a − b)) (65) Similarly, setting x = b gives f (b) > f (x0 ) + f (x0 )(λ(b − a)) (66) Multiplying (65) by λ and (66) by − λ and adding up the results on both sides, we obtain λf (a) + (1 − λ)f (b) > f (x0 ) = f (λa + (1 − λ)b) as required 1.37 From (1.104), making use of (1.111), we have H[x, y] = − p(x, y) ln p(x, y) dx dy = − p(x, y) ln (p(y|x)p(x)) dx dy = − p(x, y) (ln p(y|x) + ln p(x)) dx dy = − p(x, y) ln p(y|x) dx dy − = − p(x, y) ln p(y|x) dx dy − p(x, y) ln p(x) dx dy p(x) ln p(x) dx = H[y|x] + H[x] 1.38 From (1.114) we know that the result (1.115) holds for M = We now suppose that it holds for some general value M and show that it must therefore hold for M + Consider the left hand side of (1.115) M +1 f M λi xi = f i=1 λi xi λM +1 xM +1 + (67) i=1 M = f Full file at https://TestbankDirect.eu/ λM +1 xM +1 + (1 − λM +1 ) ηi xi i=1 (68) Solution Manual for Pattern Recognition and Machine Learning by Bisho Full file at https://TestbankDirect.eu/ 26 Solution 1.39 where we have defined ηi = λi − λM +1 (69) We now apply (1.114) to give M +1 f M λi xi i=1 λM +1 f (xM +1 ) + (1 − λM +1 )f ηi xi (70) i=1 We now note that the quantities λi by definition satisfy M +1 λi = (71) λi = − λM +1 (72) i=1 and hence we have M i=1 Then using (69) we see that the quantities ηi satisfy the property M i=1 ηi = − λM +1 M λi = (73) i=1 Thus we can apply the result (1.115) at order M and so (70) becomes M M +1 f λM +1 f (xM +1 ) + (1 − λM +1 ) λi xi i=1 M +1 ηi f (xi ) = i=1 λi f (xi ) (74) i=1 where we have made use of (69) 1.39 From Table 1.3 we obtain the marginal probabilities by summation and the conditional probabilities by normalization, to give x y 1/3 2/3 2/3 1/3 p(x) x Full file at https://TestbankDirect.eu/ p(y) y 1/2 1/2 1 p(y|x) x y 1 1/2 1/2 p(x|y) Solution Manual for Pattern Recognition and Machine Learning by Bisho Full file at https://TestbankDirect.eu/ 27 Solution 1.40 Figure H[x, y] Diagram showing the relationship between marginal, conditional and joint entropies and the mutual information H[x|y] I[x, y] H[y|x] H[x] H[x] From these tables, together with the definitions H(x) = − H(x|y) = − p(xi ) ln p(xi ) (75) i i j p(xi , yj ) ln p(xi |yj ) (76) and similar definitions for H(y) and H(y|x), we obtain the following results (a) H(x) = ln − 23 ln (b) H(y) = ln − 32 ln (c) H(y|x) = ln ln (d) H(x|y) = (e) H(x, y) = ln (f) I(x; y) = ln − 34 ln where we have used (1.121) to evaluate the mutual information The corresponding diagram is shown in Figure 1.40 The arithmetic and geometric means are defined as x ¯A = K K 1/K K xk and x ¯G = k , xk k respectively Taking the logarithm of x ¯ A and x ¯ G , we see that ln x ¯ A = ln K K xk k and ln x ¯G = K K ln xk k By matching f with ln and λi with 1/K in (1.115), taking into account that the logarithm is concave rather than convex and the inequality therefore goes the other way, we obtain the desired result Full file at https://TestbankDirect.eu/ .. .Solution Manual for Pattern Recognition and Machine Learning by Bisho CONTENTS Full file at https://TestbankDirect.eu/ Solution Manual for Pattern Recognition and Machine Learning by Bisho Solutions... m m Solution Manual for Pattern Recognition and Machine Learning by Bisho Full file at https://TestbankDirect.eu/ 21 Solutions 1.29–1.30 and so, by continuity, we have that h(px ) = x h(p) for. .. https://TestbankDirect.eu/ p(y)y y Solution Manual for Pattern Recognition and Machine Learning by Bisho Full file at https://TestbankDirect.eu/ 10 Solutions 1.7–1.8 and hence cov[x, y] = The case where x and y are continuous

Solution manual for pattern recognition and machine learning by bishop

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan