Thông tin tài liệu
Part I
MATHEMATICAL
PRELIMINARIES
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
2
Random Vectors and
Independence
In this chapter, we review central concepts of probability theory,statistics, and random
processes. The emphasis is on multivariate statistics and random vectors. Matters
that will be needed later in this book are discussed in more detail, including, for
example, statistical independence and higher-order statistics. The reader is assumed
to have basic knowledge on single variable probability theory, so that fundamental
definitions such as probability, elementary events, and random variables are familiar.
Readers who already have a good knowledge of multivariate statistics can skip most
of this chapter. For those who need a more extensive review or more information on
advanced matters, many good textbooks ranging from elementary ones to advanced
treatments exist. A widely used textbook covering probability, random variables, and
stochastic processes is [353].
2.1 PROBABILITY DISTRIBUTIONS AND DENSITIES
2.1.1 Distribution of a random variable
In this book, we assume that random variables are continuous-valued unless stated
otherwise. The cumulative distribution function (cdf) of a random variable at
point is defined as the probability that :
(2.1)
Allowing to change from to defines the whole cdf for all values of .
Clearly, for continuous random variables the cdf is a nonnegative, nondecreasing
(often monotonically increasing) continuous function whose values lie in the interval
15
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
16
RANDOM VECTORS AND INDEPENDENCE
σσ
m
Fig. 2.1
A gaussian probability density function with mean and standard deviation .
. From the definition, it also follows directly that ,and
.
Usually a probability distribution is characterized in terms of its density function
rather than cdf. Formally, the probability density function (pdf) of a continuous
random variable is obtained as the derivative of its cumulative distribution function:
(2.2)
In practice, the cdf is computed from the known pdf by using the inverse relationship
(2.3)
For simplicity, is often denoted by and by , respectively. The
subscript referring to the random variable in question must be used when confusion
is possible.
Example 2.1 The gaussian (or normal) probability distribution is used in numerous
models and applications, for example to describe additive noise. Its density function
is given by
(2.4)
PROBABILITY DISTRIBUTIONS AND DENSITIES
17
Here the parameter (mean) determines the peak point of the symmetric density
function, and (standard deviation), its effective width (flatness or sharpness of the
peak). See Figure 2.1 for an illustration.
Generally, the cdf of the gaussian density cannot be evaluated in closed form using
(2.3). The term in front of the density (2.4) is a normalizing factor that
guarantees that the cdf becomes unity when . However, the values of the
cdf can be computed numerically using, for example, tabulated values of the error
function
erf
(2.5)
The error function is closely related to the cdf of a normalized gaussian density, for
which the mean
and the variance . See [353] for details.
2.1.2 Distribution of a random vector
Assume now that is an -dimensional random vector
(2.6)
where denotes the transpose. (We take the transpose because all vectors in this book
are column vectors. Note that vectors are denoted by boldface lowercase letters.) The
components of the column vector are continuous random variables.
The concept of probability distribution generalizes easily to such a random vector.
In particular, the cumulative distribution function of is defined by
(2.7)
where again denotes the probability of the event in parentheses, and is
some constant value of the random vector . The notation means that each
component of the vector is less than or equal to the respective component of the
vector . The multivariate cdf in Eq. (2.7) has similar properties to that of a single
random variable. It is a nondecreasing function of each component, with values lying
in the interval . When all the components of approach infinity,
achieves its upper limit ; when any component , .
The multivariate probabilitydensity function of is defined as the derivative
of the cumulative distribution function with respect to all components of the
random vector :
(2.8)
Hence
(2.9)
18
RANDOM VECTORS AND INDEPENDENCE
where is the th component of the vector . Clearly,
(2.10)
This provides the appropriate normalization condition that a true multivariate proba-
bility density must satisfy.
In many cases, random variables have nonzero probability density functions only
on certain finite intervals. An illustrative example of such a case is presented below.
Example 2.2 Assume that the probability density function of a two-dimensional
random vector = is
elsewhere
Let us now compute the cumulative distribution function of . It is obtained by
integrating over both and , taking into account the limits of the regions where the
density is nonzero. When either or , the density and consequently
also the cdf is zero. In the region where and , the cdf is given
by
In the region where and , the upper limit in integrating over
becomes equal to 1, and the cdf is obtained by inserting into the preceding
expression. Similarly, in the region and , the cdf is obtained by
inserting to the preceding formula. Finally, if both and ,the
cdf becomes unity, showing that the probability density has been normalized
correctly. Collecting these results yields
or
and
2.1.3 Joint and marginal distributions
The joint distribution of two different random vectors can be handled in a similar
manner. In particular, let be another random vector having in general a dimension
different from the dimension of . The vectors and can be concatenated to
EXPECTATIONS AND MOMENTS
19
a "supervector" = , and the preceding formulas used directly. The cdf
that arises is called the joint distribution function of and , and is given by
(2.11)
Here
and are some constant vectors having the same dimensions as and ,
respectively, and Eq. (2.11) defines the joint probability of the event and
.
The joint density function
of and is again defined formally by dif-
ferentiating the joint distribution function with respect to all components
of the random vectors and . Hence, the relationship
(2.12)
holds, and the value of this integral equals unity when both and .
The marginal densities of and of are obtained by integrating
over the other random vector in their joint density :
(2.13)
(2.14)
Example 2.3 Consider the joint density given in Example 2.2. The marginal densi-
ties of the random variables and are
elsewhere
elsewhere
2.2 EXPECTATIONS AND MOMENTS
2.2.1 Definition and general properties
In practice, the exact probability density function of a vector or scalar valued random
variable is usually unknown. However, one can use instead expectations of some
20
RANDOM VECTORS AND INDEPENDENCE
functions of that random variable for performing useful analyses and processing. A
great advantage of expectations is that they can be estimated directly from the data,
even though they are formally defined in terms of the density function.
Let denote any quantity derived from the random vector . The quantity
may be either a scalar, vector, or even a matrix. The expectation of is
denoted by E , and is defined by
E (2.15)
Here the integral is computed over all the components of .The integration operation
is applied separately to every component of the vector or element of the matrix,
yielding as a result another vector or matrix of the same size. If = , we get the
expectation E of ; this is discussed in more detail in the next subsection.
Expectations have some important fundamental properties.
1. Linearity. Let , be a set of different random vectors, and ,
, some nonrandom scalar coefficients. Then
E E (2.16)
2. Linear transformation. Let be an -dimensional random vector, and and
some nonrandom and matrices, respectively. Then
E E E E (2.17)
3. Transformation invariance. Let be a vector-valued function of the
random vector
.Then
(2.18)
Thus E =E , even though the integrations are carried out over
different probability density functions.
These properties can be proved using the definition of the expectation operator
and properties of probability density functions. They are important and very helpful
in practice, allowing expressions containing expectations to be simplified without
actually needing to compute any integrals (except for possibly in the last phase).
2.2.2 Mean vector and correlation matrix
Moments of a random vector are typical expectations used to characterize it. They
are obtained when consists of products of components of . In particular, the
EXPECTATIONS AND MOMENTS
21
first moment of a random vector is called the mean vector of . It is defined
as the expectation of :
E (2.19)
Each component of the -vector is given by
E (2.20)
where
is the marginal density of the th component of . This is because
integrals over all the other components of reduce to unity due to the definition of
the marginal density.
Another important set of moments consists of correlations between pairs of com-
ponents of . The correlation between the th and th component of is given
by the second moment
E
(2.21)
Note that correlation can be negative or positive.
The correlation matrix
E (2.22)
of the vector represents in a convenient form all its correlations, being the
element in row and column of .
The correlation matrix has some important properties:
1. It is a symmetric matrix: = .
2. It is positive semidefinite:
(2.23)
for all -vectors . Usually in practice is positive definite, meaning that
for any vector , (2.23) holds as a strict inequality.
3. All the eigenvalues of are real and nonnegative (positive if is positive
definite). Furthermore, all the eigenvectors of are real, and can always be
chosen so that they are mutually orthonormal.
Higher-order moments can be defined analogously, but their discussion is post-
poned to Section 2.7. Instead, we shall first consider the corresponding central and
second-order moments for two different random vectors.
22
RANDOM VECTORS AND INDEPENDENCE
2.2.3 Covariances and joint moments
Central moments are defined in a similar fashion to usual moments, but the mean
vectors of the random vectors involved are subtracted prior to computing the ex-
pectation. Clearly, central moments are only meaningful above the first order. The
quantity corresponding to the correlation matrix is called the covariance matrix
of , and is given by
E (2.24)
The elements
E (2.25)
of the matrix are called covariances, and they are the central moments
corresponding to the correlations
1
defined in Eq. (2.21).
The covariance matrix satisfies the same properties as the correlation matrix
. Using the properties of the expectation operator, it is easy to see that
(2.26)
If the mean vector , the correlation and covariance matrices become the
same. If necessary, the data can easily be made zero mean by subtracting the
(estimated) mean vector from the data vectors as a preprocessing step. This is a usual
practice in independent component analysis, and thus in later chapters, we simply
denote by the correlation/covariance matrix, often even dropping the subscript
for simplicity.
For a single random variable , the mean vector reduces to its mean value =
E , the correlation matrix to the second moment E , and the covariance matrix
to the variance of
E (2.27)
The relationship (2.26) then takes the simple form E = .
The expectation operation can be extended for functions of two different
random vectors and in terms of their joint density:
E (2.28)
The integrals are computed over all the components of and .
Of the joint expectations, the most widely used are the cross-correlation matrix
E (2.29)
1
In classic statistics, the correlation coefficients = are used, and the matrix consisting of
them is called the correlation matrix. In this book, the correlation matrix is defined by the formula (2.22),
which is a common practice in signal processing, neural networks, and engineering.
EXPECTATIONS AND MOMENTS
23
−5 −4 −3 −2 −1 0 1 2 3 4 5
−5
−4
−3
−2
−1
0
1
2
3
4
5
y
x
Fig. 2.2
An example of negative covariance
between the random variables
and .
−5 −4 −3 −2 −1 0 1 2 3 4 5
−5
−4
−3
−2
−1
0
1
2
3
4
5
y
x
Fig. 2.3
An example of zero covariance be-
tween the random variables
and .
and the cross-covariance matrix
E (2.30)
Note that the dimensions of the vectors and can be different. Hence, the cross-
correlation and -covariance matrices are not necessarily square matrices, and they are
not symmetric in general. However, from their definitions it follows easily that
(2.31)
If the mean vectors of
and are zero, the cross-correlation and cross-covariance
matrices become the same. The covariance matrix of the sum of two random
vectors and having the same dimension is often needed in practice. It is easy to
see that
(2.32)
Correlations and covariances measure the dependence between the random vari-
ables using their second-order statistics. This is illustrated by the following example.
Example 2.4 Consider the two different joint distributions of the zero
mean scalar random variables and shown in Figs. 2.2 and 2.3. In Fig. 2.2,
and have a clear negative covariance (or correlation). A positive value of mostly
implies that is negative, and vice versa. On the other hand, in the case of Fig. 2.3,
it is not possible to infer anything about the value of by observing . Hence, their
covariance .
[...]... gaussian pdf is that linear processing methods based on first- and second-order statistical information are usually optimal for gaussian data For example, independent component analysis does not bring out anything new compared with standard principal component analysis (to be discussed later) for gaussian data Similarly, linear time-invariant discrete-time filters used in classic statistical signal processing... secondorder temporal statistics on certain conditions Such techniques are quite different from standard independent component analysis They will be discussed in Chapter 18 34 RANDOM VECTORS AND INDEPENDENCE u to x makes the components of the gaussian distribution of uncorrelated, and hence also independent Moreover, the eigenvalues i and eigenvectors i of the covariance matrix x reveal the geometrical... to independent and identically distributed random vectors i having a common mean z and covariance matrix z The limiting distribution of the random vector !1 z m C 1 X yk = p (zi mz) k k (2.78) i=1 C is multivariate gaussian with zero mean and covariance matrix z The central limit theorem has important consequences in independent component analysis and blind source separation A typical mixture, or component. .. definition (2.57) gives rise to a generalization of the standard notion of statistical independence The components of the random vector x are themselves scalar random variables, and the same holds for y and z Clearly, the components of x can be mutually dependent, while they are independent with respect to the components of the other random vectors y and z, and (2.57) still holds A similar argument applies... respective covariance matrices as well s 2.3.2 n Statistical independence A key concept that constitutes the foundation of independent component analysis is statistical independence For simplicity, consider first the case of two different scalar random variables x and y The random variable x is independent of y , if knowing the value of y does not give any information on the value of x For example, x and y... networks the inner product u = T of the weight vector of the neuron and its input vector Inserting this into (2.86) shows clearly that higher-order statistics of the components of the vector are involved in the computations Independent component analysis and blind source separation require the use of higher-order statistics either directly or indirectly via nonlinearities Therefore, we discuss in the following... the components of the noise vector are all uncorrelated and have equal variance 2 , so that in (2.50) n Rn = 2 I (2.51) Sometimes, for example in a noisy version of the ICA model (Chapter 15), the components of the signal vector are also mutually uncorrelated, so that the signal correlation matrix becomes the diagonal matrix s Ds = diag(Efs2g Efs2g : : : Efs2 g) 1 2 m where s1 s2 : : : sm are components... and covariance matrix Cy = ACxAT A special case of this result says that any linear combination of gaussian random variables is itself gaussian This result again has implications in standard independent component analysis: it is impossible to estimate the ICA model for gaussian data, that is, one cannot blindly separate 33 THE MULTIVARIATE GAUSSIAN DENSITY gaussian sources from their mixtures without... each component of is perfectly correlated with itself The best that we can achieve is that different components of are mutually uncorrelated, leading to the uncorrelatedness condition x x C x Cx = Ef(x mx)(x mx)T g = D Here D is an n n diagonal matrix D = diag(c11 c22 : : : cnn) = diag( (2.41) 2 ::: x ) (2.42) = Ef(xi mx )2 g = cii of the 2 x1 2 x2 n whose n diagonal elements are the variances xi i components... whitening of the original data can be made in infinitely many ways Whitening will be discussed in more detail in Chapter 6, because it is a highly useful and widely used preprocessing step in independent component analysis It is clear that there also exists infinitely many ways to decorrelate the original data, because whiteness is a special case of the uncorrelatedness property Example 2.5 Consider the . gaussian data.
For example, independent component analysis does not bring out anything new com-
pared with standard principal component analysis (to be discussed. increasing) continuous function whose values lie in the interval
15
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
Ngày đăng: 21/01/2014, 06:20
Xem thêm: Tài liệu Independent component analysis P2 pptx, Tài liệu Independent component analysis P2 pptx