Nonnegative matrix algorithm
factoring
Daniel Lee
Bell laboratory
Lucent technology
Murray Hill, New Jersey 07974
H. Sebastian Cheng
Brain and gear train. SCI .
Massachusetts Institute of Technology (MIT)
Cambridge, Massachusetts 02 138
abstract
Nonnegative matrix factorization (NMF) has been proved before.
It is a useful decomposition of multivariate data. Two different kinds—
Analysis of NMF algorithm for folding fan. They are only slightly different.
Multiplication factor used to update the rule. The algorithm can
In order to reduce the traditional least square error, etc
Minimization of generalized Kullback-Leibler divergence. Monotonous
The convergence of these two algorithms can be proved by an auxiliary function-
This method is similar to the expected convergence for proof—
Maximization algorithm. Algorithms can also be interpreted as diagnosis—
As long as the gradient descent is adjusted, the scale factor is the best.
Select to ensure convergence.
1 Introduction
Unsupervised learning algorithms, such as principal component analysis and vector quantization—
Quantization can be understood as data matrix decomposition under different constraints. Means "down"
The factors of waiting time to limit utilization can be very different—
Different representative performances. Principal component analysis is poorly performed, or—
The orthogonal constraint of wave function leads to the cancellation of its use in the representation of complete distribution.
Produce variation, [1, 2]. On the other hand, vector quantization has a big winner—
Divide all constraints into mutually exclusive prototypes [3].
We have shown before that nonnegative matrix factorization is a useful constraint.
You can learn some representations in the data [4,5]. Non-negative alkaline carrier,
The distributed but still sparse combination used produces performance.
Reconstruction [6, 7]. In this paper, we analyze two numerical algorithms.
Learn the optimal nonnegative factor from the data.
2 Non-negative matrix factorization
We formally consider this algorithm to solve the following problems:
Nonnegative matrix factorization (NMF) gives a nonnegative matrix.
And factors of nonnegative matrices.
and
Like this:
1。 NMF can be applied to the statistical analysis of multivariate data in the following ways.
Given a set of multidimensional data vectors, the vectors are placed in
column
matrix
where
In some cases a data set. this
Matrix, and then decomposed into an approximation.
matrix
first
matrix
Usually choose less than or
What did you say?
and
Than the original
matrix
. This result is a compressed version of the original data matrix.
In the general meaning of the equation? What does (1) mean? It can override the column.
pillar; mainstay
where
and
There are corresponding columns
and
. in other ways
If so, each data vector
Approximation by linear combination of columns
Weighted by composition. therefore
It can be regarded as containing the foundation.
This is a linear approximate optimization of data.
. Due to relatively few foundations
Vector to represent many data carriers, good approximation, can only be achieved.
If a potential structure is found in the basic vector data.
Does this document apply NMF and focus on technology—
Factorization of nonnegative matrices? Nding technology. Of course, other types of horses—
Matrix decomposition has been widely studied in numerical linear algebra, but it does not—
Negative constraints make many previous work unsuitable for this situation.
8。
Here, we discuss two NMF algorithms based on iterative updating.
and
. because
This is a simple algorithm and its convergence guarantee.
We find them very useful in practical application. Other algorithms may
More effective? The overall calculation time is sufficient, but it is also difficult? Possibility of realizing worship
It cannot be extended to different cost functions. There is only one similarity between our algorithms.
This factor is suitable for deconvolution previously used in emission tomography.
And astronomical images [9, 10, 1 1, 12].
In each iteration of the algorithm, new values will be obtained.
or
Has been discovered.
Some factors depend on the current value of the approximate mass of equation (1). we
It is proved that the approximation property and application are monotonously improved.
These multiplications update the rules. In practice, this means iteration.
Update rules to ensure convergence to local optimal matrix decomposition.
Cost function of 3
Right? Nd approximate factorization
First of all, we? Need? Cost function of NE
Quantify the quality of the approximation. Such a cost function can be constructed.
Using some metrics, the distance between two nonnegative matrices
and
. A useful one
Measurement is the Euclidean distance between squares.
This is the lower bound of zero if and only if it disappears locally.
and
13。
2。
Another useful measure
3。
This is also like the zero point of the lower bound of Euclid distance, if only sum disappears.
if
. But it can't be called "distance" because it is asymmetric.
and
So we call it "divergence"
from
. It reduces the distance between Kuhlback and Laible.
Divergence, or relative entropy, when
What did you say?
and
can
As a standardized probability distribution. We now consider the optimization of two formulations of NMF:
1 minimization problem
relative to
and
Be bound
2 Minimization problem
relative to
and
Be bound
Although this function
and
It is convex.
Only or
It's just them
These two variables are not convex. Therefore, the expectation of the algorithm is unrealistic.
Solve the problems in 1 and 2? The feeling of finding the global minimum. However, there are many
In terms of numerical optimization, can it be applied to? And local minima.
Gradient descent may be the simplest way to achieve it, but convergence can
Slow down. Other methods, such as * * * yoke gradient method, converge faster, at least in
Local minima nearby, but it is more complicated than gradient descent.
8。 Gradient-based methods are also disadvantageous.
It is sensitive to the choice of step size, but it is not convenient for large-scale application.
4 Multiplication update rules
We find that the following "multiplication update rule" is a good compromise.
Solve the problems of speed and easy implementation of 1 and 2.
Euclidean distance of theorem 1
According to the updated rules, it is negative.
4。
Euclidean distance is constant if and only if these updates are made.
The distance of the fixed point.
and
In one place
Bifurcation of Theorem 2
According to the updated rules, it is negative.
5。
Differences are constant, and these updates are necessary and only necessary.
and
In a fixed
Divergence point.
The proof of these theorems will be given in later chapters. Now, we notice that every update
Multiply by a multiple. Especially seeing this directly.
When the multiplication factors are unified
This perfect reconstruction is necessary.
One? The update rule point of XED.
Update of multiplication and addition rules of 5
It is useful to compare the updates of these multiplications with the gradient descent.
14。 In particular, a simple additive was updated.
Reducing the square distance may
Write as
6。
if
Are set equal to some small positive numbers, which is equivalent to tradition.
Gradient descent. As long as this number is small enough? , updates should be reduced.
. If we scale variables and settings diagonally,
7。
We got the update rules.
This gives the theorem 1. Please note that this ratio
The result of the multiplication factor of sectarian religion in the positive component of the gradient—
But the absolute value of the negative component of the first sum factor molecule.
Divergence, diagonal scaling gradient descent form
8。
Similarly, if
It is small and positive, and this update should be reduced.
. If we now
collect
9。
We got the update rules.
This gives Theorem 2. This adjustment can also be made.
Is interpreted as a multiplication rule with a positive component of a gradient.
The denominator and negative component are used as multiplication factor numerator.
Because of our choice
Not small, it seems that there is no guarantee.
The decrease of adjustment gradient leads to the decrease of cost function. Surprisingly, this is
In fact, this situation will be introduced in the next section.
Proof of convergence of 6
In order to prove theorem 1 and theorem 2, we will use an auxiliary function similar to using.
In the expectation maximization algorithm [15, 16].
De? Definition 1
It's an auxiliary function.
If conditions
10。
Both? Version.
Accessibility is a useful concept, because of the following lemma, so is it.
Figure 1 schematic diagram.
1 If lemma
Is an auxiliary function, and then
This is a subtraction update.
1 1。
Prove:
Please note,
have only
Is the local minimum.
. If the derivative
about
Existence and persistence in a small community.
, which also means that
derivant
. Therefore, by iteratively updating the formula (1 1), we get a sequence.
The estimation converges to a local minimum.
The purpose of
Function:
12。
We will prove that Germany? Appropriate auxiliary function
two
and
Theorem 1, update rules and 2 easily follow the following formula (1 1). Share to:14000g (high, high)
Female (male)
HT HT+ 1
Figure 1: Minimizing accessibility
for
If lemma
Positive diagonal matrix
HMIN
H
guarantee
13。
then
14。
It's an auxiliary function.
15。
Proof: from
Obviously, we just need to show
. along with
To this end, we compared
16。
Use the formula (14)? find
be qualified for sth
17。
1
18。
This is a scaling component.
semide? When and only at night.
Yes, and ...
. then
Is positive.
19。
20。
2 1。
22。
23。
1
You can also prove it.
Positive half? Finite consideration matrix
. then
Is a positive eigenvector.
along with
The unified eigenvalue and Frobenius Perron theorem are used to show that the formula 17 is valid. Share to: 14000 Now we can prove the convergence of the theorem 1:
Proof substitution of theorem 1
The subordinate formula (14) is an auxiliary function,
In formula (1 1), the result of updating the rule in formula (14) is:
24。
This is according to the negative update rule, according to
Lemma 1. The composition of this equation is very clear, and we get
25。
By the action of torsion
and
Lemma 1 and 2,
This can also prove that
Under the updated subtraction rule
We now consider the following auxiliary functions of the divergent cost function:
Lemma 3 de NE
26。
27。
This is an auxiliary function.
28。
Proof: This is a simple verification.
. show
Inequalities obtained by using the convexity of logarithmic functions.
29。
All non-negative cargo holds
That kind of unity. build
30。
What have we got?
3 1。
The following results can be drawn from this inequality.
Theorem 2, and then from the application lemma 1:
Proof of Theorem 2: Minimum Value
relative to
By setting
Gradient is zero:
32。
Therefore, the updating rule of the equation is in the form of (1 1).
33。
oneself
Is an auxiliary function,
Subtract this update from equation (28). Rewrite—
The matrix form of ten is equivalent to the updating rule of EQ (5). By the action of torsion
and
, update rules
It can also prove to be negative. 7 discussion
We have proved its application in updating equation rules. (4) and (5) guarantee
Question 1 and 2? Nd at least local optimal solutions. Convergence of
Proof of dependence? Ning proper auxiliary function. Our current job
These theorems are extended to more complex constraints. Update the rule itself
The calculation is very simple, and it is estimated that others will be used.
Various applications.
We thank Bell Laboratories for their support. We also want to thank Carlos
Broudy, Ken Clarkson, Corinna Cortes, Roland freund, Linda Kaufman, Yan Lecun, Sam.
Journal, Larry Sauer and Margaret Wright had a useful discussion.
reference book
[ 1] Jolliffe,it ( 1986)。 Principal component analysis. New york: springer Publishing House.
[2] Turkey, acquisition of Puntland, I (199 1). Characteristic face recognition. J. know each other neuroscience. 3,86,7 1–。
[3] Gersho, gray, RM( 1992). Signal compression of vector quantization. China Academy of Sciences.
Publishing house.
Li Dadong and Cheng, Housing Society. Unsupervised learning using convex cone coding (1997). lawsuit
At the 9th meeting of neural information processing system, 515–521.
5 Li Dadong and Cheng, Housing Society (1999). Non-negative matrix decomposes the part of the learning object—
And ash. Property 40 1, 788–791.
[6] Field, DJ( 1994). What is the purpose of sensory coding? Neural computing. 6,60 1,559–。
Fordiak, P&, male (1995). Coding of sparse primate cerebral cortex. Brain handbook.
Theory and Neural Networks, 895–898. MIT Press, Cambridge, Massachusetts.
[8] Press, WH, Xiu, Sa, Vitlin, Weight and flannery, BP (1993). Numerical method: art
Scientific calculation? C (Cambridge University Press, Cambridge, UK).
[9] xipu, La and Wadi, Y( 1982). Emission tomography based on maximum likelihood reconstruction.
IEEE transactions.113–2,122.
[10] Richardson, who (1972). Iterative image restoration method based on Bayesian network. J. choose. SOC .
Me. 62,59,55–。
Lucy, LB( 1974). Observe the distribution? Cation iteration technique. Astronomy.
74,745–754。
[12] Boman and K Shore, California (1996). A university? Method of using coordinate statistical tomography
Optimization of descent. IEEE transactions. image processing. 5,492,480–。
[13] Paltrow, p and t, U( 1997). Robust non-negative factor analysis formulated by least square method—
Sister. Metrology. Intelligence. Experiments 37, 23–35.
Kivinen and Vaumousse, j, M( 1997). Addition and power gradient update linearity
Forecast. Journal of Information and Computing 132,1–64.
Dempster, Laird, Associated Press, Nano and Rubin, DB( 1977). Incomplete data of maximum likelihood method
EM algorithm. Royal statistical system. 39,38, 1–。
Sauer, l and Pereira, F. Statistical Language of Aggregated and Mixed Order Markov Models
Handle. C. Heart and R. Wei Scheedel (editors). Minutes of the second meeting
Empirical methods in natural language processing, 81–89. ACL press