Current location - Education and Training Encyclopedia - Graduation thesis - Sis paper on cryptography
Sis paper on cryptography
Formulas and special characters have not been translated. Please have a look for yourself. I hope it helps:

Nonnegative matrix algorithm

factoring

Daniel Lee

Bell laboratory

Lucent technology

Murray Hill, New Jersey 07974

H. Sebastian Cheng

Brain and gear train. SCI .

Massachusetts Institute of Technology (MIT)

Cambridge, Massachusetts 02 138

abstract

Nonnegative matrix factorization (NMF) has been proved before.

It is a useful decomposition of multivariate data. Two different kinds—

Analysis of NMF algorithm for folding fan. They are only slightly different.

Multiplication factor used to update the rule. The algorithm can

In order to reduce the traditional least square error, etc

Minimization of generalized Kullback-Leibler divergence. Monotonous

The convergence of these two algorithms can be proved by an auxiliary function-

This method is similar to the expected convergence for proof—

Maximization algorithm. Algorithms can also be interpreted as diagnosis—

As long as the gradient descent is adjusted, the scale factor is the best.

Select to ensure convergence.

1 Introduction

Unsupervised learning algorithms, such as principal component analysis and vector quantization—

Quantization can be understood as data matrix decomposition under different constraints. Means "down"

The factors of waiting time to limit utilization can be very different—

Different representative performances. Principal component analysis is poorly performed, or—

The orthogonal constraint of wave function leads to the cancellation of its use in the representation of complete distribution.

Produce variation, [1, 2]. On the other hand, vector quantization has a big winner—

Divide all constraints into mutually exclusive prototypes [3].

We have shown before that nonnegative matrix factorization is a useful constraint.

You can learn some representations in the data [4,5]. Non-negative alkaline carrier,

The distributed but still sparse combination used produces performance.

Reconstruction [6, 7]. In this paper, we analyze two numerical algorithms.

Learn the optimal nonnegative factor from the data.

2 Non-negative matrix factorization

We formally consider this algorithm to solve the following problems:

Nonnegative matrix factorization (NMF) gives a nonnegative matrix.

And factors of nonnegative matrices.

and

Like this:

1。 NMF can be applied to the statistical analysis of multivariate data in the following ways.

Given a set of multidimensional data vectors, the vectors are placed in

column

matrix

where

In some cases a data set. this

Matrix, and then decomposed into an approximation.

matrix

first

matrix

Usually choose less than or

What did you say?

and

Than the original

matrix

. This result is a compressed version of the original data matrix.

In the general meaning of the equation? What does (1) mean? It can override the column.

pillar; mainstay

where

and

There are corresponding columns

and

. in other ways

If so, each data vector

Approximation by linear combination of columns

Weighted by composition. therefore

It can be regarded as containing the foundation.

This is a linear approximate optimization of data.

. Due to relatively few foundations

Vector to represent many data carriers, good approximation, can only be achieved.

If a potential structure is found in the basic vector data.

Does this document apply NMF and focus on technology—

Factorization of nonnegative matrices? Nding technology. Of course, other types of horses—

Matrix decomposition has been widely studied in numerical linear algebra, but it does not—

Negative constraints make many previous work unsuitable for this situation.

8。

Here, we discuss two NMF algorithms based on iterative updating.

and

. because

This is a simple algorithm and its convergence guarantee.

We find them very useful in practical application. Other algorithms may

More effective? The overall calculation time is sufficient, but it is also difficult? Possibility of realizing worship

It cannot be extended to different cost functions. There is only one similarity between our algorithms.

This factor is suitable for deconvolution previously used in emission tomography.

And astronomical images [9, 10, 1 1, 12].

In each iteration of the algorithm, new values will be obtained.

or

Has been discovered.

Some factors depend on the current value of the approximate mass of equation (1). we

It is proved that the approximation property and application are monotonously improved.

These multiplications update the rules. In practice, this means iteration.

Update rules to ensure convergence to local optimal matrix decomposition.

Cost function of 3

Right? Nd approximate factorization

First of all, we? Need? Cost function of NE

Quantify the quality of the approximation. Such a cost function can be constructed.

Using some metrics, the distance between two nonnegative matrices

and

. A useful one

Measurement is the Euclidean distance between squares.

This is the lower bound of zero if and only if it disappears locally.

and

13。

2。

Another useful measure

3。

This is also like the zero point of the lower bound of Euclid distance, if only sum disappears.

if

. But it can't be called "distance" because it is asymmetric.

and

So we call it "divergence"

from

. It reduces the distance between Kuhlback and Laible.

Divergence, or relative entropy, when

What did you say?

and

can

As a standardized probability distribution. We now consider the optimization of two formulations of NMF:

1 minimization problem

relative to

and

Be bound

2 Minimization problem

relative to

and

Be bound

Although this function

and

It is convex.

Only or

It's just them

These two variables are not convex. Therefore, the expectation of the algorithm is unrealistic.

Solve the problems in 1 and 2? The feeling of finding the global minimum. However, there are many

In terms of numerical optimization, can it be applied to? And local minima.

Gradient descent may be the simplest way to achieve it, but convergence can

Slow down. Other methods, such as * * * yoke gradient method, converge faster, at least in

Local minima nearby, but it is more complicated than gradient descent.

8。 Gradient-based methods are also disadvantageous.

It is sensitive to the choice of step size, but it is not convenient for large-scale application.

4 Multiplication update rules

We find that the following "multiplication update rule" is a good compromise.

Solve the problems of speed and easy implementation of 1 and 2.

Euclidean distance of theorem 1

According to the updated rules, it is negative.

4。

Euclidean distance is constant if and only if these updates are made.

The distance of the fixed point.

and

In one place

Bifurcation of Theorem 2

According to the updated rules, it is negative.

5。

Differences are constant, and these updates are necessary and only necessary.

and

In a fixed

Divergence point.

The proof of these theorems will be given in later chapters. Now, we notice that every update

Multiply by a multiple. Especially seeing this directly.

When the multiplication factors are unified

This perfect reconstruction is necessary.

One? The update rule point of XED.

Update of multiplication and addition rules of 5

It is useful to compare the updates of these multiplications with the gradient descent.

14。 In particular, a simple additive was updated.

Reducing the square distance may

Write as

6。

if

Are set equal to some small positive numbers, which is equivalent to tradition.

Gradient descent. As long as this number is small enough? , updates should be reduced.

. If we scale variables and settings diagonally,

7。

We got the update rules.

This gives the theorem 1. Please note that this ratio

The result of the multiplication factor of sectarian religion in the positive component of the gradient—

But the absolute value of the negative component of the first sum factor molecule.

Divergence, diagonal scaling gradient descent form

8。

Similarly, if

It is small and positive, and this update should be reduced.

. If we now

collect

9。

We got the update rules.

This gives Theorem 2. This adjustment can also be made.

Is interpreted as a multiplication rule with a positive component of a gradient.

The denominator and negative component are used as multiplication factor numerator.

Because of our choice

Not small, it seems that there is no guarantee.

The decrease of adjustment gradient leads to the decrease of cost function. Surprisingly, this is

In fact, this situation will be introduced in the next section.

Proof of convergence of 6

In order to prove theorem 1 and theorem 2, we will use an auxiliary function similar to using.

In the expectation maximization algorithm [15, 16].

De? Definition 1

It's an auxiliary function.

If conditions

10。

Both? Version.

Accessibility is a useful concept, because of the following lemma, so is it.

Figure 1 schematic diagram.

1 If lemma

Is an auxiliary function, and then

This is a subtraction update.

1 1。

Prove:

Please note,

have only

Is the local minimum.

. If the derivative

about

Existence and persistence in a small community.

, which also means that

derivant

. Therefore, by iteratively updating the formula (1 1), we get a sequence.

The estimation converges to a local minimum.

The purpose of

Function:

12。

We will prove that Germany? Appropriate auxiliary function

two

and

Theorem 1, update rules and 2 easily follow the following formula (1 1). Share to:14000g (high, high)

Female (male)

HT HT+ 1

Figure 1: Minimizing accessibility

for

If lemma

Positive diagonal matrix

HMIN

H

guarantee

13。

then

14。

It's an auxiliary function.

15。

Proof: from

Obviously, we just need to show

. along with

To this end, we compared

16。

Use the formula (14)? find

be qualified for sth

17。

1

18。

This is a scaling component.

semide? When and only at night.

Yes, and ...

. then

Is positive.

19。

20。

2 1。

22。

23。

1

You can also prove it.

Positive half? Finite consideration matrix

. then

Is a positive eigenvector.

along with

The unified eigenvalue and Frobenius Perron theorem are used to show that the formula 17 is valid. Share to: 14000 Now we can prove the convergence of the theorem 1:

Proof substitution of theorem 1

The subordinate formula (14) is an auxiliary function,

In formula (1 1), the result of updating the rule in formula (14) is:

24。

This is according to the negative update rule, according to

Lemma 1. The composition of this equation is very clear, and we get

25。

By the action of torsion

and

Lemma 1 and 2,

This can also prove that

Under the updated subtraction rule

We now consider the following auxiliary functions of the divergent cost function:

Lemma 3 de NE

26。

27。

This is an auxiliary function.

28。

Proof: This is a simple verification.

. show

Inequalities obtained by using the convexity of logarithmic functions.

29。

All non-negative cargo holds

That kind of unity. build

30。

What have we got?

3 1。

The following results can be drawn from this inequality.

Theorem 2, and then from the application lemma 1:

Proof of Theorem 2: Minimum Value

relative to

By setting

Gradient is zero:

32。

Therefore, the updating rule of the equation is in the form of (1 1).

33。

oneself

Is an auxiliary function,

Subtract this update from equation (28). Rewrite—

The matrix form of ten is equivalent to the updating rule of EQ (5). By the action of torsion

and

, update rules

It can also prove to be negative. 7 discussion

We have proved its application in updating equation rules. (4) and (5) guarantee

Question 1 and 2? Nd at least local optimal solutions. Convergence of

Proof of dependence? Ning proper auxiliary function. Our current job

These theorems are extended to more complex constraints. Update the rule itself

The calculation is very simple, and it is estimated that others will be used.

Various applications.

We thank Bell Laboratories for their support. We also want to thank Carlos

Broudy, Ken Clarkson, Corinna Cortes, Roland freund, Linda Kaufman, Yan Lecun, Sam.

Journal, Larry Sauer and Margaret Wright had a useful discussion.

reference book

[ 1] Jolliffe,it ( 1986)。 Principal component analysis. New york: springer Publishing House.

[2] Turkey, acquisition of Puntland, I (199 1). Characteristic face recognition. J. know each other neuroscience. 3,86,7 1–。

[3] Gersho, gray, RM( 1992). Signal compression of vector quantization. China Academy of Sciences.

Publishing house.

Li Dadong and Cheng, Housing Society. Unsupervised learning using convex cone coding (1997). lawsuit

At the 9th meeting of neural information processing system, 515–521.

5 Li Dadong and Cheng, Housing Society (1999). Non-negative matrix decomposes the part of the learning object—

And ash. Property 40 1, 788–791.

[6] Field, DJ( 1994). What is the purpose of sensory coding? Neural computing. 6,60 1,559–。

Fordiak, P&, male (1995). Coding of sparse primate cerebral cortex. Brain handbook.

Theory and Neural Networks, 895–898. MIT Press, Cambridge, Massachusetts.

[8] Press, WH, Xiu, Sa, Vitlin, Weight and flannery, BP (1993). Numerical method: art

Scientific calculation? C (Cambridge University Press, Cambridge, UK).

[9] xipu, La and Wadi, Y( 1982). Emission tomography based on maximum likelihood reconstruction.

IEEE transactions.113–2,122.

[10] Richardson, who (1972). Iterative image restoration method based on Bayesian network. J. choose. SOC .

Me. 62,59,55–。

Lucy, LB( 1974). Observe the distribution? Cation iteration technique. Astronomy.

74,745–754。

[12] Boman and K Shore, California (1996). A university? Method of using coordinate statistical tomography

Optimization of descent. IEEE transactions. image processing. 5,492,480–。

[13] Paltrow, p and t, U( 1997). Robust non-negative factor analysis formulated by least square method—

Sister. Metrology. Intelligence. Experiments 37, 23–35.

Kivinen and Vaumousse, j, M( 1997). Addition and power gradient update linearity

Forecast. Journal of Information and Computing 132,1–64.

Dempster, Laird, Associated Press, Nano and Rubin, DB( 1977). Incomplete data of maximum likelihood method

EM algorithm. Royal statistical system. 39,38, 1–。

Sauer, l and Pereira, F. Statistical Language of Aggregated and Mixed Order Markov Models

Handle. C. Heart and R. Wei Scheedel (editors). Minutes of the second meeting

Empirical methods in natural language processing, 81–89. ACL press