Current location - Education and Training Encyclopedia - Graduation thesis - Bayesian optimization
Bayesian optimization
Bayesian optimization

First of all, what can Bayesian optimization do? It gives me the feeling that I can do anything, of course, some of its effects may not be satisfactory. Bayesian optimization can do regression (although it always feels that this thing is just an accessory), but it mainly solves an "optimization problem".

Bayesian optimization solves the following types of problems:

Note: there is no substantial difference in using "argmin". Actually, "argmin" is used in [1].

Often we don't know, so it is difficult to solve this kind of problem with the classical gradient rise ("argmin" means gradient fall). Bayesian optimization adopts probabilistic proxy model to deal with it. It is decision-making, which is often referred to as decision-making space. Drug formulation is a decision, and the convolution kernel size of neural network can also be regarded as a decision. Moreover, it is often difficult to know the relationship between this decision and the final output. This is also the strength of Bayesian optimization.

The above two pictures are from [2] and [1] respectively. Because some symbols are different, it is stipulated to use the symbols in [2].

Bayesian optimization, each iteration, first of all, under the "prior" of the proxy model, by maximizing the set function (this function is often a trade-off between the distribution and popularization of evaluation points). New evaluation points are introduced into the system as inputs, and new outputs are obtained to update the probabilistic proxy model.

In ...

The above picture is a simple demonstration of Bayesian optimization. The black dotted line represents the objective function, and the black solid line represents the curve we fit (obtained by the probability proxy model in the average graph). The blue-purple area is. The green curve below represents each iteration. It can be seen that the evaluation points selected in each iteration correspond to the maximum value.

Next, we will discuss the probabilistic proxy model and the acquisition function respectively.

Probabilistic agent model, as its name implies, is a probabilistic model for agents.

Parametric model, that is, it can be determined by parameters. If we give a prior distribution. Then, through Bayesian formula, we can get posterior distribution:

The problem now is that we don't know and ah. Is a probability distribution, which we usually have to know. As for it, it is more difficult to calculate, but it only plays the role of coefficient here, so it can be solved by kernel method. In fact, we often choose the prior distribution of the yoke as the prior distribution.

Here's an example: there are K kinds of drugs in the laboratory, and we need to find out which drug has the best effect through drug experiments. It is assumed that there are only two possibilities of successful cure and failure when drugs act on a patient, and the curative effect of another drug cannot be judged by the effect of one drug. This kind of question seems to be called A/B test, which is often used in news recommendation.

We use it to represent drugs, the possibility that the first drug can successfully cure patients, and the treatment situation of patients (0 failure, 1 cure). Function is a complex mapping. Let the parameters,. Then the probability proxy model we choose is.

We choose this distribution as the prior distribution because it is the prior distribution of its yoke.

Definition:

It indicates the drug selected in the secondary evaluation and the number of treatment failures, and vice versa. Only it is 1, otherwise it is 0.

Then, the posterior probability of is:

See the appendix for the above derivation.

From the above, we can also find that the number of cures and failures represented by hyperparameters. The following figure is an example of thinking a priori.

Thompson sampling-wiki

So on the basis of it, what if you find it? Thompson sampling (or posterior sampling) is used as follows:

, that is, sampling from the posterior distribution of.

The advantages of this model are:

The following is the algorithm of this model:

The above model will be stretched when dealing with combination types. For example, we look for a collocation from each element. Each element has two states, so there is always a combination. It is obviously unrealistic to set one for each combination. What's more, the assumption of the previous model (the validity of another combination cannot be inferred from one combination) seems untenable. Because different combinations are often subtly related.

The linear model can solve this problem well.

Let's set each strategy as, the probability proxy model as, and now it's a weight vector. This is only part of the proxy model because it does not reflect the "probability" part.

The observed value of the combination is, which obeys normal distribution. Naturally, we also choose the * * * yoke prior distribution as the prior distribution: normal_inverse_gamma-wiki.

The distribution has four hyperparameters, and the posterior distribution of satisfies:

The first of these behaviors.

See the appendix for derivation.

Regarding the choice of, Thompson sampling can also be used:

In ...

The linear model has many extensions:

Among them, it is often:

and

Here,,, are all hyperparameters. As for how to update these superparameters, I'm not sure.

The nonparametric model does not mean that there are no parameters, but that the parameters (quantities) are uncertain.

Let's first look at how to transform the previous linear model into a nonparametric model.

Our hypothesis is fixed, and it obeys multidimensional normal distribution with mean and covariance matrices. Then, we can integrate and get a marginal distribution:

See the appendix for derivation.

As mentioned earlier, we can introduce,

Map the data (design) matrix to so that the corresponding marginal distribution needs to be changed:

Note that we actually don't need to specify, just through the kernel.

It is a new position, not a corresponding prediction, and both can be vectors.

The molecular part is joint Gaussian distribution. At this point, we have actually completed a simple Gaussian process, and the Gaussian process is formally introduced below.

Gaussian process-wiki

Gauss process-Mars eleven lang

The core of Gaussian process is the mean function (we often choose it as 0 in Bayesian optimization) and covariance function, while the observed value. The sequences and observations obtained by Gaussian process obey joint normal distribution:

Kernel method-wiki

Matrix covariance function-wiki

The file [1] gives the following form (which is actually the case):

Among them, it is the smoothing parameter, the proportional parameter and the second kind of deformed Bessel function.

At the same time, several commonly used matrix covariance functions are given.

Literature [2] gives another expression:

Where is a diagonal matrix with diagonal elements.

These parameters can be understood as follows:

Some of the above parameters will give some update methods below.

The logarithmic marginal likelihood function can be expressed as:

As can be seen from the figure, the right side of the equation is divided into three parts, which have different meanings:

A natural idea is to estimate the maximum likelihood of the above likelihood function, so as to get the estimate.

The complexity of each Gaussian process is around the level, which is brought by the inverse matrix of the calculation matrix. By Coleski decomposition, it can be reduced to.

Therefore, some algorithms have been developed to try to overcome this difficulty.

SPGP selects M pseudo-inputs from N inputs to replace them, so as to achieve the purpose of rank reduction. At the same time, the positions of these M dummy inputs are also used as parameters (although I don't know how to update them). The natural advantage is that,

The complexity can be reduced to. The disadvantage is that there are relatively many parameters and it is easy to produce "over-fitting" phenomenon.

According to Bochner theorem, any stable kernel has a positive definite Fourier spectrum representation:

Then, by Monte Carlo method, the frequency of m samples is sampled to approximate the integral of the appeal. So as to obtain an approximate covariance function. When the data set is small, SSGP is also easy to "over-fit".

Random forest-Bohr's notes

Random forest can be used as a substitute for Gaussian process. The disadvantage is that in the case of missing data, the estimation is inaccurate (the difference is constant). In addition, because the random forest is discontinuous and nondifferentiable, it is impossible to update the parameters by gradient descent (or ascent). In addition, it is puzzling that even if we give the prior distribution of random forest parameters, how can we find the posterior distribution?

First, we have a utility function. As the name implies, utility function is an index reflecting evaluation and corresponding function value (under conditions). Paper [1] does not introduce this utility function, but paper [2] introduces this concept for better explanation.

The choice of acquisition function is expected utility:

Actually, it's quite strange, because it's just expectation, and this set function is also expectation. My understanding is that it is "vague". If the maximum likelihood output is "accurate", the evaluation points may not be dispersed enough and fall into local optimization. Moreover, it seems that there is no need to estimate parameters, although the price is expected.

We can see from the following algorithm that there is often no such step.

Finally, once again, the design of set function is often a trade-off between exploration and utilization. That is to say, we hope that the new evaluation point is far away from the original data point (exploration of unknown areas) and can be improved (exploration of current areas).

The design idea of PI's acquisition function is very simple, that is, we need to find an evaluation point to make the maximum known (if it is argmin at the beginning, it is the minimum) and make it. Usually,

The acquisition function is:

Note that this is the probability function of the standard normal distribution.

The utility function in this acquisition function is:

Among them, the index function.

When it is the minimum, PI works very well.

One disadvantage of PI is that it only cares about the probability of promotion, but not the degree of promotion. EI covers these two aspects.

Generally, its lifting function is expressed by the following formula:

The corresponding acquisition function is:

Where is the probability density function of the standard normal distribution. The formula can be derived by replacing the integral variable.

It is actually a utility function.

The acquisition function is:

This acquisition function can be understood as follows: for any one, it has a mean value and a standard deviation (reflecting the floating range and degree), and we think it is a more reliable bound, and we think it has a more possible value. So maximizing the function is to maximize our hope.

The choice mentioned in [2] is often chernov-Hefudin bound. It sounds mysterious, but UCB seems to be very hot now. In addition, there is a set of theories that can guide and plan superparameters and make them optimal.

Different from previous strategies, information-based strategies rely on the posterior distribution of global optimal solutions. This distribution is implied in the posterior distribution of the function (different