The so-called "least square method" is literally the embodiment of an optimization problem, so the least square method is an optimization problem.
My understanding: when there is a functional relationship between independent variables and dependent variables that can be approximately described, we call this behavior regression. And this function is called regression function.
Regression can help us predict dependent variables. According to the previous data and regression function, we can roughly predict the trend of the next dependent variable. This is widely used in the stock market and some time series problems.
Suppose we have a set of numbers,
Generate a set of random numbers, and then we want to use a function to describe the relationship between the horizontal axis X and the vertical axis Y, assuming that it is linear and the function is H (x) = \ theta _ 0+\ theta _1* X.
If we get \ theta _ 0 and \ theta _ 1, then our regression function can be determined. The corresponding y can be calculated by x.
How to get H(x)
Let our hypothetical function describe the real data better, that is, the hypothetical function is closer to the distribution of the real data. Then the error between them should be minimized.
So, we get the cost function,
J(\theta_0,\theta_ 1)=\frac{ 1}{2m}\sum_{i= 1}^m(h(x^{(i)})-y^{(i)})^2
Square the error of each discrete value and the calculated value, and then sum them to calculate the total error.
At this time, our goal is to calculate:
Minimize J(\theta_0, \theta_ 1)
The above is the concept of least square method.
Least square method is an idea of optimization problem, and gradient descent method is a concrete method to realize this optimization idea.
In the process of minimizing J (\ theta _ 0, \ theta _ 1) in solving the least squares problem, if it is a linear problem, we can try to use a matrix, that is, a normal equation. Just make sure that (x tx) {- 1} exists here. Of course, this is also a limitation of matrix calculation.
The normal and omnipotent method is the gradient descent method (not to mention that he is slow).
The essence of gradient descent method is iteration. By iteratively updating the value of \theta, the minimum value of j (\theta_0, \theta_ 1) is gradually found.
As can be seen from the above figure, with the increase of the number of iterations on the horizontal axis, the value of j on the vertical axis gradually decreases.
We know the cost function j (\ theta _ 0, \ theta _1) = \ frac {1} {2m} \ sum _ {i =1} m (h (x {(i)})-y.
Our objective function is to minimize j (\ theta _ 0, \ theta _ 1).
Therefore, let J(\theta_0, \theta_ 1) find the partial derivatives of \theta_0 and \ theta _ 1 respectively.
Get,
\frac{\partial J(\theta_0,\ theta _ 1)} { \ partial \ theta _ 0 } = \frac{ 1}{m}\sum_{i= 1}^m(x^{(i)})-y^{(i)})
\ frac { \ partial j(\θ_ 0,\θ_ 1)} { \ partial \θ_ 1 } = \frac{ 1}{m}\sum_{i= 1}^m(x^{(i)})-y^{(i)})x_i
Update \theta:
\ theta _ 0:\ theta _ 0-\ alpha \ frac { \ partial J(\ theta _ 0)} { \ partial \ theta _ 0 }
\ theta _ 1:\ theta _ 1-\ alpha \ frac { \ partial J(\ theta _ 1)} { \ partial \ theta _ 1 }
Here, multicolored lines are gradient lines. We can see that our initial \theta is zero, then \theta gradually approaches the minimum cost and finally reaches the position of [1.6, 1.6].
In the gradient descent method, we will also have the incremental gradient descent method, etc., but they are all designed to achieve approximation faster, and need specific analysis.