# Linear regression (according to Coursera’s ML course)

The first two weeks of the Andrew Ng’s Machine Learning course at Coursera started quite simple and easy, specially for someone with initial knowledge on Statistics/Machine Learning. There were two basic tutorials on Linear Algebra and Octave. Octave is the chosen computer language of the course. Although simple, there is always something worth learning and keeping in mind. One thing that I try to take from this Machine Learning type material is the difference in treatment of the same subjects between statisticians and machine learning oriented people.

Machine Learning definition

Firstly, it tries to define Machine Learning. Even though there is no correct and unique definition, I would use the following one attributed to Tom Mitchell to define a learning problem:

“Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”

Supervised and Unsupervised Learning

Secondly, it gives an intuition about the difference between supervised and unsupervised learning. I don’t know if there is a formal definition for this, but in my head, for supervised learning you have response variables (yes, I use statistical terms), while in unsupervised learning you don’t.

Linear Regression with Multiple Variables

The main focus of the lectures so far have been on linear regression with multiple variables. Now the way this is presented sounds very weird if you come from a statistical background.

Basically, you have your training set ${\{(x^{(1)}, y^{(1)}), ..., (x^{(m)}, y^{(m)})\}}$ and you need to find the parameter ${\theta = (\theta_0, \theta_1, ..., \theta_n)}$ that minimizes the cost function

$\displaystyle J(\theta) = \frac{1}{2m} \sum _{i=1}^{m} (h_{\theta} (x^{(i)}) - y^{(i)})^2 \ \ \ \ \ (1)$

where the “hypothesis” ${h_{\theta} (x)}$ is given by

$\displaystyle h_{\theta} (x) = \theta^Tx = \theta_0 x_0 + \theta_1 x_1 + ... + \theta_n x_n$

After you have found ${\theta^*}$, the value of ${\theta}$ that minimizes ${J(\theta)}$, your best prediction for ${y^{(m+1)}}$ would be ${h_{\theta^*} (x^{(m+1)})}$.

It is interesting to note that there is not much concerning confidence/credible intervals for parameter/prediction estimates. Also, not much discussion on what are the (model) assumptions behind all this.

Finding ${\theta^*}$

Analytical solution

In the case of Linear regression, there is a closed form solution to find ${\theta^*}$,

$\displaystyle \theta^* = (X^TX)^{-1}X^Ty, \ \ \ \ \ (2)$

where ${X}$ is the design matrix and ${y}$ is the response vector. Even though we have an analytical solution in this case, it becomes quite slow if the number of features is large, e.g. ${> 10000}$. This happens because in order to compute Eq. (2) we need to invert a ${n \times n}$ matrix ${(X^T X)}$, which costs approximately ${O(n^3)}$. For large cases like this, it would be worth while to use a numerical solution to the problem. The optimization algorithm introduced was the Gradient Descent.

Gradient descent is a first-order optimization algorithm. If we want to find the minimum of ${J(\theta)}$ with gradient descent, then we should repeat

$\displaystyle \theta_{n+1} = \theta_{n} - \alpha_n \nabla J(\theta),$

until convergence, where ${\nabla J(\theta)}$ is the gradient of ${J(\theta)}$. With ${J(\theta)}$ defined as in Eq. (1) we have

$\displaystyle \nabla J(\theta) = \frac{1}{m}X^T(X\theta - y).$

Some tips on the application of gradient descent were given:

• Feature scaling: Transform all your features so that they have a common range. One way is to standardize them. This will make the optimization algorithm easier to converge.
• Debugging: The best graph to check the convergence of the algorithm would be to plot the value of ${\underset{\theta}{min}J(\theta)}$ against the number of iterations. A decaying function is to be expected.
• Choice of ${\alpha_n}$: If ${\alpha_n = \alpha}$ is too small it will lead to slow convergence, while if it is too big ${J(\theta)}$ may not decrease on every iteration or even diverge. Choose different values for ${\alpha}$, as in ${... 0.001, 0.003, 0.01, 0.03, 0.1, ...}$ to get a feeling on what works in your problem.

Polynomial regression

Polynomial regression was briefly mentioned, basically showing that you can use non-linear functions of your features as a feature in order to capture non-linear trends using a linear regression model.

Reference:

Related posts: