1.0 | History
Least squares linear regression, as a means of finiding aa good rought linear fit to a set of points was performed by Legendre (1805) and Gauss (1809) for the prediction of planetary movement. Quetelet was responsible for making the procedure well-known and for using it extensively in the social sciences.
2.0 | Objective
Linear regression is a statistical method used to model (finding a function) the relationship between a dependent variable and one or more independent variables by fitting (finding the slope and intercept) a linear equation to observed data. Its purpose is to predict the value of the dependent variable based on the values of the independent variables.
3.0 | Derivation
Assume that we want to fit a linear function to \(n\) datapoints, \(I_{i} = (x_{i}, y_{i})\). Define the model as \(\hat{y_{i}} = \theta_{0} + \theta_{1}x_{i}\).
Define the error function:
\[\text{Error}(\theta_0, \theta_1) = \sum_{i=0}^{n-1}\left(y_{i}- \hat{y_{i}}\right)^{2}\] \[\text{Error}(\theta_0, \theta_1) = \sum_{i=0}^{n-1}\left(y_{i}-(\theta_{0} + \theta_{1}x_{i})\right)^{2}\] \[\text{Error}(\theta_0, \theta_1) = \sum_{i=0}^{n-1}\left(y_{i} - \theta_{0} - \theta_{1}x_{i}\right)^{2}\]
The optimal paramter values for the regression line, according to linear regression, are the ones which minimize the least square error. In order to obtain them, the derivative with respect to \(\theta_{0}\) and \(\theta_{1}\) and subsequently set the result to zero.
\[\min\left(\text{Error}(\theta_0, \theta_1)\right)\]
Solve for \(\theta_0\): \[\frac{d}{d \theta_0}\text{Error}(\theta_0, \theta_1) = 0\] \[ \frac{d}{d \theta_0}\sum_{i=0}^{n-1}\left(y_{i} - \theta_{0} - \theta_{1}x_{i}\right)^{2} = 0\] \[ \sum_{i=0}^{n-1}2\left(y_{i} - \theta_{0} - \theta_{1}x_{i}\right) (-1) = 0\] \[ 2(-1) \sum_{i=0}^{n-1}\left(y_{i} - \theta_{0} - \theta_{1}x_{i}\right) = 0\] \[ \sum_{i=0}^{n-1}\left(y_{i} - \theta_{0} - \theta_{1}x_{i}\right) = 0\] \[ \sum_{i=0}^{n-1} y_{i} - \sum_{i=0}^{n-1} \theta_{0} - \sum_{i=0}^{n-1} \theta_{1}x_{i} = 0\] \[ \sum_{i=0}^{n-1} y_{i} = \theta_{0}\sum_{i=0}^{n-1} 1 + \sum_{i=0}^{n-1} \theta_{1}x_{i}\] \[ \sum_{i=0}^{n-1} y_{i} = (n-1 - 0 + 1) \theta_{0} + \sum_{i=0}^{n-1} \theta_{1}x_{i}\] \[ \sum_{i=0}^{n-1} y_{i} = n\theta_{0} + \sum_{i=0}^{n-1} \theta_{1}x_{i}\] \[ n\theta_{0} = \sum_{i=0}^{n-1} y_{i} - \theta_{1}\sum_{i=0}^{n-1} x_{i}\] \[ \theta_{0} = \frac{\sum_{i=0}^{n-1} y_{i}}{n} - \theta_{1}\frac{\sum_{i=0}^{n-1} x_{i}}{n} = \overline{y} - \theta_{1}\overline{x}\]
Thus: \[ \theta_{0} = \overline{y} - \theta_{1}\overline{x}\]
Solve for \(\theta_1\): \[\frac{d}{d \theta_1}\text{Error}(\theta_0, \theta_1) = 0\] \[ \frac{d}{d \theta_1}\sum_{i=0}^{n-1}\left(y_{i} - \theta_{0} - \theta_{1}x_{i}\right)^{2} = 0\] \[ 2(-1) \sum_{i=0}^{n-1}\left(y_{i} - \theta_{0} - \theta_{1}x_{i}\right)x_i = 0\] \[ \sum_{i=0}^{n-1}\left(y_{i} - \theta_{0} - \theta_{1}x_{i}\right)x_i = 0\]
Substitute \(\theta_0\) = \(\overline{y} - \theta_{1}\overline{x}\): \[ \sum_{i=0}^{n-1}\left(y_{i}x_i - (\overline{y} - \theta_{1}\overline{x})x_i - \theta_{1}x_{i}^{2}\right) = 0\] \[ \sum_{i=0}^{n-1}\left(y_{i}x_i - \overline{y}x_i + \theta_{1}\overline{x}x_i - \theta_{1}x_{i}^{2}\right) = 0\] \[ \sum_{i=0}^{n-1}y_{i}x_i - \sum_{i=0}^{n-1}\overline{y}x_i + \theta_{1}\sum_{i=0}^{n-1}\overline{x}x_i - \theta_{1}\sum_{i=0}^{n-1}x_{i}^{2} = 0\] \[ \sum_{i=0}^{n-1}\left(y_{i}x_i - \overline{y}x_i\right) + \theta_{1}\left(\sum_{i=0}^{n-1}\overline{x}x_i - x_{i}^{2}\right) = 0\] \[ \sum_{i=0}^{n-1}\left(y_{i}x_i - \overline{y}x_i\right) = -\theta_{1}\left(\sum_{i=0}^{n-1}\overline{x}x_i - x_{i}^{2}\right)\] \[ \theta_{1} = \frac{\sum_{i=0}^{n-1}\left(\overline{y}x_i - y_{i}x_i\right)}{\sum_{i=0}^{n-1}\overline{x}x_i - x_{i}^{2}}\]
4.0 | Python Implementation
Define 100 points around \(y = 3x\) and standard deviation of \(3\).
# get 100 random points between 0 and 10
= np.random.uniform(0, 10, 100)
x = 3 * x + np.random.normal(0, 3, 100) y
Implementation of Linear Regression in Python:
# apply linear regression from scratch
def m_linear_regression(x, y):
= np.mean(x)
x_mean = np.mean(y)
y_mean = np.sum(y_mean*x - y*x) / np.sum(x_mean*x - pow(x, 2))
k = y_mean - k * x_mean
m return k, m
= m_linear_regression(x, y) k, m
Results in the following graph:
The \(k\) value is off by \(5\%\) and the intercept shifted upwards by \(.7\). If I instead define more points sampled from the same initial distribution (meaning that the standard deviation remains the same) then the accuracy of the parameters will increase.
The following graph is defined by \(1000\) datapoints and a standard deviation of \(3\):
I will try one more time, keeping the standard deviation the constant but including \(10,000\) points:
A substantial improvement in accuracy occured in the transition from \(100\) to \(1000\) points. The model parameters are usually not needed to identically map onto reality. Using \(1000\) points, one obtains a percent error of about \(0.67\%\).
Note that including extra points does not always increase the accuracy of the parameters. It only does so if the standard deviation remains constant or lower. However, if it were to increase to \(10\) and one defined \(100\) points, the resulting image would resemble the one below:
5.0 | Final Remarks
Linear regression is a powerful statistical tool for modeling the relationship between a dependent variable and one or more independent variables. By fitting a linear equation to observed data, we can predict values and understand the underlying trends. In the derivation and implementation, we observed that the accuracy of the linear regression model improves with an increase in the number of data points, provided the noise characteristics remain consistent. This is because more data points provide a better representation of the underlying distribution, reducing the impact of random noise on the model’s parameters. However, increasing noise levels can counteract this improvement. Thus, while more data generally enhances model accuracy, it is crucial to maintain consistent noise levels for reliable predictions.