Linear Regression
Linear regression aims to determine an optimal line that accurately represents a given dataset. Notably, the “best fit” line doesn’t need to traverse all or even the majority of data points. Its key goal is to minimize the vertical distance from the data points. Learn more about linear regression
Collinearity in Datasets
Datasets may contain two or more points lying on the same vertical line, meaning they share the same x-value. However, these points can’t be traversed by a function unless their y-values are also identical. Conversely, functions can accommodate multiple points on a single horizontal line. Remember that a “best fit” line may intersect few or even none of the data points, depending on their distribution.
Perfect Collinearity in Linear Regression
In the case where all data points lie on the same line, such as x values from 1 to 10 and corresponding y values from 2 to 20, the equation for the best-fitting line is straightforward: y=2x+0.
Linear Regression vs Curve-Fitting
Linear regression differs from curve-fitting, a mathematical approach determining a polynomial degree less than or equal to n-1 that traverses all n points in a dataset. Discover the mathematical proof here
Sometimes, a lower degree polynomial suffices. For instance, in a dataset where x equals y for 100 points, the line y=x (a polynomial of degree one) passes through all points. Yet, how well a line “represents” a dataset depends on the variance of the data points, which indicates their collinearity.
Understanding Solution Accuracy
While traditional statistical methods offer exact solutions for linear regression, machine learning algorithms provide approximations that converge to optimal values. Machine learning tools tend to excel with complex, non-linear, multi-dimensional datasets, which are often beyond the scope of linear regression.
Multivariate Analysis Explained
Multivariate analysis, or generalized linear regression in machine learning, extends the concept of linear regression to higher dimensions, forming a hyperplane rather than a line. In this scenario, you must find multiple values (w1, w2,…wn) compared to the simple slope (m) and y-intercept (b) in 2D linear regression.
Going Beyond Linear Regression
While linear regression is powerful, it isn’t always the best fit for all datasets. Other options include quadratic and cubic equations or higher-degree polynomials. However, each choice comes with its trade-offs. Alternatively, a hybrid approach using piece-wise linear functions might be applicable.
Regression involves several key considerations:
- What curve best fits the data and how can we determine this?
- Is there another curve that could be a better fit?
- What exactly does “best fit” mean?
Visual inspection can be helpful for datasets up to three dimensions, but isn’t practical for higher dimensions. However, one proven technique for finding the “best fitting” line for a dataset is by minimizing the Mean Squared Error (MSE), which we will explore in the upcoming sections.
Related Articles