Generally, Linear Regression is used for predictive analysis. It is a linear approximation of a fundamental relationship between two or more variables.
Main processes of linear regression
- Get sample data
- Design a model that works best for that sample
- Make prediction for the whole population
Main uses of regression analysis
- Finding the strength of predictors
- Forecasting an effect
- Trend forecasting
Some types of linear regression analysis
Simple Linear Regression
One dependent variable i.e. interval or ratio ,and one independent variable i.e. interval or ratio or dichotomous
Multiple Linear Regression
One dependent variable i.e. interval or ratio, and two plus independent variables i.e. interval or ratio or dichotomous
Logistic Linear Regression
One dependent variable i.e. dichotomous, and two plus independent variables i.e. interval or ratio or dichotomous
One dependent variable i.e. ordinal, and one plus independent variables i.e. nominal or dichotomous
One dependent variable i.e. nominal, and one plus independent variables i.e. interval or ratio or dichotomous.
Types of Variables in Linear Regression
In linear regression, there are two types of variables:
- Dependent Variable
- Independent Variable
Dependent variables are those which we are going to predict while independent variables are predictors.
Let’s briefly explain them with the help of example.
y = F(x1, x2,x3,…………….. xk)
In above equation, y is dependent variable which is a function of independent variables x1 to xk.
The population formula of simple linear regression model is given below: –
Look at the above equation, y is dependent variable, β0 is regression constant, β1 is the coefficient that quantifies the effect of independent variable on dependent variable, x1 sample data for independent variable and ε is the error of estimation.
Now we take an example to understand this equation well, for instance, income is dependent variable i.e. y and education is independent variable i.e. x1 then we say that income will definitely depend on education, more education will ensure the higher income.
Therefore, error of estimation is the actual difference between the observed income and the income the regression predicted. However, an average error of estimation is zero.
Simple linear regression equation is given below.
Difference between Regression and Correlation
|It is used to measure how one variable effect the other variable||It is the relationship between two variables|
|It is used to fit a best line and estimate one variable on the basis of another variable||It is used to show connection between two variables|
|In regression, both variables are dissimilar||There is no difference between dependent and independent variables|
|One way||p(x,y) = p(y,x)|
Python Packages Installation
Python libraries will be used during our practical example of linear regression.
To see the Anaconda installed libraries, we will write the following code in Anaconda Prompt,
We can also install the more libraries in Anaconda by using this code.
C:\Users\Iliya>conda install numpy
Before we go to start the practical example of linear regression in python, we will discuss its important libraries.
It is a library for the python programming which allows us to work with multidimensional arrays and matrices along with a large collection of high level mathematical functions to operate on these arrays.
It is a software library for the python programming for data manipulation in a tabular form and analysis.
It is 2D plotting library for python programming which is specially designed for visualization of NumPy computation.
It is open source python library which is used for scientific and technical computing. It contains modules for optimization, linear algebra, integration, image processing, machine learning.
It is a python data visualization library based on matplotlib. Seaborn offers a high level interface for drawing attractive and informative graphics.
It is a python package which permits users to explore data, estimate statistical models and execute statistical tests.
It is free software machine learning library for python programming.
Practical example of Simple Linear Regression
Import the relevant libraries
Load the data
Now we load the data in .csv format in the same folder where regression_example.ipynb file saved and also check the data what is inside the file as shown in figure.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import statsmodels.api as sm
In order to show the informative statistics, we use the describe() command as shown in figure.
Now we define the dependent and independent variables. In our example, code (allotted to each education) is independent variable whereas salary is dependent variable.
y = data['salary'] x1 = data['code']
In order to explore the data in shape of scatter plot, first we define the horizontal axis and then vertical axis, see this figure.
Now we add a constant means we are adding a new column which consists of only 1s.
x = sm.add_constant(x1)
Fit the model according to the Ordinary Least Squares (OLS) method with a dependent variable ‘y’ and an independentvariable ‘x’
results = sm.OLS(y,x).fit()
Finally, we print a summary of the regression.
Now we are going to create a scatter plot
then, define the regression equation yhat = 5914.2857*x1+6466.6667
and now plot the regression line against the independent variable i.e. code (used for education)
fig = plt.plot(x1,yhat, lw=4, c='orange', label ='regression line')
Now, label the x-axis and y-axis
plt.xlabel('Education', fontsize = 20) plt.ylabel(Salary, fontsize = 20) plt.show()
Now, look at the output result in below figure . This is the complete code.
plt.scatter(x1,y) yhat = 5914.2857*x1+6466.6667 fig = plt.plot(x1,yhat, lw=4, c='orange', label ='regression line') plt.xlabel('Education', fontsize = 20) plt.ylabel(Salary, fontsize = 20) plt.show()
Interpret the Regression Results
Now, put the following lines of code to interpret the regression results.
x = sm.add_constant(x1) results = sm.OLS(y,x).fit() results.summary()
Salary is dependent variable
R-squared shows the fit of the model. Its values range from 0 to 1. In our example, R-squared value is 0.911. It is pertinent to mention here that higher value indicate a better fit.
Simple Linear Regression is given by,
In our example, const i.e. b0 is 5152.5157
Salary i.e. b1is 6240.5660
Std err shows the level of accuracy of the coefficient. Lower the std error, higher the level of accuracy.
P > | t | is p-value. This value is less than 0.05 is considered to be statistically important.
Salary = 5152.5157 + 6240.5660 × code
If code = 2 then salary will be
17633.6477 = 5152.5157 + 6240.5660 × 2
Hence, according to our model, the expected salary of employee whose education is FA is 17633.65 that is the predictive power of linear regression.
In case of null hypothesis of this test, Beta is equal to zero (H0 : β = 0) which means that coefficient equal to zero. If the coefficient is zero for the intercept be zero that is then the line crosses the y-axis at the origin as shown in figure.
plt.scatter(x1,y) yhat = 5914.2857*x1+0 fig = plt.plot(x1,yhat, lw=4, c='red', label='regression line') plt.xlabel('Education', fontsize = 20) plt.ylabel('Salary', fontsize = 20) plt.xlim(0) plt.ylim(0) plt.show()
If b1= 0 then ŷ = b0 Therefore, graphically, this variable will not be considered for the model.
Therefore, we conclude that the regression line horizontal is always going through the intercept value.
Practical example of Multiple Linear Regression
Import the relevant libraries and load the data
In order to shown the informative statistics, we use the describe() command as shown in figure.
Now we define the dependent and independent variables. In our example, code (allotted to each education) and year are independent variables, whereas, salary is dependent variable.
In order to explore the data in shape of scatter plot, first we define the horizontal axis and then vertical axis as shown in figure.
Interpret the Regression Results
Now, we can easily compare the both results of regression model with one or more variables.