Simple Linear Regression[Case Study]

By Sudhanshu Kumar on September 15, 2018

Simple Progression Towards Simple Linear Regression

Introduction :

The goal of the blogpost is to get the beginners started with basics of the linear regression concepts and quickly help them to build their first linear regression model. We will mainly focus on the modeling side of it . The data cleaning and preprocessing parts would be covered in detail in an upcoming post.

Linear Regression are one of the most fundamental and widely used Machine Learning Algorithm. Linear regression is usually among the first few topics which people pick while learning predictive modeling.Linear Regression establishes a relationship between dependent variable (Y) and one or more independent variables (X) using a best fit straight line (also known as regression line).The dependent variable is continuous, independent variable(s) can be continuous or discrete, and nature of regression line is linear.

Linear relationship can either be positive or negative. Positive relationship between two variable basically means that that increase in one variable should also increase the value in other value by some constant value. Negative relationship between two variable means that that increase in one variable should decrease the value in other value by some constant value.

Mathematical Explanation :

A simple linear regression has one independent variable. Mathematically, the line representing a simple linear regression is expressed through a basic equation :

Y = mX + b  + e

Here ,
m is the slope
X is the predictor variable
b is the intercept/bias term
Y is the predicted target variable
e is the error term

Enough of theory now let’s dive into the implementation of both; a simple linear regression and a multivariate linear regression.

We will use implementation provided by the python machine learning framework known as scikit-learn.

Problem Statement :

Predict the price of a car given its compression ratio.

Data details

These data sets are originally from the http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.

The label is the identifier of the column you are trying to predict. The identified features are used to predict the label.

vendor_id: The ID of the taxi vendor is a feature.

rate_code: The rate type of the taxi trip is a feature.

passenger_count: The number of passengers on the trip is a feature.

trip_time_in_secs: The amount of time the trip took. You won’t know how long the trip takes until after it is

Completed : You exclude this column from the model.

trip_distance: The distance of the trip is a feature.

payment_type: The payment method (cash or credit card) is a feature.

fare_amount: The total taxi fare paid is the label.

Tools used :
Pandas , Numpy , Matplotlib , scikit-learn

Python Implementation with code :

Import necessary libraries

Import the necessary modules from specific libraries.

from sklearn import linear_model
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

Load the data set

Use pandas module to read the taxi data from the file system. Check few records of the dataset.

taxi_train = "data/taxi-fare-train.csv"
taxi_test = "data/taxi-fare-test.csv"

tax_train = pd.read_csv(taxi_train)
tax_train.head()

vendor_id              rate_code              passenger_count  trip_time_in_secs  trip_distance          payment_type                fare_amount

0              CMT        1              1              1271        3.8           CRD        17.5

1              CMT        1              1              474          1.5           CRD        8.0

2              CMT        1              1              637          1.4           CRD        8.5

3              CMT        1              1              181          0.6           CSH        4.5

4              CMT        1              1              661          1.1           CRD        8.5

Select the predictor feature for Simple Regression , select the target variable

X = tax_train['trip_distance']

y = tax_train['fare_amount']

Check 5-num summary of selected predictor feature

X.describe()

count    728541.000000
mean          2.741597
std           3.298091
min           0.000000
25%           1.000000
50%           1.700000
75%           3.000000
max          98.700000
Name: trip_distance, dtype: float64

Train test split :

from sklearn.model_selection import train_test_split


x_training_set, x_test_set, y_training_set, y_test_set = train_test_split(X,y)

Dome some reshaping of the variable for visualisation

x_training_set, x_test_set, y_training_set, y_test_set = x_training_set.values, x_test_set.values, y_training_set.values, y_test_set.values

x_training_set, x_test_set, y_training_set, y_test_set = x_training_set.reshape(-1, 1), x_test_set.reshape(-1, 1), y_training_set.reshape(-1, 1), y_test_set.reshape(-1, 1)

Do some initial visual inspection between predictor and target variable

# So let's plot some of the data

# - this gives some core routines to experiment with different parameters

plt.title('Relationship between dependent and target variable')

plt.scatter(x_training_set, y_training_set,  color='black')

plt.show()

Training / model fitting

Fit the model to selected supervised data

lm = linear_model.LinearRegression()

lm.fit(x_training_set,y_training_set)

Model parameters study :

from sklearn.metrics import mean_squared_error, r2_score

model_score = lm.score(x_training_set,y_training_set)

# Have a look at R sq to give an idea of the fit ,

# Explained variance score: 1 is perfect prediction

print('R sq: ',model_score)


y_predicted = lm.predict(x_test_set)


# The coefficients

print('Coefficients: ', lm.coef_)

# The mean squared error

print("Mean squared error: %.2f"

% mean_squared_error(y_test_set, y_predicted))

# Explained variance score: 1 is perfect prediction

print('Variance score: %.2f' % r2_score(y_test_set, y_predicted))

('R sq: ', 0.7729861480277364)
('Coefficients: ', array([[2.55423554]]))
Mean squared error: 21.33
Variance score: 0.77

Accuracy report with test data :

Let’s visualise the goodness of the fit with the predictions being visualised by a line

# So let's run the model against the test data

y_predicted = lm.predict(x_test_set)


plt.title('Comparison of Y values in test and the Predicted values')

plt.xlabel('Test Set')

plt.yabel('Predicted values')

plt.plot(x_test_set, y_predicted, color='blue', linewidth=3)

plt.xticks(())

plt.yticks(())

plt.show()

Prediction :

Algo Advantages :

Extremely simple method
When relationships between the independent variables and the dependent variable are almost linear, shows optimal results.

Very easy and intuitive to use and understand
Even when it doesn’t fit the data exactly, we can use it to find the nature of the relationship between the two variables.

Algo Disadvantages :

Linear regression is limited to predicting numeric output.
Very sensitive to the anomalies in the data (or outliers)
If we have a number of parameters than the number of samples available then the model starts to model the noise rather than the relationship between the variables.
Regression coefficients are biased by the data imbalance.

Sudhanshu Kumar

Data Scientist at Verizon Labs

Introduction :

Problem Statement :

Data details

Python Implementation with code :

Share This Post

Related Articles

Naive Bayes Algorithm [Case Study]

K-Means Clustering Algorithm

Understanding Principal Component Analysis(PCA)

Login

Lost Password

Register