Simple Progression Towards Simple Linear Regression

**Introduction :**

The goal of the blogpost is to get the beginners started with basics of the linear regression concepts and quickly help them to build their first linear regression model. We will mainly focus on the modeling side of it . The data cleaning and preprocessing parts would be covered in detail in an upcoming post.

Linear Regression are one of the most fundamental and widely used Machine Learning Algorithm. Linear regression is usually among the first few topics which people pick while learning predictive modeling.Linear Regression establishes a relationship between dependent variable (Y) and one or more independent variables (X) using a best fit straight line (also known as regression line).The dependent variable is continuous, independent variable(s) can be continuous or discrete, and nature of regression line is linear.

Linear relationship can either be positive or negative. Positive relationship between two variable basically means that that increase in one variable should also increase the value in other value by some constant value. Negative relationship between two variable means that that increase in one variable should decrease the value in other value by some constant value.

Mathematical Explanation :

A simple linear regression has one independent variable. Mathematically, the line representing a simple linear regression is expressed through a basic equation :

Y = mX + b + e

Here ,

m is the slope

X is the predictor variable

b is the intercept/bias term

Y is the predicted target variable

e is the error term

Enough of theory now let’s dive into the implementation of both; a simple linear regression and a multivariate linear regression.

We will use implementation provided by the python machine learning framework known as scikit-learn.

**Problem Statement :**

Predict the price of a car given its compression ratio.

**Data details**

These data sets are originally from the http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.

The label is the identifier of the column you are trying to predict. The identified features are used to predict the label.

**vendor_id**: The ID of the taxi vendor is a feature.

**rate_code**: The rate type of the taxi trip is a feature.

**passenger_count**: The number of passengers on the trip is a feature.

**trip_time_in_secs**: The amount of time the trip took. You won’t know how long the trip takes until after it is

**Completed** : You exclude this column from the model.

**trip_distance**: The distance of the trip is a feature.

**payment_type**: The payment method (cash or credit card) is a feature.

**fare_amount**: The total taxi fare paid is the label.

**Tools used** :

Pandas , Numpy , Matplotlib , scikit-learn

**Python Implementation with code :**

**Import necessary libraries**

Import the necessary modules from specific libraries.

from sklearn import linear_model import pandas as pd import matplotlib.pyplot as plt import numpy as np

**Load the data set**

Use pandas module to read the taxi data from the file system. Check few records of the dataset.

taxi_train = "data/taxi-fare-train.csv" taxi_test = "data/taxi-fare-test.csv"

tax_train = pd.read_csv(taxi_train) tax_train.head()

vendor_id rate_code passenger_count trip_time_in_secs trip_distance payment_type fare_amount 0 CMT 1 1 1271 3.8 CRD 17.5 1 CMT 1 1 474 1.5 CRD 8.0 2 CMT 1 1 637 1.4 CRD 8.5 3 CMT 1 1 181 0.6 CSH 4.5 4 CMT 1 1 661 1.1 CRD 8.5

**Select the predictor feature for Simple Regression , select the target variable**

X = tax_train['trip_distance'] y = tax_train['fare_amount']

**Check 5-num summary of selected predictor feature**

X.describe()

count 728541.000000 mean 2.741597 std 3.298091 min 0.000000 25% 1.000000 50% 1.700000 75% 3.000000 max 98.700000 Name: trip_distance, dtype: float64

**Train test split :**

from sklearn.model_selection import train_test_split x_training_set, x_test_set, y_training_set, y_test_set = train_test_split(X,y)

Dome some reshaping of the variable for visualisation

x_training_set, x_test_set, y_training_set, y_test_set = x_training_set.values, x_test_set.values, y_training_set.values, y_test_set.values x_training_set, x_test_set, y_training_set, y_test_set = x_training_set.reshape(-1, 1), x_test_set.reshape(-1, 1), y_training_set.reshape(-1, 1), y_test_set.reshape(-1, 1)

**Do some initial visual inspection between predictor and target variable**

# So let's plot some of the data # - this gives some core routines to experiment with different parameters plt.title('Relationship between dependent and target variable') plt.scatter(x_training_set, y_training_set, color='black') plt.show()

**Training / model fitting**

Fit the model to selected supervised data

lm = linear_model.LinearRegression() lm.fit(x_training_set,y_training_set)

**Model parameters study :**

from sklearn.metrics import mean_squared_error, r2_score model_score = lm.score(x_training_set,y_training_set) # Have a look at R sq to give an idea of the fit , # Explained variance score: 1 is perfect prediction print('R sq: ',model_score) y_predicted = lm.predict(x_test_set) # The coefficients print('Coefficients: ', lm.coef_) # The mean squared error print("Mean squared error: %.2f" % mean_squared_error(y_test_set, y_predicted)) # Explained variance score: 1 is perfect prediction print('Variance score: %.2f' % r2_score(y_test_set, y_predicted))

('R sq: ', 0.7729861480277364) ('Coefficients: ', array([[2.55423554]])) Mean squared error: 21.33 Variance score: 0.77

**Accuracy report with test data :**

Let’s visualise the goodness of the fit with the predictions being visualised by a line

# So let's run the model against the test data y_predicted = lm.predict(x_test_set) plt.title('Comparison of Y values in test and the Predicted values') plt.xlabel('Test Set') plt.yabel('Predicted values') plt.plot(x_test_set, y_predicted, color='blue', linewidth=3) plt.xticks(()) plt.yticks(()) plt.show()

**Prediction :**

** **

Algo Advantages :

- Extremely simple method
- When relationships between the independent variables and the dependent variable are almost linear, shows optimal results.

- Very easy and intuitive to use and understand
- Even when it doesn’t fit the data exactly, we can use it to find the nature of the relationship between the two variables.

Algo Disadvantages :

- Linear regression is limited to predicting numeric output.
- Very sensitive to the anomalies in the data (or outliers)
- If we have a number of parameters than the number of samples available then the model starts to model the noise rather than the relationship between the variables.
- Regression coefficients are biased by the data imbalance.