Speeding Up and Benchmarking Logistic Regression With PCA
Introduction :
When the data becomes too much in its dimension then it becomes a problem for pattern learning. Too much information is bad on two things : compute and execution time and quality of the model fit. When the dimension of the data is too high we need to find a way to reduce it. But that reduction has to be done in such a way that we maintain the original pattern of the data. The algorithm that we are going to discuss in this article does the similar job. The algorithm is quite famous and widely used in varieties of tasks. Its name is Principal Component Analysis aka PCA.
The main purpose of principal component analysis is the analysis of data to identify patterns and finding patterns to reduce the dimensions of the dataset with minimal loss of information.
PCA is used to transform a high-dimensional dataset into a smaller-dimensional subspace ;into a new coordinate system.In the new coordinate system, the first axis corresponds to the first principal component, which is the component that explains the greatest amount of the variance in the data.
In simple words, principal component analysis is a method of extracting important variables known as principal components from a large set of variables available in a data set. It captures as much information as possible from the original high dimensional data. It represents the original data in terms of its principal components in a new dimension space.
Summary of PCA :
Applications of PCA :
Visualisation
Denoising
Data Compression
Speeding up ML algorithms
Problem Statement :
Speed up Handwriting recognition learning
Solution :
We will solve this problem by forming the a classification pipeline on MNIST dataset.
About the Dataset :
The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.
Four files are available on this site:
● train-images-idx3-ubyte.gz: training set images (9912422 bytes)
● train-labels-idx1-ubyte.gz: training set labels (28881 bytes)
● t10k-images-idx3-ubyte.gz: test set images (1648877 bytes)
● t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes)
The MNIST database of handwritten digits is available on the following website: MNIST Dataset
Train a model with all components
Import necessary packages :
from sklearn.datasets import fetch_mldata
from sklearn.decomposition import PCA
from sklearn import metrics
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import numpy as np
Load the Dataset :
# You can add the parameter data_home to wherever to where you want to download your data
mnist = fetch_mldata('MNIST original')
There are 70,000 records of 784 dimensions. The labels are a 70,000 dimensional vector. The dimension has been exported under name ‘data’ and labels are exported as ‘target’.
Split the data into train/test :
# test_size: what proportion of original data is used for test set
train_img, test_img, train_lbl, test_lbl = train_test_split(
mnist.data, mnist.target, test_size=1/7.0, random_state=0)
Standardize the data :
scaler = StandardScaler()
# Fit on training set only.
scaler.fit(train_img)
# Apply transform to both the training set and the test set.
train_img = scaler.transform(train_img)
test_img = scaler.transform(test_img)
Notice that we have done the fitting on the training set only and then applied that to the test data as well.
Initialize a becnchmarking dataframe :
Let’s initialise a pandas dataframe that would hold :
Variance : The variance of the original data that is retained
N_components : number of principal components
Timing : time to fit training
Accuracy : Accuracy obtained
We will capture the above attributes from each experiment run .
Train the model with all data: Train a logistic regression on all data and record the training time and accuracy. The variance and num of components will be obviously 1.0 and 784.
variance = 1.0
n_components = train_img.shape[1]
logisticRegr = LogisticRegression(solver = 'lbfgs')
start = time.time()
logisticRegr.fit(train_img, train_lbl)
end = time.time()
timing = end-start
# Predict for Multiple Observations (images) at Once
predicted = logisticRegr.predict(test_img)
# generate evaluation metrics
accuracy = (metrics.accuracy_score(test_lbl, predicted))
a = dict(zip(benchmark_cols,[variance,n_components,timing,accuracy]))
benchmark = benchmark.append(a,ignore_index=True)
print(benchmark)