# Python Sparse Random Projections

##### June 5, 2014

This guest post is brought to you by Adrian Rosebrock, PhD, a computer vision expert and creator of pyimagesearch.com, a blog about computer vision and image search. Adrian is an entrepreneur and creator of ID My Pill, an app for identifying prescription pills using your smartphone’s camera, as well as Chic Engine, a fashion search engine that lets you search clothes with pics.

# Intro

There are many data related tasks in which it's desirable or necessary to perform some form of dimensionality reduction.

Reducing the number of dimensions in your data, at a high level, is all about stripping out unimportant parts of your data to expose and amplify the important ones. Ultimately, the goal is to "enhance" a data set to make it easier to process and to improve the quality of insights you produce downstream.

So what techniques should you be using for dimensionality reduction? PCA? Approximate PCA? Kernel PCA?

##### Well, that depends.

See Bubbles' quote on YouTube (Trailer Park Boys, Season 2 Episode 5. NSFW)

if you’re working with text documents, maybe you’ve used SVD and latent semantic analysis. Or, if you’re studying computer vision, classification, or content-based image retrieval, you might have tried sparse dictionary learning.

In one way or another, all these techniques aim to solve for dimensionality reduction; however, most of the time these ones can't cut it for tasks where performance is critical. They aren’t the fastest algorithms, and you’ll likely have to cross-validate some of their parameters.

The good news is that there's another way to perform dimensionality reduction that answers for performance quite well.

...it’s fast

...it's easy to implement (and is likely already implemented in your favorite ML package)

...and it’s dead simple to understand!

# Sparse Random Projections

This post will clue you in on one of my favorite dimensionality reduction techniques: sparse random projections.

While this method can’t be applied in all cases when you are trying to perform dimensionality reduction, this is normally the first algorithm I try because it's super useful in obtaining a baseline reading on accuracy.

In some instances, it turns out that this "baseline" is actually the quickest and easiest without sacrificing the accuracy of my downstream models.

## The Goal:

The goal of this post is to provide you with an example of how to use sparse random projections in Python's scikit-learn package. I'm not going to dive deep into the theoretical aspects of random projections. Instead, I want to provide you with actionable advice and code that you can take and apply to your own datasets.

I’m also going to focus strictly on sparse random projections rather than dense random projections. While both are useful, I find the benefits of sparse dataset representations to be much more interesting and practical.

If you want to read more about the theoretical foundations of random projections, I've included some resources at the end of the post.

## Overview of what we’re going to do

• Apply sparse random projections to classify handwritten digits using Python and scikit-learn.
• What you’ll learn: How to use Python and scikit-learn to apply sparse random projections to a dataset, train a model on the projected data, and evaluate it.
• What you’ll need: Python, NumPy, Matplotlib, and scikit-learn.

## Let's get started!

Random projections are made possible by the Johnson-Lindenstrauss lemma which states that there exists a mapping from a high-dimensional to a low-dimensional Euclidean space, such that the distance between the points is preserved, within some epsilon variance. Or, simply, the goal is to preserve the pairwise distances between any 2 points in your data.

Furthermore, you can actually calculate the resulting dimensionality of your reduction to preserve these pairwise distances. This is totally unlike methods such as PCA where you typically cross-validate the number of components in conjunction with your classifier.

This minimum number of dimensions is important to keep in mind, but as you’ll see, when we apply sparse random projections to our digits dataset, we’ll violate this minimum number of dimensions and still obtain higher accuracy than using the raw data alone.

Remember, the Johnson-Lindenstrauss lemma makes no assumptions regarding the data itself, it only states that exists a mapping from a high-dimensional to low-dimensional space while preserving the pairwise Euclidean distances within some epsilon value.

Depending your application, there are cases when you easily violate this minimum number of dimensions and still obtain an adequately performing classifier. And there are also cases when you’ll need to increase your minimum number of dimensions.

Again, random projections are not suitable for all datasets. There is no “silver bullet” approach to dimensionality reduction.

However, with that said, random projections are (normally) a great starting point to obtain a baseline accuracy.

## An example

We'll be using the scikit-learn digits dataset, which consists of 1,797 samples of handwritten digit images of the numbers 0-9. Each image is 8x8 pixels. Our goal is to predict which digit each image contains.

We'll be treating each 8x8 image as a flattened feature vector of length 64. All random projections will be fitted on this raw pixel feature vector and a model trained on the resulting random projection.

As I mentioned above, we can easily calculate the minimum number of required dimensions to preserve the pairwise distances:

In [1]:
from sklearn.random_projection import johnson_lindenstrauss_min_dim

In [2]:
johnson_lindenstrauss_min_dim(1797,eps=0.1)

Out[2]:
6423


Given that we have 1,797 data points, in order to preserve their pairwise Euclidean distances within a tolerance of 0.1 epsilon, our random projection matrix should have 6,423 components.

But we only have 64 features, and we can't "reduce" 64 features to 6,423! Does this mean we can’t apply sparse random projections?

Of course not! It simply means that we can’t make any assumptions regarding the preservation of pairwise distances between data points.

As I’ve mentioned earlier in this post, if accuracy of your downstream classifier is all you’re looking for, don’t be afraid to break the theoretical guarantees!

Let’s jump into some code:

In [3]:
from sklearn.random_projection import SparseRandomProjection
from sklearn.svm import LinearSVC
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn import datasets
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

In [4]:
accuracies = []
components = np.int32(np.linspace(2, 64, 20))


Box 3 imports the packages that we’ll be using. This is pretty self-explanatory. scikit-learn will perform the sparse random projection, train a classifier, and perform training/testing splits, along with reporting accuracies. matplotlib will visualize the results.

From there, Box 4 initializes our lists of accuracies. We’ll be evaluating 20 different component sizes, equally spaced from 2 to 64.

Next we load the digits data set, perform a training and testing split, train a Linear SVM, and obtain a baseline accuracy:

In [5]:
digits = datasets.load_digits()
split = train_test_split(digits.data, digits.target, test_size = 0.3,
random_state = 42)
(trainData, testData, trainTarget, testTarget) = split

In [6]:
model = LinearSVC()
model.fit(trainData, trainTarget)
baseline = metrics.accuracy_score(model.predict(testData), testTarget)


Again, our baseline accuracy was obtained by training a Linear SVM on the raw pixel values. We have not yet applied our sparse random projection.

That comes next:

In [7]:
# loop over the projection sizes
for comp in components:
# create the random projection
sp = SparseRandomProjection(n_components = comp)
X = sp.fit_transform(trainData)

# train a classifier on the sparse random projection
model = LinearSVC()
model.fit(X, trainTarget)

# evaluate the model and update the list of accuracies
test = sp.transform(testData)
accuracies.append(metrics.accuracy_score(model.predict(test), testTarget))


Let’s break down this code down to make sure we know what’s going on.

• We start looping over our number of components, equally spaced, in the range 2 to 64.
• Then we instantiate our SparseRandomProjection using the current number of components, fit our random projection to the data (which essentially means generating a sparse matrix of values), and then transforming our original training data by projecting the data.
• Now that we have obtained a sparse representation of the data, let’s train a Linear SVM on it.
• Our Linear SVM is now trained; it’s time to see how it performs on the testing data and update our list of accuracies.

At this point all the hard work is done.

We’ve evaluated a series of sparse random projections for varying numbers of components.

Let’s plot our results and see what we have:

In [8]:
# create the figure
plt.figure()
plt.suptitle("Accuracy of Sparse Projection on Digits")
plt.xlabel("# of Components")
plt.ylabel("Accuracy")
plt.xlim([2, 64])
plt.ylim([0, 1.0])

# plot the baseline and random projection accuracies
plt.plot(components, [baseline] * len(accuracies), color = "r")
plt.plot(components, accuracies)

plt.show()


Our baseline accuracy is plotted in red while our sparse random projection + Linear SVM is plotted in blue. The baseline Linear SVM (trained on the raw pixel values of each 8×8 image) obtains roughly 94% accuracy.

We then see that as the size of our sparse random projection increases, we are able to eventually outperform the original Linear SVM when roughly 30 components are used, which is less than half the number of predictors in the original feature space.

Furthermore, these 30 components are also sparse, thus leading to a much more efficient memory representation (and with higher accuracy).

## Summary

In this post, I showed how to utilize sparse random projections for dimensionality reduction.

While we can calculate the minimum number of dimensions required to preserve the pairwise distances between data points within some epsilon, don’t become overly obsessed with this value. The Johnson-Lindenstrauss lemma makes no assumptions regarding the data itself. It simply states that there is a projection from a high-dimensional to low-dimensional space where the pairwise distances can be preserved.

In practice, you should play around with the number of components. As our digit classification example showed, we were able to able to outperform our original dense feature vectors using half the number of sparse predictors. And with the scikit-learn implementation, it only takes two lines of code to perform!

Definitely consider giving sparse random projections a try with your next project. They're seriously fast, memory efficient, and dead simple to use, and the absolute worst case scenario is that you can use them as a baseline accuracy from which to build further experiments.