ŷhat

Random Forest Regression and Classification in R and Python

by yhat

Learn More

We've written about Random Forests a few of times before, so I'll skip the hot-talk for why it's a great learning method.

But given how many different random forest packages and libraries are out there, we thought it'd be interesting to compare a few of them. Is there a "best" one? Or, is one better suited for your prediction task?

Take a peek at the really great resources at the bottom or read our earlier posts if you want to wet your beak w/ more on this subject. This list from the Butler Analytics blog will get you started if you're keen to explore options.

This is a post exploring how different random forest implementations stack up against one another.

Random Forests in Python

We wrote this post on random forests in Python back in June. Since then, there have been some serious improvements to the scikit-learn RandomForest and Tree modules.

In the 0.14 release, Gilles Louppe (@glouppe) and the scikit-learn dev team greatly improved both the performance and effectiveness of the RandomForestClassifier and ExtraTreesClassifier.

Click thru the link in this tweet to read about some of the ground the scikit-learn guys covered.

This is some pretty incredible stuff. Especially when you consider that the ensemble module in scikit-learn is still relatively new.

Random Forests in R

So how does R fit into this type of an evaluation?

The most well established R package for Random Forests is (you guessed it) randomForest.

The package has been around for a while; it's on version 4.6-7 and has some nice features that many do not (e.g. built-in feature importance and MDS plots).

You can read more about the project on their website, which despite the web 1.0 look and feel, is incredibly informative.

Another interesting Random Forest implementation in R is bigrf. I've only played around with it a bit, but it looks like a very promising project focused on making Random Forests work w/ larger data sets. It includes built-in parallelzation to learn in parallel w/o a lot of manual or complicated setup by the analyst (thank you!).

Comparing the Two Packages

For our comparison we're going to focus on a few performance metrics:

  • Accuracy (for classification)
  • Mean Squared Error and \(R^{2} \) (for regression)
  • Training Time

By no means are these the only ones. But many people know these and consider them to be among the more important benchmarks. It'll at least give us a few quanitative measures for comparing cross-platform (python vs. r).

For our data, we're going to use the Wine Quality data set from the UC Irvine Machine Learning Repo. It gives us a nice mix of classification and regression problems to test on.

Prepping the Data

As with any data project, the first step is getting our data into the right format. R and pandas make these tasks relatively straightforward (lucky for us we have everything in CSV format).

Data prep in R

Data prep in Python

Multilabel Classifcation

For our multilabel classification test we're going to try and predict the quality attribute given to each bottle of wine. We're going to use the same hyper-parameters for both the models (same as used in the scikit-learn test above).

Running the tests, you can see that these classifiers perform nearly the same. Any error can be attributed to randmoness/noise. This isn't terribly surprising given the similarity of the implementations.

What is suprising, though, is how much faster the R version is over the Python version. R training time was more than 3X faster than the Python training time.

Regresison

So here's where things start to get a little funky. I decided to do a simple regression test. Trying to predict the alcohol content for a given wine based on: color (is_red), fixed.acidity, density, and pH. Pretty straightforward.

When I build the regression models for both Python and R I got totally different results. The scikit-learn version produced an \(R^{2} \) value ~0.72 where as the R version was ~0.63. In addition the MSE for R was 0.64 and 0.42 for Python.

The n_jobs Feature

The feature that really makes me partial to using scikit-learn's Random Forest implementation is the n_jobs parameter. Specifying n_jobs will automatically parallelize the training of your RandomForest. The randomForest package in R doesn't have an equivalent feature (although the bigrf package does). It's hard to beat instant paralellization without impacting your workflow in any way.

Takeaways