Random Forest Regression and Classification in R and Python

by yhat

Learn More

We've written about Random Forests a few of times before, so I'll skip the hot-talk for why it's a great learning method.

But given how many different random forest packages and libraries are out there, we thought it'd be interesting to compare a few of them. Is there a "best" one? Or, is one better suited for your prediction task?

Take a peek at the really great resources at the bottom or read our earlier posts if you want to wet your beak w/ more on this subject. This list from the Butler Analytics blog will get you started if you're keen to explore options.

This is a post exploring how different random forest implementations stack up against one another.

Random Forests in Python

We wrote this post on random forests in Python back in June. Since then, there have been some serious improvements to the scikit-learn RandomForest and Tree modules.

In the 0.14 release, Gilles Louppe (@glouppe) and the scikit-learn dev team greatly improved both the performance and effectiveness of the RandomForestClassifier and ExtraTreesClassifier.

Click thru the link in this tweet to read about some of the ground the scikit-learn guys covered.

This is some pretty incredible stuff. Especially when you consider that the ensemble module in scikit-learn is still relatively new.

Random Forests in R

So how does R fit into this type of an evaluation?

The most well established R package for Random Forests is (you guessed it) randomForest.

The package has been around for a while; it's on version 4.6-7 and has some nice features that many do not (e.g. built-in feature importance and MDS plots).

You can read more about the project on their website, which despite the web 1.0 look and feel, is incredibly informative.

Another interesting Random Forest implementation in R is bigrf. I've only played around with it a bit, but it looks like a very promising project focused on making Random Forests work w/ larger data sets. It includes built-in parallelization to learn in parallel w/o a lot of manual or complicated setup by the analyst (thank you!).

Comparing the Two Packages

For our comparison we're going to focus on a few performance metrics:

  • Accuracy (for classification)
  • Mean Squared Error and \(R^{2} \) (for regression)
  • Training Time

By no means are these the only ones. But many people know these and consider them to be among the more important benchmarks. It'll at least give us a few quantitative measures for comparing cross-platform (python vs. r).

For our data, we're going to use the Wine Quality data set from the UC Irvine Machine Learning Repo. It gives us a nice mix of classification and regression problems to test on.

Prepping the Data

As with any data project, the first step is getting our data into the right format. R and pandas make these tasks relatively straightforward (lucky for us we have everything in CSV format).

Data prep in R

Data prep in Python

Multilabel Classifcation

For our multilabel classification test we're going to try and predict the quality attribute given to each bottle of wine. We're going to use the same hyper-parameters for both the models (same as used in the scikit-learn test above).

Running the tests, you can see that these classifiers perform nearly the same. Any error can be attributed to randomness/noise. This isn't terribly surprising given the similarity of the implementations.

What is surprising, though, is how much faster the R version is over the Python version. R training time was more than 3X faster than the Python training time.


So here's where things start to get a little funky. I decided to do a simple regression test. Trying to predict the alcohol content for a given wine based on: color (is_red), fixed.acidity, density, and pH. Pretty straightforward.

When I build the regression models for both Python and R I got totally different results. The scikit-learn version produced an \(R^{2} \) value ~0.72 where as the R version was ~0.63. In addition the MSE for R was 0.64 and 0.42 for Python.

The n_jobs Feature

The feature that really makes me partial to using scikit-learn's Random Forest implementation is the n_jobs parameter. Specifying n_jobs will automatically parallelize the training of your RandomForest. The randomForest package in R doesn't have an equivalent feature (although the bigrf package does). It's hard to beat instant parallelization without impacting your workflow in any way.


Our Products

Distributed, Scalable, Collaborative Data Science
Harness the power of distributed computing to allocate computationally intensive tasks across a cluster of servers.

Learn More

Embed and Scale Predictive Models in Production Applications
A platform for productionizing, scaling, and monitoring predictive models in production applications.

Learn More

Yhat (pronounced Y-hat) provides data science and decision management solutions that let data scientists create, deploy and integrate insights into any business application without IT or custom coding.

With Yhat, data scientists can use their preferred scientific tools (e.g. R and Python) to develop analytical projects in the cloud collaboratively and then deploy them as highly scalable real-time decision making APIs for use in customer- or employee-facing apps.