# Random Forest Regression and Classification in R and Python

#### by yhat

##### September 29, 2013

We've written about Random Forests a few of times before, so I'll skip the hot-talk for why it's a great learning method.

But given how many different random forest packages and libraries are out there, we thought it'd be interesting to compare a few of them. Is there a "best" one? Or, is one better suited for your prediction task?

Take a peek at the really great resources at the bottom or read our earlier posts if you want to wet your beak w/ more on this subject. This list from the Butler Analytics blog will get you started if you're keen to explore options.

This is a post exploring how different random forest implementations stack up against one another.

### Random Forests in Python

We wrote this post on random forests in Python back in June. Since then, there have been some serious improvements to the scikit-learn RandomForest and Tree modules.

In the 0.14 release, Gilles Louppe (@glouppe) and the scikit-learn dev team greatly improved both the performance and effectiveness of the RandomForestClassifier and ExtraTreesClassifier.

Click thru the link in this tweet to read about some of the ground the scikit-learn guys covered.

This is some pretty incredible stuff. Especially when you consider that the ensemble module in scikit-learn is still relatively new.

### Random Forests in R

So how does R fit into this type of an evaluation?

The most well established R package for Random Forests is (you guessed it) randomForest.

The package has been around for a while; it's on version 4.6-7 and has some nice features that many do not (e.g. built-in feature importance and MDS plots).

You can read more about the project on their website, which despite the web 1.0 look and feel, is incredibly informative.

Another interesting Random Forest implementation in R is bigrf. I've only played around with it a bit, but it looks like a very promising project focused on making Random Forests work w/ larger data sets. It includes built-in parallelization to learn in parallel w/o a lot of manual or complicated setup by the analyst (thank you!).

### Comparing the Two Packages

For our comparison we're going to focus on a few performance metrics:

• Accuracy (for classification)
• Mean Squared Error and $$R^{2}$$ (for regression)
• Training Time

By no means are these the only ones. But many people know these and consider them to be among the more important benchmarks. It'll at least give us a few quantitative measures for comparing cross-platform (python vs. r).

For our data, we're going to use the Wine Quality data set from the UC Irvine Machine Learning Repo. It gives us a nice mix of classification and regression problems to test on.

### Prepping the Data

As with any data project, the first step is getting our data into the right format. R and pandas make these tasks relatively straightforward (lucky for us we have everything in CSV format).

### Multilabel Classifcation

For our multilabel classification test we're going to try and predict the quality attribute given to each bottle of wine. We're going to use the same hyper-parameters for both the models (same as used in the scikit-learn test above).

Running the tests, you can see that these classifiers perform nearly the same. Any error can be attributed to randomness/noise. This isn't terribly surprising given the similarity of the implementations.

What is surprising, though, is how much faster the R version is over the Python version. R training time was more than 3X faster than the Python training time.

### Regression

So here's where things start to get a little funky. I decided to do a simple regression test. Trying to predict the alcohol content for a given wine based on: color (is_red), fixed.acidity, density, and pH. Pretty straightforward.

When I build the regression models for both Python and R I got totally different results. The scikit-learn version produced an $$R^{2}$$ value ~0.72 where as the R version was ~0.63. In addition the MSE for R was 0.64 and 0.42 for Python.

### The n_jobs Feature

The feature that really makes me partial to using scikit-learn's Random Forest implementation is the n_jobs parameter. Specifying n_jobs will automatically parallelize the training of your RandomForest. The randomForest package in R doesn't have an equivalent feature (although the bigrf package does). It's hard to beat instant parallelization without impacting your workflow in any way.

### Takeaways

#### Our Products

A Python IDE built for doing data science directly on your desktop.

Harness the power of distributed computing to run computationally intensive tasks on a cluster of servers.

A platform for productionizing, scaling, and monitoring predictive models in production applications.

Yhat (pronounced Y-hat) provides data science and decision management solutions that let data scientists create, deploy and integrate insights into any business application without IT or custom coding.

With Yhat, data scientists can use their preferred scientific tools (e.g. R and Python) to develop analytical projects in the cloud collaboratively and then deploy them as highly scalable real-time decision making APIs for use in customer- or employee-facing apps.