R and pandas and what I've learned about each
by
In step with our recent article about essential R packages, this post explores tools for data analysis in Python.
what is pandas?
pandas is the utility belt for data analysts using python. The package centers around the pandas DataFrame, a two-dimensional data structure with indexable rows and columns. It has effectively taken the best parts of Base R, R packages like plyr and reshape2 and consolidated them into a single library. It has lots of features (see library highlights). pandas gets its name from panel data, an econometrics term for multidimensional structured datasets (McKinney 5., 2013)
Pandas has a lot in common with R (pandas comparison with R), and as someone who's familiar with R and Python (but not specifically pandas) I've found pandas to be extremely easy to use. This is a post about R and pandas and about what I've learned about each.
Munging and Plotting in Python
-
plyr-esq features in Python
Few tools hold a candle to pandas when it comes to Split-Apply-Combine operations. pandas
groupbyenables transformations, aggregations, and easy-access plotting functions. Virtually anything you can do with R'splyrpackage has a pandas equivalent.One thing I like better about
groupbythan, say,ddply, is the ability to perform an operation in multiple steps. pandas let's you perform thegrouppart on one line followed by theapplypart on the next. This allows you to inspect the combined results on a third line, giving you visibility into what's going on under the hood.Additionally, pandas is faster than plyr. In some instances I found equivalent operations to be 4x+ faster using pandas'
groupbyover plry'sddply. -
applying functions element-wise
If you use R, you know that most of the time you can get by with plyr. But every once in a while you need to bust out
lapplyorsapply. In pandas, on the other hand, you can useapplyon bothDataFramesandSeries.When you use
applyon a dataframe, you can apply your function along either rows or columns (axis=0oraxis=1). When you apply on a series, you're applying only on that series.
-
wide to long and back again
R's
reshape2makes it extremely easy to switch your data between wide and long formats. pandas has its own set of functions that provide this functionality. pandas also has a concept called stacking and unstacking which allow you to shift the index of a pandas dataframe. -
plot
One of my favorite parts about R is you can call
ploton just about anything and R will render an appropriate graphic you'd expect.pandas measures up with its own out-of-the-box plotting powered by matplotlib.
DataFrames and Series can both be plotted using the
plotmethod along with standardhistandboxplot.
matplotlibis an excellent plotting library, but I have to say I still prefer the look and feel ofggplot2graphics. I always end up getting more props when I circulate ggplots. rplot, is a module found in this pandas fork providing ggplot2-like interfaces for pandas, though I'm not sure whether or not the fork is actively being developed at this time. -
data.frame
While the implementation might be different, pandas data frames and R data frames have a lot in common. Most of the core functionality between the two are the same - they both allow column-wise operations on your data, they're tabular, etc. The biggest difference I've found is the way which you operate on the data frames themselves.
R has a much more functional feel to it. Instead of calling a particular method on an R data frame, you invoke a function on an R dataframe. pandas has a much more OOP feel to it. Dataframe methods are called with the dataframe itself. One feature that I haven't seen in R but that comes in handy in pandas is multi-level indexing. pandas allows you to create indicies based not only on row number, but also on dates, numbers, and even categorical variables.
This just scratches the surface of pandas' functionality. Another topic that isn't mentioend in this post is the excellent time series capabilities that pandas has (similar to zoo in R). They're extensive enough that it merits its own post. In the meantime you can check out some of Wes McKinney's great tutorials .
yhat is the easiest way to operationalize predictive models.
Contact us at info@yhathq.com for details.