In step with our recent article about essential R packages, this post explores tools for data analysis in Python.
what is pandas?
pandas is the utility belt for data analysts using python. The package centers around the pandas
DataFrame, a two-dimensional data structure with indexable rows and columns. It has effectively taken the best parts of Base R, R packages like
reshape2 and consolidated them into a single library. It has lots of features (see library highlights). pandas gets its name from panel data, an econometrics term for multidimensional structured datasets (McKinney 5., 2013)
Pandas has a lot in common with R (pandas comparison with R), and as someone who's familiar with R and Python (but not specifically pandas) I've found pandas to be extremely easy to use. This is a post about R and pandas and about what I've learned about each.
Munging and Plotting in Python
plyr-esq features in Python
Few tools hold a candle to pandas when it comes to Split-Apply-Combine operations. pandas
groupbyenables transformations, aggregations, and easy-access plotting functions. Virtually anything you can do with R's
plyrpackage has a pandas equivalent.
One thing I like better about
ddply, is the ability to perform an operation in multiple steps. pandas let's you perform the
grouppart on one line followed by the
applypart on the next. This allows you to inspect the combined results on a third line, giving you visibility into what's going on under the hood.
Additionally, pandas is faster than plyr. In some instances I found equivalent operations to be 4x+ faster using pandas'
applying functions element-wise
If you use R, you know that most of the time you can get by with plyr. But every once in a while you need to bust out
sapply. In pandas, on the other hand, you can use
When you use
applyon a dataframe, you can apply your function along either rows or columns (
axis=1). When you apply on a series, you're applying only on that series.
wide to long and back again
reshape2makes it extremely easy to switch your data between wide and long formats. pandas has its own set of functions that provide this functionality. pandas also has a concept called stacking and unstacking which allow you to shift the index of a pandas dataframe.
One of my favorite parts about R is you can call
ploton just about anything and R will render an appropriate graphic you'd expect.
pandas measures up with its own out-of-the-box plotting powered by matplotlib.
DataFrames and Series can both be plotted using the
plotmethod along with standard
matplotlibis an excellent plotting library, but I have to say I still prefer the look and feel of
ggplot2graphics. I always end up getting more props when I circulate ggplots. rplot, is a module found in this pandas fork providing ggplot2-like interfaces for pandas, though I'm not sure whether or not the fork is actively being developed at this time.
While the implementation might be different, pandas data frames and R data frames have a lot in common. Most of the core functionality between the two are the same - they both allow column-wise operations on your data, they're tabular, etc. The biggest difference I've found is the way which you operate on the data frames themselves.
R has a much more functional feel to it. Instead of calling a particular method on an R data frame, you invoke a function on an R dataframe. pandas has a much more OOP feel to it. Dataframe methods are called with the dataframe itself. One feature that I haven't seen in R but that comes in handy in pandas is multi-level indexing. pandas allows you to create indicies based not only on row number, but also on dates, numbers, and even categorical variables.
This just scratches the surface of pandas' functionality. Another topic that isn't mentioned in this post is the excellent time series capabilities that pandas has (similar to zoo in R). They're extensive enough that it merits its own post. In the meantime you can check out some of Wes McKinney's great tutorials.