Time based data can be a pain to work with--Is it a date or a datetime? Are my dates in the right format? Luckily, Python and pandas provide some super helpful utilities for making this easier. In this post, we'll be using pandas and ggplot to analyze time series data.
For these examples, we'll be using the
meat data set which has been made available to us from the U.S. Dept. of Agriculture. It contains metrics on livestock, dairy, and poultry outlook and production.
pip install -U ggplot
pip install -U pandasql
import numpy as np import pandas as pd import matplotlib.pyplot as plt from ggplot import *
meat = meat.dropna(thresh=800, axis=1) # drop columns that have fewer than 800 observations ts = meat.set_index(['date'])
Working with dates and times with pandas
pandas has some excellent out of the box functionality for aggregating date and time based data.
Since we indexed our data on a datetime column (
date), we can group by the year and take the sum over the columns pretty easily.
But what if we're keen to look at the sums over the decades?
Grouping by decade
If you're only interested in one or more specific decades, you can accomplish that using the date and time slicing functionality baked-in to
pandas. Here we selected a slice of the data corresponding to the 1940s.
the1940s = ts.groupby(ts.index.year).sum().ix['1940-01-01':'1949-12-31'] the1940s
Then you could just sum the column or columns you're interested in to get the total for the decade you're looking at.
But what if you need to look at all the decades?
One quick way is to use Python's unambiguous floor division operator,
def floor_decade(date_value): "Takes a date. Returns the decade." return (date_value.year // 10) * 10
Timestamp('2013-10-09 00:00:00', tz=None)
Now we can just apply the
floor_decade over the dates like so:
the1940s.sum().reset_index(name='meat sums in the 1940s')
|index||meat sums in the 1940s|
And just to sanity check, we see that the numbers tie out the same no matter which of these approaches you take.
by_decade = ts.groupby(floor_decade).sum()
by_decade.index.name = 'year'
by_decade = by_decade.reset_index()
ggplot(by_decade, aes('year', weight='beef')) + \ geom_bar() + \ scale_y_continuous(labels='comma') + \ ggtitle('Head of Cattle Slaughtered by Decade')
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/matplotlib/font_manager.py:1236: UserWarning: findfont: Font family ['serif'] not found. Falling back to Bitstream Vera Sans (prop.get_family(), self.defaultFamily[fontext]))
Things are starting to make sense. Now how might we better inspect the trends we're seeing over time? Well one way we could do it is by using the same bar chart as before, but stacking the values for each type of livestock.
by_decade_long = pd.melt(by_decade, id_vars="year") ggplot(aes(x='year', weight='value', colour='variable'), data=by_decade_long) + \ geom_bar() + \ ggtitle("Meat Production by Decade")
For all you
ggplot2 fans wondering why we didn't do a stacked bar chart--don't worry! It's coming in a release in the not so distant future.
Trends over time
For our last plot we're going to jump back a little bit. Instead of looking at the data in aggregate, we're going to take another approach to making sense of our time series data. We're going to bring the original
meat dataset back into the mix so we can take a look at all of our livestock varieties.
from ggplot import meat meat_lng = pd.melt(meat, id_vars=['date']) ggplot(aes(x='date', y='value', colour='variable'), data=meat_lng) + geom_line()
Ok so this plot looks a bit cluttered. We've got way too much zigging and zagging. Sure the colors are nice, but it's a bit overwheleming.
Instead of getting rid of our data, we're going to apply a smoothing function so that we'll see the trend instead of the noise.
ggplot(aes(x='date', y='value', colour='variable'), data=meat_lng) + \ stat_smooth(span=0.10) + \ ggtitle("Smoothed Livestock Production")
Ahh, much better. This plot I can actually make sense of. You can see that chicken production has been growing quickly since the late 1950's, and that sometime in late 1970s/early 1980s it overtook pork production, and a few years later it overtook beef production.
We're still working out some of the kinks in
stat_smooth, but you can see that it's already an incredibly useful function. If you're interested in helping build
ggplot for Python, drop us a note at firstname.lastname@example.org! We'd love to hear from you.