ŷhat

Aggregating & plotting time series in python

by yhat

Time based data can be a pain to work with--Is it a date or a datetime? Are my dates in the right format? Luckily, Python and pandas provide some super helpful utilities for making this easier. In this post, we'll be using pandas and ggplot to analyze time series data.

Data set

For these examples, we'll be using the meat data set which has been made available to us from the U.S. Dept. of Agriculture. It contains metrics on livestock, dairy, and poultry outlook and production.

You can find the data set in either the ggplot package or the pandasql package, both of which are installed via pip.

pip install -U ggplot
pip install -U pandasql

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from ggplot import *
In [2]:
meat = meat.dropna(thresh=800, axis=1) # drop columns that have fewer than 800 observations
ts = meat.set_index(['date'])
In [3]:
ts.head(10)
Out[3]:
beef veal pork lamb_and_mutton
date
1944-01-01 751 85 1280 89
1944-02-01 713 77 1169 72
1944-03-01 741 90 1128 75
1944-04-01 650 89 978 66
1944-05-01 681 106 1029 78
1944-06-01 658 125 962 79
1944-07-01 662 142 796 82
1944-08-01 787 175 748 87
1944-09-01 774 182 678 91
1944-10-01 834 215 777 100

Working with dates and times with pandas

pandas has some excellent out of the box functionality for aggregating date and time based data.

In [4]:
ts.groupby(ts.index.year).sum().head(10)
Out[4]:
beef veal pork lamb_and_mutton
1944 8801 1629 11502 1001
1945 9936 1552 8843 1030
1946 9010 1329 9220 946
1947 10096 1493 8811 779
1948 8766 1323 8486 728
1949 9142 1240 8875 587
1950 9248 1137 9397 581
1951 8549 972 10190 508
1952 9337 1080 10321 635
1953 12055 1451 8971 715

Since we indexed our data on a datetime column (date), we can group by the year and take the sum over the columns pretty easily.

But what if we're keen to look at the sums over the decades?

Grouping by decade

If you're only interested in one or more specific decades, you can accomplish that using the date and time slicing functionality baked-in to pandas. Here we selected a slice of the data corresponding to the 1940s.

In [5]:
the1940s = ts.groupby(ts.index.year).sum().ix['1940-01-01':'1949-12-31']
the1940s
Out[5]:
beef veal pork lamb_and_mutton
1944 8801 1629 11502 1001
1945 9936 1552 8843 1030
1946 9010 1329 9220 946
1947 10096 1493 8811 779
1948 8766 1323 8486 728
1949 9142 1240 8875 587

Then you could just sum the column or columns you're interested in to get the total for the decade you're looking at.

But what if you need to look at all the decades?

One quick way is to use Python's unambiguous floor division operator, // .

In [6]:
def floor_decade(date_value):
    "Takes a date. Returns the decade."
    return (date_value.year // 10) * 10
In [7]:
pd.to_datetime('2013-10-09')
Out[7]:
Timestamp('2013-10-09 00:00:00', tz=None)
In [8]:
floor_decade(_)
Out[8]:
2010

Voilà!

Now we can just apply the floor_decade over the dates like so:

In [9]:
ts.groupby(floor_decade).sum()
Out[9]:
beef veal pork lamb_and_mutton
1940 55751.0 8566.0 55737.0 5071.0
1950 119161.0 12693.0 98450.0 6724.0
1960 177754.0 8577.0 116587.0 6873.0
1970 228947.0 5713.0 132539.0 4256.0
1980 230100.0 4278.0 150528.0 3394.0
1990 243579.0 2938.0 173519.0 2986.0
2000 260540.7 1685.3 208211.3 1964.7
2010 76391.5 371.9 66491.2 455.6
In [10]:
the1940s.sum().reset_index(name='meat sums in the 1940s')
Out[10]:
index meat sums in the 1940s
0 beef 55751
1 veal 8566
2 pork 55737
3 lamb_and_mutton 5071

And just to sanity check, we see that the numbers tie out the same no matter which of these approaches you take.

In [11]:
by_decade = ts.groupby(floor_decade).sum()
In [12]:
by_decade.index.name = 'year'
In [13]:
by_decade = by_decade.reset_index()
In [14]:
ggplot(by_decade, aes('year', weight='beef')) + \
    geom_bar() + \
    scale_y_continuous(labels='comma') + \
    ggtitle('Head of Cattle Slaughtered by Decade')
Out[14]:
<ggplot: (277778957)>
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/matplotlib/font_manager.py:1236: UserWarning: findfont: Font family ['serif'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Things are starting to make sense. Now how might we better inspect the trends we're seeing over time? Well one way we could do it is by using the same bar chart as before, but stacking the values for each type of livestock.

In [15]:
by_decade_long = pd.melt(by_decade, id_vars="year")

ggplot(aes(x='year', weight='value', colour='variable'), data=by_decade_long) + \
    geom_bar() + \
    ggtitle("Meat Production by Decade")
Out[15]:
<ggplot: (277783637)>

For all you ggplot2 fans wondering why we didn't do a stacked bar chart--don't worry! It's coming in a release in the not so distant future.

For our last plot we're going to jump back a little bit. Instead of looking at the data in aggregate, we're going to take another approach to making sense of our time series data. We're going to bring the original meat dataset back into the mix so we can take a look at all of our livestock varieties.

In [16]:
from ggplot import meat
meat_lng = pd.melt(meat, id_vars=['date'])
ggplot(aes(x='date', y='value', colour='variable'), data=meat_lng) + geom_line()
Out[16]:
<ggplot: (278421225)>

Ok so this plot looks a bit cluttered. We've got way too much zigging and zagging. Sure the colors are nice, but it's a bit overwheleming.

Instead of getting rid of our data, we're going to apply a smoothing function so that we'll see the trend instead of the noise.

In [17]:
ggplot(aes(x='date', y='value', colour='variable'), data=meat_lng) + \
    stat_smooth(span=0.10) + \
    ggtitle("Smoothed Livestock Production")
Out[17]:
<ggplot: (279162797)>

Ahh, much better. This plot I can actually make sense of. You can see that chicken production has been growing quickly since the late 1950's, and that sometime in late 1970s/early 1980s it overtook pork production, and a few years later it overtook beef production.

We're still working out some of the kinks in stat_smooth, but you can see that it's already an incredibly useful function. If you're interested in helping build ggplot for Python, drop us a note at info@yhathq.com! We'd love to hear from you.

Interested in ŷhat? Learn More