Fast summary statistics in R with data.table

by Jeff

September 26, 2013

Calculating summary measures (e.g. sums, counts, means, min, max, percentiles, etc.) across groups of data is one of the most common tasks in data analysis. Unless you're among the poor souls stuck with Hadoop, the right tool for the job could be a SQL GROUP BY, an Excel Pivot Table, or, if you're like me, a scripting language!

This is a post about data.table, a popular package for summarizing data in R.

Installation

Like most R packages, installation is a snap. Just use install.packages.

install.packages("data.table")


How to make a data.table?

Ok so you've got data.table installed. What can you do with it? Well for starters, you should probably know how to create a data.table. I usually make a data.table from a data.frame, but you can also create a data.table from vectors. It's pretty much the same things as making a data.frame.

library(data.table)
# Making a data.frame from vectors
df.data <- data.frame(a=1:5,b=1:5)

# Making a data.table from vectors
dt.data <- data.table(a=1:5,b=1:5)

# Making a data.table from a data.frame
dt.data <- data.table(df.data)
dt.data
#  a b
#1 1 1
#2 2 2
#3 3 3
#4 4 4
#5 5 5


Usage & Examples

Data Set

First step is to pull in some data. For this example, we'll be using two data sets from Basketball Reference: boxscore.csv and team_game_count.csv. If you want to throw it into MySQL, you can use this loadscript. The data comes from boxscores from the 2004-2013 NBA season. For simplicity, I'm only going to use the fields: date, team, player, minutes, field goals made (fgm), field goals attempted (fga), and points. Note that I've converted the minutes field to decimal form, so 1.5 minutes is 1 minute and 30 seconds. Since I threw my data into MySQL, I used RMySQL to pull it out.

library(RMySQL)

strQuery <- "SELECT date, team, player, minutes, fgm, fga, points FROM basketball.boxscore"
df.boxscore <- dbGetQuery(con, strQuery)
df.boxscore$date <- as.Date(df.boxscore$date, '%Y-%m-%d')

#date team player minutes fgm fga points
#1 2012-10-30 WAS A.J. Price 29.40 2 13 7
#2 2012-10-30 WAS Emeka Okafor 24.58 4 10 10
#3 2012-10-30 WAS Trevor Ariza 24.58 3 8 9
#4 2012-10-30 WAS Bradley Beal 21.55 2 8 8
#5 2012-10-30 WAS Trevor Booker 16.75 2 9 4
#6 2012-10-30 WAS Jordan Crawford 28.70 4 13 11


One other data point I wanted to get was the number of games a team had played up to a certain point in the season. Since my boxscore data has team and date, this had all of the data I needed.

strQuery <- "SELECT * FROM basketball.team_game_count"
df.games <- dbGetQuery(con, strQuery)
df.games$date <- as.Date(df.games$date, '%Y-%m-%d')


df.games is a data.frame of every NBA game and tells me how many games a team has played up to a certain point in the season. game_num will be 1 for a team's first game of the season and 82 for its last game (assuming the team plays 82 games in a season).

head(df.games)
season       date team game_num
1 2005-2006 2005-11-01  DAL        1
2 2005-2006 2005-11-01  DEN        1
3 2005-2006 2005-11-01  MIL        1
4 2005-2006 2005-11-01  NOK        1
5 2005-2006 2005-11-01  PHI        1
6 2005-2006 2005-11-01  PHO        1


Comparing plyr vs. data.table

Now I have my data, so let's dig into it a bit and show some examples of where data.tables can replace the plyr package. First, let's get team-level stats instead of player-level stats. Easy, I can just group df.boxscore by date + team and sum the fgm, fga, and points variables. This is easy to accomplish using ddply from the plyr package, but it's also easy to do using data.tables.

Our first file has player-specific stats like the number of points made by per game per player.

Let's compute team-level stats; first we'll use ddply.

library(plyr)

system.time(
df.statsTeam <- ddply(df.boxscore, ~ date + team,
summarize, points=sum(points), fgm=sum(fgm), fga=sum(fga)
)
)
user  system elapsed
29.990  18.164  48.174


Ugh, that was slow. This operation is equally simple to perform w/ data.table.

Create a data.table by passing a data.frame to the data.table function.

dt.boxscore <- data.table(df.boxscore)


Then I set its key -- this is something I haven't mentioned yet, but you can find more information about keys on the data.table intro page. As you might expect, the key sorts the data.table so that R can access the data more efficiently. I'll set the key to the variables I want to group by -- date and team.

setkey(dt.boxscore, date, team)

system.time(
dt.statsTeam <- dt.boxscore[,list(points=sum(points),
fgm=sum(fgm),
fga=sum(fga)),
by=list(date,team)
]
)
#  user  system elapsed
#  0.095   0.003   0.098 


Wow, that was fast! data.table took 0.098 seconds while ddply took 48.2 seconds. But is the output the same? I'll have to convert dt.statsTeam to a data.frame in order to compare it to df.statsTeam using the identical() function. That is easily accomplished by calling data.frame( {the data table} ).

identical(df.statsTeam, data.frame(dt.statsTeam))
# [1] TRUE


Yep, they're the same! Ok great, so we can use a data.table as a near drop in replacement for ddply.

A Basic Merge

Say I only want to find the boxscores from the Houston Rockets' last game of the 2012/2013 regular season. Our first step is to just subset our dataset so that we're only looking at Rockets' data. Let's see what that would look like in regular R.

df.games.houston <- df.games[df.games$team == 'HOU' & df.games$game_num == 82 &
df.games$season == '2012-2013',] df.games.houston # season date team game_num #19179 2012-2013 2013-04-17 HOU 82 ## Merge of the two data frames system.time( df.merge <- merge(df.boxscore, df.games.houston, by=c('date','team')) ) # user system elapsed # 3.392 0.201 3.664  Now to merge on data.tables we can either use the merge() function on the data.tables or we can use a unique data.table function. ## First let's just try merging on the data tables. dt.games.houston <- data.table(df.games.houston); dt.boxscore <- data.table(df.boxscore) system.time( dt.merge <- merge(dt.boxscore, dt.games.houston, by=c('date','team')) ) # user system elapsed # 0.092 0.013 0.106  Even Faster Merges Using merge was pretty fast, but let's try setting keys on our data before we merge. setkey(dt.games.houston, date, team) setkey(dt.boxscore, date, team) system.time( dt.merge <- merge(dt.boxscore, dt.games.houston, by=c('date','team')) ) # user system elapsed # 0.018 0.002 0.022  Even faster! Finally, I'll use the internal data.table function for merging. It's a bit funky, but you can do the same as an inner, left, or right join. Since we've already set the keys for both data tables, it knows what fields we want to merge on (date and team). Here is an inner join. system.time( dt.merge <- dt.boxscore[dt.games.houston] ) # user system elapsed # 0.008 0.001 0.011  Some Summary Statistics Okay, now that I've shown you some of the great things about data tables, let's calculate some interesting stats with the data. I'll be using my original data.table here called dt.boxscore. I'll normalize the points variable by calculating points per 48 minutes (the number of minutes in an NBA game) and then calculate some summary statistics for the 2012-2013 season for each player -- data tables will be useful here. dt.boxscore$pp48 <- dt.boxscore$points/dt.boxscore$minutes*48
dt.boxscore.subset <- dt.boxscore[dt.boxscore$date >= '2012-09-01',] dt.statsPlayer <- dt.boxscore.subset[,list(minutes=sum(minutes), mean=mean(pp48), min=min(pp48), lower=quantile(pp48, .25, na.rm=TRUE), middle=quantile(pp48, .50, na.rm=TRUE), upper=quantile(pp48, .75, na.rm=TRUE), max=max(pp48)), by='player'] dt.statsPlayer player minutes mean min lower middle upper max 1: Daequan Cook 476.37 9.810285 0.00000 0.00000 0.00000 18.11817 66.97674 2: Robert Sacre 207.72 5.735073 0.00000 0.00000 0.00000 10.68816 38.01980 3: Jared Jeffries 349.84 5.619368 0.00000 0.00000 0.00000 11.64374 23.70370 4: Jared Cunningham 25.70 20.313846 0.00000 0.00000 0.00000 39.60131 75.39267 5: Kevin Murphy 52.39 7.397563 0.00000 0.00000 0.00000 0.00000 42.29075 --- 466: James Harden 3227.72 32.310256 12.57485 26.12344 32.28516 38.15972 54.24410 467: LeBron James 3837.39 33.111033 16.88159 28.30721 33.19206 37.20936 53.63128 468: Kevin Durant 3603.38 34.830615 18.31624 30.20330 34.91109 38.75855 51.85185 469: Kobe Bryant 3012.59 33.665142 0.00000 25.67263 36.18276 40.76094 53.66972 470: Carmelo Anthony 2962.78 36.730475 10.66140 30.92032 36.38086 43.00658 62.82723  Let's only look at players who played over 1,000 minutes in the season (no special reason for 1,000 minutes...it just seemed like a good number to me). And then let's order by median (field is called middle) descending and show the top 10 players. If you follow basketball at all, the top 10 players here make sense. dt.statsPlayer.1000 <- dt.statsPlayer[dt.statsPlayer$minutes >= 1000, ]
dt.top10 <- head(dt.statsPlayer.1000[order(-dt.statsPlayer.1000\$middle),], 10)
dt.top10
player minutes     mean       min    lower   middle    upper      max
1:   Carmelo Anthony 2962.78 36.73048 10.661402 30.92032 36.38086 43.00658 62.82723
2:       Kobe Bryant 3012.59 33.66514  0.000000 25.67263 36.18276 40.76094 53.66972
3:      Kevin Durant 3603.38 34.83062 18.316241 30.20330 34.91109 38.75855 51.85185
4:      LeBron James 3837.39 33.11103 16.881594 28.30721 33.19206 37.20936 53.63128
5:      James Harden 3227.72 32.31026 12.574850 26.12344 32.28516 38.15972 54.24410
6: Russell Westbrook 2928.55 32.12473 13.445378 26.45569 32.22371 36.74335 55.01433
7:      Kyrie Irving 2047.88 30.70178  9.076584 22.59503 30.85990 38.09608 51.90094
8:       Tony Parker 2938.78 28.76046  3.772102 23.31282 29.57983 34.06920 53.08598
9:       Brook Lopez 2516.70 30.51248 16.520076 25.24544 28.80000 36.10256 53.43570
10:     Stephen Curry 3479.12 28.29065  7.902535 22.89263 28.04535 34.24703 54.00000


Other resources

Hopefully this gives you a brief overview of how great data.tables are. I'll follow-up with another post that provides some more interesting NBA analysis. Let us know if there is any specific analysis you would be interested in seeing! If you're interested in reading more data.tables, check out these resources:

Our Products

A Python IDE built for doing data science directly on your desktop.

Harness the power of distributed computing to run computationally intensive tasks on a cluster of servers.

A platform for productionizing, scaling, and monitoring predictive models in production applications.

Yhat (pronounced Y-hat) provides data science and decision management solutions that let data scientists create, deploy and integrate insights into any business application without IT or custom coding.

With Yhat, data scientists can use their preferred scientific tools (e.g. R and Python) to develop analytical projects in the cloud collaboratively and then deploy them as highly scalable real-time decision making APIs for use in customer- or employee-facing apps.