ŷhat

Fast summary statistics in R with data.table

by Jeff

Learn More Tweet

Calculating summary measures (e.g. sums, counts, means, min, max, percentiles, etc.) across groups of data is one of the most common tasks in data analysis. Unless you're among the poor souls stuck with Hadoop, the right tool for the job could be a SQL GROUP BY, an Excel Pivot Table, or, if you're like me, a scripting language!

This is a post about data.table, a popular package for summarising data in R.


Installation

Like most R packages, installation is a snap. Just use install.packages.

install.packages("data.table")

How to make a data.table?

Ok so you've got data.table installed. What can you do with it? Well for starters, you should probably know how to create a data.table. I usually make a data.table from a data.frame, but you can also create a data.table from vectors. It's pretty much the same things as making a data.frame.

library(data.table)
# Making a data.frame from vectors 
df.data <- data.frame(a=1:5,b=1:5)

# Making a data.table from vectors 
dt.data <- data.table(a=1:5,b=1:5)

# Making a data.table from a data.frame 
dt.data <- data.table(df.data)
dt.data
#  a b
#1 1 1
#2 2 2
#3 3 3
#4 4 4
#5 5 5

Usage & Examples

Data Set

First step is to pull in some data. For this example, we'll be using two data sets from Basketball Reference: boxscore.csv and team_game_count.csv. If you want to throw it into MySQL, you can use this loadscript. The data comes from boxscores from the 2004-2013 NBA season. For simplicity, I'm only going to use the fields: date, team, player, minutes, field goals made (fgm), field goals attempted (fga), and points. Note that I've convereted the minutes field to decimal form, so 1.5 minutes is 1 minute and 30 seconds. Since I threw my data into MySQL, I used RMySQL to pull it out.

library(RMySQL)

con <- dbConnect(drv="MySQL", dbname="basketball", user="{USERNAME}", password="{PASSWORD}")
strQuery <- "SELECT date, team, player, minutes, fgm, fga, points FROM basketball.boxscore"
df.boxscore <- dbGetQuery(con, strQuery)
df.boxscore$date <- as.Date(df.boxscore$date, '%Y-%m-%d')

head(df.boxscore)
#date team player minutes fgm fga points
#1 2012-10-30 WAS A.J. Price 29.40 2 13 7 
#2 2012-10-30 WAS Emeka Okafor 24.58 4 10 10
#3 2012-10-30 WAS Trevor Ariza 24.58 3 8 9
#4 2012-10-30 WAS Bradley Beal 21.55 2 8 8 
#5 2012-10-30 WAS Trevor Booker 16.75 2 9 4 
#6 2012-10-30 WAS Jordan Crawford 28.70 4 13 11

One other data point I wanted to get was the number of games a team had played up to a certain point in the season. Since my boxscore data has team and date, this had all of the data I needed.

strQuery <- "SELECT * FROM basketball.team_game_count"
df.games <- dbGetQuery(con, strQuery)
df.games$date <- as.Date(df.games$date, '%Y-%m-%d')

df.games is a data.frame of every NBA game and tells me how many games a team has played up to a certain point in the season. game_num will be 1 for a team's first game of the season and 82 for its last game (assuming the team plays 82 games in a season).

head(df.games)
     season       date team game_num
1 2005-2006 2005-11-01  DAL        1
2 2005-2006 2005-11-01  DEN        1
3 2005-2006 2005-11-01  MIL        1
4 2005-2006 2005-11-01  NOK        1
5 2005-2006 2005-11-01  PHI        1
6 2005-2006 2005-11-01  PHO        1

Comparing plyr vs. data.table

Now I have my data, so let's dig into it a bit and show some examples of where data.tables can replace the plyr package. First, let's get team-level stats instead of player-level stats. Easy, I can just group df.boxscore by date + team and sum the fgm, fga, and points variables. This is easy to accomplish using ddply from the plyr package, but it's also easy to do using data.tables.

Our first file has player-specific stats like the number of points made by per game per player.

Let's compute team-level stats; first we'll use ddply.

library(plyr)

system.time( 
  df.statsTeam <- ddply(df.boxscore, ~ date + team,
                        summarize, points=sum(points), fgm=sum(fgm), fga=sum(fga)
                        )
  )   
 user  system elapsed 
 29.990  18.164  48.174

Ugh, that was slow. This operation is equally simple to perform w/ data.table.

Create a data.table by passing a data.frame to the data.table function.

dt.boxscore <- data.table(df.boxscore)

Then I set its key -- this is something I haven't mentioned yet, but you can find more information about keys on the data.table intro page. As you might expect, the key sorts the data.table so that R can access the data more efficiently. I'll set the key to the variables I want to group by -- date and team.

setkey(dt.boxscore, date, team)
system.time( 
  dt.statsTeam <- dt.boxscore[,list(points=sum(points),
                                    fgm=sum(fgm),
                                    fga=sum(fga)),
                              by=list(date,team)
                              ] 
  )
#  user  system elapsed 
#  0.095   0.003   0.098 

Wow, that was fast! data.table took 0.098 seconds while ddply took 48.2 seconds. But is the output the same? I'll have to convert dt.statsTeam to a data.frame in order to compare it to df.statsTeam using the identical() function. That is easily accomplished by calling data.frame( {the data table} ).

identical(df.statsTeam, data.frame(dt.statsTeam))
# [1] TRUE

Yep, they're the same! Ok great, so we can use a data.table as a near drop in replacement for ddply.

A Basic Merge

Say I only want to find the boxscores from the Houston Rockets' last game of the 2012/2013 regular season. Our first step is to just subset our dataset so that we're only looking at Rockets' data. Let's see what that would look like in regular R.

df.games.houston <- df.games[df.games$team == 'HOU' & 
                            df.games$game_num == 82 &
                            df.games$season == '2012-2013',]
df.games.houston
#         season       date team game_num
#19179 2012-2013 2013-04-17  HOU       82

## Merge of the two data frames
system.time(
  df.merge <- merge(df.boxscore, df.games.houston, by=c('date','team'))
)
#  user  system elapsed 
#  3.392   0.201   3.664 

Now to merge on data.tables we can either use the merge() function on the data.tables or we can use a unique data.table function.

## First let's just try merging on the data tables.
dt.games.houston <- data.table(df.games.houston);
dt.boxscore <- data.table(df.boxscore)
system.time(
  dt.merge <- merge(dt.boxscore, dt.games.houston, by=c('date','team'))
)
#  user  system elapsed 
#  0.092   0.013   0.106 

Even Faster Merges

Using merge was pretty fast, but let's try setting keys on our data before we merge.

setkey(dt.games.houston, date, team)
setkey(dt.boxscore, date, team)
system.time(
  dt.merge <- merge(dt.boxscore, dt.games.houston, by=c('date','team'))
)
#  user  system elapsed 
#  0.018   0.002   0.022 



Even faster! Finally, I'll use the internal data.table function for merging. It's a bit funky, but you can do the same as an inner, left, or right join. Since we've already set the keys for both data tables, it knows what fields we want to merge on (date and team). Here is an inner join.

system.time(
  dt.merge <- dt.boxscore[dt.games.houston]
)
#  user  system elapsed 
#  0.008   0.001   0.011 

Some Summary Statistics

Okay, now that I've shown you some of the great things about data tables, let's calculate some interesting stats with the data. I'll be using my original data.table here called dt.boxscore.

I'll normalize the points variable by calculating points per 48 minutes (the number of minutes in an NBA game) and then calculate some summary statistics for the 2012-2013 season for each player -- data tables will be useful here.

dt.boxscore$pp48 <- dt.boxscore$points/dt.boxscore$minutes*48
dt.boxscore.subset <- dt.boxscore[dt.boxscore$date >= '2012-09-01',]

dt.statsPlayer <- dt.boxscore.subset[,list(minutes=sum(minutes),
                                           mean=mean(pp48),
                                           min=min(pp48),
                                           lower=quantile(pp48, .25, na.rm=TRUE),
                                           middle=quantile(pp48, .50, na.rm=TRUE),
                                           upper=quantile(pp48, .75, na.rm=TRUE),
                                           max=max(pp48)),
                                     by='player']
dt.statsPlayer
               player minutes      mean      min    lower   middle    upper      max
  1:     Daequan Cook  476.37  9.810285  0.00000  0.00000  0.00000 18.11817 66.97674
  2:     Robert Sacre  207.72  5.735073  0.00000  0.00000  0.00000 10.68816 38.01980
  3:   Jared Jeffries  349.84  5.619368  0.00000  0.00000  0.00000 11.64374 23.70370
  4: Jared Cunningham   25.70 20.313846  0.00000  0.00000  0.00000 39.60131 75.39267
  5:     Kevin Murphy   52.39  7.397563  0.00000  0.00000  0.00000  0.00000 42.29075
 ---                                                                                
466:     James Harden 3227.72 32.310256 12.57485 26.12344 32.28516 38.15972 54.24410
467:     LeBron James 3837.39 33.111033 16.88159 28.30721 33.19206 37.20936 53.63128
468:     Kevin Durant 3603.38 34.830615 18.31624 30.20330 34.91109 38.75855 51.85185
469:      Kobe Bryant 3012.59 33.665142  0.00000 25.67263 36.18276 40.76094 53.66972
470:  Carmelo Anthony 2962.78 36.730475 10.66140 30.92032 36.38086 43.00658 62.82723

Let's only look at players who played over 1,000 minutes in the season (no special reason for 1,000 minutes...it just seemed like a good number to me). And then let's order by median (field is called middle) descending and show the top 10 players. If you follow basketball at all, the top 10 players here make sense.

dt.statsPlayer.1000 <- dt.statsPlayer[dt.statsPlayer$minutes >= 1000, ]
dt.top10 <- head(dt.statsPlayer.1000[order(-dt.statsPlayer.1000$middle),], 10)
dt.top10
               player minutes     mean       min    lower   middle    upper      max
 1:   Carmelo Anthony 2962.78 36.73048 10.661402 30.92032 36.38086 43.00658 62.82723
 2:       Kobe Bryant 3012.59 33.66514  0.000000 25.67263 36.18276 40.76094 53.66972
 3:      Kevin Durant 3603.38 34.83062 18.316241 30.20330 34.91109 38.75855 51.85185
 4:      LeBron James 3837.39 33.11103 16.881594 28.30721 33.19206 37.20936 53.63128
 5:      James Harden 3227.72 32.31026 12.574850 26.12344 32.28516 38.15972 54.24410
 6: Russell Westbrook 2928.55 32.12473 13.445378 26.45569 32.22371 36.74335 55.01433
 7:      Kyrie Irving 2047.88 30.70178  9.076584 22.59503 30.85990 38.09608 51.90094
 8:       Tony Parker 2938.78 28.76046  3.772102 23.31282 29.57983 34.06920 53.08598
 9:       Brook Lopez 2516.70 30.51248 16.520076 25.24544 28.80000 36.10256 53.43570
10:     Stephen Curry 3479.12 28.29065  7.902535 22.89263 28.04535 34.24703 54.00000

Other resources

Hopefully this gives you a brief overview of how great data.tables are. I'll follow-up with another post that provides some more interesting NBA analaysis. Let us know if there is any specific analysis you would be interested in seeing! If you're interested in reading more data.tables, check out these resources:

Interested in ŷhat? Learn More