ŷhat

Estimating User Lifetimes - the right and many wrong ways

by Cam Davidson-Pilon

Learn More

We're happy to bring you our first guest blog post. It comes to you from Cam Davidson-Pilon, author of Bayesian Methods for Hackers. Cam is a strategic data analyst at Shopify, a leading ecommerce solution for small and medium businesses, and head organizer of Data Origami, a non-profit organization devoted help other non-profits expose value in their data. Follow him on Twitter at cmrn_dp and at his personal website camdp.com.

iPhone Lifetimes

Suppose I ask you to estimate the median lifespan of your iPhones (because medians are cool and totally hot right now). Simple, you think: my first iPhone lasted 5 months, my second and third only lasted 1 month (defective batteries!), and my fourth I've had now for 9 months. So the median is \( med(1,1,5,9) = 3 \) months. But, you feel strange about this. You think: My current iPhone is still with me - it's lifespan is definetly not 9 months. Maybe I'll just exclude it from the analysis.

So your new answer is \( med(1,1,5) = 1 \). This feels even more strange, as it suggests your current iPhone lifetime (9 months) is a huge outlier! How can we reconcile these differences?

Here's a bunch of wrong ways to find the median lifespan of your iPhone:

  1. Take the average of 3 months and 1 month
  2. Just use 3 months or 1 month, and leave a small footnote as to the whether or not it includes your current iPhone
  3. Try to guess how much longer your current iPhone will last using the previous data.

Dealing with iPhones is a simple, artificial example. But what about if you are interested in the median lifetime of customers on your service? It makes no sense to exclude the current users, nor does it make sense to set their lifetime to their current value. We need some new tools to deal with this situation.

Survival Analysis

Let's create an example. We will use the service model - when a customer joins your service, we consider that a birth, and when a customer leaves your service, that is a death. We have a record of all births, but at points in time we do not necessarily see all the deaths. Consider the graphic below. Each line represents a user moving through time. For each user, we of course observe their birth (joins service). At current time \( t \), we only observe some deaths, denoted by red points, whereas the still active users have not suffered this death event yet.

We call deaths we have not yet seen right-censored events. Our objective is to calculate the median lifetime of all users, regardless of dead or alive.

What do we do? It is not correct to exclude the current users: your estimate will overemphasize the presence of users who sign up and then leave almost immediately (very short lifespans), and completely underemphasize the surviving users (who generally have long lifetimes - the mere fact they are alive is evidence of this).

The other possibility is to use the alive user's current lifetimes along with all other lifetimes. This is not advisable as the implicit assumption you've made is that no user can have a lifetime, even theoretically, larger than your own service's existance.

No, for this one, we are going Bayesian.

Bayesian?

I'm glad you asked! We are working on answering that question. We are writing an open-source, interactive ebook titled Bayesian Methods for Hackers. It is written entirely in the IPython Notebook format, and offers a computation/understanding-first, mathematics-second point of view. Simply, we are trying to abstract Bayesian inference from mathematics. I'll leave the link here, but here's what you need to know for the remainder of this article:

  1. We are interested in the posterior distribution, which is essentailly our updated beliefs about the model's unknown parameters (we'll describe a model soon).
  2. The width of this posterior distribution reflects how uncertain we are about the unknown value: wide and flat == more uncertain, skinny and tall == more confident.
  3. In practice, we are returned thousands of samples from the posterior (thus forming a histogram with the samples recreates the original posterior).

Okay, let's dig into the some data : For this data, we are likely looking at Weibull distributed data. The Weibull distribution is a very flexible and common model for lifetimes (hence why I choose it), and perhaps also known as the “bathtub” distribution to engineers. The distribution has two parameters: \( \alpha \) and \( \beta \) (these are our model's unknown parameters).

Our goal is to find these unknown \( \alpha \) and \( \beta \) parameters using the partial data we have. We will use PyMC, a library for performing Bayesian analysis. I won't go into the syntax of PyMC, though the code does suggest what it does (but see Chapter 2 for an intro to PyMC):


Censor factor

This following code is our censorship event. Essentially, if any guess don't follow the hard limit that some samples are right-censored, than we discard that guess and try again.


Below we plot the posterior distributions of the two unknowns:

Each of the distributions in the figure below is derived from a pair \( (\alpha_i, \beta_i) \) from the posterior. The pair is then used to reconstruct a Weibull distribution. The thickest line is the distribution using the average posterior values of \( \alpha \) and \( \beta \). It is probably pretty close to what the true survival distribution looks like.

Back to the median lifetime

To compute the median lifetime, we note that the median of a Weibull is \( \beta(\log{2})^{1/\alpha} \). For each \( (\alpha_i, \beta_i) \) samples returned from the posterior, we can compute the median:

Wow! The true median age of our users' lifetimes could fall anywhere in this distribution. It could be as low as 2.5, or as high as 5.5! (And here you thought you can be satisfactory with a single number) This huge range is not the fault of the technique: the same uncertainty would be present using any technique, we have simply a better visualization for seeing it. That uncertainty is from the lack of data (we've only been using 20 data points). Let's bump that up to 2500 data points.

Unfortunately, the trick we used using the @potential function will not scale for large number of data points. Instead we will directly use the log-likelihood for the Weibull distribution:



As we can see, we can be much more confident in what the true median of lifetimes might be, thanks to using (significantly) more data.

Conclusion

Hopefully this solved a burning question you might have been having: how do I corrrectly estimate the median (or mean, the computations work the same) lifetime of my service's users. Though the problem can be expressed in other statistical frameworks (ask your local statistician about MLE's or KM estimates), choosing the Bayesian route allows for us to see the uncertainty and have more flexible models (it's really easy to make alpha or beta a dependent function of other variables).

Thanks to Yhat for the guest blog, and have a look at our Bayesian Methods for Hackers project!