ŷhat

10 Books for Data Enthusiasts

by yhat

Learn More

Over the last few years, I've invested a lot of time exploring various areas of data analysis and software development. Going down the proverbial coding rabbit hole, I've quietly accumulated a lot of books on various subjects.

This is a post about 10 data books that I've gotten a lot of milage out of and that really have legs.

  1. Programming Collective Intelligence by Toby Segaran

    Overall
    Code Examples
    Depth
    Readability
    Synopsis

    An overview of machine learning and the key algorithms in use today. Each chapter outlines a problem, defines an approach to solving it using a particular algorithm, and then gives you all the sample code you need to solve it.

    Why you should read it

    One of my favorite books (non-techincal and technical). I try to re-read it at least once per year. Great explanations of how you can make machine learning useful.

    Everyone has something to learn from PCI. My only criticism--the code is indented with 2 spaces instead of 4. Nitpicky, but annoying. Despite the fact that this is one of the oldest books on the list, it has managed to stay extremely relevant in the ever changing landscape of data analysis tools.

  2. Machine Learning for Hackers by Drew Conway and John Myles White

    Overall
    Code Examples
    Depth
    Readability
    Synopsis

    A series of real world case studies and solutions which use machine learning. This is a very practical approach to machine learning. The visuals are great and there are plenty of code samples to go around. A few of the chapters focusing on text classification/regression are particularly well done.

    Why you should read it

    I was on the pre-order list for this one. It was a gruelling 3 months on the waiting list but when it arrived Machine Learning for Hackers didn't disappoint. The code examples are optimized for readability rather than optimization which makes it much easier to follow along in the book (and translate them to other languages if need be). The code examples were also translated into Python, so I've included the Python logo even though it's not actually in the book.

  3. Super Crunchers by Ian Ayres

    Overall
    No Code Examples
    Depth
    Readability
    Synopsis

    A collection of stories about data, modeling, and analysis, Super Crunchers tells how data and analysis are used in practice. Some of the examples are a little dated, but the core message stands the test of time.

    Why you should read it

    It's a lot higher level than most of the books on this list, and is geared for people who might not actually be doing the analysis or the modeling. Still, Super Crunchers is a great read and if you happen to be an analyst or data scientist, this will give you some insight into how the rest of the world views your work (for better or worsee). The most important takeaway from the book is not neccessarily what algorithms or technologies are being applied, but how they're being applied and how they're changing the way that companies use their data.

  4. Python for Data Analysis by Wes McKinney

    Overall
    Code Examples
    Depth
    Readability
    Synopsis

    A few years ago Wes McKinney took one for the team. He quit his job and wrote pandas, the open source Python package for wrangling data. Naturally Wes is the best person to write the book on pandas. The title may be a little misleading but Python for Data Analysis shows you the ins and outs of using pandas to improve your workflow.

    Why you should read it

    pandas is a must have for doing analysis with Python. This book focuses more on munging, wrangling, and formatting data (not modeling which many people incorrectly assume). So if you need brush up on your data wrangling (and you probably do) grab this off the shelf.

  5. R Cookbook by Paul Teetor

    Overall
    Code Examples
    Depth
    Readability
    Synopsis

    Pretty straightforward. A series of recipes for problems frequently encountered when doing analysis. Things like: building a regession model, merging data, imputing values, file i/o, etc.

    Why you should read it

    R can be a prickly language. The syntax is a little strange when you first start, everything is in tabular form, and weird stuff just tends to happen in general. This is the perfect book for when you have a question like:

    "I just want to loop through a bunch of files and combine them together. I know exactly how I'd to it in Python, but how the heck do I do it in R?"
    I strongly recommend this book if you're learning R, especially if you're coming form another programming language. It'll sit on your desk at work forever and you're guaranteed to pick it up at least a couple times per week.

  6. The Signal and the Noise by Nate Silver

    Overall
    No Code Examples
    Depth
    Readability
    Synopsis

    A great overview of how predictions impact different parts of our lives. The book follows a similar pattern to Super Crunchers, telling stories related to data and prediction, and then tying them all together at the end. A great, quick read for anyone interested in data or analysis.

    Why you should read it

    Just because it's on The Internet doesn't mean it's true. Same goes with data. If you stare at a chart for long enough, a trend begins to emerge. The Signal and the Noise does a great job at teaching you when to throw up a warning flag when someone hands you some analysis.

  7. Visualize This by Nathan Yau

    Overall
    Code Examples
    Depth
    Readability
    Synopsis

    This is essentially the first couple years of Nathan Yau's blog, Flowing Data, in book format. There are great code examples to go along with some truely spectacular visuals.

    Why you should read it

    You can't show off your work with out some nifty data visuals. This book takes you step by step and shows you how you it's easy to construct great looking charts, maps, and other visuals if you use the right tools.

  8. ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham

    Overall
    Code Examples
    Depth
    Readability
    Synopsis

    The name pretty much sums it up. This book shows you how to use ggplot2 by walking you through some examples and gradually adding complexity.

    Why you should read it

    If you're going to use R, you're inevitably going to be using ggplot2. ggplot2 is one the most popular R packages and probably the standard for making great looking visualizations. Who better to teach you how to use ggplot2 than the package's creator, Hadley Wickham. The book provides some core examples for making basic plots, and then exapnds on each of these by detailing some of the more in depth and advanced features of ggplot2 which makes it great for both beginners and advanced users.

  9. The NLTK Books by Jacob Perkins, Steven Bird, Ewan Klein, and Edward Loper

    Overall
    Code Examples
    Depth
    Readability
    Synopsis

    The Natural Language Toolkit (NLTK) is an excellent Python library for processing text and language. It has excellent APIs that can preproces, classify, and help analyze your text. The Cookbook and the freely available online book serve as the instruction manuals for using NLTK.

    Why you should read it

    Text analytics is really fun. Some of the examples in the NLTK books are really just magical (the text classification chapter is particularly cool). Some of the code examples use a lot of the Python syntactic sugar which can make it a little difficult to read for someone who is new to Python, but the breadth of examples more than makes up for it. Top it all off with a really amazing library and it makes for a great read.

  10. Think Stats by Allen B. Downey

    Overall
    Code Examples
    Depth
    Readability
    Synopsis

    This book provides a gentle overview to statistics and a nice tutorial on using Python as well. It's sort of a crash course in statistics for those of us who chose to major in something less mathy in school.

    Why you should read it

    It's short, sweet, and to the point. Think Stats serves as the introduction to statistics course that many people missed out on in school. If you need to brush up on CDFs, PDFs, Normal Variates, or the Central Limit Theorem, then this is the book you're looking for. Also not a bad way to learn Python while picking up some stats skills.

Other Books

A few others that didn't quite make the list but we still love:

Misc

Let us know if there are any others you think we missed!