
ML Pitfalls: Measuring Performance (Part 1)
Unfortunately, analysis lives and dies by selfreported metrics. Is this feature A better than feature B? Is this classifier better than another? How much confidence can I have in this financial report? From the development to the consumption, almost every decision regarding analytics inherently asks "How good is this model?" "How good" can mean a lot of things and it varies over domain and problems sets. But it is the developer's responsibility to provide a fair measurement in the ...
Mar 03 2015by Eric Tweet 
Base R Plots
There's a lot of talk about ggplot these days (we even wrote a Python version of it) and for good reason: it's a great plotting package that's easy to use. Despite this, I sometimes find myself wanting something even quicker than ggplot. When that's the case, I turn to base R plots. They're not as pretty and the syntax is a little unpleasant but they're very fast, work on just about anything, and are ...
Feb 23 2015by Greg Tweet 
What is Linear Regression? A Qualitative Exploration
When it comes to statistical modeling few things are as tried and tested as linear regression. It's simple, it's (fairly) easy to conceptualize, and fast. Unfortunately, most of the articles I've read about it feel closer to math textbooks than to layman's definitions. In this post I'll give a fairly informal definition of linear regression, overview the goals of linear regression, and talk about a few things you can use it for. Caveat lector: this ...
Feb 19 2015by Greg Tweet 
11 Python Libraries You Might Not Know
There are tons of Python packages out there. So many that no one man or woman could possibly catch them all. PyPi alone has over 47,000 packages listed! Recently, with so many data scientists making the switch to Python, I couldn't help but think that while they're getting some of the great benefits of pandas, scikitlearn, and numpy, they're missing out on some older yet equally helpful Python libraries. In this post, I'm going to ...
Jan 20 2015by Greg Tweet 
Running R in Parallel (the easy way)
Like a lot of folks, I have a love/hate relationship with R. One topic that I've seen people struggle with is parallel computing, or more directly "How do I process data in R when I run out of RAM". But fear not! R actually has some easy to use parallelization packages! Don't let this happen to you! Here's a quick post on doing parallel computing in R. Picking a library My take on parallel computing has ...
Jan 14 2015by Greg Tweet 
Currency Portfolio Optimization Using ScienceOps
Portfolio optimization is a problem faced by anyone trying to invest money (or any kind of capital, such as time) in a known group of investments. Its most obvious, and common, application is investing in the stock market. Typically, portfolio managers have two competing goals: Maximize return Minimize risk Maximizing return means selecting a group of investments that collectively result in the highest expected yield. Minimizing risk means selecting investments that are most likely to actually result in the yields ...
Jan 05 2015by Ryan J. O'Neil Tweet 
Scraping and Analyzing Baseball Data with R
We get a lot of emails from people who are interested in analyzing sports data. The usual suspects are moneyball typesSABRmetrics enthusiasts with a love of baseball and a penchant for R. Luckily for us, baseball data is very accessible. The MLB even goes as far as to make low level details on every pitch publicly available. In this post, I'm going to show you how you can scrape your own baseball data in R and then use it ...
Dec 23 2014by Greg Tweet 
Reducing your R memory footprint by 7000x
R is notoriously a memory heavy language. I don't necessarily think this is a bad thingR wasn't built to be super performant, it was built for analyzing data! That said, there are times when there are some implementation patterns that are quite...redundant. As an example, I'm going to show you how you can prune a 330 MB glm to 45KB without losing significant functionality. > Let's trim the R fat Le Model Our model is going ...
Dec 17 2014by Greg Tweet
Post Index