• ML Pitfalls: Measuring Performance (Part 1)

    Unfortunately, analysis lives and dies by self-reported metrics. Is this feature A better than feature B? Is this classifier better than another? How much confidence can I have in this financial report? From the development to the consumption, almost every decision regarding analytics inherently asks "How good is this model?" "How good" can mean a lot of things and it varies over domain and problems sets. But it is the developer's responsibility to provide a fair measurement in the ...

    Mar 03 2015
    by Eric
  • Base R Plots

    There's a lot of talk about ggplot these days (we even wrote a Python version of it) and for good reason: it's a great plotting package that's easy to use. Despite this, I sometimes find myself wanting something even quicker than ggplot. When that's the case, I turn to base R plots. They're not as pretty and the syntax is a little unpleasant but they're very fast, work on just about anything, and are ...

    Feb 23 2015
    by Greg
  • What is Linear Regression? A Qualitative Exploration

    When it comes to statistical modeling few things are as tried and tested as linear regression. It's simple, it's (fairly) easy to conceptualize, and fast. Unfortunately, most of the articles I've read about it feel closer to math textbooks than to layman's definitions. In this post I'll give a fairly informal definition of linear regression, overview the goals of linear regression, and talk about a few things you can use it for. Caveat lector: this ...

    Feb 19 2015
    by Greg
  • 11 Python Libraries You Might Not Know

    There are tons of Python packages out there. So many that no one man or woman could possibly catch them all. PyPi alone has over 47,000 packages listed! Recently, with so many data scientists making the switch to Python, I couldn't help but think that while they're getting some of the great benefits of pandas, scikit-learn, and numpy, they're missing out on some older yet equally helpful Python libraries. In this post, I'm going to ...

    Jan 20 2015
    by Greg
  • Running R in Parallel (the easy way)

    Like a lot of folks, I have a love/hate relationship with R. One topic that I've seen people struggle with is parallel computing, or more directly "How do I process data in R when I run out of RAM". But fear not! R actually has some easy to use parallelization packages! Don't let this happen to you! Here's a quick post on doing parallel computing in R. Picking a library My take on parallel computing has ...

    Jan 14 2015
    by Greg
  • Currency Portfolio Optimization Using ScienceOps

    Portfolio optimization is a problem faced by anyone trying to invest money (or any kind of capital, such as time) in a known group of investments. Its most obvious, and common, application is investing in the stock market. Typically, portfolio managers have two competing goals: Maximize return Minimize risk Maximizing return means selecting a group of investments that collectively result in the highest expected yield. Minimizing risk means selecting investments that are most likely to actually result in the yields ...

    Jan 05 2015
    by Ryan J. O'Neil
  • Scraping and Analyzing Baseball Data with R

    We get a lot of emails from people who are interested in analyzing sports data. The usual suspects are moneyball types--SABRmetrics enthusiasts with a love of baseball and a penchant for R. Luckily for us, baseball data is very accessible. The MLB even goes as far as to make low level details on every pitch publicly available. In this post, I'm going to show you how you can scrape your own baseball data in R and then use it ...

    Dec 23 2014
    by Greg
  • Reducing your R memory footprint by 7000x

    R is notoriously a memory heavy language. I don't necessarily think this is a bad thing--R wasn't built to be super performant, it was built for analyzing data! That said, there are times when there are some implementation patterns that are quite...redundant. As an example, I'm going to show you how you can prune a 330 MB glm to 45KB without losing significant functionality. -----> Let's trim the R fat Le Model Our model is going ...

    Dec 17 2014
    by Greg

Post Index

ML Pitfalls: Measuring Performance (Part 1)

Eric | Mar 03, 2015

Base R Plots

Greg | Feb 23, 2015

What is Linear Regression? A Qualitative Exploration

Greg | Feb 19, 2015

11 Python Libraries You Might Not Know

Greg | Jan 20, 2015

Running R in Parallel (the easy way)

Greg | Jan 14, 2015

Currency Portfolio Optimization Using ScienceOps

Ryan J. O'Neil | Jan 05, 2015

Scraping and Analyzing Baseball Data with R

Greg | Dec 23, 2014

Reducing your R memory footprint by 7000x

Greg | Dec 17, 2014

Naive Bayes in Python

Greg | Dec 11, 2014

Introducing db.r

Greg | Dec 04, 2014

How Yhat Does Cloud Balancing: A Case Study

Ryan J. O'Neil | Nov 10, 2014

Introducing db.py

Greg Lamp | Nov 05, 2014

Using data science to build better products

Colin Ristig | Sep 17, 2014

Analysing your e-commerce funnel with R

Justin Marciszewski | Aug 05, 2014

Fuzzy Matching with Yhat

Greg | Jul 23, 2014

Yhat ScienceBox

Colin Ristig | Jun 17, 2014

Python Sparse Random Projections

Adrian Rosebrock | Jun 05, 2014

Yhat meets Go

Jess Frazelle | May 29, 2014

Neural networks and a dive into Julia

Eric Chiang | May 15, 2014

ggplot tutorial

Greg | May 02, 2014

Python Multi-armed Bandits (and Beer!)

Eric Chiang | Apr 07, 2014

Predicting customer churn with scikit-learn

Eric Chiang | Mar 20, 2014

Real-time NLP with Twitter and Yhat

Greg | Mar 14, 2014

Yhat at NY Enterprise Technology Meetup

Greg | Mar 11, 2014

Yhat at the SF Data Science Meetup

Greg | Feb 17, 2014

Image Processing with scikit-image

Eric Chiang | Jan 30, 2014

What's new in ggplot-0.4?

Yhat | Jan 22, 2014

Data Science in Python

Greg | Jan 13, 2014

Detecting Outlier Car Prices on the Web

Josh Levy | Dec 18, 2013

Weather Forecasting with Twitter & Pandas

Eric Chiang | Dec 05, 2013

Building email reports with R

yhat | Nov 22, 2013

Aggregating & plotting time series in python

yhat | Nov 03, 2013

ggplot for python

Yhat | Oct 13, 2013

Random Forest Regression and Classification in R and Python

yhat | Sep 29, 2013

Fast summary statistics in R with data.table

Jeff | Sep 26, 2013

Two great things that go great together: Yhat and fantasy football

Drew Conway | Aug 25, 2013

Estimating User Lifetimes - the right and many wrong ways

Cam Davidson-Pilon | Aug 20, 2013

Machine Learning for Predicting Bad Loans

yhat | Aug 16, 2013

10 Books for Data Enthusiasts

yhat | Aug 11, 2013

PyData Boston 2013 Slides

yhat | Jul 29, 2013

Intuitive Classification using KNN and Python

yhat | Jul 25, 2013

Recognizing Handwritten Digits in Python

yhat | Jul 14, 2013

Named Entities in Law & Order Episodes

yhat | Jul 04, 2013

Running R in the Cloud (Part 1)

yhat | Jun 27, 2013

Statistical Quality Control in R

yhat | Jun 25, 2013

Recommendation System in R

yhat | Jun 19, 2013

Content-based image classification in Python

yhat | Jun 12, 2013

Random Forests in Python

yhat | Jun 05, 2013

Fitting & Interpreting Linear Models in R

yhat | May 18, 2013

Deploy Your R Models to yhat

yhat | May 10, 2013

pandas & google analytics

yhat | Apr 12, 2013

7 handy SQL features for data scientists

yhat | Apr 09, 2013

yhat is going to PyCon

yhat | Mar 10, 2013

Logistic Regression in Python

yhat | Mar 03, 2013

SQL for pandas DataFrames

yhat | Feb 24, 2013

R and pandas and what I've learned about each

yhat | Feb 16, 2013

Setting Up Scientific Python

yhat | Feb 15, 2013

10 R packages I wish I knew about earlier

yhat | Feb 10, 2013

Predicting SMS spam

yhat | Jan 08, 2013

Repeatable, Scalable, Analytics using yhat

yhat | Jan 05, 2013