Detecting Outlier Car Prices on the Web

by Josh Levy

December 18, 2013

We're pleased to bring you this post, courtesy of Josh Levy, Director of Data Science at Vast.com. Based in Austin, TX, Vast is a leading provider of data and technology powering vertical search for automotive, travel and real estate. Prior to Vast, Josh was an R&D Engineer at Demand Media.

You can find Josh on LinkedIn or Github.

Intro

As a data scientist, I have the great fortune of working on some really cool projects and a range of fascinating analytical problems. If you've never heard of Vast.com before, here's the elevator description.

Vast provides data to publishers, marketplaces, and search engines in 3 industries: cars, real estate, and leisure, lodging & travel. Vast's systems are delivered via a white label integration and improve search results, product recommendations, and special offers within some very popular consumer apps (Southwest GetAway Finder, AOL Travel, Yahoo! Travel, Car and Driver to name a few).

This is a post exploring a real world outlier detection problem and the approach I took to solving it at my company.

Outliers and outlier detection

"Outliers" are, simply speaking, data points that are especially distant from the other points in your data. They can be problematic to building analytical applications, as they tend to yield misleading results if you're not aware of them or if you fail to account for them adequately.

Outlier detection is an extremely important problem with a direct application in a wide variety of application domains, including fraud detection (Bolton, 2002), identifying computer network intrusions and bottlenecks (Lane, 1999), criminal activities in e-commerce and detecting suspicious activities (Chiu, 2003).

~ Jayakumar and Thomas, A New Procedure of Clustering Based on Multivariate Outlier Detection (Journal of Data Science 11(2013), 69-84)

Outliers are extremely common, and you'd be hard pressed to find a real world data set entirely without them. They can crop up for a variety of reasons. For example an outlier can be the result of human error in creating the data or due to measurement error caused by inconsistent practices across teams of researchers to name two.

The problem with outliers

At Vast, we ingest listing data from thousands of suppliers and publish listings to thousands of marketplaces that trust the data are accurate. The listing data itself is initially created manually by users and is therefore vulnerable to human error.

Users submit values in the wrong field, or they mistype or fat finger values inadvertently. 100,000 miles is a sensible number for an odometer reading of an 8 year old vehicle. But intuition tells us $100,000 is an unusual price for most compact cars. And while$42,000 is reasonable for one listing, say, a 2013 Cadillac ATS Luxury Edition, it may be unexpectedly high for another (e.g. a 1997 Buick Lesabre).

Being able to detect these scenarios lets us gracefully correct unwanted errors and deliver a superior product to the end user.

Detecting outliers at Vast

We recently needed to develop a better way to detect erroneous listings in order to resolve them before they reach users. The remainder of this post will outline the problem and the solution we devised using Python, Scikit-Learn, and ŷhat.

Overview of the approach

I'll fit a linear regression model to predict the price for a given car listing. I'll then deploy a classifier to ŷhat to flag suspicious listings based on the estimated price output by the linear regressor. I'll use websockets to stream new listings to ŷhat to identify suspected listing errors on the web in real time.

The code in this blog post was tested against Continuum Analytics' Anaconda 1.8.0 Python distribution. Anaconda 1.8.0 includes Python 2.7.5 along with pandas 0.12.0 which provides helper functions I use to read data tables, scikit-learn 0.14.1 which I use for feature extraction and model building, and Requests 1.2.3 which I use to communicate with ŷhat's REST endpoint. I used pip to install yhat 0.3.1, which is used to deploy my model into ŷhat and websocket-client 0.12.0 which I used to communicate with ŷhat's websocket interface.

In [40]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [41]:
import json
from operator import itemgetter
import pandas as pd
import requests
import websocket
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from yhat import BaseModel, Yhat

In [42]:
import warnings
warnings.filterwarnings('ignore')
pd.options.display.width = 900


The data set

The training set, accord_sedan_training.csv contains abbreviated listings for 417 Honda Accord sedans.

All are from the 2006 model year, are all are assumed to have clean titles and to be in good condition. The 2006 Accord came primarily in two trim levels: "LX" and "EX". Leather interior was an option on the "EX", in this dataset, an "EX" with leather is known as "EXL". Each trim level had 4 cylinder and 6 cylinder engines available. All combinations of engine and trim were available with an automatic transmission, and a manual transmission was offered in some combinations.

I'll use the read_csv function from pandas to parse the training data. That creates the DataFrame training containing integer values for price, mileage and year, and string values for trim, engine, and transmission.

In [43]:
training = pd.read_csv('data/accord_sedan_training.csv')
training.shape

Out[43]:
(417, 6)

In [44]:
training.head(7)

Out[44]:
price mileage year trim engine transmission
0 14995 67697 2006 ex 4 Cyl Manual
1 11988 73738 2006 ex 4 Cyl Manual
2 11999 80313 2006 lx 4 Cyl Automatic
3 12995 86096 2006 lx 4 Cyl Automatic
4 11333 79607 2006 lx 4 Cyl Automatic
5 10067 96966 2006 lx 4 Cyl Automatic
6 8999 126150 2006 lx 4 Cyl Automatic
In [45]:
training_no_price = training.drop(['price'], 1)

Out[45]:
mileage year trim engine transmission
0 67697 2006 ex 4 Cyl Manual
1 73738 2006 ex 4 Cyl Manual
2 80313 2006 lx 4 Cyl Automatic
3 86096 2006 lx 4 Cyl Automatic
4 79607 2006 lx 4 Cyl Automatic

Extracting features and building the model

Next, I'll use DictVectorizer from sklearn.feature_extraction to map each of training into a numpy array. DictVectorizer applies the "OneHot" encoding to each string value, creating a 10-dimensional vector space, corresponding to the following features:

• engine=4 Cyl
• engine=6 Cyl
• mileage
• price
• transmission=Automatic
• transmission=Manual
• trim=ex
• trim=exl
• trim=lx
• year

Note: price is part of the feature space. This is a bit of hackery. I want price to be available to the outlier detection model, but I don't want it to influence the price prediction model. I'll allow DictVectorizer to see price, but I'll zero it out before passing it through to the LinearRegression model.

In [46]:
dv = DictVectorizer()
dv.fit(training.T.to_dict().values())

Out[46]:
DictVectorizer(dtype=<type 'numpy.float64'>, separator='=', sparse=True)

In [47]:
len(dv.feature_names_)

Out[47]:
10

In [48]:
dv.feature_names_

Out[48]:
['engine=4 Cyl',
'engine=6 Cyl',
'mileage',
'price',
'transmission=Automatic',
'transmission=Manual',
'trim=ex',
'trim=exl',
'trim=lx',
'year']


Now I'll use the LinearRegression class from sklearn.linear_model to fit a linear model predicting price from the features coming out of the DictVectorizer.

In [49]:
LR = LinearRegression().fit(dv.transform(training_no_price.T.to_dict().values()), training.price)

In [50]:
' + '.join([format(LR.intercept_, '0.2f')] + map(lambda (f,c): "(%0.2f %s)" % (c, f), zip(dv.feature_names_, LR.coef_)))

Out[50]:
'12084.24 + (-337.20 engine=4 Cyl) + (337.20 engine=6 Cyl) + (-0.05 mileage) + (0.00 price) + (420.68 transmission=Automatic) + (-420.67 transmission=Manual) + (208.93 trim=ex) + (674.60 trim=exl) + (-883.53 trim=lx) + (2.23 year)'


The resulting model is

\begin{eqnarray*}PRICE \approx 12084.24 & - & 337.20(engine=4 Cyl) + 337.20(engine=6 Cyl) \\ & - & 0.05(mileage) + 420.68(transmission=Automatic) \\ & - & 420.67(transmission=Manual) \\ & + & 208.93(trim=ex) + 674.60(trim=exl) \\ & - & 883.53(trim=lx) + 2.23(year)\end{eqnarray*}

As was previously mentioned, price has not leaked into the linear regression model.

Now we can measure the prediction accuracy on the training set, and choose an error threshold for identifying possible outliers in new data.

In [51]:
trainingErrs = abs(LR.predict(dv.transform(training.T.to_dict().values())) - training.price)

In [52]:
percentile(trainingErrs, [75, 90, 95, 99])

Out[52]:
[1391.7170820764786,
2200.1942672614978,
2626.9376376401688,
3857.4605411615066]

In [53]:
outlierIdx = trainingErrs >= percentile(trainingErrs, 95)
scatter(training.mileage, training.price, c=(0,0,1), marker='s')
scatter(training.mileage[outlierIdx], training.price[outlierIdx], c=(1,0,0), marker='s')

Out[53]:
<matplotlib.collections.PathCollection at 0x109570b90>


I've held out 100 listings to use as a test set. These are in the file accord_sedan_testing.csv, in the same format as the training data. We can visualize both sets to see that the testing data generally follows the same price/mileage trend, but there is one significant outlier that the model does a poor job of predicting.

In [54]:
testing = pd.read_csv('data/accord_sedan_testing.csv')
testing.shape

Out[54]:
(100, 6)

In [55]:
scatter(training.mileage, training.price, c=(0,0,1), marker='s')
scatter(testing.mileage, testing.price, c=(1,1,0), marker='v')

Out[55]:
<matplotlib.collections.PathCollection at 0x1095a6cd0>

In [56]:
errs = abs(LR.predict(dv.transform(testing.T.to_dict().values())) - testing.price)

In [57]:
hist(errs, bins=50)

Out[57]:
(array([ 20.,  17.,   9.,  10.,  14.,   3.,   4.,   6.,   4.,   5.,   1.,
2.,   2.,   0.,   2.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
0.,   0.,   0.,   0.,   0.,   1.]),
array([    18.03662975,    254.07463514,    490.11264053,    726.15064592,
962.18865131,   1198.2266567 ,   1434.26466209,   1670.30266748,
1906.34067287,   2142.37867826,   2378.41668365,   2614.45468904,
2850.49269443,   3086.53069982,   3322.56870521,   3558.6067106 ,
3794.64471599,   4030.68272138,   4266.72072677,   4502.75873216,
4738.79673755,   4974.83474294,   5210.87274833,   5446.91075372,
5682.94875911,   5918.9867645 ,   6155.02476989,   6391.06277528,
6627.10078067,   6863.13878606,   7099.17679145,   7335.21479685,
7571.25280224,   7807.29080763,   8043.32881302,   8279.36681841,
8515.4048238 ,   8751.44282919,   8987.48083458,   9223.51883997,
9459.55684536,   9695.59485075,   9931.63285614,  10167.67086153,
10403.70886692,  10639.74687231,  10875.7848777 ,  11111.82288309,
11347.86088848,  11583.89889387,  11819.93689926]),
<a list of 50 Patch objects>)

In [58]:
percentile(abs(errs), [90, 95, 100])

Out[58]:
[2263.0162371350216, 2840.2272005583504, 11819.936899259745]


Deploying the model

Now its time to build up and deploy a ŷhat model. PricingModel is a subclass of BaseModel from ŷhat.

The PricingModel class has a self.transform method which maps a raw json request to a numpy array expected by our linear model. Then, self.predict evaluates the model on that observation (i.e. on that array).

Here predict returns an object where ["suspectedOutlier"] is 1 when the the prediction error is too great, and ["x"], ["predictedPrice"], and ["threshold"] provide diagnostic information.

In [59]:
class PricingModel(BaseModel):
def transform(self, doc):
"""
Maps input dict (from json post) into numpy array
delegates to DictVectorizer self.dv
"""
return self.dv.transform(doc)
def predict(self, x):
"""
Evaluate model on array
delegates to LinearRegression self.lr
returns a dict (will be json encoded) suppling
"predictedPrice", "suspectedOutlier", "x", "threshold"
where "x" is the input vector and "threshold" is determined
whether or not a listing is a suspected outlier.
"""
doc = self.dv.inverse_transform(x)[0]
predicted = self.lr.predict(x)[0]
err = abs(predicted - doc['price'])
return {'predictedPrice': predicted,
'x': doc,
'suspectedOutlier': 1 if (err > self.threshold) else 0,
'threshold': self.threshold}
In [60]:
pm = PricingModel(dv=dv, lr=LR, threshold=percentile(trainingErrs, 95))

In [61]:
pm.predict(pm.transform(testing.T.to_dict()[0]))

Out[61]:
{'predictedPrice': 13289.967037908384,
'suspectedOutlier': 0,
'threshold': 2626.9376376401688,
'x': {'engine=4 Cyl': 1.0,
'mileage': 68265.0,
'price': 12995.0,
'transmission=Automatic': 1.0,
'trim=ex': 1.0,
'year': 2006.0}}


Let's write a helper function to handle model deployment to Yhat.

In [62]:
def deploy_model(model_name, fitted_model):
protocol = 'http://'
apikey = secrets['apikey']
deployment_url = protocol + secrets['yhat_url'] + '/'
print deployment_url
result = yh.deploy(model_name, fitted_model)
return result


And now we can deploy our model using our helper function we just wrote.

In [63]:
success = deploy_model('levyPricePredictor', pm)
success

http://cloud.yhathq.com/


Out[63]:
{u'status': u'success'}


Predicting new data in production

The model has been deployed to ŷhat, so now we can feed it new data. Yhat exposes several interfaces to our PricingModel.

Accessing models via REST interface

I'm going to set up a few utility functions to handle authentication. This will make it easier for us to access our model via REST and Websockets.

I've stored my Yhat credentials (i.e. my username and apikey) in a json file called yhat_secrets.json. This first function just reads that file and returns a Python dictionary.

In [64]:
def read_secrets_from_file():
return secrets


This next one simply returns my credentials as a base64 encoded string which is required for authenticating RESTful calls to our model.

In [65]:
import base64

def yhat_base64str():
auth = '%s:%s' % (secrets['username'], secrets['apikey'])
base64string = base64.encodestring(auth).replace('\n', '')
return "Basic %s" % base64string


And since we're going to make requests over http as well as over an open websocket connection, let's make a helper to create the proper URL structure.

In [66]:
def model_url_for(model_name, protocol='http'):
fmt = '{0}://cloud.yhathq.com/{1}/models/{2}/'
return url

In [67]:
url = model_url_for('levyPricePredictor', protocol='http')
url

Out[67]:
'http://cloud.yhathq.com/josh/models/levyPricePredictor/'


We can use our yhat_base64str helper function to compose proper headers for our RESTful API call to our model.

In [68]:
headers = {
'Content-type': 'application/json',
'Accept': 'application/json',
'Authorization': yhat_base64str()
}

In [69]:
payload = testing.T.to_dict()[0]


Here's what's going into our request to our model on Yhat.

In [70]:
print 'headers'
print '*' * 100
print '*' * 100

headers
*******************************************************************************
{
"Content-type": "application/json",
"Authorization": "Basic *************************************=",
"Accept": "application/json"
}

********************************************************************************
{
"trim": "ex",
"engine": "4 Cyl",
"mileage": 68265,
"transmission": "Automatic",
"price": 12995,
"year": 2006
}



And here's what a prediction response message looks like coming back from Yhat.

In [71]:
r = requests.post(url, data=payload, headers=headers)
print json.dumps(r.json, indent=2)

{
"suspectedOutlier": 0,
"x": {
"mileage": 68265,
"price": 12995,
"transmission=Automatic": 1,
"trim=ex": 1,
"year": 2006,
"engine=4 Cyl": 1
},
"yhat_id": "e9b0eb57-619e-40f9-871b-9de88de84144",
"predictedPrice": 13289.96704,
"threshold": 2626.93764
}



To a certain extent, REST is sort of the lingua franca of the web and definitely a key interface for accessing models in production software applications. It's great that we have that in our toolbox, but it's not the only way to access our models.

Accessing models via Websocket interface

Yhat also exposes our PricingModel via a streaming websocket interface. This is far more suitable for some types of applications--particularly those where latency is a concern or where you anticipate high throughput or prediction volume (e.g. pricing in app purchases in a mobile game or virtually anything in the ad tech space).

Lets see how we'd access our model via Yhat's streaming API interface. We'll run the entire testing / holdout set, and generate a report from the suspected outliers.

In [72]:
url = model_url_for('levyPricePredictor', protocol='ws')
url

Out[72]:
'ws://cloud.yhathq.com/josh/models/levyPricePredictor/'


Websockets allow us to do the handshaking to establish a communication channel which remains open. This enables us to send as many messages as we like through that channel without opening and closing the connection for each request.

In a situation where we want to score many listings as soon as they become available, this ends up being more efficient than REST which would require that we "shake hands" each time we send a new message. There's definitely a tradeoff between the high overhead of sending each listing in its own message and introducing latency by collecting listings into a batch that would then get sent all together in a single REST call.

For demo purposes, I'm using synchronous calls to communicate over the websocket. I follow each call to ws.send with a call to ws.recv. Then, I sort suspected outliers from a finite set.

In a real system, the communication would likely be asynchronous, however. The websocket-client Python package includes an event-driven API that mirrors the Javascript API for websockets. As new listings flow through a pipeline, we can send them through the websocket to ŷhat for scoring in a streaming fashion. A message handler would look at the responses and handle the suspected outliers appropriately, for example the listing could be flagged to prevent its display until it can be confirmed and a ticket could be filed to trigger an investigation.

Let's write a helper function which opens a secure websocket connection. This makes it easier to perform the one-time handshake between us and our model on Yhat.

In [73]:
def open_secure_socket():
ws = websocket.create_connection(url)
auth = {
"apikey": secrets['apikey']
}
ws.send(json.dumps(auth))
return ws


And another little helper which streams data to Yhat to make predictions over websockets.

In [74]:
def findOutliers():
ws = open_secure_socket()
for _, item in testing.T.iteritems():
ws.send(json.dumps(item.to_dict()))
yield(res)
ws.close()

In [75]:
possible_outliers = []
n_records = 0
for record in findOutliers():
possible_outlier = record['suspectedOutlier'] == 1
if possible_outlier:
possible_outliers.append(record)
n_records += 1

print 'n_records: %d' % n_records
print "n_possible_outliers: %d" % len(possible_outliers)
print "n_possible_outliers / n_records: %2f" % (len(possible_outliers) / float(n_records))

n_records: 100
n_possible_outliers: 7
n_possible_outliers / n_records: 0.070000



This model has identified 7 suspected outliers. Let's look at one.

In [76]:
possible_outliers[0]

Out[76]:
{u'predictedPrice': 10461.03503,
u'suspectedOutlier': 1,
u'threshold': 2626.93764,
u'x': {u'engine=4 Cyl': 1,
u'mileage': 122458,
u'price': 7499,
u'transmission=Automatic': 1,
u'trim=ex': 1,
u'year': 2006},
u'yhat_id': u'db702640-b60a-4df7-bfed-817357d166f3'}


These are possible outliers, so we don't for sure if these actually stem from invalid, unwanted, or otherwise "bad" data. But we do know that they at least look a bit fishy. So which among these looks the "most fishy" or the most severe?

We can make a helper function to compute the absolute delta between what the regressor estimated the price to be and the actual price as it's listed on Vast (i.e. predicted_y - actual_y).

In [77]:
def calc_delta(record):
"""
Compute absolute difference in the observed and estimated value from our regression model.

Args:
record: Yhat response as a Python dictionary.

Returns:
error: float

example:

In [1]:
record = {
u'predictedPrice': 14633.81626,
u'suspectedOutlier': 1,
u'threshold': 2626.93764,
u'x': {u'engine=4 Cyl': 1,
u'mileage': 51442,
u'price': 11800,
u'transmission=Automatic': 1,
u'trim=exl': 1,
u'year': 2006},
u'yhat_id': u'dfd222d7-7a59-4089-9b78-0da5cc20b336'
}
calc_delta(record)

Out[2]: 2833.8162599999996
"""
predicted_y = record['predictedPrice']
actual_y    = record['x']['price']
return abs(predicted_y - actual_y)


This can be used to sort the results with

In [78]:
possible_outliers = sorted(possible_outliers, key=calc_delta, reverse=1)
most_severe = possible_outliers[0]
most_severe

Out[78]:
{u'predictedPrice': 14431.9369,
u'suspectedOutlier': 1,
u'threshold': 2626.93764,
u'x': {u'engine=6 Cyl': 1,
u'mileage': 59308,
u'price': 2612,
u'transmission=Automatic': 1,
u'trim=ex': 1,
u'year': 2006},
u'yhat_id': u'96abe028-45e4-40be-a823-803b801decc4'}


The most severe is a relatively high end (6 Cylinder, EX trim), low mileage (~60,000 miles) vehicle listed for \$11,000 under what our linear model predicted it should be.

Based on experience, I'd bet this one was either (A) a consequence of mistyped data on the site; or (B) that the vehicle has some other undesirable property like body damage or car title issues that we didn't include in our model.