Real-time NLP with Twitter and Yhat

by Greg

Learn More

The Premise

We wanted a way to show of the Yhat WebSocket API, so we threw together a small node.js app that does real-time named entity recognition using nltk. It's not perfect, but considering it took me about an hour to build, I think it's off to a good start! Layout and styles are courtesy of Ms. Jess Frazelle.


How it works

The Sprinkler

We used ntwitter and node.js to connect to the Twitter Streaming API. One cool feature is that you can filter for tweets containing certain keywords. We're grabbing any tweets that mention Obama, Putin, or the Ukraine. We're seeing between 7-12 tweets per second on average (NOTE: I benchmarked this at 5pm ET on Wednesday).

Each time we receive a tweet, we send a message to each connected browser. You can see the rest of the server side code on github.

var tags = ["obama", "putin", "ukraine"];
twit.stream('statuses/filter', { track: tags }, function(stream) {
  stream.on('data', function(tweet) {
    _.each(connections, function(conn) {

Tagging People and Places

To tag the data, we're using nltk and Yhat! nltk provides off the shelf word tokenizers, part of speech tagging, and named entity extractors that make it super easy to look like an NLP mastermind. Try it yourself:

import nltk
tweet = """BREAKING: President Obama: U.S. stand with Ukrainian people on sovereignty and 'will
be forced to apply costs' if Russia continues. Via @AP"""
tokens = nltk.word_tokenize(tweet)
pos_tags = nltk.pos_tag(tokens)
trees = nltk.ne_chunk(pos_tags)
Tree('S', [('BREAKING', 'NN'), (':', ':'), ('President', 'NNP'), Tree('PERSON', [('Obama', 'NNP')]), (':', ':'), ('U.S', 'JJ'), ('stand', 'NN'), ('with', 'IN'), Tree('GPE', [('Ukrainian', 'JJ')]), ('people', 'NNS'), ('on', 'IN'), ('sovereignty', 'NN'), ('and', 'CC'), ("'will", 'NNP'), ('be', 'VB'), ('forced', 'VBN'), ('to', 'TO'), ('apply', 'RB'), ('costs', 'VBZ'), ("'", "''"), ('if', 'IN'), Tree('GPE', [('Russia', 'NNP')]), ('continues.', 'NNP'), ('Via', 'NNP'), ('@', 'NNP'), Tree('ORGANIZATION', [('AP', 'NNP')])])

You can then traverse the parsed tree and tag and people or places:

for tree in trees.subtrees():
    etype = None
    if tree.node=="PERSON":
        etype = "PERSON"
    elif tree.node=="GPE":
        etype = "PLACE"
    if etype is not None:
        ne = " ".join([leaf[0] for leaf in tree.leaves()])
        tweet = tweet.replace(ne, "<" + etype + ">" + ne + "</" + etype + ">")
print tweet
"BREAKING: President <PERSON>Obama</PERSON>: <PLACE>U.S.</PLACE> stand with 
<PLACE>Ukrainian</PLACE> people on sovereignty and 'will be forced to apply costs' 
if <PLACE>Russia</PLACE> continues. Via @AP"

I wrapped all of this code in a function called tag_tweet.


Wrapping everything in Yhat is easy. Just define how you'll handle incoming data and invoke the tag_tweet function we wrote earlier.

This will automatically deploy as a REST and streaming API.

from yhat import YhatModel, Yhat, preprocess

class Tagger(YhatModel):
    @preprocess(in_type=dict, out_type=dict)
    def execute(self, raw):
        tweet = raw['text']
        tagged = tag_tweet(tweet)
        raw['tagged'] = tagged
        return raw
yh = Yhat("greg", "myapikey", "http://cloud.yhathq.com/")
yh.deploy("NamedEntityTagger", Tagger, globals())

The App

You can check it out here, http://twitter-tagger.yhathq.com/, or you can find all the code on github.

Our Products

A Python IDE built for doing data science directly on your desktop.

Download it now!

Harness the power of distributed computing to run computationally intensive tasks on a cluster of servers.

Learn More

A platform for productionizing, scaling, and monitoring predictive models in production applications.

Learn More

Yhat (pronounced Y-hat) provides data science and decision management solutions that let data scientists create, deploy and integrate insights into any business application without IT or custom coding.

With Yhat, data scientists can use their preferred scientific tools (e.g. R and Python) to develop analytical projects in the cloud collaboratively and then deploy them as highly scalable real-time decision making APIs for use in customer- or employee-facing apps.