ŷhat

Real-time NLP with Twitter and Yhat

by Greg

Learn More

The Premise

We wanted a way to show of the Yhat WebSocket API, so we threw together a small node.js app that does real-time named entity recognition using nltk. It's not perfect, but considering it took me about an hour to build, I think it's off to a good start! Layout and styles are courtesy of Ms. Jess Frazelle.

http://twitter-tagger.yhathq.com/

How it works

The Sprinkler

We used ntwitter and node.js to connect to the Twitter Streaming API. One cool feature is that you can filter for tweets containing certain keywords. We're grabbing any tweets that mention Obama, Putin, or the Ukraine. We're seeing between 7-12 tweets per second on average (NOTE: I benchmarked this at 5pm ET on Wednesday).

Each time we recieve a tweet, we send a message to each connected browser. You can see the rest of the server side code on github.

var tags = ["obama", "putin", "ukraine"];
twit.stream('statuses/filter', { track: tags }, function(stream) {
  stream.on('data', function(tweet) {
    _.each(connections, function(conn) {
      conn.send(tweet);
    });
  });
});

Tagging People and Places

To tag the data, we're using nltk and Yhat! nltk provides off the shelf word tokenizers, part of speech tagging, and named entity extractors that make it super easy to look like an NLP mastermind. Try it yourself:

import nltk
tweet = """BREAKING: President Obama: U.S. stand with Ukrainian people on sovereignty and 'will
be forced to apply costs' if Russia continues. Via @AP"""
tokens = nltk.word_tokenize(tweet)
pos_tags = nltk.pos_tag(tokens)
trees = nltk.ne_chunk(pos_tags)
trees
Tree('S', [('BREAKING', 'NN'), (':', ':'), ('President', 'NNP'), Tree('PERSON', [('Obama', 'NNP')]), (':', ':'), ('U.S', 'JJ'), ('stand', 'NN'), ('with', 'IN'), Tree('GPE', [('Ukrainian', 'JJ')]), ('people', 'NNS'), ('on', 'IN'), ('sovereignty', 'NN'), ('and', 'CC'), ("'will", 'NNP'), ('be', 'VB'), ('forced', 'VBN'), ('to', 'TO'), ('apply', 'RB'), ('costs', 'VBZ'), ("'", "''"), ('if', 'IN'), Tree('GPE', [('Russia', 'NNP')]), ('continues.', 'NNP'), ('Via', 'NNP'), ('@', 'NNP'), Tree('ORGANIZATION', [('AP', 'NNP')])])

You can then traverse the parsed tree and tag and people or places:

for tree in trees.subtrees():
    etype = None
    if tree.node=="PERSON":
        etype = "PERSON"
    elif tree.node=="GPE":
        etype = "PLACE"
    if etype is not None:
        ne = " ".join([leaf[0] for leaf in tree.leaves()])
        tweet = tweet.replace(ne, "<" + etype + ">" + ne + "</" + etype + ">")
print tweet
"BREAKING: President <PERSON>Obama</PERSON>: <PLACE>U.S.</PLACE> stand with 
<PLACE>Ukrainian</PLACE> people on sovereignty and 'will be forced to apply costs' 
if <PLACE>Russia</PLACE> continues. Via @AP"

I wrapped all of this code in a function called tag_tweet.

Yhat

Wrapping everything in Yhat is easy. Just define how you'll handle incoming data and invoke the tag_tweet function we wrote earlier.

This will automatically deploy as a REST and streaming API.

from yhat import YhatModel, Yhat, preprocess

class Tagger(YhatModel):
    @preprocess(in_type=dict, out_type=dict)
    def execute(self, raw):
        tweet = raw['text']
        tagged = tag_tweet(tweet)
        raw['tagged'] = tagged
        return raw
yh = Yhat("greg", "myapikey", "http://cloud.yhathq.com/")
yh.deploy("NamedEntityTagger", Tagger, globals())

The App

You can check it out here, http://twitter-tagger.yhathq.com/, or you can find all the code on github.