Detecting the right Apples and Oranges

Social media text analysis with scikit-learn and NLTK

disambiguation is hard

  • apple, orange
  • homeland, lost, defiance
  • elite, valve
  • cold, stuffy

Why?

current

  • 400 million tweets per day
  • not trained on social media (long texts, proper grammar, punctuation)
  • doesn’t take into account links
  • rule building often “by hand”

Can we build auto-updating ... disambiguator

Data

Scikit-learn example

  • apple (computers) or not for tweets that have text “apple” in them - 1 or 0
  • matrix y=”tweet index” x=”unigram features”
  • very sparse - have to learn from vast array of 0 texts to spot the few 1s
  • Gold standard - 2014 tweets classified by hand (5 hours!)
  • 2/3 is-brand, 1/3 not-brand (684)
  • test/train, 584 each
  • validation set, 100 of each

Results

  • first pass - LogisticRegression - fit to data, score test examples
  • test against OpenCalais (92.5% precision (true -ve) (2 wrong)), 25% recall (true +ve))
  • new tool (100% precision, 51% recall)
  • not generalised - still working on it
  • all on github
  • blogged about it

Future

  • add link following
  • add temporal factors - tweets during apple keynote
  • hash tags ...
  • NLP meet in London
  • bootstrap to larger data
  • ...

Table Of Contents

Previous topic

Teaching Data Science With Really Scrapable Web App

Next topic

Oh Py, Behave

This Page