Detecting the right Apples and Oranges
Social media text analysis with scikit-learn and NLTK
disambiguation is hard
- apple, orange
 
- homeland, lost, defiance
 
- elite, valve
 
- cold, stuffy
 
 
Why?
current
- 400 million tweets per day
 
- not trained on social media (long texts, proper grammar, punctuation)
 
- doesn’t take into account links
 
- rule building often “by hand”
 
Can we build auto-updating ... disambiguator
 
Data
Scikit-learn example
- apple (computers) or not for tweets that have text “apple” in them - 1 or 0
 
- matrix y=”tweet index” x=”unigram features”
 
- very sparse - have to learn from vast array of 0 texts to spot the few 1s
 
- Gold standard - 2014 tweets classified by hand (5 hours!)
 
- 2/3 is-brand, 1/3 not-brand (684)
 
- test/train, 584 each
 
- validation set, 100 of each
 
 
Results
- first pass - LogisticRegression - fit to data, score test examples
 
- test against OpenCalais (92.5% precision (true -ve) (2 wrong)), 25% recall (true +ve))
 
- new tool (100% precision, 51% recall)
 
- not generalised - still working on it
 
- all on github
 
- blogged about it
 
 
Future
- add link following
 
- add temporal factors - tweets during apple keynote
 
- hash tags ...
 
- NLP meet in London
 
- bootstrap to larger data
 
- ...