Detecting the right Apples and Oranges
Social media text analysis with scikit-learn and NLTK
disambiguation is hard
- apple, orange
- homeland, lost, defiance
- elite, valve
- cold, stuffy
Why?
current
- 400 million tweets per day
- not trained on social media (long texts, proper grammar, punctuation)
- doesn’t take into account links
- rule building often “by hand”
Can we build auto-updating ... disambiguator
Data
Scikit-learn example
- apple (computers) or not for tweets that have text “apple” in them - 1 or 0
- matrix y=”tweet index” x=”unigram features”
- very sparse - have to learn from vast array of 0 texts to spot the few 1s
- Gold standard - 2014 tweets classified by hand (5 hours!)
- 2/3 is-brand, 1/3 not-brand (684)
- test/train, 584 each
- validation set, 100 of each
Results
- first pass - LogisticRegression - fit to data, score test examples
- test against OpenCalais (92.5% precision (true -ve) (2 wrong)), 25% recall (true +ve))
- new tool (100% precision, 51% recall)
- not generalised - still working on it
- all on github
- blogged about it
Future
- add link following
- add temporal factors - tweets during apple keynote
- hash tags ...
- NLP meet in London
- bootstrap to larger data
- ...