Detecting the right Apples and Oranges¶

Social media text analysis with scikit-learn and NLTK

Ian Ozsvald
@IanOzsvald
applying parallel/NLP/ML in industry
morconsulting.com
showmedo.com
ianozsvald.com - blog with articles about this
annotate.io
https://github.com/ianozsvald/social_media_brand_disambiguator

disambiguation is hard¶

apple, orange
homeland, lost, defiance
elite, valve
cold, stuffy

Why?¶

current

400 million tweets per day
not trained on social media (long texts, proper grammar, punctuation)
doesn’t take into account links
rule building often “by hand”

Can we build auto-updating ... disambiguator

Data¶

Scikit-learn example

apple (computers) or not for tweets that have text “apple” in them - 1 or 0
matrix y=”tweet index” x=”unigram features”
very sparse - have to learn from vast array of 0 texts to spot the few 1s
Gold standard - 2014 tweets classified by hand (5 hours!)
2/3 is-brand, 1/3 not-brand (684)
test/train, 584 each
validation set, 100 of each

Results¶

first pass - LogisticRegression - fit to data, score test examples
test against OpenCalais (92.5% precision (true -ve) (2 wrong)), 25% recall (true +ve))
new tool (100% precision, 51% recall)
not generalised - still working on it
all on github
blogged about it

Future¶

add link following
add temporal factors - tweets during apple keynote
hash tags ...
NLP meet in London
bootstrap to larger data
...

Table Of Contents

Detecting the right Apples and Oranges
- disambiguation is hard
- Why?
- Data
- Results
- Future

Previous topic

Teaching Data Science With Really Scrapable Web App

Next topic

This Page