====================================== Detecting the right Apples and Oranges ====================================== Social media text analysis with scikit-learn and NLTK * Ian Ozsvald * @IanOzsvald * applying parallel/NLP/ML in industry * morconsulting.com * showmedo.com * ianozsvald.com - blog with articles about this * annotate.io * https://github.com/ianozsvald/social_media_brand_disambiguator disambiguation is hard ====================== - apple, orange - homeland, lost, defiance - elite, valve - cold, stuffy Why? ==== current * 400 million tweets per day * not trained on social media (long texts, proper grammar, punctuation) * doesn't take into account links * rule building often "by hand" Can we build auto-updating ... disambiguator Data ==== Scikit-learn example - apple (computers) or not for tweets that have text "apple" in them - 1 or 0 - matrix y="tweet index" x="unigram features" - very sparse - have to learn from vast array of 0 texts to spot the few 1s - Gold standard - 2014 tweets classified by hand (5 hours!) - 2/3 is-brand, 1/3 not-brand (684) - test/train, 584 each - validation set, 100 of each Results ======= - first pass - LogisticRegression - fit to data, score test examples - test against OpenCalais (92.5% precision (true -ve) (2 wrong)), 25% recall (true +ve)) - new tool (100% precision, 51% recall) - not generalised - still working on it - all on github - blogged about it Future ====== * add link following * add temporal factors - tweets during apple keynote * hash tags ... * NLP meet in London * bootstrap to larger data * ...