Ship Data Science¶

@IanOzsvald ian.osvald@modelinsight.io https://speakerdeck.com/ianozsvald/ship-data-science-products-pyconuk-2015

Company “elevate” - match CVs to jobs

Extract data

tools: sql, elasticsearch, numpy, pandas, ...

Visualising data

most data is boring
requires human curation + detective skills to get good stuff
prob needs engineer/researcher and business
tools - bokeh ...

Extracting data from binary files

PDF/PNG data
tools - textract, Apache Tika, Sovren
this might take months!

Augmenting data

identify people, sentiment
tools - opendata, datasift, NLTK, alchemyapi

machine learning

tools: PyMC, scikit learn, statsmodels, sparkit-learn
deep learning: theano, caffe
text: spaCy, NTLK

Delivery - Keep it simple (stupid)

we’re prob not publishing the best result
debuggability is key
“cult of the imperfect” Watson-Watt
dumb models + clean data beat other combinations

Don’t Kill It!

your data is missing, poor, it lies
- missing data
- log everything - eg segment.io
- make dat quality tools & reports
- Note! More data -> desynchronisation - BBAAADDD!
R&D != Engineering
- discovery based, iterative, learn from both success and failure

Internal deployment

CSVs, reports
database updates
IPython Notebook (though note, not secure)
Bokeh

Deploying live systems

Spyre (locked down cf. ipython notebook)
Microservices - Docker, Amazon ECS, Flask + Swagger

Deploying python

make python modules (setup.py)

python setup.py develop

book: data science and visualisation in python

Common gotchas

mysql utf8 is 3 bytes
JS months are 0 based
date times - use ISO 8601
iOS epoch is 2001
Windows Excel convert to CP1252
MongoDB no_timeout_cursor = True
Github 100MB file limit
...

Do you really have big data?

amazon has 32 core servers with 244GB RAM

PyDataLondon meetup

Previous topic

Simplicity is a Feature

Next topic

Testing Django CMS

This Page