Ship Data Science
@IanOzsvald
ian.osvald@modelinsight.io
https://speakerdeck.com/ianozsvald/ship-data-science-products-pyconuk-2015
Company “elevate” - match CVs to jobs
Extract data
- tools: sql, elasticsearch, numpy, pandas, ...
 
Visualising data
- most data is boring
 
- requires human curation + detective skills to get good stuff
 
- prob needs engineer/researcher and business
 
- tools - bokeh ...
 
Extracting data from binary files
- PDF/PNG data
 
- tools - textract, Apache Tika, Sovren
 
- this might take months!
 
Augmenting data
- identify people, sentiment
 
- tools - opendata, datasift, NLTK, alchemyapi
 
machine learning
- tools: PyMC, scikit learn, statsmodels, sparkit-learn
 
- deep learning: theano, caffe
 
- text: spaCy, NTLK
 
Delivery - Keep it simple (stupid)
- we’re prob not publishing the best result
 
- debuggability is key
 
- “cult of the imperfect” Watson-Watt
 
- dumb models + clean data beat other combinations
 
Don’t Kill It!
- your data is missing, poor, it lies
- missing data
 
- log everything - eg segment.io
 
- make dat quality tools & reports
 
- Note! More data -> desynchronisation - BBAAADDD!
 
 
- R&D != Engineering
- discovery based, iterative, learn from both success and failure
 
 
Internal deployment
- CSVs, reports
 
- database updates
 
- IPython Notebook (though note, not secure)
 
- Bokeh
 
Deploying live systems
- Spyre (locked down cf. ipython notebook)
 
- Microservices - Docker, Amazon ECS, Flask + Swagger
 
Deploying python
- make python modules (setup.py)
 
- book: data science and visualisation in python
 
Common gotchas
- mysql utf8 is 3 bytes
 
- JS months are 0 based
 
- date times - use ISO 8601
 
- iOS epoch is 2001
 
- Windows Excel convert to CP1252
 
- MongoDB no_timeout_cursor = True
 
- Github 100MB file limit
 
- ...
 
Do you really have big data?
- amazon has 32 core servers with 244GB RAM
 
PyDataLondon meetup