Machine Learning Startups - Machine Learning Meetups

... state of the global air traffic network ... Then build good tools and take scientific approach to exploring ..... Task: Identify risks from the person's social network.
5MB Sizes 8 Downloads 595 Views
Machine Learning Startups

My Background

Lessons Learned Turn hard problems into easy ones ML in practice requires carefully formulating research problems ...and being creative about bootstrapping training data

Lessons Learned Many ways to capture dependencies Training data and features > models

Lessons Learned A model is not a product Nobody cares about your ideas


Predicting the real-time state of the global air traffic network

The Prediction Problem

Flight F departing at time T Likelihood that F departs at T, T+n1, T+n2


Carrier, FAA, weather data Nightly reset natural cadence for feature vecs

Every aircraft has a unique tail #

Fuzzy k-way join on tail #, time, location Isolate incorrect joins by keeping feature vecs independent

positions in past - already delayed at prediction time?

weather and status - FAA groundings at airports on path?

featurizing time - how delayed and how many mins from departure?


trees could pick up dependencies that linear model couldn’t but perf became trivially incremental once adding more sophisticated ways of featurizing dependencies

Tools and Deployment

Clojure on hadoop for featurizing and model training Wrap complexity in simple API FP awesome for data pipelines

Write models to json Product team used Rails Read json and make predictions Predictions stored in production DB for eval

Pain Points

Log-based-debugging paradigm sucks Don’t want to catch ETL and feature eng issues in hadoop setting At same time can not catch at tiny scale because needs real data at material scale

dirty data -- manual entry early days of clojure / hadoop deploying json models rather than services

Lessons learned

Model selection mattered less than featurizing Many ways to capture dependencies

Intuitions of domain expert useful but also often misleading Use domain experts to identify data sources Then build good tools and take scientific approach to exploring the feature space

Computational graph with HOF in order to log structured data Inspired fast debugging with plumbing.graph at prismatic Isolate issues: single thread, multi thread, multi process and multi machine

Production was OK not great Better to to put ML behind services Full stack product team calls backend team APIs

A model is not a product Humans don’t understand probability distributions Even if discretized or turned into classification Solve a human need directly -- turn into recommendations, etc


Personalized Ranking of People, Topics, and Content

The Personalized Ranking Problem

Given a index of content, display the content that maximizes the likelihood of user engagement

Intention: max LT engagement Proxy: max session interactions


Focused crawling of Twitter, FB and web Maximum coverage algorithms Spam content and de-duping


Content and interaction features Feature crosses and hacks for dependencies Bootstrapping weight hacks -- can’t train on overly sparse interactions Scores for interests (topics, people, publishers, …)

Models: Personalized Ranking

Logistic -- newsfeed ranking has to be ultra fast in prod 100ms Learning to rank -- inversions Universal features, user specific weight-vectors Snapshot every session

Models: Classification

How do you train a large set of topic classifiers? Latent topic models don’t work But how would we get labeled data to train a classifier for each topic?

Enter distant supervision Create mechanism to bootstrap training data with noisy labels Requires lots of heuristics and clever hacks

Snarf docs