... state of the global air traffic network ... Then build good tools and take scientific approach to exploring ..... Task: Identify risks from the person's social network.
Lessons Learned Turn hard problems into easy ones ML in practice requires carefully formulating research problems ...and being creative about bootstrapping training data
Lessons Learned Many ways to capture dependencies Training data and features > models
Lessons Learned A model is not a product Nobody cares about your ideas
Flightcaster
Predicting the real-time state of the global air traffic network
The Prediction Problem
Flight F departing at time T Likelihood that F departs at T, T+n1, T+n2
Featurizing
Carrier, FAA, weather data Nightly reset natural cadence for feature vecs
Every aircraft has a unique tail #
Fuzzy k-way join on tail #, time, location Isolate incorrect joins by keeping feature vecs independent
positions in past - already delayed at prediction time?
weather and status - FAA groundings at airports on path?
featurizing time - how delayed and how many mins from departure?
Models
trees could pick up dependencies that linear model couldn’t but perf became trivially incremental once adding more sophisticated ways of featurizing dependencies
Tools and Deployment
Clojure on hadoop for featurizing and model training Wrap complexity in simple API FP awesome for data pipelines
Write models to json Product team used Rails Read json and make predictions Predictions stored in production DB for eval
Pain Points
Log-based-debugging paradigm sucks Don’t want to catch ETL and feature eng issues in hadoop setting At same time can not catch at tiny scale because needs real data at material scale
dirty data -- manual entry early days of clojure / hadoop deploying json models rather than services
Lessons learned
Model selection mattered less than featurizing Many ways to capture dependencies
Intuitions of domain expert useful but also often misleading Use domain experts to identify data sources Then build good tools and take scientific approach to exploring the feature space
Computational graph with HOF in order to log structured data Inspired fast debugging with plumbing.graph at prismatic Isolate issues: single thread, multi thread, multi process and multi machine
Production was OK not great Better to to put ML behind services Full stack product team calls backend team APIs
A model is not a product Humans don’t understand probability distributions Even if discretized or turned into classification Solve a human need directly -- turn into recommendations, etc
Prismatic
Personalized Ranking of People, Topics, and Content
The Personalized Ranking Problem
Given a index of content, display the content that maximizes the likelihood of user engagement
Intention: max LT engagement Proxy: max session interactions
Content
Focused crawling of Twitter, FB and web Maximum coverage algorithms Spam content and de-duping
Featurizing
Content and interaction features Feature crosses and hacks for dependencies Bootstrapping weight hacks -- can’t train on overly sparse interactions Scores for interests (topics, people, publishers, …)
Models: Personalized Ranking
Logistic -- newsfeed ranking has to be ultra fast in prod 100ms Learning to rank -- inversions Universal features, user specific weight-vectors Snapshot every session
Models: Classification
How do you train a large set of topic classifiers? Latent topic models don’t work But how would we get labeled data to train a classifier for each topic?
Enter distant supervision Create mechanism to bootstrap training data with noisy labels Requires lots of heuristics and clever hacks
Full stack product team calls backend team APIs ... Spam content and de-duping ..... Facebook for risk: Streamline investigation with the risk network.
target function V and again use the notation V : B + 8 to denote that V maps ..... the task of classifying text documents, including data and software available over the World ...... ELIMINATION algorithm is that it requires noise-free training data.
like Facebook, Google, and Twitter, who have been using machine learning techniques for ... provider that helps corporations realize a 360-degree view of their.
Today I am using a smartphone for talking, texting, tweeting, emailing, and, ... Once digitized, computers can automate the task and eliminate manual labor.
the structure of such models from partially observed data. There are ... not cover application papers or works that appeared in the computational statistics community, and that are ...... ence on Knowledge discovery and data mining (2011). 15.
worldwide utilizing IBM's state of the art data mining software and I wrote an often ... software, text learning and language ... Advanced analytics introduced.
One of the largest fallacies with machine learning is that it'll replace the need for humans. But didn't ... The basic example we used was the original Google PageRank model for ... Despite their odd name, support vector machines are a way we.
Python. Code. R. Code. Types. Machine Learning. Algorithms. ( Python and R ... #Import other necessary libraries like pandas, ... #Load Train and Test datasets.
more. Media sites rely on machine learning to sift through millions of options to give you song or movie ... Clustering is the most common unsupervised learning.