Apr 9, 2011 - The enterprise social network. Saturday, April 9, 2011 ..... Monitor it. Saturday, April 9 ... Ganglia/Gra
METRICS
METRICS EVERYWHERE Saturday, April 9, 2011
METRICS
METRICS EVERYWHERE Saturday, April 9, 2011
Make better decisions by using numbers.
Saturday, April 9, 2011
Coda Hale @coda github.com/codahale
Saturday, April 9, 2011
The enterprise social network.
www.yammer.com
Saturday, April 9, 2011
I write code.
Saturday, April 9, 2011
But that’s not actually my job.
Saturday, April 9, 2011
code
Saturday, April 9, 2011
code
Saturday, April 9, 2011
business value
What the hell is business value?
Saturday, April 9, 2011
A new feature.
Saturday, April 9, 2011
An improved existing feature.
Saturday, April 9, 2011
Fewer bugs.
Saturday, April 9, 2011
Not pissing our users off with a slow site.
Saturday, April 9, 2011
Not pissing our users off with a slow site. ugly
Saturday, April 9, 2011
Not pissing our users off with a slow site. ugly pretty Saturday, April 9, 2011
Making future changes easier.
Saturday, April 9, 2011
Adding a unit test before fixing that bug.
Saturday, April 9, 2011
Business value is anything which makes people more likely to give us money. Saturday, April 9, 2011
We want to generate more business value.
Saturday, April 9, 2011
We need to make better decisions about our code.
Saturday, April 9, 2011
Our code generates business value when it runs.
Saturday, April 9, 2011
Our code generates business value when it runs, not when we write it. Saturday, April 9, 2011
We need to know what our code does when it runs.
Saturday, April 9, 2011
We can’t do this unless we measure it.
Saturday, April 9, 2011
Why measure it?
Saturday, April 9, 2011
map ≠ territory
Saturday, April 9, 2011
map ≠ city of of San San Francisco Francisco Saturday, April 9, 2011
the ≠ the way way we it talk is Saturday, April 9, 2011
the ≠ the thing thing we in think of itself Saturday, April 9, 2011
perception ≠ reality
Saturday, April 9, 2011
MIND THE GAP
Saturday, April 9, 2011
We have a mental model of what our code does.
Saturday, April 9, 2011
It’s a mental model. It’s not the code.
Saturday, April 9, 2011
It is often wrong.
Saturday, April 9, 2011
Confusion.
Saturday, April 9, 2011
“This code can’t possibly work.”
Saturday, April 9, 2011
(It works.)
Saturday, April 9, 2011
MIND THE GAP
Saturday, April 9, 2011
“This code can’t possibly fail.”
Saturday, April 9, 2011
(It fails.)
Saturday, April 9, 2011
MIND THE GAP
Saturday, April 9, 2011
Which is faster?
Saturday, April 9, 2011
Which is faster? items.sort_by { |i| i.name }
Saturday, April 9, 2011
Which is faster? items.sort_by { |i| i.name } items.sort { |a, b| a.name b.name }
Saturday, April 9, 2011
We don’t know.
Saturday, April 9, 2011
def sort_by(&blk) sleep(100) # FIXME: I AM POISON super(&blk) end
We don’t know.
Saturday, April 9, 2011
def sort_by(&blk) sleep(100) # FIXME: I AM POISON super(&blk) end
We don’t know. def sort(&blk) # TODO: make not explode raise Exception.new("Haw haw!") end
Saturday, April 9, 2011
We can’t know until we measure it.
Saturday, April 9, 2011
This affects how we make decisions.
Saturday, April 9, 2011
“Our application is slow. This page takes 500ms. Fix it.”
Saturday, April 9, 2011
Find the bottleneck!
Saturday, April 9, 2011
Find the bottleneck! SQL Query
Saturday, April 9, 2011
Find the bottleneck! SQL Query Template Rendering
Saturday, April 9, 2011
Find the bottleneck! SQL Query Template Rendering Session Storage
Saturday, April 9, 2011
We don’t know.
Saturday, April 9, 2011
Find The Bottleneck 2.0! SQL Query Template Rendering Session Storage
Saturday, April 9, 2011
Find The Bottleneck 2.0! SQL Query Template Rendering Session Storage
Saturday, April 9, 2011
53ms
Find The Bottleneck 2.0! SQL Query
53ms
Template Rendering
1ms
Session Storage
Saturday, April 9, 2011
Find The Bottleneck 2.0! SQL Query
53ms
Template Rendering
1ms
Session Storage
Saturday, April 9, 2011
315ms
Find The Bottleneck 2.0! SQL Query
53ms
Template Rendering
1ms
Session Storage
Saturday, April 9, 2011
315ms
Confusion.
Saturday, April 9, 2011
Saturday, April 9, 2011
We made a better decision.
Saturday, April 9, 2011
We improve our mental model by measuring what our code does.
Saturday, April 9, 2011
map ≠ territory
Saturday, April 9, 2011
map → territory
Saturday, April 9, 2011
We use our mental model to decide what to do.
Saturday, April 9, 2011
A better mental model makes us better at deciding what to do. Saturday, April 9, 2011
A better mental model makes us better at generating business value. Saturday, April 9, 2011
Measuring makes your decisions better.
Saturday, April 9, 2011
But only if we’re measuring the right thing.
Saturday, April 9, 2011
We need to measure our code where it matters.
Saturday, April 9, 2011
In the wild.
Saturday, April 9, 2011
Generating business value.
Saturday, April 9, 2011
Saturday, April 9, 2011
PRODUCTION Saturday, April 9, 2011
Continuously measuring code in production.
Saturday, April 9, 2011
Metrics
Saturday, April 9, 2011
Metrics
Java/Scala
Saturday, April 9, 2011
Metrics
Java/Scala
github.com/codahale/metrics
Saturday, April 9, 2011
Gauges Counters Meters Histograms Timers Saturday, April 9, 2011
Each metric is associated with a class and has a name.
Saturday, April 9, 2011
An autocomplete service for city names.
Saturday, April 9, 2011
An autocomplete service for city names. > GET /complete?q=San%20Fra
Saturday, April 9, 2011
An autocomplete service for city names. > GET /complete?q=San%20Fra < HTTP/1.1 200 RAD < < ["San Francisco"]
Saturday, April 9, 2011
What does this code do that affects its business value?
Saturday, April 9, 2011
And how can we measure that?
Saturday, April 9, 2011
Gauges Counters Meters Histograms Timers Saturday, April 9, 2011
Gauges Counters Meters Histograms Timers Saturday, April 9, 2011
Gauge The instantaneous value of something.
Saturday, April 9, 2011
# of cities
Saturday, April 9, 2011
metrics.gauge("cities") { cities.size }
Saturday, April 9, 2011
metrics.gauge("cities") { cities.size }
Saturday, April 9, 2011
metrics.gauge("cities") { cities.size }
Saturday, April 9, 2011
“The service has 589 cities registered.”
Saturday, April 9, 2011
Gauges Counters Meters Histograms Timers Saturday, April 9, 2011
Gauges Counters Meters Histograms Timers Saturday, April 9, 2011
Counter
An incrementing and decrementing value.
Saturday, April 9, 2011
# of open connections
Saturday, April 9, 2011
val counter = metrics.counter("connections") counter.inc() counter.dec()
Saturday, April 9, 2011
val counter = metrics.counter("connections") counter.inc() counter.dec()
Saturday, April 9, 2011
val counter = metrics.counter("connections") counter.inc() counter.dec()
Saturday, April 9, 2011
val counter = metrics.counter("connections") counter.inc() counter.dec()
Saturday, April 9, 2011
“There are 594 active sessions on that server.”
Saturday, April 9, 2011
Gauges Counters Meters Histograms Timers Saturday, April 9, 2011
Gauges Counters Meters Histograms Timers Saturday, April 9, 2011
Meter
The average rate of events over a period of time.
Saturday, April 9, 2011
# of requests/sec
Saturday, April 9, 2011
val meter = metrics.meter("requests", SECONDS) meter.mark()
Saturday, April 9, 2011
val meter = metrics.meter("requests", SECONDS) meter.mark()
Saturday, April 9, 2011
val meter = metrics.meter("requests", SECONDS) meter.mark()
Saturday, April 9, 2011
val meter = metrics.meter("requests", SECONDS) meter.mark()
Saturday, April 9, 2011
# of events mean rate = elapsed time
Saturday, April 9, 2011
# of requests
time Saturday, April 9, 2011
# of requests
time Saturday, April 9, 2011
# of requests
time Saturday, April 9, 2011
MIND THE GAP
Saturday, April 9, 2011
Recency.
Saturday, April 9, 2011
# of events mean rate = elapsed time
Saturday, April 9, 2011
# of events mean rate = elapsed time
Saturday, April 9, 2011
COGNITIVE HAZARD
Saturday, April 9, 2011
Exponentially weighted moving average.
Saturday, April 9, 2011
k -(1-α) mt-1
+
k (1-(1-α) )Yt
k
Saturday, April 9, 2011
k -(1-α) mt-1
+
k (1-(1-α) )Yt
k
Saturday, April 9, 2011
# of requests
time Saturday, April 9, 2011
# of requests
time Saturday, April 9, 2011
# of requests
time Saturday, April 9, 2011
# of requests
time Saturday, April 9, 2011
1-minute rate
Saturday, April 9, 2011
1-minute rate 5-minute rate
Saturday, April 9, 2011
1-minute rate 5-minute rate 15-minute rate
Saturday, April 9, 2011
“We went from 3,000 requests/sec to 86 billion values
Saturday, April 9, 2011
1,000 req/sec
× 1,000 actions/req
× 1 day
= >86 billion values
>640GB of data/day Saturday, April 9, 2011
1,000 req/sec
× 1,000 actions/req
× 1 day
= >86 billion values
>640GB of data/day
Not gonna happen. Saturday, April 9, 2011
COGNITIVE HAZARD
Saturday, April 9, 2011
Reservoir sampling. Keep a statistically representative sample of measurements as they happen.
Saturday, April 9, 2011
Vitter’s Algorithm R.
Vitter, J. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1), 57. Saturday, April 9, 2011
# of cities
time Saturday, April 9, 2011
# of cities
time Saturday, April 9, 2011
# of cities
time Saturday, April 9, 2011
# of cities
time Saturday, April 9, 2011
# of cities
time Saturday, April 9, 2011
MIND THE GAP
Saturday, April 9, 2011
Vitter’s Algorithm R produces uniform samples.
Saturday, April 9, 2011
Recency.
Saturday, April 9, 2011
SUPER-DUPER COGNITIVE HAZARD
Saturday, April 9, 2011
Saturday, April 9, 2011
Forward-decaying priority sampling. Cormode, G., Shkapenyuk, V., Srivastava, D., & Xu, B. (2009). Forward Decay: A Practical Time Decay Model for Streaming Systems. ICDE '09: Proceedings of the 2009 IEEE International Conference on Data Engineering. Saturday, April 9, 2011
Maintain a statistically representative sample of the last 5 minutes.
Saturday, April 9, 2011
# of cities
time Saturday, April 9, 2011
# of cities
time Saturday, April 9, 2011
# of cities
time Saturday, April 9, 2011
# of cities
time Saturday, April 9, 2011
Uniform Saturday, April 9, 2011
Biased
“95% of autocomplete results return 3 cities or less.”
Saturday, April 9, 2011
Gauges Counters Meters Histograms Timers Saturday, April 9, 2011
Gauges Counters Meters Histograms Timers Saturday, April 9, 2011
Timer
A histogram of durations and a meter of calls.
Saturday, April 9, 2011
# of ms to respond
Saturday, April 9, 2011
val timer = metrics.timer("requests", MILLISECONDS, SECONDS) timer.time { handle(req, resp) }
Saturday, April 9, 2011
val timer = metrics.timer("requests", MILLISECONDS, SECONDS) timer.time { handle(req, resp) }
Saturday, April 9, 2011
val timer = metrics.timer("requests", MILLISECONDS, SECONDS) timer.time { handle(req, resp) }
Saturday, April 9, 2011
val timer = metrics.timer("requests", MILLISECONDS, SECONDS) timer.time { handle(req, resp) }
Saturday, April 9, 2011
val timer = metrics.timer("requests", MILLISECONDS, SECONDS) timer.time { handle(req, resp) }
Saturday, April 9, 2011
“At ~2,000 req/sec, our 99% latency jumps from 13ms to 453ms.”
Saturday, April 9, 2011
Gauges Counters Meters Histograms Timers Saturday, April 9, 2011
Gauges Counters Meters Histograms Timers Saturday, April 9, 2011
Now what?
Saturday, April 9, 2011
Instrument it.
Saturday, April 9, 2011
Instrument it. If it could affect your code’s business value, add a metric.
Saturday, April 9, 2011
Instrument it. If it could affect your code’s business value, add a metric. Our services have 40-50 metrics.
Saturday, April 9, 2011
Collect it.
Saturday, April 9, 2011
Collect it. JSON via HTTP.
Saturday, April 9, 2011
Collect it. JSON via HTTP. Every minute.
Saturday, April 9, 2011
Monitor it.
Saturday, April 9, 2011
Monitor it. Nagios/Zabbix/Whatever
Saturday, April 9, 2011
Monitor it. Nagios/Zabbix/Whatever If it affects business value, someone should get woken up.
Saturday, April 9, 2011
Aggregate it.
Saturday, April 9, 2011
Aggregate it. Ganglia/Graphite/Cacti/Whatever
Saturday, April 9, 2011
Aggregate it. Ganglia/Graphite/Cacti/Whatever Place current values in historical context.
Saturday, April 9, 2011
Aggregate it. Ganglia/Graphite/Cacti/Whatever Place current values in historical context. See long-term patterns.
Saturday, April 9, 2011
Go faster.
Saturday, April 9, 2011
Shorten our decision-making cycle.
Saturday, April 9, 2011
Observe
Saturday, April 9, 2011
Observe Orient
Saturday, April 9, 2011
Observe Orient Decide
Saturday, April 9, 2011
Observe Orient Decide Act Saturday, April 9, 2011
Observe Orient Decide Act Saturday, April 9, 2011
Observe What is the 99% latency of our autocomplete service right now?
Saturday, April 9, 2011
Observe What is the 99% latency of our autocomplete service right now?
~500ms Saturday, April 9, 2011
Orient How does this compare to other parts of our system, both currently and historically?
Saturday, April 9, 2011
Orient How does this compare to other parts of our system, both currently and historically?
way slower Saturday, April 9, 2011
Decide Should we make it faster? Or should we add feature X?
Saturday, April 9, 2011
Decide Should we make it faster? Or should we add feature X?
make it faster Saturday, April 9, 2011
Act! Write some code.
Saturday, April 9, 2011
Act! Write some code. def sort_by(&blk) #sleep(100) # WTF DUDE super(&blk) end
Saturday, April 9, 2011
10 Print "Rinse" 20 Print "Repeat" 30 Goto 10
Saturday, April 9, 2011
If we do this faster we will win.
Saturday, April 9, 2011
Fewer bugs.
Saturday, April 9, 2011
More features.
Saturday, April 9, 2011
Happier users. Saturday, April 9, 2011
Money. Saturday, April 9, 2011
tl;dr Saturday, April 9, 2011
We might write code.
Saturday, April 9, 2011
We have to generate business value.
Saturday, April 9, 2011
In order to know how well our code is generating business value, we need metrics. Saturday, April 9, 2011
Gauges Counters Meters Histograms Timers Saturday, April 9, 2011
Monitor them for current problems.
Saturday, April 9, 2011
Aggregate them for historical perspective.
Saturday, April 9, 2011
map ≠ territory
Saturday, April 9, 2011
map → territory
Saturday, April 9, 2011
Improve our mental model of our code.
Saturday, April 9, 2011
MIND THE GAP
Saturday, April 9, 2011
Observe Orient Decide Act Saturday, April 9, 2011
If you’re on the JVM, use Metrics.
Saturday, April 9, 2011
If you’re on the JVM, use Metrics. github.com/codahale/metrics
Saturday, April 9, 2011
If not, you can build this.
Saturday, April 9, 2011
Please build this.
Saturday, April 9, 2011
Make better decisions by using numbers.
Saturday, April 9, 2011
Thank you.
Saturday, April 9, 2011
Saturday, April 9, 2011