Metrics Metrics Everywhere - Coda Hale

Apr 9, 2011 - The enterprise social network. Saturday, April 9, 2011 ..... Monitor it. Saturday, April 9 ... Ganglia/Graphite/Cacti/Whatever. Saturday, April 9 ...
3MB Sizes 1 Downloads 129 Views
METRICS

METRICS EVERYWHERE Saturday, April 9, 2011

METRICS

METRICS EVERYWHERE Saturday, April 9, 2011

Make better decisions by using numbers.

Saturday, April 9, 2011

Coda Hale @coda github.com/codahale

Saturday, April 9, 2011

The enterprise social network.

www.yammer.com

Saturday, April 9, 2011

I write code.

Saturday, April 9, 2011

But that’s not actually my job.

Saturday, April 9, 2011

code

Saturday, April 9, 2011

code

Saturday, April 9, 2011

business value

What the hell is business value?

Saturday, April 9, 2011

A new feature.

Saturday, April 9, 2011

An improved existing feature.

Saturday, April 9, 2011

Fewer bugs.

Saturday, April 9, 2011

Not pissing our users off with a slow site.

Saturday, April 9, 2011

Not pissing our users off with a slow site. ugly

Saturday, April 9, 2011

Not pissing our users off with a slow site. ugly pretty Saturday, April 9, 2011

Making future changes easier.

Saturday, April 9, 2011

Adding a unit test before fixing that bug.

Saturday, April 9, 2011

Business value is anything which makes people more likely to give us money. Saturday, April 9, 2011

We want to generate more business value.

Saturday, April 9, 2011

We need to make better decisions about our code.

Saturday, April 9, 2011

Our code generates business value when it runs.

Saturday, April 9, 2011

Our code generates business value when it runs, not when we write it. Saturday, April 9, 2011

We need to know what our code does when it runs.

Saturday, April 9, 2011

We can’t do this unless we measure it.

Saturday, April 9, 2011

Why measure it?

Saturday, April 9, 2011

map ≠ territory

Saturday, April 9, 2011

map ≠ city of of San San Francisco Francisco Saturday, April 9, 2011

the ≠ the way way we it talk is Saturday, April 9, 2011

the ≠ the thing thing we in think of itself Saturday, April 9, 2011

perception ≠ reality

Saturday, April 9, 2011

MIND THE GAP

Saturday, April 9, 2011

We have a mental model of what our code does.

Saturday, April 9, 2011

It’s a mental model. It’s not the code.

Saturday, April 9, 2011

It is often wrong.

Saturday, April 9, 2011

Confusion.

Saturday, April 9, 2011

“This code can’t possibly work.”

Saturday, April 9, 2011

(It works.)

Saturday, April 9, 2011

MIND THE GAP

Saturday, April 9, 2011

“This code can’t possibly fail.”

Saturday, April 9, 2011

(It fails.)

Saturday, April 9, 2011

MIND THE GAP

Saturday, April 9, 2011

Which is faster?

Saturday, April 9, 2011

Which is faster? items.sort_by { |i| i.name }

Saturday, April 9, 2011

Which is faster? items.sort_by { |i| i.name } items.sort { |a, b| a.name <=> b.name }

Saturday, April 9, 2011

We don’t know.

Saturday, April 9, 2011

def sort_by(&blk) sleep(100) # FIXME: I AM POISON super(&blk) end

We don’t know.

Saturday, April 9, 2011

def sort_by(&blk) sleep(100) # FIXME: I AM POISON super(&blk) end

We don’t know. def sort(&blk) # TODO: make not explode raise Exception.new("Haw haw!") end

Saturday, April 9, 2011

We can’t know until we measure it.

Saturday, April 9, 2011

This affects how we make decisions.

Saturday, April 9, 2011

“Our application is slow. This page takes 500ms. Fix it.”

Saturday, April 9, 2011

Find the bottleneck!

Saturday, April 9, 2011

Find the bottleneck! SQL Query

Saturday, April 9, 2011

Find the bottleneck! SQL Query Template Rendering

Saturday, April 9, 2011

Find the bottleneck! SQL Query Template Rendering Session Storage

Saturday, April 9, 2011

We don’t know.

Saturday, April 9, 2011

Find The Bottleneck 2.0! SQL Query Template Rendering Session Storage

Saturday, April 9, 2011

Find The Bottleneck 2.0! SQL Query Template Rendering Session Storage

Saturday, April 9, 2011

53ms

Find The Bottleneck 2.0! SQL Query

53ms

Template Rendering

1ms

Session Storage

Saturday, April 9, 2011

Find The Bottleneck 2.0! SQL Query

53ms

Template Rendering

1ms

Session Storage

Saturday, April 9, 2011

315ms

Find The Bottleneck 2.0! SQL Query

53ms

Template Rendering

1ms

Session Storage

Saturday, April 9, 2011

315ms

Confusion.

Saturday, April 9, 2011

Saturday, April 9, 2011

We made a better decision.

Saturday, April 9, 2011

We improve our mental model by measuring what our code does.

Saturday, April 9, 2011

map ≠ territory

Saturday, April 9, 2011

map → territory

Saturday, April 9, 2011

We use our mental model to decide what to do.

Saturday, April 9, 2011

A better mental model makes us better at deciding what to do. Saturday, April 9, 2011

A better mental model makes us better at generating business value. Saturday, April 9, 2011

Measuring makes your decisions better.

Saturday, April 9, 2011

But only if we’re measuring the right thing.

Saturday, April 9, 2011

We need to measure our code where it matters.

Saturday, April 9, 2011

In the wild.

Saturday, April 9, 2011

Generating business value.

Saturday, April 9, 2011

Saturday, April 9, 2011

PRODUCTION Saturday, April 9, 2011

Continuously measuring code in production.

Saturday, April 9, 2011

Metrics

Saturday, April 9, 2011

Metrics

Java/Scala

Saturday, April 9, 2011

Metrics

Java/Scala

github.com/codahale/metrics

Saturday, April 9, 2011

Gauges Counters Meters Histograms Timers Saturday, April 9, 2011

Each metric is associated with a class and has a name.

Saturday, April 9, 2011

An autocomplete service for city names.

Saturday, April 9, 2011

An autocomplete service for city names. > GET /complete?q=San%20Fra

Saturday, April 9, 2011

An autocomplete service for city names. > GET /complete?q=San%20Fra < HTTP/1.1 200 RAD < < ["San Francisco"]

Saturday, April 9, 2011

What does this code do that affects its business value?

Saturday, April 9, 2011

And how can we measure that?

Saturday, April 9, 2011

Gauges Counters Meters Histograms Timers Saturday, April 9, 2011

Gauges Counters Meters Histograms Timers Saturday, April 9, 2011

Gauge The instantaneous value of something.

Saturday, April 9, 2011

# of cities

Saturday, April 9, 2011

metrics.gauge("cities") { cities.size }

Saturday, April 9, 2011

metrics.gauge("cities") { cities.size }

Saturday, April 9, 2011

metrics.gauge("cities") { cities.size }

Saturday, April 9, 2011

“The service has 589 cities registered.”

Saturday, April 9, 2011

Gauges Counters Meters Histograms Timers Saturday, April 9, 2011

Gauges Counters Meters Histograms Timers Saturday, April 9, 2011

Counter

An incrementing and decrementing value.

Saturday, April 9, 2011

# of open connections

Saturday, April 9, 2011

val counter = metrics.counter("connections") counter.inc() counter.dec()

Saturday, April 9, 2011

val counter = metrics.counter("connections") counter.inc() counter.dec()

Saturday, April 9, 2011

val counter = metrics.counter("connections") counter.inc() counter.dec()

Saturday, April 9, 2011

val counter = metrics.counter("connections") counter.inc() counter.dec()

Saturday, April 9, 2011

“There are 594 active sessions on that server.”

Saturday, April 9, 2011

Gauges Counters Meters Histograms Timers Saturday, April 9, 2011

Gauges Counters Meters Histograms Timers Saturday, April 9, 2011

Meter

The average rate of events over a period of time.

Saturday, April 9, 2011

# of requests/sec

Saturday, April 9, 2011

val meter = metrics.meter("requests", SECONDS) meter.mark()

Saturday, April 9, 2011

val meter = metrics.meter("requests", SECONDS) meter.mark()

Saturday, April 9, 2011

val meter = metrics.meter("requests", SECONDS) meter.mark()

Saturday, April 9, 2011

val meter = metrics.meter("requests", SECONDS) meter.mark()

Saturday, April 9, 2011

# of events mean rate = elapsed time

Saturday, April 9, 2011

# of requests

time Saturday, April 9, 2011

# of requests

time Saturday, April 9, 2011

# of requests

time Saturday, April 9, 2011

MIND THE GAP

Saturday, April 9, 2011

Recency.

Saturday, April 9, 2011

# of events mean rate = elapsed time

Saturday, April 9, 2011

# of events mean rate = elapsed time

Saturday, April 9, 2011

COGNITIVE HAZARD

Saturday, April 9, 2011

Exponentially weighted moving average.

Saturday, April 9, 2011

k -(1-α) mt-1

+

k (1-(1-α) )Yt

k

Saturday, April 9, 2011

k -(1-α) mt-1

+

k (1-(1-α) )Yt

k

Saturday, April 9, 2011

# of requests

time Saturday, April 9, 2011

# of requests

time Saturday, April 9, 2011

# of requests

time Saturday, April 9, 2011

# of requests

time Saturday, April 9, 2011

1-minute rate

Saturday, April 9, 2011

1-minute rate 5-minute rate

Saturday, April 9, 2011

1-minute rate 5-minute rate 15-minute rate

Saturday, April 9, 2011

“We went from 3,000 requests/sec to <500 a second.”

Saturday, April 9, 2011

Gauges Counters Meters Histograms Timers Saturday, April 9, 2011

Gauges Counters Meters Histograms Timers Saturday, April 9, 2011

Histogram

The statistical distribution of values in a stream of data.

Saturday, April 9, 2011

# of cities returned

Saturday, April 9, 2011

val histogram = metrics.histogram("response-sizes") histogram.update(response.cities.size)

Saturday, April 9, 2011

val histogram = metrics.histogram("response-sizes") histogram.update(response.cities.size)

Saturday, April 9, 2011

val histogram = metrics.histogram("response-sizes") histogram.update(response.cities.size)

Saturday, April 9, 2011

minimum

Saturday, April 9, 2011

minimum maximum

Saturday, April 9, 2011

minimum maximum mean

Saturday, April 9, 2011

minimum maximum mean standard deviation Saturday, April 9, 2011

Quantiles

Saturday, April 9, 2011

Quantiles median

Saturday, April 9, 2011

Quantiles median 75th percentile

Saturday, April 9, 2011

Quantiles median 75th percentile 95th percentile

Saturday, April 9, 2011

Quantiles median 75th percentile 95th percentile 98th percentile

Saturday, April 9, 2011

Quantiles median 75th percentile 95th percentile 98th percentile 99th percentile Saturday, April 9, 2011

Quantiles median 75th percentile 95th percentile 98th percentile 99th percentile 99.9th percentile Saturday, April 9, 2011

We can’t keep all of these values.

Saturday, April 9, 2011

1,000 req/sec

Saturday, April 9, 2011

1,000 req/sec

×

Saturday, April 9, 2011

1,000 req/sec

× 1,000 actions/req

Saturday, April 9, 2011

1,000 req/sec

× 1,000 actions/req

×

Saturday, April 9, 2011

1,000 req/sec

× 1,000 actions/req

× 1 day

Saturday, April 9, 2011

1,000 req/sec

× 1,000 actions/req

× 1 day

=

Saturday, April 9, 2011

1,000 req/sec

× 1,000 actions/req

× 1 day

= >86 billion values

Saturday, April 9, 2011

1,000 req/sec

× 1,000 actions/req

× 1 day

= >86 billion values

>640GB of data/day Saturday, April 9, 2011

1,000 req/sec

× 1,000 actions/req

× 1 day

= >86 billion values

>640GB of data/day

Not gonna happen. Saturday, April 9, 2011

COGNITIVE HAZARD

Saturday, April 9, 2011

Reservoir sampling. Keep a statistically representative sample of measurements as they happen.

Saturday, April 9, 2011

Vitter’s Algorithm R.

Vitter, J. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1), 57. Saturday, April 9, 2011

# of cities

time Saturday, April 9, 2011

# of cities

time Saturday, April 9, 2011

# of cities

time Saturday, April 9, 2011

# of cities

time Saturday, April 9, 2011

# of cities

time Saturday, April 9, 2011

MIND THE GAP

Saturday, April 9, 2011

Vitter’s Algorithm R produces uniform samples.

Saturday, April 9, 2011

Recency.

Saturday, April 9, 2011

SUPER-DUPER COGNITIVE HAZARD

Saturday, April 9, 2011

Saturday, April 9, 2011

Forward-decaying priority sampling. Cormode, G., Shkapenyuk, V., Srivastava, D., & Xu, B. (2009). Forward Decay: A Practical Time Decay Model for Streaming Systems. ICDE '09: Proceedings of the 2009 IEEE International Conference on Data Engineering. Saturday, April 9, 2011

Maintain a statistically representative sample of the last 5 minutes.

Saturday, April 9, 2011

# of cities

time Saturday, April 9, 2011

# of cities

time Saturday, April 9, 2011

# of cities

time Saturday, April 9, 2011

# of cities

time Saturday, April 9, 2011

Uniform Saturday, April 9, 2011

Biased

“95% of autocomplete results return 3 cities or less.”

Saturday, April 9, 2011

Gauges Counters Meters Histograms Timers Saturday, April 9, 2011

Gauges Counters Meters Histograms Timers Saturday, April 9, 2011

Timer

A histogram of durations and a meter of calls.

Saturday, April 9, 2011

# of ms to respond

Saturday, April 9, 2011

val timer = metrics.timer("requests", MILLISECONDS, SECONDS) timer.time { handle(req, resp) }

Saturday, April 9, 2011

val timer = metrics.timer("requests", MILLISECONDS, SECONDS) timer.time { handle(req, resp) }

Saturday, April 9, 2011

val timer = metrics.timer("requests", MILLISECONDS, SECONDS) timer.time { handle(req, resp) }

Saturday, April 9, 2011

val timer = metrics.timer("requests", MILLISECONDS, SECONDS) timer.time { handle(req, resp) }

Saturday, April 9, 2011

val timer = metrics.timer("requests", MILLISECONDS, SECONDS) timer.time { handle(req, resp) }

Saturday, April 9, 2011

“At ~2,000 req/sec, our 99% latency jumps from 13ms to 453ms.”

Saturday, April 9, 2011

Gauges Counters Meters Histograms Timers Saturday, April 9, 2011

Gauges Counters Meters Histograms Timers Saturday, April 9, 2011

Now what?

Saturday, April 9, 2011

Instrument it.

Saturday, April 9, 2011

Instrument it. If it could affect your code’s business value, add a metric.

Saturday, April 9, 2011

Instrument it. If it could affect your code’s business value, add a metric. Our services have 40-50 metrics.

Saturday, April 9, 2011

Collect it.

Saturday, April 9, 2011

Collect it. JSON via HTTP.

Saturday, April 9, 2011

Collect it. JSON via HTTP. Every minute.

Saturday, April 9, 2011

Monitor it.

Saturday, April 9, 2011

Monitor it. Nagios/Zabbix/Whatever

Saturday, April 9, 2011

Monitor it. Nagios/Zabbix/Whatever If it affects business value, someone should get woken up.

Saturday, April 9, 2011

Aggregate it.

Saturday, April 9, 2011

Aggregate it. Ganglia/Graphite/Cacti/Whatever

Saturday, April 9, 2011

Aggregate it. Ganglia/Graphite/Cacti/Whatever Place current values in historical context.

Saturday, April 9, 2011

Aggregate it. Ganglia/Graphite/Cacti/Whatever Place current values in historical context. See long-term patterns.

Saturday, April 9, 2011

Go faster.

Saturday, April 9, 2011

Shorten our decision-making cycle.

Saturday, April 9, 2011

Observe

Saturday, April 9, 2011

Observe Orient

Saturday, April 9, 2011

Observe Orient Decide

Saturday, April 9, 2011

Observe Orient Decide Act Saturday, April 9, 2011

Observe Orient Decide Act Saturday, April 9, 2011

Observe What is the 99% latency of our autocomplete service right now?

Saturday, April 9, 2011

Observe What is the 99% latency of our autocomplete service right now?

~500ms Saturday, April 9, 2011

Orient How does this compare to other parts of our system, both currently and historically?

Saturday, April 9, 2011

Orient How does this compare to other parts of our system, both currently and historically?

way slower Saturday, April 9, 2011

Decide Should we make it faster? Or should we add feature X?

Saturday, April 9, 2011

Decide Should we make it faster? Or should we add feature X?

make it faster Saturday, April 9, 2011

Act! Write some code.

Saturday, April 9, 2011

Act! Write some code. def sort_by(&blk) #sleep(100) # WTF DUDE super(&blk) end

Saturday, April 9, 2011

10 Print "Rinse" 20 Print "Repeat" 30 Goto 10

Saturday, April 9, 2011

If we do this faster we will win.

Saturday, April 9, 2011

Fewer bugs.

Saturday, April 9, 2011

More features.

Saturday, April 9, 2011

Happier users. Saturday, April 9, 2011

Money. Saturday, April 9, 2011

tl;dr Saturday, April 9, 2011

We might write code.

Saturday, April 9, 2011

We have to generate business value.

Saturday, April 9, 2011

In order to know how well our code is generating business value, we need metrics. Saturday, April 9, 2011

Gauges Counters Meters Histograms Timers Saturday, April 9, 2011

Monitor them for current problems.

Saturday, April 9, 2011

Aggregate them for historical perspective.

Saturday, April 9, 2011

map ≠ territory

Saturday, April 9, 2011

map → territory

Saturday, April 9, 2011

Improve our mental model of our code.

Saturday, April 9, 2011

MIND THE GAP

Saturday, April 9, 2011

Observe Orient Decide Act Saturday, April 9, 2011

If you’re on the JVM, use Metrics.

Saturday, April 9, 2011

If you’re on the JVM, use Metrics. github.com/codahale/metrics

Saturday, April 9, 2011

If not, you can build this.

Saturday, April 9, 2011

Please build this.

Saturday, April 9, 2011

Make better decisions by using numbers.

Saturday, April 9, 2011

Thank you.

Saturday, April 9, 2011

Saturday, April 9, 2011