Jun 16, 2011 - on linking it to the business. Web performance has succeeded ... Bing, Google. 2010. Shopzilla. 2011. MSN
Velocity Culture (The Unmet Challenge in Ops) Jon Jenkins Amazon.com
[email protected] K͛ZĞŝůůLJsĞůŽĐŝƚLJŽŶĨĞƌĞŶĐĞʹ June 16, 2011
The success of Velocity culture depends on linking it to the business Web performance has succeeded in doing this, ops less so
Ops needs to focus on closing this gap
2009 Bing, Google 2010 Shopzilla
2011 MSN, DoubleClick
ops = business ops != business ops ? business
ops business
What if the size of your server fleet was completely flexible?
Case Study 1 ʹ Scaling Down
Typical Weekly Traffic to amazon.com
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Typical Weekly Traffic to amazon.com
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Typical Weekly Traffic to amazon.com
39%
61% Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
November Traffic for amazon.com
November Traffic for amazon.com
76%
24%
November Traffic for amazon.com
Capacity Planning = Spending Money
Capacity Optimization = Saving Money
The Problem Web site hardware is underutilized Traffic spikes require heroic effort Scaling is non-‐linear
November 10, 2010
Outcomes All traffic for www.amazon.com is now served by EC2 Reduced spending on server capacity Fleet scales dynamically in increments as small as a single host Traffic spikes can be handled with ease Cultural change
Case Study 2 ʹ Scaling Up
Continuous Deployment
Amazon May Deployment Stats (production hosts & environments only) 11.6 seconds Mean time between deployments (weekday)
1,079 Max # of deployments in a single hour 10,000 Mean # of hosts simultaneously receiving a deployment 30,000 Max # of hosts simultaneously receiving a deployment
Availability Zone 1
WWW1 WWW2
WWW3 WWWn
Availability Zone 2
Load Balancer
WWW1 WWW2
WWW3 WWWn
Availability Zone 3
WWW1 WWW2
WWW3 WWWn
Availability Zone 1
WWW1 WWW2
WWW3 WWWn
Availability Zone 2
Load Balancer
WWW1 WWW2
WWW3 WWWn
Availability Zone 3
WWW1 WWW2
WWW3 WWWn
Availability Zone 1
WWW1 WWW2
WWW3 WWWn
Availability Zone 2
Load Balancer
WWW1 WWW2
WWW3 WWWn
Availability Zone 3
WWW1 WWW2
WWW3 WWWn
Availability Zone 1
WWW1 WWW2
WWW3 WWWn
Availability Zone 2
Load Balancer
WWW1 WWW2
WWW3 WWWn
Availability Zone 3
WWW1 WWW2
WWW3 WWWn
Availability Zone 1
WWW1 WWW2
WWW3 WWWn
Availability Zone 2
Load Balancer
WWW1 WWW2
WWW3 WWWn
Availability Zone 3
WWW1 WWW2
WWW3 WWWn
The Problem Upgrading software on a fixed fleet requires a complex workflow Upgrading software on a fixed fleet is a slow process Dealing with failure scenarios requires emergent, high-‐judgment decisions
Availability Zone 1
WWW1 WWW2
WWW3 WWWn
Availability Zone 2
Load Balancer
WWW1 WWW2
WWW3 WWWn
Availability Zone 3
WWW1 WWW2
WWW3 WWWn
Availability Zone 1
WWW1 WWW2
Availability Zone 1
WWW3 WWWn
WWW1 WWW2
Availability Zone 2
WWW1 WWW2
WWW3 WWWn
Availability Zone 3
WWW1 WWW2
WWW3 WWWn
WWW3 WWWn
Availability Zone 2
Load Balancer
WWW1 WWW2
WWW3 WWWn
Availability Zone 3
WWW1 WWW2
WWW3 WWWn
Availability Zone 1
WWW1 WWW2
Availability Zone 1
WWW3 WWWn
WWW1 WWW2
Availability Zone 2
WWW1 WWW2
WWW3 WWWn
Availability Zone 3
WWW1 WWW2
WWW3 WWWn
WWW3 WWWn
Availability Zone 2
Load Balancer
WWW1 WWW2
WWW3 WWWn
Availability Zone 3
WWW1 WWW2
WWW3 WWWn
Availability Zone 1
WWW1 WWW2
Availability Zone 1
WWW3 WWWn
WWW1 WWW2
Availability Zone 2
WWW1 WWW2
WWW3 WWWn
Availability Zone 3
WWW1 WWW2
WWW3 WWWn
WWW3 WWWn
Availability Zone 2
Load Balancer
WWW1 WWW2
WWW3 WWWn
Availability Zone 3
WWW1 WWW2
WWW3 WWWn
Results 75% reduction in outages triggered by software deployments since 2006 90% reduction in outage minutes triggered by software deployments ~0.001% of software deployments cause an outage Instantaneous automated rollback Reduction in complexity
The Challenge for Velocity 2012
Long live Ops!