May 20, 2009 - FAWNdamentally. Power-efficient Clusters. Vijay Vasudevan, Jason Franklin, David Andersen, .... is FAWN g
FAWNdamentally
Power-efficient Clusters Vijay Vasudevan, Jason Franklin, David Andersen, Amar Phanishayee, Lawrence Tan, Michael Kaminsky*, Iulian Moraru Carnegie Mellon University, *Intel Research Pittsburgh
May 20, 2009
1
Monthly energy statement considered harmful
• Power is a limiting factor in computing • 3-year TCO soon to be dominated by power cost [EPA 2007]
• Influences location, technology choices
2
Approaches to saving power Infrastructure Efficiency Dynamic Power Scaling Computational Efficiency
Power generation Power distribution Cooling Sleeping when idle Rate adaptation VM consolidation
FAWN
3
Approaches to saving power Infrastructure Efficiency Dynamic Power Scaling Computational Efficiency
Power generation Power distribution Cooling Sleeping when idle Rate adaptation VM consolidation
FAWN
Goal of computational efficiency: Reduce the amount of energy to do useful work 3
Tan, Vijay Vasudevan wer Project
FAWN
Fast Array of Wimpy Nodes
34-5"6"78-, 9&4:&4
Improve computational efficiency of data-intensive computing using an array of well-balanced low-power systems.
+;1< ()* %&' +,-#. ()* %&' +,-#.
()* ()* !"#$
()* ()*
()* %&' +,-#. ()* %&' +,-#. ()* %&' +,-#. ()* %&' +,-#.
%&'
()* %&' +,-#. ()* %&' +,-#.
/001
201
4
Tan, Vijay Vasudevan wer Project
FAWN
Fast Array of Wimpy Nodes
34-5"6"78-, 9&4:&4
Improve computational efficiency of data-intensive computing using an array of well-balanced low-power systems.
+;1< ()* %&' +,-#. ()* %&' +,-#.
()* ()* !"#$
()* ()*
()* %&' +,-#. ()* %&' +,-#. ()* %&' +,-#. ()* %&' +,-#.
%&'
()* %&' +,-#. ()* %&' +,-#.
/001
201
AMD Geode 256MB DRAM 4GB CompactFlash 4
Target: Data-intensive computing
• Large amounts of data • Highly-parallelizable • Fine-grained, independent tasks Workloads amenable to “scale-out” approach
5
Outline • What is FAWN? • Why FAWN? • When FAWN? • Challenges (How FAWN?) 6
Why FAWN? 1. Fixed costs make dynamic power scaling difficult 2. FAWN balances system to save energy 3. FAWN targets sweet-spot in efficiency 4. FAWN reduces peak power consumption
7
1. Fixed power costs dominate
Power (W)
Power (W)
5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0
No DVFS DVFS Ideal
0
20
40 80 60 System Utilization(%)
Figure adapted from Tolia et. al HotPower 08
100
8
1. Fixed power costs dominate
Power (W)
Power (W)
5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0
70% of peak power at 0% utilization!
No DVFS DVFS Ideal
0
20
40 80 60 System Utilization(%)
Figure adapted from Tolia et. al HotPower 08
100
8
1. Fixed power costs dominate
Power (W)
Power (W)
5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0
70% of peak power at 0% utilization!
} 0
Fixed power costs
No DVFS DVFS Ideal
20
40 80 60 System Utilization(%)
Figure adapted from Tolia et. al HotPower 08
100
8
2. Balancing to save energy CPU/Disk Gap 100000000 10000000 CPU-to-Disk seek Speed Ratio 1000000 100000
•
How do we balance?
•
Big CPUs clocked down?
• •
Embedded CPUs? Why not use more disks with big CPUs?
10000 1980 1985 1990 1995 2000 2005
Year 9
3. Targeting the sweet-spot in efficiency Speed vs. Efficiency
Instructions/sec/W in millions
2500 2000
XScale 800Mhz
Atom Z500
1500 Custom ARM Mote Xeon7350
1000 500 0 1
10
100 1000 10000 Instructions/sec in millions
100000
10
3. Targeting the sweet-spot in efficiency Speed vs. Efficiency
2500 Instructions/sec/W in millions
Fast processors mask memory wall at the cost of efficiency
2000
XScale 800Mhz
Atom Z500
1500 Custom ARM Mote Xeon7350
1000 500 0 1
10
100 1000 10000 Instructions/sec in millions
100000
10
3. Targeting the sweet-spot in efficiency Speed vs. Efficiency
2500 Instructions/sec/W in millions
Fast processors mask memory wall at the cost of efficiency
2000
XScale 800Mhz
Atom Z500
1500 Custom ARM Mote Xeon7350
1000 500 0 1
10
100 1000 10000 Instructions/sec in millions
100000
10
3. Targeting the sweet-spot in efficiency Speed vs. Efficiency
2500 Instructions/sec/W in millions
Fast processors mask memory wall at the cost of efficiency
2000 XScale 800Mhz
Atom Z500
1500 Xeon7350
1000 500 0
Custom ARM Mote
1
10
100 1000 10000 Instructions/sec in millions
100000
10
3. Targeting the sweet-spot in efficiency Speed vs. Efficiency
2500 Instructions/sec/W in millions
Fast processors mask memory wall at the cost of efficiency
2000 XScale 800Mhz
Atom Z500
1500 Xeon7350
1000 500 0
Custom ARM Mote
1
10
100 1000 10000 Instructions/sec in millions
100000
10
3. Targeting the sweet-spot in efficiency Speed vs. Efficiency
2500 Instructions/sec/W in millions
Fast processors mask memory wall at the cost of efficiency Fixed power costs can dominate efficiency for slow processors
2000 XScale 800Mhz
Atom Z500
1500 Xeon7350
1000 500 0
Custom ARM Mote
1
10
100 1000 10000 Instructions/sec in millions
100000
10
3. Targeting the sweet-spot in efficiency Speed vs. Efficiency
2500 Instructions/sec/W in millions
Fast processors mask memory wall at the cost of efficiency Fixed power costs can dominate efficiency for slow processors
2000 XScale 800Mhz
Atom Z500
1500 Xeon7350
1000 500 0
FAWN targets sweet spot in processor efficiency when including fixed costs
Custom ARM Mote
1
10
100 1000 10000 Instructions/sec in millions
100000
10
4. Reducing peak power consumption
• Provisioning for peak power requires: 1. worst case cooling requirements 2. UPS systems upon power failure 3. power generation and substations investment
11
4. Reducing peak power consumption
• Provisioning for peak power requires: 1. worst case cooling requirements 2. UPS systems upon power failure 3. power generation and substations investment
11
4. Reducing peak power consumption
• Provisioning for peak power requires: 1. worst case cooling requirements 2. UPS systems upon power failure 3. power generation and substations investment
11
4. Reducing peak power consumption
• Provisioning for peak power requires: 1. worst case cooling requirements 2. UPS systems upon power failure 3. power generation and substations investment
11
What is FAWN good for? • • •
Random-access workloads (Key-value Lookup) Scan-bound workloads (Hadoop, Data Analytics) CPU-bound workloads (Compression, Encryption)
12
Important metrics Performance Efficiency
Work time
Perf Watt
Density
Cost
Perf Volume
Perf $
13
Random access workloads FAWN + CF (4W)
Traditional + HD (87W)
Traditional + SSD (83W)
14
Random access workloads
14
Random access workloads FAWN (4W) Traditional + HD (87W) Traditional + SSD (83W) 6000
5800
4500 3000 1500 0
1697 177
Queries/sec
Performance 15
Random access workloads FAWN (4W) Traditional + HD (87W) Traditional + SSD (83W) 6000
5800
500
4500
375
3000
250
1500 0
424.25
125
1697 177
Queries/sec
Performance
0
2.034
69.88
Queries/Joule
Efficiency 15
Random access workloads FAWN is 6-200x more efficient than traditional systems FAWN (4W) Traditional + HD (87W) Traditional + SSD (83W) 6000
5800
500
4500
375
3000
250
1500 0
424.25
125
1697 177
Queries/sec
Performance
0
2.034
69.88
Queries/Joule
Efficiency 15
CPU-bound encryption AES encryption/decryption of a 512MB file with a 256-bit key FAWN (5W) Traditional + HD (87W) 60 45
51.15
30 15 0
3.65 Encryption Speed (MB/s)
Performance 16
CPU-bound encryption AES encryption/decryption of a 512MB file with a 256-bit key FAWN (5W) Traditional + HD (87W) 60 45
0.8 51.15
30
0.73 0.6 0.4 0.365
15 0
0.2 3.65 Encryption Speed (MB/s)
Performance
0
Encryption Efficiency (MB/J)
Efficiency 16
CPU-bound encryption AES encryption/decryption of a FAWN is 2x more efficient for CPU-bound operations! 512MB file with a 256-bit key FAWN (5W) Traditional + HD (87W) 60 45
0.8 51.15
30
0.73 0.6 0.4 0.365
15 0
0.2 3.65 Encryption Speed (MB/s)
Performance
0
Encryption Efficiency (MB/J)
Efficiency 16
When to use FAWN for random access workloads? • •
Total cost of ownership
•
Capital cost + 3 year power @ $0.10/kWh
What is the cheapest architecture for serving random access workloads?
• •
Traditional + {Disks, SSD, DRAM}? FAWN + {Disks, SSD, DRAM}?
17
%&'&()'!*+,)!+-!./
Architecture with lowest TCO for random access workloads
!$"""" !$""" !$"" !$" !$ !"#$ !"#$
Ratio of query rate to dataset size informs storage technology
0.12*+*,%34
0.12*+*55, / . -
*, + * #) ( %' & % $ # " ! 0.12*+*,-./
!$ !$" !$"" 01)23!4&')!56+77+8-(9():;
!$"""
18
%&'&()'!*+,)!+-!./
Architecture with lowest TCO for random access workloads
!$"""" !$""" !$"" !$" !$ !"#$ !"#$
Ratio of query rate to dataset size informs storage technology
0.12*+*,%34
0.12*+*55, / . -
*, + * #) ( %' & % $ # " ! 0.12*+*,-./
!$ !$" !$"" 01)23!4&')!56+77+8-(9():;
!$"""
18
%&'&()'!*+,)!+-!./
Architecture with lowest TCO for random access workloads
!$"""" !$""" !$"" !$" !$ !"#$ !"#$
Ratio of query rate to dataset size informs storage technology
0.12*+*,%34
0.12*+*55, / . -
*, + * #) ( %' & % $ # " ! 0.12*+*,-./
!$ !$" !$"" 01)23!4&')!56+77+8-(9():;
!$"""
18
%&'&()'!*+,)!+-!./
Architecture with lowest TCO for random access workloads
!$"""" !$""" !$"" !$" !$ !"#$ !"#$
Ratio of query rate to dataset size informs storage technology
0.12*+*,%34
0.12*+*55, / . -
*, + * #) ( %' & % $ # " ! 0.12*+*,-./
!$ !$" !$"" 01)23!4&')!56+77+8-(9():;
!$"""
FAWN-based systems can provide lower cost per {GB, QueryRate} 18
Challenges “Each decimal order of magnitude increase in parallelism requires a major redesign and rewrite of parallel code” - Kathy Yelick
• Algorithms and Architectures at 10x scale • Dealing with Amdahl’s law • High performance using low performance nodes • Today’s software may not run out of the box • Manageability, failures, network design, power cost vs. engineering cost
19
Conclusion •
FAWN improves the computational efficiency of datacenters
•
Informed by fundamental system power trends
•
Challenges: programming for 10x scale, running today’s software on yesterday’s machines...
20
Hot enough for industry
21