Difficult Technology Transitions ... Application complexity grew due to parallelism and more .... Exascale will have chi
Are there Exascale Algorithms? Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley
Computational Science has Moved through Difficult Technology Transitions Application Performance Growth (Gordon Bell Prizes)
Attack of the “killer micros”
2
HPC: From Vector Supercomputers to Massively Parallel Systems 500
Programmed by “annotating” serial programs
SIMD Single Proc. SMP Constellation Cluster MPP
300
Programmed by completely rethinking algorithms and software for parallelism
3
2011
2008
2007
2006
50% 2005
2004
2003
2002
2001
2000
1999
1998
industrial use 1997
1996
1995
0
1994
25%
2010
100
2009
200
1993
Systems
400
The Impact of Scientific Libraries in High Performance Computing • Application complexity grew due to parallelism and more ambitious science problems (e.g., multiphysics, multiscale) • Scientific libraries enable these applications
LAPACK 35% of apps
ScaLAPACK 20% of apps
Overture ~1200 /yr
netCDF: ~12% of apps
Trilinos 21,000 total
METIS 4% of apps
FFTW: ~25% of apps
PETSc ~4800 /yr
hypre ~1400 /yr
HDF5: ~11% of apps
ParPack 3% of apps
FastBit 6,300 total
SuperLU: ~4% of apps
GlobalArrays 28% of apps
Numbers show downloads per year or total; percentages are based on the percentage of NERSC projects that use this library 4
NITRD Projects Addressed Programmer Productivity of Irregular Problems
Message Passing Programming Global Address Space Programming Divide up domain in pieces Each start computing Compute one piece and exchange Grab whatever / whenever PVM, MPI, and many libraries
UPC, CAF, X10, Chapel, Fortress, Titanium, GA, 5
Computing Performance Improvements will be Harder than Ever 10,000,000 1,000,000 100,000
Transistors Transistors (Thousands) (Thousands) Frequency (MHz) (MHz) Frequency Power (W) (W) Power Cores
10,000 1,000 100 10 1 0
1970
1975
1980
1985
1990
1995
2000
2005
2010
Moore’s Law continues, but power limits performance growth. Parallelism is used instead. 6
Scientists Need to Undertake another Difficult Technology Transitions Application Performance Growth (Gordon Bell Prizes)
First Exascale Application? (billion-billion operations / sec)
Attack of the “killer micros”
The rest of the computing world gets parallelism
7
Energy Efficient Computing is Key to Performance Growth At $1M per MW, energy costs are substantial • 1 petaflop in 2010 used 3 MW • 1 exaflop in 2018 would use 130 MW with “Moore’s Law” scaling
usual scaling goal
2005
2010
2015
2020
This problem doesn’t change if we were to build 1000 1-Petaflop machines instead of 1 Exasflop machine. It affects every university department cluster and cloud data center. 8
New Processor Designs are Needed to Save Energy Cell phone processor (0.1 Watt, 4 Gflop/s) Server processor (100 Watts, 50 Gflop/s)
• Server processors designed for performance • Embedded and graphics processors use simple low-power processors good performance/Watt • New processor architecture and software for future HPC systems 9
Scientists Need to Undertake another Difficult Technology Transitions Application Performance Growth (Gordon Bell Prizes)
First Exascale Application? (billion-billion operations / sec)
Attack of the “killer cellphones”?
Attack of the “killer micros”
The rest of the computing world gets parallelism
10
How to Measure Efficiency?
– NERSC in 2010 ran at 450 publications per MW-year – But that number drops with each new machine
• Next best: application performance per Watt – Newest, largest machine is best; lower energy and cost per core – Goes up with Moore’s Law
• Race-to-Halt generally minimizes energy use
0.09
$ per core-hour
• For Scientific Computing centers, the metric should be science output per Watt….
0.10
0.08
Center SysAdmin Power & cooling
0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00
Old-HPC Cluster New-HPC
Power vs. Energy • Two related (but different!) problems – Minimize peak power: Keep machines from exceeding facility power and melting chips – Energy efficiency: Minimize Joules / science publication
• Race-to-halt to minimize energy – Leakage current is nearly 50% of power – Finish as quickly as possible (maximizing simultaneous hardware usage)
• Dynamic clock speed scaling – Under hardware control to implement power caps and thermal limits; software will probably adapt, not control this
• Dark silicon: – More transistors than you can afford to power. More likely to have specialized hardware.
New Processors Means New Software Interconnect Memory Processors
130 MW Server Processors
75 MW Manycore
25 Megawatts Low power memory and interconnect
• Exascale will have chips with thousands of tiny processor cores, and a few large ones • Architecture is an open question: – sea of embedded cores with heavyweight “service” nodes – Lightweight cores are accelerators to CPUs
• Low power memory and storage technology are key 13
Communication is Expensive in Time • Running time of an algorithm is sum of 3 terms: – # flops * time_per_flop – # words moved / bandwidth – # messages * latency
communica(on
• Time_per_flop