Are there Exascale Algorithms? - Blue Sky eLearn

3 downloads 238 Views 5MB Size Report
Application complexity grew due to parallelism and more ambitious ... Scientific libraries enable these applications. 4
Are there Exascale Algorithms? Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Computational Science has Moved through Difficult Technology Transitions Application Performance Growth (Gordon Bell Prizes)

Attack of the “killer micros”

2

HPC: From Vector Supercomputers to Massively Parallel Systems 500

Programmed by “annotating” serial programs

SIMD Single Proc. SMP Constellation Cluster MPP

300

Programmed by completely rethinking algorithms and software for parallelism

3

2011

2008

2007

2006

50% 2005

2004

2003

2002

2001

2000

1999

1998

industrial use 1997

1996

1995

0

1994

25%

2010

100

2009

200

1993

Systems

400

The Impact of Scientific Libraries in High Performance Computing •  Application complexity grew due to parallelism and more ambitious science problems (e.g., multiphysics, multiscale) •  Scientific libraries enable these applications

LAPACK   35%  of  apps  

ScaLAPACK   20%  of  apps  

Overture   ~1200  /yr  

netCDF:     ~12%  of  apps  

Trilinos     21,000  total      

METIS   4%  of  apps  

FFTW:  ~25%  of  apps  

PETSc     ~4800  /yr    

hypre     ~1400  /yr  

HDF5:     ~11%  of  apps  

ParPack   3%  of  apps  

FastBit     6,300  total    

SuperLU:     ~4%  of  apps  

GlobalArrays   28%  of  apps  

Numbers  show  downloads  per  year  or  total;    percentages  are   based  on  the  percentage  of  NERSC  projects  that  use  this  library   4

NITRD Projects Addressed Programmer Productivity of Irregular Problems

Message Passing Programming Global Address Space Programming Divide up domain in pieces Each start computing Compute one piece and exchange Grab whatever / whenever PVM, MPI, and many libraries

UPC, CAF, X10, Chapel, Fortress, Titanium, GA, 5

Computing Performance Improvements will be Harder than Ever 10,000,000 1,000,000 100,000

Transistors Transistors (Thousands) (Thousands) Frequency (MHz) (MHz) Frequency Power (W) (W) Power Cores

10,000 1,000 100 10 1 0

1970

1975

1980

1985

1990

1995

2000

2005

2010

Moore’s Law continues, but power limits performance growth. Parallelism is used instead. 6

Scientists Need to Undertake another Difficult Technology Transitions Application Performance Growth (Gordon Bell Prizes)

First Exascale Application? (billion-billion operations / sec)

Attack of the “killer micros”

The rest of the computing world gets parallelism

7

Energy Efficient Computing is Key to Performance Growth At $1M per MW, energy costs are substantial •  1 petaflop in 2010 used 3 MW •  1 exaflop in 2018 would use 130 MW with “Moore’s Law” scaling

usual scaling goal

2005

2010

2015

2020

This problem doesn’t change if we were to build 1000 1-Petaflop machines instead of 1 Exasflop machine. It affects every university department cluster and cloud data center. 8

New Processor Designs are Needed to Save Energy Cell phone processor (0.1 Watt, 4 Gflop/s) Server processor (100 Watts, 50 Gflop/s)

•  Server processors designed for performance •  Embedded and graphics processors use simple low-power processors  good performance/Watt •  New processor architecture and software for future HPC systems 9

Scientists Need to Undertake another Difficult Technology Transitions Application Performance Growth (Gordon Bell Prizes)

First Exascale Application? (billion-billion operations / sec)

Attack of the “killer cellphones”?

Attack of the “killer micros”

The rest of the computing world gets parallelism

10

How to Measure Efficiency?

–  NERSC in 2010 ran at 450 publications per MW-year –  But that number drops with each new machine

•  Next best: application performance per Watt –  Newest, largest machine is best; lower energy and cost per core –  Goes up with Moore’s Law

•  Race-to-Halt generally minimizes energy use

0.09

$ per core-hour

•  For Scientific Computing centers, the metric should be science output per Watt….

0.10

0.08

Center SysAdmin Power & cooling

0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00

Old-HPC Cluster New-HPC

Power vs. Energy •  Two related (but different!) problems –  Minimize peak power: Keep machines from exceeding facility power and melting chips –  Energy efficiency: Minimize Joules / science publication

•  Race-to-halt to minimize energy –  Leakage current is nearly 50% of power –  Finish as quickly as possible (maximizing simultaneous hardware usage)

•  Dynamic clock speed scaling –  Under hardware control to implement power caps and thermal limits; software will probably adapt, not control this

•  Dark silicon: –  More transistors than you can afford to power. More likely to have specialized hardware.

New Processors Means New Software Interconnect Memory Processors

130 MW Server Processors

75 MW Manycore

25 Megawatts Low power memory and interconnect

•  Exascale will have chips with thousands of tiny processor cores, and a few large ones •  Architecture is an open question: –  sea of embedded cores with heavyweight “service” nodes –  Lightweight cores are accelerators to CPUs

•  Low power memory and storage technology are key 13

Communication is Expensive in Time •  Running time of an algorithm is sum of 3 terms: –  # flops * time_per_flop –  # words moved / bandwidth –  # messages * latency

communica(on  

•  Time_per_flop