Programming on K computer - Fujitsu

Programming on K computer Koh Hotta The Next Generation Technical Computing Fujitsu Limited Copyright 2010 FUJITSU LIMITED

System Overview of “K computer”  Target Performance : 10PF  over 80,000 processors  Over 640K cores  Over 1 Peta Bytes Memory

 Cutting-edge technologies  CPU : SPARC64 VIIIfx

8 cores, 128GFlops Extension of SPARC V9  Interconnect, “Tofu” : 6-D mesh/torus  Parallel programming environment. 1

Copyright 2010 FUJITSU LIMITED

I have a dream

that one day you just compile your programs and enjoy high performance on your high-end supercomputer. So, we must provide easy hybrid parallel programming method including compiler and run-time system support.

2


User I/F for Programming for K computer K computer Client System

Front FrontEnd End

Back BackEnd End(80000 (80000Nodes) Nodes)

IDE Interface Command Command Interface Interface

IDE

Job JobControl Control debugger

Debugger Debugger Interface Interface

App App

Interactive Debugger GUI Data Data Converter Converter

Profiler

Visualized Visualized Data Data

Data Data Sampler Sampler Sampling Sampling Data Data

3

Stage out (InfiniBand)

debugging partition official op partition


Parallel Programming

4


Hybrid Parallelism on over-640K cores Too large # of processes to manipulate  To reduce number of processes, hybrid thread-process programming is required  But Hybrid parallel programming is annoying for programmers

Even for multi-threading, procedure level or outer loop parallelism was desired  Little opportunity for such coarse grain parallelism  System support for “fine grain” parallelism is required 5


Targeting inner-most loop parallelization Automatic vectorization technology has become mature, and vector-tuning is easy for programmers. Inner-most loop parallelism, which is fine-grain, should be an important portion for peta-scale parallelization.

Inner-most loop acceleration by multi-threading technology 6


Inner-most loop acceleration by multi-threading technology  CPU architecture is designed to reuse vectorization methodology efficiently.  Targeting the inner-most loop automatic parallelization for multicore processor.

7


VISIMPACTTM : you need not think about multi-cores  Efficient multi-thread execution on multiple cores tightly coupled with each other  Collaboration between hardware architecture and compiler optimization makes high efficiency Shared L2 cache on a chip High speed hardware barrier on a chip Automatic parallelization

Automatic parallelization facility makes multi-cores like a single high-speed core  You need not think about cores in a CPU chip. 8


VISIMPACTTM Performance on DAXPY 8 threads SPARC64™ VIIIfx 2.0GHz simd ×8並列 4 threads FX1(SPARC64™ VII) 2.52GHz 4並列 8 threads (2 chips) BX900 Nehalem 2.93GHz 8並列(2チップ)

 Euroben 8 (DAXPY) Do i = 1, n y(i,jsw) = y(i,jsw) + c0*x1(i) End Do

Rate (Mflop/s)

Shared cache provides twice performance than Nehalem 2.93GHz.

10000

1000

100 100

9

1000 Problem size

10000 Copyright 2010 FUJITSU LIMITED

MPI  Open MPI based  Tuned to “Tofu” interconnect

10


MPI Approach for the K computer  Open MPI based  Open Standard, Open Source, Multi-Platform including PC Cluster  Adding extension to Open MPI for “Tofu” interconnect

 High Performance  Short-cut message path for low latency communication  Torus oriented protocol: Message Size, Location, Hop Sensitive  Trunking Communication utilizing multi-dimensional network links by Tofu selective routing.

11


Goal for MPI on K system  High Performance  Low Latency & High Bandwidth

 Highly Scalability  Collective Performance Optimized for Tofu interconnect

 High Availability, Flexibility and Easy to Use  Providing Logical 3D-Torus for each JOB with eliminating failure nodes.  Providing New up version of MPI Standard functions as soon as possible

12


MPI Software stack Original Open MPI Software Stack (Using openib BTL) MPI MPI

Supported special Bcast・Allgather・ Alltoall・Allreduce for Tofu Extension Extension

COLL

PML

ob1 PML

(Point-to-Point Messaging Layer)

BML

r2 BML

(BTL Management Layer)

BTL

openib BTL

(Byte Transfer Layer)

MPI

MPI COLL

PML BML

Adapting BTL to tofu

tofu LLP

PML

ob1 PML

BML

r2 BML

BTL BTL

Hardware dependent

BML

tofu BTL tofu common Tofu Library

OpenFabrics Verbs LLP (Low Latency Path) Special Hardware dependent layer For Tofu Interconnect 13

Providing Common Data processing and structures for BTL・LLP・COLL Rendezvous Protocol Optimization etc Copyright 2010 FUJITSU LIMITED

Flexible Process Mapping to Tofu environment  You can allocate your processes as you like.  Dimension Specification for each rank  1D ：(x)  2D ：(x,y)  3D ：(x,y,z)

(0) (1) (2) (3) (7) (6) (5) (4)

(0,0) (1,0) (2,0) (3,0) (3,1) (2,1) (1,1) (0,1)

4

5

6

7

0

1

2

3

y

(0,0,0) (1,0,0) (0,1,0) (1,1,0) (0,0,1) (0,1,1) (1,0,1) (1,1,1)

7

6

5

4

0

1

2

3

x

y

2

3

0

1

5

7

4

6

z x 14


Performance Tuning  Not only by compiler optimization, but also you can manipulate performance  Compiler directives to tune programs.

 Tools to help your effort to tune your programs  ex. Watch your program using event counter 15


Performance Tuning (Event Counter Example)  3-D job example  Display 4096 procs in 16 x 16 x 16 cells  Cells painted in colors according to the proc status (e.g. CPU time)  Cut a slice of jobs along x-, y-, or z-axis to view

16


Conclusion: Automatic and transparency of performance  VISIMPACTTM lets you treat 8-cored CPU as a single high-speed core.  Collaboration by the CPU architecture and the compiler. High-speed hardware barrier to reduce the overhead of synchronization Shared L2 cache to improve memory access Automatic parallelization to recognize parallelism and accelerate your program

 Open MPI based MPI to utilize “Tofu” interconnect.  Tuning facility shows the activity of parallel programs.

17


18