Back End (80000 Nodes). Job Control. Job Control official op partition. App. Data ... Hybrid parallel programming is ann
Programming on K computer Koh Hotta The Next Generation Technical Computing Fujitsu Limited Copyright 2010 FUJITSU LIMITED
System Overview of “K computer” Target Performance : 10PF over 80,000 processors Over 640K cores Over 1 Peta Bytes Memory
Cutting-edge technologies CPU : SPARC64 VIIIfx
8 cores, 128GFlops Extension of SPARC V9 Interconnect, “Tofu” : 6-D mesh/torus Parallel programming environment. 1
Copyright 2010 FUJITSU LIMITED
I have a dream
that one day you just compile your programs and enjoy high performance on your high-end supercomputer. So, we must provide easy hybrid parallel programming method including compiler and run-time system support.
2
Copyright 2010 FUJITSU LIMITED
User I/F for Programming for K computer K computer Client System
Front FrontEnd End
Back BackEnd End(80000 (80000Nodes) Nodes)
IDE Interface Command Command Interface Interface
IDE
Job JobControl Control debugger
Debugger Debugger Interface Interface
App App
Interactive Debugger GUI Data Data Converter Converter
Profiler
Visualized Visualized Data Data
Data Data Sampler Sampler Sampling Sampling Data Data
3
Stage out (InfiniBand)
debugging partition official op partition
Copyright 2010 FUJITSU LIMITED
Parallel Programming
4
Copyright 2010 FUJITSU LIMITED
Hybrid Parallelism on over-640K cores Too large # of processes to manipulate To reduce number of processes, hybrid thread-process programming is required But Hybrid parallel programming is annoying for programmers
Even for multi-threading, procedure level or outer loop parallelism was desired Little opportunity for such coarse grain parallelism System support for “fine grain” parallelism is required 5
Copyright 2010 FUJITSU LIMITED
Targeting inner-most loop parallelization Automatic vectorization technology has become mature, and vector-tuning is easy for programmers. Inner-most loop parallelism, which is fine-grain, should be an important portion for peta-scale parallelization.
Inner-most loop acceleration by multi-threading technology 6
Copyright 2010 FUJITSU LIMITED
Inner-most loop acceleration by multi-threading technology CPU architecture is designed to reuse vectorization methodology efficiently. Targeting the inner-most loop automatic parallelization for multicore processor.
7
Copyright 2010 FUJITSU LIMITED
VISIMPACTTM : you need not think about multi-cores Efficient multi-thread execution on multiple cores tightly coupled with each other Collaboration between hardware architecture and compiler optimization makes high efficiency Shared L2 cache on a chip High speed hardware barrier on a chip Automatic parallelization
Automatic parallelization facility makes multi-cores like a single high-speed core You need not think about cores in a CPU chip. 8
Copyright 2010 FUJITSU LIMITED
VISIMPACTTM Performance on DAXPY 8 threads SPARC64™ VIIIfx 2.0GHz simd ×8並列 4 threads FX1(SPARC64™ VII) 2.52GHz 4並列 8 threads (2 chips) BX900 Nehalem 2.93GHz 8並列(2チップ)
Euroben 8 (DAXPY) Do i = 1, n y(i,jsw) = y(i,jsw) + c0*x1(i) End Do
Rate (Mflop/s)
Shared cache provides twice performance than Nehalem 2.93GHz.
10000
1000
100 100
9
1000 Problem size
10000 Copyright 2010 FUJITSU LIMITED
MPI Open MPI based Tuned to “Tofu” interconnect
10
Copyright 2010 FUJITSU LIMITED
MPI Approach for the K computer Open MPI based Open Standard, Open Source, Multi-Platform including PC Cluster Adding extension to Open MPI for “Tofu” interconnect
High Performance Short-cut message path for low latency communication Torus oriented protocol: Message Size, Location, Hop Sensitive Trunking Communication utilizing multi-dimensional network links by Tofu selective routing.
11
Copyright 2010 FUJITSU LIMITED
Goal for MPI on K system High Performance Low Latency & High Bandwidth
Highly Scalability Collective Performance Optimized for Tofu interconnect
High Availability, Flexibility and Easy to Use Providing Logical 3D-Torus for each JOB with eliminating failure nodes. Providing New up version of MPI Standard functions as soon as possible
12
Copyright 2010 FUJITSU LIMITED
MPI Software stack Original Open MPI Software Stack (Using openib BTL) MPI MPI
Supported special Bcast・Allgather・ Alltoall・Allreduce for Tofu Extension Extension
COLL
PML
ob1 PML
(Point-to-Point Messaging Layer)
BML
r2 BML
(BTL Management Layer)
BTL
openib BTL
(Byte Transfer Layer)
MPI
MPI COLL
PML BML
Adapting BTL to tofu
tofu LLP
PML
ob1 PML
BML
r2 BML
BTL BTL
Hardware dependent
BML
tofu BTL tofu common Tofu Library
OpenFabrics Verbs LLP (Low Latency Path) Special Hardware dependent layer For Tofu Interconnect 13
Providing Common Data processing and structures for BTL・LLP・COLL Rendezvous Protocol Optimization etc Copyright 2010 FUJITSU LIMITED
Flexible Process Mapping to Tofu environment You can allocate your processes as you like. Dimension Specification for each rank 1D :(x) 2D :(x,y) 3D :(x,y,z)
(0) (1) (2) (3) (7) (6) (5) (4)
(0,0) (1,0) (2,0) (3,0) (3,1) (2,1) (1,1) (0,1)
4
5
6
7
0
1
2
3
y
(0,0,0) (1,0,0) (0,1,0) (1,1,0) (0,0,1) (0,1,1) (1,0,1) (1,1,1)
7
6
5
4
0
1
2
3
x
y
2
3
0
1
5
7
4
6
z x 14
Copyright 2010 FUJITSU LIMITED
Performance Tuning Not only by compiler optimization, but also you can manipulate performance Compiler directives to tune programs.
Tools to help your effort to tune your programs ex. Watch your program using event counter 15
Copyright 2010 FUJITSU LIMITED
Performance Tuning (Event Counter Example) 3-D job example Display 4096 procs in 16 x 16 x 16 cells Cells painted in colors according to the proc status (e.g. CPU time) Cut a slice of jobs along x-, y-, or z-axis to view
16
Copyright 2010 FUJITSU LIMITED
Conclusion: Automatic and transparency of performance VISIMPACTTM lets you treat 8-cored CPU as a single high-speed core. Collaboration by the CPU architecture and the compiler. High-speed hardware barrier to reduce the overhead of synchronization Shared L2 cache to improve memory access Automatic parallelization to recognize parallelism and accelerate your program
Open MPI based MPI to utilize “Tofu” interconnect. Tuning facility shows the activity of parallel programs.
17
Copyright 2010 FUJITSU LIMITED
18
Copyright 2010 FUJITSU LIMITED