Introducing CUDA 5

Introducing CUDA 5 Mark Harris Chief Technologist, GPU Computing

© 2012 NVIDIA

CUDA Parallel Computing Platform www.nvidia.com/getcuda

Programming Approaches

Libraries

OpenACC Directives

Programming Languages

“Drop-in” Acceleration

Easily Accelerate Apps

Maximum Flexibility

Nsight IDE

Development Environment

Linux, Mac and Windows GPU Debugging and Profiling

Open Compiler Tool Chain

CUDA-GDB debugger NVIDIA Visual Profiler

Enables compiling new languages to CUDA platform, and CUDA languages to other architectures

SMX

Dynamic Parallelism

HyperQ

GPUDirect

Hardware Capabilities © 2012 NVIDIA

CUDA 5 Application Acceleration Made Easier

Dynamic Parallelism Spawn new parallel work from within GPU code on GK110

GPU-callable Libraries Libraries and plug-ins for GPU code

New Nsight™ Eclipse Edition Develop, Debug, and Optimize… All in one tool!

GPUDirect™ RDMA between GPUs and PCIe devices © 2012 NVIDIA

What is CUDA Dynamic Parallelism? The ability for any GPU thread to launch a parallel GPU kernel Dynamically - based on run-time data Simultaneously - from multiple threads at once Independently - each thread can launch a different grid

CPU

GPU

Fermi: Only CPU can generate GPU work

CPU

GPU

Kepler: GPU can generate work for itself

© 2012 NVIDIA

Dynamic Parallelism CPU

Fermi GPU

CPU

Kepler GPU

© 2012 NVIDIA

Dynamic Work Generation

Fixed Grid

Statically assign conservative worst-case grid

Dynamically assign performance where accuracy is required Initial Grid

Dynamic Grid

© 2012 NVIDIA

Familiar Syntax and Programming Model int main() { float *data; setup(data);

CPU main

A > (data); B > (data); C > (data);

GPU

cudaDeviceSynchronize(); return 0; }

__global__ void B(float *data) { do_stuff(data); X > (data); Y > (data); Z > (data); cudaDeviceSynchronize();

A

X

B

Y

C

Z

do_more_stuff(data); }

© 2012 NVIDIA

Before CUDA 5: Whole-Program Compilation

all.cu main.cpp

+

a.cu

b.cu

c.cu

program.exe

Include files together to build

Previous CUDA releases required a single source file for each kernel Linking with external device code was not supported

© 2012 NVIDIA

CUDA 5: Separate Compilation & Linking

a.cu

main.cpp

b.cu

c.cu

+

program.exe a.o

b.o

c.o

Separate compilation allows building independent object files CUDA 5 can link multiple object files into one program © 2012 NVIDIA

CUDA 5: GPU Callable Libraries main.cpp a.cu

b.cu

+ kernel()

+ a.o

+

b.o

ab.a

program.exe

Can combine object files into static libraries Link and externally call device code © 2012 NVIDIA

CUDA 5: GPU Callable Libraries a.cu

a.o

b.cu

+

b.o

main.cpp

main2.cpp

+

+

foo()

bar()

+

+

ab.a

program.exe



ab.a

program2.exe

Facilitates code reuse, reduces compile time © 2012 NVIDIA

CUDA 5: Callbacks main.cpp

foo.cu

Enables closed-source device libraries to call user-defined device callback functions

kernel()

callback()

vendor.culib

program.exe

© 2012 NVIDIA

GPU-Callable Libraries main2.cpp

+ bar()

Direct Access to BLAS and other Libraries from GPU Code

+ CUBLAS

program2.exe

© 2012 NVIDIA

NVIDIA® Nsight,™ Eclipse Edition CUDA aware editor

• • • • •

•

Semantic highlighting of CUDA code makes it easy to differentiate GPU code from CPU code Generate code faster with CUDA aware auto code completion and inline help Easily port CPU loops to CUDA kernels with automatic code refactoring Integrated CUDA samples makes it quick and easy to get started Hyperlink navigation enables faster code browsing

Supports automatic makefile generation

© 2012 NVIDIA

NVIDIA® Nsight,™ Eclipse Edition Nsight Debugger

•

• •

• •

•

Seamless and simultaneous debugging of both CPU and GPU code

View program variables across several CUDA threads Examine execution state and mapping of the kernels and GPUs View, navigate and filter to selectively track execution across threads Set breakpoints and single-step execution at both source-code and assembly levels Includes CUDA-MEMCHECK to help detect memory errors

© 2012 NVIDIA

NVIDIA® Nsight,™ Eclipse Edition Nsight Profiler

• • • •

•

Easily identify performance bottlenecks using a unified CPU and GPU trace of application activity Expert analysis system pin-points potential optimization opportunities Highlights potential performance problems at specific source-lines within application kernels Close integration with Nsight editor and builder for fast edit-build-profile optimization cycle Integrates with the new nvprof commandline profiler to enable visualization of profile data collected on headless compute nodes © 2012 NVIDIA

NVIDIA® GPUDirect™ Support for RDMA

Direct Communication Between GPUs and PCIe devices System Memory

CPU

GDDR5 Memory

GDDR5 Memory

GDDR5 Memory

GDDR5 Memory

GPU1

GPU2

GPU2

GPU1

PCI-e

CPU

PCI-e

Network Card

Server 1

System Memory

Network

Network Card

Server 2 © 2012 NVIDIA

GPUDirect enables GPU-aware MPI cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost); MPI_Send(s_buf_h,size,MPI_CHAR,1,100,MPI_COMM_WORLD);

MPI_Recv(r_buf_h,size,MPI_CHAR,1,100,MPI_COMM_WORLD); cudaMemcpy(r_buf_h,r_buf_d,size,cudaMemcpyHostToDevice); Simplifies to

MPI_Send(s_buf,size,MPI_CHAR,1,100,MPI_COMM_WORLD); MPI_Recv(r_buf,size,MPI_CHAR,1,100,MPI_COMM_WORLD); (for CPU and GPU buffers)

© 2012 NVIDIA

New Online CUDA Resource Center All the Latest Information, All in One Place

Programming Guides API Reference Library Manuals Code Samples Tools Manuals Platform Specs

Over 1600 Files! http://docs.nvidia.com © 2012 NVIDIA

CUDA Open Compiler Tool Chain Developers want to build front-ends for Java, Python, R, DSLs

CUDA C, C++, Fortran

New Language Support

LLVM Compiler For CUDA

Target other processors like ARM, FPGA, GPUs, x86

NVIDIA GPUs

x86 CPUs

New Processor Support

© 2012 NVIDIA

Thank you!

Download CUDA 5 http://www.nvidia.com/getcuda

© 2012 NVIDIA

Extra Slides

© 2012 NVIDIA

Device Linker Invocation Introduction of an optional link step for device code

nvcc –arch=sm_20 –dc a.cu b.cu nvcc –arch=sm_20 –dlink a.o b.o –o link.o g++ a.o b.o link.o –L -lcudart Link device-runtime library for dynamic parallelism

nvcc –arch=sm_35 –dc a.cu b.cu nvcc –arch=sm_35 –dlink a.o b.o -lcudadevrt –o link.o g++ a.o b.o link.o –L -lcudadevrt -lcudart Currently, link occurs at cubin level (PTX not yet supported)

© 2012 NVIDIA

Simpler Code: LU Example LU decomposition (Fermi)

memcpy dlaswap dtrsm dgemm next j

dgetrf(N, N) { dgetrf idamax();

CPU is Free

dgetrf(N, N) { for j=1 to N for i=1 to 64 idamax memcpy dswap memcpy dscal dger next i

LU decomposition (Kepler)

dswap(); dscal(); dger();

dlaswap(); dtrsm(); dgemm();

}

dgetrf(N, N) { for j=1 to N for i=1 to 64 idamax dswap dscal dger next i dlaswap dtrsm dgemm next j }

synchronize(); }

CPU Code

GPU Code

CPU Code

GPU Code © 2012 NVIDIA

CUDA Dynamic Parallelism Data-Dependent Execution Recursive Parallel Algorithms

GPU-Side Kernel Launch

Batching to Help Fill GPU Efficiency Dynamic Load Balancing

Simplify CPU/GPU Divide Library Calls from Kernels

© 2012 NVIDIA