GPU Debugging and Profiling. CUDA-GDB debugger. NVIDIA Visual Profiler. Open Compiler. Tool Chain .... Java, Python, R,
Introducing CUDA 5 Mark Harris Chief Technologist, GPU Computing
© 2012 NVIDIA
CUDA Parallel Computing Platform www.nvidia.com/getcuda
Programming Approaches
Libraries
OpenACC Directives
Programming Languages
“Drop-in” Acceleration
Easily Accelerate Apps
Maximum Flexibility
Nsight IDE
Development Environment
Linux, Mac and Windows GPU Debugging and Profiling
Open Compiler Tool Chain
CUDA-GDB debugger NVIDIA Visual Profiler
Enables compiling new languages to CUDA platform, and CUDA languages to other architectures
SMX
Dynamic Parallelism
HyperQ
GPUDirect
Hardware Capabilities © 2012 NVIDIA
CUDA 5 Application Acceleration Made Easier
Dynamic Parallelism Spawn new parallel work from within GPU code on GK110
GPU-callable Libraries Libraries and plug-ins for GPU code
New Nsight™ Eclipse Edition Develop, Debug, and Optimize… All in one tool!
GPUDirect™ RDMA between GPUs and PCIe devices © 2012 NVIDIA
What is CUDA Dynamic Parallelism? The ability for any GPU thread to launch a parallel GPU kernel Dynamically - based on run-time data Simultaneously - from multiple threads at once Independently - each thread can launch a different grid
CPU
GPU
Fermi: Only CPU can generate GPU work
CPU
GPU
Kepler: GPU can generate work for itself
© 2012 NVIDIA
Dynamic Parallelism CPU
Fermi GPU
CPU
Kepler GPU
© 2012 NVIDIA
Dynamic Work Generation
Fixed Grid
Statically assign conservative worst-case grid
Dynamically assign performance where accuracy is required Initial Grid
Dynamic Grid
© 2012 NVIDIA
Familiar Syntax and Programming Model int main() { float *data; setup(data);
CPU main
A > (data); B > (data); C > (data);
GPU
cudaDeviceSynchronize(); return 0; }
__global__ void B(float *data) { do_stuff(data); X > (data); Y > (data); Z > (data); cudaDeviceSynchronize();
A
X
B
Y
C
Z
do_more_stuff(data); }
© 2012 NVIDIA
Before CUDA 5: Whole-Program Compilation
all.cu main.cpp
+
a.cu
b.cu
c.cu
program.exe
Include files together to build
Previous CUDA releases required a single source file for each kernel Linking with external device code was not supported
© 2012 NVIDIA
CUDA 5: Separate Compilation & Linking
a.cu
main.cpp
b.cu
c.cu
+
program.exe a.o
b.o
c.o
Separate compilation allows building independent object files CUDA 5 can link multiple object files into one program © 2012 NVIDIA
CUDA 5: GPU Callable Libraries main.cpp a.cu
b.cu
+ kernel()
+ a.o
+
b.o
ab.a
program.exe
Can combine object files into static libraries Link and externally call device code © 2012 NVIDIA
CUDA 5: GPU Callable Libraries a.cu
a.o
b.cu
+
b.o
main.cpp
main2.cpp
+
+
foo()
bar()
+
+
ab.a
program.exe
ab.a
program2.exe
Facilitates code reuse, reduces compile time © 2012 NVIDIA
CUDA 5: Callbacks main.cpp
foo.cu
Enables closed-source device libraries to call user-defined device callback functions
kernel()
callback()
vendor.culib
program.exe
© 2012 NVIDIA
GPU-Callable Libraries main2.cpp
+ bar()
Direct Access to BLAS and other Libraries from GPU Code
+ CUBLAS
program2.exe
© 2012 NVIDIA
NVIDIA® Nsight,™ Eclipse Edition CUDA aware editor
• • • • •
•
Semantic highlighting of CUDA code makes it easy to differentiate GPU code from CPU code Generate code faster with CUDA aware auto code completion and inline help Easily port CPU loops to CUDA kernels with automatic code refactoring Integrated CUDA samples makes it quick and easy to get started Hyperlink navigation enables faster code browsing
Supports automatic makefile generation
© 2012 NVIDIA
NVIDIA® Nsight,™ Eclipse Edition Nsight Debugger
•
• •
• •
•
Seamless and simultaneous debugging of both CPU and GPU code
View program variables across several CUDA threads Examine execution state and mapping of the kernels and GPUs View, navigate and filter to selectively track execution across threads Set breakpoints and single-step execution at both source-code and assembly levels Includes CUDA-MEMCHECK to help detect memory errors
© 2012 NVIDIA
NVIDIA® Nsight,™ Eclipse Edition Nsight Profiler
• • • •
•
Easily identify performance bottlenecks using a unified CPU and GPU trace of application activity Expert analysis system pin-points potential optimization opportunities Highlights potential performance problems at specific source-lines within application kernels Close integration with Nsight editor and builder for fast edit-build-profile optimization cycle Integrates with the new nvprof commandline profiler to enable visualization of profile data collected on headless compute nodes © 2012 NVIDIA
NVIDIA® GPUDirect™ Support for RDMA
Direct Communication Between GPUs and PCIe devices System Memory
CPU
GDDR5 Memory
GDDR5 Memory
GDDR5 Memory
GDDR5 Memory
GPU1
GPU2
GPU2
GPU1
PCI-e
CPU
PCI-e
Network Card
Server 1
System Memory
Network
Network Card
Server 2 © 2012 NVIDIA
GPUDirect enables GPU-aware MPI cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost); MPI_Send(s_buf_h,size,MPI_CHAR,1,100,MPI_COMM_WORLD);
MPI_Recv(r_buf_h,size,MPI_CHAR,1,100,MPI_COMM_WORLD); cudaMemcpy(r_buf_h,r_buf_d,size,cudaMemcpyHostToDevice); Simplifies to
MPI_Send(s_buf,size,MPI_CHAR,1,100,MPI_COMM_WORLD); MPI_Recv(r_buf,size,MPI_CHAR,1,100,MPI_COMM_WORLD); (for CPU and GPU buffers)
© 2012 NVIDIA
New Online CUDA Resource Center All the Latest Information, All in One Place
Programming Guides API Reference Library Manuals Code Samples Tools Manuals Platform Specs
Over 1600 Files! http://docs.nvidia.com © 2012 NVIDIA
CUDA Open Compiler Tool Chain Developers want to build front-ends for Java, Python, R, DSLs
CUDA C, C++, Fortran
New Language Support
LLVM Compiler For CUDA
Target other processors like ARM, FPGA, GPUs, x86
NVIDIA GPUs
x86 CPUs
New Processor Support
© 2012 NVIDIA
Thank you!
Download CUDA 5 http://www.nvidia.com/getcuda
© 2012 NVIDIA
Extra Slides
© 2012 NVIDIA
Device Linker Invocation Introduction of an optional link step for device code
nvcc –arch=sm_20 –dc a.cu b.cu nvcc –arch=sm_20 –dlink a.o b.o –o link.o g++ a.o b.o link.o –L -lcudart Link device-runtime library for dynamic parallelism
nvcc –arch=sm_35 –dc a.cu b.cu nvcc –arch=sm_35 –dlink a.o b.o -lcudadevrt –o link.o g++ a.o b.o link.o –L -lcudadevrt -lcudart Currently, link occurs at cubin level (PTX not yet supported)
© 2012 NVIDIA
Simpler Code: LU Example LU decomposition (Fermi)
memcpy dlaswap dtrsm dgemm next j
dgetrf(N, N) { dgetrf idamax();
CPU is Free
dgetrf(N, N) { for j=1 to N for i=1 to 64 idamax memcpy dswap memcpy dscal dger next i
LU decomposition (Kepler)
dswap(); dscal(); dger();
dlaswap(); dtrsm(); dgemm();
}
dgetrf(N, N) { for j=1 to N for i=1 to 64 idamax dswap dscal dger next i dlaswap dtrsm dgemm next j }
synchronize(); }
CPU Code
GPU Code
CPU Code
GPU Code © 2012 NVIDIA
CUDA Dynamic Parallelism Data-Dependent Execution Recursive Parallel Algorithms
GPU-Side Kernel Launch
Batching to Help Fill GPU Efficiency Dynamic Load Balancing
Simplify CPU/GPU Divide Library Calls from Kernels
© 2012 NVIDIA