The “New” Moore’s Law • Computers no longer get faster, just wider • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution
The “New” Moore’s Law
Enter the GPU • Massive economies of scale • Massively parallel
Graphical processors • The graphics processing unit (GPU) on commodity video cards has evolved into an extremely flexible and powerful processor Programmability Precision Power
• GPGPU: an emerging field seeking to harness GPUs for general-purpose computation
5
Parallel Computing on a GPU •
8-series GPUs deliver 25 to 200+ GFLOPS on compiled parallel C applications
• •
GeForce 8800
Available in laptops, desktops, and clusters
GPU parallelism is doubling every year Programming model scales transparently Tesla D870
•
Multithreaded SPMD model uses application data parallelism and thread parallelism Tesla S870
Computational Power • GPUs are fast…
3.0 GHz dual-core Pentium4: 24.6 GFLOPS NVIDIA GeForceFX 7800: 165 GFLOPs 1066 MHz FSB Pentium Extreme Edition : 8.5 GB/s ATI Radeon X850 XT Platinum Edition: 37.8 GB/s
• GPUs are getting faster, faster CPUs: 1.4× annual growth GPUs: 1.7×(pixels) to 2.3× (vertices) annual growth
7
CPU vs GPU
Flexible and Precise • Modern GPUs are deeply programmable Programmable pixel, vertex, video engines Solidifying high-level language support
• Modern GPUs support high precision 32 bit floating point throughout the pipeline High enough for many (not all) applications
9
GPU for graphics • GPUs designed for & driven by video games Programming model unusual Programming idioms tied to computer graphics Programming environment tightly constrained
• Underlying architectures are: Inherently parallel Rapidly evolving (even in basic feature set!) Largely secret
10
General purpose GPUs • The power and flexibility of GPUs makes them an attractive platform for general-purpose computation • Example applications range from in-game physics simulation to conventional computational science • Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor
11
Previous GPGPU Constraints • Dealing with graphics API Working with the corner cases of the graphics API
• Addressing modes
Input Registers
Fragment Program
Limited texture size/dimension
• Instruction sets Lack of Integer & bit ops
• Communication limited Between pixels
Texture Constants Temp Registers
• Shader capabilities Limited outputs
per thread per Shader per Context
Output Registers FB
Memory
Enter CUDA • Scalable parallel programming model • Minimal extensions to familiar C/C++ environment • Heterogeneous serial-parallel computing
Sound Bite
GPUs + CUDA = The Democratization of Parallel Computing
Massively parallel computing has become a commodity technology
MOTIVATION
146X
36X
Interactive visualization of volumetric white matter connectivity
Ionic placement for molecular dynamics simulation on GPU
149X
47X
Financial simulation of LIBOR model with swaptions
[email protected]: an Mscript API for GPU linear algebra
17X
100X
Fluid mechanics in Matlab using .mex file CUDA function
Astrophysics N-body simulation
20X
24X
30X
Ultrasound medical imaging for cancer diagnostics
Highly optimized object oriented molecular dynamics
Cmatch exact str