GPU Computing with CUDA

93 downloads 271 Views 2MB Size Report
on compiled parallel C applications. ▫ Available in laptops, desktops, and clusters. • GPU parallelism is doubling e
The “New” Moore’s Law • Computers no longer get faster, just wider • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution

The “New” Moore’s Law

Enter the GPU • Massive economies of scale • Massively parallel

Graphical processors • The graphics processing unit (GPU) on commodity video cards has evolved into an extremely flexible and powerful processor  Programmability  Precision  Power

• GPGPU: an emerging field seeking to harness GPUs for general-purpose computation

5

Parallel Computing on a GPU •

8-series GPUs deliver 25 to 200+ GFLOPS on compiled parallel C applications 

• •

GeForce 8800

Available in laptops, desktops, and clusters

GPU parallelism is doubling every year Programming model scales transparently Tesla D870



Multithreaded SPMD model uses application data parallelism and thread parallelism Tesla S870

Computational Power • GPUs are fast…    

3.0 GHz dual-core Pentium4: 24.6 GFLOPS NVIDIA GeForceFX 7800: 165 GFLOPs 1066 MHz FSB Pentium Extreme Edition : 8.5 GB/s ATI Radeon X850 XT Platinum Edition: 37.8 GB/s

• GPUs are getting faster, faster  CPUs: 1.4× annual growth  GPUs: 1.7×(pixels) to 2.3× (vertices) annual growth

7

CPU vs GPU

Flexible and Precise • Modern GPUs are deeply programmable  Programmable pixel, vertex, video engines  Solidifying high-level language support

• Modern GPUs support high precision  32 bit floating point throughout the pipeline  High enough for many (not all) applications

9

GPU for graphics • GPUs designed for & driven by video games  Programming model unusual  Programming idioms tied to computer graphics  Programming environment tightly constrained

• Underlying architectures are:  Inherently parallel  Rapidly evolving (even in basic feature set!)  Largely secret

10

General purpose GPUs • The power and flexibility of GPUs makes them an attractive platform for general-purpose computation • Example applications range from in-game physics simulation to conventional computational science • Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor

11

Previous GPGPU Constraints • Dealing with graphics API  Working with the corner cases of the graphics API

• Addressing modes

Input Registers

Fragment Program

 Limited texture size/dimension

• Instruction sets  Lack of Integer & bit ops

• Communication limited  Between pixels

Texture Constants Temp Registers

• Shader capabilities  Limited outputs

per thread per Shader per Context

Output Registers FB

Memory

Enter CUDA • Scalable parallel programming model • Minimal extensions to familiar C/C++ environment • Heterogeneous serial-parallel computing

Sound Bite

GPUs + CUDA = The Democratization of Parallel Computing

Massively parallel computing has become a commodity technology

MOTIVATION

146X

36X

Interactive visualization of volumetric white matter connectivity

Ionic placement for molecular dynamics simulation on GPU

149X

47X

Financial simulation of LIBOR model with swaptions

GLAME@lab: an Mscript API for GPU linear algebra

17X

100X

Fluid mechanics in Matlab using .mex file CUDA function

Astrophysics N-body simulation

20X

24X

30X

Ultrasound medical imaging for cancer diagnostics

Highly optimized object oriented molecular dynamics

Cmatch exact string matching to find similar proteins and gene sequences

19X Transcoding HD video stream to H.264

4.6 Days 2.7 Days

3 Hours 8 Hours

30 Minutes 16 Minutes

27 Minutes Computational Chemistry

13 Minutes Neurological Modeling

CPU Only

Cell Phone RF Simulation

Heterogeneous with Tesla GPU

3D CT Ultrasound

GPUs: Turning Point in Supercomputing Desktop beats Cluster

4 GPUs vs 256 CPUs

Tesla Personal Supercomputer

CalcUA $5 Million

$10,000

Source: University of Antwerp, Belgium

CUDA: ‘C’ FOR PARALLELISM void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y); Standard

C Code

__global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel(n, 2.0, x, y);

Parallel C Code

So far, today.. • GPU – powerful coprocessors • CUDA – programming model for GPU • Easier to parallelize on GPUs • CUDA extends GPU to general purpose computing • Now we shall look at the thread programming and memory structure on GPU

Hierarchy of concurrent threads • Parallel kernels composed of many threads

Thread t

 all threads execute the same sequential program Block b t0 t1 … tB

• Threads are grouped into thread blocks  threads in the same block can cooperate

• Threads/blocks have unique IDs Kernel foo() ...

Hierarchical organization Block Thread per-thread local memory

Local barrier

per-block shared memory

Kernel 0 ... Global barrier Kernel 1 ...

per-device global memory

Heterogeneous Programming • CUDA = serial program with parallel kernels, all in C  Serial C code executes in a CPU thread  Parallel kernel C code executes in thread blocks across multiple processing elements Serial Code Parallel Kernel foo>(args); Serial Code Parallel Kernel bar(args);

.. .

...

What is a thread? • Independent thread of execution  has its own PC, variables (registers), processor state, etc.  no implication about how threads are scheduled

• CUDA threads might be physical threads  as on NVIDIA GPUs

• CUDA threads might be virtual threads  might pick 1 block = 1 physical thread on multicore CPU

What is a thread block? • Thread block = virtualized multiprocessor  freely choose processors to fit data  freely customize for each kernel launch

• Thread block = a (data) parallel task  all blocks in kernel have the same entry point  but may execute any code they want

• Thread blocks of kernel must be independent tasks  program valid for any interleaving of block executions

Blocks must be independent • Any possible interleaving of blocks should be valid  presumed to run to completion without pre-emption  can run in any order  can run concurrently OR sequentially

• Blocks may coordinate but not synchronize  shared queue pointer: OK  shared lock: BAD … can easily deadlock

• Independence requirement gives scalability

Levels of parallelism • Thread parallelism  each thread is an independent thread of execution

• Data parallelism  across threads in a block  across blocks in a kernel

• Task parallelism  different blocks are independent  independent kernels

Block = virtualized multiprocessor • Provides programmer flexibility  freely choose processors to fit data  freely customize for each kernel launch

• Thread block = a (data) parallel task  all blocks in kernel have the same entry point  but may execute any code they want

• Thread blocks of kernel must be independent tasks  program valid for any interleaving of block executions

Scalable Execution Model Kernel launched by host ...

Blocks Run on Multiprocessors MT IU

MT IU

MT IU

MT IU

MT IU

MT IU

MT IU

MT IU

SP

SP

SP

SP

SP

SP

SP

SP

Shared Memory

Shared Memory

Shared Memory

...

Shared Memory

Shared Memory

Device Memory

Shared Memory

Shared Memory

Shared Memory

Synchronization & Cooperation • Threads within block may synchronize with barriers … Step 1 … __syncthreads(); … Step 2 …

• Blocks coordinate via atomic memory operations  e.g., increment shared queue pointer with atomicInc()

• Implicit barrier between dependent kernels vec_minus(a, b, c); vec_dot(c, c);

CUDA Memories

G80 Implementation of CUDA Memories • Each thread can:     

Read/write per-thread registers Read/write per-thread local memory Read/write per-block shared memory Read/write per-grid global memory Read/only per-grid constant Host memory

• The host can R/W global, constant, and texture memories

Grid Block (0, 0)

Block (1, 0)

Shared Memory Registers

Registers

Thread (0, 0) Thread (1, 0)

Shared Memory Registers

Registers

Thread (0, 0) Thread (1, 0)

Global Memory Constant Memory

31

Block Thread Shared Memory

Local Memory

Grid 0 ... Global Memory

Grid 1

Sequential Grids in Time

... 32

Memory model

Device 0 memory Host memory Device 1 memory

A Common Programming Strategy • •

Global memory resides in device memory (DRAM) - much slower access than shared memory So, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory:  Partition data into subsets that fit into shared memory  Handle each data subset with one thread block by:  Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism  Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element  Copying results from shared memory to global memory

34

A Common Programming Strategy (Cont.) • Constant memory also resides in device memory (DRAM) - much slower access than shared memory  But… cached!  Highly efficient access for read-only data

• Carefully divide data according to access patterns    

R/Only  constant memory (very fast if in cache) R/W shared within Block  shared memory (very fast) R/W within each thread  registers (very fast) R/W inputs/results  global memory (very slow) 35

Is that all?? • No!! • Memory Coalescing • Bank conflicts

Memory Coalescing • When accessing global memory, peak performance utilization occurs when all threads access continuous memory locations. Not coalesced Md

coalesced Nd

WIDTH

Thread 1 Thread 2 WIDTH

Memory Layout of a Matrix in C Access direction in Kernel code

M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3

Time Period 1

Time Period 2

T1 T2 T3 T4

T1 T2 T3 T4



M

M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3

Memory Layout of a Matrix in C M0,0 M1,0 M2,0 M3,0

Access direction in Kernel code

M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3



Time Period 2

T1

T2

T3

T4

Time Period 1

T1

T2

T3

T4

M

M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3

Parallel Memory Architecture for Shared memory • In a parallel machine, many threads access memory  Therefore, memory is divided into banks  Essential to achieve high bandwidth

• Each bank can service one address per cycle  A memory can service as many simultaneous accesses as it has banks

• Multiple simultaneous accesses to a bank result in a bank conflict  Conflicting accesses are serialized

Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7

Bank 15

Bank Addressing Examples •

No Bank Conflicts 

Linear addressing stride == 1



No Bank Conflicts 

Random 1:1 Permutation

Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7

Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7

Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7

Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7

Thread 15

Bank 15

Thread 15

Bank 15

Bank Addressing Examples •

2-way Bank Conflicts 

Linear addressing stride == 2

Thread 0 Thread 1 Thread 2 Thread 3 Thread 4

Thread 8 Thread 9 Thread 10 Thread 11



8-way Bank Conflicts 

Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7

Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7

Bank 15

Thread 15

Linear addressing stride == 8 x8

x8

Bank 0 Bank 1 Bank 2

Bank 7 Bank 8 Bank 9

Bank 15

Summary of CUDA programming tips • Divide the overall task between concurrent noncommunicating threads. • Design a coalesced access of global memory • Avoid bank-conflicts when accessing shared memory

Programming on CUDA

Basic steps • Transfer data from CPU to GPU • Explicitly call the GPU kernel designed  CUDA will implicitly assign threads to each multiprocessor, and assign resources for computations

• Transfer results back from GPU to CPU

CPU vs GPU • CPU – operation intensive  Goal: reduce number of operations performed at the expense of additional memory access

• GPU – memory intensive  Goal: reduce the number of memory accesses at the expense of additional operations.

Memory model

Device 0 memory Host memory

cudaMemcpy() Device 1 memory

CUDA: Minimal extensions to C/C++ • Declaration specifiers to indicate where things live __global__ __device__ __device__ __shared__

void void int int

KernelFunc(...); DeviceFunc(...); GlobalVar; SharedVar;

// kernel callable from host // function callable on device // variable in device memory // in per-block shared memory

• Extend function invocation syntax for parallel kernel launch KernelFunc(...);

// 500 blocks, 128 threads each

• Special variables for thread identification in kernels dim3 threadIdx;

dim3 blockIdx;

dim3 blockDim;

• Intrinsics that expose specific operations in kernel code __syncthreads();

// barrier synchronization

CUDA: Features available on GPU • Standard mathematical functions        

sinf, powf, atanf, ceil, min, sqrtf, expf erfc

 And many more standard mathematical functions

CUDA: Runtime support • Explicit memory allocation returns pointers to GPU memory cudaMalloc(), cudaFree()

• Explicit memory copy for host ↔ device, device ↔ device cudaMemcpy(), cudaMemcpy2D(), ...

• Texture management cudaBindTexture(), cudaBindTextureToArray(), ...

• OpenGL & DirectX interoperability cudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(), …

Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; } int main() { // Run N/256 blocks of 256 threads each vecAdd>(d_A, d_B, d_C); }

Example: Host code for vecAdd // allocate and initialize host (CPU) memory float *h_A = …, *h_B = …; // allocate float *d_A, cudaMalloc( cudaMalloc( cudaMalloc(

device (GPU) memory *d_B, *d_C; (void**) &d_A, N * sizeof(float)); (void**) &d_B, N * sizeof(float)); (void**) &d_C, N * sizeof(float));

// copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice) ); cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice) ); // execute the kernel on N/256 blocks of 256 threads each vecAdd(d_A, d_B, d_C);

CUDA libraries – CUFFT & CUBLAS

CUBLAS and CUFFT • Standard libraries with development kit • CUBLAS  CUDA version of blas  Available for single- and double- precision for real and complex numbers  Double version – slower

• CUFFT  FFT & IFFT on CUDA  Faster than the fastest CPU algorithm

CUBLAS • 3 classes 1. Vector operations  Vector addition, norm, dot-product, etc

2. Matrix-vector operations  Matrix-vector product for symmetric and normal matrices, etc

3. Matrix-matrix operations  Matrix multiplication, etc

Advantage • Highly optimized design • Usable as standard C/C++/Fortran libraries • Caters the needs in many scientific computing tasks

OpenCL

CUDA: An Architecture for Massively Parallel Computing

ATI’s Compute “Solution”

OpenCL vs. C for CUDA

C for CUDA Entry point for developers who want low-level API Shared back-end compiler & optimization technology

OpenCL

PTX

GPU

Entry point for developers who prefer high-level C

Recall: GPU and CUDA • GPU – developed for accelerating graphics • CUDA – developed to harness the power of GPUs for general purpose applications  Like C in syntax

• GPU – not a panacea  Used in a master-slave scenario with CPU (host) as master

Recall: GPU memories • Each thread can:     

Read/write per-thread registers Read/write per-thread local memory Read/write per-block shared memory Read/write per-grid global memory Read/only per-grid constant memory

• Divide data patterns

according

to

Grid Block (0, 0)

Shared Memory Registers

access

 R/Only  constant memory (very fast if in cache)  R/W shared within Block  shared memory (very fast)  R/W within each thread  registers (very fast)  R/W inputs/results  global memory (very slow)

Block (1, 0)

Registers

Thread (0, 0) Thread (1, 0)

Host

Shared Memory Registers

Registers

Thread (0, 0) Thread (1, 0)

Global Memory Constant Memory

62

Recall: Thread organization Block Thread Shared Memory

Local Memory

Grid 0 ... Global Memory

Grid 1

Sequential Grids in Time

... 63

Recall: Heterogeneous programming \\ CPU codes cudaMalloc() \\ allocate memories on device cudaMemcpy() \\ transfer input data to device Kernel() \\ call cuda kernels \\ kernels are functions evaluated on a single thread cudaMemcpy() \\ transfer results from device

Keywords: __global__, __shared__, __device__ Special math functions: sinf, expf, min, etc

Case Study: Matrix Multiplication

Matrix Multiplication Kernel using Multiple Blocks __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col]; Pd[Row*Width+Col] = Pvalue;

}

How about performance on G80? • All threads access global memory for their input matrix elements    

Two memory accesses (8 bytes) per floating point multiply-add 4B/s of memory bandwidth/FLOPS 4*346.5 = 1386 GB/s required to achieve peak FLOP rating 86.4 GB/s limits the code at 21.6 GFLOPS

• The actual code runs at about 15 GFLOPS Host • Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS

Grid Block (0, 0)

Block (1, 0)

Shared Memory Registers

Registers

Thread (0, 0) Thread (1, 0)

Global Memory Constant Memory

Shared Memory Registers

Registers

Thread (0, 0) Thread (1, 0)

Use Shared Memory to reuse global memory data N

WIDTH

• Each input element is read by Width threads. • Load each element into Shared Memory and have several threads use the M local version to reduce the memory bandwidth

P

WIDTH

ty

 Tiled algorithms

tx WIDTH

WIDTH

68

bx

Tiled Multiply

0

1

2

tx

Md

Nd

WIDTH

TILE_WIDTH TILE_WIDTH

• Break up the execution of the kernel into phases so that the data accesses in each phase is focused on one subset (tile) of Md and Nd

01 2 TILE_WIDTH-1

Pd

1

ty

Pdsub

TILE_WIDTH-1 TILE_WIDTH TILE_WIDTH

2

WIDTH

TILE_WIDTH WIDTH

WIDTH

by

0 1 2

TILE_WIDTHE

0

A Small Example Nd0,0Nd1,0 Nd0,1Nd1,1 Nd0,2Nd1,2 Nd0,3Nd1,3

Md0,0Md1,0Md2,0Md3,0

Pd0,0Pd1,0Pd2,0Pd3,0

Md0,1Md1,1Md2,1Md3,1

Pd0,1Pd1,1Pd2,1Pd3,1 Pd0,2Pd1,2Pd2,2Pd3,2 Pd0,3Pd1,3Pd2,3Pd3,3

© David Kirk/NVIDIA and Wenmei W. Hwu, 2007-2009

70

Every Md and Nd Element is used exactly twice in generating a 2X2 tile of P P0,0 thread0,0

Access order

P1,0 thread1,0

P0,1 thread0,1

P1,1 thread1,1

M0,0 * N0,0 M0,0 * N1,0

M0,1 * N0,0

M0,1 * N1,0

M1,0 * N0,1 M1,0 * N1,1

M1,1 * N0,1

M1,1 * N1,1

M2,0 * N0,2 M2,0 * N1,2

M2,1 * N0,2

M2,1 * N1,2

M3,0 * N0,3 M3,0 * N1,3

M3,1 * N0,3

M3,1 * N1,3

Breaking Md and Nd into Tiles Nd0,0Nd1,0 Nd0,1Nd1,1 Nd0,2Nd1,2 Nd0,3Nd1,3

Md0,0Md1,0Md2,0Md3,0

Pd0,0Pd1,0Pd2,0Pd3,0

Md0,1Md1,1Md2,1Md3,1

Pd0,1Pd1,1Pd2,1Pd3,1 Pd0,2Pd1,2Pd2,2Pd3,2 Pd0,3Pd1,3Pd2,3Pd3,3

First-order Size Considerations in G80 • Each thread block should have many threads  TILE_WIDTH of 16 gives 16*16 = 256 threads

• There should be many thread blocks  A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks

• Each thread block perform 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations.  Memory bandwidth no longer a limiting factor

CUDA Code – Kernel Execution Configuration // Setup the execution configuration

dim3 dimBlock(TILE_WIDTH, TILE_WIDTH); dim3 dimGrid(Width / TILE_WIDTH, Width / TILE_WIDTH);

Tiled Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { 1. 2.

__shared__float Mds[TILE_WIDTH][TILE_WIDTH]; __shared__float Nds[TILE_WIDTH][TILE_WIDTH];

3. 4.

int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Coolaborative loading of Md and Nd tiles into shared memory 9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)]; 10. Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width]; __syncthreads(); 11. 11. 12. 13. 14. 13.

for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Mds[ty][k] * Nds[k][tx]; Synchthreads(); } Pd[Row*Width+Col] = Pvalue;}

Tiled Multiply

bx 0

tx

Nd

m bx

• Each thread computes one element of Pdsub

1

ty

TILE_WIDTH-1

m k

TILE_WIDTH TILE_WIDTH

2

WIDTH

Pdsub

TILE_WIDTH WIDTH

WIDTH

by

by

TILE_WIDTHE

Pd

0 0 1 2

k

WIDTH

01 2 TILE_WIDTH-1

TILE_WIDTH

Md

2

TILE_WIDTH TILE_WIDTH

• Each block computes one square sub-matrix Pdsub of size

1

G80 Shared Memory and Threading • Each SM in G80 has 16KB shared memory  SM size is implementation dependent!  For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory.  Can potentially have up to 8 Thread Blocks actively executing  This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)

 The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time

• Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16  The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS!

77

not tiled 4x4 tiles 8x8 tiles 12x12 tiles

tiled & unrolled

tiled only

tiled & unrolled

tiled only

tiled & unrolled

tiled only

tiled & unrolled

tiled only

GFLOPS

Tiling Size Effects

100

90

80

70

60

50

40

30

20

10

0

16x16 tiles

Typical Structure of a CUDA Program • • •





Global variables declaration  __host__  __device__... __global__, __constant__, __texture__ Function prototypes  __global__ void kernelOne(…)  float handyFunction(…) Main ()  allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes )  transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…)  execution configuration setup  kernel call – kernelOne( args… ); repeat  transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…) as  optional: compare against golden (host computed) solution needed Kernel – void kernelOne(type args,…)  variables declaration - __local__, __shared__  automatic variables transparently assigned to registers or local memory  syncthreads()… Other functions  float handyFunction(int inVar…);

GPU for Machine learning

Machine learning • With improved sensors, the amount of data available has increased by several folds over the past decade. • Also, more robust and sophisticated learning algorithms have been developed to extract meaningful information from the data • This has resulted in the application of these algorithms in many areas:  Geostatistics, astronomical predictions, weather data assimilations, computational finance.

Extracting information from the data • “Extracting information from the data” means converting the raw data to an interpretable version  For example, given a face image, it would desirable to extract the identity of the person, the face pose, etc • Information extraction categories  Regression – [fitting a continuous function]  Classification – [classify into one of the predefined classes]  Density estimation – [evaluating the class membership]  Ranking – [preference relationships between classes] • Bottom-line: Infer the relationships based on the data  Build the relationship model from the data

Relationship modeling • There are two primary categories of the models  Parametric  Non-parametric • Parametric model:  Assumes a known parametric form of the “relationship”  Estimates the parameters of this “form” from the data • Non-parametric model  Do not make any assumptions on the form of the underlying function.  “Letting the data speak for itself”

Kernel methods • A class of robust non-parametric learning methods • Projects the data into a higher dimensional space • Formulates the problem such that only the inner product of the higher dimension features are required • The inner-products are given by the kernel functions • For example the Gaussian kernel is given by:

Scalable learning methods • Most of these kernel based learning approaches scale O(N2) or O(N3) in time with respect to data

• There is also O(N2) memory requirement in many of these • This is undesirable for very large datasets • We would like to develop a parallelized version on GPU

Kernel methods on GPUs • There are problems where summations of kernel functions need to be evaluated  Algorithm must map summation to multiple threads

• Some problems require the solution of linear system involving kernel matrices  Possibly use the kernel summation before with popular iterative approaches like conjugate gradient

• There also exist problems where popular matrix decompositions like LU needed to performed for kernel matrices  Number of approaches already exist on GPUs

Solving Ky=b • Can use iterative methods  Conjugate Gradient

• Over each iteration, to evaluate Kx • We will discuss the matrix-vector product now

Kernel matrix – special structure • O(N) dependence  N x N matrix depends on O(N)-length vector

• Need only O(N) space • Need to exploit this to minimize space requirements

Kernel summation on GPU • Data:  Source points xi, i=1,…,N,  Evaluation points yj, j=1,…,M $ • Each thread evaluates the sum corresponding to one evaluation point: • Algorithm: 1. Load evaluation point corresponding to the current thread in to alocal register. 2. Load the first chunk of source data to the shared memory. 3. Evaluate part of kernel sum corresponding to sourcedata in the shared memory. 4. Store the result in a local register. 5. If all the source points have not been processed yet, load the next chunk, go to Step 3. 6. Write the sum in the local register to the global memory.

Gaussian kernel on GPU float sum=0.0; __shared__ float hs[DIM]; volatile float yr[DIM]; for (int k=0;k