Concurrent GPU Programming - Waterloo Computer Graphics Lab

CS842 - Concurrent Programming Mechanisms and Tools

Concurrent GPU Programming

Lesley Northam School of Computer Science University of Waterloo 200 University Avenue West Waterloo, Ontario, Canada N2L 3G1

April 12, 2009.

Contents 1 Introduction

5

2 Background

5

2.1

Stream programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.2

General Purpose GPU programming . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

3 CUDA Platform

7

3.1

Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

3.2

Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

3.2.1

Global memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

3.2.2

Shared memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

3.2.3

Local memory

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

3.2.4

Constant Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

3.2.5

Texture memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

3.3.1

Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

3.3.2

Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

3.3.3

Coalescing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

3.4

Warps and branching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

3.5

Atomic operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

3.3

4 CUDA Applications

15

4.1

Heterogeneous programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

4.2

Program setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

4.2.1

Kernel functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

4.2.2

Device functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

4.3

Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

4.4

Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

4.4.1

Kernel functions must have a void return type . . . . . . . . . . . . . . . . .

19

4.4.2

Kernel and device functions cannot be recursive . . . . . . . . . . . . . . . . .

20

4.4.3

Kernel and device functions cannot define static variables . . . . . . . . . . .

21

4.4.4

Kernel and device functions cannot have a variable number of arguments . .

21

4.4.5

Kernel and device function parameters are limited to 256 bytes . . . . . . . .

22

4.4.6

Kernel and device functions cannot directly communicate with the host . . .

22

4.4.7

Kernel and device functions cannot spawn new threads . . . . . . . . . . . . .

23

4.4.8

Device functions cannot have their address taken . . . . . . . . . . . . . . . .

23

4.4.9

Conditional branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

4.4.10 Accumulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

Problems Ill-Suited to CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

4.5

5 A CUDA-based virtual spectrophotometer

25

5.1

Early design decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

5.2

Experiment work-load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

5.3

Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

5.4

Equipment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

5.5

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

5.5.1

29

Result discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 Conclusion

30

7 Future Work

30

References

32

2

List of Tables 1

CPU and GPU GFlops comparison [7, 3] . . . . . . . . . . . . . . . . . . . . . . . . .

5

2

CUDA available memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

3

Atomic instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

4

Virtual spectrophotometer implementations . . . . . . . . . . . . . . . . . . . . . . .

28

5

Execution time and speed-up for largest work-load (see Table 4 for implementation setup) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

List of Figures 1

NVCC workflow [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2

CUDA memory resources [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

3

Thread partitioning [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

4

Coalesced and non-coalesced memory [7] . . . . . . . . . . . . . . . . . . . . . . . . .

14

5

Surface reflectance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

6

Execution time vs work-load. Note that GPGPU implementations have an overhead of 0.2s, rendering them slower than the CPU until a work-load of at least 105 rays. .

3

29

List of Listings 1

Global memory allocation and copy . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2

Global memory accumulator without mutual exclusion . . . . . . . . . . . . . . . .

9

3

Global memory accumulator with mutual exclusion . . . . . . . . . . . . . . . . . .

10

4

Shared memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

5

Local memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

6

Constant memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

7

Texture memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

8

Coalesced memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

9

Conditional branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

10

Sample kernel function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

11

Sample Device Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

12

Synchronizing host and device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

13

Wait-queue function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

14

Kessels Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

15

Kernel function with void return type . . . . . . . . . . . . . . . . . . . . . . . . . .

20

16

Recursion workarounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

17

Static variables in functions workaround . . . . . . . . . . . . . . . . . . . . . . . . .

21

18

Structure for variable arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

19

256 byte parameters workaround . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

20

Communication between device and host . . . . . . . . . . . . . . . . . . . . . . . . .

22

21

Elimination of branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

22

Accumulation with Atomic Operations . . . . . . . . . . . . . . . . . . . . . . . . . .

24

23

Accumulation without Atomic Operations . . . . . . . . . . . . . . . . . . . . . . . .

24

4

1

Introduction

Monte Carlo algorithms use repeated random sampling to find solutions to problems. One common example uses points randomly selected from the unit box to approximate the value of π. Another example is a simulation called a virtual spectrophotometer which measures the reflectance of a modeled material [1]. The repetitive nature of Monte Carlo algorithms usually causes these programs to be time and energy intensive. These repetitions are identical and mutually independent, leading to easy parallelization. Repetition level granularity may require an exorbitant number of threads. Multi-core architectures are capable of parallel thread execution, but the massively multi-threaded GPU architecture is better suited to Monte Carlo and virtual spectrophotometer work-loads because they offer orders of magnitude more thread-level parallelism (see Table 1)[7, 4].

Device

# Cores

Intel Core i7 965

Total GFlops

4

70

NVIDIA Geforce 8800 GTX

128

518

NVIDIA GTX 280

240

933

NVIDIA Tesla S1070

960

4320

Table 1: CPU and GPU GFlops comparison [7, 3]

The purpose of this project is to explore general purpose GPU programming and implement a virtual spectrophotometer that uses the massively multi-threaded nature of the GPU. Particular attention is given to the concurrency issues that limit the performance of the spectrophotometer on the GPU. This report is divided into four sections. The first section provides background information on general purpose GPU programming. The second section details the CUDA hardware abstraction layer. The third section discusses the limitations of CUDA applications and provides potential solutions to these limitations. The final section analyzes several CUDA implementations of a virtual spectrophotometer and compares their execution time with a single-thread CPU implementation.

2

Background

This section provides background on general purpose GPU programming and related software platforms.

5

2.1

Stream programming

Let X be a dataset, and let f (x) be a computation, or kernel function, to be applied to each element x ∈ X. If the kernel function is independent (i.e. f (xi ) does not rely on f (xj ) for all xi , xj ∈ X, i 6= j), all of the computations f (xi ) for i = 1, .., n can be evaluated completely in parallel. This is called stream programming. Traditionally, the graphics processing unit (GPU) renders the computer user interface to a video display. The onset of realistic graphical applications (e.g. video games) has driven the development of GPU’s with highly parallel graphical rendering pipelines. This parallel GPU architecture can be harnessed to support stream processing [7, 5].

2.2

General Purpose GPU programming

Programs written for CPUs can be executed on GPUs using a technique called General Purpose GPU (GPGPU) programming. Applications with natural or obvious parallelism are especially suited to stream processing on the GPU. Some of these applications are: • real-time ray-tracing • fluid dynamics simulation • audio/video/image/signal compression • image segmentation and registration • searching and sorting GPGPU applications are made possible by programmable shaders, a hardware feature of the GPU. In rendering applications, a shader or shader program is used to describe the appearance of a geometric surface (e.g. a chrome car bumper). Early GPGPU programs were written using shader languages such as Cg, GLSL, and HLSL[5]. Designing shader language GPGPU programs requires the developer to: • understand the graphics rendering pipeline (i.e. specific datatypes and algorithms used to generate images and animations) • formulate the problem in terms of rendering primitives (e.g. rays, textures, transformation matrices) More practical GPGPU platforms reduce this learning curve and development overhead by abstracting away the graphics rendering pipeline. These platforms include: • Brook/Brook+; a freely available multi-platform GPGPU extension to ANSI C, created by the Stanford University Graphics group [5]. • CUDA; a freely available GPGPU extension to C, created by NVIDIA (especially) for GeForce/Quadro/Tesla/GTX graphics cards [7, 5].

6

• RapidMind Multi-core Development Platform; a commercially available multi-platform (GPU, CPU and IBM Cell) stream programming extension to C++, created by RapidMind - a University of Waterloo spin-off. This platform and runtime includes a load balancer and profiler that optimize core utilization [5, 6].

3

CUDA Platform

NVIDIA’s GPU hardware can be harnessed for highly parallel stream processing. Their GPGPU platform, CUDA (Compute Unified Device Architecture), allows kernel functions to be developed and deployed to GPU hardware using an extension to the C language. NVIDIA abstracts the CPU and GPU program components into host and device parts. This abstraction layer ensures CUDA platform portability to non-GPU systems and non-NVIDIA hardware. However, CUDA applications must interact with hardware resources (such as threads and memory) to optimize for specific hardware configurations. A proper balance between portability and optimality can be achieved by considering CUDA’s hardware abstraction layer at application design time. This section details CUDA’s workflow and hardware abstraction layer.

3.1

Workflow

Figure 1: NVCC workflow [7] The NVCC compiler (Figure 1) converts application code into host (non-GPGPU) and device (GPGPU) source codes. The host code is compiled directly into object files, whereas the device code is compiled into a library containing: • GPU configuration and host interfacing • the kernel function binary loaded by the GPU Alternatively, the device library can be implemented as an emulation layer, providing a testbed for CUDA applications in the absence of GPGPU hardware. The device library and host objects are linked to produce the GPGPU application binary.

7

3.2

Memory

CUDA applications can access five specialized memory resources [7]. These specializations reduce memory read times and contention for exclusive access. Figure 2 illustrates host and device (GPU) communication using these memories. Table 2 contrasts the functional differences between the memories.

Figure 2: CUDA memory resources [7]

3.2.1

Global memory

The host (CPU) and device (GPU) have full read/write access to global memory. It can be used to store the kernel function’s input and output (Listing 1). Listing 1: Global memory allocation and copy // Assume i n p u t D a t a i s p o p u l a t e d w i t h i n p u t v a l u e s f l o a t inputData [ 5 1 2 ] ;

8

Memory Type

Access

R/W

Lifetime

global (device)

all threads and CPU

both

application

shared

threads of same block

both

block

local

single thread

both

thread

texture

all threads and CPU

read only

application

constant

all threads and CPU

read only

application

Table 2: CUDA available memory

f l o a t outputData [ 5 1 2 ] ; f l o a t ∗ input , ∗ output ; // A l l o c a t e s p a c e f o r i n p u t and o u t p u t on GPU cudaMalloc ( ( void ∗∗)& input , 512 ∗ s i z e o f ( f l o a t ) ) ; cudaMalloc ( ( void ∗∗)& output , 512 ∗ s i z e o f ( f l o a t ) ) ; // Copy i n p u t t o GPU cudaMemcpy( input , inputData , 512 ∗ s i z e o f ( f l o a t ) , cudaMemcpyHostToDevice ) ; // TODO: Perform k e r n e l f u n c t i o n h e r e // Copy o u t p u t t o CPU cudaMemcpy( outputData , output , 512 ∗ s i z e ( f l o a t ) , cudaMemcpyDeviceToHost ) ;

Note that global memory requires mutual exclusion when multiple GPU threads write to the same address (e.g. Listing 2). Without mutual exclusion, only one thread is guaranteed success (e.g. in Listing 2, sum may be less than 512 after Increment has completed). With mutual exclusion, all threads are guaranteed success (e.g. in Listing 3, sum = 512 after Increment has completed) [7]. Listing 2: Global memory accumulator without mutual exclusion // g l o b a l d e v i c e v a r i a b l e device

i n t sum = 0 ;

// k e r n e l f u n c t i o n − a c c u m u l a t o r global

void I n c r e m e n t ( )

{ // non−atomi c i n c r e m e n t − o n l y one t h r e a d i s g u a r a n t e e d s u c c e s s sum++; } i n t main ( ) { // Kernel f u n c t i o n c a l l w i t h 512 t h r e a d s

9

Increment (); }

Listing 3: Global memory accumulator with mutual exclusion // g l o b a l d e v i c e v a r i a b l e device

i n t sum = 0 ;

// k e r n e l f u n c t i o n − a c c u m u l a t o r global

void I n c r e m e n t ( )

{ // atomi c i n c r e m e n t − a l l t h r e a d s g u a r a n t e e d s u c c e s s atomicAdd ( &sum , 1 ) ; } i n t main ( ) { // Kernel f u n c t i o n c a l l w i t h 512 t h r e a d s Increment (); }

3.2.2

Shared memory

CUDA partitions the GPU threads into a grid of blocks. Each block of threads has full read/write access to a private shared memory (Listing 4). Mutual exclusion is required when multiple threads write to the same address in shared memory. However, atomic operations only need to stall the threads in the same block to ensure mutual exclusion in the shared memory. Shared memory should be used whenever possible to reduce serialization caused by atomic instructions and mutual exclusion. Listing 4: Shared memory // Thi s k e r n e l f u n c t i o n u s e s s h a r e d memory t o h e l p sum t h e m a tr i x e l e m e n t s // i t r e q u i r e s no atomi c o p e r a t i o n s global

void MatrixSumElements ( i n t ∗ matrix , i n t ∗sum )

{ int idx = threadIdx . x ; shared

int i = idx ∗ 4 ;

shared

i n t rowSums [ 4 ] ;

// each t h r e a d w i l l sum a m a t r i x row rowSums [ i d x ] = matrix [ i ] + matrix [ i + 1 ] + matrix [ i + 2 ] + matr ix [ i + 3 ] ; // must w a i t f o r a l l t h r e a d s t o c o m p l e t e p r e v i o u s s t a t e m e n t b e f o r e // c o n t i n u i n g hardware b a r r i e r − o n l y f o u r t h r e a d s − f a s t synchthreads ( ) ;

10

// u p d a t e sum , do atomi c op needed −− g u a r a n t e e d one t h r e a d w i l l s u c c e e d sum = rowSums [ 0 ] + rowSums [ 1 ] + rowSums [ 2 ] + rowSums [ 3 ] ; } i n t main ( ) { i n t matrix [ 1 6 ] = { 1 , 1 , . . .

, 1};

i n t sum ; int ∗ matrix cuda ; i n t ∗ sum cuda ; cudaMalloc ( ( void∗∗)& matrix cuda , 16 ∗ s i z e o f ( i n t ) ) ; cudaMalloc ( ( void∗∗)&sum cuda , s i z e o f ( i n t ) ) ; cudaMemcpy( matrix cuda , matrix , 16 ∗ s i z e o f ( i n t ) , cudaMemcpyHostToDevice ) ; MatrixSumElements ( matrix cuda , sum cuda ) ; cudaMemcpy( sum , sum cuda , s i z e o f ( i n t ) , cudaMemcpyDeviceToHost ) ; }

3.2.3

Local memory

Each CUDA thread has a private stack memory with full read/write access, i.e. neither the host nor other threads can access it (Listing 5). Listing 5: Local memory // k e r n e l f u n c t i o n global

void DoSomething ( )

{ // l o c a l v a r i a b l e i n t sum = 0 ; }

3.2.4

Constant Memory

Constant memory provides relatively faster read access to data that cannot be changed by the GPU (Listing 6) [7]. Listing 6: Constant memory constant

f l o a t PI = 3 . 1 4 1 5 9 2 6 5 ;

// k e r n e l f u n c t i o n global


{ i n t twoPi = 2 ∗ PI ;

11

}

3.2.5

Texture memory

Texture memory provides linearly-interpolated read-only memory access to all threads [7]. This provides fast hardware support for repetitive graphical and scientific computing operations. Texture memory is addressed by one, two or three dimensional floating point values. Listing 7 demonstrates basic texture memory operations (initialization and fetching). Listing 7: Texture memory t e x t u r e t e x ; // One dim t e x t u r e o f i n t i n t memory [ 2 5 6 ] ; void c o n f i g u r e t e x t u r e ( void ) { // s e t u p a d d r e s s i n g t e x . addressMode [ 0 ] = cudaAddressModeWrap ; tex . normalized = tru e ; // w r i t e v a l u e s i n t o t e x t u r e memory f o r ( i n t i = 0 ; i < 2 5 6 ; i ++ ) { memory [ i ] = i ; } cudaBindTexture ( 0 , tex , memory , s i z e o f ( i n t ) ) ; } // Kernel f u n c t i o n global


{ // t e x t u r e f e t c h i n t v a l u e = t e x 1 D f e t c h ( tex , 0 . 5 ) ; }

3.3

Threads

GPUs are massively multithreaded, so the CUDA hardware abstraction layer organizes the application threads into a grid-and-block hierarchy. As a result, the application must allocate threads along five spatial dimensions (3D thread blocks, 2D grid of blocks - Figure 3). At the application level, each of these dimensions represent an iteration index, parameterizing the threads. Usually performance is optimized by partitioning these threads to closely match the underlying GPU or device. Unfortunately these optimizations introduce device dependencies, reducing portability.

12

Figure 3: Thread partitioning [7] 3.3.1

Blocks

Thread blocks have three dimensions (e.g. 8 × 2 × 8 threads = 128 threads/block). The total number of threads within a block is limited by the hardware, and all blocks must be the same dimension. The CUDA platform presents each device as a set of block-executing multi-processors. Each multi-processor supports a limited number of active blocks and threads (e.g. a maximum of 8 blocks and 768 threads distributed amongst those 8 blocks). When a block finishes executing, the multiprocessor schedules a remaining block from the grid for execution. It is desirable to keep each multi-processor as busy as possible (maximal assigned threads). The ratio of assigned threads to total threads is called multi-processor utilization. For example, suppose a multi-processor supports a maximum of 8 blocks and 768 total threads with a maximum of 512 threads/block. If the application requests blocks of 512 threads, the multi-processor utilization is 67% since only one block can be assigned, leaving 256 unused threads. However, if the application requests blocks of 256 threads, the multi-processor utilization is 100% since three blocks can be assigned, leaving no unused threads. 3.3.2

Grids

A grid of blocks has two dimensions (e.g. a 2 × 2 grid of blocks). The total number of threads in the grid is limited by the particular device. 3.3.3

Coalescing

Threads are best partitioned into a grid-and-block hierarchy that reflects memory access. For example, a linear arrangement of 256 threads parallelizes memory operations when adding 256-element vectors. Similarly, a matrix arrangement of threads can be used when adding matrices. The compiler detects such thread arrangements and memory accesses, and then schedules one DMA-like memory request on behalf of the block’s threads. This is called memory coalescing. In particular, memory coalescing occurs when memory is indexed by thread id (i.e. the thread’s

13

3D coordinate in the block). Listing 8 and Figure 4 illustrate the difference between coalesced and non-coalesced memory access. Listing 8: Coalesced memory // k e r n e l f u n c t i o n global

void DoSomething ( i n t ∗ input , i n t ∗ r e s u l t )

{ // c o a l e s c e d memory a c c e s s r e s u l t [ threadIdx . x ] = input [ threadIdx . x ] ∗ 5; // non−c o a l e s c e d memory a c c e s s r e s u l t [3 ∗ threadIdx . x + 2] = input [2 ∗ threadIdx . x − 1 ] ; }

Figure 4: Coalesced and non-coalesced memory [7]

3.4

Warps and branching

Each block is divided into groups of threads called warps. These warp threads execute in lockstep with instruction level granularity (recall that all threads in a block execute the same kernel function). When the threads in a warp encounter a branch instruction, they cannot continue executing in lockstep because the threads may diverge. Threads taking the same path continue executing together while the divergent threads wait for their turn. After all threads in the warp have taken their turn, they re-converge and continue executing in lockstep [7, 4, 2]. Listing 9 illustrates a kernel function with a branch - some threads take Path A, others path B. Listing 9: Conditional branch i f ( i n p u t [ t h r e a d I d x . x ] % 2 == 0 ) r e s u l t [ t h r e a d I d x . x ] = 1 ; // Path A else r e s u l t [ t h r e a d I d x . x ] = 2 ; // Path B

3.5

Atomic operations

The CUDA architecture provides hardware level atomic instructions for enforcing mutual exclusion (Table 3) [7].

14

Operation

Function prototype

Addition

atomicAdd( int * addr, int value )

Subtraction

atomicSub( int * addr, int value )

Exchange

atomicExch( int * addr, int value )

Increment

atomicInc( int * addr, int value )

Decrement

atomicDec( int * addr, int value )

Check and Set

atomicCAS( int * addr, int compare, int value ) atomicAnd( int * addr, int value )

And Or

atomicOr( int * addr, int value )

Xor

atomicXor( int * addr, int value ) Table 3: Atomic instructions

4

CUDA Applications

This section explains how to write CUDA application code, focusing on specific platform limitations and their workarounds.

4.1

Heterogeneous programming

GPGPU programs execute code on both CPU (host) and GPU (device). The host is responsible for device initialization, communication, and making kernel calls. The device executes the kernel.

4.2

Program setup

A general purpose GPU program consists of: 1. one or more kernel functions 2. GPU library functions (a.k.a. device functions) 3. constants (e.g. π) 4. global, shared, and texture memory 4.2.1

Kernel functions

Kernel functions are the work horses of GPGPU/stream programming. They may only be called by the host. A kernel function call results in block and thread creation on the device. Each device thread executes the called kernel function exactly once [7]. The AddV ectors kernel function, identified by the “ global ” notation in Listing 10, creates 512 threads on the device. The > notation represents the gridand-block thread layout, where gridDim and blockDim are integers (or type “dim2(int,int)” and “dim3(int,int,int)”, respectively).

15

Listing 10: Sample kernel function global

void AddVectors ( i n t ∗ x1 , i n t ∗ x2 , i n t ∗ x3 )

{ // t h r e a d I d x i s a b u i l t −i n v a r i a b l e t h a t s t o r e s t h e t h r e a d i n d e x unsigned i n t i d x = t h r e a d I d x . x ; x3 [ i d x ] = x1 [ i d x ] + x2 [ i d x ] ; } i n t main ( ) { i n t v1 [ 5 1 2 ] , v2 [ 5 1 2 ] , v3 [ 5 1 2 ] ; f o r ( i n t i = 0 ; i < 5 1 2 ; i ++ ) { v1 [ i ] = i ; v2 [ i ] = 512 − i ; } // C r e a t e memory on GPU i n t ∗ x1 , ∗ x2 , ∗ x3 ; cudaMalloc ( ( void ∗∗)&x1 , 512 ∗ s i z e o f ( i n t ) ) ; cudaMalloc ( ( void ∗∗)&x2 , 512 ∗ s i z e o f ( i n t ) ) ; cudaMalloc ( ( void ∗∗)&x3 , 512 ∗ s i z e o f ( i n t ) ) ; // Copy v a l u e s from CPU t o GPU cudaMemcpy( x1 , v1 , 512 ∗ s i z e o f ( i n t ) , cudaMemcpyHostToDevice ) ; cudaMemcpy( x2 , v2 , 512 ∗ s i z e o f ( i n t ) , cudaMemcpyHostToDevice ) ; // C a l l k e r n e l f u n c t i o n AddVectors ( x1 , x2 , x3 ) ; // Copy r e s u l t from GPU t o CPU cudaMemcpy( v3 , x3 , 512 ∗ s i z e o f ( i n t ) , cudaMemcpyDeviceToHost ) ; }

4.2.2

Device functions

Device functions are only usable by the device (i.e. cannot be called by the host)[7]. They are used as helper functions for the kernel and other device functions. Listing 11 illustrates a device function that squares a value. Listing 11: Sample Device Function // D e v i c e f u n c t i o n device

int square ( int x )

{ return x ∗ x ;

16

} // Kernel f u n c t i o n global

void MyKernel ( i n t ∗ x1 , i n t ∗ x2 )

{ unsigned i n t i d x = t h r e a d I d x . x ; x2 [ i d x ] = s q u a r e ( x1 [ i d x ] ) ; }

4.3

Synchronization

The host does not block after issuing the kernel function call to the device, so the CUDA platform provides the cudaThreadSynchronize function (Listing 12) to synchronize the host with the device. Listing 12: Synchronizing host and device i n t main ( ) { int r e s u l t [ 5 1 2 ] ; int ∗ r e s u l t c u d a ; // A l l o c a t e d e v i c e memory cudaMalloc ( ( void∗∗)& r e s u l t c u d a , 512 ∗ s i z e o f ( i n t ) ) ; // Launch k e r n e l DoSomething ( r e s u l t c o d e ) ; // B a r r i e r , f o r c e s h o s t t o w a i t u n t i l a l l t h r e a d s c o m p l e t e d cudaThreadSynchroni ze ( ) ; // I t i s now s a f e t o copy r e s u l t o f DoSomething // from d e v i c e t o h o s t cudaMemcpy( r e s u l t , r e s u l t c o d e , 512 ∗ s i z e o f ( i n t ) , cudaMemcpyDeviceToHost ) ; }

The kernel function may require synchronization on the device if: • there are sequential reads/writes on memory (e.g. an accumulator), or • there are multiple kernels with memory dependencies (e.g. producer-consumer) The CUDA platform provides atomic instructions that can be used to create higher level concurrency primitives (e.g. monitors). Many synchronization issues can be resolved using hardware-level barriers, which can be used by kernels and the host between kernel calls. Using barriers may reduce performance and should be used wisely [7, 2].

17

In some cases, CUDA’s execution semantics can be exploited to provide synchronization without using the architected concurrency primitives. Listing 13 gives the kernel function for a simple mutex/wait-queue that relies on a bounded busy wait. Within warps, the threads execute in lockstep until they have executed their critical section, at which point they break from the loop and wait for the rest of the threads to reconverge. Listing 13: Wait-queue function device

int cnt = 0 ;

global

void w a i t q u e u e ( )

{ while ( 1 ) { // o n l y t h r e a d w i t h c n t as i n d e x may p r o c e e d i f ( c n t == t h r e a d I d x . x ) { // c r i t i c a l s e c t i o n count ++; break ; } } }

Assume for now that only one warp executes the mutex. Each thread executes thrdIdx + 1 iterations of the loop. Let n be the number of threads executing in the warp. The upper bound on the total number of loop iterations within the warp is n X

(i) =

i=1

n2 + n 2

Extending this result to multiple warps requires the introduction of other variables. When more than one warp executes the mutex, only one of the warps will be executing threads with indices close to cnt. The other warps do not execute in lockstep with this warp, so it might seem like they execute an arbitrary number of iterations. However, if the critical section takes m cycles as an upper bound, then each warp would execute at most mn iterations, per warp, when it’s indices are not 2

near cnt. If there are pwarps, then each warp executes mnp + n 2+n iterations, yielding a grand 2 total of p mnp + n 2+n iterations. This establishes an upper bound on the number of iterations performed by the busy wait. Listing 14 illustrates an implementation of Kessels algorithm. Listing 14: Kessels Algorithm device

int i n t e n t [ 2 ] = { 0 , 0 } ;

device

int f l a g [ 2 ] = { 0 , 1 } ;

device

i n t addAmt [ 2 ] = { 2 , 5 } ;

// Assumes o n l y two t h r e a d s ! global

void K e s s e l ( i n t

id , int ∗ r e s u l t )

18

{ // t h r e a d i d int id =

id ;

f o r ( i n t i = 0 ; i < 1 0 ; i ++ ) { in ten t [ id ] = 1; b o o l temp = f l a g [ ! i d ] ; f l a g [ i d ] = temp ˆ i d ; while ( i n t e n t [ ! i d ] && f l a g [ i d ] == f l a g [ ! i d ] ˆ i d ) {} ( ∗ r e s u l t ) += 1 ;

// c r i t i c a l s e c t i o n

in ten t [ id ] = 0; } }

4.4

Limitations

CUDA is an extension to the C language, but not all C idioms are supported. Specifically [7]: • Kernel functions must have a void return type • Kernel and device functions cannot be recursive • Kernel and device functions cannot declare static variables • Kernel and device functions cannot have a variable number of arguments • Kernel and device function parameters are limited to 256 bytes • Kernel and device functions cannot directly communicate with the host (i.e. write to disk) • Kernel and device functions cannot create new threads • Device functions cannot have their address taken (i.e. no function pointers) Additionally, some basic C idioms may execute slowly - these are: • Conditional branches • Accumulation The remainder of this section will demonstrate potential solutions to these limitations. 4.4.1

Kernel functions must have a void return type

Results can be stored in global memory, since it is visible to the host. Listing 15 uses global memory to store the return value.

19

Listing 15: Kernel function with void return type global

void K e r n e l ( i n t ∗ r e s u l t )

{ r e s u l t [ threadIdx . x ] = 5; } i n t main ( ) { // A l l o c a t e s p a c e on t h e d e v i c e f o r t h e r e s u l t int r [ 5 1 2 ] , ∗ r e s u l t ; cudaMalloc ( ( void∗∗)& r e s u l t , 512 ∗ s i z e o f ( i n t ) ) ; // C a l l k e r n e l f u n c t i o n Kernel ( r e s u l t ) ; // R e t r i e v e r e s u l t s from d e v i c e cudaMemcpy( r , r e s u l t , 512 ∗ s i z e o f ( i n t ) , cudaMemcpyDeviceToHost ) ; }

4.4.2

Kernel and device functions cannot be recursive

Most recursion can be replaced with a loop. Gotos may also be used to simulate recursion. Listing 16 illustrates both solutions. Listing 16: Recursion workarounds // I d e a l r e c u r s i v e f u n c t i o n − n o t p e r m i t t e d by CUDA device

int f a c t o r i a l ( int n , int f a c t )

{ if ( n > 0 ) factorial ( n − 1 , fact ∗ n ); e l s e return f a c t ; } // Workaround A: Use a l o o p device

int f a c t o r i a l ( int n )

{ int f a c t = 1 ; while ( n > 0 ) { f a c t ∗= n ; n −−; } return f a c t ; } // Workaround B: Use g o t o

20

device

int f a c t o r i a l ( int n )

{ int f a c t = 1 ; jmp point : i f ( n == 0 ) return f a c t ; else { f a c t ∗= n ; n −−; goto j m p p o i n t ; } }

4.4.3

Kernel and device functions cannot define static variables

The “static” variable can be placed in global memory - but will be visible to both host and device (developer beware). Listing 17 illustrates this idea with a random number generator. Listing 17: Static variables in functions workaround \\ I d e a l f u n c t i o n with s t a t i c v a r i a b l e − not p e r m i t t e d i n CUDA device

i n t random ( )

{ static int seed = 12345; i n t rand = s e e d % 1 0 2 4 ; s e e d = s e e d ∗ 119237 + 3 7 ; return rand ; } \\ Workaround A: Use g l o b a l memory device

int seed = 12345;

device

i n t random ( )

{ i n t rand = s e e d % 1 0 2 4 ; s e e d = s e e d ∗ 119237 + 3 7 ; return rand ; }

4.4.4

Kernel and device functions cannot have a variable number of arguments

A list of inputs stored in global memory could be used to simulate “varargs”. Listing 18 shows a sample list data structure that could be used create a parameter list. Listing 18: Structure for variable arguments // L i s t node

21

typedef struct { void ∗ item ; char type ; Parameter ∗ next ; } Parameter ; // Kernel f u n c t i o n global

void DoSomething ( Parameter ∗ f i r s t )

{ while ( f i r s t −>next != NULL ) { // R e t r i e v e p a r a m e t e r s } }

4.4.5

Kernel and device function parameters are limited to 256 bytes

To reduce total parameter size, pass larger items (i.e. data structures) by reference instead of value and group similar parameters into arrays or lists (see Listing 18 and 19). Listing 19: 256 byte parameters workaround // Large s t r u c t typedef struct { int red [ 1 0 2 4 ] ; int green [ 1 0 2 4 ] ; int blue [ 1 0 2 4 ] ; } ColourVector ; // I d e a f u n c t i o n : n o t p e r m i t t e d − p a r a m e t e r s t o o l a r g e device

int average ( ColourVector c

) { ... }

// Workaround : Pass by r e f e r e n c e device

4.4.6

int average ( ColourVector ∗ c ) { . . . }

Kernel and device functions cannot directly communicate with the host

The device can communicate with the host by storing information in global memory. The host can asynchronously poll these values - Listing 20 illustrates this idea. Listing 20: Communication between device and host // Kernel f u n c t i o n global

void FindValue ( i n t ∗ v a l u e s , i n t key , i n t ∗ foundKey )

{

22

// i f t h e k e y i s f ound − s e t a f l a g i n g l o b a l memory i f ( v a l u e s [ t h r e a d I d x . x ] == key ) atomicAdd ( foundKey , 1 ) ; } i n t main ( ) { i n t v a l u e s [ 5 1 2 ] , foundKey = 0 ; int ∗ v , ∗ f ; f o r ( i n t i = 0 ; i < 5 1 2 ; i ++ ) { values [ i ] = i ; } // A l l o c a t e and copy memory on d e v i c e cudaMalloc ( ( void∗∗)&v , 512 ∗ s i z e o f ( i n t ) ) ; cudaMalloc ( ( void∗∗)& f , 1 ∗ s i z e o f ( i n t ) ) ; cudaMemcpy( v a l u e s , v , 512 ∗ s i z e o f ( i n t ) , cudaMemcpyHostToDevice ) ; cudaMemcpy( foundKey , f , 1 ∗ s i z e o f ( i n t ) , cudaMemcpyHostToDevice ) ; FindValue ( v , 4 , f ) ; // P o l l d e v i c e u n t i l v a l u e i s f ound while ( 1 ) { cudaMemcpyAsync( f , foundKey , 1 ∗ s i z e o f ( i n t ) , cudaMemcpyDeviceToHost ) ; i f ( foundKey > 0 ) { p r i n t f ( ”Key was found ! \ n” ) ; break ; } } }

4.4.7

Kernel and device functions cannot spawn new threads

Only the host may create new threads on the device. The global flag model presented in Listing 20 will allow the device to communicate its desire for new threads to the host. 4.4.8

Device functions cannot have their address taken

If each function were assigned a unique id (unsigned int) at design time, a switchboard function could dispatch the intended call on behalf of the caller (the caller only needs to know the id of the function it intends to call). The overhead of this design varies like the number of functions indexed by the switchboard. However, this approach creates a design dependency that might become difficult to maintain as functions are added and removed from the switchboard.

23

4.4.9

Conditional branches

Branching can result in a reduction of parallelism, reducing performance. Branches can be eliminated by computing all branch paths, and then “blending” the results with conditionals (Listing 21). Listing 21: Elimination of branches // O r i g i n a l Branch A i f ( x % 2 == 0 ) x ∗= 2 ; e l s e x += 1 ; // ” B l e n d i n g ” b o o l doThis = ( x % 2 == 0 ) ; x = doThis ∗ ( x ∗ 2 ) + ! doThis ∗ ( x + 1 ) ;

4.4.10

Accumulation

Kernel functions that require all threads to write to the same address in memory require atomic operations - which forces serialization, and reduces performance (e.g. Listing 22). Listing 22: Accumulation with Atomic Operations // Compute a random p o i n t P=(x , y ) f a l l i n g i n t h e s q u a r e c e n t e r e d // a t ( 0 , 0 ) w i t h s i d e l e n g t h 1 and // c o u n t t h e number o f p o i n t s t h a t f a l l w i t h i n t h e c i r c l e c e n t e r e d // a t ( 0 , 0 ) w i t h r a d i u s 1 global

void ComputePI ( i n t ∗ r e s u l t )

{ int P [ 2 ] ; randomPoint ( &P ) ; i n t i n C i r c l e = c h e c k I f I n C i r c l e ( &P ) ; // r e t u r n s 0 i f no , 1 i f y e s atomicAdd ( &r e s u l t , i n C i r c l e ) ; // a c c u m u l a t e r e s u l t }

Instead of using a single storage location, each thread can store their contribution in a unique global memory address eliminating the need for atomic operations. Accumulation occurs host-side, where it can be done without atomic operations. Listing 23 demonstrates this idea. Listing 23: Accumulation without Atomic Operations global

void ComputePI ( i n t ∗ r e s u l t )

{ int P [ 2 ] ; randomPoint ( &P ) ; r e s u l t [ t h r e a d I d . x ] = c h e c k I f I n C i r c l e ( &P ) ; } i n t main ( ) {

24

int resu ltD ata [ 5 1 2 ] ; int ∗ r e s u l t ; // A l l o c a t e s p a c e f o r i n p u t and r e s u l t on GPU cudaMalloc ( ( void ∗∗)& r e s u l t , 512 ∗ s i z e o f ( i n t ) ) ; ComputePI( r e s u l t ) ; cudaSynchThreads ( ) ;

// B a r r i e r t o e n s u r e a l l t h r e a d s have c o m p l e t e d

// Copy r e s u l t from GPU t o CPU cudaMemcpy( r e s u l t D a t a , r e s u l t , 512 ∗ s i z e ( i n t ) , cudaMemcpyDeviceToHost ) ; // Sum r e s u l t s i n t sum = 0 ; f o r ( i n t i = 0 ; i < 5 1 2 ; i ++ ) { sum += r e s u l t D a t a [ i ] ; } }

4.5

Problems Ill-Suited to CUDA

The following list details the types of programs that are not suited to general purpose GPU programming. 1. Low concurrency problems - problems using few threads. The GPUs performance is realized with one or more thousand threads. 2. Programs with high memory transaction ratio (local and/or global). Memory is high latency. GPUs perform best when there are more mathematical operations than memory transactions. 3. System level programs such as operating systems. Use of instructions such as trap are not supported by the GPU.

5

A CUDA-based virtual spectrophotometer

The previous section discussed some of the higher level concurrency issues encountered while porting the single-threaded virtual spectrophotometer to the CUDA platform. This section compares the execution times of the single-threaded uniprocessor-based spectrophotometer and the massivelymultithreaded CUDA-based spectrophotometer.

5.1

Early design decisions

My Master’s research requires the use of my research group’s virtual spectrophotometer program. This single-threaded application often takes days to produce acceptably accurate results using com-

25

modity single-processor hardware. Improving the total execution time by even a few percent could save days of work per program invocation. One of the project goals involved porting the spectrophotometer to a GPGPU platform to improve total execution time. Using a GPGPU, the improvements to total execution time probably reduce the total power consumption as well, but that topic is beyond the scope of this paper. Ultimately, a GPGPU implementation was selected over a multi-threaded approach on a multiprocessor system because of limited access to resources. However, the virtual spectrophotometer would not harness the memory bandwidth provided by a powerful multi-processor system. In terms of cost of operation, the spectrophotometer’s low memory bandwidth requirement is a better match for GPGPU hardware. Focusing on idealized single-precision floating point performance, Table 1 suggests that a CUDA-based GPGPU application could rival or outperform similarly priced multiprocessor based implementations. The spectrophotometer’s low memory bandwidth requirement should allow it to operate near these idealized performance ratings. An in-depth analysis of this comparison could form the basis of future work (i.e. do non-memory-bandwith-intensive applications run better on GPGPU clusters or large multi-processor systems, on an execution time basis or a cost of operation basis?). This work could be augmented by profiling and characterizing the virtual spectrophotometer work-load. Curiosity about GPU programming also played a role in the decision to use a GPGPU platform. CUDA was selected as the GPGPU platform because of the availability of development and testing resources.

5.2

Experiment work-load

A spectrophotometer is a tool that measures the reflectance of a material (one of many properties used to describe appearance). It shines a beam of light at the material’s surface and then measures the reflectance, or intensity of reflected light (Figure 5). Reflectance depends on the light beam’s wavelength and incidence angle, so these parameters are varied to observe their relationship to the material [1]. A virtual spectrophotometer shines simulated light rays on a material model and counts the reflected light rays. Since each light ray reflects off the material randomly, many light rays must be used to get an accurate statistical characterization. This requires the simulation of hundreds of thousands of light rays per incidence angle and wavelength [1]. To measure how execution time varies with work-load, the number of rays (per incidence angle and wavelength) was varied logarithmically from 1 to 10 million, each with 90 different incidence angles. Diamond was selected as the test material and the wavelength was held constant to simplify the test setup. As a result, the heaviest work-load (10 million rays per angle) simulated 900 million rays.

26

Figure 5: Surface reflectance

5.3

Implementation

Seven different implementations of the virtual spectrophotometer were tested against each of the work-loads. These seven implementations test the affects of conditional branching, accumulation, thread arrangement, and block arrangment by comparing execution time. Execution time was measured in seconds using the clock t function. Table 4 lists the properties of the seven implementations. The single-thread CPU implementation is the experiment constant. Implementations using conditional branches were expected to be slower than those without since branching can reduce parallelism (see Section 3.4). There are two types of ray counters: a shared global counter requiring atomic increments, and a distributed counter which uses the CPU to perform final accumulation. The distributed counter was expected to be faster than the shared counter because atomic operations cannot occur in parallel (see Section 3.5). There are two thread arrangements: (250 × 1 × 1) and (500 × 1 × 1). The (250 × 1 × 1) arrangement was expected to be faster because its multi-processor utilization is 97.5%, and the (500 × 1 × 1) arrangement has a 65.1% utilization. There are two block arrangements: (100 × 1) and (40x1) (25000 and 10000 total threads). The (100 × 1) arrangment was expected to be faster because it results in fewer kernel function calls and less barrier synchronization 1 . For example, if there are 100 million total rays and 25000 total threads per grid the kernel must be called 4000 times. If there are 10000 total threads per grid the kernel must be called 10000 times. 1 One

thread per ray.

Kernel functions must be called N times, where N = T OT ALRAY S/25000 or N =

T OT ALRAY S/10000. Barrier synchronization is used between kernel function calls.

27

The CUDA emulation implementation (Imp. #6 in Table 4) uses the hypothesized fastest GPGPU setup (Imp. #1 in Table 4). I had no expectations of how execution time would compare with the constant (single-thread) implementation. But I suspected that large thread overheads would cause it to be slower than any of the five implementations run on the GPU.

Impl. #

Device

Counter Type

Branching?

Total threads

Grid arr.

Block arr.

1

GPU

distributed

no

25000

(100x1)

(250x1x1)

2

GPU

distributed

yes

25000

(100x1)

(250x1x1)

3

GPU

distributed

no

25000

(50x1)

(500x1x1)

4

GPU

distributed

no

10000

(40x1)

(250x1x1)

5

GPU

shared

no

25000

(100x1)

(250x1x1)

6

CPU (emu)

distributed

no

25000

(100x1)

(250x1x1)

7

CPU

shared

yes

1

n/a

n/a

Table 4: Virtual spectrophotometer implementations

5.4

Equipment

The following machine was used as a test platform: • Dell XPS 1730 laptop • Intel Core2 Extreme X9000 2.8ghz - Set to maximum performance • 4gb RAM • Dual 250gb 7200rpm hard disks - Raid 0 • Dual NVIDIA GeForce 8800M GTX - SLI disabled (ensure only one GPU is used) • Kubuntu 8.10 • NVIDIA CUDA 2.0

5.5

Results

Execution time is shown in Figure 6(work-load vs speed) and Table 5. Note that the last two data points for the CUDA emulation and last data point of single-thread are projected - it was not possible to measure these as the CPU temperature jumped to an unsafe 95 degrees. Also note that speed-up refers to how many times faster an implementation is over the single-thread (constant) implementation.

28

Figure 6: Execution time vs work-load. Note that GPGPU implementations have an overhead of 0.2s, rendering them slower than the CPU until a work-load of at least 105 rays. Impl.

Avg. Time (s)

Speed-up

7 (constant)

859.3s (14.3min)

n/a

1

5.8s

148x

2

3.7s

232x

3

5.9s

146x

4

7.5s

114x

5

278.0s

3x

6 (emu)

7981.6s

n/a

Table 5: Execution time and speed-up for largest work-load (see Table 4 for implementation setup)

5.5.1

Result discussion

Of the seven implementations (Table 4 and 5), two produced unexpected results. Implementation 2 (with conditional branches) and implementation 3 (low multi-processor utilization) were faster than expected. I expected that an implementation with conditional branching (Impl. 2) would be faster than

29

one without (Impl. 1). But it was observed to be 1.5 times faster. This could be caused by compiler optimizations (branch predication [7]). Or, it may be a result of fewer total instructions. This can occur when all threads of a warp (see Section 3.4) take the same branch path. Only those instructions corresponding to the path taken are executed. Unlike no-branch implementations which execute code from all paths (see Listing 21). I expected that a block arrangment of (500 × 1 × 1) threads and a multi-processor utilization of 65.1% (Impl. 3) would be slower than an arrangment with a 97.5% utilization (Impl. 1). However, the two implementations had the same execution time. This could occur if the maximum number of threads per multi-processor 2 was 1024 or 512 3 , since the utilization would be the same for both implementations.

6

Conclusion

The purpose of this project was to improve the total execution time of my group’s virtual spectrophotometer program using a GPGPU platform. Seven implementations were tested using a variety of work-loads and their execution times were compared. The conducted experiments found that GPGPU implementations are faster than a single-threaded CPU implementation. The fastest implementation used conditional branching, a distributed ray counter, and has a 97.5% multi-processor utilization (see Table 4, Impl. 2). This implementation was 232 times faster than the original single-thread virtual spectrophotomer. In conclusion, my group’s virtual spectrophotometer can acheive a significant improvement in execution time using a GPGPU.

7

Future Work

The experiments conducted in this project have demonstrated that a simple virtual spectrophotometer can be sped-up by using a GPGPU. Other tools required for my research are: a complex virtual spectrophotometer, and a virtual goniophotometer. The complex spectrophotometer measures surface and subsurface reflectance, absorption, and transmission. These properties provide a more complete description of a materials appearance. This implementation is complex. It requires a geometric material4 model and more rays. I suspect that a GPGPU could improve execution time. 2 on

a GeForce 8800M GTX of 768 4 The simple spectrophotometer does not require a geometric model of the material since light only interacts with 3 instead

the surface at a single point. In the complex spectrophotometer, light can interact with the sub-surface structure.

30

I also suspect that a GPGPU could improve execution time of a virtual goniophotometer. This device is similar to a spectrophotometer. It records the direction of reflected and transmitted light. Virtual goniophotometers are slower than spectrophotometers because they require more rays to acquire results of the same accuracy.

31

References [1] Baranoski, G., Rokne, J., and Xu, G. Virtual spectrophotometric measurements for biologically and physically-based rendering. In PG ’00: Proceedings of the 8th Pacific Conference on Computer Graphics and Applications (Washington, DC, USA, 2000), IEEE Computer Society, p. 398. [2] Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., and Skadron, K. A performance study of general-purpose applications on graphics processors using cuda. J. Parallel Distrib. Comput. 68, 10 (2008), 1370–1380. [3] en.wikipedia.org. Wikipedia. [4] Halfhill, T. R. Parallel processing with cuda nvidia’s high-performance computing platform uses massive multithreading. Microprocessor Report (2008). [5] J. D. Owens, M. Houston, D. L. S. G. J. E. S., and Phillips, J. C. Gpu computing. Proceedings of the IEEE 96, 5 (2008), 879. [6] Monteyne, M. RapidMind Multi-Core Development Platform. RapidMind, 2008. [7] nVidia. nVidia CUDA Compute Unified Device Architecture Programming Guide. nVidia, 2008.

32