CUDA Performance Report

CUDA 7.0 Performance Report May 2015 1

CUDA 7.0 Performance Report cuFFT cuBLAS cuSPARSE cuSOLVER cuRAND NPP Thrust math.h cuDNN

Fast Fourier Transforms Library Complete BLAS Library Sparse Matrix Library Linear Solver Library Random Number Generation (RNG) Library Performance Primitives for Image & Video Processing Templated Parallel Algorithms & Data Structures C99 floating-point Library Deep Neural Net building blocks New in CUDA 7.0

Included in the CUDA Toolkit (free download): developer.nvidia.com/cuda-toolkit For more information on CUDA libraries: developer.nvidia.com/gpu-accelerated-libraries 2

cuFFT: Multi-dimensional FFTs Real and complex, Single- and double-precision data types 1D, 2D and 3D batched transforms Flexible input and output data layouts Also supports simple “drop-in” replacement of a CPU FFTW library XT interface supports up to 4 GPUs Device callbacks optimize use cases such as FFT + datatype conversion

3

cuFFT: up to 700 GFLOPS Single Precision

Double Precision

800

350

700

300

250 Powers of 2

500

Powers of 3

400

Powers of 5 300

Powers of 7

200

GFLOPS

GFLOPS

600

200 150 100 50

100 0

0 1

1,000 1,000,000 Transform Size

1

1,000 1,000,000 Transform Size

1D Complex, Batched FFTs Used in Signal Processing and as a Foundation for 2D and 3D FFTs Performance may vary based on OS and software versions, and motherboard configuration

• cuFFT 7.0 on K40m, Base clocks, ECC ON • Batched transforms on 28M-33M total elements, input and output data on device • Excludes time to create cuFFT “plans” 4

New in CUDA 7.0

cuFFT up to 3x Faster for sizes that are composites of small primes 1D Single Precision Complex-to-Complex Transforms

CUDA 7.0 vs. CUDA 6.5 Speedup

5x Size = 15 Size = 30

Size = 121

Size = 31

4x

3x

Size = 127

2x

1x

0

20

Performance may vary based on OS and software versions, and motherboard configuration

40

60 80 Transform Size

100

120

140

• cuFFT 6.5 and 7.0 on K20m, ECC ON • Batched transforms on 32M total elements, input and output data on device • Excludes time to create cuFFT “plans” 5

cuBLAS: Dense Linear Algebra on GPUs Complete BLAS implementation plus useful extensions Supports all 152 standard routines for single, double, complex, and double complex Host and device-callable interface Batched routines for higher performance on small problem sizes

XT Interface for Level 3 BLAS Distributed computations across multiple GPUs Out-of-core streaming to GPU, no upper limit on matrix size “Drop-in” BLAS intercepts CPU BLAS calls, streams to GPU

6

TFLOPS single-precision cuBLAS: >3 >1 TFLOPS double-precision 3500 3000

GFLOPS

2500 2000 1500 1000 500

Single


Single Complex

Double

ZSYRK

ZTRSM

ZSYMM

ZGEMM

DSYRK

DTRSM

DSYMM

DGEMM

CSYRK

CTRSM

CSYMM

CGEMM

SSYRK

STRSM

SSYMM

SGEMM

0

Double Complex

• cuBLAS 7.0 on K40m, Base clocks, ECC ON, input and output data on device • m=n=k=4096, transpose=no, side=right, fill=lower

7

14000 12000 10000 8000 6000 4000 2000 0

1xK80

Single


Single Complex

Double

ZTRSM

ZSYRK

ZGEMM

DTRSM

DSYRK

DGEMM

CSYRK

CGEMM

STRSM

SSYRK

3xK80

SGEMM

GFLOPS

cuBLAS-XT: Multi-GPU Performance Scaling >12 TF on a single node

Double Complex

• cuBLAS 7.0 on K80, Base clocks, ECC ON • input and output data on host, m=n=k=32768, transpose=no

8

cuSPARSE: Sparse linear algebra routines y1 y2 y3 y4

\alpha

1.0 2.0

3.0

5.0

4.0 6.0

7.0

1.0 2.0 3.0 4.0

+ \beta

y1 y2 y3 y4

Optimized sparse linear algebra BLAS routines for matrix-vector, matrix-matrix, triangular solve Support for variety of formats (CSR, COO, block variants) Incomplete-LU and Cholesky preconditioners New in CUDA 7.0

Graph Coloring 9

cuSPARSE: 4x Faster than MKL Sparse Matrix x Dense Vector (SpMV) Speedup over MKL

5x 4x 3x 2x 1x 0x


• • • •

Average of S/C/D/Z routines cuSPARSE 7.0 on K40m, Base clocks, ECC ON, input and output data on device MKL 11.0.4 on Intel Xeon Haswell single-socket 16-core E5-2698 v3 @ 2.3GHz, 3.6GHz Turbo 10 Matrices obtained from: http://www.cise.ufl.edu/research/sparse/matrices/

New in CUDA 7.0

Graph Coloring & Preconditioned CG Solver Analysis and ilu0

Input matrix

(cuSPARSE)

Solve (cuSPARSE)

Execution time of ilu0 and Solve phases are limited by available parallelism in input Matrix

Input matrix

Graph coloring

Reorder matrix

Analysis and ilu0

(cuSPARSE)

(Thrust)

(cuSPARSE)

Coloring and reordering the matrix exposes more parallelism

Solve (cuSPARSE)

ilu0 and Solve phases run faster due to extra parallelism

11

New in CUDA 7.0

cuSPARSE Graph Coloring Speeds Up Factorization

Speedup of Incomplete LU Factorization (ILU0) after Graph Coloring 20x

28x

9x

6x

Speedup

5x 4x 3x 2x 1x 0x

Full results at: research.nvidia.com/publication/parallel-graph-coloring-applications-incomplete-lu-factorization-gpu Performance may vary based on OS and software versions, and motherboard configuration

• cuSPARSE 7.0 on K40c • Matrices obtained from: http://www.cise.ufl.edu/research/sparse/matrices/

12

New in CUDA 7.0

cuSOLVER

cusolverDN Dense Cholesky, LU, SVD, (batched) QR Optimization, Computer vision, CFD

cusolverSP Sparse direct solvers & Eigensolvers Newton’s method, Oil & Gas Well Models

cusolverRF Sparse refactorization solver Chemistry, ODEs, Combustion, Circuit simulation 13

New in CUDA 7.0

cuSOLVER Dense vs. MKL

1800 1600

cuSOLVER

GFLOPS

1400

MKL

1200 1000

800 600 400 200

Cholesky Factorization


LU Factorization

ZGEQRF

CGEQRF

DGEQRF

SGEQRF

ZGETRF

CGETRF

DGETRF

SGETRF

ZPOTRF

CPOTRF

DPOTRF

SPOTRF

0

QR Factorization • cuSOLVER 7.0 on K40c, ECC ON, M=N=4096 • MKL 11.0.4 on Intel Xeon Haswell 14-core E5-2697 v3 @ 3.6GHz 14

New in CUDA 7.0

cuSOLVER Sparse QR Analysis, Factorization and Solve

Speedup over CPU

12x

11.3x

10x 8x 6x

4x 2x

2.0x

1.9x

1.4x

1.2x

ex9

nasa1824

0x 1138_bus Performance may vary based on OS and software versions, and motherboard configuration

Chem97ZtZ

Muu

• cuSOLVER 7.0 on K40c, ECC ON • SuiteSparse v4.4 on Intel Xeon Haswell 14-core E5-2697 v3 @ 3.6GHz • Matrices obtained from: http://www.cise.ufl.edu/research/sparse/matrices/

15

cuRAND: Random Number Generation Generating high quality random numbers in parallel is hard Don’t do it yourself, use a library!

Pseudo- and Quasi-RNGs Mersenne Twister 19937 Supports several output distributions Statistical test results in documentation

16

cuRAND: > 50x Faster vs. Intel MKL 16 14

GSamples / sec

12 10 8

cuRAND

6

MKL

4

2 0 Sobol32

MRG32k3a

Uniform Distribution


Sobol32

MRG32k3a

Normal Distribution

Sobol32

MRG32k3a

Log-Normal Distribution

• cuRAND 7.0 on K40m, Base clocks, ECC ON, double-precision input and output data on device • MKL 11.0.1 on Intel Xeon Haswell single-socket 16-core E5-2698 v3 @ 2.3GHz, 3.6GHz Turbo 17

cuRAND: High Performance RNGs 18 16 Gsamples / sec

14 12 10 8

6 4 2 0 XORWOW

Philox

MRG32k3a

MTGP32

Pseudo-random Uniform Distribution Performance may vary based on OS and software versions, and motherboard configuration

Normal Distribution

Sobol32 Scrambled

Sobol64 Scrambled

Quasi-random Log-Normal Distribution

• cuRAND 7.0 on K40m, Base clocks, ECC ON, double-precision input and output data on device

18

CUDA C++ Parallel Template Library Parallel Template library for CUDA C++ Host and Device Containers that mimic the C++ STL Optimized Parallel Algorithms for sort, reduce, scan, etc. TBB and OpenMP CPU Backends Performance portable

Also available on GitHub: thrust.github.com

Allows applications and prototypes to be built quickly 19

New in CUDA 7.0

Thrust Performance Improvements Thrust Sort, 7.0 vs. 6.5 (32M samples) 2.0x

1.7x

1.8x

1.5x Speedup

sort: 1.1x – 1.8x faster (3x for user-defined types) merge: 2x faster scan: 1.15x faster reduce_by_key: 1.25x faster

1.2x

1.3x 1.1x

1.1x

1.0x 0.5x 0.0x char

short

int

long

float double

• CUDA 7.0 and 6.5 on K40m, ECC ON, input and output data on device • Performance may vary based on OS and software versions, and motherboard configuration

CUDA streams argument (concurrency between threads) thrust::count_if(thrust::cuda::par.on(stream1), text, text+n, myFunc()); 20

NPP: NVIDIA Performance Primitives Over 5000 image and signal processing routines: color transforms, geometric transforms, move operations, linear filters, image & signal statistics, image & signal arithmetic, JPEG building blocks, image segmentation, median filter, BGR/YUV conversion, 3D LUT color conversion

21

NPP Speedup vs. Intel IPP 25x

20x

15x

10x

22.4x

19.9x

19.9x

10.7x

5x

5.3x

5.0x 0x Image Set (8-bit RGB)

Image Set Channel (8-bit RGB)


Image Resize (8-bit RGB)

Image Gaussian Color Conversion Filter 8-bit YUV422 to (32-bit float) 8-bit RGB

JPEG 8x8 Forward DCT

• NPP 7.0 on K40m, ECC ON, Base clocks, input and output data on device • IPP 7.0 on Intel Xeon Haswell single-socket 16-core E5-2698 v3 @ 2.3GHz, 3.6GHz Turbo

22

Image Processing Function Group

NPP Speedup vs. Intel IPP Filter

4.0x

Statistics

5.5x

Geometry Transforms

9.1x

JPEG

15.2x

Morphological

5.5x

Linear Transform

7.6x

Color Processing

71.6x

Color Conversion

8.9x

Threshold And Compare

4.1x

Alpha Composition Logical Arithmetic Data Exchange And Initialization

10.8x 5.9x 12.3x 9.8x

0x


5x

10x

15x

20x

25x

• NPP 7.0 on K40m, ECC ON, Base clocks, input and output data on device • Each bar represents the average speedup over all routines in the function group • IPP 7.0 on Intel Xeon Haswell single-socket 16-core E5-2698 v3 @ 2.3GHz, 3.6GHz Turbo 23

math.h: C99 floating-point library + extras CUDA math.h is industry proven, high performance, accurate •Basic: +, *, /, 1/, sqrt, FMA (all IEEE-754 accurate for float, double, all rounding modes) •Exponentials: exp, exp2, log, log2, log10, ... •Trigonometry: sin, cos, tan, asin, acos, atan2, sinh, cosh, asinh, acosh, ... •Special functions: lgamma, tgamma, erf, erfc •Utility: fmod, remquo, modf, trunc, round, ceil, floor, fabs, ... •Bessel: j0, j1, jn, y0, y1, yn, cyl_bessel_i0, cyl_bessel_i1 •Vector SIMD: vadd, vsub, vavrg, vabsdiff, vmin, vmax, vset •Extras: rsqrt, rhypot, rcbrt, exp10, sinpi, sincos[pi], cospi, erf[c]inv, normcdf[inv]

New in CUDA 7.0

• Significantly optimized double precision reciprocal • rcp()

• 3D/4D Euclidean Norms: • [r]norm3d[f], norm4d[f] 24

NVIDIA releases cuDNN Version 2

Up to 1.8x faster on AlexNet than a baseline GPU implementation

New support for 3D convolutions

Images Trained Per Day (Caffe AlexNet) Millions of Images

Accelerates key routines to improve performance of neural net training

50

43M

40 30 18M

20 10

23M

2.5M

0 16 Core CPU GTX Titan E5-2698 v3 @ 2.3GHz

Titan Black cuDNN v1

Titan X cuDNN v2

1.6x

Integrated into all major Deep Learning frameworks: Caffe, Theano, Torch

1.2x 1.0x

Baseline (GPU)

1.0x With cuDNN

Caffe (GoogLeNet)

Torch (OverFeat) 25

CUDA Registered Developer Program Sign up for free at: www.nvidia.com/paralleldeveloper Exclusive access to pre-release CUDA Installers Submit bugs and features requests to NVIDIA Keep informed about latest releases and training opportunities Access to exclusive downloads Exclusive activities and special offers

26