Performance Primitives for Image & Video Processing. Thrust .... 10x. 12x. 1138_bus. Chem97ZtZ. Muu ex9 nasa1824. Sp
CUDA 7.0 Performance Report May 2015 1
CUDA 7.0 Performance Report cuFFT cuBLAS cuSPARSE cuSOLVER cuRAND NPP Thrust math.h cuDNN
Fast Fourier Transforms Library Complete BLAS Library Sparse Matrix Library Linear Solver Library Random Number Generation (RNG) Library Performance Primitives for Image & Video Processing Templated Parallel Algorithms & Data Structures C99 floating-point Library Deep Neural Net building blocks New in CUDA 7.0
Included in the CUDA Toolkit (free download): developer.nvidia.com/cuda-toolkit For more information on CUDA libraries: developer.nvidia.com/gpu-accelerated-libraries 2
cuFFT: Multi-dimensional FFTs Real and complex, Single- and double-precision data types 1D, 2D and 3D batched transforms Flexible input and output data layouts Also supports simple “drop-in” replacement of a CPU FFTW library XT interface supports up to 4 GPUs Device callbacks optimize use cases such as FFT + datatype conversion
3
cuFFT: up to 700 GFLOPS Single Precision
Double Precision
800
350
700
300
250 Powers of 2
500
Powers of 3
400
Powers of 5 300
Powers of 7
200
GFLOPS
GFLOPS
600
200 150 100 50
100 0
0 1
1,000 1,000,000 Transform Size
1
1,000 1,000,000 Transform Size
1D Complex, Batched FFTs Used in Signal Processing and as a Foundation for 2D and 3D FFTs Performance may vary based on OS and software versions, and motherboard configuration
• cuFFT 7.0 on K40m, Base clocks, ECC ON • Batched transforms on 28M-33M total elements, input and output data on device • Excludes time to create cuFFT “plans” 4
New in CUDA 7.0
cuFFT up to 3x Faster for sizes that are composites of small primes 1D Single Precision Complex-to-Complex Transforms
CUDA 7.0 vs. CUDA 6.5 Speedup
5x Size = 15 Size = 30
Size = 121
Size = 31
4x
3x
Size = 127
2x
1x
0
20
Performance may vary based on OS and software versions, and motherboard configuration
40
60 80 Transform Size
100
120
140
• cuFFT 6.5 and 7.0 on K20m, ECC ON • Batched transforms on 32M total elements, input and output data on device • Excludes time to create cuFFT “plans” 5
cuBLAS: Dense Linear Algebra on GPUs Complete BLAS implementation plus useful extensions Supports all 152 standard routines for single, double, complex, and double complex Host and device-callable interface Batched routines for higher performance on small problem sizes
XT Interface for Level 3 BLAS Distributed computations across multiple GPUs Out-of-core streaming to GPU, no upper limit on matrix size “Drop-in” BLAS intercepts CPU BLAS calls, streams to GPU
6
TFLOPS single-precision cuBLAS: >3 >1 TFLOPS double-precision 3500 3000
GFLOPS
2500 2000 1500 1000 500
Single
Performance may vary based on OS and software versions, and motherboard configuration
Single Complex
Double
ZSYRK
ZTRSM
ZSYMM
ZGEMM
DSYRK
DTRSM
DSYMM
DGEMM
CSYRK
CTRSM
CSYMM
CGEMM
SSYRK
STRSM
SSYMM
SGEMM
0
Double Complex
• cuBLAS 7.0 on K40m, Base clocks, ECC ON, input and output data on device • m=n=k=4096, transpose=no, side=right, fill=lower
7
14000 12000 10000 8000 6000 4000 2000 0
1xK80
Single
Performance may vary based on OS and software versions, and motherboard configuration
Single Complex
Double
ZTRSM
ZSYRK
ZGEMM
DTRSM
DSYRK
DGEMM
CSYRK
CGEMM
STRSM
SSYRK
3xK80
SGEMM
GFLOPS
cuBLAS-XT: Multi-GPU Performance Scaling >12 TF on a single node
Double Complex
• cuBLAS 7.0 on K80, Base clocks, ECC ON • input and output data on host, m=n=k=32768, transpose=no
8
cuSPARSE: Sparse linear algebra routines y1 y2 y3 y4
\alpha
1.0 2.0
3.0
5.0
4.0 6.0
7.0
1.0 2.0 3.0 4.0
+ \beta
y1 y2 y3 y4
Optimized sparse linear algebra BLAS routines for matrix-vector, matrix-matrix, triangular solve Support for variety of formats (CSR, COO, block variants) Incomplete-LU and Cholesky preconditioners New in CUDA 7.0
Graph Coloring 9
cuSPARSE: 4x Faster than MKL Sparse Matrix x Dense Vector (SpMV) Speedup over MKL
5x 4x 3x 2x 1x 0x
Performance may vary based on OS and software versions, and motherboard configuration
• • • •
Average of S/C/D/Z routines cuSPARSE 7.0 on K40m, Base clocks, ECC ON, input and output data on device MKL 11.0.4 on Intel Xeon Haswell single-socket 16-core E5-2698 v3 @ 2.3GHz, 3.6GHz Turbo 10 Matrices obtained from: http://www.cise.ufl.edu/research/sparse/matrices/
New in CUDA 7.0
Graph Coloring & Preconditioned CG Solver Analysis and ilu0
Input matrix
(cuSPARSE)
Solve (cuSPARSE)
Execution time of ilu0 and Solve phases are limited by available parallelism in input Matrix
Input matrix
Graph coloring
Reorder matrix
Analysis and ilu0
(cuSPARSE)
(Thrust)
(cuSPARSE)
Coloring and reordering the matrix exposes more parallelism
Solve (cuSPARSE)
ilu0 and Solve phases run faster due to extra parallelism
11
New in CUDA 7.0
cuSPARSE Graph Coloring Speeds Up Factorization
Speedup of Incomplete LU Factorization (ILU0) after Graph Coloring 20x
28x
9x
6x
Speedup
5x 4x 3x 2x 1x 0x
Full results at: research.nvidia.com/publication/parallel-graph-coloring-applications-incomplete-lu-factorization-gpu Performance may vary based on OS and software versions, and motherboard configuration
• cuSPARSE 7.0 on K40c • Matrices obtained from: http://www.cise.ufl.edu/research/sparse/matrices/
12
New in CUDA 7.0
cuSOLVER
cusolverDN Dense Cholesky, LU, SVD, (batched) QR Optimization, Computer vision, CFD
cusolverSP Sparse direct solvers & Eigensolvers Newton’s method, Oil & Gas Well Models
cusolverRF Sparse refactorization solver Chemistry, ODEs, Combustion, Circuit simulation 13
New in CUDA 7.0
cuSOLVER Dense vs. MKL
1800 1600
cuSOLVER
GFLOPS
1400
MKL
1200 1000
800 600 400 200
Cholesky Factorization
Performance may vary based on OS and software versions, and motherboard configuration
LU Factorization
ZGEQRF
CGEQRF
DGEQRF
SGEQRF
ZGETRF
CGETRF
DGETRF
SGETRF
ZPOTRF
CPOTRF
DPOTRF
SPOTRF
0
QR Factorization • cuSOLVER 7.0 on K40c, ECC ON, M=N=4096 • MKL 11.0.4 on Intel Xeon Haswell 14-core E5-2697 v3 @ 3.6GHz 14
New in CUDA 7.0
cuSOLVER Sparse QR Analysis, Factorization and Solve
Speedup over CPU
12x
11.3x
10x 8x 6x
4x 2x
2.0x
1.9x
1.4x
1.2x
ex9
nasa1824
0x 1138_bus Performance may vary based on OS and software versions, and motherboard configuration
Chem97ZtZ
Muu
• cuSOLVER 7.0 on K40c, ECC ON • SuiteSparse v4.4 on Intel Xeon Haswell 14-core E5-2697 v3 @ 3.6GHz • Matrices obtained from: http://www.cise.ufl.edu/research/sparse/matrices/
15
cuRAND: Random Number Generation Generating high quality random numbers in parallel is hard Don’t do it yourself, use a library!
Pseudo- and Quasi-RNGs Mersenne Twister 19937 Supports several output distributions Statistical test results in documentation
16
cuRAND: > 50x Faster vs. Intel MKL 16 14
GSamples / sec
12 10 8
cuRAND
6
MKL
4
2 0 Sobol32
MRG32k3a
Uniform Distribution
Performance may vary based on OS and software versions, and motherboard configuration
Sobol32
MRG32k3a
Normal Distribution
Sobol32
MRG32k3a
Log-Normal Distribution
• cuRAND 7.0 on K40m, Base clocks, ECC ON, double-precision input and output data on device • MKL 11.0.1 on Intel Xeon Haswell single-socket 16-core E5-2698 v3 @ 2.3GHz, 3.6GHz Turbo 17
cuRAND: High Performance RNGs 18 16 Gsamples / sec
14 12 10 8
6 4 2 0 XORWOW
Philox
MRG32k3a
MTGP32
Pseudo-random Uniform Distribution Performance may vary based on OS and software versions, and motherboard configuration
Normal Distribution
Sobol32 Scrambled
Sobol64 Scrambled
Quasi-random Log-Normal Distribution
• cuRAND 7.0 on K40m, Base clocks, ECC ON, double-precision input and output data on device
18
CUDA C++ Parallel Template Library Parallel Template library for CUDA C++ Host and Device Containers that mimic the C++ STL Optimized Parallel Algorithms for sort, reduce, scan, etc. TBB and OpenMP CPU Backends Performance portable
Also available on GitHub: thrust.github.com
Allows applications and prototypes to be built quickly 19
New in CUDA 7.0
Thrust Performance Improvements Thrust Sort, 7.0 vs. 6.5 (32M samples) 2.0x
1.7x
1.8x
1.5x Speedup
sort: 1.1x – 1.8x faster (3x for user-defined types) merge: 2x faster scan: 1.15x faster reduce_by_key: 1.25x faster
1.2x
1.3x 1.1x
1.1x
1.0x 0.5x 0.0x char
short
int
long
float double
• CUDA 7.0 and 6.5 on K40m, ECC ON, input and output data on device • Performance may vary based on OS and software versions, and motherboard configuration
CUDA streams argument (concurrency between threads) thrust::count_if(thrust::cuda::par.on(stream1), text, text+n, myFunc()); 20
NPP: NVIDIA Performance Primitives Over 5000 image and signal processing routines: color transforms, geometric transforms, move operations, linear filters, image & signal statistics, image & signal arithmetic, JPEG building blocks, image segmentation, median filter, BGR/YUV conversion, 3D LUT color conversion
21
NPP Speedup vs. Intel IPP 25x
20x
15x
10x
22.4x
19.9x
19.9x
10.7x
5x
5.3x
5.0x 0x Image Set (8-bit RGB)
Image Set Channel (8-bit RGB)
Performance may vary based on OS and software versions, and motherboard configuration
Image Resize (8-bit RGB)
Image Gaussian Color Conversion Filter 8-bit YUV422 to (32-bit float) 8-bit RGB
JPEG 8x8 Forward DCT
• NPP 7.0 on K40m, ECC ON, Base clocks, input and output data on device • IPP 7.0 on Intel Xeon Haswell single-socket 16-core E5-2698 v3 @ 2.3GHz, 3.6GHz Turbo
22
Image Processing Function Group
NPP Speedup vs. Intel IPP Filter
4.0x
Statistics
5.5x
Geometry Transforms
9.1x
JPEG
15.2x
Morphological
5.5x
Linear Transform
7.6x
Color Processing
71.6x
Color Conversion
8.9x
Threshold And Compare
4.1x
Alpha Composition Logical Arithmetic Data Exchange And Initialization
10.8x 5.9x 12.3x 9.8x
0x
Performance may vary based on OS and software versions, and motherboard configuration
5x
10x
15x
20x
25x
• NPP 7.0 on K40m, ECC ON, Base clocks, input and output data on device • Each bar represents the average speedup over all routines in the function group • IPP 7.0 on Intel Xeon Haswell single-socket 16-core E5-2698 v3 @ 2.3GHz, 3.6GHz Turbo 23
math.h: C99 floating-point library + extras CUDA math.h is industry proven, high performance, accurate •Basic: +, *, /, 1/, sqrt, FMA (all IEEE-754 accurate for float, double, all rounding modes) •Exponentials: exp, exp2, log, log2, log10, ... •Trigonometry: sin, cos, tan, asin, acos, atan2, sinh, cosh, asinh, acosh, ... •Special functions: lgamma, tgamma, erf, erfc •Utility: fmod, remquo, modf, trunc, round, ceil, floor, fabs, ... •Bessel: j0, j1, jn, y0, y1, yn, cyl_bessel_i0, cyl_bessel_i1 •Vector SIMD: vadd, vsub, vavrg, vabsdiff, vmin, vmax, vset •Extras: rsqrt, rhypot, rcbrt, exp10, sinpi, sincos[pi], cospi, erf[c]inv, normcdf[inv]
New in CUDA 7.0
• Significantly optimized double precision reciprocal • rcp()
• 3D/4D Euclidean Norms: • [r]norm3d[f], norm4d[f] 24
NVIDIA releases cuDNN Version 2
Up to 1.8x faster on AlexNet than a baseline GPU implementation
New support for 3D convolutions
Images Trained Per Day (Caffe AlexNet) Millions of Images
Accelerates key routines to improve performance of neural net training
50
43M
40 30 18M
20 10
23M
2.5M
0 16 Core CPU GTX Titan E5-2698 v3 @ 2.3GHz
Titan Black cuDNN v1
Titan X cuDNN v2
1.6x
Integrated into all major Deep Learning frameworks: Caffe, Theano, Torch
1.2x 1.0x
Baseline (GPU)
1.0x With cuDNN
Caffe (GoogLeNet)
Torch (OverFeat) 25
CUDA Registered Developer Program Sign up for free at: www.nvidia.com/paralleldeveloper Exclusive access to pre-release CUDA Installers Submit bugs and features requests to NVIDIA Keep informed about latest releases and training opportunities Access to exclusive downloads Exclusive activities and special offers
26