Altera OpenCL Overview - Meetup

3 downloads 229 Views 3MB Size Report
10 gigabit network data. Violin Accelerates Oracle flash Memory Arrays provide a simple, reliable solution for scalable
Accelerated Computing Using FPGAs and OpenCL Nick Finamore – Altera MDM

© 2012 Altera Corporation

Agenda    

 

Emerging FPGA acceleration examples Performance scaling with FPGAs Deep pipelined parallelism Compiling OpenCL kernels to FPGAs OpenCL kernel porting example Mapping multi-threaded kernels to FPGAs  Pipeline Vector Add Example

 

OpenCL roadmap Summary

© 2012 Altera Corporation

Emerging FPGA based solutions Arista FPGA switch revolutionize Layer 4-7 network services Many network engineers might dismiss Arista Networks' new field programmable gate arrays (FPGA) switch as a cool, niche product, but it has the potential to turn the Layer 4-7 network services industry……….

IBM buys Netezza for $1.7B -FPGA hosts disks

Violin Accelerates Oracle Maxeler Delivers New FPGA-Powered Connectivity Platform for HPC Users Maxeler, a company specialized in developing accelerator cards for the high-performance computing sector, has recently announced the introduction of the MaxNode10G, a platform designed for wire-speed processing of multiple 10 gigabit network data

© 2012 Altera Corporation

flash Memory Arrays provide a simple, reliable solution for scalable high performance storage. Data intensive applications like Databases, Business Intelligence/Analytics, and Data Warehousing often suffer from storage I/O bottlenecks that cripple performance. Flash Memory Arrays ensure the lowest latency data access…..

FPGA Computing Acceleration Example (circa 2008) 

~10X the quad core cpu performance  For Mersenne Twister + Monte Carlo Black Scholes

 In double precision floating point, versus single!



FPGA implementation:  Six pipelines with one result per clock for each pipeline  Resources available (768) DSP blocks in the 3SL260 (65nm) XtremeData XD2000i  www.altera.com/literature/wp/wp-01031-coprocessing-evolution.pdf



Performance scales with resources: 

65nm (768 DSP)  40nm (1288 DSP)  28nm (4096 DSP) !! (2) Quad-core nVidia 8800 Monte Carlo cpu's Precision Single Single Paths per sec. ~240M ~900M RNG Type Halton Halton Source RapidMind RapidMind Note: Cpu and gpu results @1M paths per option.

© 2012 Altera Corporation

XD2000i w/(2) 3S260's Double 1.8G Mersenne XtremeData

Growing FPGA Design Complexity 40-/100-Gbps Ethernet

~16x

Density/Performance

Density Performance

DDR2, PCI Express

~4x ~3.5x ~2.25x PCI

1x

1x

180 nm

130 nm

90 nm

Trends © 2012 Altera Corporation 5

65 nm

40 nm and beyond

Modern Altera FPGA: Massively Parallel Build extremely deep pipelines with wide parallelism Load

Load

Load

Load

Load

Load

PCIe Store

Load

Load

Store

Load

Load

•Map coarse grained thread parallelism to fine-grained FPGA parallelism

Store

Load

Load

DDRx Store

© 2012 Altera Corporation 6

Store

Deep Pipelined Parallelism

Store

•Create deeply pipelined representation of a kernel replicated across FPGA •On each clock cycle, send in input data for a new thread

The FPGA programming challenge Typical design flow based on HDL Project definition

Project creation

Design entry/RTL***Verilog coding and early pin planning Design creation

• Behavioral or structural description of design • Early pin planning allows board development in parallel

Functional verification Synthesis (mapping)

• Verify design behavior Functional verification

Logic

Memory

I/O Design compilation

Functional verification

• Translate design into device-specific primitives • Optimization to meet required area and performance constraints

Placement and routing (fitting) • Place design in specific device resources with reference to area and performance constraints • Connect resources with routing lines

Timing analysis

• Verify design will work in target technology Functional verification

• Verify performance specifications were met • Static timing analysis

PC board simulation and test

In-system debug

© 2012 Altera Corporation 7

• Simulate board design • Program and test device on board • On-chip tools for debugging

OpenCL Overview (Open Computing Language)



OpenCL is a SW programming model:  Standard C Language (C99)  OpenCL C extensions (adds parallelism to C)  API (Open standard for different devices)



Provides increased performance  CPU offload  Performance via HW acceleration



The consortium (Short List):

© 2012 Altera Corporation 8

Host CPU

Hardware Acceleration

OpenCL Enables Portability OpenCL code base

True portability across heterogeneous platforms •Performance optimized for each

Multicore CPU CPU

CPU

CPU

FPGA with Integrated CPU

GPU

© 2012 Altera Corporation

FPGA

GPU

FPGA

FPGA

OpenCL Programming Model Accelerator

Host

Typical challenges:  Global/local memory bandwidth  Limited floating point cores  Thread occupancy

© 2012 Altera Corporation 10

Processor Accelerator Accelerator Accelerator

Local Mem Local Mem Local Local Mem Mem



Global Mem

main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRange(…,sum,…); clEnqueueReadBuffer( … ); display_result( … ); }

__kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; }

OpenCL Host Program  

Pure software written in standard „C‟ Communicates with the Accelerator Device via a set of library routines which abstract the communication between the host processor and the kernels main() { Copy data from Host to FPGA

read_data_from_file( … ); maninpulate_data( … );

Ask the FPGA to run a particular kernel Copy data from FPGA to Host

© 2012 Altera Corporation 11

clEnqueueWriteBuffer( … ); clEnqueueTask(…, my_kernel, …); clEnqueueReadBuffer( … ); display_result_to_user( … ); }

OpenCL Kernels 

Data-parallel function

__kernel void sum(__global const float *a, __global const float *b, __global float *answer) { int xid = get_global_id(0); answer[xid] = a[xid] + b[xid]; }

 Defines many parallel threads

of execution  Each thread has an identifier specified by “get_global_id”  Contains keyword extensions to specify parallelism and memory hierarchy



Executed by compute float object  CPU  GPU

*a =

0

1

2

3

4

5

6

7

float *b =

7

6

5

4

3

2

1

0

 Accelerator

__kernel void sum( … ); float *answer =

© 2012 Altera Corporation 12

7

7

7

7

7

7

7

7

Mapping OpenCL Programs OpenCL Host Program + Kernels

ACL Compiler

Standard C Compiler

SOF

X86 binary

PCIe

x86 © 2012 Altera Corporation 13

Compiling OpenCL to Altera FPGAs

Host Program

__kernel void sum(__global const float *a, __global const float *b, __global float *answer) __kernel void { sum(__global const float *a, int xid = get_global_id(0); __global const float + *b, answer[xid] = a[xid] b[xid]; __global float *answer) } { int xid = get_global_id(0); answer[xid] = a[xid] + b[xid]; }

main() { read_data_from_file( … ); maninpulate_data( … );

Kernel Programs

OpenCL Load Load Load Host Program + Kernels Load

ACL Compiler

Load

PCIe

Standard C Compiler

display_result_to_user( … );

} Store SOF

Load

Load

Store X86 binary Load

Load

Load

clEnqueueWriteBuffer( … ); clEnqueueKernel(…, sum, …); clEnqueueReadBuffer( … );

Store

Load

Load

PCIe

DDRx Store

© 2012 Altera Corporation 14

Store

x86

Store

FPGA OpenCL Architecture Example FPGA PCIe

External Memory Controller & PHY

DDR*

x86 / External Processor

External Memory Controller & PHY

Global Memory Interconnect M9K

M9K M9K

M9K Kernel Pipeline

Kernel Pipeline

Kernel Pipeline

M9K M9K

Local Memory Interconnect

Limited external memory bandwidth Extremely high internal memory bandwidth Highly customizable compute cores © 2012 Altera Corporation 15

Traditional “C to Gates” Approach Logic Only 

For ASIC & FPGA Designers: 

Logic to RTL Solution  Map C algorithms to ip  For ASIC or FPGA



Algorithms maintained in C

Host CPU

FPGA

User program

 Development, Enhancements, Testing



Logic to RTL Examples (2010): 



Driver

Logic

e.g. Mentor Catapult C

Designer adds a few missing pieces: 

Data Management in/out of the ip core  Data Management to external memory  Also called “Platform Support Package”

16

OpenCL is a different approach

© 2012 Altera Corporation

Mapping Multithreaded Kernels to FPGAs 

The simplest way of mapping kernel functions to FPGAs is to replicate hardware for each thread

Load

Load

 Inefficient and wasteful 

Better method deep pipeline parallelism  Map coarse grained thread parallelism

to fine-grained FPGA parallelism  Create a deeply pipelined representation of a kernel  Send in new data and retire a new result on each clock cycle

© 2012 Altera Corporation 17

Store

Load

Load

Store

Example Pipeline for Vector Add 8 threads for vector add example 0

Load

1

2

3

4

5

6

7

Load

Thread IDs +



Store



© 2012 Altera Corporation 18

On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored

Example Pipeline for Vector Add 8 threads for vector add example 1

2

3

4

5

6

7

0

Load

Load

Thread IDs +



Store



© 2012 Altera Corporation 19

On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored

Example Pipeline for Vector Add 8 threads for vector add example 2

3

4

5

6

7

1

Load

Load

Thread IDs

0 +



Store



© 2012 Altera Corporation 20

On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored

Example Pipeline for Vector Add 8 threads for vector add example 3

4

5

6

7

2

Load

Load

Thread IDs

1 + 0

Store

© 2012 Altera Corporation 21





On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored

Example Pipeline for Vector Add 8 threads for vector add example 4

5

6

7

3

Load

Load

Thread IDs

2 + 1

Store 0

© 2012 Altera Corporation 22





On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored

High Definition Rendering CODEC: An example of OpenCL kernel porting Up and Running in a day! Optimized within a few days

23 © 2012 Altera Corporation

Summary    

FPGA‟s are able to accelerate high performance computing applications Deep pipeline parallelism allow highly efficient algorithm acceleration OpenCL enables standard based high level programming of FPGA accelerators OpenCL will enable even more efficient implementation of acceleration on FPGAs in the future

© 2012 Altera Corporation