Multicore and GPU Programming - TECDIS

4 downloads 188 Views 5MB Size Report
Mar 5, 2015 - The Second Congress on Multicore and GPU Programming (PPMG, ...... Education and Science of Spain, FEDER f
Multicore and GPU Programming EDITORS Miguel A. Vega-Rodríguez Manuel I. Capel-Tuñón Antonio J. Tomeu-Hardasmal Alberto G. Salguero-Hidalgo

2015

Title: Multicore and GPU Programming Editors: Miguel A. Vega-Rodríguez, University of Extremadura, Spain Manuel I. Capel-Tuñón, University of Granada, Spain Antonio J. Tomeu-Hardasmal, University of Cadiz, Spain Alberto G. Salguero-Hidalgo, University of Cadiz, Spain

All rights are reserved. Duplication of this publication or parts thereof, by any media or procedure, is not permitted, without the corresponding written authorization from the copyright owners. Violations are liable to prosecution under the respective Copyright Law.

© The authors

ISBN: 978-84-606-6036-1 Layout: Miguel A. Vega-Rodríguez Date: March 2015 Printed in Spain

iv

Multicore and GPU Programming

PREFACE This book includes the best works presented at the Second Congress on Multicore and GPU Programming (PPMG2015). After the first Congress PPMG2014, the second edition of the Congress (PPMG2015) took place at Escuela Politecnica, Caceres (University of Extremadura, Spain), on 5-6 March 2015. The PPMG Community has to intensify research on new methods and techniques of use for Parallel Programming, so that we can obtain the maximum possible performance of current multicore processors. These new techniques will allow us to exploit the locality and come out with the deep parallel structure of many algorithms. Not only for improving general application software development will be useful the expected results of that research in the coming years, but also for other lines of research in high complexity domains such as data mining, networks and communications, artificial intelligence, games, bioinformatics, etc. The Second Congress on Multicore and GPU Programming (PPMG, according to the romanic acronym) came across as a scientific-technological event of “minimum cost and maximum interest” of Portuguese, Latin-american and Spanish researchers and professionals. This Conference was promoted by TECDIS (Iberoamerican Network of Research on Concurrent, Distributed, and Parallel Technologies) and AUIP (Postgraduate Iberoamerican Academic Association), and was aimed at providing a discussion forum to developers, entrepreneurs and researchers interested in the development, use and dissemination of systems and applications connected to the new multicore processors and the top-market platforms for concurrency. PPMG also aimed to create professionals links that lead to a frequent exchange of ideas among experts in different domains (processors, operating systems, compilers, development tools, debuggers, simulators, services architectures and programmers) that come from the academic world and the industry and are brought together in our “Multicore Community”. The Congress PPMG2015 took place together with the "Winter School" of the "High Performance Computation Network on Heterogeneous Parallel Architectures (CAPAPH5)". All the contributions in this book were reviewed by at least two expert reviewers. We sincerely hope that PPMG2015 stimulates your interest in the many issues surrounding Multicore and GPU Programming. The topics covered in the book are timely and important, and the authors have done an excellent job of presenting the material. In fact, this congress would not have been possible without the assistance of both the authors and the Program Committee members, to whom we give many thanks. Before concluding we also want to express our sincere gratitude to AUIP (Postgraduate Iberoamerican Academic Association) for its funding, making possible the PPMG2015 congress. v

Multicore and GPU Programming

Finally it is important to highlight that PPMG2015 will be held in the lovely city of Caceres (Spain), we hope you enjoy your stay.

March 2015

Miguel A. Vega-Rodríguez Manuel I. Capel-Tuñón Antonio J. Tomeu-Hardasmal Alberto G. Salguero-Hidalgo

vi

Multicore and GPU Programming

PPMG2015 Second Congress on Multicore and GPU Programming Escuela Politécnica, University of Extremadura, Cáceres, Spain, 5-6 March 2015

Steering Committee Manuel I. Capel Tuñón. University of Granada Antonio J. Tomeu Hardasmal. University of Cadiz Miguel A. Vega-Rodríguez. University of Extremadura

Local Organizing Committee Chair Miguel A. Vega-Rodríguez. University of Extremadura

Committee Víctor Berrocal-Plaza. University of Extremadura David L. González-Álvarez. University of Extremadura José M. Granado-Criado. University of Extremadura Alejandro Hidalgo-Paniagua. University of Extremadura José M. Lanza-Gutiérrez. University of Extremadura Álvaro Rubio-Largo. University of Extremadura Sergio Santander-Jiménez. University of Extremadura

vii

Multicore and GPU Programming

Program Committee Diego A. Azcurra (Universidad de Buenos Aires, Argentina) Jorge Barbosa (Universidade de Porto, Portugal) Guillem Bernat Nicolau (Rapita Systems Ltd.) Manuel I. Capel Tuñón (Universidad de Granada) Miguel Cárdenas-Montes (CIEMAT) José Duato (Universidad Politécnica de Valencia) Domingo Giménez Cánovas (Universidad de Murcia) Klaus Havelund (NASA Jet Propulsion Laboratory, USA) Emilio Hernández (Universidad Simón Bolívar, Venezuela) Juan A. Holgado Terriza (Universidad de Granada) Zhiyi Huang (Otago University, New Zealand) Takahiro Katahiri (Tokyo University, Japan) Mikel Larrea (Universidad del País Vasco) Diego R. Llanos (Universidad de Valladolid) Julio Ortega Lopera (Universidad de Granada) Carlos E. Pereira (Universidade Federal do Rio Grande Do Sul, Brazil) Manuel Prieto-Matías (Universidad Complutense de Madrid) Francisco J. Quiles (Universidad de Castilla-La Mancha) Enrique Quintana Ortí (Universidad Jaume I) Mario Rossainz López (Benemérita Universidad Autónoma de Puebla, Mexico) Alberto G. Salguero Hidalgo (Universidad de Cádiz) Federico Silla (Universidad Politécnica de Valencia) Leonel Sousa (Instituto Superior Técnico de Lisboa, Portugal) Antonio J. Tomeu Hardasmal (Universidad de Cádiz) Manuel Ujaldón Martínez (Universidad de Málaga) Miguel A. Vega-Rodríguez (Universidad de Extremadura) Di Zhao (Ohio State University, USA) Dieter Zoebel (Koblenz University, Germany)

viii

Multicore and GPU Programming

Sponsors and Collaborators

Iberoamerican Network of Research on Concurrent, Distributed, and Parallel Technologies

ix

Multicore and GPU Programming

x

Multicore and GPU Programming

PPMG2015 Program

Thursday, 5-March-2015 08:45

Registration opens

09:0009:15

Welcome and Opening of the Congress

09:1511:00 11:0011:30

Technical Seminar: Concurrency and Parallelism in Modern C++ (Part 1) Dr. José Daniel García and Dr. Javier García-Blas Coffee Break

11:3013:30

Technical Seminar: Concurrency and Parallelism in Modern C++ (Part 2) Dr. José Daniel García and Dr. Javier García-Blas

13:30

Lunch

* IMPORTANT: All the activities will take place at the room “Salón de Actos” (Escuela Politécnica), except for the Technical Seminar (parts 1 and 2) that will take place at the room “C7” (Escuela Politécnica).

xi

Multicore and GPU Programming

Friday, 6-March-2015 09:45

Registration opens

10:00-11:00

Invited Talk: Virtualizing GPUs to Reduce Cost and Improve Performance. Prof. Dr. Jose Duato

11:00-11:30

Coffee Break

11:30-13:30

Oral Presentations Chair: Miguel A. Vega-Rodríguez

11:30-11:50

Power and Energy Implications of the Number of Threads Used on the Intel Xeon Phi. Oscar García Lorenzo, Tomás F. Pena, José Carlos Cabaleiro, Juan C. Pichel, Francisco F. Rivera and Dimitrios S. Nikolopoulos Implementation and Performance Analysis of the AXPY, DOT, and SpMV Functions on Intel Xeon Phi and NVIDA Tesla using OpenCL. Edoardo Coronado-Barrientos, Guillermo Indalecio and Antonio Jesús García-Loureiro Modelling Shared-Memory Metaheuristics and Hyperheuristics for Auto-Tuning. José-Matías Cutillas-Lozano and Domingo Giménez Paralelización de Ramificación y Poda usando Programación Paralela Estructurada. Mario Rossainz-López and Manuel I. Capel-Tuñón El Nuevo API de C++ para Concurrencia/Paralelismo. Una Comparativa con Java en Problemas con Coeficiente de Bloqueo Nulo: Análisis de un Caso. Antonio J. TomeuHardasmal, Alberto G. Salguero and Manuel I. Capel-Tuñón

11:50-12:10

12:10-12:30

12:30-12:50

12:50-13:10

13:30-14:00

Closing of the Congress

14:00

Lunch

* IMPORTANT: All the activities will take place at the room “Salón de Actos” (Escuela Politécnica), except for the Technical Seminar (parts 1 and 2) that will take place at the room “C7” (Escuela Politécnica).

xii

Multicore and GPU Programming

TABLE OF CONTENTS Power and Energy Implications of the Number of Threads Used on the Intel Xeon Phi .............................................................................................................................. 1 Oscar García Lorenzo, Tomás F. Pena, José Carlos Cabaleiro, Juan C. Pichel, Francisco F. Rivera and Dimitrios S. Nikolopoulos

Implementation and Performance Analysis of the AXPY, DOT, and SpMV Functions on Intel Xeon Phi and NVIDA Tesla using OpenCL ............................ 9 Edoardo Coronado-Barrientos, Guillermo Indalecio and Antonio Jesús García-Loureiro

Modelling Shared-Memory Metaheuristics and Hyperheuristics for AutoTuning ............................................................................................................................... 19 José-Matías Cutillas-Lozano and Domingo Giménez

Paralelización de Ramificación y Poda usando Programación Paralela Estructurada .................................................................................................................... 29 Mario Rossainz-López and Manuel I. Capel-Tuñón

El Nuevo API de C++ para Concurrencia/Paralelismo. Una Comparativa con Java en Problemas con Coeficiente de Bloqueo Nulo: Análisis de un Caso ..... 39 Antonio J. Tomeu-Hardasmal, Alberto G. Salguero and Manuel I. Capel-Tuñón

Virtualizing GPUs to Reduce Cost and Improve Performance ........................... 49 Jose Duato

Concurrency and Parallelism in Modern C++ ........................................................ 51 José Daniel García and Javier García-Blas

xiii

Multicore and GPU Programming

Author Index ........................................................................................................................ 53

xiv

Multicore and GPU Programming

Power and Energy Implications of the Number of Threads Used on the Intel Xeon Phi O. G. Lorenzo∗ , T. F. Pena∗ , J. C. Cabaleiro∗ , J. C. Pichel∗ , F. F. Rivera∗ and D. S. Nikolopoulos† ∗ Centro

de Investigaci´on en Tecnolox´ıas da Informaci´on (CITIUS) Univ. of Santiago de Compostela, Spain Email: {oscar.garcia, tf.pena, jc.cabaleiro, juancarlos.pichel, ff.rivera}@usc.es † Queens University Belfast Northern Ireland, UK Email: [email protected] detrimental to energy consumption. Since as more cores are used, and more intensely, more power is consumed. Therefore, energy consumption may reach its minimum using a different number of threads for different applications. There may be codes which consume less energy using fewer threads, with the Xeon Phi using less power, even when their execution time may be longer. For these reasons it is interesting to find the best thread configuration (in number and also in placement) for legacy applications.

Abstract—Energy consumption has become an important area of research of late. With the advent of new manycore processors, situations have arisen where not all the processors need to be active to reach an optimal relation between performance and energy usage. In this paper, a study of the power and energy usage of a series of benchmarks, the PARSEC and the SPLASH2X Benchmark Suites, on the Intel Xeon Phi for different threads configurations, is presented. To carry out this study, a tool was designed to monitor and record the power usage in real time during execution time and afterwards to compare the results of executions with different number of parallel threads.

In this paper a study of the power and energy usage of a series of benchmarks, the Princeton Application Repository for Shared-Memory Computers (PARSEC) and the Stanford ParalleL Applications for SHared-memory 2 (SPLASH-2X) Benchmark Suites [2], on the Xeon Phi for different number of threads involved, is presented. To carry out this study a set of tools were designed to monitor and record the power usage in real time during the execution of the codes and afterwards compare the results of different executions with different number of parallel threads.

Keywords—Power, energy, manycores, Intel Xean Phi, benchmarking.

I.

I NTRODUCTION

Manycore systems have been hailed as a important step towards a greater energy efficiency. They increase the computer performance through replicating simple and energy conserving cores on a single chip and also promise to reduce energy usage by allowing resources to be used only when necessary. In these systems, optimally allocation of the available power budget to different parts of the system is becoming a crucial decision. In fact, different computations will suffer different kind of performance degradation when power assigned to the components is modified. Therefore, there exists a need for analysis and tools that help the user to understand how a given computation behaves in terms of power usage.

This study shows that finding the best configuration in terms of number of threads, either from the performance or the energy efficiency point of view, is far from being straightforward. The optimal balance between performance and energy is not easily reached, and using all the available hardware resources may not yield the best performance or energy usage.

Intel Xeon Phi is the first commercial manycore x86-based processor, currently available as an accelerator to be used alongside a host system. The Xeon Phi attempts to lower the energy cost of computation by favoring a high degree of parallelism over single core performance. One of the main advantage of the Xeon Phi over other accelerators as Graphics Processing Units (GPUs) or Field Programmable Gate Arrays (FPGAs), is its compatibility with x86 instructions, which allows the same code to be run on either the host processors or the accelerator board.

The rest of the paper is organised as follows: Section II reviews some related work. Section III describes the Intel Xeon Phi architecture. The mechanisms for power measurement on the Xeon Phi and the tool we have developed to facilitate the obtaining of power data are introduced in Section IV. In Section V some of the results of our analysis of the above mentioned benchmarks are presented. Finally, some conclusions are drawn in Section VI.

For the Xeon Phi, some models have been proposed to improve energy efficiency at core or instruction level [1], and they are certainly useful for new codes or to find the best implementations of legacy ones. Nevertheless, few codes designed to be run on regular SMP machines scale easily to the number of threads available on the Xeon Phi and, if scalability is limited, using all the resources on the Xeon Phi may be

Gaining a good understanding of the power and energy usage behaviour of HPC codes is essential, given that the power wall has emerged as the key bottleneck in the design of exascale systems. In [3], a method to obtain this behaviour in a detailed way on a manycore system is shown. This method is integrated in a tool that interacts with the user and shows graphically some measurements.

II.

1

R ELATED WORK

Multicore and GPU Programming

There are several instrumentation systems available [4], [5] that supply users with power and energy consumption information. However, many suffer from resolution and accuracy problems. Moreover, some of them supply results outof-band, impeding optimization strategies to improve energy consumption. Modeling power and energy is a growing topic in HPC computing, many contributions can be found on this issue [6], [7], [8]. Models to estimate leakage energy on a cache hierarchy are presented in [9]. In [10], the authors present a detailed study of the performance-energy tradeoffs of the Xeon Phi architecture. Leon and Karlin [11] demonstrate key tradeoffs among power, energy and execution time for explicit hydrodynamics codes. In [12], a model to predict the performance effects of applying multiple techniques simultaneously is presented. In [13], models for predicting CPU and DIMM power and energy were introduced. III.

Fig. 1.

T HE I NTEL X EON P HI

PHI memory architecture.

(MPI), unlike a GPU. This means it can operate in multiple execution modes. Workload tasks can be shared between the host processor and coprocessor, they can work independently or parts of the workload can be sent out to the coprocessor as needed (as in a GPU).

R Xeon PhiTM is referred The architecture of the Intel to as Intel Many Integrated Core Architecture or Intel MIC. It is a coprocessor computer architecture developed by Intel incorporating earlier work on the Larrabee [14] many core architecture, the Teraflops Research Chip [15] multicore chip research project and the Intel Single-chip Cloud Computer[16] multicore microprocessor. The cores of Intel MIC are based on a modified version of P54C design, used in the original Pentium. The basis of the Intel MIC architecture is to leverage x86 legacy by creating a x86-compatible multiprocessor architecture that can utilize existing parallelisation software tools. Design elements inherited from the Larrabee project include x86 ISA, 4-way SMT per core, 512-bit SIMD units, 32 KB L1 instruction cache, 32 KB L1 data cache, coherent L2 cache (512 KB per core), and ultra-wide ring bus connecting processors and memory (see Figure 1). Manufactured using Intel industry-leading 22 nm technology with 3D Tri-Gate transistors, each coprocessor features more cores, more threads, and wider vector execution units than an Intel Xeon processor. The high degree of parallelism is intended to compensate for the lower speed of each core to deliver higher aggregate performance for highly parallel workloads. Since languages, tools, and applications are compatible for both Intel Xeon processor and Intel Xeon Phi coprocessors, codes initially developed for Intel Xeon can be reused.

IV.

P OWER M EASUREMENT

A. Power Measurement on Xeon Phi Power measurements for the Intel Xeon Phi coprocessor can be obtained directly from the Operating System, OS. There are no readable PM PMU events (Power Management Performance Monitoring Unit events) in the current first generation of Intel Xeon Phi coprocessors, although this is likely to change in later generations. The OS gets its temperature readings from off chip monitoring sensors on the circuit board supporting the processor. As such, it is an aggregate and approximate measurement. When used in native mode, that is a user is directly logged in the coprocessor OS, power measurements can be read in /sys/class/micras/power. From the host system this measures are read using a command line tool provided by Intel [17]. These measurements are updated each 50 ms. The exported data consists in a series of values of power readings from various sensors. The coprocessor board sensors give information about power consumption. Sensors in the board measure the power from the various power inputs in μW. They are the following:

With Xeon Phi, a single programming model can be used for all the code. The coprocessor gives developers a hardware design optimized for extreme parallelism, without requiring them to re-architect or rewrite their code. Coming from Xeon programming there is no need to rethink the entire problem or learn a new programming model; existing codes can simply be recompiled and optimised using familiar tools, libraries, and runtimes. Also, by maintaining a single source code between Intel Xeon processors and Intel Xeon Phi coprocessors, developers should be able to optimise once for parallelism but maximize performance on both processor and coprocessor. While designed for high-performance computing, the coprocessor can host an operating system, be fully IP addressable, and support standards such as Message Passing Interface



Total power: Two measurements are given, each taken in different time windows (0 and 1).



Instantaneous power and maximum instantaneous power.



PCI-E connector power (up to 75 W).



Power from the auxiliary 2x3 (75 W) and 2x4 (150 W) connectors. These are needed because the Xeon Phi board can not be powered only by the PCI-E. (Some models may not need the 2x3 connector.)

Sensors near the Phi coprocessor also give information about current (in μA) and voltage (μV). These sensors are: 2

Multicore and GPU Programming

each execution power measurements were taken every second. The results for 5 different executions of each thread configuration were combined to obtain an average of the power usage. This way the evolution in time of the power usage can be extracted. In addition, the energy consumption of each thread configuration can be studied. As more threads are executed more instantaneous power is used, but applications usually run faster, which means that total energy consumption may be lower, although this is not always the case.

Fig. 2.

All benchmarks were compiled with icc -O2 and autovectorisation, at least. The Xeon Phi coprocessor used was 7120A model, with 16 GB of memory and 61 cores at 1.238 GHz. In this coprocessor, up to 240 threads can be used to run applications, with 4 threads per core (one core is reserved for the OS). Thread placement was left to the OS, no directions were given, so threads were assigned in a round robin fashion. Alternative placement of threads, using affinity options, were not considered in this study. This means that, when up to 60 threads are considered, just one thread is executed per core. Whereas from 61 threads up, cores are assigned more threads, which implies that in some configurations there may be some cores executing one more thread than others. In this Xeon Phi board the data storage is implemented as Network Attached Storage (NAS) when programs are executed directly by the Phi.

The R GUI.



Core rail. This sensor measures the activity of the cores. As more is demanded from the cores larger will be the power used.



Uncore rail. This sensor measures the uncore activity, such as the bus ring interconnection among cores and the L2 cache level.



Memory subsystem rail. This sensor measures the activity of the memory modules and their connection to the processor. While memory and cores are connected by the same ring bus, memory modules reside outside the processor chip.

A. Benchmark Suite The benchmarks used to carry out this study were the PARSEC and SPLASH-2X benchmark suites [2]. PARSEC is a benchmark suite composed of multithreaded programs. The suite focuses on emerging workloads and was designed to be representative of next-generation shared-memory programs for chip-multiprocessors. It is complemented by the SPLASH2X benchmarks, which is a benchmark suite that includes applications and kernels mostly in the area of high performance computing. It has been widely used to evaluate multiprocessors and their designs for the past 15 years. SPLASH-2X and PARSEC benchmark suites complement each other in terms of diversity of architectural characteristics such as instruction distribution, cache miss rate and working set size. In this study benchmarks were executed using their Native input parameters, the larger standard inputs, to obtain long execution times and be able to appreciate variations in the energy usage. Many of these benchmarks suffer a performance loss if compared to their execution on the Xeon host. This is mainly due to I/O, since the host can access the hard drive directly, but the coprocessor must use NAS when used independently.

In addition to power, temperature measurements are also recorded, from eight different temperature sensors. B. Power Measurement Tool To record the power consumption of a benchmark, we have implemented series of bash scripts, which record to a file the commands used, the output of the benchmark and the power data. These scripts can be used to easily run a series of executions of an application with different thread configurations. To batch process these power data files, we have also implemented an application running on an R environment [18]. This tool mean averages the results of each different run of each thread configuration, to obtain a more representative view of its execution. Moreover, it allows to easily make comparisons among different thread configurations. A view of this interactive Graphical User Interface (GUI) is shown in Figure 2. In the left hand side of the GUI are the controls to select the kind of graph to plot, for instance power vs. time or energy vs. thread configuration, and the desired thread configuration (if necessary). In the right hand side are the controls to modify the limits of the x axis of the graph (time or thread configuration) and buttons to reset those limits to the default ones. V.

B. Data Analysis First, note that there is a great correlation between the data power usage measured in the board sensors to those near the processor. For instance, consider the evolution of power usage during the execution of the Blackscholes benchmark (PARSEC) shown in Figure 3. It is clear how the peaks of power consumption in the processor are met with peaks on the board. It can be also observed how the 2x3 connector (only 75 W) draws less power than the 2x4 connector (150 W), and they seem to work in tandem, drawing each a proportional

C ASE S TUDY

Power and energy consumption of a series of benchmarks executed natively on the Xeon Phi are considered. Benchmarks were executed with different number of threads, and during 3

Multicore and GPU Programming

(a) Power vs Time on board. Fig. 3.

(b) Power vs Time on chip.

Power vs Time for Blackscholes benchmark.

amount of the total power. Adding the power of these two connectors and the instantaneous power drawn from the PCI-E gives approximately the total power measured during window 1 and window 2 (Figure 3(a)). Furthermore, the total power (approximately 100 W during the initial phase, Figure 3(a)) is close to the sum of the core, uncore and memory rails power (approximately 25 W, 30 W and 37 W during the initial phase, in Figure 3(b)). This result is reasonable given that they do not comprise the entirety of the elements on the board. The fact that power sensors behave this way seems to indicate that their measurements can be trusted, at least in terms relative to one another. Given this correlation, in order to make figures more readable, only data for the measurements of the processor rails will be shown in the remaining of the paper, since they are more interesting for our analysis.

in Figure 4(c), Bodytrack executed with 170 threads, one of the worst cases; here the memory is being stressed and cores are probably starved of data. It is likely data can not be well partitioned among the threads. The Swaptions benchmark (PARSEC) performs well on the Xeon Phi. Its power requirements are very similar in any thread configuration, since it does not stress the cores (see Figure 5(a)). This leads to an energy consumption directly related to the execution time. The stair pattern in Figure 5(b) is due to the resolution of the measurement; because they are taken every second, and execution time is between 6 and 9 seconds, so detail is lost. Nevertheless, it is enough to see the best configurations are those that use 2 threads per core (between 120 and 180 threads). In Figure 6, the behaviour of the benchmark Barnes, from the SPLASH-2X suite, shows that some interesting facts about an application execution can be deduced from its power usage. In Figure 6(a), it seems clear that the application goes through 4 phases of intense computation. Between those phases the memory and uncore rails flare up, possibly indicating a data preload or store. In Figure 6(b), four sections can be differentiated, each corresponding to cores using 1, 2, 3 or 4 threads; this behaviour can be observed in other benchmarks, but it is clearer in this one. These sections are separated by a low energy peak (highlighted in the figure by a rectangle), that corresponds to multiples of the number of cores, that are 60, 120 and 180. Usually these configurations produce better results, since they share the workload more evenly, because they imply the same number of threads per core.

Applications for the Xeon Phi are recommended to be executed with 4 threads per core. However, this configuration may not give the best performance, specially if the application is not modified from its Xeon version for the Phi. As an example of an application which performs better with 1 thread per core is Bodytrack (PARSEC). Bodytrack is a benchmark based on a body tracking application, which reads a series of images from disc as it were video. It performs badly on this Xeon Phi due to the high I/O due to necessity of loading the images from NAS. Furthermore, its best performance is when 60 threads are used, one thread per core, and gets worse as more threads are used, which means it consumes more energy (see Figure 4(a)). In Figure 4 variability on the power used by the cores can be seen. In Figure 4(b) the usage by the memory and uncore is fairly constant, and similar to the case in Figure 3(b), but the cores consumption varies greatly. Figure 4(d) shows Bodytrack executed with 240 threads. Note that how the cores, while now executing 4 threads each, actually use less power than in the case when it is executed with 60 threads, probably because they are performing less operations per second. A reason for this may be glimpsed

Nevertheless, these configurations may not always be the best, neither in execution time nor energy consumption. With the Water nsquared benchmark (SPLASH-2x), Figure 7, it can be observed how differences of just a few threads can influence greatly the execution, and how the best thread configurations are not always obvious. In Figure 7(a), this variability, which mostly follows the execution time is clear. In Figure 7(b), note 4

Multicore and GPU Programming

Fig. 4.

(a) Energy vs thread configuration.

(b) Power vs time 60 threads.

(c) Power vs time 170 threads.

(d) Power vs time 240 threads.

Power data for Bodytrack benchmark.

how the mean power usage per second is almost the inverse of the total energy consumed. This is due to the fastest thread configurations drawing more power during execution, but using less energy in the end by finishing the execution faster. The two thread configurations highlighted on these figures are shown in more detail in Figures 7(c) and 7(d) In Figure 7(c), the phases of the execution are shown for one of the worst cases with 4 threads per core. Contrary to what Figure 6(a) showed, in which computations and memory accesses seem to occur at the same time. If Figure 7(c) and Figure 7(d), respectively, a bad and a good case are compared. It is clear how, with good thread configurations, power usage remains high during the whole execution. This leads to a greater energy consumption

per second, but it is finally compensated by a lower execution time. In general power consumption and performance seem to be highly related on the Xeon Phi. This leads to the best thread configuration from a performance view usually being the less energy consuming. Nevertheless, the benchmarks considered in this study perform worse on the Xeon Phi than on the host system, a dual Xeon with 16 cores in total, sometimes in the order of 4 or 5 times worse, like Blackscholes. Some loss of performance is due to the NAS I/O, but other considerations, as the level of vectorisation, may be of importance. For instance, Xeon Phi cores allow a greater vectorisation than 5

Multicore and GPU Programming

(a) Mean power vs thread configuration. Fig. 5.

(b) Energy vs thread configuration.

Power and Energy data for Swaptions benchmark.

(a) Power vs time for 60 threads configuration. Fig. 6.

(b) Energy vs thread configuration.

Power and Energy data for Barnes benchmark.

the usual Xeon cores, and the autovectorisation done by the icc compiler may not be enough to obtain good performance on the Xeon Phi. VI.

performance per watt ratio. Even so, the optimal balance between performance and energy is not easily reached. To use all the available hardware resources in order to obtain the maximum performance may not compensate the energy consumed, although in the Xeon Phi it seems the best starting point. Nevertheless, the best thread configuration, either from the performance or the energy efficiency point of view, may not be evident at first glance. In future work, a study of different configurations, changing not only the number of threads, but their core placement and affinity, may yield complementary

C ONCLUSION

Power and energy usage has become an important issue in all areas of computer science and information technology of late, particularly in HPC. The use of hardware accelerators such as the Xeon Phi may help to reach a better 6

Multicore and GPU Programming

Fig. 7.

(a) Energy vs configuration.

(b) Mean Power vs configuration.

(c) Power vs vs time for 226 threads.

(d) Power vs vs time for 234 threads.

Power and Energy data for Water nsquared benchmark.

insights. To help finding the best thread configuration for a given problem, a series of tools are here presented. These tools allow for a fast study of an application performance and power usage using different number of threads on the Xeon Phi. These tools could be improved in the future using data gathered from hardware counters, to try relating power usage to other performance metrics, as cache misses, for instance, or to gain a more detailed view of the power consumption, at core level or related to different execution phases of an application.

ACKNOWLEDGMENT This work has been partially supported by the Ministry of Education and Science of Spain, FEDER funds under contract TIN 2013-41129P, and Xunta de Galicia, GRC2014/008 and EM2013/041. It has been developed in the framework of the European network HiPEAC-2, the Spanish network CAPAPH, and Galician network under the Consolidation Program of Competitive Research Units (Network ref. R2014/041). This work is a result of a collaboration between CITIUS, University of Santiago de Compostela, and the School of Electronics, Electrical Engineering and Computer Science, Queens University Belfast. 7

Multicore and GPU Programming

R EFERENCES [1]

[2] [3] [4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17] [18]

Y. S. Shao and D. Brooks, “Energy characterization and instructionlevel energy model of Intel’s Xeon Phi processor,” in Proceedings of the International Symposium on Low Power Electronics and Design. IEEE Press, 2013, pp. 389–394. C. Bienia, “Benchmarking modern multiprocessors,” Ph.D. dissertation, Princeton University, January 2011. P. Kogge et al., “Exascale computing study: Technology challenges in achieving exascale systems,” Tech. Rep., 2008. D. Bedard, M. Y. Lim, R. Fowler, and A. Porterfield, “Powermon: Fine-grained and integrated power monitoring for commodity computer systems,” in IEEE SoutheastCon 2010 (SoutheastCon), Proceedings of the. IEEE, 2010, pp. 479–484. M. Tolentino and K. W. Cameron, “The optimist, the pessimist, and the global race to exascale in 20 megawatts,” Computer, vol. 45, no. 1, pp. 95–97, 2012. G. Kestor, R. Gioiosa, D. J. Kerbyson, and A. Hoisie, “Enabling accurate power profiling of HPC applications on exascale systems,” in ACM International Workshop on Runtime and Operating Systems for Supercomputers, 2013. C. Isci and M. Martonosi, “Runtime power monitoring in high-end processors: Methodology and empirical data,” in IEEE International Symposium on Microarchitecture, MICRO–36, 2003. R. Bertran, M. Gonzalez, X. Martorell, N. Navarro, and E. Ayguade, “Decomposable and responsive power models for multicore processors using performance counters,” in ACM International Conference on Supercomputing, 2010. A. Deshpande and J. Draper, “Leakage energy estimates for HPC applications,” in ACM International Workshop on Energy Efficient Supercomputing. E2SC13, 2013. B. Li, H.-C. Chang, S. Song, C.-Y. Su, T. Meyer, J. Mooring, and K. W. Cameron, “The power-performance tradeoffs of the Intel Xeon Phi on HPC applications,” in Proc. 2014 IEEE Int. Parallel and Distributed Processing Symposium Workshops (IPDPSW’14), 2014, pp. 1448–1456. E. Leon and I. Karlin, “Characterizing the impact of program optimizations on power and energy for explicit hydrodynamics,” in IEEE 28th International Parallel and Distributed Processing Symposium Workshops, 2014, pp. 773–781. D. Nikolopoulos, H. Vandierendonck, N. Bellas, C. Antonopoulos, S. Lalis, G. Karakonstantis, A. Burg, and U. Naumann, “Energy efficiency through significance-based computing,” Computer, vol. 47, no. 7, pp. 82–85, 2014. A. Tiwari, M. A. Laurenzano, L. Carrington, and A. Snavely, “Modeling power and energy usage of HPC kernels,” in IEEE 26th Int. Parallel and Distributed Processing Symp. Workshops, 2012, pp. 990–998. L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan, “Larrabee: A manycore x86 architecture for visual computing,” ACM Trans. Graph., vol. 27, no. 3, pp. 18:1–18:15, Aug. 2008. [Online]. Available: http://doi.acm.org/10.1145/1360612.1360617 S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, and N. Borkar, “An 80-tile 1.28tflops network-on-chip in 65nm cmos,” in IEEE International Solid-State Circuits Conference, San Fransisco, USA, 2007, ser. Digest of Technical Papers. IEEE, 2007, pp. 98–99. J. Rattner, “Single-chip cloud computer: An experimental manycore processor from intel labs,” available at download. intel. com/pressroom/pdf/rockcreek/SCCAnnouncement, 2011. Intel Xeon Phi Coprocessor Datasheet, http://www.intel.com/content/ www/us/en/processors/xeon/xeon-phi-coprocessor-datasheet.html. R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2008, ISBN 3-900051-07-0.

8

Multicore and GPU Programming

Implementation and performance analysis of the AXPY, DOT, and SpMV functions on Intel Xeon Phi and NVIDIA Tesla using OpenCL E. Coronado-Barrientos, G. Indalecio and A. Garcia-Loureiro Centro de Investigacion en Tecnoloxias da Informacion (CiTIUS) Universidad de Santiago de Compostela Santiago de Compostela, Spain Email: [email protected]

an non-profit consortium, the Khronos Group [2], and adopted by many vendors like Apple, Intel, AMD, and some more, by providing support on their devices. Both the NVIDIA Tesla S2050 and the Intel Xeon Phi 3120A are compatible, so it is possible to write code to both and make a direct comparison of the performance.

Abstract—The present work is an analysis of the performance of the AXPY, DOT and SpMV functions using OpenCL. The code was tested on the NVIDIA Tesla S2050 GPU and Intel Xeon Phi 3120A coprocessor. Due to nature of the AXPY function, only two versions were implemented, the routine to be executed by the CPU and the kernel to be executed on the previously mentioned devices. It was studied how they perform for different vector’s sizes. Their results show the NVIDIA architecture better suited for the smaller vectors sizes and the Intel architecture for the larger vector’s sizes. For the DOT and SpMV functions, there are three versions implemented. The first, is the CPU routine, the second one is an OpenCL kernel that uses local memory and the third one is an OpenCL kernel that only uses global memory. The kernels that use local memory are tested by varying the size of the work-group; the kernels that only uses global memory are tested by varying the arrays size. In the case of the first ones, the results show the optimum work-group size and that the NVIDIA architecture benefits from the use of local memory. For the latter kernels, the results show that larger computational loads benefits the Intel architecture.

With that in mind, having the same code that can be executed in several platforms is becoming a necessity instead of an option. The present work is inspired in this reality, so the objective is to analyse the kernels’ performance to improve executions’ time of algorithms of linear systems solvers [3]. The focus was on solvers that could manage to find the solution for systems whose matrix is not symmetric, which are commonly generated by numerical simulations, including semiconductors devices [4]. The kernels tested in this work were included in a three-dimensional Finite Element simulator tool developed by our group. The applications of this simulation tool include the study of several semiconductor devices like: Bipolar Junction Transistors [5], High-Electron Mobility Transistors [6], Metal-Oxide-Semiconductor FieldEffect Transistors [7] and Solar Cells [8].

I. I NTRODUCTION Nowadays, the top supercomputer systems are heterogeneous systems that use more than one kind of processors to gain performance. One of the leading vendors to provide accelerators for high performance computing is NVIDIA. As of November 2014, 61.3 percent of the heterogeneous systems at Top500 use NVIDIA to provide acceleration to their systems [1]. More concretely, the number 2 and 6 of the list run with NVIDIA. This situation differs greatly from the state at June 2012, where 91.5 percent of the heterogeneous systems used NVIDIA. The difference between then and now is the incursion of the Intel Xeon Phi coprocessor for acceleration. In June 2012 only one system was found using this card at Top500, and now in November 2014, 28 percent of the heterogeneous systems use Xeon Phi, including the first in the list. These two devices have different programming APIs and memory models, which complicates the comparison of the behaviour of algorithms between them. In order to simplify the development in heterogeneous systems, the OpenCL project provides a single interface that enables the execution of the same code in any compliant device. This is a wide available and lively developed framework maintained by

This work has the following organization: on section II there is a brief justification of the present analysis; on section III the technical specifications of the NVIDIA Tesla and Intel Xeon Phi devices are given; on section IV a brief introduction to the terminology used by OpenCL is done; on section V an analysis of the implemented kernels is done; on section VI the results of the tests are registered; and on section VII conclusions are dicussed. II. J USTIFICATION The main classification for sparse linear systems solver are: direct solutions methods and iterative methods. An example of the first class is the LU factorization. The iterative methods can be classified in two classes: basic iterative methods and projection methods. The latter class has the most important iterative methods for solving large sparse linear systems [9].

9

Multicore and GPU Programming

TABLE I: S PECIFICATIONS FOR THE NVIDIA T ESLA S2050 AND I NTEL X EON P HI 3120A DEVICES . Cores Core clock (GHz) Total memory (GB) Memory max bandwidth (GB/s) TDP (W) Maximum work-group size Double arithmetic support Cache L2 (MB) Cache lane size (B) Device type

Tesla S2050 448 1.10 3 148 225 1024 yes 128 GPU

Xeon Phi 3120A 57 1.55 6 240 300 8192 yes 28.5 64 Accelerator card

perfect to be compared using a common application program interface (API). In TABLE I is a summary of the main specifications of the NVIDIA Tesla S2050 and the Intel Xeon Phi 3120A.

Fig. 1: Functions of the FGMRES algorithm. For clarity, it has been located one instance of each function in the code, but several others can be found on other steps of the code.

GPUs were originally developed to be used in graphics manipulation, so they have very simple processing cores, however the GPGPUs have been designed to be more useful in the field of High Performance Computing (HPC), starting with the Fermi line. The NVIDIA Tesla S2050 is a Fermi based GPU which has two slots, each having two M2050 cards of 14 CUDA cores (Streaming Multiprocessors) for a total of 448 cores and 3,072 MB of memory. Each GPU operates at a frequency of 574 MHz. Its theoretical peak performance is 1028 GFLOPS with single precision [10].

Examples of these algorithms are: Full Orthogonalization Method (FOM), Incomplete Orthogonalisation Method (IOM), General Generalized Minimal Residual method. The most used operations in iterative methods for solving sparse linear systems, whose matrix is unsymmetric [9], are: the AXPY function (ax + y), the DOT function (x · y) and the SpMV function (Ax), where a is a scalar, x, y are vectors and A is a quadratic sparse matrix. Having a look on one iterative method for solving sparse linear systems, the Flexible variant of the Generalized Minimal Residual method, shows that the algorithm depends highly on these operations (see Fig. 1). Note that also the norm-2 has a dependency on the DOT function, as its calculation requires a dot product of a vector with itself.

The Intel Xeon Phi coprocessor contains 57 many cores, with a 512-bit vector arithmetic unit, for SIMD vector instructions. The coprocessor has an L1 cache in each core of 64 KB (32 KB data + 32 KB instructions) and L2 cache in each core of 512 KB (combined Data and Instructions, with the L1 D cache included). Each core can execute up to four hardware threads simultaneously. The theoretical peak performance of the Intel Xeon Phi coprocessor is 1011 GFLOPS [11].

The performance of these functions have a considerable dependency on the specific implementation. Considering the differences in architecture and capabilities of the available coprocessors and General Purpose Graphics Processing Units (GPGPUs), changing the implementation of the kernels will have an extra impact on the performance of the functions, hence boosting or decreasing the running time of the linear system solver.

The NVIDIA architecture is benefited by parallelism of fine grain due to larger quantity of lighter cores that are not capable of supporting SIMD instructions. On the other hand the Intel Xeon Phi coprocessor has heavier cores that supports SIMD instruction and with its own cache hierarchy implying a necessity for a complex mechanism for cache coherency. Because of this, the latter is benefited by parallelism of coarse grain. Another important difference to consider between these two devices, is where each has allocated the local memory. For the NVIDIA Tesla architecture, local memory is allocated on each CUDA core and its available to each of its 32 cores; on the other hand, for the Intel Xeon Phi architecture the local memory is allocated on the regular GDDR memory, so it introduces additional overhead in terms of redundant data copy and management if used [12].

III. S PECIFICATIONS OF DEVICES In this work, two devices are used in order to comprehend and analyse the differences between kernel implementations and performance. The first is a High Performance Computing (HPC) oriented GPU from NVIDIA, the Tesla S2050, which represents a well tested and established product. Also, it is used the novel Xeon Phi Coprocessor from Intel, arising with a different paradigm of more complex cores and logic. Both devices have support for OpenCL, which makes them

10

Multicore and GPU Programming

IV. O PEN CL OpenCL is a programming frame that enables the utilisation of the same code on multiplatform systems [13], [14], [15]. In this work, OpenCL provided common ground to test functions on two devices with very different architectures. As a new programming paradigm it has its own terminology, much as CUDA does the same for the NVIDIA’s GPUs. The description of the components of the OpenCL frame is needed to explain the behaviour of the kernels along the article. Please note that this components will have a different counterpart in the physical device depending on the card. Especially the last three (compute unit, work-item and work-group), and the memory model, as explained in section III.

Fig. 2: OpenCL frame.

Key terms: •

Host: the central processing unit (CPU) were the code is being executed.



Host application: the code that contains the C and OpenCL instructions and manages the kernels.

Note that each device needs its own queue, so if three devices are available on the platform, three queues will be required to communicate with each one of them. The last three terms of the key terms list are more related to the OpenCL’s device model. It is important to know this model because explains how memory address spaces are accessed. The OpenCL device model identifies four address spaces:



Kernel: is a special coded function to be executed by one or more devices.



Device: any processing unit, such as CPUs, GPUs or accelerator cards (coprocessor), that is OpenCL complaint.



Global memory: write/read memory that stores data for the entire device.

Context: a container used by the host to manage all the connected devices.



Constant memory: read memory that stores data for the entire device.

Program: the container where the kernels are searched by the host to send into a device.



Local memory: stores data for work-items in a workgroup.

Queue: is a mechanism used by the host to dispatch kernels from a program into a device.



Private memory: stores data for a single work-item.









Compute unit: a processing core contained within a device.



Work-item: is an individual kernel execution with a specific set of data.



Work-group: is a set of work-items that access the same processing resources (e.g. local memory).

In Fig. 3 the relations between the different types of memory and the work-items, work-groups and compute units are shown. The global memory is the largest and also the slowest memory region on the device. When the host sends data to the device, it writes directly into global memory, and when the host receives data from the device it reads from global memory. The local memory is faster and commonly much smaller than global memory, the NVIDIA Tesla architecture benefits from using it, as later results will show. Finally, the private memory is the fastest and smallest memory region, it can be only accessed by individuals work-items.

The host is running the application that contains all the code, namely source.c. Once the host finds the instructions that set the frame for OpenCL, it looks into the program container, namely kernels source.cl, for kernels. Once the host have access to the kernels, it can send them to a specific device (included into the context) by its queue (see Fig. 2).

11

Multicore and GPU Programming

Fig. 4: Array accesses by work-items in kernel axpy. This figure contains the example of r = 2x+y with r, x, y ∈ R100 .

Fig. 3: Schematic representation of the OpenCL device model.

V. K ERNELS access consecutive memory locations.

Several implementations of the functions are done in order to analyse them. All implementations are written using OpenCL, so kernels can run in both devices without changes. Besides the length of the arrays to process, there is another parameter that influence the performance of some of these kernels, the work-group size. In section VI is showed the impact of this parameter.

B. The DOT function There are three versions of the DOT function. The first version is the CPU implementation, the second version is an OpenCL kernel, optimized for the NVIDIA architecture as it uses local memory (see section III), and the third version is an OpenCL kernel optimized for the Intel Xeon Phi architecture, as it does not use local memory (see section III).

A. The AXPY function. There are two versions of the AXPY function, the CPU version and one OpenCL version, this is due to the function itself, that is almost a simple read-write operation. This operation was included to see which device performed better for highly parallel operation like this.

The CPU version is: r = 0.0f; for (i = 0; i < LENGTH; i++) r = r + x[i]∗y[i];

The CPU version of the AXPY function is:

the first kernel of the DOT function is:

for (i = 0; i < LENGTH; i++) r[i] = a∗x[i] + y[i];

kernel void dot0(

and the kernel for the AXPY function is: kernel void axpy(

global float *v1, global float *v2, global float *vrp)

{ unsigned int gid = get global id(0); unsigned int lid = get local id(0); unsigned int s; local float LB[CHK]; LB[lid] = v1[gid]*v2[gid] + v1[gid+CHK]*v2[gid+CHK]; barrier(CLK LOCAL MEM FENCE); for (s=(CHK/2); s>0; s>>=1) { if (lid