Hacking GPUs: insights into the NVidia architecture

Gheorghe Almasi
IBM Research

In this talk I expand on results obtained by us, and by others, reverse-engineering the NVidia GPU family in an attempt to understand the underlying architecture. The NVidia corp. like to keeps some of these details under wraps, mostly in order to discourage programmers from writing code that is too specific to a particular instance of the accelerator architecture. We, however, are interested in unpublished details about e.g. how thread divergence is implemented, how the two warp schedulers interact with each other, and what the memory hierarchy looks like.

A Generic Stencil Library

Mauro Bianco

In this era of diverse and heterogeneous computer architectures the programmability issues, such as productivity and portable efficiency, are crucial to software development and algorithm design. One way to approach the problem is to step away from traditional sequential programming languages and move toward domain specific programming environments to balance between expressivity and efficiency.

In order to demonstrate this principle, we developed a domain specific [C++ generic] library for stencil computations, like PDE solvers. The library features high level constructs to specify computation, and allows the development of parallel stencil computations with very limited effort. The high abstraction constructs (like do_all and do_reduce) makes the program shorter and cleaner with increased contextual information for better performance exploitation. The results shows good performance from Windows multicores, to HPC clusters and machines with accelerators, like GPUs.

psort, yet another fast stable external sorting software

Marco Bressan
University of Padova

psort was the fastest sorting software in the last four years, currently sorting 334GB of data for 0.01$ of computer time (according to the 2011 PennySort benchmark) and sorting 100MB from disk to disk in less than 60ms. This talk presents its internals and the careful fitting of its structure to the architecture of modern platforms, which make it significantly faster than state of the art sorting software such as Nsort and STXXL. Most of our optimizations are not sorting-specific, and would be of help in any data-intensive software such as FFT libraries or DBMSs.

SkePU: A skeleton programming framework for GPU-based systems

Usman Dastgeer
Linkoping University

Skeleton programming is an approach where an application is written with the help of ``skeletons''. A skeleton models common computation and coordination patterns of parallel applications as a pre-defined generic component that can be parametrized with user computations. SkePU is such a skeleton programming framework for multicore CPUs and multi-GPU systems. It has six data-parallel and one task-parallel skeletons, two container types, and support for execution on multi-GPU systems both with CUDA and OpenCL. Recently, support for hybrid execution, dynamic scheduling and load balancing is developed in SkePU.

In this talk, we will briefly discuss how skeleton programming can help in efficiently utilizing the computational power of modern GPUs while providing high level of abstraction to the application programmer. To show how to achieve performance, we provide a case-study on an optimized GPU-based skeleton implementation (CUDA/OpenCL) for 2D stencil computations and introduce two metrics to maximize resource utilization on a GPU. By devising a mechanism to automatically calculate these two metrics, performance can be retained while porting an application from one GPU architecture to another.

From GPGPUs to Many Core Co-Processors - An exciting journey

Alexander Heinecke
University of Munich

Today's GPUs are only by small factor (2-6X) faster than standard x86 hardware if the comparison is based on theoretical numbers. However, often speed-ups greater than 100X are published. These results raise the question if there generally is a proper performance engineering approach when analyzing GPUs.

The talk outlines the author's opinion on comparing GPUs to CPUs and other co-processors like Intel MIC. Introduced concepts are used for evaluating different applications running on CPU, GPU and MIC. Examples allow to draw a rough categorization of applications' characteristics for tailoring them to the mentioned architectures. Besides pure performance aspects, ease of use and portability aspects are emphasized in this talk. This also includes the fact that CPU optimizations are mandatory when using CPUs and GPUs simultaneously (hybrid programming).

What does it take to achieve energy-efficient computing . an exploration of the software challenge

Paul Kelly
Imperial College London

We have been exploring how to map OpenCL/CUDA onto streaming dataflow architectures implemented using FPGAs. We have also been looking at higher-level models than OpenCL/CUDA, that abstract the data access stream, so that effective use of local RAM and DRAM bandwidth can be managed automatically and portably. In this talk I will look at some of the things we are learning from this work in progress. The key idea is to see if we can simultaneously (1) achieve uncompromised performance, (2) support performance portability . so a single body of source code can be mapped automatically to take advantage of different heterogeneous, power-efficient multicore/manycore architectures, and (3) raise the level of abstraction, to promote clarity and reuse in the application code. I will show some of our plans to demonstrate that this can be done.

Experiences in co-design: Tackling the challenges of Performance, Power, and Reliability

Darren Kerbyson

The complexity of large-scale parallel systems necessitates the simultaneous optimization of multiple hardware and software components in order to achieve goals in performance, energy efficiency, and fault tolerance. With system costs amounting to hundreds of millions of dollars and annual operational costs of tens of millions of dollars, there is clear motivation for the development of a co-design methodology that enables maximum return in terms of achievable science per unit cost. Using tools such as performance modeling that enable exploration of design alternatives prior to implementation, we demonstrate the benefits of a co-design methodology for optimizing extreme-scale systems on the path to exascale systems and workloads. We exemplify this by drawing on our experiences in co-design for performance, for power, and for reliability. Performance modeling was used in the design and use of the first petaflop system, Roadrunner, which was ultimately deployed using the IBM cell processor. On-line models are in use to optimize energy by coupling application to the run-time through Energy Templates. A fault tolerant system involving application, programming model and run-time was co-designed allowing for continued application execution in the presence of node failures.

Thick control flows - Imperative version of stream programming

Ville Leppanen
University of Turku

We propose a new concept of thick control flows for defining parallel (multithreaded) programs, and consider its influence on the semantics of ordinary language constructions When a thick control flow (in terms of the number of threads) is executing a statement or an expression of a program, all the threads of that flow are considered to execute the same program element synchronously in parallel. Considering method calls, when a control flow with thickness t calls a method, the method is not called separately by each of the t threads, but the control flow calls it only once with t threads. A call stack is not related to each thread but to each of the parallel control flows, since threads do not have program counters - only control flows have program counters. The concept of thread is only implicit. A thick thread-wise variable is an array-like value having a thread-wise actual value. Method signatures naturally advance types with thickness, but non-thick types are also useful.

The concept of thick control flow makes the programmer to focus on co-operation of few parallel thick control flows instead of a huge number of parallel threads. The concept computation's state is promoted as a flow is seen to have a state (instead of each thread). The concept of state has been in a central role in achieving correctness in sequential programs. It is a natural generalization of ordinary imperative sequential programming.

The whole idea of thick control flows is very close to that of vector and stream computing and e.g. the Brook language. Many of the GPGPU computing related approaches (BrookGPU, OpenCL, CUDA) can be seen to have a lot in common with out thick control flows. E.g. the stream based BrookGPU in practice defines computational functions (called kernels) that operate on multiple streams and produce stream values. The streams can have a multidimensional shape and that shape corresponds to a set of executing threads. Executing a kernel means synchronously executing a thick control flow (the kernel's body) over the stream values. At execution level, the SIMT (Single Instruction Multiple Thread) approach of GPU devices is of course close to our approach. The main difference with respect to stream computing is the dataflow/functional style versus the imperative style of thick control flows.

The GPU approach provides efficient possibilities for implementing the thick control flow approach, yet existing GPU architectures as such will not be enough.

rCUDA: A tool for accessing to a remote GPU for GPU computing

Rafael Mayo Gual
University Jaume I

Current high performance clusters are equipped with high bandwidth/low latency networks, lots of processors and nodes, very fast storage systems, etc. However, due to economical and/or power related constraints, in general it is not desirable to provide an accelerating co-processor such as a graphics processor (GPU)- per node. The rCUDA Framework enables the concurrent usage of CUDA devices remotely, thus reducing the number of GPUs in the cluster. This middleware provides applications with the illusion that they are dealing with a real private GPU. The highly-tuned TCP communications module included in the rCUDA Open Source package enables a wide rCUDA usage: Ethernet, InfiniBand over IP, Virtual Machines, etc. Non-Open Source packages provide additional features: rCUDA's InfiniBand communications module -written directly upon the low level InfiniBand Verbs API and featuring GPUDirect support- increases bandwidth, reaching a local main memory to remote GPU memory bandwidth close to that of the InfiniBand interconnect.

SIMD - The next generation: Irregular and non-numerical computing

Jose Moreira
IBM Research

Single Instruction Multiple Data computing has been used successfully in numerical intensive computing since the 1960's. It was particularly effective in the vector machines of the 1970's and 1980's. General purpose microprocessors started incorporating SIMD instructions in the second half of the 1990's. In addition to operations on floating-point data, these processors also included SIMD instructions for manipulating fixed-point data, primarily for the purpose of digital signal processing and graphics processing. GPUs have extended and enhanced SIMD with the concept of Single Instruction Multiple Threads (SIMT) in which a single stream of instructions can be (conditionally) executed by multiple independent threads. Each of these steps has broadened the applicability of SIMD computing. In this talk we will discuss new features appearing in modern processors that are broadening SIMD computing even further. We will discuss the use of SIMD computing in regular expression processing, business analytics and in sparse matrix computations. We will also discuss the cost and efficiency of hardware mechanisms to support new SIMD features.

EXA-Scale Computing in 2018: About the Challenges for all Kinds of Technology!

Wolfgang Nagel
Technical University of Dresden

Parallelism and Scalability have become the major issue in all areas of Computing, nowadays pretty much everybody - even beyond the classical HPC field - is using parallel codes. Nevertheless, the number of cores on one chip - homogeneous as well as heterogeneous - is significantly increasing in future. Soon we will have millions of them in one HPC system, the ratios between flops and memory size as well as bandwidth for memory, communication and I/O will get worse, the need for energy might be extraordinary, and it is not clear what will be the best programming paradigm.

The talk will describe technology developments, software requirements, and other related issues to identify challenges for the community which have to be carefully addressed, and solved, within the next couple of years.

Introduction to GPU Architecture

Pratap Pattnaik
IBM Research

This talk will provide an introduction to GPU architecture, particularly NVIDIA Fermi family of processors

GPU implementations of irregular algorithms

Keshav Pingali
University of Texas at Austin

There is a substantial body of work on using GPU's to accelerate regular applications, which are applications in which the key data structures are dense matrices accessed in systematic ways such as by columns or rows. We understand much less about how to use GPU's to accelerate irregular applications in which the key data structures are pointer-based data structures such as graphs and trees. In this talk, we discuss our experience in using GPU's to accelerate two irregular applications: an n-body application (Barnes-Hut) in which the key data structure is a tree, and Andersen-style points-to analysis in which the key data structure is a graph.

This is joint work with Martin Burtscher (Texas State) and Mario Mendez-Lojo (Samsung Research).

Utilising GPUs by the press of a compiler switch? Lessons from the SaC approach

Sven-Bodo Scholz
University of Hertfordshire

This talk gives an overview of the lessons we have learned so far from trying to compile the data-parallel programming language SaC into code for graphics cards. One of the key challenges of this approach is the deliberate absence of any hint from the programmer as to which program parts should be executed on GPUs and which parts should not. The talk highlights the key techniques we have identified as being essential and what level of performance we have achieved through them. The talk also identifies limits of this approach and it presents outstanding issues that require further exploration.

GPUs in Astrophysics? Project ISAAC and Multi-science with AMR based GAMER Framework

Hemant Shukla

The growing need for power efficient extreme-scale high-performance computing (HPC) coupled with plateauing clock-speeds is driving the emergence of massively parallel compute architectures. Despite early promise of multi-core, many-core and commodity accelerator based systems for example graphics processing units (GPUs) the wide-ranging adoption of these architectures is steeped in difficult learning curves and requires reengineering of existing applications that mostly leads to expensive and error prone code rewrites without prior guarantee and knowledge of any speedups.

Project ISAAC is focused to develop comprehensive and energy-efficient HPC infrastructure comprising of tools, libraries, frameworks, and complete turnkey solutions using the emerging many-core and accelerator-based programmable architectures for a broad class of applications critical to research in Physics & Astronomy. The unique ISAAC approach is science/application-driven where the key drivers are identified up-front to assess the efficacy of this new approach. The set of applications are categorized into three separate domains - simulation, instrumentation and data processing - covering specific real-case challenges in cosmology, radio astronomy, optics, and image/data processing with potential of interdisciplinary relevance.

A concrete example of a simulation application is the commonly used algorithms and techniques with adaptive mesh refinements (AMR), advanced hydrodynamics partial differential equation (PDE) solvers, and Poisson-Gravity solvers. Taking advantage of the commonalities, we use GPU-aware AMR code, GAMER, to solve multi-science problems in astrophysics, hydrodynamics and particle astrophysics with single codebase. We demonstrate significant speedups in disparate class of scientific applications on 3 separate GPU clusters. By extensively reusing the extendable single codebase we mitigate the impediments of significant code rewrites. We also collect performance and energy consumption benchmark metrics. In addition, we propose a strategy and framework for legacy and new applications to successfully leverage the evolving GAMER codebase on massively parallel architectures. The framework and the benchmarks are aimed to help quantify the adoption strategies for legacy and new scientific applications.

Parallel Application Characterization with Quantitative Metrics

Henk Sips
Delft University of Technology

When computer architects re-invented parallelism through multi-core processors, application parallelization became a problem. Now that multi-cores have penetrated from hand-helds to supercomputers, parallelization becomes a large-scale challenge. A lot of research is going into compiler improvements, language extensions, frameworks and application/platform case studies. While fairly successful, these solutions are based on experimental tools, trial-and-error, and expert knowledge, and do not bring multi-core programming into reach for the whole software industry. We believe that the challenge of ``mass-parallelization'' must be tackled more systematically. Development begins at application specification and algorithm design, followed by application characterization with trade-offs in parallelization strategies and data layouts. With a proper software design, implementation and optimization can start. In this presentation, we focus on quantitative application characterization for such a systematic approach. We introduce a set of metrics to characterize applications and show how they can be evaluated. We present our interpretation of the results and suggest ways to use them to guide design decisions.

Programming heterogeneous, accelerator-based multicore machines: current situation and main challenges

Samuel Thibault
LaBRI, University Bordeaux

Heterogeneous accelerator-based parallel machines, featuring manycore CPUs and with GPU accelerators provide an unprecedented amount of processing power per node. Dealing with such a large number of heterogeneous processing units - providing a highly unbalanced computing power - is one of the biggest challenge that developers of HPC applications have to face. To Fully tap into the potential of these heterogeneous machines, pure offloading approaches, that consist in running an application on regular processors while offloading part of the code on accelerators, are not sufficient.

In this talk, I will go through the major programming environments that were specifically designed to harness heterogeneous architectures, including extensions of parallel languages and specific runtime systems. I will discuss some of the most critical issues programmers have to consider to achieve portability of performance, and I will show how advanced runtime techniques can speed up applications in the domain of dense linear algebra.

Eventually, I will give some insights about the main challenges designers of programming environments will have to face in upcoming years.

The GPU computational model as a bridge to Stream-based Reconfigurable HW design

Pedro Trancoso
University of Cyprus

Reconfigurable hardware can be used as very efficient co-processing solution to accelerate certain types of applications. However, obtaining efficient hardware is a hard task due to its large design space flexibility. Moreover, different applications have different hardware design requirements. To facilitate the design of hardware accelerators we propose a methodology that adopts the stream-based computing model and the even more widely used Graphics Processing Units (GPUs) as prototyping platforms. Massive parallel programing allow designers to identify contentions and hazards as well as optimize their parallel approach at an early stage of the design. We then propose a modular reconfigurable architecture to support the efficient deployment of stream accelerators into hardware. In particular, the architecture consists in a group of slots where pre-defined accelerators can be deployed by the system at run time. The proposed architecture mates the flexibility of reconfigurable hardware with the advantages of stream computing.

Where do GPUs fit: An analysis based on case studies

Jan Treibig
University of Erlangen

GPUs have experienced a tremendous hype in recent years. While some promises cannot be fulfilled there is still a large impact on how people think about application-specific hardware optimizations. This talk presents an analysis of how GPUs compare to other hardware technologies that focus on data parallelism, like SIMD-enabled multicore processors and classic vectors. On the example of three algorithms (a back-projection algorithm from a medical application, an image filter algorithm and sparse matrix vector multiply) differences and opportunities are discussed. Beyond the performance aspects we also cover software development issues.

From total Enthusiasm to bitter Cynicism - The religious War on GPUs

Carsten Trinits
University of Munich

This talk will give an overview on different compute intensive applications on GPGPUs at TUM and the conclusions drawn by the authors. As both very promising and very disappointing results were obtained (depending on the application), as usual the community divides into GPU fans and GPU haters. The talk will try to draw conclusions from this phenomenon and give an outlook on which architectures might suit best for these applications in the future. Applications comprise examples from medical imaging as well as electrical engineering.


Henry M. Tufo III

OpenCL vs. Cuda: A Programmability Debate

Ana Lucia Varbanescu
University of Amsterdam

OpenCL's portability has been heavily advertised by all the KRONOS consortium members. The language has provided a flexible model to enable code portability, and so far it seems to work. However, performance portability and productivity remain questionable. Therefore, one should ask what exactly could make OpenCL attractive to GPU programmers. Or, in other words, is it worth the trouble?

In this talk we give several answers to this question, discussing the productivity, portability, and performance of OpenCL. Specifically, we focus on the performance variation that OpenCL code shows when compared with similar CUDA code, and we investigate the reasons behind these performance gaps. We show some interesting results based on several case-studies from the SHOC and Rodinia benchmarks, and we conclude OpenCL is a viable alternative to program GPUs, but not necessarily a better one.

Performance, efficiency, and programming challenges in many-core/exascale computing architecture: the general purpose GPU answer and more

Lorenzo Verdoscia
CNR Napoli

The availability and advancement of programmable graphics cards have been one of the most exciting developments in parallel programming over the past few years. A high end graphics card costs less than a high end CPU and provides tantalizing peak performance approaching, or exceeding, one teraflop. Moreover, last year the China's GPU-rich TIANHE-1A Supercomputer was ranked the fastest system in the world. This fact focused attention and a lot of discussion, within the HPC community, concerning the advantages of Graphics Processing Unit (GPU) versus CPU used in many systems. So, the new heterogeneous architectures, based on CPU in combination with either GPU and/or Field Programmable Gate Array (FPGA), pose new challenges and opportunities in HPC. Recently, in the HPC scientific community, researchers believe that the challenges are far greater than mere energy. The other great hurdles lie in programmability and resiliency, and to arrive at solutions for these problems, a revolutionary approach is required. The historic levels of scaling are quite ended, challenges related to power and code are the most important ones. The challenge is not about the flops any longer, it is about data movement. We traditionally think that it is simply a matter of power efficiency, but it is mainly a matter of locality. Algorithms should be designed to perform more work per unit data movement, and programming systems should further optimize this data movement. Architectures need to facilitate data movement by providing an exposed hierarchy and efficient communication. In this context, the heterogeneous architectures do represent the right solution, and a number of problems must be solved if many applications can take advantage of this hybrid architectures. Trends in processor and system architecture drive toward very high-core count designs and extreme software parallelism to solve exascale class of problems. In order to efficiently exploit the available computing power, programs have to be expressed in a finer-grained manner. As far as the execution model is concerned, the computation units that can be scheduled atomically have to be more fine-grained than a traditional thread. On the other hand, in the heterogeneous architectures, the programming model is based on a high level memory management interface, enabling hierarchical description of data domains. In contrast with the traditional homogeneous multicore architectures, heterogeneous machines which do not provide a coherent global memory have much more runtime requirements than the mere task scheduling. In this talk we discuss the challenging issues introduced by the design of a heterogeneous many-core exascale machine, featuring different GPUs and computing units. The most fundamental problems to be solved at the different levels, from the programming to the computational and execution models, will be analyzed, and the proposed solutions (if any) will be discussed too. Moreover, we will present our proposal, based on the dataflow execution model, where the scheduling quantum is the elemental operator and the scheduling mechanism is completely decentralized and at a single level. "Learning from the past is necessary to creating future advances".