Abstracts (in progress)


A Machine Learning Approach to Online Fault Classification in HPC Systems

Ozalp Babaoglu
University of Bologna, Italy

As High-Performance Computing (HPC) systems push towards the Exascale goal, failure rates both at the hardware and software levels will increase significantly. Detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for their continued operation. In this talk, I will present a fault classification method for HPC systems based on machine learning. The novelty of our approach rests with the fact that it can be operated on streamed data in an online manner, thus opening the possibility to devise and enact control actions on the target system in real-time. I will then describe a high-level, easy-to-use fault injection tool called FINJ, with a focus on the management of complex experiments. In order to train and evaluate our machine learning classifiers, we injected faults to an in-house experimental HPC system using FINJ, and generated an extensive fault dataset. Experimental results demonstrate that our approach allows almost perfect classification accuracy to be reached for different fault types with low computational overhead and minimal delay. Both FINJ and the dataset are publicly available to facilitate resiliency research in the HPC systems field.

The I/O complexity of Toom-Cook integer multiplication

Gianfranco Bilardi
University of Padova, Italy

Nearly matching upper and lower bounds are derived for the I/O complexity of the Toom-Cook-k (or Toom-k) algorithm computing the products of two integers, each represented with n digits in a given base s, in a two-level storage hierarchy with M words of fast memory, with different digits stored in different memory words.

An IOAk(n,M) = Ω((n/M)logk 2k−1M) lower bound on the I/O complexity is established, by a technique that combines an analysis of the size of the dominators of suitable sub-CDAGs of the Toom-Cook-k CDAG (Computational Directed Acyclic Graph) and the analysis of a function, which we call “Partial Grigoriev’s flow”, which captures the amount of information to be transferred between specific subsets of input and output variables, by any algorithm that solves the integer multiplication problem. The lower bound applies even if the recomputation of partial results is allowed.

A careful implementation of the Toom-Cook-k algorithm, assuming that M=Ω(k3 logs k), is also developed and analyzed, leading to an I/O complexity upper bound that is within a factor O(k2) of the corresponding lower bound, hence asymptotically optimal for fixed k.

Both the lower and the upper bounds are actually derived in the more general case where the value of k is allowed to vary with the level of recursion, although the quantitative expressions become more involved. Extensions of the lower bound for a parallel model with P processors are also discussed.

Joint work with L. De Stefani - Brown University.

OpenABL: A Domain-Specific Language for Parallel and Distributed Agent-Based Simulations

Biagio Cosenza
T.U. Berlin, Germany

Agent-based simulations are becoming widespread among scientists from different areas, who use them to model increasingly complex problems. To cope with the growing computational complexity, parallel and distributed implementations have been developed for a wide range of platforms. However, it is difficult to have simulations that are portable to different platforms while still achieving high performance. We present OpenABL, a domain-specific language for portable, high-performance, parallel agent modeling. It comprises an easy-to-program language that relies on high-level abstractions for programmability and explicitly exploits agent parallelism to deliver high performance. A source-to-source compiler translates the input code to a high-level intermediate representation exposing parallelism, locality and synchronization, and, thanks to an architecture based on pluggable backends, generates target code for multi-core CPUs, GPUs, large clusters and cloud systems. OpenABL has been evaluated on six applications from various fields such as ecology, animation, and social sciences. The generated code scales to large clusters and performs similarly to hand-written target-specific code, while requiring significantly fewer lines of codes.

HPC and AI: Impact and Opportunities

Anne C. Elster
NTNU: Norwegian University of Science and Technology, Norway

HPC (High Performance Computing) techniques are needed in order to be able to get the desired performance for computational intensive tasks, ranging from image processing to astro-physics and weather simulations. Traditionally, the HPC field drove companies like Cray, IBM and others to develop processors for supercomputing. However, the market forces in other fields have—since the proliferation of COTS (commercial off-the shelf) processors, including GPUs for gaming and now more recently AI—driven the innovation in processor design. This means that algorithms, tools and applications now should adapt and take advantage of tensor processor, Machine Learning techniques, and other related technology, rather than expecting that old computational models will hold true. In this talk, we will discuss these issues, including how this is also an opportunity to help develop better graph algorithms for AI, and apply some of the techniques from AI to HPC challenges.

A Lightweight Programming Model for a Data-Flow Execution Model

Roberto Giorgi
University of Siena, Italy

Data-Flow Threads (DF-Threads) represent a simple and scalable approach for High Performance Computing. To easy the mapping to generic high level programming models, we defined a low-level minimalistic API, which can be targeted by compilers, hopefully by direct mapping internal data-flow intermediate representations into such API. This API uses C-language as programming model and can be also used directly by the programmer: some basic example are provided. We have evaluated our approach against OpenMPI and CUDA. Our approach achieves higher GFLOPS per core both when compared to MPI and CUDA. In particular, OpenMPI has a large overhead due to the OS-kernel activity.

ModSim in the AI Era

Adolfy Hoisie
Brookhaven National Laboratories, USA

.

It is difficult to make predictions. Especially about the future. But for sure it is heterogeneous

Ben Juurlink
T.U. Berlin, Germany

Dennard observed in 1974 that, as transistors become smaller, their power density stays roughly constant, so that the power consumption of integrated circuits stays roughly constant. This scaling law broke down around the year 2006, which motivated the multicore revolution. Furthermore, there are also strong signs that Moore’s law is close to ending. For these reasons it seems that the only way to provide even higher performance (besides, as of yet, unproven technologies such as quantum computers) is heterogeneous computer systems. In fact, such heterogeneous systems have dominated the TOP500 list of supercomputers in recent years, since they achieve higher performance per Watt than homogeneous designs. In this talk I will briefly describe some of the research my team members and I performed in this area. In particular, some of the research results obtained in the EU projects Encore, LPGPU, and LPGPU2 will be presented.

“Turing Tariff” Reduction: architectures, compilers and languages to break the universality barrier

Paul Kelly
Imperial College, United Kingdom

This talk is about “Turing tariff”, sometimes called “Turing Tax” (I did not invent the term but I don’t know where it came from). We pay Turing tariffs when we use a general-purpose machine for a purpose where a specialized product would be better. The aim of this talk is to focus attention on how, when and why we pay Turing tariffs – and what strategies we know for avoiding them. I will look at processor architecture - and accelerator architecture. I will also look at the role of the compiler. I’ll also try to show that there is an analogous price to pay when we use a general-purpose language when we might have done better with a toolchain able to deliver domain-specific optimisations.

High-level programming infrastructure and data movement optimization for accelerator-based HPC architectures

Christoph Kessler
Linkoping University, Sweden

This presentation has two parts. In the first part I will shortly present the design of the EXA2PRO high-level programming model and framework, which is currently under development. EXA2PRO (exa2pro.eu) is a H2020 FETHPC project (2018-2021) with 7 partners from 5 countries. The vision of EXA2PRO is to develop, and apply, a programming environment for productive deployment of highly parallel, portable applications in exascale computing systems. The EXA2PRO programming model is designed to support convenient and high-level programming of heterogeneous parallel systems by seamlessly integrating: multi-backend algorithmic skeletons, multi-variant components, smart data-containers, and a platform modeling and introspection framework, in a HPC programming and execution environment.

In the second part I consider the global transfer fusion problem in heterogeneous computer systems with non-shared memory, i.e., the problem of minimizing, for a heterogeneous program modeled by a dataflow graph of kernel calls, the overall number of operand data transfers, and thus, the accumulated transfer startup overhead. Our approach analyzes the kernel-operand dependence graph and reorders the operand arrays in memory such that transfers and memory allocations of multiple operands adjacent in memory can be merged, saving transfer startup costs and memory allocation overheads.

Performance Software for Deep Learning Chips

Wei Li
Intel, USA

Scaling a Fine-Grained Parallel Biological Sequence Alignment Application to Hundreds of GPUs

Alba Cristina Melo
University of Brasilia, Brasil

In this talk, we discuss related work in the area of parallel biological sequence alignment with the Smith-Waterman algorithm and present CUDAlign, our fine-grained multi-GPU strategy to align DNA sequences with up to 249 millions of characters in hundreds of GPUs. In order to achieve this, CUDAlign uses (a) a fine-grained parallelogram-shaped strategy to exploit parallelism; (b) overlapping of computation and communication and (c) an innovative speculation technique, which is able to parallelize a phase of the Smith-Waterman algorithm that is inherently sequential. We will show that CUDAlign is able to attain the impressive rate of 10.3 TCUPS (Trillions of matrix Cells Updated per Second), the best performance in the literature so far. We also present a pruning technique for one GPU that is able to prune more than 50% of the Smith-Waterman matrix and still retrieve the optimal alignment. Then, we will discuss our thoughts on how to adapt this pruning strategy to a multi-GPU environment and make it scalable to hundreds of GPUs, showing some preliminary results. Finally, we will present open challenges and research directions.

On the ROI of Parallel Performance Optimization

Bernd Mohr
Jülich Supercomputing Center, Germany

Developers and users of HPC applications can count on free advice from European experts to analyse the performance of their scientific codes. The Performance Optimization and Productivity (POP) Centre of Excellence, funded by the European Commission under H2020, gathers together experts from BSC, IT4I, JSC, HLRS, RWTH Aachen NAG, Teratec, and UVSQ. A first phase ran from October 2015 to March 2018, and the CoE got renewed for another 3 years at December 2018. The objective of POP is to provide performance measurement and analysis services to the industrial and academic HPC community, help them to better understand the performance behaviour of their codes and suggest improvements to increase their efficiency. Training and user education regarding application tuning is also provided. Further information can be found at https://www.pop-coe.eu/.

The talk will briefly introduce the POP CoE, its objectives, and its performance analysis and optimization services provided to the community. Details about the tools, methods, and metrics used for the services will also be given. The main part of presentation will report on statistics, outcomes, and success stories from the over 150 performance assessments already performed so far during the project. The talk will conclude with lessons learned from providing performance analysis and optimization in a service-oriented professional setting. .

Towards Secure Computing Environments
Jose Moreira
IBM, T. J. Watson Research Lab., USA

Cyber security threats and vulnerabilities draw more public attention to computing than any other issue, and rightly so. Millions of people are impacted every year by unexpected attacks, to varying degree of financial losses. It is no surprise that the community, including government, industry and academia, are paying a lot more attention to the problem. In this talk, we will cover some of the basic forms of attacks that threaten our computing systems. We will also discuss what are the architectural innovations being developed and deployed in computer systems to counter those attacks. We plan to discuss a broad range of issues, including return-oriented and jump-oriented programming, memory encryption, memory integrity, side-channel attacks, and homomorphic computing. We will conclude the talk with a list of specific actions that can make our computing systems more secure.

Large Capacity Memory Systems with Intel(R) Optane DC Memory: An Introduction

Ramesh Peri
Intel, USA

A new kind of memory, based on flash technology, that is lot more dense than traditional DRAM but with more latency is available in the market today. The high density of this memory technology enables semiconductor manufacturers to build memory DIMMs with very large capacity. One such product is Intel(R) Optane DC Persistent Memory which allows a user to build a compute platform containing up to 6TB of main memory. In this talk we introduce the system architecture of such a compute platform and share the performance results of few graph algorithms, operating on very large graphs, that use significant portion of the memory capacity of this machine. We will also share details of optimizations we had to do to achieve good performance in operating system and in user code. We will introduce some development tools that are available for these machines that were useful in tuning applications so that they run efficiently on these large memory capacity machines.

What can machine learning do for computer systems?

Keshav Pingali
University of Texas at Austin, USA

Machine learning has transformed areas like vision and AI but its impact on computer systems has so far been more limited. In this talk, I discuss some new directions in computer systems for which I believe machine learning is essential.

Leveraging the Use of High Performance Computing Systems in Mining of Streams of Real Valued Data Points

Amir Raoofy
T.U. Munich, Germany

TurbO project aims at studying and optimizing heavy duty gas turbines and power plants. Monitoring the operation and combustion process of these turbines generates large and fine-grained data streams, which must be analyzed in detail. Such analyses bring insights to turbine operators and manufacturers enabling further optimizations of their operation and better future designs. These analyses require high computational resources, and I will discuss how HPC systems can be exploited for that and can open the door for a wide range of new possibilities to gain fine-grained insights from data streams.

Modern online processing techniques rely on domain knowledge and insights from previous operations of the system under investigation. Various techniques are used to extract such insights from the operation of systems. In particular, we focus on the class of “Matrix Profile” approaches, which encode similarities and anomalies in univariate and multivariate streams and simplifies data mining tasks on the streams. In this talk, I will discuss kernels for computing Matrix Profiles and how they can be implemented on HPC systems.

I will introduce efficient kernels to calculate Matrix Profiles, what the resource requirements for efficient execution are, and how one might approach implementing an efficient matrix profile calculator in a distributed system. I will introduce a domain decomposition scheme based on partitioning of records, leading to similar data distribution for parallel execution of different kernels. Furthermore, I will discuss the optimization techniques that are used to boost the performance of the kernels on a CPU-based system.

Productivity through Abstraction
Transfinite Arrays for Unifying Streams and Arrays

Sven-Bodo Scholz
Heriot-Watt University, United Kingdom

Arrays and streams seem to be fundamentally at odds: arrays require their size to be known in advance, streams are dynamically growing; arrays offer direct access to all of their elements, streams provide direct access to the front elements only; the list goes on. Despite these differences, many computational problems at some point require shifting from one paradigm to the other. The driving forces for such a shift can be manifold. Typically, it is a shift in the task requirements at hand or it is motivated by efficiency considerations. In this talk, we propose a generalisation of arrays, named "Transfinite Arrays" which allows to switch between these two worlds without changing the program specification itself. Instead, a compiler can generate both, a streamed version and an array based version from the very same program specification.

Faster Matrix Multiplication via Sparse Decomposition

Oded Schwartz
Hebrew University, Israel

Fast matrix multiplication algorithms are of practical use only if the leading coefficient of their arithmetic complexity is sufficiently small. Many algorithms with low asymptotic cost have large leading coefficients, and are thus impractical. Karstadt and Schwartz (2017) have demonstrated a technique that reduces the leading coefficient by introducing fast O(n2 logn) basis transformations, applied to the input and output matrices. We generalize their technique, by allowing bases of larger dimension for the transformations, while maintaining low overhead. Thus we accelerate several matrix multiplication algorithms, beyond what is known to be possible using the previous technique. Of particular interest are a few new sub-cubic algorithms with leading coefficient 2, matching that of classical matrix multiplication. For example, we obtain an algorithm with arithmetic complexity of 2nlog3 23 + o(n log3 23 ), compared to 2n3n2 of the classical algorithm. Such new algorithms can outperform previous ones (classical included) even on relatively small matrices. We obtain lower bounds matching the coefficient of several of our algorithms, proving them to be optimal.

Joint work with Gal Beniamini.

Toward a Complexity Theory for Massively Parallel Computations

Michele Scquizzato
University of Padova, Italy

The Massively Parallel Computation (MPC) model serves as a common abstraction of many modern large-scale data processing frameworks, and has been receiving increasingly more attention over the past few years, especially in the context of classical graph problems. So far, the only way to argue lower bounds for this model is to condition on conjectures about the hardness of some specific problems, such as graph connectivity on promise graphs that are either one cycle or two cycles, usually called the “one cycle vs. two cycles” problem. This is unlike the traditional arguments based on conjectures about complexity classes (e.g., P ≠ NP), which are often more robust in the sense that refuting them would lead to groundbreaking algorithms for a whole bunch of problems.

In this paper we present connections between problems and classes of problems that allow the latter type of arguments. Our conjectures concern the class of problems solvable in a sub-logarithmic amount of rounds in the MPC model, denoted by MPC(o(log n)), and some standard classes concerning space complexity, namely L and NL. Our conjectures are robust in the sense that refuting them would lead to many surprisingly fast new algorithms in the MPC model. In the process we also obtain new conditional lower bounds by leveraging known reductions and by proving new reductions and equivalences between problems.

Joint work with Danupon Nanongkai.

Advancing General-Purpose Optimising Compilers for GPUs

Nicolai Stawinoga
T.U. Berlin, Germany

Stratton identified eight code optimisation patterns particularly relevant to improving run time performance on GPUs. Questions regarding policy - when and to which degree to apply each optimisation - are commonly difficult problems that involve managing a complex set of trade-offs with the potential to gain significant speed-ups. I will look at different optimisation patterns and their applicability to general-purpose GPU compilers. I will discuss different approaches to policy decisions in a fully automated manner, and throughout focus on my own experience of working on restructuring application parallelism in general-purpose GPU codes.

Celerity: High-productivity Programming for Accelerator Clusters

Peter Thoman
University of Innsbruck, Austria

In contemporary large-scale HPC systems, the highest tiers of performance and efficiency are generally achieved by accelerator clusters. However, while efficient, these architectures burden developers with both the complexities inherent in distributed memory systems as well as those introduced by accelerator programming. The Celerity system is a more user-focused approach designed to enable high-productivity programming of accelerator clusters.

While several efforts to improve on the status quo have been made in the past, the limited support of research toolchains means that the classic "MPI + X" approach – where X is most likely CUDA – remains the most popular real-world choice. To enable broad adoption, Celerity builds on an established industry standard, SYCL. SYCL allows writing single-source parallel programs targeting accelerators using a convenient and modern C++14 API, and Celerity extends this approach to clusters of accelerators.

The main novelty of the Celerity API comes in the form “range mappers”: While SYCL’s device accessors express concisely which resources are accessed on a compute device, and how, Celerity’s range mappers additionally specify how each individual GPU thread will access a particular buffer. This allows the Celerity runtime to transparently split kernels across multiple workers and know the required inputs as well as produced outputs for each of the kernel chunks. Based on this information, the Celerity runtime system asynchronously builds and distributes a task and fine-grained command graph, efficiently executing the given workload on an accelerator cluster.

This talk primarily aims to introduce the Celerity API from a user perspective, but will also provide some insight into the design challenges faced by the underlying system components.

Two Years of TurbO – an Update

Carsten Trinitis
T.U. Munich, Germany

The talk will give an update on the TurbO project, a joint research project between TUM and IfTA, a medium enterprise specialising on thermoacoustic analysis.The goal of the project is to make power plants more productive due to failure prediction, monitoring and modelling combustion processes, and increasing energy efficiency.

The talk will focus around analysis and modelling of real data which was obtained from a power plant in Munich, showing some unwanted operational event. This data was subject to comprehensive investigation and analysis. In the end it turned out that in spite of the fact that everyone is discussing artificial intelligence and machine learning, a relatively simple analytical model helped to significantly improve failure prediction for gas turbines. This case will be presented and discussed in the talk.

The basic approach works as follows: Assuming that we have a (recurring) failure pattern, we aim at modeling and predicting future events based on oscillation data. Within this context, the following issues arise:

  • How do you decide when a failure is likely to occur?
  • When modelling the process, one has to distinguish between operational data and traditional engineering models and methods.
  • As a consequence, a tradeoff has to be made whether one should use analytical methods or machine learning.
  • Very often the former is way more efficient and straightforward!
The Long and Winding Road to Exascale
Henry Tufo
University of Colorado, Boulder, USA

We present a brief history of high-performance computing (HPC) through the lens of the tera-, peta-, and pre-exascale milestones. Key challenges and (potential) solutions are discussed and trends identified. An overview of the current HPC landscape is provided and, when coupled with technology roadmaps and evolving workloads, several possible avenues to practical exascale computing are identified. Buy many challenges remain and we discuss them in the context of several real world examples.

Notes to the Intel Configurable Spatial Accelerator

Lorenzo Verdoscia
CNR, Napoli

While the demand for computing performance continues to grow exponentially, the underlying advances due to the Moore’s law and Dennard scaling that have sustained improvements in performance have been faltering. To respond to such requirement, hardware accelerators seem to be a promising solution. Common hardware accelerators come in many forms, from the fully customizable ASIC designed for a specific function (e.g., a floating-point unit) to the more flexible graphics processing unit (GPU) and the highly programmable field programmable gate array (FPGA). The problem with today’s hardware accelerators, however, is that they are ad-hoc solutions for specific compute/data-intensive queries. Although FPGAs are programmable, reconfiguration including compilation, synthesis, and routing usually takes hours. Therefore, applications must know the query patterns in advance so that they can prepare the hardware for the tasks. On July 5; 2018, Intel in conjunction with the US Department of Defense disclosed a US Patent about of a “configurable spatial accelerator (CSA) targets both high performance computing and the direct execution of a dataflow graph to yield a computationally dense yet energy efficient spatial microarchitecture which far exceeds conventional roadmap architectures.” The notes focus mainly on the dataflow aspects of the CSA engine.

Stream processing of genomic data

Francesco Versaci
CRS4, Cagliari, Italy

Modern sequencing machines produce order of a terabyte of data per day, which need subsequently to go through a complex processing pipeline. The conventional workflow begins with a few independent, shared-memory tools, which communicate by means of intermediate files. Given its lack of robustness and scalability, this approach is ill-suited to exploiting the full potential of sequencing in the context of healthcare, where large-scale, population-wide applications are the norm.

In this work we propose the adoption of stream computing to simplify the genomic resequencing pipeline, boosting its performance and improving its fault-tolerance. We decompose the first steps of the genomic processing in two distinct and specialized modules: preprocessing, implemented within the Apache Flink framework, and alignment, based on BWA-MEM. We loosely compose them via communication through Kafka streams, in order to allow for easy composability and integration in the already-existing YARN-based pipelines. The proposed solution is then experimentally validated on real data and shown to scale almost linearly.

Joint work with L. Pireddu and G. Zanetti.

More Science by Better Energy Efficiency and Power Management

Josef Weidendorfer
Leibniz Supercomputing Centre, Germany

Nowadays the cost of energy consumed by large HPC systems makes up a significant portion of total expenses for building and running such a system. For a compute center with a given budget to serve academia, high energy efficiency and adequate power management adapting to work load results in more scientific outcome for the same money. In this talk, I will describe the efforts done and to be done at LRZ towards these goals.