Abstracts (in progress)

An accelerator to aggregate key-value pairs for graph analytics using NAND flash storage


We will describe an in-store accelerator to aggregate a list of key-value pairs and show its usefulness in processing very large graphs. This accelerator, which we will call “Aggregator”, will take as input a list of unsorted key-value pairs stored in NAND flash, and emit a new list of key-value pairs which is sorted by the keys, and where the values for the same key have been aggregate d using a user-provided operator. The functionality of Aggregator is motivated by its use in big graph analytics where many algorithms repeatedly update the vertex values of a graph by applying some function t o aggregate the values of edges connected to the vertex. Such repeated updates, performed naively, are expensive when the graph is stored externally in flash or disk storage. We will show how and why the Aggregator can perform these repeated updates to external storage efficiently. We will also present some preliminary results for PageRank, Shortest Path and K-mer counting, which show that a single node system with such an accelerator c an outperform a 32-node system in processing terabyte size graphs.

Joint work with Sang-Woo Jun and Andy Wright, CSAIL, MIT, USA.

Cognifying the HPC Ecosystem for Sustainability and Manageability

Ozalp Babaoglu
University of Bologna, Italy

Computation has become the third paradigm of the modern scientific method. As such, High-Performance Computing (HPC) systems have become fundamental instruments for driving scientific discovery and industrial competitiveness, much like the microscope, telescope and steam engine were during the previous century. Exascale is the new frontier for HPC systems. Scaling current systems to millions of processor cores running billions of threads that are necessary for reaching the exascale goal is unsustainable since the resulting power consumption will exceed a gigawatt and failure rates will exceed multiple instances daily. For exascale computing to be sustainable, power consumption must be limited to 30MW while perceived failure rates must be severely reduced so that long-running computations can be completed at exascale speeds. While technological progress will continue to contribute towards these goals, I believe that rest of the progress will come from novel, data-fueled, intelligent software technologies powered by Data Science and Artificial Intelligence. The process of transforming ordinary objects, services or activities into their intelligent counterparts by tapping into existing services for the required data analytics and intelligence is colloquially referred to as “cognification” The situation is akin to the electrification of many mechanical processes that took place more than a century ago when electricity became ubiquitous as a commodity. Cognification is already disrupting fields such as corporate law, medical diagnosis, investment banking, mortgage appraisal and insurance underwriting. I believe that exascale HPC systems are next in line and are begging for solutions derived from cognification of their entire programming workflow, all the way from top-level applications down to low-level system software. In this talk I will present my vision for what cognified HPC. might look like and outline some of the challenges that need to be solved and new technologies that need to be developed in order to realize it.

Optimal On-Line Computation of Stack Distances for MIN and OPT

Gianfranco Bilardi
University of Padova, Italy

The replacement policies known as MIN and OPT are optimal for a two-level memory hierarchy. The computation of the cache content for these policies requires the off-line knowledge of the entire address trace. However, the stack distance of a given access, that is, the smallest capacity of a cache for which that access results in a hit, is independent of future accesses and can be computed on-line. Off-line and on-line algorithms to compute the stack distance in time O(V) per access have been known for several decades, where V denotes the number of distinct addresses within the trace. The off-line time bound was recently improved to O(√V logV).

This paper introduces the Critical Stack Algorithm for the online computation of the stack distance of MIN and OPT, in time O(logV) per access. The result exploits a novel analysis of properties of OPT and data structures based on balanced binary trees. A corresponding Ω(log V) lower bound is derived by a reduction from element distinctness; this bound holds in a variety of models of computation and applies even to the off-line simulation of just one cache capacity.

Joint work with K. Ekanadham and P. Pattnaik, IBM Research.

Impact of memory data layouts on performance and energy-efficiency on state-of-the-art processors and accelerators

Enrico Calore
University of Ferrara, Italy

HPC systems are often designed using commercial off the shelf (COTS) hardware, where computing nodes combine traditional CPU processors with accelerators such as GPUs. The memory subsystem of processors and accelerators are known to be often quite different in their design and the data layout which provides the best memory performance may differ between them. Recently has been released the second generation of Intel processors based on Many Integrated Cores (MIC) architecture, named Knights Landing (KNL). This new processor, targeting the HPC application segment, has an hybrid hardware design in between a multi-core CPU and a GPU. In this work we assess the performance of this new processor for Lattice Boltzmann codes widely used in computational fluid-dynamics, evaluating in particular the impact of several data memory layouts on its memory sub-system. We eventually compare its performance and energy-efficiency to other processors and accelerators available on the market.

Diversity maximization in MapReduce and Streaming

Matteo Ceccarello
University of Padova, Italy

Given a dataset of points in a metric space and an integer k, a diversity maximization problem requires determining a subset of k points maximizing some diversity objective measure, e.g., the minimum or the average distance between two points in the subset. Diversity maximization is computationally hard, hence only approximate solutions can be hoped for. Although its applications are mainly in massive data analysis, most of the past research on diversity maximization focused on the sequential setting. In this talk we present space and pass/round-efficient diversity maximization algorithms for the Streaming and MapReduce models.

One-Sided Communication on a Non-Cache-Coherent Many-Core Architecture

Steffen Christgau
University of Potsdam, Germany

In the light of increasing core counts and high-bandwidth on-die memories, the task of maintaining cache coherence becomes more and more challenging. Although cache coherence will not disappear completely, designs for shared memory systems with many cores that do not offer global cache coherence are considered to be present on the path to exascale. One-sided communication, like it is found in the Message Passing Interface (MPI) standard, provides remote memory access operations to program such systems without the need for cache coherence.

The talk addresses the efficient support of MPI's one-sided communication on non-cache coherent many-core systems. It presents both a design for the general active target synchronization and a software-managed cache coherence scheme that integrates transparently into the synchronization mechanism. For both designs, the evaluation on a research system reveals significant performance improvements, both in micro-benchmarks and at the application level.

Towards multi-objective, region-based auto-tuning for parallel programs

Thomas Fahringer
University of Innsbruck, Austria

Efficient parallelization and optimization for modern parallel architectures is a time-consuming and error-prone task that requires numerous iterations of code transformations, tuning and performance analysis which in many cases have to be redone for each different architecture. In this talk we introduce a novel auto-tuning compiler named INSIEME which consists of a compiler component featuring a multi-objective optimizer and a runtime system. The multi-objective optimizer derives a set of non-dominated solutions, each of them expressing a trade-off among different conflicting objectives such as runtime, resource efficiency and energy consumption. This set is commonly known as Pareto set. Our search algorithm, which explores code transformations and their parameter settings, is based on differential evolution. Additionally, rough sets are employed to reduce the search space, and thus the number of evaluations required during compilation. We further examine global versus region-based tuning techniques and demonstrate our approach by tuning shared memory parallel programs for execution time, energy consumption and resource usage.

Time complexity of scientific algorithms and memory access

Thomas Husslein


E2MC: Entropy Encoding Based Memory Compression for GPUs

Ben Juurlink
T.U. Berlin, Germany

Modern Graphics Processing Units (GPUs) provide much higher off-chip memory bandwidth than CPUs, but many GPU applications are still limited by memory bandwidth. Unfortunately, off-chip memory bandwidth is growing slower than the number of cores and has become a performance bottleneck. Effective memory bandwidth optimizations therefore play a significant role for scaling the performance of GPUs. Memory compression is a promising approach for improving memory bandwidth which can translate into higher performance and energy efficiency. Compression is not for free, however, and its challenges need to be addressed. Otherwise the benefits of compression may be canceled by its overhead. We propose an entropy encoding based memory compression (E2MC) technique for GPUs, which is based on the well-known Huffman encoding. We study the feasibility of entropy encoding for GPUs and show that it achieves higher compression ratios than state-of-the-art GPU compression techniques. Furthermore, we address the key challenges of probability estimation, choosing an appropriate symbol length for encoding, and decompression with low latency. The average compression ratio of E2MC is 53% higher than the state of the art. This translates into an average speedup of 20% compared to no compression and a speedup of 8% over the state of the art. The energy consumption and energy-delay-product are reduced by 13% and 27%, respectively.

Data transfer optimization for componentized computations in heterogeneous parallel systems using Smart Data-Containers

Christoph Kessler
Linkoping University, Sweden

Smart data-containers are STL-like software abstractions for aggregate data structures like vectors, matrices, images etc. that can be passed as operands to componentized computations (such as algorithmic skeletons or multi-variant functions) and that internally optimize, at runtime, data transfers between different memory units in a heterogeneous parallel system. Especially for iterative computations performing many invocations over the same operand variables, performance gains from avoiding unnecessary memory transfers can be considerable. We summarize work from several recent papers, including (1) the design of smart data-containers for our skeleton-programming framework SkePU; (2) the VectorPU data annotation framework for use with CPU/GPU software components; and (3) smart data-containers with support for tiling and lazy evaluation of multiple skeleton calls for the optimization of data locality in SkePU 2.

Joint work with Usman Dastgeer, Lu Li, and August Ernstsson. Research partly supported by EU FP7 projects PEPPHER and EXCESS and by Swedish e-Science Research Center (SeRC).

Conditions for scalability: Memory capacity vs Network bandwidth

Jose Moreira
IBM, T. J. Watson Research Lab., USA

When analyzing and comparing the performance of parallel machines, it is common to focus on metrics that characterize the basic nodes of these machines. The typical metrics include the computational performance, the memory capacity, and the communication bandwidth of the nodes. It is also common to use the ratio of communication bandwidth to the computational performance. The usual assumption is that “more is better”. That is, nodes with higher values of these metrics lead to superior machines. But things are not that simple. In particular, memory capacity and communication bandwidth can be traded-off for each other. That is, increasing the memory capacity of a node allows it to achieve better performance with less communication bandwidth. Conversely, increasing the communication bandwidth of a node allows it to achieve better performance with a smaller memory capacity. In this talk, we will explore the relationship between computational performance, memory capacity and communication bandwidth for different classes of algorithms.

A sketch of Quantum Computing
Pratap Pattnaik
IBM, T. J. Watson Research Lab., USA

Quantum Computing is an budding and esoteric area of computing that has made a significant progress in past two decades. With advancement in material science, it has the promise to solve a number of very challenging problems, with far reaching impact on a few areas of the society. This talk will provide a birds eye view of the subject.

Joint work with K. Ekanadham (IBM, T. J. Watson Research Lab., USA).

Gluon: A Communication Framework for Distributed, Heterogeneous Graph Analytics

Keshav Pingali
University of Texas at Austin, USA

We describe Gluon, a communication substrate for bulk-synchronous distributed graph analytics that supports heterogeneity in programming models, partitioning policies, and processor types (CPU/GPU). We have interfaced Gluon with the Ligra and Galois shared-memory graph analytics systems to produce distributed-memory versions of these systems. We have also interfaced Gluon with IrGL, a state-of-the-art single-GPU system for graph analytics to produce the first ever multi-GPU distributed-memory graph analytics system. Experiments on a heterogeneous cluster show that these CPU-only and GPU-only systems based on Gluon give a geometric mean speedup of 3X over the state-of-the-art Gemini system. Gluon therefore demonstrates a new way to engineer high-performance distributed-memory graph analytics systems: assemble them out of plug-and-play components.

Joint work with Marc Snir’s group at the University of Illinois, Urbana-Champaign.

Exploiting non-volatile and high-bandwidth memory for HPC

Dirk Pleiter
Jülich Supercomputing Center, Germany

Integration of both, non-volatile as well as high-bandwidth memories in HPC architectures result in new opportunities but also challenges as they typically result in more complex memory and storage hierarchies. In this talk we discuss approaches to analyzing application requirements, which we apply to applications from different science domains. In the second part we will focus on different solutions for integration of non-volatile memory technologies and reports on results from the PCP of the Human Brain Project and the SAGE project.

Bounding Cache Miss Costs of Multithreaded Computations Under General Schedulers

Vijaya Ramachandran
University of Texas at Austin, USA

We analyze the caching overhead incurred by a class of multithreaded algorithms when scheduled by an arbitrary scheduler. We obtain bounds that match or improve upon the well-known O(Q + S.(M/B))caching cost for the randomized work stealing scheduler, where S is the number of steals, Q is the sequential caching cost, and M and B are the cache size and block (or cache line) size respectively.

Joint work with Richard Cole (NYU, USA).

On using data movement lower bounds to guide code optimization

Saday Sadayappan
Ohio State University, USA

The four-index integral transform is a computationally demanding calculation used in many quantum chemistry software suites such as NWChem. It requires a sequence of four tensor contractions that each contract a four-dimensional tensor with a two-dimensional transformation matrix. Loop fusion, in conjunction with tiling can be used to reduce the total space requirement, as well as data movement. However, the large number of possible choices for loop fusion and tiling, and data/computation distribution across a parallel system, make it challenging to develop an optimized parallel implementation. Lower bounds on data movement are used to establish relationships between available aggregate physical memory in a parallel computer system and identification and pruning of ineffective fusion configurations. This enables a characterization of optimality criteria and the development of an improved implementation of the four-index transform with higher performance and the ability to model larger electronic systems than the current implementation in the NWChem quantum chemistry software suite.

Joint work with Samyam Rajbnandari, Fabrice Rastello, Karol Kowalski, Sriram Krishnamoorthy.

Matrix multiplication, a little faster

Oded Schwartz
Hebrew University, Israel

Strassen’s matrix multiplication algorithm (1969) was followed by Winograd’s (1971) who improved it’s complexity by a constant factor. Many asymptotic improvements followed. Yet, they have done so at the cost of large, often gigantic, hidden constants. Consequently, for almost all feasible matrix dimensions, Strassen-Winograd’s remains the fastest. The leading coefficient of Strassen-Winograd’s algorithm was believed to be optimal for matrix multiplication algorithms with 2 × 2 base case, due to a lower bound of Probert (1976). Surprisingly, we obtain a faster matrix multiplication algorithm, with the same base case size, the same asymptotic complexity, but with a smaller leading coefficient. To this end, we transform the input matrices to a different basis, which can be done fast. We discuss improvements in the communication costs, its effects on parallelization and the extension of this method to other Strassen-like algorithms. We also prove a generalization of Probert’s lower bound that holds under change of basis. This shows that for matrix multiplication algorithms with a 2 × 2 base case, the leading coefficient of our algorithm cannot be further reduced, hence optimal.

Joint work with Elaye Karstadt.

Expected edit distance for random strings

Michele Schimd
University of Padova, Italy

Recent DNA sequencing technologies increased the length of reads produced at the cost of higher error rates. An interesting question is how this trend impacts algorithms for de-novo DNA assembly. More precisely which is the minimum error rate above which reconstruction of the DNA is impossible?

It turns out that this threshold is related to the average edit distance divided by the length of the reads. As the read length tends to infinity, such ratio converges to a constant the value of which is unknown.

In this work we give an analytical lower bounds to this constant and a numerical upper bound obtained through exact calculation (small length) and through sampling (large length). Using the same sampling approach, we study of the convergence rate of the expectation toward constant and, under the assumption of a the specific form for the convergence rate, we give estimates of the parameters.

Joint work with Gianfranco Bilardi.

Application autotuning for energy efficient heterogeneous HPC systems

Cristina Silvano
Politecnico di Milano, Italy

Given the increasing complexity of heterogeneous HPC systems, a wide range of architecture and application parameters must be tuned at design-time to find the best tradeoffs in terms of multiple metrics such as energy and delay. At runtime, HPC architectures offer a set of resources that can be assigned and managed dynamically to get a specified Quality of Service. Applications can expose to the runtime a set of software knobs (including application parameters, code transformations and code variants) to trade-off Quality of Results and Throughput. Resource management and application autotuning are key issues for enabling computing systems to operate close to optimal efficiency by adjusting their behavior in the face of changing conditions, operating environments, usage contexts and resource availability while meeting the requirements on energy-efficiency and Quality-of-Service. This talk will present an application autotuning framework to tune the software knobs in an adaptive scenario for HPC systems. The key techniques include a set of sampling and optimization techniques for finding Pareto points and Design of Experiment techniques to identify the experimentation plan. Machine learning techniques can be used to obtain a prediction of the system behavior based on the set of training data generated by DoE. The approach exploits the concept of orthogonality between application autotuning and runtime management of system resources to support multiple adaptive applications. Overall, the main challenge is to exploit design-time and run-time concepts to lead to an effective way of “self-aware” computing.

I/O-Efficient Similarity Join

Francesco Silvestri
University of Padova, Italy

We present an I/O-efficient algorithm for computing similarity joins based on locality-sensitive hashing (LSH). In contrast to the filtering methods commonly suggested our method has provable sub-quadratic dependency on the data size. Further, in contrast to straightforward implementations of known LSH-based algorithms on external memory, our approach is able to take significant advantage of the available internal memory: Whereas the time complexity of classical algorithms includes a factor of Nρ, where ρ is a parameter of the LSH used, the I/O complexity of our algorithm merely includes a factor (N/M)ρ, where N is the data size and M is the size of internal memory. Our algorithm is randomized and outputs the correct result with high probability. It is a simple, recursive, cache-oblivious procedure, and we believe that it will be useful also in other computational settings such as parallel computation.

A model-based approach to workload partitioning for heterogeneous systems

Henk Sips
Delft Univ of Technology, Netherlands

Most high performance computing systems today are heterogeneous platforms composed of multi-core CPUs and different types of accelerators, The heterogeneity of the hardware mix, memory structure, and the diversity of the applications pose significant challenges to exploiting such platforms. In this situation, an effective workload partitioning between processing units is critically important for improving application performance. This partitioning is a function of the processing, memory, and network capabilities as well as the application and the dataset to be used. We present a systematic approach to solve the partitioning problem. Specifically, we use modeling, profiling, and prediction techniques to quickly and correctly predict the optimal workload partitioning and the right hardware configuration to use. Our approach effectively characterizes the platform heterogeneity, efficiently determines the accurate partitioning, and easily adapts to new platforms, different application types, and different datasets.

Joint work with Jie Shen (NUDT, China) and Ana Varbanescu (University of Amsterdam, Netherlands).

Continuum Computer Architecture for High Performance Computing

Thomas Sterling
School of Informatics, Computing, and Engineering Indiana University, USA

The success of the MPP and commodity cluster class of parallel computer architectures over the span of two dozen years or more hides the fact that with all of the incorporated innovations they are still basically derivatives of the foundational von Neumann architectures first introduced before 1950. The driving technology trends described by Moore.s Law in conjunction with architecture instruction level parallelism and exponential clock rate growth have powered overall performance gain of 1013 during the intervening era. But these opportunities have almost flat-lined requiring new creative strategies for significant improvements in performance. A consideration of fundamental principles demands novel methods of addressing the key performance parameters of starvation, latency, overhead, and contention as well as energy and resilience. Further examination makes clear that a non von Neumann architecture may be essential to reverse the optimizations, once of great value, but now counterproductive. In particular, the prescribed requirements of ALU/FPU utilization, sequential instruction issue, separation of memory and logic components (referred to as the von Neumann bottleneck.), and sequential consistency of data access are all artifacts of an earlier age of technology and force current designs in non-productive ways. An alternative family of computer design, here called the .Continuum Computer Architecture. borrows from such research domains as cellular automata, data flow computing, processor in memory, asymmetric multitasking, and object-oriented control (e.g., futures) to project a new stratagem for HPC beyond exascale with greater efficiency, scalability, and performance to cost than conventional systems. This presentation will describe in outline the Simultac Fonton, a simple architecture comprising billions of lightweight logic/memory elements that in synergy produce global parallel computing as an emergent property. When employed with the control discipline specified by the ParalleX execution model and running the HPX+ runtime system, a new possibility for highly scalable computing structures is enabled. Early analysis suggests opportunities for parameter gains with respect to conventional practices.

On the use of Container Solutions for simplifying Simulation Environments

Carsten Trinitis
T.U. Munich, Germany

The full functionality of the POLOPT, an electrostatic field simulation code being in production use at ABB, has been achieved through embedding the code along with all required components into a Singularity container. With our first implementation, jobs can be submitted and controlled in a multi-node environment of compute clusters running SLURM and Torque as workload managers. The overall goal is to make the entire simulation software transparent such that there are no specific requirements for the clusters the software is supposed to run on. Thus, in principle any cluster installation that supports Singularity can be used to run the simulations.

Through this approach, system administrators do not need to worry about specific libraries to be installed on their systems since the entire simulation environment including all required system components comes “out of the box” as a Singularity container.

We decided to use Singularity as a first approach as there are some security concerns with the most common container Environment Docker, which will be discussed in the talk. A third alternative, PRoot, will also be discussed.

Parallel memories: a feasible solution for bandwidth-limited applications?

Lucia Varbanescu
University of Amsterdam, Netherlands

Intuitively, a parallel memory is a good solution for any bandwidth-limited application. However, building the right parallel memory for a given application - namely, a custom parallel memory - remains a challenge. In this talk, we present our take on customizing parallel memories. We further show results that demonstrate how parallel memories can speed-up applications with various access patterns, and dive into more details for a specific case-study. We finally discuss our current prototype of a parallel memory based on a reinterpretation of Polymorphic Register Files and a Maxeler FPGA system.

LAIK: A Library for Fault Tolerant Distribution of Global Data for Parallel Applications

Josef Weidendorfer
T.U. Munich, Germany

HPC applications usually are not written in a way that they can cope with dynamic changes in the execution environment, such as removing or integrating new nodes or node components. However, for higher flexibility with regard to scheduling and fault tolerance strategies, adequate application-integrated reaction would be worthwhile. However, with legacy MPI codes, this is difficult to achieve.

In the talk I will present LAIK, a lightweight library for distributed index spaces and associated data containers for parallel programs supporting fault tolerance features. By giving LAIK control over data and its partitioning, the library can free compute nodes before they fail and do replication for rollback schemes on demand. This way, applications become more adaptive to changes of available resources. I show example code using LAIK and results on a first prototype.