#
**Abstracts (in progress)**

**Abstracts (in progress)**

An accelerator to aggregate key-value pairs for graph analytics using NAND flash storage

CSAIL, MIT, USA

We will describe an in-store accelerator to aggregate a list of
key-value pairs and show its usefulness in processing very large
graphs. This accelerator, which we will call “*Aggregator*”, will take
as input a list of unsorted key-value pairs stored in NAND flash, and
emit a new list of key-value pairs which is sorted by the keys, and
where the values for the same key have been aggregate d using a
user-provided operator. The functionality of Aggregator is motivated
by its use in big graph analytics where many algorithms repeatedly
update the vertex values of a graph by applying some function t o
aggregate the values of edges connected to the vertex. Such repeated
updates, performed naively, are expensive when the graph is stored
externally in flash or disk storage. We will show how and why the
Aggregator can perform these repeated updates to external storage
efficiently. We will also present some preliminary results for
PageRank, Shortest Path and K-mer counting, which show that a single
node system with such an accelerator c an outperform a 32-node system
in processing terabyte size graphs.

*Joint work with Sang-Woo Jun and Andy Wright, CSAIL, MIT, USA.*

Cognifying the HPC Ecosystem for Sustainability and Manageability

University of Bologna, Italy

Computation has become the third paradigm of the modern scientific
method. As such, High-Performance Computing (HPC) systems have become fundamental instruments for driving scientific discovery and
industrial competitiveness, much like the microscope, telescope and
steam engine were during the previous century. Exascale is the new
frontier for HPC systems. Scaling current systems to millions of
processor cores running billions of threads that are necessary for
reaching the exascale goal is unsustainable since the resulting power
consumption will exceed a gigawatt and failure rates will exceed
multiple instances daily. For exascale computing to be sustainable,
power consumption must be limited to 30MW while perceived failure
rates must be severely reduced so that long-running computations can
be completed at exascale speeds. While technological progress will
continue to contribute towards these goals, I believe that rest of the
progress will come from novel, data-fueled, intelligent software
technologies powered by Data Science and Artificial Intelligence. The
process of transforming ordinary objects, services or activities into
their intelligent counterparts by tapping into existing services for
the required data analytics and intelligence is colloquially referred
to as “*cognification*” The situation is akin to the electrification
of many mechanical processes that took place more than a century ago
when electricity became ubiquitous as a commodity. Cognification is
already disrupting fields such as corporate law, medical diagnosis,
investment banking, mortgage appraisal and insurance underwriting. I
believe that exascale HPC systems are next in line and are begging for
solutions derived from cognification of their entire programming
workflow, all the way from top-level applications down to low-level
system software. In this talk I will present my vision for what
cognified HPC. might look like and outline some of the challenges
that need to be solved and new technologies that need to be developed
in order to realize it.

Optimal On-Line Computation of Stack Distances for MIN and OPT

University of Padova, Italy

The replacement policies known as MIN and OPT are optimal for a two-level memory hierarchy. The computation of the cache content for these policies requires the off-line knowledge of the entire address trace. However, the stack distance of a given access, that is, the smallest capacity of a cache for which that access results in a hit, is independent of future accesses and can be computed on-line. Off-line and on-line algorithms to compute the stack distance in time O(V) per access have been known for several decades, where V denotes the number of distinct addresses within the trace. The off-line time bound was recently improved to O(√V logV).

This paper introduces the Critical Stack Algorithm for the online computation of the stack distance of MIN and OPT, in time O(logV) per access. The result exploits a novel analysis of properties of OPT and data structures based on balanced binary trees. A corresponding Ω(log V) lower bound is derived by a reduction from element distinctness; this bound holds in a variety of models of computation and applies even to the off-line simulation of just one cache capacity.

*Joint work with K. Ekanadham and P. Pattnaik, IBM Research.*

Impact of memory data layouts on performance and energy-efficiency on state-of-the-art processors and accelerators

University of Ferrara, Italy

HPC systems are often designed using commercial off the shelf (COTS)
hardware, where computing nodes combine traditional CPU processors
with accelerators such as GPUs. The memory subsystem of processors
and accelerators are known to be often quite different in their design
and the data layout which provides the best memory performance may
differ between them. Recently has been released the second generation
of Intel processors based on Many Integrated Cores (MIC) architecture,
named Knights Landing (KNL). This new processor, targeting the HPC
application segment, has an hybrid hardware design in between a
multi-core CPU and a GPU. In this work we assess the performance of
this new processor for Lattice Boltzmann codes widely used in
computational fluid-dynamics, evaluating in particular the impact of
several data memory layouts on its memory sub-system. We eventually
compare its performance and energy-efficiency to other processors and
accelerators available on the market.

Diversity maximization in MapReduce and Streaming

University of Padova, Italy

Given a dataset of points in a metric space and an integer k, a
diversity maximization problem requires determining a subset of k
points maximizing some diversity objective measure, e.g., the minimum
or the average distance between two points in the subset.
Diversity maximization is computationally hard, hence only approximate
solutions can be hoped for. Although its applications are mainly in
massive data analysis, most of the past research on diversity
maximization focused on the sequential setting. In this talk we
present space and pass/round-efficient diversity maximization
algorithms for the Streaming and MapReduce models.

One-Sided Communication on a Non-Cache-Coherent Many-Core Architecture

University of Potsdam, Germany

In the light of increasing core counts and high-bandwidth on-die memories, the task of maintaining cache coherence becomes more and more challenging. Although cache coherence will not disappear completely, designs for shared memory systems with many cores that do not offer global cache coherence are considered to be present on the path to exascale. One-sided communication, like it is found in the Message Passing Interface (MPI) standard, provides remote memory access operations to program such systems without the need for cache coherence.

The talk addresses the efficient support of MPI's one-sided communication on non-cache coherent
many-core systems. It presents both a design for the general active target synchronization and a
software-managed cache coherence scheme that integrates transparently into the synchronization
mechanism. For both designs, the evaluation on a research system reveals significant performance
improvements, both in micro-benchmarks and at the application level.

Towards multi-objective, region-based auto-tuning for parallel programs

University of Innsbruck, Austria

Efficient parallelization and optimization for modern parallel
architectures is a time-consuming and error-prone task that requires
numerous iterations of code transformations, tuning and performance
analysis which in many cases have to be redone for each different
architecture. In this talk we introduce a novel auto-tuning compiler
named INSIEME which consists of a compiler component featuring a
multi-objective optimizer and a runtime system. The multi-objective
optimizer derives a set of non-dominated solutions, each of them
expressing a trade-off among different conflicting objectives such as
runtime, resource efficiency and energy consumption. This set is
commonly known as Pareto set. Our search algorithm, which explores
code transformations and their parameter settings, is based on
differential evolution. Additionally, rough sets are employed to
reduce the search space, and thus the number of evaluations required
during compilation. We further examine global versus region-based
tuning techniques and demonstrate our approach by tuning shared
memory parallel programs for execution time, energy consumption and
resource usage.

Time complexity of scientific algorithms and memory access

Optware

TBA

E^{2}MC: Entropy Encoding Based Memory Compression for GPUs

T.U. Berlin, Germany

Modern Graphics Processing Units (GPUs) provide much higher off-chip
memory bandwidth than CPUs, but many GPU applications are still
limited by memory bandwidth. Unfortunately, off-chip memory bandwidth
is growing slower than the number of cores and has become a
performance bottleneck. Effective memory bandwidth optimizations
therefore play a significant role for scaling the performance of
GPUs. Memory compression is a promising approach for improving memory
bandwidth which can translate into higher performance and energy
efficiency. Compression is not for free, however, and its challenges
need to be addressed. Otherwise the benefits of compression may be
canceled by its overhead. We propose an entropy encoding based memory
compression (E^{2}MC) technique for GPUs, which is based on the
well-known Huffman encoding. We study the feasibility of entropy
encoding for GPUs and show that it achieves higher compression ratios
than state-of-the-art GPU compression techniques. Furthermore, we
address the key challenges of probability estimation, choosing an
appropriate symbol length for encoding, and decompression with low
latency. The average compression ratio of E^{2}MC is 53% higher than
the state of the art. This translates into an average speedup of 20%
compared to no compression and a speedup of 8% over the state of the
art. The energy consumption and energy-delay-product are reduced by
13% and 27%, respectively.

Data transfer optimization for componentized computations in heterogeneous parallel systems using Smart Data-Containers

Linkoping University, Sweden

Smart data-containers are STL-like software abstractions for aggregate data structures like vectors, matrices, images etc. that can be passed as operands to componentized computations (such as algorithmic skeletons or multi-variant functions) and that internally optimize, at runtime, data transfers between different memory units in a heterogeneous parallel system. Especially for iterative computations performing many invocations over the same operand variables, performance gains from avoiding unnecessary memory transfers can be considerable. We summarize work from several recent papers, including (1) the design of smart data-containers for our skeleton-programming framework SkePU; (2) the VectorPU data annotation framework for use with CPU/GPU software components; and (3) smart data-containers with support for tiling and lazy evaluation of multiple skeleton calls for the optimization of data locality in SkePU 2.

*Joint work with Usman Dastgeer, Lu Li, and
August Ernstsson. Research partly supported by EU FP7 projects
PEPPHER and EXCESS and by Swedish e-Science Research Center (SeRC).*

Conditions for scalability: Memory capacity vs Network bandwidth

IBM, T. J. Watson Research Lab., USA

When analyzing and comparing the performance of parallel machines, it is common to focus on metrics that characterize the basic nodes of
these machines. The typical metrics include the computational
performance, the memory capacity, and the communication bandwidth of
the nodes. It is also common to use the ratio of communication
bandwidth to the computational performance. The usual assumption is
that “more is better”. That is, nodes with higher values of these
metrics lead to superior machines. But things are not that simple. In particular, memory capacity and communication bandwidth can be
traded-off for each other. That is, increasing the memory capacity of
a node allows it to achieve better performance with less communication
bandwidth. Conversely, increasing the communication bandwidth of a
node allows it to achieve better performance with a smaller memory
capacity. In this talk, we will explore the relationship between
computational performance, memory capacity and communication bandwidth
for different classes of algorithms.

IBM, T. J. Watson Research Lab., USA

Quantum Computing is an budding and esoteric area of computing that has made a significant progress in past two decades. With advancement in material science, it has the promise to solve a number of very challenging problems, with far reaching impact on a few areas of the society. This talk will provide a birds eye view of the subject.

*Joint work with K. Ekanadham (IBM, T. J. Watson Research Lab., USA).*

Gluon: A Communication Framework for Distributed, Heterogeneous Graph Analytics

University of Texas at Austin, USA

We describe Gluon, a communication substrate for bulk-synchronous distributed graph analytics that supports heterogeneity in programming models, partitioning policies, and processor types (CPU/GPU). We have interfaced Gluon with the Ligra and Galois shared-memory graph analytics systems to produce distributed-memory versions of these systems. We have also interfaced Gluon with IrGL, a state-of-the-art single-GPU system for graph analytics to produce the first ever multi-GPU distributed-memory graph analytics system. Experiments on a heterogeneous cluster show that these CPU-only and GPU-only systems based on Gluon give a geometric mean speedup of 3X over the state-of-the-art Gemini system. Gluon therefore demonstrates a new way to engineer high-performance distributed-memory graph analytics systems: assemble them out of plug-and-play components.

*Joint work with Marc Snir’s group at the University of Illinois, Urbana-Champaign.*

Exploiting non-volatile and high-bandwidth memory for HPC

Jülich Supercomputing Center, Germany

Integration of both, non-volatile as well as high-bandwidth memories in HPC architectures result in new opportunities but also challenges as they typically result in more complex memory and storage hierarchies. In this talk we discuss approaches to analyzing application requirements, which we apply to applications from different science domains.
In the second part we will focus on different solutions for integration of non-volatile memory technologies and reports on results from the PCP of the Human Brain Project and the SAGE project.

Bounding Cache Miss Costs of Multithreaded Computations Under General Schedulers

University of Texas at Austin, USA

We analyze the caching overhead incurred by a class of multithreaded algorithms when scheduled by an arbitrary scheduler. We obtain bounds that match or improve upon the well-known O(Q + S.(M/B))caching cost for the randomized work stealing scheduler, where S is the number of steals, Q is the sequential caching cost, and M and B are the cache size and block (or cache line) size respectively.

*Joint work with Richard Cole (NYU, USA).*

On using data movement lower bounds to guide code optimization

Ohio State University, USA

The four-index integral transform is a computationally demanding calculation used in many quantum chemistry software suites such as NWChem. It requires a sequence of four tensor contractions that each contract a four-dimensional tensor with a two-dimensional transformation matrix. Loop fusion, in conjunction with tiling can be used to reduce the total space requirement, as well as data movement. However, the large number of possible choices for loop fusion and tiling, and data/computation distribution across a parallel system, make it challenging to develop an optimized parallel implementation. Lower bounds on data movement are used to establish relationships between available aggregate physical memory in a parallel computer system and identification and pruning of ineffective fusion configurations. This enables a characterization of optimality criteria and the development of an improved implementation of the four-index transform with higher performance and the ability to model larger electronic systems than the current implementation in the NWChem quantum chemistry software suite.

*Joint work with Samyam Rajbnandari, Fabrice Rastello, Karol Kowalski,
Sriram Krishnamoorthy.*

Matrix multiplication, a little faster

Hebrew University, Israel

Strassen’s matrix multiplication algorithm (1969) was followed by Winograd’s (1971) who improved it’s complexity by a constant factor. Many asymptotic improvements followed. Yet, they have done so at the cost of large, often gigantic, hidden constants. Consequently, for almost all feasible matrix dimensions, Strassen-Winograd’s remains the fastest. The leading coefficient of Strassen-Winograd’s algorithm was believed to be optimal for matrix multiplication algorithms with 2 × 2 base case, due to a lower bound of Probert (1976). Surprisingly, we obtain a faster matrix multiplication algorithm, with the same base case size, the same asymptotic complexity, but with a smaller leading coefficient. To this end, we transform the input matrices to a different basis, which can be done fast. We discuss improvements in the communication costs, its effects on parallelization and the extension of this method to other Strassen-like algorithms. We also prove a generalization of Probert’s lower bound that holds under change of basis. This shows that for matrix multiplication algorithms with a 2 × 2 base case, the leading coefficient of our algorithm cannot be further reduced, hence optimal.

*Joint work with Elaye Karstadt.*

Expected edit distance for random strings

University of Padova, Italy

Recent DNA sequencing technologies increased the length of reads produced at the cost of higher error rates. An interesting question is how this trend impacts algorithms for de-novo DNA assembly. More precisely which is the minimum error rate above which reconstruction of the DNA is impossible?

It turns out that this threshold is related to the average edit distance divided by the length of the reads. As the read length tends to infinity, such ratio converges to a constant the value of which is unknown.

In this work we give an analytical lower bounds to this constant and a numerical upper bound obtained through exact calculation (small length) and through sampling (large length). Using the same sampling approach, we study of the convergence rate of the expectation toward constant and, under the assumption of a the specific form for the convergence rate, we give estimates of the parameters.

*Joint work with Gianfranco Bilardi.*

Application autotuning for energy efficient heterogeneous HPC systems

Politecnico di Milano, Italy

Given the increasing complexity of heterogeneous HPC systems, a wide range of architecture and application parameters must be tuned at design-time to find the best tradeoffs in terms of multiple metrics such as energy and delay. At runtime, HPC architectures offer a set of resources that can be assigned and managed dynamically to get a specified Quality of Service. Applications can expose to the runtime a set of software knobs (including application parameters, code transformations and code variants) to trade-off Quality of Results and Throughput. Resource management and application autotuning are key issues for enabling computing systems to operate close to optimal efficiency by adjusting their behavior in the face of changing conditions, operating environments, usage contexts and resource availability while meeting the requirements on energy-efficiency and Quality-of-Service.
This talk will present an application autotuning framework to tune the software knobs in an adaptive scenario for HPC systems. The key techniques include a set of sampling and optimization techniques for finding Pareto points and Design of Experiment techniques to identify the experimentation plan. Machine learning techniques can be used to obtain a prediction of the system behavior based on the set of training data generated by DoE. The approach exploits the concept of orthogonality between application autotuning and runtime management of system resources to support multiple adaptive applications. Overall, the main challenge is to exploit design-time and run-time concepts to lead to an effective way of “self-aware” computing.

I/O-Efficient Similarity Join

University of Padova, Italy

We present an I/O-efficient algorithm for computing similarity joins
based on locality-sensitive hashing (LSH). In contrast to the
filtering methods commonly suggested our method has provable
sub-quadratic dependency on the data size. Further, in contrast to
straightforward implementations of known LSH-based algorithms on
external memory, our approach is able to take significant advantage of
the available internal memory: Whereas the time complexity of
classical algorithms includes a factor of N^{ρ}, where ρ is a
parameter of the LSH used, the I/O complexity of our algorithm merely
includes a factor (N/M)^{ρ}, where N is the data size and M is
the size of internal memory. Our algorithm is randomized and outputs
the correct result with high probability. It is a simple, recursive,
cache-oblivious procedure, and we believe that it will be useful also
in other computational settings such as parallel computation.

A model-based approach to workload partitioning for heterogeneous systems

Delft Univ of Technology, Netherlands

Most high performance computing systems today are heterogeneous platforms composed of multi-core CPUs and different types of accelerators, The heterogeneity of the hardware mix, memory structure, and the diversity of the applications pose significant challenges to exploiting such platforms. In this situation, an effective workload partitioning between processing units is critically important for improving application performance. This partitioning is a function of the processing, memory, and network capabilities as well as the application and the dataset to be used. We present a systematic approach to solve the partitioning problem. Specifically, we use modeling, profiling, and prediction techniques to quickly and correctly predict the optimal workload partitioning and the right hardware configuration to use. Our approach effectively characterizes the platform heterogeneity, efficiently determines the accurate partitioning, and easily adapts to new platforms, different application types, and different datasets.

*Joint work with Jie Shen (NUDT, China) and Ana Varbanescu (University of Amsterdam, Netherlands).*

Continuum Computer Architecture for High Performance Computing

School of Informatics, Computing, and Engineering Indiana University, USA

The success of the MPP and commodity cluster class of parallel
computer architectures over the span of two dozen years or more hides
the fact that with all of the incorporated innovations they are still
basically derivatives of the foundational von Neumann architectures
first introduced before 1950. The driving technology trends described
by Moore.s Law in conjunction with architecture instruction level
parallelism and exponential clock rate growth have powered overall
performance gain of 1013 during the intervening era. But these
opportunities have almost flat-lined requiring new creative strategies
for significant improvements in performance. A consideration of
fundamental principles demands novel methods of addressing the key
performance parameters of starvation, latency, overhead, and
contention as well as energy and resilience. Further examination makes
clear that a non von Neumann architecture may be essential to reverse
the optimizations, once of great value, but now counterproductive. In
particular, the prescribed requirements of ALU/FPU utilization,
sequential instruction issue, separation of memory and logic
components (referred to as the von Neumann bottleneck.), and
sequential consistency of data access are all artifacts of an earlier
age of technology and force current designs in non-productive ways. An
alternative family of computer design, here called the .Continuum
Computer Architecture. borrows from such research domains as cellular
automata, data flow computing, processor in memory, asymmetric
multitasking, and object-oriented control (e.g., futures) to project a
new stratagem for HPC beyond exascale with greater efficiency,
scalability, and performance to cost than conventional systems. This
presentation will describe in outline the Simultac Fonton, a simple
architecture comprising billions of lightweight logic/memory elements
that in synergy produce global parallel computing as an emergent
property. When employed with the control discipline specified by the
ParalleX execution model and running the HPX+ runtime system, a new
possibility for highly scalable computing structures is enabled. Early
analysis suggests opportunities for parameter gains with respect to
conventional practices.

On the use of Container Solutions for simplifying Simulation Environments

T.U. Munich, Germany

The full functionality of the POLOPT, an electrostatic field simulation code being in production use at ABB, has been achieved through embedding the code along with all required components into a Singularity container. With our first implementation, jobs can be submitted and controlled in a multi-node environment of compute clusters running SLURM and Torque as workload managers. The overall goal is to make the entire simulation software transparent such that there are no specific requirements for the clusters the software is supposed to run on. Thus, in principle any cluster installation that supports Singularity can be used to run the simulations.

Through this approach, system administrators do not need to worry about specific libraries to be installed on their systems since the entire simulation environment including all required system components comes “out of the box” as a Singularity container.

We decided to use Singularity as a first approach as there are some
security concerns with the most common container Environment Docker,
which will be discussed in the talk. A third alternative, PRoot, will
also be discussed.

Parallel memories: a feasible solution for bandwidth-limited applications?

University of Amsterdam, Netherlands

Intuitively, a parallel memory is a good solution for any
bandwidth-limited application. However, building the right parallel
memory for a given application - namely, a custom parallel memory -
remains a challenge. In this talk, we present our take on customizing
parallel memories. We further show results that demonstrate how
parallel memories can speed-up applications with various access
patterns, and dive into more details for a specific case-study. We
finally discuss our current prototype of a parallel memory based on a
reinterpretation of Polymorphic Register Files and a Maxeler FPGA
system.

LAIK: A Library for Fault Tolerant Distribution of Global Data for Parallel Applications

T.U. Munich, Germany

HPC applications usually are not written in a way that they can cope with dynamic changes in the execution environment, such as removing or integrating new nodes or node components. However, for higher flexibility with regard to scheduling and fault tolerance strategies, adequate application-integrated reaction would be worthwhile. However, with legacy MPI codes, this is difficult to achieve.

In the talk I will present LAIK, a lightweight library for distributed
index spaces and associated data containers for parallel programs
supporting fault tolerance features. By giving LAIK control over data
and its partitioning, the library can free compute nodes before they
fail and do replication for rollback schemes on demand. This way,
applications become more adaptive to changes of available resources.
I show example code using LAIK and results on a first prototype.