Abstracts (in progress)


Experience in scaling performance-portable weather and climate applications

Mauro Bianco
CSCS Lugano, Switzerland

Developing performance portable libraries is challenging: first you need clear separation of concerns and clear interfaces to provide the right abstraction layers for a given application field, second it is often necessary to constrain the users in ways not enforceable with (language provided) syntactic checks. When scaling applications to large machines the communication primitives also poses additional requirements depending on what programming paradigms will be supported for the on-node computations. This talk will discuss some of the experiences we had, and the challenges we are facing in the GridTools project.

The I/O complexity of Strassen’s matrix multiplication

Gianfranco Bilardi
Univeristy of Padova, Italy

A tight Ω((n/√M)log27M) lower bound is derived on the I/O complexity of Strassen’s algorithm to multiply two n × n matrices, in a two-level storage hierarchy with M words of fast memory. A proof technique is introduced, which exploits the Grigoriev flow of the matrix multiplication function as well as some combinatorial properties of the Strassen computation DAG. Applications to parallel computation are also developed. The result generalizes a similar bound previously obtained under the constraint of no-recomputation, that is, that intermediate results cannot be computed more than once. For this restricted case, another lower bound technique is presented, which leads to a simpler analysis of the I/O complexity of Strassen’s algorithm and can be readily extended to other “Strassen-like” algorithms.

Joint work with Lorenzo De Stefani (Brown University).

Automatic Tuning of Stencil Computations with Ordinal Regression

Biagio Cosenza
T.U. Berlin, Germany

Stencil computations expose a large and complex space of equivalent implementations. These computations often rely on autotuning techniques, based on iterative compilation or machine learning (ML), to achieve high performance. Iterative compilation autotuning is a challenging and time-consuming task that may be not affordable in many scenarios. Meanwhile, traditional ML autotuning approaches exploiting classification algorithms (such as neural networks and support vector machines) face difficulties in capturing all features of large search spaces. In this presentation, I will introduce a new way of automatically tuning stencil computations based on structural learning. By organizing the training data in a set of partially-sorted samples (i.e., rankings), the problem is formulated as a ranking prediction model, which translates to an ordinal regression problem. Our approach can be coupled with an iterative compilation method or used as a standalone autotuner. We demonstrate its potential by comparing it with state-of-the-art iterative compilation methods on a set of nine stencil codes and by analyzing the quality of the obtained ranking in terms of Kendall rank correlation coefficients. (Reference paper)

Joint work with Juan Durillo (Leibniz Supercomputing Centre), Stefano Ermon (Stanford University) and Ben Juurlink (TU Berlin).

Going from C to stream-like RTL implementations for FPGA acceleration

Alain Darte
Xilinx, San Jose, California, USA

TBA

MemComputing with self-organizing logic gates

Massimiliano Di Ventra
University of California, San Diego, USA

I will discuss how to employ memory (time non-locality) in a novel physics-based approach to computation, memcomputing, and its practical realization with self-organizing logic gates (SOLGs). SOLGs are terminal-agnostic gates that self-organize to always satisfy their logical proposition regardless to which terminal(s) the truth value is assigned. As examples, I will show the polynomial-time solution of prime factorization, the search version of the subset-sum problem, approximations to the Max-SAT beyond the inapproximability gap, and acceleration of deep learning using polynomial resources. I will also show that these digital memcomputing machines compute via an instantonic phase, implying that they are robust against noise and disorder. The digital memcomputing machines we propose can be efficiently simulated, are scalable and can be easily realized with available nanotechnology components. Work supported in part by MemComputing, Inc. (http://memcpu.com/).

Accelerating general purpose parallel computing with the TPA architecture

Martti Forsell
VTT, Finland

The coexisting proximity of the end of Moore’s Law and dramatical raise of AI technologies are causing confusion among the computer architecture community. Some researchers are ready to dump entirely the current architectures and switch to neural networks or tensor units (TPU) while other emphasize the role of application-specific architectures in the form of GPUs and TPUs etc. Characteristic to these groups seem to be sudden and surprising change of research directions or could we have seen this inevitably coming?

In this presentation we will try to have first a short but constructive view to coverage of these approaches to the simplistic computational space. As it seems that these trends are (again) ignoring the need for efficient general purpose parallel computing, we will have a look at our Thick Control Flow Processor Architecture (TPA) that tries to contribute covering that wast area in the computational space. We will show that many parallel computing-specific techniques are efficiently implementable in TPA despite of the generality of the used Thick Control Flow (TCF) model. We will also consider the usefulness of some ideas for tuning this kind of architecture for linear algebra for ML.

Accelerating Dense Matrix Multiplication through DataFlow-Threads (DF-Threads)

Roberto Giorgi
University of Siena, Italy

Dataflow techniques have been explored at different levels of granularity, such as instruction-level in the very successful superscalar processors, where instructions fire out-of-order when operands are ready, or at coarser level, in programming models such as OmpSs/StarSs (now converging into next OpenMP), where the runtime is managing the data flow among tasks and schedules a possibly large group of instructions (i.e., a dataflow task) on the available computational units (CPU-cores, GPU-cores, accelerators). However, the hardware-software interface is not yet providing: i) proper support to parallelize threads; ii) a universally accepted memory consistency model. Most of the burden is caused by the need of synchronization, consistency and coherency: an old problem now exacerbated by intensive use of inexpensive massive parallel systems. In the recent projects TERAFLUX and AXIOM, DataFlow-Threads (DF-Threads) have been explored as a possible approach for scaling performance while providing a cleaner interface for future massive parallel systems: DF-Threads can be supported in the architecture by few new instructions, and can extend existing processor so they provide a more efficient and performing parallelism. Our experimental results show almost perfect scalability, for systems with 1000+ general-purpose x86_64 (extended) cores, running Dense Matrix Multiplications over Linux-based OS. Could DF-Threads become an easier and more efficient way to deploy highly scalable general purpose systems?

Using AI to Transform Informational Videos and Our Watching Behavior

Manish Gupta
VideoKen and IIIT Bangalore

Videos account for about 75% of the Internet traffic and enterprises are increasingly using videos for various informational purposes, including training of customers, partners and employees, marketing and internal communication. However, most viewers do not have the patience to watch these videos end-to-end and our video watching experience has not evolved much in over a decade. We present an AI-based approach to automatically index videos in the form of a table-of-contents, a phrase cloud and a searchable transcript, which helps summarize the key topics in a video and lets viewers navigate directly to the topics of interest. We use a combination of visual classification, object detection, automated speech recognition, text summarization, and domain classification, and show the results achieved on a range of informational videos. We conclude with some thoughts on the promise of transforming how informational videos are consumed as well as open problems and future directions.

Power Modeling of Heterogeneous Mobile SoCs using Machine Learning

Ben Juurlink
T.U. Berlin, Germany

GPUs have recently become important computational units on mobile devices, resulting in heterogeneous SoCs that can run a variety of parallel processing applications. While developing and optimizing such applications, estimating power consumption is of immense importance as energy efficiency has become the key design constraints to optimize for on these devices. In this work, we apply machine learning techniques in building a predictive model for estimating power consumption of parallel applications on heterogeneous mobile SoCs. Our model is an artificial neural network trained using CPU and GPU performance counter values along with measured power data. The model is trained and evaluated with data collected using a set of graphics OpenGL workloads as well as OpenCL compute benchmarks. Our evaluations show that our model can achieve accurate power estimates with mean relative error of 4% for both compute and graphics workloads. When compared to a statistical model, our neural network model is about 4x better than a state of the art statistical linear regression model.

Theory of Accelerator Computing

Bill McColl
Huawei, Paris, France

The talk will survey basic problems, algorithms, and results related to computing "deep computations" on memory-limited edge devices and accelerators. Pebbling and Space-Time tradeoffs are revisited for the new AI Era.

Implementing the GraphBLAS C API

Jose Moreira
IBM Research, NY, USA

This talk will describe our implementation of the GraphBLAS C API. The GraphBLAS C API defines a set of objects and methods that are building blocks for graph algorithms. Furthermore, the same building blocks have been demonstrated to also support deep neural network algorithms. The implementation fully hides the internals of GraphBLAS objects from application programs, which only operate on pointers to opaque objects. The interface is defined by a C11-compliant include file, while the methods are implemented in C++ and packaged as a library with C language bindings. Internally, the library is built using a layered approach, in which most of the heavy work is done by member methods of the objects, while an API layer provides the compatibility with the specification. We present an overview of the GraphBLAS C API and then proceed to describe the organization of the library, explain how it translates the polymorphic interface to its nonpolymorphic version, and we also describe the inner workings of two GraphBLAS methods, reduction and matrix-matrix multiplication, which serve as running examples of the implementation.

From HPC to Data Analytics: The Challenges Ahead
Wolfgang Nagel
T.U. Dresden, Germany

For many years, HPC is using fast supercomputers to provide solutions for many different scientific areas. Besides technology improvements, modern algorithms and methods are significantly contributing to the successes achieved in modeling and simulation. More and more often, these days progress is coming also from large data sets and innovative data analytics methods. To obtain excellent results in both areas, HPC infrastructures have to adequately support data and data analytics capabilities. For that reason, TU Dresden has extended his existing HPC infrastructure – already organized as Infiniband islands – by three components:

  1. A large 2 PB SSD farm supporting an I/O-bandwidth of up to 2 TB/s;
  2. An object store infrastructure for about 10 PB;
  3. A Power 9 based machine learning infrastructure from IBM with more than 100 Nvidia V10.

The talk will introduce the architectural environment and – based on very detailed measurements – will compare the linear algebra capabilities from several Nvidia GPUs, especially from Nvidia V100.

Manycore/Multicore Parallel Traversal of Large Ensembles of Decision Trees
Salvatore Orlando
Ca’ Foscari University of Venice, Italy

Machine-learnt models based on additive ensembles of regression trees are currently deemed the best solution to address complex classification, regression, and ranking tasks. The deployment of such models is computationally demanding: to compute the final prediction, the whole ensemble must be traversed by accumulating the contributions of all its trees. In particular, traversal cost impacts applications where the number of candidate items is large, the time budget available to apply the learnt model to them is limited, and the users’ expectations in terms of quality-of-service is high. Document ranking in web search, where sub-optimal ranking models are deployed to find a proper trade-off between efficiency and effectiveness of query answering, is probably the most typical example of this challenging issue. This paper investigates multi/many-core parallelization strategies for speeding up the traversal of large ensembles of regression trees thus obtaining machine-learnt models that are, at the same time, effective, fast, and scalable. Our best results are obtained by the GPU-based parallelization of the state-of-the-art algorithm, with speedups of up to 102.6x.

Joint work with F. Lettich, C. Lucchese, F. M. Nardini, R. Perego, N. Tonellotto, R. Venturini.

LORE: A Loop Repository for the Evaluation of Compilers and other Educational Applications

Alex Nicolau and Alex Veidenbaum
University of California Irvine, USA

Although numerous loop optimization techniques have been designed and deployed in commercial compilers in the past, virtually no common experimental infrastructure nor repository exists to help the compiler community evaluate the effectiveness of these techniques. This paper describes a repository, LORE, that maintains a large number of C language for loop nests extracted from popular benchmarks, libraries, and real applications. It also describes the infrastructure that builds and maintains the repository. Each loop nest in the repository has been compiled, transformed, executed, and measured independently. These loops cover a variety of properties that can be used by the compiler community to evaluate loop optimizations using a broad and representative collection of loops. To illustrate the usefulness of the repository, we also present two example applications. One is assessing the capabilities of the auto-vectorization features of three widely used compilers. The other is our use of this repository as a vehicle for quick hands-on teaching of some core parallel processing concepts and vectorization.

Aspects of provenance and performance for computational workflows

Line Pouchard
Brookhaven National Laboratory, NY, USA

In extreme-scale computing environments such as the DOE Leadership Computing Facilities that exhibit increased heterogeneity and expected asynchrony, scientific workflows are routinely used to coordinate software processes for the execution of complex, computational applications that perform in-silico experiments run as model simulations. Workflows enable the orchestration of the numerous processes that read, write, and analyze data and calculate quantities of interest for parallel and distributed scientific applications that range from quantum chemistry, molecular dynamics, climate modeling, and many others. Important factors for any application running in such environments include execution time and performance, accuracy of calculations, and the ability to analyze results. Given the limitations of a fixed resource budget (number of cores allocated for a specific period of time) and with simulations running for several days or weeks, it is important to determine if a simulation run is progressing as expected, what variations in performance a run exhibits and where they can be attributed. Monitoring the performance of workflows in HPC provides insightsinto this progression, how the computational resources are used, and where execution bottlenecks occur. But monitoring performance without also simultaneously tracking provenance is not sufficient to understand variations between runs, configurations, versions of a code, and between changes in an implemented stack, and systems, i.e. the variability of performance metrics data in their historical context. For gaining this type of insights, the provenance of workflow performance is needed. We define the provenance of workflow performance as the provenance that captures and correlates traditional provenance characteristics and performance metrics data. This type of provenance is used for performance analysis in empirical studies on the performance of a software or workflow during a development phase or in different computational environments.

Against the backdrop of complicated architectures with large heterogeneity of devices, complex memory hierarchies, and communication patterns, teams face an increasing need to establish credibility by producing reproducible science. We propose an approach for improved reproducibility in High Performance Computing that includes capturing and relating fine-grained, targeted provenance that includes run-time environment. Extracting provenance and environment will become even more important in the context of exascale systems that will feature multiple GPUs per node. We extend our approach by capturing provenance characteristics of Machine Learning algorithms and performance metrics in a data analysis architecture as such techniques are increasingly used in science for a variety of purposes, in particular anomaly detection and data reduction. The system capabilities are illustrated on a quantum chemistry use case provided by NWChemEx where we detail performance reproducibility in molecular dynamics workflows on HPC computing platforms.

Program Generation for Small-Scale Linear Algebra

Markus Püschel
Department of Computer Science, ETH Zürich, Switzerland

Many performance-critical computations in communication, control, multimedia processing, machine learning, or graphics fall into the domain of linear algebra. Existing optimized libraries for linear algebra are usually optimized for large scale computation and for uses in scientific computing. For small scale computations in other domains they are often suboptimal. In this talk I present our work on generating optimized linear algebra code directly from a mathematical description using techniques developed in Spiral ( HYPERLINK "http://www.spiral.net" www.spiral.net): layers of domain-specific languages to express the mathematics and the use of rewriting systems to reshape the computation at a high level of abstraction to overcome known compiler limitations.

This is the thesis work by Daniele Spampinato; project website: HYPERLINK
"https://acl.inf.ethz.ch/research/LGen/" https://acl.inf.ethz.ch/research/LGen/.

Towards tools for performance bottleneck analysis

Saday Sadayappan
Ohio State University, OH 43210, USA

Accurate modeling of the performance of alternative configurations is very desirable for optimizing compilers but is an extremely challenging problem. This talk will discuss a new approach being pursued to performance bottleneck identification in GPU code. It involves abstract emulation of the binary code of a GPU kernel along with sensitivity analysis with respect to fundamental latency/throughput parameters of component resources. The use of the bottleneck analysis approach to optimizing tensor contraction kernels will be discussed.

Joint work with C. Hong, A. Sukumaran-Rajam, J. Kim, P. Rawat, F. Rastello, L.-N. Pouchet, S. Krishnamoorthy.

Tensor Comprehensions for Machine Learning in SaC

Sven-Bodo Schölz
Heriot-Watt University, United Kingdom

Over the last few years, several machine learning frameworks such as TensorFlow, Coffee, Caffee2, or Torch have established themselves. They can be seen as specialised DSLs with highly elaborate hand-tuned high-performance implementations at their core engine. Typically with many man-hours of implementation and tuning effort behind their implementation. This talk looks at an alternative approach of using the generic programming language SaC that aims at high-productivity. We show how the tensor comprehension constructs in SaC can be used to implement Deep-Learning algorithms very concisely, reflecting the underlying mathematics very directly. We also present some initial performance figures which suggest that the generated codes are competitive with those obtained from the highly tuned frameworks.

Adaptive MapReduce Similarity Joins

Francesco Silvestri
University of Padova, Italy

Similarity joins are a fundamental database operation. Given data sets S and R, the goal of a similarity join is to find all points x in S and y in R with distance at most r. Recent research has investigated how locality-sensitive hashing (LSH) can be used for similarity join, and in particular two recent lines of work have made exciting progress on LSH-based join performance. Hu, Tao, and Yi (PODS 17) investigated joins in a massively parallel setting, showing strong results that adapt to the size of the output. Meanwhile, Ahle, Aumuller, and Pagh (SODA 17) showed a sequential algorithm that adapts to the structure of the data, matching classic bounds in the worst case but improving them significantly on more structured data. We show that this adaptive strategy can be adapted to the parallel setting, combining the advantages of these approaches. In particular, we show that a simple modification to Hu et al.’s algorithm achieves bounds that depend on the density of points in the dataset as well as the total outsize of the output. Our algorithm uses no extra parameters over other LSH approaches (in particular, its execution does not depend on the structure of the dataset), and is likely to be efficient in practice.

Joint work with with Samuel McCauley.

Efficient Implementation of Unconstrained Submodular Function Minimization

Tyler Smith
ETH Zürich, Switzerland

Submodular set functions are an important topic in machine learning. These are functions that operate on subsets, returning a real number. They exhibit a diminishing returns property whereby it is more valuable to add an element to a smaller set than to a larger one. This is a property that arises naturally in many applications, giving rise to tasks that boil down to either submodular function minimization or maximization. Examples of interesting tasks that can be solved with either submodular function minimization or maximization include: optimal sensor placement, finding the maximum flow in a flow network, corpus selection, and image de-noising.

In this talk we will discuss an efficient implementation of the Fujishige-Wolfe minimum norm point algorithm. This algorithm solves a submodular set function minimization problem by turning it into a geometric one. This geometric problem is solved by maintaining a set of vectors, represented as a matrix S, and finding the point closest to the origin within their affine hull. The computations involved in this algorithm can be broken into three categories: incrementally updating the QR factorization of S, using that QR factorization to find the point closest to the origin within the affine hull of S, and evaluating the submodular function to decide the next vector to add to S. We will describe efficient, scalable implementations of these operations and analyze their performance.

How to spot Linear Algebra in Spreadsheets

Nicolai Stawinoga
Imperial College, United Kingdom

Spreadsheets provide Turing-complete programmability to a large number of businesses and users. In this talk, we discuss benefits of vectorising the compute-graph of spreadsheets, with a view towards finding patterns matching the compute structures of linear algebra, in order to optimise evaluation times of end nodes.

On the Use of Data Mining for multi-dimensional Sensor Data

Carsten Trinitis and Amir Raoofy
T.U. Munich, Germany

The talk will present TurbO (Turbine Optimisation), a project funded by Bayerische Forschungsstiftung which investigates the applicability of data mining algorithms for industrial sensor data gathered from gas-fired power plants across the world. As a follow-up to the talk given at ScalPerf’16, which presented some preliminary ideas, we will show results which were obtained within the first year of the project, which started in 2017 and will last for 3 years in total.

Operation of gas turbines creates huge amounts of data (within the range of several TB per year for each power plant) through comprehensive sensor measurements. To be able to process and analyse these huge amounts of sensor data - which comes as a time series - dedicated hardware is required in terms of storage, compute power, and network as well as specific software infrastructure. Once this infrastructure has been set up, appropriate data mining and machine learning algorithms need to be identified and adapted to properly cope with the data.

We will start with introducing the hardware and software solutions that we set up for our investigations. The main part of the talk will focus on algorithms we consider as suitable so far.

The first algorithm under investigation (SCRIMP++) is based on the so called matrix profile, which contains information on similarities within multi-dimensional time series. By detecting these similarities we are able to analyse re-occuring events (which could be a hint to some unwanted behaviour) within the gas power station.

Another algorithm which is being investigated is called TICC (Toeplitz Inverse Covariance based Clustering) and is based on Markov Random Field (MRF). Through this, clusters with similar characteristics can be identified within multi-dimensional time series.

These two approaches are currently being applied to sensor data from Stadtwerke München which serves as a real world example to demonstrate their feasibility.

The talk will conclude with an outlook on future analysis using higher sampling frequencies (25kHz instead of 1Hz) and additional data sources. Open questions with regard to specific frameworks or machine learning methods as well hardware shall be discussed.

A model-driven approach for a new generation of adaptive libraries

Flavio Vella
Free University of Bozen, Italy

Efficient high-performance libraries often expose multiple tunable parameters to provide highly optimized routines. These can range from simple loop unroll factors or vector sizes all the way to algorithmic changes, given that some implementations can be more suitable for certain devices by exploiting hardware characteristics such as local memories and vector units. Traditionally, such parameters and algorithmic choices are tuned and then hard-coded for a specific architecture and for certain characteristics of the inputs. However, emerging applications are often data-driven, thus traditional approaches are not effective across the wide range of inputs and architectures used in practice. In this paper, we predictive model to select the optimal algorithmic parameters by training with synthetic and real datasets. We demonstrate the effectiveness of a BLAS library and specifically on its matrix multiplication routine. We present experimental results for two GPU architectures and show significant performance gains of up to 3x (on a high-end NVIDIA Pascal GPU) and 2.5x (on an embedded ARM Mali GPU) when compared to a traditionally optimized library.

Components for a High Performance GraphBLAS

Richard Veras
Louisiana State University, Baton Rouge, LA, USA

The GraphBLAS provides a powerful API for instantiating graph analytic operations in the language of linear algebra. Like its namesake, the BLAS, a goal of this project is to provide a set of efficient building blocks for creating performance portable graph routines. However, for unstructured graph operations it is particularly difficult to extract performance from modern hardware features, such as multilevel caches and vector instructions. In the case of more structured domains, like dense linear algebra, this is achieved through the careful organization of data movement and computation based on the underlying structure of the data. In this work, we adapt this approach for linear algebra-based graph analytic algorithms. When applicable, we exploit the hierarchical structure observed in real-world networks to fit the data to the cache structure of the target system. In turn, these efficient access patterns feed generated and tuned computational kernels. The end result of this work includes high performance implementations of key graph operations along with a collection of efficient components for a broader set of GraphBLAS operations.

A collective communication model for WK-recursive network topologies

Lorenzo Verdoscia
CNR-ICAR Napoli, Italy

Deep learning techniques have seen great success in diverse application domains including speech processing, computer vision, and natural language processing. the two main reasons for their recent success are: 1) the availability of large-scale training data; and 2) advances in computing hardware to efficiently train large-scale neural networks using this data. Aside from the high data parallelism, convolution neural network (CNN) training also involves high volumes of message transfers between the many-core accelerators, mainly due to the forwarding and storing of data between adjacent CNN layers. Most of these accelerators are spatial in nature, i.e., an array of interconnected processing elements (PEs) is used to provide parallelism. The performance of a many-core architecture is highly dependent on the capabilities of its communication backbone, strictly speaking the Network on-chip (NoC). An efficient NoC designed for a many-core platform should align the connectivity of the NoC with the application’s on-chip traffic patterns. However, there has been little research on the architecture or implementation of a NoC interconnecting the PEs to each other. Usually, NoCs use message passing to communicate with each other since it is scalable and more efficient than using shared memory. For computation-intensive tasks, parallel applications are typically employed which leverage underlying parallel architecture. These parallel applications employ multiple co-operating and communicating processes to speed up the computation. Communication operations may be either point-to-point, which involves single source and a single destination, or collective, in which more than two processes participate. However, being collective communication traffic more challenging to handle on conventional NoC platforms than the unicast traffic because of the requirement of long-range data exchanges and dense collective communication that can cause heavy congestion. Without proper support from the underlying NoC architecture and routing protocol, this communication can lead to high network and queueing latencies. In a WK-Recursive interconnect topology nodes can scale up to thousand ones. Consequently, simple policies are required to efficiently manage inter-node communication, especially when groups of cooperating nodes need to exchange groups of related information. The solution proposed for this collective communication resides in treating message transfer as communication among homologous level-nodes. This solution, which takes advantage of recursive network properties, allows to each gather node of a sender level-node 1) to identify the correct message sequence; 2) to evaluate the quantum to assign to the scheduler; 3) to transfer the sequence as a unique message. The resulting message is then transferred to homologous ones of a receiver level-node by a level channel in pipeline fashion to speed up the throughput. Delivering activity occurs adopting a straightforward communication algorithm.

Are existing Parallel Programming Models ready for Future HPC Systems?

Josef Weidendorfer
T.U. Munich, Germany

In this talk, I will discuss pros and cons of existing and proposed parallel programming models in regard to scalability challenges on future, deeply hierarchical heterogeneous systems with huge amounts of compute components, as well as new usage expectations such as malleability and fault tolerance. To this end, I will present the current state of our own library proposal (“LAIK”) on top of MPI, together with first promising results.