Abstracts (in progress)

A Practical Randomized CP Tensor Decomposition

Casey Battaglino
Georgia Tech

The CANDECOMP/PARAFAC (CP) decomposition is a leading method for the analysis of multi-way data. The standard alternating least squares algorithm for the CP decomposition (CP-ALS) involves a series of highly overdetermined least squares problems. We show that, by extending randomized least squares methods to tensors, the workload of CP-ALS can be drastically reduced without a sacrifice in quality. We introduce techniques for efficiently preprocessing, sampling, and computing randomized least squares on a dense tensor of arbitrary order, as well as efficient computation of stopping conditions. We also show more generally that the Khatri-Rao product (used within the CP iteration) produces conditions favorable for direct sampling without any preprocessing. In preliminary results, we see improvements in speed, storage, and robustness.

Joint work with Grey Ballard (Wake Forest) and Tamara G. Kolda (Sandia).

Lower bound techniques for communication in the memory hierarchy

Gianfranco Bilardi
University of Padova

Communication requirements between different levels of the memory hierarchy of a DAG computation are strictly connected to the space requirements of the subset in suitable partitions of the computation DAG.

We discuss a versatile technique based on a generalized notion of visit of a DAG to obtain space lower bounds. We illustrate the technique by deriving space lower bounds for specific DAGs and also establish some limitations to the power of the proposed technique.

We will also outline a recent result on the I/O complexity of Strassen’s matrix multiplication, with recomputation.

Joint work with Lorenzo De Stefani (Brown University) and Andrea Pietracaprina (University of Padova).

Big Data in Atomistic Molecular Dynamics Simulations

Andrea Cavalli
University of Bologna and IIT

Atomistic molecular dynamics (MD) is a physical method for studying the interaction and motion of atoms and molecules according to Newton’s physics. A potential energy term is used to estimate the forces between interacting atoms and calculate the overall energy of the system. Then, during MD simulations, the integration of Newton’s laws of motions generates successive configurations of the evolving system, providing trajectories that specify positions and velocities of the particles over time. From these MD trajectories, a variety of properties can be calculated, including free energy, kinetics measures, and other macroscopic quantities, which can be compared with experimental observables. The integration of the Newton’s equation of motions is performed at a predefined time-step, which is usually in the order of 1 or a few fs, and nowadays simulations lasting up to a few milliseconds (corresponding to 1012 time steps) are now possible. In addition several different trajectories are recommended to obtain adequate statistics and exhaustive sampling of the conformational space. MD can thus produce terabytes of trajectories that calls for automatic analysis methods. One of the most representative projects in this direction is the Folding@Home project lead at the Stanford University. There, a grid all over the world manages thousands parallel and independent simulations. In particular, the aim of the project is studying the kinetics of protein folding. To reach that goal Machine Learning algorithms and statistical methods are of fundamental importance because they are able to rationalize this huge amount of data. In particular Markov State Models and scalable clustering algorithms are needed for this kind of tasks. While such analyses deal with just protein systems, one could imagine to apply the same world-wide simulation setup to more drug discovery oriented aims. For instance, simulating unbinding/binding of drugs by using an equivalent technological platform. Beside strictly speaking of big-data, machine learning based methods can be used in several ways to analyze MD trajectories: indeed also supervised classification and regression algorithms can be used to estimate specific properties/observables of relevance such the free energy. On this regards kernels methods and decision trees can be extremely useful tools together with the possible exigency of developing ad hoc algorithms.

On the Analysis of Explicitly-Parallel Languages

Alain Darte
ENS Lyon

Guided by both performance and power objectives, today’s computing infrastructures evolve toward architectures with an increasing complexity, including sophisticated memory organizations and the use of hardware accelerators. They require the user or the compiler to be able to perform parallelism extraction, locality optimization, and explicit data movements. This is the case even for the mass market with the advent of multicores and GPGPUs, which has exposed the need for parallel programming and related code optimizations to a larger public. Many different approaches are investigated, based on parallel languages, libraries, or runtime systems, possibly domain-specific, in an attempt to find suitable trade-offs between expressiveness, predictability, and performance. What is lacking is a way to make the compiler, the runtime system, and the user interact in a better way. In this quest, we explore several aspects of the analysis of explicitly-parallel languages, in an attempt to improve the communication between these three ”actors”. This includes the analysis of the underlying scheduling freedom, the analysis of liveness of scalar and arrays, and their memory mapping, illustrated through specifications of pipelines, OpenMP, OpenStream, or X10 features.

Getting stable lists of biomarkers from high-throughput data: integrating prior knowledge in the learning process

Barbara Di Camillo
University of Padova

The identification of robust lists of molecular biomarkers related to a disease is a fundamental step for early diagnosis and treatment. However, methodologies for biomarker discovery using high-throughput genomic data often provide results with limited overlap.

It has been suggested that one reason for these inconsistencies may be that in complex diseases, multiple genes belonging to one or more physiological pathways are associated with the outcome. Thus, a possible approach to improve list stability is to integrate biological information from genomic databases in the learning process.

In this talk, I will discuss the reasons of biomarker list instability and present a method to regularize the discriminant functions of the classification algorithms using prior biological knowledge. The use of different biological information like functional annotations, protein-protein interaction networks and expression correlation among genes will be also assessed.

Directions and early performance results on the Thick Control Flow architecture

Martti Forsell
VTT Electronics

Thick control flows (TCF) collect together threads that are following the same control path into single data parallel entities. The degree of parallelism in TCFs can be changed by a programmer to adapt the thickness of computation according needs of functionality. The aim is to simplify programming by raising the level of abstraction and make computation more cost efficient by eliminating redundant parts from hardware and software.

At VTT, we have developed the first architecture, Thick control flow Processor Architecture (TPA), for executing TCFs natively. The architecture adopts the shared memory processing techniques from latest emulated shared memory (ESM) architectures but organizes the hardware so that control flow related execution can be overlapped with unbounded data parallel computation. In this presentation we discuss the directions we would like to explore with our TCF research, some details of the architecture, our attempt to implement it on FPGA, and present early performance results.

A portable heterogeneous implementation of the D2Q37 Lattice Boltzmann Method

Alessandro Gabbana
University of Ferrara

Lattice Boltzmann Methods (LBM) are a family of CFD algorithms widely used to solve fluid flows in several regimes. This class of algorithms – discrete in space, time and momentum space and living on a discrete and regular grid of points – offers a large amount of easily identified available parallelism; LBM algorithms are therefore ideally suited for massively parallel implementations, targeting modern many-core HPC systems.

Currently, hi-end HPC systems are more and more often based on parallel heterogeneous architectures, in which each of a large number of computing nodes has a host processor and one or more accelerators, based on GPUs, Xeon-Phi processors or – in some cases – FPGAs.

Over the years, several efficient LBM codes have been developed for these architectures, mapping the parallel structure of the underlying algorithm onto the architectural features available on the target system. In most cases, however, when considering accelerator based implementations, these codes offload almost all the intensive computational steps onto the accelerator; this choice is done to avoid bottlenecks associated to the limited communication bandwidth (and large latency) between host and accelerator, but does not exploit precious computing resources available on the host.

In this contribution we go one step further, describing a carefully optimized implementation of a D2Q37 LBM code that uses concurrently the computing resources of the accelerator and those of the host node. Several data layouts have been evaluated in order to identify a common layout capable of optimizing memory accesses both on accelerator and host side, enabling performance portability. The main advantages of this implementation are better single-node performance, increased data locality, reduced communication overheads for massively parallel implementations and better (strong and weak) scaling behavior. We provide performance results on clusters of commodity multi-core CPUs accelerated in turn by the Intel Xeon-Phi and NVIDIA Tesla GPUs, comparing with more traditional optimized implementations of the same code.

Joint work with Enrico Calore (INFN), Sebastiano Fabio Schifano and Raffaele Tripiccione (University of Ferrara).

Using Machine Learning to Support Human Learning

Manish Gupta
Xerox Research Center India

Students in many educational institutions, especially those in developing countries like India, continue to suffer from poor learning outcomes despite the growing availability of high quality educational content through MOOCS, i.e., massive open online courses. We describe our attempts to deal with this problem through a learning platform called KenSpace (previously called TutorSpace). Our platform uses machine learning techniques to evaluate videos, support search and recommendation of educational videos on a given topic, index them to support more effective navigation, e.g., by automatically generating a table of contents and glossary, and to gain insights from the learner’s interactions with the videos. We believe it is important to personalize the learning experience to suit an individual learner, to his/her goals, background, strengths and weaknesses, and learning styles. We describe many outstanding challenges in creating such a personalized learning experience, and in supporting higher order learning, especially creativity.

Data Analytics in the automotive industry – a report from aftersales

Thomas Husslein

Manufacturing companies are using data from customer interactions not only to inform marketing and sales but also to improve product development decisions. Increasingly, it is economically feasible to embed sensors in products that can generate data about actual product usage and performance and report this data home. An industry where sensors in the products are already widely spread is the automotive industry. Since the first use of embedded computers in the car, sensor data has been created, used and recorded. In the beginning data was only available when cars were serviced. Nowadays manufacturers even can obtain real-time input and adjust immediately. Typical applications within the field of aftersales are presented and from there opportunities for future work are developed.

Mitigating Impact of Memory Access Latencies in Large Graph Analytics

Manoj Kumar
IBM, T. J. Watson Research Center

Often the design point of modern multi-core/multi-threaded processors is suboptimal for large graphanalytics because, under the current hardware limits for memory access latencies and circuit speeds, memory access latency is the gating factor due to unpredictable memory access addresses. Three approaches to minimizing memory access latency are: 1) a programming model friendlier to memory accesses such as the linear algebra formulation; 2) program transformation or deep optimization to mitigate memory access latencies by improving locality in accesses; and 3) a design point of the processors devoting power and silicon area to handling large number of outstanding memory accesses arising from unpredictable memory addresses, stealing these resources from caches and data prefetch hardware which do not work for unpredictable memory accesses.

We will first present experimental evidence on the efficacy of the first two approaches and their limitations in terms of the attributes of the data and algorithms needed to benefit from these approaches. We will then present thoughts on additional attributes for which we must study the sensitivity of performance with respect to the attribute. This is essential to build synthetic data to predict the performance of an application when true data is not shareable, for example, for medical, financial or security applications. We will present our observations on elements of modern processors that are not as effective or detrimental to performance of large graph applications, while being very effective for traditional scientific-engineering or system or record applications, and finally, our thoughts on hardware capabilities for future general purpose processors that can benefit large graph analytics.

Joint work with Pratap Pattnaik, K. Ekanadham, Bill Horn, Joefon Jann, Jose Moreira, Mauricio Serrano and Gabi Tanase (IBM, T. J. Watson Research Center).

Translating HPC Performance Tools to Big Data and Analytics

Allen D. Malony
University of Oregon

The high-performance computing (HPC) field has seen a strong effort over many years by a dedicated community of researchers to create robust performance instrumentation, measurement, analysis, and optimization techniques and tools that can keep up with the rapidly evolving HPC landscape. With the rise of big data and analytics, where scalable system platforms are being targeted for high performance implementation, it is worthwhile to consider the opportunities and challenges for translating HPC performance tools to this arena. Interestingly, advances towards exascale have already driven disruptions in traditional performance tool requirements, given the move to asynchronous runtime systems, heterogeneous computing, scientific workflows, many task computing, and dynamic resource management, together with the performance problems that ensue (e.g., performance variability, need for runtime introspection, online feedback). Several of these aspects have relevance to big data and analytics environments. The talk will provide a perspective on the issues and give examples of current research activities where a carry-forward of techniques and technologies might be possible.

Bringing High Performance to Big Data and Cloud Computing

Bill McColl
Huawei Research

High Performance Computing has been extremely successful by focusing on how best to support software developers that have the technical skills required to design highly parallel algorithms, to optimize data partitioning and load balancing, to minimize communications and synchronization, and to do all of the other things necessary to achieve the highest possible performance on large-scale dedicated clusters and supercomputers with CPUs, GPUs, advanced networking, and highly parallel file systems. In HPC, where choices have to be made between, for example, high performance and fault tolerance, the tendency has almost always been to focus on the former.

Big Data and Cloud Computing have also been extremely successful in recent years by focusing on exactly the opposite community, those software developers who do not have the technical skills, or motivation, required for HPC. In Big Data and Cloud Computing, many/most of the key design decisions regarding parallelization, data partitioning, load balancing, communications, synchronization, redundancy and fault tolerance are automated, and the key objective is to enable developers to produce applications with a minimum of effort, applications that will typically be run on low-cost pools of commodity cloud hardware resources, that will often be virtualized or based on containers.

In this talk, I will describe some of the challenges in bringing high performance to big data and cloud computing, and outline some encouraging results. As machine learning moves to the center of computing, and "AI-as-a-Service" becomes a major new business opportunity, this area will become more and more important. If HPC also moves to a new era in which cost-effective "cloudonomics" matters more, then it will become important there too.


Wolfgang Nagel
T.U. Dresden

Cognitive Systems: A Status

Pratap Pattnaik
IBM, T. J. Watson Research Center

A year ago, at Scalperf ’15, we presented our view on the applicability and future popularity of cognitive systems for solving real world problems. The past year has shown these systems to be more effective and more popular than we had anticipated. Using perspectives based on algorithms, hardware, and software, this talk will present an updated view on the topic. The talk will also identify areas where major scientific breakthroughs are required before significant progress can be realized.

Joint work with K. Ekanadham, Bill Horn, Joefon Jann, Manoj Kumar, Jose Moreira, Mauricio Serrano and Gabi Tanase (IBM, T. J. Watson Research Center)

Parallel Graph Analytics on the Galois System

Keshav Pingali
The University of Texas at Austin

Unstructured data is being generated at a tremendous rate in modern applications as diverse as social networks, recommender systems, genomics, health care and energy management. Networks are an important example of unstructured data and may arise explicitly, as in social networks, or implicitly, as in recommender systems. These networks are challenging to handle; not only are they large-scale but they are constantly evolving, and many applications require difficult prediction tasks to be solved, such as link or ratings prediction.

In this talk, we will discuss scalable algorithms for a class of prediction tasks on large-scale networks. We will also present performance numbers for implementations of some of these algorithms on the Galois system, which is a data-centric parallel programming system that provides an easy-to-use programming model for implementing and tuning parallel irregular applications such as graph analytics algorithms.


J. (Ram) Ramanujam
Louisiana State University

Sparse Matrix-Matrix Multiplication for Data Analytics on GPUs

Saday Sadayappan
Ohio State University

Sparse matrix-matrix multiplication (SpMM) is required in several data and graph analytics applications, such as Markov clustering. Several recent efforts have addressed the development of efficient SpMM implementations for GPUs. However, in contrast to GPU performance easily exceeding a teraflop for matrix-matrix multiplication of large dense matrices, it has been challenging to achieve even 10 GFLOPs with multiplication of large sparse matrices. In this talk, we describe various approaches that have been proposed for SpMM on GPUs and seek to characterize factors that limit performance.

Joint work with Rakshith Kunchum and Aravind Sukumaran-Rajam (Ohio State University).

Communication costs in sparse linear algebra

Oded Schwartz
Hebrew University

The performance of parallel algorithms in numerical linear algebra is often determined by the amount of inter-processor communication. When input matrices are dense, communication costs are typically well understood. Lower bounds are known for most computation kernels, and these are attainable by communication minimizing algorithms. This, however, is not the case with sparse input matrices.

For such matrices the performance is typically determined by the communication cost as well, which in turn depends on the nonzero structure of the input matrices. Sparse algorithms are involved in fundamental computations. For example, sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multi-grid. Still, tight lower and upper bounds are known for very few combinations of algorithms and input classes. In this talk I will present some successes in recent years, and the main open challenges remaining.

Joint work with Ariful Azad, Grey Ballard, Aydin Buluç, James Demmel, Alex Druinsky, Laura Grigori, Nick Knight, Sivan Toledo, and Samuel Williams

Alignment-free techniques for comparison of genomic and meta-genomic sequencing data

Michele Schimd
University of Padova

Continuously evolving sequencing technologies generate large amount of omic data that can be used to solve a wide variety of biological problems such as classification of organisms and gene analysis.

It is natural to ask whether or not these problems could be solved using directly the sequencing data without performing assembly and alignment tasks that require substantial computational effort.

We present a distance measure based on reads data and incorporating quality values. Results obtained for genomic and meta-genomic classification show that these measures can be effectively used and that the inclusion of quality values improves the classification accuracy of such methods.

Joint work with Matteo Comin (University of Padova)

Random Lightweight Messaging through the Data Vortex Network for Graph Processing

Thomas Sterling
Indiana University

Conventional system area networks (e.g., Ethernet, IBA) design and operation emphasize bandwidth of data movement. Large packets are preferentially employed and measured as they amortize the overheads and latencies with respect to payload bits per second. Such communication is well suited for many data transfer patterns including message passing computing models (e.g., MPI) and distributed dense vector/matrix operations. It is also appropriate for streaming applications. Alternative classes of computation are emerging as increasingly relevant that require lightweight messaging and random access of communication. These include the array of scientific and data analytics problems that are based on graph processing as well as dynamic adaptive runtime computing. Adaptive mesh refinement, Particle in cell, Barnes Hut N-body, Wavelets, Fast Multipole Methods, and other sparse matrix applications rely increasingly on graph computations as do many data-intensive domains such as data analytics and machine learning, just to mention a few.

An innovative network, the Data Vortex, has been developed as a fine-grained gather-scatter engine with a new controller and switch that carries packet with 64 bit payloads. The network is devised as a multi-level minimum-logic topology with fully distributed routing control for contention avoidance. A family of clusters employing the Data Vortex network has been developed and delivered as testbeds.

This presentation will describe the Data Vortex and provide some empirical measurement data demonstrating the operational properties of this new class of network. The talk will also identify areas of interest in which such a network may have significant impact for future scaling. Among these is the opportunity for improvements in runtime system behavior for greater efficiency and scalability as well as for more general graph based problems.

Joint work with Matthew Anderson

Using Predictive Modeling to Evaluate Analytics Workloads

Nathan Tallent

Many “big data” and “analytics” workloads have irregular execution behaviors – load imbalance, irregular memory accesses, unbiased branches, and dynamic dispatch – that make it difficult to construct predictive models. In this talk, we apply predictive modeling to evaluate and optimize irregular analytics workloads. We discuss modeling and optimization of a distributed High Energy Physics workflow. We evaluate the impact of silicon photonics on graph analytics. To create models, we use the Palm model generation tool, which combines static and dynamic application analysis. Finally, we show that incorporating models into a parallel runtime can improve the energy usage of scientific applications with irregularity.

Optimization of Gas Turbines through analyzing large amounts of Sensor Data

Carsten Trinitis
T.U. Munich

Germany’s "Energiewende" and the transition to renewable energy sources which comes along with it impose major challenges to power grid and power station operators. Generating electric current by gas power plants is not considered renewable energy technology, however, it ill play an increasingly important role among future energy sources for various reasons: The higher the fraction of power generated from renewable energy sources that is submitted to the power grid, the more likely the risk of variations due to weather conditions (e.g. for solar and wind power). Therefore, the ability to rapidly supply to suit to market need is, besides their low CO2 emissions, a major advantage of gas power stations. With this, they are an ideal candidate for compensating peak loads in power grids and can avoid blackouts. Because of the high number of parameters which need to be adjusted to optimally operate a gas power station, it would be interesting to be able to analyze the data which is supplied by its numerous sensors. However, this can not yet be accomplished with the tools currently being used. Much of the recorded data has to be abandoned after a short period of time as new data has to be stored. In additions, there exists no simple way to compare data from different power stations. This is where Big Data comes into play: On the one hand, the goal is to analyze data from a single power station over a long period of time, on the other hand it is also important to analyze and compare "fleets" of gas turbines in order to obtain even more detailed insight into the processes inside the turbines.

First of all, the monitored and archived data has to be analyzed and filtered in terms of quality and significance. Moreover, meta-data needs to be identified, checked and extended (documented) if necessary. Once this pre-selection process has been finished, methods for evaluating, consolidating and analyzing can be developed and tested. The results of these steps can be used to develop an improved data management in order to optimize archiving of data and future evaluations.

In terms of data volume, large installations generate up to 200GB per turbine per day, which means that with the current configurations there is a limited amount of storage available. Also, in terms of network bandwidth, power stations can reside in areas with very slow network connections. In contrast to "classical" Big Data scenarios (e.g. Google, Facebook, Amazon) where data is generated, processed and stored IN the data centre, the conditions are significantly different for the environment described above: On the one hand, a power station generates a significant amount of data through its sensors, on the other hand there is only limited compute and storage capacity available on-site. However, the data centers capable of processing and storing the data cannot direct process the data because of the stations’ low bandwidth connections, which makes a preprocessing step necessary.

The talk will start with an introduction and motivation of the problem by explaining the background and outlining the status quo in the area. Then, possible solutions to preprocessing and analyzing the data will be introduced. A long term goal of the project is to develop a formal performance model which is capable of generating an optimal (or a nearly optimal) configuration of compute and storage components in power stations depending on e.g. hard disk throughput, compute power (on-site as well as in the data centre) and also cost factors.

Algorithms and Heuristics for Scalable Betweenness Centrality Computation

Flavio Vella
Sapienza University of Rome and CNR-IAC

Betweenness Centrality (BC) is a popular metrics widely used for finding the most important nodes in a network. However, only high performance parallel computing techniques can provide results in a reasonable amount of time since the computation of the BC score of each node is an extraordinary challenge for large-scale networks.

The approach we propose combines bi-dimensional (2-D) decomposition of the graph representing the network and multi-level parallelism. We also describe a suitable data-thread mapping that overcomes most of the difficulties caused by the irregularity of the computation on GPUs. Finally, we propose novel heuristics which exploit the topology information of the graph in order to reduce time and space requirements of BC computation.

Experimental results on synthetic and real-world graphs show that the proposed techniques allow the BC computation of graphs which are too large to fit in the memory of a single computational node along with a significant reduction of the computing time.

An adaptive algorithm for WK topologies

Lorenzo Verdoscia
CNR - ICAR Napoli

Big data has emerged as an important application domain which has shown its huge impact on scientific research, business, and society. Big data is known for its large volume, high velocity, and a variety of formats. The collection, storage, retrieval, processing and visualization of big data issue many challenges to computer architecture and systems. On the other hand, new computer architecture and system technologies enable people to handle larger and more complex data more efficiently. Moreover, the design of this computer architectures results challenging with regard to data movement, which is one of the most time consuming parts of big data processing. In addition, these computers are going to employ more and more nodes with many-cores on die that require efficient communication among the nodes and the cores, given the tremendous amount of data needs to move from node to node and from core to core. As a result of this increased flow of data, static routing cannot efficiently manage link fault, deadlock condition, load-balance traffic, resulting in an increased network contention and a reduced throughput. In contrast, adaptive routing can provide a solution to this contention problem. In fact, adaptive routing can reach efficient traffic spreading, even when traffic flow is high. Furthermore, when using adaptive routing, switches need to know little about the global state of the network or about the states of other switches. Even though we do not claim to have a definite answer to the problem of big data communication, the talk refers to an adaptive routing algorithm for the highly scalable WK networks.

Dynamic Binary Rewriting for Optimization of Parametric Data Processing

Josef Weindendorfer
T.U. Munich

There often is a trade-off between achievable performance and generic applicability of the implementation of algorithms. E.g. an implementation that can match strings for arbitrary regular expressions is always slower than specific code for a fixed known regular expression. JIT techniques can help here by automatically generating specialized versions at runtime using profile information, but this needs corresponding language runtime support which may not be available for a language stack chosen by a project. For example, HPC projects typically are based on languages compiling to native binary code for best performance, which makes runtime code adaptation difficult.

This talk presents DBrew, a library allowing for explicit control of transformation of binary code at runtime. It can generate specialized variants of functions available in binary form, and the resulting new code can be called as drop-in replacement of the existing function. We present the motivation and design criteria behind the library, as well as first evaluations with a prototype for x86-64.