Pervasive Cognitive Computing: the Energy efficiency Challenge

Luca Benini
University of Bologna

Cognitive computing at the "Internet-of-things" scale requires massive amounts of local signal processing, coupled with advanced multi-standard/multi-mode communication capabilities. From the computational viewpoint, the challenge is formidable and can be addressed only by pushing computing fabrics toward massive parallelism and brain-like energy efficiency levels. We believe that CMOS technology can still take us a long way toward this vision. Our recent results with the PULP (parallel ultra-low power) open computing platform demonstrate that pj/OP (GOPS/mW) computational efficiency is within reach in today’s 28nm CMOS FDSOI technology. This energy efficiency level enables the embedding of several key building blocks of cognitive computing such as learning and classification algorithms, within a few mm2 of silicon ad a few mW of power. Looking forward, even pj/OP won’t be enough, and we will need to fully exploit the flexibility of heterogeneous 3D integration, stop being religious about analog vs. digital, Von Neumann vs. "new" computing paradigms, and seriously look into relaxing traditional "hardware-software contracts" such as numerical precision and error-free permanent storage.

Mapping what we know on what we know: the case of GridTools

Mauro Bianco
CSCS, ETH Zurich

Applications of regular grids are still relevant in HPC, ranging from weather forecasting, climate modeling, geophysics, and others. The GridTools project will provide a set of C++ libraries to develop efficient-portable (explicit) finite differences, finite volumes, and finite elements algorithm motifs on regular grids. The goal of GridTools is to allow computational scientists to provide an architecture-agnostic specification of the computation, that will be performed efficiently on a variety of platforms. This is made practically feasible by restricting GridTools to those algorithmic motifs. This limitation allows the computation graph to be determined by a number of parameters that does not depend on the size of the input, so that an efficient mapping to an actual architecture is possible.

Energy vs time performances of HPC applications on low-power processors

Enrico Calore
University of Ferrara

Energy efficiency is lately becoming of paramount importance in the HPC field and high-end processors are quickly evolving towards more advanced power-saving and power-monitoring technologies. On the other hand, low-power processors, designed for the mobile market, are gaining interest in the HPC area thanks to their increasing computing capabilities, in conjunction with their competitive pricing and low power consumption. In this work we explore energy and computational performances of low-power processors, in the context of HPC, using a Lattice Boltzmann application as a benchmark. Current is monitored using a custom hardware for different CPU, GPU and memory clocks. We discuss several energy and performance metrics, interesting per se and also in view of a prospective use of these processors in an HPC context.

Directive-Based Parallel Programming at Exascale? Really?

Barbara Chapman
Stony Brook University and University of Houston

High-end computer architectures continue to evolve in order to provide exceptional levels of application performance and to facilitate the processing of enormous data sets, while meeting the challenge of reducing power utilization. The compute power of individual nodes continues to grow, with increasing core counts, potential heterogeneity and a deep memory hierarchy. Architectural diversity will exacerbate the difficulty of creating parallel applications that are capable of exploiting these platforms.

Despite significant and promising research into novel programming models and approaches, today’s HPC applications are still overwhelmingly programmed using MPI. Increasingly, MPI is coupled with programming directives to better exploit the compute power within individual nodes. It appears increasingly likely that this will remain one of the means available for application development on exascale platforms. We recap the current status of OpenMP and OpenACC and consider how well equipped they are for the next generation of high-end parallel systems.

TRIÉST: Counting Local and Global Triangles in Fully-dynamic Streams with Fixed Memory Size

Lorenzo De Stefani
University of Padova and Brown University

Computing the global number of triangles in a graph and the local number of triangles incident to each vertex are fundamental primitives in graph analysis. When the graphs are huge and dynamic (i.e., edges can be inserted and deleted adversarially), as is the case for many modern networks, obtaining the exact number of triangles is infeasible (due to the size) and has limited value (due to the ever-changing nature of this quantity). Hence, approximations are necessary and sufficient.

We present TRIÉST, a suite of novel one-pass streaming algorithms that, for the first time, use a (user-specified) fixed amount of memory to compute probabilistically-guaranteed, high-quality, unbiased estimations of the global and local number of triangles in a fully-dynamic graph, represented as a stream of edge insertion and deletions. Our method neither requires any global knowledge about the stream (e.g., an upper bound on the total number of edges or triangles), nor, in contrast with previous approaches, it needs any hard-to-guess and non-immediately-interpretable parameter (e.g., a sampling probability). Instead, we devise a novel adaptive reservoir sampling scheme to fully exploits the fixed available memory space at all times.

The results of our extensive experimental evaluation on very large graphs show that our algorithms outperform current state-of-the-art approaches in accuracy

Joint work with Alessandro Epasto, Matteo Riondato and Eli Upfal.

Introducing the ECM diagnostic performance model

Jan Eitzinger
Friedrich-Alexander-Universitaet Erlangen-Nuernberg

Getting the best performance out of current hardware systems becomes more and more difficult. Systems expose a complex heterogeneous topology, and parallelism on multiple levels is the dominating strategy to increase performance. Most of the performance enhancing technologies are explicit with respect to the software. Software must make use use of them to benefit. The responsibility to exploit the performance potential is almost fully put on the shoulders of the software developer. Performance modelling is an indispensable tool to enable a systematic performance engineering effort. It helps to understand observed performance, learn about details of software-hardware interaction and pinpoint performance bottlenecks. Further more it can help to decide on optimization efforts by estimating the possible benefits. The Execution-Cache-Memory performance model is a simple resource oriented diagnostic performance model which can be set up with pen and paper. At the moment it is applicable to instruction-throughput or bandwidth-limited scenarios. This talk will introduce the basic methodology of the model and illustrate its benefits on several examples. An outlook discusses how the model could be adapted to be used in latency bound situations as well.

Outline of the TCF architecture

Martti Forsell
University of Oulu

As the number of processor( core)s is scaling up, providing the flexibility of full MIMD model can become a burden. Namely, writing a unique code for a high but fixed number of hardware cores can make programming tedious and fetching instructions to all those cores makes design of the instruction memory system painful. On the other hand if the workload contains self-similarities with respect to control and address computation there is a risk of substantial energy wasting in typical MIMD machinery. A recently invented solution/optimization, the thick control flow (TCF) programming model, promises to address these problems. The model packs together threads flowing through the same control path and supports unbounded number (or unbounded thickness) of ”fibers” for them. Unfortunately, there is no known/practical execution architectures for it.

In this presentation we outline a CMP architecture for executing TCFs efficiently. The architecture uses of the most techniques/solutions of the REPLICA emulated shared memory CMP, including scalable latency hiding with multithreading, radical synchronization cost reduction via a wave-based synchronization mechanism, and improved low-level parallelism exploitation via chaining of FUs. The related methodology borrows the REPLICA’s strong PRAM model for sophisticated parallel programming. According to our studies it seems that the TCF model helps to conquer both programmability/performance concerns related to algorithms with unbounded structures/characteristics and address the energy inefficiency of full MIMD scheme. We also believe that the TCF model provides a very solid starting point for a simple and low-cost task model that is typically complex in multithreaded systems. Results of our on-going analysis on the performance of some key algorithmic kernels are presented.

Streams, Photos, Sharing, and Deep Learning

Filippo Gioachin

In this talk, I will start by introducing some of the work my team is doing to support next-generation mobile-ready applications. I will then use one example: Photos, one of the recent successes from my Product Area, to drive the discussion around cognitive computing. I will give several examples of how machine learning is heavily used inside Google to provide a wide variety of world-class products.

Performance and power modeling for energy efficient computing

Adolfy Hoisie

Deriving models of automotive production through data analysis

Thomas Husslein

In this talk we present a stochastic model for processing times of an automotive paint shop, which can be used for strategic decision support. This model is derived directly from the statistical analysis of current production data.

In order to support strategic planning using simulation and planning systems, the modeling of the production process is required. Due to the very limited time and resources a detailed modeling by humans is not possible. In our approach we create a robust model with good accuracy and reduced complexity automatically by data analysis. The result is the prediction of the systematic behavior of the bodies within the plant. The automated approach enables us to keep the model up to date at almost no additional cost. We then use the obtained model as an input for automated optimization algorithms. The presented approach enables the generalized application in many environments.

Joint work with S. Müller, C. Danner, J. Breidbach, M. Seidl and W. Lauf.

Enabling design-space exploration for robot SLAM - for accuracy, performance and energy

Paul Kelly
Imperial College London

SLAM - simultaneous location and mapping - is a key platform for understanding 3D environments for a huge range of applications, spanning robotics and augmented reality and beyond. Building a really usable map of the environment requires "dense" methods currently feasible in realtime only on powerful hardware. This talk will introduce SLAMBench, a publicly-available software framework which represents a starting point for quantitative, comparable and validatable experimental research to investigate trade-offs in performance, accuracy and energy consumption of a dense RGB-D SLAM system. SLAMBench provides a KinectFusion implementation in C++, OpenMP, OpenCL and CUDA, and harnesses the ICL-NUIM dataset of synthetic RGB-D sequences with trajectory and scene ground truth for reliable accuracy comparison of different implementation and algorithms. We present an analysis and breakdown of the constituent algorithmic elements of KinectFusion, and experimentally investigate their execution time on a variety of multicore and GPU-accelerated platforms. For a popular embedded platform, we also present an analysis of energy efficiency for different configuration alternatives. This work is part of a larger research agenda aiming to push the limits of compiler technology up the "food chain", to explore higher-level algorithmic aspects of the design space and low-level implementation choices together, and I will present some preliminary results showing some of the potential of this idea.

New Directions in BSP Computing

Bill McColl
Huawei Research Center in Paris

BSP now powers parallel computing, big data analytics, graph computing and machine learning at many leading edge companies. BSP is used at Google, Facebook, Yahoo, Twitter, LinkedIn, Microsoft, Tencent, and many other leading-edge web and big data companies, for their analysis of big data from massive graphs and networks with billions of nodes and trillions of edges. BSP is also at the heart of new open source projects such as Petuum, for leading-edge machine learning with billions of parameters. Companies such as Teradata are also now looking at BSP to reinvent database and data warehouse architectures to handle major new types of massive data sets. At Huawei Research in Paris we are developing new BSP algorithms and software platforms that will power next-generation applications and services for Telecomms and IT.

Cognitive Computing and Trans-Turing Machines

José E. Moreira
IBM Thomas J. Watson Research Center

With the increased academic and commercial interest in machine learning, it is natural to revisit that old question: "Can machines think?". This, as of yet unanswered, question can also be formulated as "Is thinking, and consciousness in particular, a computable process (in the Turing sense)?" and generalized to the question "Is the universe computable?". In this talk we will present a quick summary of conjectures on what is there beyond Turing computation and even some outlandish, non-realizable, machine models that serve as discussion points. Our operational goal is to decide if there is some research activity in this field that can help with practical applications of cognitive computing.

Cognitive Systems: A perspective

Pratap Pattnaik
IBM Thomas J. Watson Research Center

During the past few years, Cognitive Systems have become significant in many different endeavors of the society. They touch most of the mankind in a meaningful way. They have renewed the research in many areas of computer science such as algorithms, computational models, hardware design, including accelerators, operating systems, services, middleware organizations, and data center designs to name a few. All these topics have been focus topics in past ScalPerfs. This talk will give a very brief introduction to the topic of Machine Learning, a key component of the cognitive systems, and will point out the role of Linear Algebra in it. The talk will also give a summary of the status of the major progresses that have been made in the Linear Algebra formulation of Graph Analytics, which in itself is a major tool in Machine Learning.

Joint work with K. Ekanadham, Bill Horn, Joefon Jann, Manoj Kumar, Jose Moreira, Mauricio Serrano and Gabi Tanase.

A Practical Parallel Algorithm for Diameter Approximation of Massive Weighted Graphs

Andrea Pietracaprina
University of Padova

We present a space/time-efficient practical parallel algorithm for approximating the diameter of massive graphs on distributed platforms supporting a MapReduce-like abstraction. The core of the algorithm is a weighted graph decomposition strategy generating disjoint clusters of bounded weighted radius. The number of parallel rounds executed by the algorithm is proportional to the unweighted radius of the clusters, and can be significantly smaller than the number of rounds required by the natural SSSP-based approximation algorithm, which is the only practical competitor in the aformentioned computational scenario. We present an extensive experimental analysis testing various design choices, which result into an implementation featuring a good approximation ratio and running times which are up to one order of magnitude faster than the SSSP-based competitor, on a number of very large benchmark graphs.

JOINT WORK WITH: Matteo Ceccarello, Geppino Pucci, and Eli Upfal

The TAO of Parallelism in Algorithms

Keshav Pingali
University of Texas at Austin

I will describe a data-centric foundation for parallel programming called the operator formulation in which algorithms are described in terms of unitary actions on data structures.

This data-centric view of parallel algorithms shows that a generalized form of data-parallelism called amorphous data-parallelism is ubiquitous even in complex, irregular graph applications such as mesh generation and partitioning algorithms, graph analytics, and machine learning applications. Binding time considerations provide a unification of parallelization techniques ranging from static parallelization to speculative parallelization.

We have built a system called Galois, based on these ideas, for exploiting amorphous data-parallelism on multicores and GPUs. I will present experimental results from our group as well as from other groups that are using the Galois system.

System level evaluation of an in-memory-processing architecture

Dirk Pleiter
Jülich Supercomputing Center

In this talk we will present IBM’s Active Memory Cube (AMC) architecture and the outcome of a research project that aimed on assessing the usability and efficiency of this architecture for selected scientific applications. The AMC architecture augments the Hybrid Memory Cube (HMC) architecture by significant processing capabilities. Such processing-in-memory architectures promise increased compute performance at decreased costs in terms of energy as they allows for a significant reduction of off-chip data movement. The goal of this work was a system level assessment of the potential for efficient exploitation of the AMC architecture by relevant scientific applications. This could be achieved by combining cycle accurate simulations of relevant application kernels with performance models to explore performance for proxy-architectures comprising AMC modules.

Fully Dynamic Betweenness Centrality

Vijaya Ramachandran
UT Austin

We present fully dynamic algorithms for maintaining betweenness centrality (BC) of vertices in a directed graph G=(V,E) with positive edge weights. BC is a widely used parameter in the analysis of large complex networks. We achieve an amortized O({nu⋆}2 log3 n) time per update with our basic algorithm, and O({nu⋆}2 log2 n) time with a more complex algorithm, where n = |V|, and nu⋆ bounds the number of distinct edges that lie on shortest paths through any single vertex. For recomputing BC values after updates to edges incident on a single vertex in a dense graph with compact shortest path dags, these algorithms give almost a factor of n improvement over the static algorithm of Brandes [Brandes 2001]

Our basic algorithm contains within it, a method and analysis different from Demetrescu and Italiano [DI 2004] for obtaining fully dynamic APSP from a decremental algorithm. Further, for graphs with nu⋆ = O(n), our algorithms match the fully dynamic APSP bounds of Demetrescu and Italiano [ DI 2004] and Thorup [Thorup 2004] for unique shortest paths.

(Joint work with Matteo Pontecorvi)

Data Set Dependence of the Performance of Parallel Sparse Matrix Computations

Saday Sadayappan
Ohio State University

The performance of sparse matrix computations can vary widely depending on the non-zero structure of the matrix. The use of multiple algorithmic variants and/or multiple data representations may be necessary in order to achieve good performance across a range of sparse matrices arising from different application domains. This talk will present some experimental data using hundreds of matrices from the Florida collection, on the variance of performance of sparse matrix-vector multiplication on GPUs. An attempt to develop simple predictive models to choose among alternative data structure representations will also be discussed.

Minimizing Writes with Non-Volatile Memories

Oded Schwartz
Hebrew U.

Communication, i.e., moving data, either between levels of a memory hierarchy or between processors over a network, is much more expensive (in time or energy) than arithmetic. There has been much recent work designing algorithms that minimize communication, when possible attaining lower bounds on the total number of reads and writes. However, most of this previous work does not distinguish between the costs of reads and writes. Writes can be much more expensive than reads in some current and emerging technologies. The first example is nonvolatile memory, such as Flash and Phase Change Memory. Second, in cloud computing frameworks like MapReduce, Hadoop, and Spark, intermediate results may be written to disk for reliability purposes, whereas read-only data may be kept in DRAM. Third, in a shared memory environment, writes may result in more coherency traffic over a shared bus than reads. This motivates us to first ask whether there are lower bounds on the number of writes that certain algorithms must perform, and when these bounds are asymptotically smaller than bounds on the sum of reads and writes together. When these smaller lower bounds exist, we then ask when they are attainable; we call such algorithms “write-avoiding (WA)”, to distinguish them from “communication-avoiding (CA)” algorithms, which only minimize the sum of reads and writes. We identify a number of cases in linear algebra and direct N-body methods where known CA algorithms are also WA (some are and some aren’t). We also identify classes of algorithms, including Strassen’s matrix multiplication, Cooley-Tukey FFT, and cache oblivious algorithms for classical linear algebra, where a WA algorithm cannot exist: the number of writes is unavoidably high, within a constant factor of the total number of reads and writes. We explore the interaction of WA algorithms with cache replacement policies and argue that the Least Recently Used (LRU) policy works well with the WA algorithms in this paper. We provide empirical hardware counter measurements from Intel’s Nehalem-EX microarchitecture to validate our theory. In the parallel case, for classical linear algebra, we show that it is impossible to attain lower bounds both on interprocessor communication and on writes to local memory, but either one is attainable by itself.

Based on joint work with Erin Carson, James Demmel, Laura Grigori, Nicholas Knight, Penporn Koanantakool, and Harsha Vardhan Simhadri. Tech-report available at:


Runtime System Requirements, Challenges, and Accomplishments for Exascale

Thomas Sterling
Indiana University

Significant progress has been made by the HPC community in advancing the state of the art of runtime system software support for significant improvements in operational efficiency and scalability, at least for classes of dynamic adaptive computational application algorithms. These include but are not limited to a wide range of adaptive mesh refinement (AMR) algorithms, wavelets, and fast multipole methods (FMM), among others. In a context of real-world multi-scale multi-physics problems sustaining single-digit efficiencies, runtime is recognized as a possible means of exploiting continuous system state information to guide task scheduling and resource allocation. With complex data structures defined by meta-data, parallelism discovery through the use of runtimes can increase scalability. But challenges, potentially severe, imposed by added overhead may limit or in some cases eliminate any advantages. This presentation will describe and discuss recent results that expose both opportunities and challenges using the current release of the HPX-5 version 1.2 runtime system software. Also discussed will be implications for improved user productivity made possible by runtime support. Included will be working examples of applications of the types above.

Simulation and optimization of electric power devices: A practical example for high performance scientific computing

Carsten Trinitis

When designing insulation systems in high voltage engineering, the knowledge of the electric field distribution is a very important aspect. Traditionally, experimental methods were used to determine this field, e.g. an electrolytic tank or semi-conductive paper. Today these methods have been replaced by numerical simulation methods like e.g. charge simulation, finite element methods, or boundary element methods. In order to realistically simulate and optimise three dimensional complex geometries, the simulation code, which had mainly been must be adapted to high performance computing architectures accordingly. In an industrial environment, a key priority has always been cost effectiveness, which also affect research and development. As a consequence,software that has been used for many years is only changed when absolutely necessary, e.g. to incorporate additional functionality or to fix bugs. With the free lunch being over now, comprehensive code adaptations with regard to both single thread performance and scalability had to be adapted in the area of electric field simulation as well. The talk will give an overview on various cooperation projects between TU Munich and Asea Brown Boveri (ABB)’s corporate research centre in Baden-Dättwil, Switzerland. Topics being covered are code optimisation and the investigation of alternative simulation methods as well the automatic optimisation of real world examples from ABB.

The Road to Exascale (Bataan Death March or Road to Nirvana?)

Henry Tufo
U. Colorado Boulder

We discuss the U.S. Exascale Program and its impact on the HPC ecosystem.

An Interconnection Mode Configurable Data-flow processor

Lorenzo Verdoscia
CNR - ICAR Napoli

As computer models of scientific phenomena have increased in scale, power, typically in the form of FLOPS, has increased exponentially. But the requirement for increasing FLOPS is not likely to slacken in the foreseeable future. However, the nature of the workloads to which these systems are applied is rapidly evolving. Even today, the performance of many complex simulations is less dominated by the performance of floating point operations, than by memory and integer operations. Moreover the nature of the problems of greatest security, industrial and scientific interest is becoming increasingly data-driven. Indeed, in the future, HPC computers should be able to (1) quantify the uncertainty associated with the behavior of complex systems-of-systems (e.g. hurricanes, nuclear disaster, seismic exploration, engineering design) and thereby predict outcomes (e.g. impact of intervention actions, business implications of design choices); (2) learn and refine underlying models based on constant monitoring and past outcomes; and (3) provide real-time interactive visualization and accommodate "what if'' questions in real-time required to cognitive computing.

In this scenario, together to other non-von Neumann options like quantum computing and neuromorphic (brain-like) computing, configurable computing has become a subject of a great deal of research given its potential to greatly accelerate a wide variety of applications that need high throughput. As possible way to accelerate applications and augment throughput, the principles on which dataflow is based appear still promising to go faster and sustain cognitive computing, given the natural vocation of the dataflow execution to the data-driven requirement. Even though we do not claim to have a definite answer to both cognitive and HPC computing, the talk refers to a dataflow processor based on the Interconnection Mode Configurable Machine.

Dynamic code generation for HPC

Josef Weidendorfer

There are cases in HPC applications where code generation at runtime can provide significant benefits. However, this is not well supported by HPC programming environments nowadays. Different approaches exist (especially from the compiler research community) but are found to be difficult to use. In this talk, we will present two example applications where we researched the benefits of dynamic code generation, showing the effort needed to make it work. The talk finishes with first results from a proposal for a simple low-level interface for code generation at runtime. We believe it to be programmer-friendly and flexible enough for wider adoption.