Abstracts

Twelve ways to fool the masses: fast forward to 2013

David Bailey
Lawrence Berkeley National Laboratories

In 1991, the present author wrote a humorous essay “Twelve Ways to Fool the Masses", which outlined a number of highly questionable “techniques" often used at the time by researchers in the high performance computing field to inflate their performance results, or to otherwise obfuscate issues and difficulties. This was followed by an article, published in 1993, that gave specific examples of many of the abuses from published papers. This talk will review those 12 “techniques" and mention some new “techniques" now emerging in the world of very highly parallel computing and graphics processing units (GPUs). The talk will also explain why vigilance is so important, both by researchers as well as reviewers and editors, particularly in light of the ongoing effort to promote better “reproducibility" in the field of high-performance computing.

Thermal challenges in large-scale parallel computing systems

Luca Benini
University of Bologna & ETH

We are in the middle of a temperature crisis in digital systems design: from Mobile SoCs to supercomputers, thermal design power (TDP) constraints are increasingly harder to meet. To achieve the exascale at a reasonable cost cannot embrace dark silicon, but moving to near-threshold “dim silicon" significantly increases temperature sensitivity and requires closed-loop temperature control coupled with active cooling. This talk will give insights on how we can deal with temperature and thermal issues with a well-balanced mix of technology, circuits, architecture and system-level design, without forgetting energy proportionality and maximum GOPS/mW.

A Lower Bound Technique for Communication on BSP with Application to the FFT

Gianfranco Bilardi
University of Padova

Communication complexity is defined, within the Bulk Synchronous Parallel (BSP) model of computation, as the sum of the degrees of all the supersteps. A lower bound to the communication complexity is derived for a given class of DAG computations in terms of the switching potential of a DAG, that is, the number of permutations that the DAG can realize when viewed as a switching network. The proposed technique yields a novel and tight lower bound for the FFT graph.

Joint work with Michele Scquizzato and Francesco Silvestri

Openstack - Technical deep dive

Jose Castanos
IBM Research

At IBM Research we are building an Openstack lab to support systems research and our SDE goals. We have been using fairly advanced components of Openstack on a variety of heterogeneous hardware (x86, power, GPUs, openflow switches, local and remote cinder volumes, SSDs and HDDs, etc.). In this talk we provide an overview of these components, describe their current features and limitations that determined our selection process, and present our experience on building and managing a lab with state-of-the art open source components.

A Uniform approach for programming distributed heterogeneous systems

Biagio Cosenza
University of Innsbruck

Large-scale clusters of heterogeneous nodes equipped with multicore CPUs and GPUs are getting increasingly popular in the scientific community. However, such systems require a combination of different programming paradigms making application development very challenging.

This talk will introduce libWater, a library-based extension of the OpenCL programming model that simplifies the development of heterogeneous distributed applications. libWater consists of a simple interface, which is a transparent abstraction of the underlying distributed architecture, offering advanced features such as inter-context and inter-node device synchronization. It provides a runtime system which tracks dependency information enforced by event synchronization to dynamically build a DAG of commands, on which we automatically apply two optimizations: collective communication pattern detection and device-host-device copy removal. libWater integrates the Insieme OpenCL compiler, which is able to perform problem size sensitive task partitioning on the single heterogeneous node.

The talk will also answer typical questions about workload and performance on heterogeneous large-scale compute clusters, showing results obtained on the Vienna Scientific Cluster and the Barcelona Supercomputing Center.

Preview of the REPLICA architecture and methodology

Martti Forsell
VTT Electronics

There is significant evidence that current CMPs architectures are tedious to program and many times provide poor speedup compared to conventional sequential (single core) processors. The main reasons for this are lack of fast synchronization, weakly scalable latency hiding mechanism and inefficient integration of TLP and ILP. In this presentation we take a preliminary look into our REPLICA CMP architecture and methodology aiming to address these problems. REPLICA provides a synchronization mechanism with order of several magnitudes lower cost than that in current CMPs, latency toleration mechanism scalable up to hundreds/thousands of processors, and virtual ILP exploitation scheme allowing to execute even fully dependent code like low-level ILP. All these open up a whole new possibility to provide programmer a strong and easy-to-program PRAM model of computation and access to related well-developed theory of parallel algorithms. For workloads with limited parallelism, it allows multiple threads to be bunched together to mimic the NUMA execution scheme. The first four core FPGA prototype and c-like parallel language/compiler toolset are currently under testing and we expect the first results to be available by ScalPerf’13.

Affordable 100PFlops systems with Multiscale Dataflow

Georgi N Gaydadjiev
Chalmers University

With Exascale computers plans in the USA being postponed to the second half of 2020, we need to accelerate the deployment of 100PF systems to meet the expectations of the HPC users community. Multiscale Dataflow by Maxeler technologies provides affordable 100PF and up- systems that do not suffer from the severe energy constraints currently limiting traditional computing systems. A 100PF Multiscale Dataflow system will consume significantly lower energy and use only a fraction of the floorspace that is currently required by Top500 supercomputers. These advantages bring with them a drastic change in the programming paradigm which is slowing down adoption in traditionally conservative domains such as defense. In addition, the application specific performance and power advantages require comparing systems by fixing datacenter space first and then comparing total cost of ownership, which is not yet a widely accepted methodology for ranking computer systems.

In this talk we present the Maxeler Multiscale dataflow technology, explain our proposed comparison method based on total cost of ownership, and give a brief overview of the significant speedups already achieved in several HPC domains. The key is to allow Multiscale Dataflow systems programmers to easily exploit all benefits of this technology at all levels of abstraction, from algorithms, math and models to arithmetic structures, data orchestration and individual logic gates.

A peek at some data analytics problems in industry

Manish Gupta
Xerox

Advanced scientific computing on Intel Xeon Phi

Alexander Heinecke
University of Munich

We will focus on the recently released Intel Xeon Phi coprocessor, which is the first device that implements Intel’s Many Integrated Core Architecture (Intel MIC Architecture). Compute clusters equipped with Intel Xeon Phi coprocessors can be used in various ways and it is not limited to regarding the coprocessor as an add-in device which is just suitable for offloading highly parallel sub-parts of applications to. Such a cluster offers basically five different usage models: (1) CPU-hosted: a classic CPU cluster without any coprocessor usage; (2) offload to MIC: this is the “GPU"-model as highly parallel kernels are offloaded to the Intel Xeon Phi coprocessor; (3) symmetric: here, on both devices (CPU and MIC) MPI tasks are started and the communication between the host and the coprocessor is transparent due to the message passing programming paradigm, however, sophisticated load balancing may be required; (4) reverse offload to CPU: if an application is in general highly parallel but suffers from small and rare sequential parts, theses parts can be “offloaded” back to the CPU and last but not least (5) MIC-hosted: this is suitable for highly parallel and scaling codes and the CPUs are not used anymore as the application runs entirely on the coprocessors boards.

We will present the usage of these different usage models for a set of benchmarks and scientific applications. In addition, we will talk about important code modifications such as vectorization, highly scalable shared and distributed memory parallelization on both, the CPU and the Xeon Phi, which pave the road to high performance on an Intel Xeon Phi accelerated compute cluster. The selected applications are HPL, Molecular Dynamics Simulation and Data Mining.

Optimization algorithms in the automotive industry

Thomas Husslein
Optware

The availability of reliable data has opened up the opportunity to apply large optimization models in a consistent and regular way within the automotive industry. Optimization systems are applied to problems spanning the scope over whole supply chains and large companies considering sometimes millions of variables. Presenting two applications from the automotive industry state of the art and the challenges for high performance computer systems from optimization are derived.

Amdahl’s law for predicting the future of multicores considered harmful

Ben Juurlink
TU Berlin

Several recent works predict the future of multicore systems or identify scalability bottlenecks based on Amdahl’s law. Amdahl’s law implicitly assumes, however, that the problem size stays constant, but in most cases more cores are used to solve larger and more complex problems. There is a related law known as Gustafson’s law which assumes that runtime, not the problem size, is constant. In other words, it is assumed that the runtime on p cores is the same as the runtime on 1 core and that the parallel part of an application scales linearly with the number of cores. We apply Gustafson’s law to symmetric, asymmetric, and dynamic multicores and show that this leads to fundamentally different results than when Amdahl’s law is applied. We also generalize Amdahl’s and Gustafson’s law and study how this quantitatively effects the dimensioning of future multicore systems.

Watching the data flow: streaming and re-use in unstructured-mesh CFD

Paul Kelly
Imperial College London

Unstructured-mesh CFD is a rich domain at the frontier of research in program optimization for locality and parallelism. An unstructured mesh is essentially a graph, spanning a two- or three-dimensional space. A key idea is to generate code that analyses the mesh at runtime, to determine partitioning and scheduling of the computation on it; such "inspector-executor" schemes rely on the mesh connectivity not changing, or changing infrequently. This talk will review some of our recent work inspector-executor strategies. I will show how to structure the dataflow of a parallel loop over an unstructured-mesh so that it essentially reaches peak streaming DRAM performance, despite being dominated by pointer-based data accesses. I will also show how to do loop fusion, and tiling, across a chain of such loops, with dependencies due to such pointers, to beat this limit.

Open-System, adiabatic quantum annealing at the USC - Lockheed Martin Quantum Computing Center

Bob Lucas
USC/ISI

With the end of Dennard scaling, and soon Moore’s Law as well, those seeking ever increasing computational capability need to start considering alternatives to large-scale clusters of commercial microprocessors. Towards that end, the USC - Lockheed Martin Quantum Computing Center has taken delivery of two generations of D-Wave open-system, adiabatic quantum annealers. In this talk, we will report on our experience assessing the quantum mechanical behavior of the devices and benchmarking their performance. We will present experimental results that strongly indicate that quantum annealing is indeed being performed by D-Wave, and show how the device performs when compared to classical computers on a particular optimization problem. We will also discuss ongoing work to find applications where such a device can uniquely augment our ability to solve them.

Jülich - still - on the way to exascale

Bernd Mohr
Jülich Supercomputing Center

This presentation will give an update to the ScalPerf’10 talk “Jülich on the Way to Exascale". It will summarize experiences in application and tool scaling and report on Jülich’s involvement in Exascale activities and projects, by comparing and confronting latest achievements and results with the goals and objectives of 2010.

First, we will report on the status of the “High-Q Club”, a showcase for codes that can utilize the entire 28-rack BlueGene/Q machine at JSC which translates to 458,752 cores or 1,835,008 threads. Next, we will report on scaling experiments with parallel debugging and performance analysis tools. Finally, we present the various activities and projects of JSC related to Exascale: the International Exascale Software Program (IESP), the European ExaScale Software Initiative (EESI), the Exascale Innovation Centre (EIC) and the ExaCluster Laboratory (ECL) as well as Jülich’s involvement in the EU FP7 Exascale projects, i.e., DEEP, DEEPER, Mont-Blanc, and the Human Brain Project.

Multicore energy efficiency by vectorization? Results and challenges

Lasse Natvig
NTNU

It is not clear what will be the dominating architecture of future microprocessor chips since different characteristics are needed for optimal performance in the different market segments. The presentation starts by outlining four main computer segments: mobile devices, desktop computers, servers and supercomputers/HPC, discussing the main performance critical architectural elements for each workload group. Energy-efficiency is of central importance both for the smallest and largest computers, and we present results from recent studies at NTNU demonstrating the energy-efficiency gained by vectorization on both ARM and Intel multicore processors. It is expected that heterogeneous processors will become increasingly important in the future, despite the added programming complexity. The presentation ends by sketching our ongoing research and future plans in this area, both covering existing state-of-the-art heterogeneous multicores, our heterogeneous research prototype SHMAC, and our ambitions within Horizon 2020.

Parallel graph analytics in the Galois system

Donald Nguyen
University of Texas at Austin

Several domain-specific languages (DSLs) for parallel graph analytics have been proposed recently. In this paper, we argue that existing DSLs can be implemented on top of a general-purpose infrastructure that (i) supports very fine-grain tasks, (ii) implements autonomous, speculative execution of these tasks, and (iii) allows application-specific control of task scheduling policies. To support this claim, we describe such an implementation in the Galois system.

We demonstrate the capabilities of this infrastructure in three ways. First, we implement more sophisticated algorithms for some of the graph analytics problems tackled by previous DSLs and show that end-to-end performance can be improved by orders of magnitude even on power-law graphs, thanks to the better algorithms facilitated by a more general programming model. Second, we show that, even when an algorithm can be expressed in existing DSLs, the implementation of that algorithm in the more general system can be orders of magnitude faster when the input graphs are road networks and similar graphs with high diameter, thanks to more sophisticated scheduling. Third, we implement the APIs of three existing graph DSLs on top of the common infrastructure in a few hundred lines of code and show that even for power-law graphs, the performance of the resulting implementations often exceeds that of the original DSL systems, thanks to the lightweight infrastructure.

(Joint work with Keshav Pingali)

Processor micro-architecture: from the perspective of a practitioner

Pratap Pattnaik
IBM Research

This talk, in continuation of the points I had presented in some of the past ScalPerf, will give a perspective on the processor micro-architecture evolution over the past few decades. It will reflect on how the architecture has responded to changes in the VLSI technology capabilities and to the workload evolution. It will particularly discuss the explicit impacts on the micro-architecture due to the large scale augmentation of the workloads with social media and analytics-related computing.

Space-efficient parallel algorithms for combinatorial search problems

Andrea Pietracaprina
University of Padova

We present space-efficient parallel strategies for two fundamental combinatorial search problems, namely, backtrack search and branch-and-bound, both involving the visit of an n-node tree of height h under the assumption that a node can be accessed only through its father or its children. For both problems we propose efficient algorithms that run on a distributed-memory machine with p processors. For backtrack search, we give a deterministic algorithm running in O(n/p+hlogp) time, and a Las Vegas algorithm requiring optimal O(n/p+h) time, with high probability. Building on the backtrack search algorithm, we also derive a Las Vegas algorithm for branch-and-bound which runs in O((n/p+hlogp logn)hlogn) time, with high probability. A remarkable feature of our algorithms is the use of only constant space per processor, which constitutes a significant improvement upon previously known algorithms whose space requirements per processor depend on the (possibly huge) tree to be explored.

(Joint work with Geppino Pucci, Francesco Silvestri, and Fabio Vandin)

Realizing active storage concepts for today’s and future HPC systems

Dirk Pleiter
Jülich Supercomputing Center

With computing performance improving at a faster path than the performance of the I/O subsystem, adoption of new storage concepts will be necessary on the path towards exascale systems. Both architectural and technical opportunities have to be explored. In this talk we will consider on the one hand the architectural concept of active storage and on the other hand advances in non-volatile memory technologies. Specifically, we will report on how both are used in the Blue Gene Active Storage (BGAS) architecture. We will give an overview on the architecture and explore use concepts. Finally, we discuss how this design can be extended to future architectures.

Efficient resource-oblivious algorithms for multicores

Vijaya Ramachandran
University of Texas at Austin

We consider cost measures and algorithmic strategies that are geared to current parallel multicore environments. We view the multicore as consisting of multiple cores, each having a private cache, which communicate through shared memory, with data organized in blocks of words. It has a run-time scheduler that schedules parallel tasks that are generated by the parallel algorithm being executed.

We introduce the notion of a resource-oblivious parallel algorithm, that is designed without any reference to the machine parameters such as number of cores, and cache and block sizes, or to the run-time scheduler. We characterize a class of multithreaded resource-oblivious algorithms for which we establish low cache miss as well as low false sharing overhead on multicores with a wide range of machine parameters, as long as the run-time scheduler is efficient in terms of the number of parallel tasks it schedules across the cores. This class includes algorithms with high levels of parallelism for matrix computations, FFT, sorting, list ranking and other basic problems.

R. Cole and V. Ramachandran, Efficient resource oblivious algorithms for multicores with false sharing, IPDPS 2012.
R. Cole and V. Ramachandran, Resource-oblivious sorting on multicores, Proc. International Colloquium of Automata, Languages and Programming (ICALP), Track A, Springer LNCS Volume 6198, pp. 226-237, 2010.

Architecture and applications of software defined environments

Radha Ratnaparkhi
IBM Research

IBM Research is making a significant investment on Software Defined Environments (SDE) to guide the design of the next generation of cloud infrastructures. As the adoption of the cloud in enterprise customers is accelerating, we observe a shift from cost reduction to agility and optimization, with more volatile workloads. The industry is responding to these challenges with programmable control planes and infrastructures that allow for dynamic optimization to respond to these dynamic business requirements.

SDE encompass a complete virtualization layer that incorporates Software Defined Compute (SDC), Storage (SDS) and Network (SDN) in an integrated solution, and provides an abstraction of workloads, service and infrastructure. This talk will discuss the current industry trends around the architecture of Software Defined Environments (SDE) at IBM and related efforts in the open community. A few typical use case scenarios of SDE will be shared and analyzed.

Data access complexity: revisiting the red-blue pebble game

Saday Sadayappan
Ohio State University

Technology trends will cause data movement to account for the majority of the energy expenditure as well as execution time on emerging/future computers. Therefore the computational complexity of algorithms will no longer be a sufficient metric for comparing algorithms, and a fundamental characterization of the data access complexity of algorithms will be increasingly important.

In this talk we revisit the problem of data access complexity (also called I/O complexity) in a two-level memory hierarchy, first addressed by the seminal work of Hong and Kung using the formalism of the red/blue pebble game. We address ways of improving on prior work in several ways: 1) enabling the development of I/O lower bounds for composite computations from lower bounds of constituent sub-computations, 2) developing a complementary graph min-cut based bounding strategy to Hong and Kung’s S-partitioning approach, and 3) developing an automated approach to generating I/O lower bounds for arbitrary, possibly irregular computational directed acyclic graphs.

On the effectiveness of application-aware self-management for scientific discovery in large-scale distributed systems

Michela Taufer
University of Delaware

An important challenge faced by high-throughput, multi-scale applications is that human intervention has a central role in driving their success. However, manual intervention is inefficient, error-prone and promotes resource wasting. This talk presents an application-aware modular framework that provides self-management for computational multi-scale applications in large-scale distributed systems. Our framework consists of a learning engine and three modules that can be easily adapted to different distributed systems including volunteer computing, desktop grids, and cloud computing. The learning engine of this framework is based on our novel tree-like structure called KOTree. KOTree is a fully automatic method that organizes statistical information in a multidimensional structure that can be efficiently searched and updated at runtime. Our empirical evaluation on a group of diverse volunteer computing projects shows that our framework can effectively provide application-aware self-management. Additionally, we observed that our KOTree algorithm is able to predict accurately the expected length of new jobs, resulting in an average of 85% increased throughput with respect to other algorithms.

Who is this petaflop anyway? Strong scaling matters!

Jan Treibig
University of Erlangen

The run to new scales pushes the HPC community to new frontiers. Petascale is followed by Exascale. Still most of this performance is achieved by replicating parallel resources. Unfortunately many real production problems are bound by strong scaling limitations. Examples are problems with global state or with time propagation. They also have very high performance demands but due to their strong scaling properties can not even make full use of one modern compute node. This talk will discuss the issues that limit strong scaling on modern compute platforms. The current state and possible improvements are presented using several examples.

Tera-Peta-Exa-Zetta-Yotta-... : Driven by hardware vendors or driven by applications?

Carsten Trinitis
University of Munich

With each new generation of supercomputers, the question comes up if there are indeed applications which are able to fully exploit the new machine's potential in terms of optimal sequential performance, scaling, use of accelerators, etc... With quite a few ideas on how to increase an architecture's peak performance (SIMD, many-core, accelerators, interconnect), scientist are always behind with getting close to peak performance on top supercomputers. On the other hand, the demand for compute power seems to be unlimited for certain applications, but these have been designed as "traditional workloads". Thus, the question comes up if it does make sense to think about "exotic" approaches, such as e.g. data flow driven or FPGA based architectures, which on the other hand does indeed require significant re-thinking and re-implementation of workloads and algorithms.

MapReduce algorithmics

Sergei Vassilvitskii
Google

The MapReduce paradigm is a de facto standard for large scale data analysis. In this talk we focus on the challenge of designing efficient algorithms in this setting. We introduce a model of computation for the system, and show simple and scalable algorithms for many graph properties (connectivity, triangle counting, densest subgraphs among others).

OutFlank routing: increasing throughput in toroidal interconnection networks

Francesco Versaci
TU Wien

We present a new, deadlock-free, routing scheme for toroidal interconnection networks, called OutFlank Routing (OFR). OFR is an adaptive strategy which exploits non-minimal links, both in the source and in the destination nodes. When minimal links are congested, OFR deroutes packets to carefully chosen intermediate destinations, in order to obtain travel paths which are only an additive constant longer than the shortest ones. Since routing performance is very sensitive to changes in the traffic model or in the router parameters, an accurate discrete-event simulator of the toroidal network has been developed to empirically validate OFR, by comparing it against other relevant routing strategies, over a range of typical real-world traffic patterns. On the 16×16×16 (4096 nodes) simulated network OFR exhibits improvements of the maximum sustained throughput between 15% and 115%, with respect to Adaptive Bubble Routing.

Home >
Abstracts >

ScalPerf'13

Scalable Approaches to High Performance and High Productivity Computing

Abstracts