Scalability of Gossip-Based Solutions to Dynamic Network Problems

Ozalp Babaoglu
University of Bologna

Gossiping has been shown to be capable of solving interesting problems arising in dynamic network environments in an extremely fast, highly adaptive and robust manner. In this talk, I will explore how these solutions behave as the size of networks they cover assumes ever increasing dimensions. I will show that in the case of three specific problems, gossip-based solutions can be considered extremely scalable in addition to their other desirable properties.

Exploring Power Consumption in Extreme Scale Systems

Kevin Barker

Reaching Exascale-level performance will require extreme scale in both application concurrency and system component count. As the number of system components scales into the tens or even hundreds of thousands, and execution threads potentially into the millions, of particular concern is the power consumed in order to achieve such scalability and performance levels. Current estimates for the power required for such a system range from the high tens of megawatts into the hundreds of megawatts, making such performance levels unachievable without significant innovative and radical advancements. In this talk, we describe ongoing research at Pacific Northwest National Laboratory in characterizing the power consumption of both hardware and software, optimizing the power consumption of large-scale systems, and ultimately developing an analytical modeling capability which will enable researchers to quantify the power consumption of given workloads on architectures of interest. Such a power modeling capability can then be integrated with existing performance modeling techniques to evaluate the trade-offs between power consumption and achievable performance for particular workloads at extreme scale.

A Stochastically Optimal, Quasi-Competitive, Stack Replacement Policy for the Memory Hierarchy

Gianfranco Bilardi
University of Padova

The classical replacement problem for memory hierarchies is considered, both in infinite and finite temporal horizon, assuming memory references distributed according to the Least Recently Used Stack Model (LRUSM). An optimal control formulation is adopted and arbitrary stack-distance distributions are considered.

The concept of Least Profit Rate (LPR) policy is introduced and it is shown that, for the LRUSM model, LPR is an optimal policy over an infinite horizon. The parameters characterizing LPR can be computed for all the buffer capacities in time $O(V)$, where $V$ is the size of the virtual space. The LPR satisfies the inclusion property: the content of a given buffer is also contained in all larger buffers. Exploiting these properties and further structure of the policy, an algorithm is developed to compute the LPR misses for all buffer capacities in time $O(\log V)$ per access. The miss rate of LPR is compared with that of OPT, the well-known optimal off-line policy, to investigate the difference between an exact and a statistical knowledge of the future of the trace. It is shown that the stochastic competitive ratio of LPR is $\chi \in O(\log
C)$ for any access distribution and $\chi \in O(1)$ for a significant subset of distributions. For comparison the stochastic competitive ratio of LRU is $\Omega(C)$ for some distributions. The more difficult miss minimization problem over a finite horizon is solved for monotone stack-distance distributions, by establishing the optimality of the LRU (MRU) policy for any non increasing (non decreasing) distribution. The optimality of LRU is shown to hold even under considerably weaker assumptions. (Joint work with Francesco Versaci.)

Compilers are from Mars, Dynamic Scripting Languages are from Venus

Jose Castanos
IBM Research

There has been a recent surge in interest in compiling dynamic scripting languages (DSL), such as Javascript, Python, PHP, and Lua, to improve performance. The most popular implementations of DSL today are primarily interpreted and can be 10 to 1000 times slower than C or C++. Given this performance disparity, one might expect compilation to easily yield magnitudes of performance improvement. To our surprise, performance improvement from DSL compilers today is still quite elusive, where speed-up factors are typically within 2X (with one notable exception of TraceMonkey for some Javascript workloads). In some cases, DSL JIT can also incur significant slowdown.

In this talk, we will examine various approaches taken by today's DSL compilers with a special focus on Python. Python offers almost every flavor of JITting approach taken by the community, such as attaching an existing optimizer (LLVM ) to the CPython interpreter (Unladen-swallow), converting Python to an intermediate language with a mature JIT (e.g., Jython and IronPython), exploring trace JIT (PyPy), and exploring runtime type specialization (Psyco), We observe two interesting directions taken in today's Python JITs: one is to compile primarily at the level of the language implementation and runtime (bottom-up approach), the other is to customize existing optimizers designed for statically typed languages to compile DSLs. We will discuss the trade-offs and limitations of both approaches.

We will also briefly talk about our early experience in building a JIT compiler for Python in our ongoing FIORANO project and the optimization space we are exploring, including simple JITting, Python bytecode level optimization, runtime feedback, and tracing.

Programming Large-Scale Systems with Heterogeneous Multicore Nodes

Barbara Chapman
University of Texas, Houston

Recent high-end systems such as the Leadership Class Facilities at Oak Ridge National Laboratory are large compute clusters with nodes that include graphics units for providing highest levels of performance on suitable code. Application developers working on a variety of large-scale computations in astrophysics, climate research, chemistry and more are attempting to exploit this computational power in their codes. In order to do so, they are faced with the challenge of restructuring their computation to identify kernels that may be accelerated and with creating CUDA versions of these code fragments. Moreover, they must adapt their program to arrange for the necessary data transfers, which must be carefully tuned. In this presentation, we describe the state of the art in efforts to exploit such recent platforms and consider how we may be able to reduce the programming effort involved in adapting a range of scientific codes to run well on heterogeneous nodes.

Chasing High Performances for Scientific Applications: the AuroraScience Project

Francesco Di Renzo
University of Parma

AuroraScience is a scientific collaboration gathering several partner institutions with scientific and technological know-how at the crossroad of Computational Sciences and Computer Architecture. The first goal of AuroraScience has been to define a computational platform able to support scientific simulations in basic and applied science and technology. This work has met the efforts by the Eurotech company in their development of the Aurora parallel system. AuroraScience has in particular developed a 3D toroidal low-latency and high-bandwidth network. We will describe the AuroraScience ongoing project to deploy a prototype Aurora system, validating its capability to match computational demands of various scientific applications.


Manish Gupta
IBM Research

Expanding the Parallel Code Synthesis Design Space in Finite-Element CFD

Paul Kelly
Imperial College London

What is the right code to generate, for a given hardware platform? How does this change as problem parameters change? This talk presents some recent work-in-progress in the finite-element fluid dynamics domain; we show some of the fruits of our attempt to map out the design space. Our goal is to build tools that automatically synthesise the optimal implementation. By getting the abstraction right, we can capture design choices far beyond what a conventional compiler can do. For example, we show that in low-order finite-element formulations, assembling the global sparse system matrix is efficient on CPUs, but on GPUs the balance is shifted to favour a different, local assembly algorithm, with a better memory access pattern. In high-order problems, this turns out to be attractive on CPUs as well. The choice of high- or low-order is a tunable parameter, giving a rich space of implementation alternatives with different accuracy-performance characteristics on different hardware.

Architecture-Independent Metrics for Characterizing Parallelism and Locality

Milind Kulkarni
Purdue University

The spread of parallel architectures has been accompanied by a concomitant increase in their variety and complexity, which has increased the importance of performance tuning and optimization of parallel programs. A critical first step in performance tuning is to characterize the behavior of a parallel program. Unfortunately, many readily-measurable metrics are tied to particular architectural parameters, such as cache sizes or number of cores. Nevertheless, fundamental attributes such as the amount of parallelism in a program, or its locality behavior, are inherent to the program and are not architecture dependent. This talk will present two analysis tools for characterizing the behavior of parallel programs. The first tool, ParaMeter, reveals the amount of parallelism available in a program, independent of system overheads and architectural limitations. The second tool, sampling-based multicore-aware reuse distance analysis, takes the popular metric of reuse distance and adapts it to efficiently and effectively account for the complexities of shared-memory multithreaded programs, providing a measure of locality for both shared-cache and private-cache machines.

Maintaining Indexes on Streams of Complex Data

Bradley Kuszmaul

Many high-performance programs need to index complex data that is produced at high rates, and does not fit in main memory. The simple cases, in which the index is one dimensional with fixed sized keys, are handled by streaming B-trees, but many important applications need multidimensional indexes, or indexes on long keys (which are not always strings). In this talk, I'll present some of our recent results on maintaining such indexes, with a focus on variable sized opaque keys. (This is joint work with Michael Bender, and Martin Farach-Colton, and is inspired by some of the problems we've seen at Tokutek. This research was supported in part by the National Science Foundation.)

Jülich on the Way to Exascale

Bernd Mohr
Jülich Supercomputing Centre

Housing the world's most parallel machine JUGENE, a 72-rack BlueGene/P with a total of 294,912 cores, for more than a year now, extreme scalability of application software and programming tools becomes more and more important at Jülich Supercomputing Centre (JSC). This talk will summarize experiences in application and tool scaling and report on Jülich's involvement in Exascale activities and projects. First, we will review experiences made and results obtained during the two ``Jülich BlueGene/P Extreme Scaling Workshops'' of October 2009 and March 2010. Next, we will report on scaling experiments with parallel debugging and performance analysis tools which were done in conjunction with the Dagstuhl Seminar ``Program Development for Extreme-Scale Computing'' organized by LLNL, BSC, and JSC in May 2010. Finally, we present the various activities and projects of JSC related to Exascale: the International Exascale Software Program (IESP), the European ExaScale Software Initiative (EESI), the Exascale Innovation Centre (EIC) and the ExaCluster Laboratory (ECL).

Application Scaling through Improved Single-Thread Performance

Jose Moreira
IBM Research

We are all familiar with Amdahl's law, which limits the speedup that a parallel program can achieve based on the fraction of the work that is sequential. Similar limitations to speedup also arise from critical sections in parallel programs, although the details of the limitations are different. Both sequential sections and critical sections make a strong case for continuing improvements in single-thread performance in modern processors. In this talk, we show a quantitative analysis of the limitations to parallelism imposed by sequential and critical sections and of the benefits from improved single-thread performance. Other ways to reduce the impact on parallelism of those sections are also discussed. Finally, we discuss the impacts to processor design coming from the requirements to improve both parallelism and single-thread performance.

Large-Scale HPC systems: How to get Correct and Optimized Programs on Millions of Cores?

Wolfgang Nagel
Technische Universität Dresden

Scalability has become the major issue in High Performance Computing. The number of cores - homogeneous as well as heterogeneous - will significantly increase above the numbers we already see these days. The ratios between flops and memory size as well as bandwidth for memory, communication and I/O gets worse, and it is not clear what will be the best programming paradigm.

Nevertheless, the goal is to understand what happens on such systems, to identify shortcomings in programming and system bottlenecks, and to improve the execution behaviour to just get a faster solution with better utilization of system resources. Nevertheless, looking to these current trends in technology we might loose the capabilities of the infrastructure needed to investigate such systems, it has become a severe issue already today. The talk will present challenges in that field, which have to be addressed and solved in the next years.

The Hardware Compiling Environment (HCE): current status and future developments

Paolo Palazzari
ENEA and Ylichron Srl

The talk will explain the current status of the high level synthesis HCE design flow and will present the research activities planned to improve HCE functionalities. The presentation will describe the theoretical framework into which HCE has been developed and the directives used to guide the compiler toward an efficient translation of standard ANSI C programs into efficient parallel architectures described through VHDL-RTL and to be implemented onto FPGA devices. The talk will then illustrate the development actions planned to introduce in HCE the usage of pointers and streams and to efficiently exploit multi-port memories.

Information Technology in the Age of Smarter Planet

Pratap Pattnaik
IBM Research

Smarter Planet is to significantly improve the yield per unit of resource consumed (i.e. value to mankind). It is already happening, at a moderate scale, at various isolated areas around the globe. The driving force behind it is our ability to faithfully represent a large amount physical events & data in high fidelity digital form. It is ushering a new phase for mankind like the Industrial revolution did by improving the production per unit of labor, through the deployment of and advances in mechanical & electrical technology. Smarter Planet will require significant amount of new features and support from the Information Technology (IT) infrastructure. It will lead to the development of new features at all levels of the IT stack, from processor micro architecture, system architecture, programming languages, autonomic optimizations etc.. This talk will briefly touch various examples in this space and discuss the various areas where major inventions are needed. Also for illustrative purpose, this talk will describe one of our innovation in the processor design to accommodate the needs of Smarter Planet.

Large-Scale Graph Computations

Keshav Pingali
University of Texas, Austin

The parallel programming community has a deep understanding of the structure of parallelism and locality in regular applications in which the key data structures are vectors and dense matrices. However, we understand relatively little understanding of these issues in the context of irregular applications in which the key data structures are pointer-based structures like trees and graphs. In this talk, I argue that these problems arise largely from the limitations of the program-centric abstractions like dependence graphs that we currently use to think about parallelism. I will then propose a novel data-centric abstraction called the operator formulation of algorithms, which reveals that a generalized form of data-parallelism called amorphous data-parallelism is ubiquitous in diverse applications ranging from mesh generation/refinement/partitioning to SAT solvers, maxflow algorithms, stencil computations and event-driven simulation. I will also show that the operator formulation can be used to perform a structural analysis of algorithms that can be exploited for efficient implementations of these algorithms. Finally, I will describe a system based on these ideas called Galois for programming multicore processors.

Asynchronous computing of irregular applications using the SVPN model and S-Net coordination

Alex Shafarenko
University of Hertfordshire

This talk will address a new design representation for irregular distributed applications: a Spinal Vector Petri Net (SVPN).

SVPN is a vector of identical graphs which have processing functions at the nodes and which have messages floating along the edges. It is a distant relative of Coloured Petri Nets in that it reveals the synchronisation structure of an asynchronous application and can be used for analysis of its behaviour. SVPN can naturally support irregular applications, such as Particles-in-Cells simulations of plasma and molecular dynamics, as message-driven processes, without the programmer being aware of any subtle concurrency issues. Our main contribution to knowledge is a demonstration that the mapping of SVPN to the coordination language S-Net is nearly mechanical, which makes it possible, at least in principle, to write asynchronous, irregular applications directly in S-Net and utilise its advanced support for software engineering.

Oblivious Algorithms for Multicores and Network of Processors

Francesco Silvestri
University of Padova

We address the design of algorithms for multicores that are oblivious to machine parameters. We propose HM, a multicore model consisting of a parallel shared-memory machine with hierarchical multi-level caching, and we introduce a multicore-oblivious (MO) approach to algorithms and schedulers for HM. An MO algorithm is specified with no mention of any machine parameters, such as the number of cores, number of cache levels, cache sizes and block lengths. However, it is equipped with a small set of instructions that can be used to provide hints to the run-time scheduler on how to schedule parallel tasks. We present efficient MO algorithms for several fundamental problems including matrix transposition, FFT, sorting, the Gaussian Elimination Paradigm, list ranking, and connected components. The notion of an MO algorithm is complementary to that of a network-oblivious (NO) algorithm, recently introduced by Bilardi et al. for parallel distributed-memory machines where processors communicate point-to-point. We show that several of our MO algorithms translate into efficient NO algorithms, adding to the body of known efficient NO algorithms.

The HPX Runtime System for Exascale Computing

Thomas Sterling
Louisiana State University

While projected performance based on Moore's Law suggests Exaflops scale before the end of this decade, more detailed analysis exposes a number of factors that may obstruct this achievement. The limitations of power consumption have resulted in flatlining of core clock rate and the exhaustion of core complexity has similarly flatlined ILP and the concomitant ipc. Future performance growth will be achieved only through dramatic increases in number of cores (multicore) and heavy multithreading (e.g., GPU). Efficiency and scalability must be significantly enhanced in order to effectively employ future systems comprising hundreds of millions of cores and tens of billions of threads. HPX is an experimental runtime system that serves as a reference implementation for the ParalleX execution model (presented last year). HPX permits dynamic adaptive resource management and task scheduling in a global address space context using message driven multithreaded computation with lightweight synchronization. It allows work to be moved to the data when more efficient than always moving data to the work both reducing and hiding system-wide latency effects. It also enables migration of continuations for distributed control flow which in turn supports data directed execution. This exploits intrinsic parallelism of meta-data such as in large directed graph data structures. HPX has been employed on conventional platforms for such application classes as Adaptive Mesh Refinement (AMR) for numerical relativity applications and for symbolic computations such as game playing (e.g., Chess). HPX is being employed to investigate the limitations of conventional architecture scalability and to guide development of future generation Exascale computer architectures.

UpSizeR: Synthetically Scaling Up a Given Database State

Y.C. Tay
National University of Singapore

E-commerce and social networking services must ensure that their systems are scalable. Engineering for rapid growth requires intensive testing with scaled-up datasets. Although such a larger dataset is synthetically generated, it must be similar to a real dataset if it is to be useful. This talk presents UpSizeR, a tool for scaling up relational databases. Given a database state D and a positive number s, UpSizeR generates a synthetic state D' that is s times the size of D, yet similar to D in terms of query results. UpSizeR does this by extracting inter-column (e.g. key-induced) and inter-row (e.g. clustering) information from D. Experiments with Flickr data shows good agreement between crawled data and UpSizeR output for various sizes.

Performance Modeling and Optimization on the Example of Stencil Algorithms

Jan Treibig
University of Erlangen

Stencil algorithms are a central part of many scientific and engineering codes. This talk will present an analysis of architectural developments with regard to the demands of bandwidth limited loop kernels. On the example of the Jacobi and Gauss-Seidel methods a variant of temporal blocking is presented specifically designed to exploit the shared caches of modern multicore processors. In order to get an analytic performance estimate a diagnostic model is introduced allowing to predict the performance for streaming loop kernels in all memory hierarchy levels. Results for various modern multicore processors are presented.