Performance of Fractal-Tree Databases

Michael Bender
Stony Brook University and Tokutek, Inc.

Insertion bottlenecks lie at the heart of database and file-system innovations, best practices, and system workarounds. Most databases and file systems are based on B-tree data structures, and suffer from the performance cliffs and unpredictable run times of B-trees. In this talk, I introduce the Fractal Tree data structure and explain how it provides dramatically improved performance in both theory and in practice.

From a theoretical perspective, if B is the block-transfer size, the B-tree performs $O(log_B N)$ block transfers per insert in the worst case. In contrast, the Fractal Tree structure performs $O((log_B
N)/B)$ memory transfers per insert, which translates to run-time improvements of two orders of magnitude.

To relate that theory to practice, I present an algorithmic model for B-tree performance bottlenecks. I explain how the bottlenecks affect best practice and how database designers typically modify B-trees to try to mitigate the bottlenecks. Then I show how Fractal Tree structures can attain faster insertion rates, intuitively by transforming disk-seek bottlenecks into disk-bandwidth bottlenecks

I conclude with performance results. Tokutek has developed a transaction-safe Fractal-Tree storage engine for MySQL. I present performance results, showing how the system can maintain rich indexes more efficiently than B-trees. Surprisingly, Fractal Tree structures seem to maintain their order-of-magnitude competitive advantage over B-trees on both traditional rotating media as well as SSDs.

Network Obliviousness

Gianfranco Bilardi
University of Padova

The design of algorithms that can run unchanged yet efficiently on a variety of machines characterized by different degrees of parallelism and communication capabilities is a highly desirable goal. We propose a framework for network-obliviousness based on a model of computation where the only parameter is the problem's input size. Algorithms are then evaluated on a model with two parameters, capturing parallelism and granularity of communication. We show that, for a wide class of network-oblivious algorithms, optimality in the latter model implies optimality in a block-variant of the Decomposable BSP model, which effectively describes a wide and significant class of parallel platforms. We illustrate our framework by providing optimal network-oblivious algorithms for a few key problems, and also establish some negative results.

Automatic Performance Tuning: Today and Future Challenges

Jacqueline Chame
University of Southern California

On the Implementation of Circuits for the Discrete Fourier Transform

Carlo Fantozzi
University of Padova

The Discrete Fourier Transform (DFT) remains a key building block in countless computing systems. Commercial solutions usually rely on ASIC implementations of the radix-2 FFT algorithm, but this classical, static approach is being questioned by the demanding requirements of modern applications. A clear example is given by new communication standards, which mandate support for many combinations of parameters to ensure good performance in different scenarios: for instance, DVB-T2 requires to switch between 6 different DFT lengths. Other applications, such as cellular telephony, mandate support for multiple standards at once. The need for adaptivity is foreseen to grow further in the future, with the spreading of software-defined and cognitive radio systems.

In this scenario, the talk will present some remarks on the silicon implementation of the DFT in modern processors, such as multi-core chips or application-specific instruction-set processor. The objective is to shed light on key properties that are valid regardless of the size, algorithm and other details of the DFT implementation.

SW/HW Approach for Optimizing the Performance
of Synchronous Shared Memory Architectures to Low-TLP Situations

Martti Forsell
VTT Electronics

Synchronous shared memory (SSM) architectures are promising candidates for future CMP architectures due to their ability to execute general purpose parallel code efficiently down to the finest granularity and support for easy-to-use parallel programming models. While the recent SSM architectures are tuned for fast execution of parallel workloads and co-exploitation of ILP and TLP, the solutions used in them do not support efficient execution of low-TLP code fragments. More generally speaking, this inability of architectures optimized for parallel execution to efficiently execute sequential code has been shown to be one of the design bottlenecks in the theory of architectures.

In this presentation we propose a SW/HW approach for dynamically optimizing the performance of recent SSM architectures to low-TLP situations. The HW part includes changes to the processor pipeline and instruction set as well as a new technique called bunching that combines execution slots of multiple threads into a single bunch executing a single thread with a speedup proportional to the number of threads. The SW part includes language level mechanism to support seamless bunching concurrently with parallel execution. Preliminary evaluation of the approach is given.

Abstraction in Manycore Programming: Active Libraries and Indexed Metadata

Paul Kelly
Imperial College London

This talk is a review and manifesto, looking at what makes building high-performance software hard. I'll talk about some of the correctness and performance issues, with a specific focus on how these concerns are destructive to software quality - how performance hacking breaks software abstractions and reusability. My research has attacked this problem from multiple perspectives; the main aim of this talk is to set out our agenda for software technologies that allow performance engineering to be isolated from higher-level algorithmic development - by delivering pluggable domain-specific, or library-specific, optimisation tools. Our recent examples of this include a collaboration with The Foundry Ltd , on image processing in visual effects postproduction, an active library for dense and sparse linear algebra, and program generation tools for finite-element CFD.

Context-aware Composition of Parallel Components

Christoph Kessler
Linköping University

Programming parallel systems efficiently is notoriously difficult. Components are a well-proven concept to manage software design and implementation complexity. However, to be reusable, components are often more general than necessary. Moreover, they hardcode and hide too many design decisions such as scheduling, algorithm selection or data representation selection, which could be bound better at a later point of time, e.g. at run-time when more information about available resources or problem parameters is known. We show how such decisions can be guided at run-time with low overhead by look-up tables that are computed off-line from profiling data. This context-aware composition is a powerful optimization technique that can be seen as a generalization of the currently very popular autotuning methods for special domains and library functions. While formulated for a parallel scenario, our method also applies to optimized composition of sequential programs as a special case.

(This is joint work with Welf Löwe, Univ. of Växjö, Sweden.)


Bradley C. Kuszmaul
Massachusetts Institute of Technology

To achieve high performance and high productivity, systems much support oblivious programmers. Programmers do not and cannot deal with complex systems well, so as systems theorists and designers we must hide a bunch of details. Here are some examples:

  • Scheduler-Obliviousness: Programmers know there are many cores in their computer, but it's better if they don't know exactly how many. The number of cores will certainly change over the useful life of a program, and it may even change dynamically over a single run. A provably good thread scheduler (such as a work-stealing scheduler) can hide the details of scheduling work onto cores, and a provably good job scheduler can hide the details of which jobs get how many cores. Instead of worrying about how to schedule a program, a programmer can worry about minimize the work and span (critical path length) of a program.
  • Cache-Obliviousness: To achieve good performance in a cache hierarchy (which may include a disk system), software must operate on memory in a blocked fashion. The optimal block size may be difficult to ascertain and to program, especially in a multi-level hierarchy. With disk systems, the optimal block size may change over time (because of prefetching) and over space (e.g., disk inner tracks have smaller effective block sizes than outer tracks). Cache-oblivious algorithms provide a way to get good performance without measuring the cache hierarchy.

Moving Threads instead of Data - Benefits and Challenges

Ville Leppänen
University of Turku

A traditional solution in multiprocessor/multicore systems based on the shared memory concept is to move data read/write requests between processors/cores. This can lead to memory consistency problems, if cache memories are attached to each processor/core and memory values can replicated into several caches to compensate the slowness of memory hierarchy. Another possible problems are imbalance of threads allocated on cores and/or dynamic hotspots in the memory system due to non-successful allocation of program's data structures.

In this talk, we consider dynamic moving threads instead of data write/write requests. The moving threads approach makes it possible to avoid consistency problems, if each core can access only a separate portion of the whole shared memory through its cache memory. In principle, randomness in mapping program's data structures into the shared memory can make the threads to evenly load each core (and its cache). An obvious drawback of this however is a rather high thread moving capability requirement. On the other hand, one can also consider having special annotations in programs to guide the allocation of data structures so that temporal locality in the execution of threads is maximized and the thread moving requirements are minimized. Of course, this complicates programming considerably.

We present a RISC-based multicore architecture framework supporting the moving threads approach and consider the cost of synchronous execution of programs on it. We consider issues like the size of threads (the number of registers), the need of threads per core to hide thread moving and memory hierarchy latencies, their domination of these two types of latencies, and heap construction.

Enhanced Performance Analysis of Parallel Applications with an Integrated Tool-chain

Bernd Mohr
Jülich Supercomputing Center

Programming and optimizing large parallel applications is an ambitious and time consuming challenge. Therefore a number of software tools have been developed in the past to assist the programmer in analysing and optimizing their codes. Scalasca from Juelich Supercomputing Centre and Vampir from the Technical University of Dresden are two of these performance-analysis tools that are already well established and recognized in the community. While Scalasca shows performance flaws due to certain common patterns in a compact and easy-to-use tree representation, the Vampir framework visualizes all details of the function call sequences and communication pattern in a great variety of time-line or statistic displays. Scalasca shows the problem and its severity but gives limited practical hints on the case history to find a solution. Looking at the huge amount of data in Vampir can make it hard to find the interesting spots in spatial and temporal resolution for identification of problems and selection for further investigation. This observation led to the combination of these tools resulting in information gain higher then the added gain of both tools each taken by itself. In this presentation we present our approach to connect Scalasca and Vampir through a common bus system. A practical real world example is provided and the beneficial impact of the integration on the work-flow is shown by optimizing the adaptive mesh refinement and load balancing phases of the metal forming simulation FE software INDEED.

The Importance and Effectiveness of Speculative Optimizations

José E. Moreira
IBM T. J. Watson Research Center

Dynamic approaches have the potential for significant more impact on performance optimization than static approaches. As an example, just-in-time or runtime compilers have access to a lot more information about the behavior of a program than static compilers. That information includes execution metrics such as method frequency and loop bounds, as well as dependence information that can be useful in enabling compiler transformations. Nevertheless, it can still be difficult (or expensive) to determine the legality of certain optimizations even at runtime. In this talk, we will discuss the effectiveness of performing optimizations speculatively. That is, optimized code is executed even though the compiler and runtime system are not sure of its legality. Possible violations of proper program semantics are then detected while the code executing and some form of correction is performed to undo the effects of incorrect code. We will review mechanism to both detect and correct the violations and discuss the performance impact of those mechanisms and the optimizations that they enable.

Scalable performance analysis on large systems: What to do with all the data?

Wolfgang Nagel
Technical University of Dresden

Performance analysis and optimization is an interesting - and challenging - field in computer science. The goal is to understand what happens on a computer system, to identify short-commings in programming and system bottlenecks, and to improve the execution behaviour to just get a faster solution with better utilization of system resources. If everything is going well, the user gets a faster solution - sometimes one or two orders of magnitudes.

Over time, the community has learned how to deal with parallel machines. Nevertheless, looking to current trends in technology we might loose the capabilities of the infrastructure needed to investigate systems with half a million cores and even beyond, it has become a severe issue already today. Therefore, new ideas and software concepts are needed, and the talk will present a strategy and a prototype implementation to address that field.

HCE: The Optimized Design Flow from ANSI C Programs to Optimized FPGA Architectures

Paolo Palazzari
ENEA and Ylichron Srl

The HCE (HARWEST Compiling Environment) design flow, developed by Ylichron, will be shortly presented. Starting from an algorithm expressed through an ANSI C program, HCE generates an optimized parallel architecture which implements the algorithm and is described as synthesizable VHDL-RTL, ready to be implemented on FPGA technology. The theoretical basis of HCE will be reviewed, as well as the latest developments and improvements on the design flow. Examples will be shown to illustrate the impact of different C programming styles on the performances achievable by the final architecture.

Programming Landscape and its Impact of System Design

Pratap Pattnaik
IBM T. J. Watson Research Center

In recent years, besides the leading programming languages like C, COBOL, Fortran etc., a series of new languages such as Java, Perl, PHP, Python, and Ruby are becoming quite popular. Productivity considerations and the globalization of code development and code service are impacting the evolution of these languages. It is expected that, in enterprise deployment of applications written in these languages, the runtime and the hardware synergistically and autonomically will provide the needed performance to applications. This talk, after briefly describing the evolution of the programming language landscape, will deal with some of its impact on hardware design and leave with a number of significant but yet to be answered questions.

Self-adaptive Networks: a Coordination Concept for Open Environments

Frank Penczek
University of Hertfordshire

In this talk we will briefly introduce the coordination language S-Net and then discuss some extensions to it that allow a distributed application to be self-adaptive. Extended S-Net will enable the most complete possible separation of concerns between the stream processing schema, which can be reasoned about in terms of data manipulation, correctness, etc., and a coexisting problem-specific management layer which shares the communication infrastructure and computational resources with the rest of the code and which has the ability to monitor and adapt the schema to achieve better utilisation of resources in a distributed environment. Our proposed solution is in terms of a component technology and a coordination ``glue'' that enable concurrency engineers and subject area programmers to collaborate without knowing much of each other's toolkit. The presentation will be illustrated by code examples.

This work was funded by the EU framework 6 project AETHER and its continuation is being funded under framework 7, the project ADVANCE.

It's a Computer Jim; but not as we know it!

Ian Phillips

For the conventional computation model, increased performance equates directly to increased processing throughput. Embedded Intelligent Systems are Computers; just not like 'proper' computers! So when we talk about Scaling the Performance of such computers it is by no means obvious what this might mean. Typically scaling of an Embedded Computer is much less concerned with raw throughput, than improvements in non-functional aspects of performance, such as Power, Cost, System Integration, Productivity, Quality and Reusability. To understand Scalable Performance in this domain must relate to this context. The presenter will explore the architecture of typical embedded systems and what scalability might mean to them.

Amorphous Data-Parallelism

Keshav Pingali
University of Texas, Austin

Most client-side applications running on multicore processors are likely to be irregular programs that deal with complex, pointerbased data structures such as sparse graphs and trees. However, we understand very little about the nature of parallelism in irregular algorithms. In this talk, we define a generalized form of data-parallelism that we call amorphous data-parallelism and show that it is ubiquitous in irregular algorithms. We also show that the concepts behind amorphous data-parallelism lead to a natural categorization of irregular algorithms. We argue that optimistic or speculative execution is essential for exploiting amorphous dataparallelism in many irregular algorithms, but show that the overheads of speculative execution can be reduced dramatically in most cases by exploiting algorithmic structure.

STAPL: A Parallel Programming Infrastructure

Lawrence Rauchwerger
University of Texas, A&M

The Standard Template Adaptive Parallel Library (STAPL) is a collection of generic data structures and algorithms that provides a high productivity, parallel programming infrastructure with an approach that draws heavily in design from the C++ Standard Template Library (STL). By abstracting much of the complexities of parallelism from the end user, STAPL provides a platform for high productivity by enabling the user to focus on algorithmic design instead of lower level parallel implementation issues. In this talk, we provide an overview of the major STAPL components, discuss its framework for adaptive algorithm selection, and show that several important scientific applications can be written with relative ease in STAPL and still have scalable performance.

The QPACE Project

Fabio Schifano
University of Ferrara & INFN - Ferrara

The talk is focused on the QPACE project, a Petaflops-range supercomputer system designed by a collaboration of academic and industrial partners. QPACE is a massively parallel system based on the IBM-PowerXCell8i processor, and interconnected via a first-neighbor 3D-torus network. The talk will give an overview of the hardware and software of the system, and an update of the status.

HPC in Phase Change

Thomas Sterling
Louisiana State University

Since 2007, high performance computing has been at the beginnings of the most dramatic change in form and function in the last decade and a half. Since the advent of the killer micro and the MPPs and commodity Clusters it spawned supported by message-passing programming techniques, most notably MPI, HPC has been on an exponential curve augmenting performance at historic rates through incremental changes to feature size, clock rate, and architectural complexity. But as always happens with S-curves, HPC is turning towards its final asymptote and is undergoing what may prove to be its 6th and potential final phase change. Most visible is the adoption of multicore heterogeneous system architectures driven by constraints in power, complexity, clock rate, and reliability while continuing to exploit improvements in feature size to achieve growth in performance. To realize this goal and the achievement of Exascale performance by the end of the next decade within practical limitations, critical advances in efficiency, scalability, energy, and programmability will be required. In all previous such metamorphoses in HPC, the underlying principles of a new execution model was used to guide the co-design of new architectures, programming methods, and system software. Such is the case for the emerging HPC Phase VI. This presentation will discuss the likely elements the new execution model based on the exploratory ParalleX model of computation, and describe key attributes of architecture, operating and runtime system software, and programming methods that are likely to gain ascendency over the next decade.

Green Flash: Co-designing Hardware/Applications/Auto-tuners for Extreme Scale Computing Systems

Erich Strohmaier
Lawrence Berkeley National Laboratory

Green Flash is a project to develop a new design methodology for extreme scale computing systems. The approach is to take a vertical slice through the space of applications, algorithms, software, and hardware to perform a study on how to build the most scalable, energy-efficient system to solve a particular computational science problem. The initial target is climate simulation, and the solution is likely to require higher degrees of parallelism in the climate algorithms and software, as well as new hardware designs based on technology from outside the traditional workstation and server domains, particularly low-power embedded technology. Auto-tuning technologies are one key component in the design cycle between hardware and software to achieve optimal performance and power efficiency. The result will be a proof-of-concept machine design, along with supporting analysis of how this compares to alternative designs, and new algorithms and software techniques as needed to make the case for feasibility of computing the science problem on the system.

A Bridging Model for Multi-Core Computing

Leslie Valiant
Harvard University

We propose a bridging model aimed at capturing the most basic resource parameters of multi-core architectures. We suggest that the considerable intellectual effort needed for designing efficient algorithms for such architectures may be most fruitfully pursued as an effort in designing portable algorithms for such a bridging model. Portable algorithms would contain efficient designs for all reasonable ranges of the basic resource parameters and input sizes, and would form the basis for implementation or compilation for particular machines. We show that for several fundamental problems, namely matrix multiplication, fast Fourier Transform, and sorting, algorithms do exist that are optimal in a defined sense, for all combinations of input size and parameter values for this model.

Dynamic Adaptability for a Many-computing Element Dataflow Processor

Lorenzo Verdoscia
Institute for High Performance Computing and Networking (ICAR) - CNR

In the last twenty years, industry has delivered dramatic performance gains by increasing the frequency of its processors, from 5 MHz to more than 3 GHz, while at the same time, improving IPC (instructions per cycle). However, power thermal issues, such as dissipating heat from increasingly densely packed transistors, have begun to limit the rate at which processor frequency can also be increased. Although frequency increases have been a design staple for the last 20 years, the next 20 years will require a new approach. Basically, industry needs to develop improved micro-architectures at a faster rate, and in coordination with each new silicon manufacturing process, from 45 nm, to 32 nm, and beyond. For this new approach we can take advantage of Moore's law because transistor feature size is expected to continue to be reduced at a rate similar to that in the past. New technologies, such as 3-D die stacking, may allow even greater increases in total transistor counts within a given footprint, beyond the increases made possible by improvements in lithography alone.

In this scenario many-cores on a single chip is going to become the dominant mechanism for scaling processor performance. Exponential growth in the number of cores on a single processor is expected to lead in a short time to mainstream computers with thousands of cores. Scalable implementations of parallel algorithms will be necessary in order to achieve improved single-application performance on such processors. Because of memory access will continue to be an important limiting factor on achieving performance, heterogeneous systems may make use of cores with varying capabilities and performance characteristics, and an appropriate programming model can address scalability and can expose data locality while making it possible to migrate application code between processors with different parallel architectures and variable numbers and kinds of cores.

In this talk we present some ideas on our dynamic approach to the dataflow processor design of the PHOENIX project in which key words are: dynamically dataflow graph reconfiguration, virtual computing units, and adaptation of hardware resources on data parallelism. Coming soon...