Abstracts

Dynamically Computing the Performance of Statically Optimal Page Replacement

Gianfranco Bilardi
University of Padova

Implementing the optimal replacement policy proposed by Belady in the sixties requires static, off-line knowledge of the address trace. We show that, nevertheless, the number of misses incurred while processing any given prefix of the trace only depends upon that prefix and can be efficienlty computed by an on-line algorithm, simultaneously for all cache sizes.

Adaptative Libraries for Heterogeneous Multicore Architectures

Francois Bodin
Irisa/University of Rennes/Caps entreprise

Libraries are at the heart of high performance systems (embedded or not) and users should be encouraged to use them when writing efficient applications for heterogeneous multicores. However, performance of such code depends on numerous factors (problem size, workload, hardware configuration, ...), which results into the need of several implementations addressing different hardware components or specific input data cases. The best implementation has to be chosen at runtime according to the routine input parameters.

In this talk we present the Library Adapter that allows exploiting heterogeneous implementations of a library routine. The proposed solution deals with runtime decisions as well as the configuration of the target platform.

Continuous Program Optimization

Jose Castanos
IBM T. J. Watson Research Center

As the complexity of both machine architecture and software systems increases, it is likely that neither current static nor dynamic compilation techniques are going to be adequate to handle emerging workloads written in multiple languages and requiring optimizations across a complex software stack. In this talk we present a design for next generation compilation systems that combines static analysis and optimization with dynamic code generation and machine optimization. Most current compilers have built-in extensive static analysis algorithms that are very powerful but limited by a lack of runtime information. Dynamic compilers have access to a wealth of runtime information, but can not afford expensive optimization algorithms. By combining dynamic and static compilation we allow the system to exploit the best features of both approaches. Our infrastructure provides for extensive program behavior monitoring across the execution stack. Performance data is collected at runtime and used by the dynamic compiler as well as stored externally for post-run analysis and profile directed optimization in the static compiler. This allows for on-the-fly, profile driven optimization without burdening the user.

Collecting and Exploiting Information on the Dynamic Behavior of OpenMP Code

Barbara Chapman
University of Houston

Each OpenMP compiler has its own runtime environment, customized to suit the chosen implementation strategy. The OpenMP ARB's Tools Committee therefore developed an interface known as the Collector API to support the statistical sampling and tracing of OpenMP applications. This API enables performance tools to collect information about a program's runtime behavior by defining a set of states and events that can be dynamically queried. We are extending the runtime system that is part of our OpenUH compiler framework to provide one of the first implementations of the Collector API. We have moreover extended the API sanctioned by the OpenMP ARB in order to fully support the events needed by popular profiling and tracing tools such as TAU and KOJAK/Scalasca. We describe this effort and our current work that explores how this infrastructure might be useful for dynamic optimization.

Increasing Granularity of OpenMP Tasks

Alejandro Duran
UPC, Barcelona

The latest OpenMP specification (3.0) introduced a new execution model that allows the creation of independent execution units called tasks. The OpenMP specification purposely leaves freedom to the implementation of the scheduling of these tasks as long as a few restrictions are met. This allows the implementations to optimize task creation by embedding one task into another thus increasing the granularity. In this presentation we describe our work to implement dynamic granularity control in the context of OpenMP and how the compiler allows us to reduce the associated overheads.

Enabling Dynamic Optimization and Adaptation for Statically Compiled Programs Using Function Multi-Versioning

Grigori Fursin
INRIA Saclay

Applying the right compiler optimizations to a program can considerably improve performance, reduce code size and in some cases power consumption. However finding a good solution in large non-linear optimization spaces is a tedious and time-consuming task particularly on rapidly evolving architectures. Iterative compilation and machine learning have been recently introduced to automate and speed-up the search process. However, similar to traditional feedback-directed compilation, iterative compilation adapts program only to one specific architecture and dataset at a time, and do not take into account varying dynamic behavior of the system. On the other hand, dynamic and JIT optimizations can adapt to any context at run-time but are limited by analysis and optimization time constraints, usually do not apply aggressive optimizations and often require complex run-time recompilation frameworks.

In this talk I will present our on-going work to enable run-time optimization and adaptation for statically compiled programs. Unlike other complex dynamic recompilation frameworks, our technique relies on function/loop versioning and static low-overhead monitoring with adaptation routines. We perform costly iterative search for powerful optimizations off-line using machine learning to find different optimizations for programs with multiple datasets and varying system context. I will present preliminary results demonstrating our adaptation techniques for programs with regular and irregular behavior using MILEPOST GCC, Collective Compilation Framework and MiBench with MiDatasets. We plan to use our techniques for continuous program optimizations, generation of adaptive static libraries, program adaptation on multi-ISA architectures, split-compilation, etc. We also believe that our techniques may simplify future JIT optimizers by passing hints from the analysis and optimization stages of current static compilers to the JIT as annotations and at run-time using statistical and machine learning techniques based on program static and dynamic features to quickly figure out which optimizations to use.

Scalable Resource Management in Grid Environments

Elisa Heymann
Universitat Autònoma de Barcelona

I will present our experiences in the development of CrossBroker, a job management service that provides transparent and reliable support for executing parallel and interactive applications on the Grid. I will outline the main components of CrossBroker and how they interact with other middleware services. In addition specific features of the scheduler used to guarantee resource co-allocation for running MPI jobs remotely over multiple resources spread across several Grid sites, or starting interactive applications as fast as possible will be described. Although resource allocation in Grid environments is generally carried out under the assumption that there is one primary scheduling system, it becomes increasingly likely that scalable deployments will involve multiple independent schedulers allocating jobs over the same resources. Therefore I will explain how we have applied common techniques borrowed from IP networks - particularly approaches based on Random Early Detection (RED) buffer management and its subsequent extensions/variations to schedule jobs on the Grid.

The SVP Model and the Place Abstraction

Chris Jesshope
University of Amsterdam

This talk will introduce the SVP model and in particular look at various implementations of the model's abstraction of place. It will also propose a novel hierarchical architecture and resource-management protocol for the delegation of work within ubiquitous and heterogeneous environments. The protocol is based on serving an SVP place, where a component of work is delegated together with the responsibility for meeting any non-functional computational requirements such as deadline or throughput constraints. The protocol is based on a market where SANE processors bid for jobs to execute and selection is based on a cost model that reflects the energy required to meet the jobs requirements.

Dynamic vs Static Binary Translation

Wei Li
Intel

In this talk, I will discuss dynamic and static binary translation, in particular the technologies and products developed by Intel. I would explore the tradeoffs of the two approaches to binary translation.

The Performance Engineering Research Institute

Bob Lucas
USC/ISI

This talk will describe the Performance Engineering Research Institute (PERI), an Enabling Technologies Institute of the Scientific Discovery through Advanced Computing 2 (SciDAC-2) program of the U.S. Department of Energy (DOE) Office of Science (SC). As we look to the future, achieving expected levels of performance on high-end computing (HEC) systems is growing ever more challenging due to enormous scale, increasing architectural complexity, and increasing application complexity. To address these challenges, PERI is implementing a unified, tripartite research plan encompassing: (1) performance modeling and prediction; (2) automatic performance optimization; and (3) performance engineering of high profile applications. The PERI performance modeling and prediction activity is developing and refining performance models, significantly reducing the cost of collecting the data upon which the models are based, and increasing model fidelity, speed and generality. Our primary research activity is automatic tuning of scientific software. This is spurred by the strong user preference for automatic tools and is based on previous successful activities such as ATLAS, which has automatically tuned components of the LAPACK linear algebra library, the highly successful FFTW library, and other recent work. Our third major component is application engagement, in which we are working directly with SciDAC-2 applications. This last activity not only helps DOE scientists meet their near-term performance goals, but will also help keep PERI's research focused on the real challenges facing DOE computational scientists as they enter the Petascale era.

A Framework for Binary Code Analysis and Static and Dynamic Patching

Barton P. Miller
University of Wisconsin, Madison

Tools that analyze and modify binary code are crucial to many areas of computer science, including cyber forensics, program tracing, debugging, testing, performance profiling, performance modeling, and software engineering. While there are many tools used to support these activities, these tools have significant limitations in functionality, efficiency, accuracy, portability, and availability.

To overcome these limitations, we are designing and implementing a new framework for binary code analysis and modification. This framework provides a component architecture to support tools that analyze binary code and modify it both statically (binary rewriting) and dynamically (dynamic instrumentation), allow for interoperability of the static and dynamic code modification, and enable the sharing and leveraging of this complex technology.

Characteristics of this framework include:

multi-architecture, multi-format, and multi-operating system;
library-based, so that components can be used separately as needed;
open source, to allow both local control and auditing;
extensible data structures, so that new analyses and interfaces can be added easily;
exportable data structures, so that all analysis products will be stored in a format that can be readily used by other tools;
batch enabled, so that tools can operate effectively without interactive control;
testable, with each separate component provided with a detailed test suite;
accurate and efficient, using best-known current algorithms and the addition of new algorithms for code parsing;
up to date, handling modern binary code idioms like exceptions, and functions with non-contiguous and shared code.

This component-based approach requires identifying key portable and multi-platform abstractions at every level of functionality. If we (as a community) are successful in transitioning to this model, we can break the ``hero model'' of tool development of having each group trying to produce its own end-to-end complete tool sets.

The Paradyn group is working on several initial library components of this effort, including symbol table parsers, binary code scanners (instruction decoders), binary code parsers (control flow analysis), dynamic code generators, stack walkers, process execution controllers, and a visual binary code editor.

The goal of this talk is to lay out the motivation, plans, and current progress for this project. We also hope to solicit feedback on both its design and functionality.

Scalasca: A Scalable Performance Measurement and Analysis Toolset

Bernd Mohr
Jülich Supercomputing Center

Scalasca is a performance toolset that has been specifically designed to analyze parallel application execution behavior on large-scale systems. It offers an incremental performance analysis process that integrates runtime summaries with in-depth studies of concurrent behavior via event tracing, adopting a strategy of successively refined measurement configurations. A distinctive feature is its ability to identify wait states in applications with very large numbers of processes. Especially when trying to scale communication-intensive applications to large processor counts, such wait states can present severe challenges to achieving good performance.

In this talk, we review the current toolset architecture, emphasizing its scalable design and the role of the different components in transforming raw measurement data into knowledge of application execution behavior. The scalability and effectiveness of Scalasca are then surveyed from experience measuring and analyzing real-world applications on a range of computer systems.

Data Parallel Operations and the Future of Parallel Programming

David Padua
University of Illinois at Urbana-Champaign

Data parallel programming is the standard paradigm for massively parallel machines and their importance for multicores will increase with the number of processors. Although data parallelism is most often expressed today using loops or SPMD code, the alternative of representing parallelism using array data types and their associated collective operations (including element-by-element operation extensions and reductions) has great advantages. In fact, array operations can encapsulate parallelism and raise the level of abstraction helping in this way program structure and facilitating debugging, optimization, and porting. The main disadvantage of today's array notation is its inability to represent cleanly some classes of array operations and many collective operations in symbolic computing.

In this talk, I will discuss our work at Illinois on extending data parallel operations by adding new functionality to the array data type and going beyond arrays to other aggregate data types, notably sets. Our goal is to represent data parallelism for a wide range of problem domains. We expect these data types and operations not only to facilitate parallel programming, but also to enable a more effective automatic optimization by presenting compilers the additional information provided by the semantics of the data parallel operations.

HCE: a Tool to Automatically Translate C Programs onto Dedicated Architectures for FPGAs

Paolo Palazzari
ENEA and Ylichron Srl

The talk will present the HARWEST Compiling Environment (HCE), a collection of tools which is able to automatically translate an ANSI C program into an optimized, dedicated architecture to be implemented on FPGA technology. The HCE design flow will be reviewed, the optimization techniques will be discussed and some guidelines to write C programs which generate efficient architectures will be given. Results from actual synthesis, originated from C programs, will be reported.

Issues Confronting a System Designer

Pratap Pattnaik
IBM T. J. Watson Research Center

Over the past two decades, the VLSI technologies, the Software development & deployment styles, the I/O & Network technologies, and business models have gone through significant advancements. The role of a system designer is to judiciously synthesize these advancements to produce a system yielding significant economic value to the society. This talk will focus on the

current challenges faced by system designers to achieve a successful system design.

Amorphous Data-parallelism in Irregular Programs

Keshav Pingali
University of Texas, Austin

Client-side applications running on multicore processors are likely to be irregular programs that deal with complex, pointer-based data structures such as graphs and trees. In her 2007 Turing award lecture, Fran Allen raised an important question about such programs: Do irregular programs have data parallelism, and if so, how do we exploit it on multicore processors?

In this talk, we argue using concrete examples that irregular programs have a generalized kind of data parallelism that arises from the use of iterative algorithms that manipulate worklists of various sorts. We then describe the approach taken in the Galois project to exploit this data-parallelism. There are three main aspects to the Galois system: (1) a small number of syntactic constructs for packaging optimistic parallelism as iteration over ordered and unordered sets, (2) assertions about methods in class libraries, and (3) a runtime scheme for detecting and recovering from potentially unsafe accesses to shared memory made by an optimistic computation. We present experimental results that demonstrate that the Galois approach is practical, and discuss ongoing work on this system.

Automatic Parallelization with Hybrid Analysis

Lawrence Rauchwerger
Texas A&M University

Hybrid Analysis (HA) is a compiler technology that can seamlessly integrate all static and run-time analysis of memory references into a single framework capable of generating sufficient information for most memory related optimizations. In this talk, we will present Hybrid Analysis as a framework to perform automatic parallelization of loops. For the cases when static analysis does not give conclusive results, we extract sufficient conditions which are then evaluated dynamically and can (in)validate the parallel execution of loops. The HA framework has been fully implemented in the Polaris compiler and has parallelized 22 benchmark codes with 99% coverage and speedups superior to the Intel Ifort compiler.

Trends in Computing for Theoretical Physics

Fabio Schifano
University of Ferrara & INFN - Ferrara

The talk will give an overview of the computing trends for theoretical physics. Starting from computing intensive applications relevant in theoretical physics, like the Lattice-QCD and the Spin-Glasses, the talk will present the computing solutions adopted to solve these problems efficiently.

S-Net: Scalability via Decontextualisation

Alex Shafarenko
University of Hertfordshire

While useful for exploitation of locality, the nodal context creates a significant difficulty in achieving application scalability in current distributed computing technologies. Indeed most message passing schemes abstract an algorithm in the form of a set of communicating state-machines, where each machine holds its state until a transition triggered by an incoming message. For regular applications with predictable quasi-periodic behaviour that works really well: the data distribution is static, any communication patterns are stable and performance can be optimised by increasing the ratio of held to exchanged data. For irregular problems such as computational chemistry, advanced CFD, and various fields of scientific and grand challenge engineering computing, scalability tends to require bespoke solutions, but perhaps better results could be achieved by a change in programming paradigm.

This talk will introduce S-Net: a coordination language based on our ideas of extreme encapsulation and what can be described as "decontextualisation". The application is abstracted as a set of black-box components of a purely computational nature. They are thus stripped of all contextual information between instances of message arrival. Any copy of a user-coded box behaves exactly as an original at any time and on any processor in a multiprocessor system. A component has a single-input single-output interface for which the API library consists of only one library function: namely a function for passing data tuples from a box to the distributed coordination environment. Any language can serve as a box language. The S-Net coordination infrastructure manages the components' persistent states and the exchange of messages, which is specified in a very small language that describes the virtual network topology in the form of algebraic expressions. There are only 4 operators (called combinators) used in these expressions and only two special boxes for the purposes of data management and synchronisation. The glue being used for holding boxes together is a variety of typed data streams, hence S-Net can be described as a streaming language.

The fact that the language is so small while being capable of defining all types of networks and network behaviour is due to the extensive use of the type system and nondeterminism. The former minimises notation and provides type-directed routing; the latter is used as a positive factor for reducing the variety of virtual topologies as well as improving network latencies. The crucial aspect of S-Net is its focus on dynamic optimisation techniques and the propensity for self-adaptivity. The talk will briefly touch on these and on the progress in S-Net implementation to date.

Dynamic System Optimization through Performance Modelling: Universality and Decomposition

Y. C. Tay
National University of Singapore

Dynamic optimization of system performance is necessary but hard. Some modeling techniques can help overcome this difficulty. Two techniques illustrate this possibility:

Universality. We could surrender to system complexity by adopting black-box control. However, there is no doubt that we can achieve more if there is an explicit closed-form expression for the black-box function. A model is necessary for deriving such an expression. Ironically, the key to dealing with the complexity is not to model the many intricate details, but to search for simple approximations that are universal, i.e. they apply regardless of hardware, software and workload. This idea is illustrated with the problem of maintaining a target probability for cache misses.

Decomposition. System behavior is determined by multiple factors. They often interact through complex feedback mechanisms. Changing one factor can activate countervailing forces that drive the system to a new equilibrium that is hard to determine. The fixed point for such a feedback cycle is thus a moving target for performance optimization. We need to break the cycle by decoupling the forces, so each factor can be studied in isolation. This idea is illustrated with the problem of preventing a collapse in network performance when users react to congestion (e.g. by aborting connections).

Preparing Climate Science for the Massively Parallel Era

Henry M. TUFO III
University of Colorado at Boulder

We describe the efforts undertaken over the past four years to move NCAR's climate applications from O(100) cores to O(100K) cores. The recipe for for this has been fairly straightforward: modern methods, improved algorithms, and better implementations. The next 1000x jump won't be so easy. First, the long-time integrations required for simulating climate place restrictions on resolution. Second, many of the approximations currently being employed are at or near their resolution limits. Third, because individual thread execution rates have stalled, future systems will have significantly increased parallelism at the chip/node level. Unfortunately, how that parallelism will manifest itself and how it will be harvested remain open questions.

Some Experiments with Machine Learning for Performance Optimisation

Ian Watson
University of Manchester

Dynamic optimisation of systems is now widely used in pursuit of increased performance. There are many instances of dynamic optimisation where it is clear what information should be gathered and how it should be used to tune the system. However, there are many other occasions where it is clear that the use of run time information might be of use, but it is much less clear what to measure and how to use the data. We have been investigating the use of Machine Learning techniques applied to a variety of system problems where these conditions hold. This talk will cover experiences with the consideration of branch prediction as an ML problem,ML approaches to optimise garbage collection and a description of our current work on the use of ML in thread level speculation and the scheduling of transactions.

Home >
Abstracts >

ScalPerf'08

Scalable Approaches to High Performance and High Productivity Computing

Abstracts