Dynamically Computing the Performance of Statically Optimal Page Replacement
University of Padova
Implementing the optimal replacement policy proposed by Belady in the
sixties requires static, off-line knowledge of the address trace. We
show that, nevertheless, the number of misses incurred while
processing any given prefix of the trace only depends upon that prefix
and can be efficienlty computed by an on-line algorithm,
simultaneously for all cache sizes.
Adaptative Libraries for Heterogeneous Multicore Architectures
Irisa/University of Rennes/Caps entreprise
Libraries are at the heart of high performance systems (embedded or not) and users should be encouraged to use them when writing efficient applications for heterogeneous multicores. However, performance of such code depends on numerous factors (problem size, workload, hardware configuration, ...), which results into the need of several implementations addressing different hardware components or specific input data cases. The best implementation has to be chosen at runtime according to the routine input parameters.
In this talk we present the Library Adapter that allows exploiting
heterogeneous implementations of a library routine. The proposed
solution deals with runtime decisions as well as the configuration of
the target platform.
Continuous Program Optimization
IBM T. J. Watson Research Center
As the complexity of both machine architecture and software systems
increases, it is likely that neither current static nor dynamic
compilation techniques are going to be adequate to handle emerging
workloads written in multiple languages and requiring optimizations
across a complex software stack. In this talk we present a design for
next generation compilation systems that combines static analysis and
optimization with dynamic code generation and machine optimization.
Most current compilers have built-in extensive static analysis
algorithms that are very powerful but limited by a lack of runtime
information. Dynamic compilers have access to a wealth of runtime
information, but can not afford expensive optimization algorithms. By
combining dynamic and static compilation we allow the system to
exploit the best features of both approaches. Our infrastructure
provides for extensive program behavior monitoring across the
execution stack. Performance data is collected at runtime and used by
the dynamic compiler as well as stored externally for post-run
analysis and profile directed optimization in the static compiler.
This allows for on-the-fly, profile driven optimization without
burdening the user.
Collecting and Exploiting Information on the Dynamic Behavior of OpenMP Code
University of Houston
Each OpenMP compiler has its own runtime environment, customized to
suit the chosen implementation strategy. The OpenMP ARB's Tools
Committee therefore developed an interface known as the Collector API
to support the statistical sampling and tracing of OpenMP
applications. This API enables performance tools to collect
information about a program's runtime behavior by defining a set of
states and events that can be dynamically queried. We are extending
the runtime system that is part of our OpenUH compiler framework to
provide one of the first implementations of the Collector API. We have
moreover extended the API sanctioned by the OpenMP ARB in order to
fully support the events needed by popular profiling and tracing tools
such as TAU and KOJAK/Scalasca. We describe this effort and our
current work that explores how this infrastructure might be useful for
Increasing Granularity of OpenMP Tasks
The latest OpenMP specification (3.0) introduced a new execution model
that allows the creation of independent execution units called
tasks. The OpenMP specification purposely leaves freedom to the
implementation of the scheduling of these tasks as long as a few
restrictions are met. This allows the implementations to optimize task
creation by embedding one task into another thus increasing the
granularity. In this presentation we describe our work to implement
dynamic granularity control in the context of OpenMP and how the
compiler allows us to reduce the associated overheads.
Enabling Dynamic Optimization and Adaptation for Statically Compiled Programs Using Function Multi-Versioning
Applying the right compiler optimizations to a program can considerably improve performance, reduce code size and in some cases power consumption. However finding a good solution in large non-linear optimization spaces is a tedious and time-consuming task particularly on rapidly evolving architectures. Iterative compilation and machine learning have been recently introduced to automate and speed-up the search process. However, similar to traditional feedback-directed compilation, iterative compilation adapts program only to one specific architecture and dataset at a time, and do not take into account varying dynamic behavior of the system. On the other hand, dynamic and JIT optimizations can adapt to any context at run-time but are limited by analysis and optimization time constraints, usually do not apply aggressive optimizations and often require complex run-time recompilation frameworks.
In this talk I will present our on-going work to enable run-time
optimization and adaptation for statically compiled programs. Unlike
other complex dynamic recompilation frameworks, our technique relies
on function/loop versioning and static low-overhead monitoring with
adaptation routines. We perform costly iterative search for powerful
optimizations off-line using machine learning to find different
optimizations for programs with multiple datasets and varying system
context. I will present preliminary results demonstrating our
adaptation techniques for programs with regular and irregular behavior
using MILEPOST GCC, Collective Compilation Framework and MiBench with
MiDatasets. We plan to use our techniques for continuous program
optimizations, generation of adaptive static libraries, program
adaptation on multi-ISA architectures, split-compilation, etc. We
also believe that our techniques may simplify future JIT optimizers by
passing hints from the analysis and optimization stages of current
static compilers to the JIT as annotations and at run-time using
statistical and machine learning techniques based on program static
and dynamic features to quickly figure out which optimizations to use.
Scalable Resource Management in Grid Environments
Universitat Autònoma de Barcelona
I will present our experiences in the development of CrossBroker, a
job management service that provides transparent and reliable support
for executing parallel and interactive applications on the Grid. I
will outline the main components of CrossBroker and how they interact
with other middleware services. In addition specific features of the
scheduler used to guarantee resource co-allocation for running MPI
jobs remotely over multiple resources spread across several Grid
sites, or starting interactive applications as fast as possible will
be described. Although resource allocation in Grid environments is
generally carried out under the assumption that there is one primary
scheduling system, it becomes increasingly likely that scalable
deployments will involve multiple independent schedulers allocating
jobs over the same resources. Therefore I will explain how we have
applied common techniques borrowed from IP networks - particularly
approaches based on Random Early Detection (RED) buffer management and
its subsequent extensions/variations to schedule jobs on the Grid.
The SVP Model and the Place Abstraction
University of Amsterdam
This talk will introduce the SVP model and in particular look at
various implementations of the model's abstraction of place. It will
also propose a novel hierarchical architecture and resource-management
protocol for the delegation of work within ubiquitous and
heterogeneous environments. The protocol is based on serving an SVP
place, where a component of work is delegated together with the
responsibility for meeting any non-functional computational
requirements such as deadline or throughput constraints. The protocol
is based on a market where SANE processors bid for jobs to execute and
selection is based on a cost model that reflects the energy required to
meet the jobs requirements.
Dynamic vs Static Binary Translation
In this talk, I will discuss dynamic and static binary translation, in
particular the technologies and products developed by Intel. I would
explore the tradeoffs of the two approaches to binary translation.
The Performance Engineering Research Institute
This talk will describe the Performance Engineering Research Institute
(PERI), an Enabling Technologies Institute of the Scientific Discovery
through Advanced Computing 2 (SciDAC-2) program of the U.S. Department
of Energy (DOE) Office of Science (SC). As we look to the future,
achieving expected levels of performance on high-end computing (HEC)
systems is growing ever more challenging due to enormous scale,
increasing architectural complexity, and increasing application
complexity. To address these challenges, PERI is implementing a
unified, tripartite research plan encompassing: (1) performance
modeling and prediction; (2) automatic performance optimization; and
(3) performance engineering of high profile applications. The PERI
performance modeling and prediction activity is developing and
refining performance models, significantly reducing the cost of
collecting the data upon which the models are based, and increasing
model fidelity, speed and generality. Our primary research activity
is automatic tuning of scientific software. This is spurred by the
strong user preference for automatic tools and is based on previous
successful activities such as ATLAS, which has automatically tuned
components of the LAPACK linear algebra library, the highly successful
FFTW library, and other recent work. Our third major component is
application engagement, in which we are working directly with SciDAC-2
applications. This last activity not only helps DOE scientists meet
their near-term performance goals, but will also help keep PERI's
research focused on the real challenges facing DOE computational
scientists as they enter the Petascale era.
A Framework for Binary Code Analysis and Static and Dynamic Patching
University of Wisconsin, Madison
Tools that analyze and modify binary code are crucial to many areas of
computer science, including cyber forensics, program tracing,
debugging, testing, performance profiling, performance modeling, and
software engineering. While there are many tools used to support
these activities, these tools have significant limitations in
functionality, efficiency, accuracy, portability, and availability.
To overcome these limitations, we are designing and implementing a new
framework for binary code analysis and modification. This framework
provides a component architecture to support tools that analyze binary
code and modify it both statically (binary rewriting) and dynamically
(dynamic instrumentation), allow for interoperability of the static
and dynamic code modification, and enable the sharing and leveraging
of this complex technology.
Characteristics of this framework include:
- multi-architecture, multi-format, and multi-operating system;
- library-based, so that components can be used separately as needed;
- open source, to allow both local control and auditing;
- extensible data structures, so that new analyses and interfaces can be added easily;
- exportable data structures, so that all analysis products will be stored in a format that can be readily used by other tools;
- batch enabled, so that tools can operate effectively without interactive control;
- testable, with each separate component provided with a detailed test suite;
- accurate and efficient, using best-known current algorithms and the addition of new algorithms for code parsing;
- up to date, handling modern binary code idioms like exceptions, and functions with non-contiguous and shared code.
The Paradyn group is working on several initial library components of
this effort, including symbol table parsers, binary code scanners
(instruction decoders), binary code parsers (control flow analysis),
dynamic code generators, stack walkers, process execution controllers,
and a visual binary code editor.
The goal of this talk is to lay out the motivation, plans, and current
progress for this project. We also hope to solicit feedback on both
its design and functionality.
Scalasca: A Scalable Performance Measurement and Analysis Toolset
Jülich Supercomputing Center
Scalasca is a performance toolset that has been specifically designed
to analyze parallel application execution behavior on large-scale
systems. It offers an incremental performance analysis process that
integrates runtime summaries with in-depth studies of concurrent
behavior via event tracing, adopting a strategy of successively
refined measurement configurations. A distinctive feature is its
ability to identify wait states in applications with very large
numbers of processes. Especially when trying to scale
communication-intensive applications to large processor counts, such
wait states can present severe challenges to achieving good
In this talk, we review the current toolset architecture, emphasizing
its scalable design and the role of the different components in
transforming raw measurement data into knowledge of application
execution behavior. The scalability and effectiveness of Scalasca are
then surveyed from experience measuring and analyzing real-world
applications on a range of computer systems.
Data Parallel Operations and the Future of Parallel Programming
University of Illinois at Urbana-Champaign
Data parallel programming is the standard paradigm for massively parallel machines and their importance for multicores will increase with the number of processors. Although data parallelism is most often expressed today using loops or SPMD code, the alternative of representing parallelism using array data types and their associated collective operations (including element-by-element operation extensions and reductions) has great advantages. In fact, array operations can encapsulate parallelism and raise the level of abstraction helping in this way program structure and facilitating debugging, optimization, and porting. The main disadvantage of today's array notation is its inability to represent cleanly some classes of array operations and many collective operations in symbolic computing.
In this talk, I will discuss our work at Illinois on extending data parallel operations by adding new functionality to the array data type and going beyond arrays to other aggregate data types, notably sets. Our goal is to represent data parallelism for a wide range of problem domains. We expect these data types and operations not only to facilitate parallel programming, but also to enable a more effective automatic optimization by presenting compilers the additional information provided by the semantics of the data parallel operations.
HCE: a Tool to Automatically Translate C Programs onto Dedicated Architectures for FPGAs
ENEA and Ylichron Srl
The talk will present the HARWEST Compiling Environment (HCE), a collection of tools which is able to automatically translate an ANSI C program into an optimized, dedicated architecture to be implemented on FPGA technology. The HCE design flow will be reviewed, the optimization techniques will be discussed and some guidelines to write C programs which generate efficient architectures will be given. Results from actual synthesis, originated from C programs, will be reported.
Issues Confronting a System Designer
IBM T. J. Watson Research Center
Over the past two decades, the VLSI technologies, the Software
development & deployment styles, the I/O & Network technologies, and
business models have gone through significant advancements. The role
of a system designer is to judiciously synthesize these advancements
to produce a system yielding significant economic value to the
society. This talk will focus on the
current challenges faced by system designers to achieve a successful system design.
Amorphous Data-parallelism in Irregular Programs
University of Texas, Austin
Client-side applications running on multicore processors are likely to
be irregular programs that deal with complex, pointer-based data
structures such as graphs and trees. In her 2007 Turing award
lecture, Fran Allen raised an important question about such programs:
Do irregular programs have data parallelism, and if so, how do we
exploit it on multicore processors?
In this talk, we argue using concrete examples that irregular programs
have a generalized kind of data parallelism that arises from the use
of iterative algorithms that manipulate worklists of various sorts. We
then describe the approach taken in the Galois project to exploit this
data-parallelism. There are three main aspects to the Galois system:
(1) a small number of syntactic constructs for packaging optimistic
parallelism as iteration over ordered and unordered sets, (2)
assertions about methods in class libraries, and (3) a runtime scheme
for detecting and recovering from potentially unsafe accesses to
shared memory made by an optimistic computation. We present
experimental results that demonstrate that the Galois approach is
practical, and discuss ongoing work on this system.
Automatic Parallelization with Hybrid Analysis
Texas A&M University
Hybrid Analysis (HA) is a compiler technology
that can seamlessly integrate all static and
run-time analysis of memory references into a
single framework capable of generating sufficient
information for most memory related optimizations.
In this talk, we will present Hybrid Analysis as a framework
to perform automatic parallelization of loops.
For the cases when static analysis does not give conclusive results,
we extract sufficient conditions which are then evaluated
dynamically and can (in)validate the parallel execution of loops.
The HA framework has been fully implemented in the Polaris compiler
and has parallelized 22 benchmark codes with 99% coverage
and speedups superior to the Intel Ifort compiler.
Trends in Computing for Theoretical Physics
University of Ferrara & INFN - Ferrara
The talk will give an overview of the computing trends for theoretical
physics. Starting from computing intensive applications relevant in
theoretical physics, like the Lattice-QCD and the Spin-Glasses, the
talk will present the computing solutions adopted to solve these
S-Net: Scalability via Decontextualisation
University of Hertfordshire
While useful for exploitation of locality, the nodal context creates a
significant difficulty in achieving application scalability in current
distributed computing technologies. Indeed most message passing
schemes abstract an algorithm in the form of a set of communicating
state-machines, where each machine holds its state until a transition
triggered by an incoming message. For regular applications with
predictable quasi-periodic behaviour that works really well: the data
distribution is static, any communication patterns are stable and
performance can be optimised by increasing the ratio of held to
exchanged data. For irregular problems such as computational
chemistry, advanced CFD, and various fields of scientific and grand
challenge engineering computing, scalability tends to require bespoke
solutions, but perhaps better results could be achieved by a change in
This talk will introduce S-Net: a coordination language based on our
ideas of extreme encapsulation and what can be described as
"decontextualisation". The application is abstracted as a set of
black-box components of a purely computational nature. They are thus
stripped of all contextual information between instances of message
arrival. Any copy of a user-coded box behaves exactly as an original
at any time and on any processor in a multiprocessor system. A
component has a single-input single-output interface for which the API
library consists of only one library function: namely a function for
passing data tuples from a box to the distributed coordination
environment. Any language can serve as a box language. The S-Net
coordination infrastructure manages the components' persistent states
and the exchange of messages, which is specified in a very small
language that describes the virtual network topology in the form of
algebraic expressions. There are only 4 operators (called combinators)
used in these expressions and only two special boxes for the purposes
of data management and synchronisation. The glue being used for
holding boxes together is a variety of typed data streams, hence S-Net
can be described as a streaming language.
The fact that the language is so small while being capable of defining
all types of networks and network behaviour is due to the extensive
use of the type system and nondeterminism. The former minimises
notation and provides type-directed routing; the latter is used as a
positive factor for reducing the variety of virtual topologies as well
as improving network latencies. The crucial aspect of S-Net is its
focus on dynamic optimisation techniques and the propensity for
self-adaptivity. The talk will briefly touch on these and on the
progress in S-Net implementation to date.
Dynamic System Optimization through Performance Modelling: Universality and Decomposition
National University of Singapore
Dynamic optimization of system performance is necessary but hard.
Some modeling techniques can help overcome this difficulty.
Two techniques illustrate this possibility:
Universality. We could surrender to system complexity by adopting black-box control. However, there is no doubt that we can achieve more if there is an explicit closed-form expression for the black-box function. A model is necessary for deriving such an expression. Ironically, the key to dealing with the complexity is not to model the many intricate details, but to search for simple approximations that are universal, i.e. they apply regardless of hardware, software and workload. This idea is illustrated with the problem of maintaining a target probability for cache misses.
Decomposition. System behavior is determined by multiple factors. They often interact through complex feedback mechanisms. Changing one factor can activate countervailing forces that drive the system to a new equilibrium that is hard to determine. The fixed point for such a feedback cycle is thus a moving target for performance optimization. We need to break the cycle by decoupling the forces, so each factor can be studied in isolation. This idea is illustrated with the problem of preventing a collapse in network performance when users react to congestion (e.g. by aborting connections).
Preparing Climate Science for the Massively Parallel Era
University of Colorado at Boulder
We describe the efforts undertaken over the past four years to move NCAR's climate applications from O(100) cores to O(100K) cores. The recipe for for this has been fairly straightforward: modern methods, improved algorithms, and better implementations. The next 1000x jump won't be so easy. First, the long-time integrations required for simulating climate place restrictions on resolution. Second, many of the approximations currently being employed are at or near their resolution limits. Third, because individual thread execution rates have stalled, future systems will have significantly increased parallelism at the chip/node level. Unfortunately, how that parallelism will manifest itself and how it will be harvested remain open questions.
Some Experiments with Machine Learning for Performance Optimisation
University of Manchester
Dynamic optimisation of systems is now widely used in pursuit of increased performance. There are many instances of dynamic optimisation where it is clear what information should be gathered and how it should be used to tune the system. However, there are many other occasions where it is clear that the use of run time information might be of use, but it is much less clear what to measure and how to use the data. We have been investigating the use of Machine Learning techniques applied to a variety of system problems where these conditions hold. This talk will cover experiences with the consideration of branch prediction as an ML problem,ML approaches to optimise garbage collection and a description of our current work on the use of ML in thread level speculation and the scheduling of transactions.