Architectural Support for Software Transactional Memory

Ali-Reza Adl-Tabatabai

Intel
Transactional memory provides a concurrency control mechanism that avoids many of the pitfalls of lock-based synchronization. Researchers have proposed several different implementations of transactional memory, broadly classified into software transactional memory (STM) and hardware transactional memory (HTM). Both approaches have their pros and cons: STMs provide rich and flexible transactional semantics on stock processors but incur significant overheads. HTMs, on the other hand, provide high performance but implement restricted semantics or add significant hardware complexity.

This talk proposes architectural support for accelerating transactions executed entirely in software. We propose instruction set architecture (ISA) extensions and novel hardware mechanisms that improve STM performance. We adapt a high-performance STM algorithm supporting rich transactional semantics to our ISA extensions (called hardware accelerated software transactional memory or HASTM). HASTM accelerates fully virtualized nested transactions, supports language integration, and provides both object-based and cache-line based conflict detection. Our results show that (1) HASTM single-thread performance is comparable to a conventional HTM implementation; (2) HASTM scaling is comparable to a STM implementation; and (3) HASTM is resilient to spurious aborts and can scale better than HTM in a multi-core setting. HASTM thus provides the flexibility and rich semantics of STM, while giving the performance of HTM.

Stochastic Analysis and Optimality of Replacement Policies

Gianfranco Bilardi

University of Padova
The performance of memory hierarchy is crucial to the overall performance of chip multicores, since the chip I/O bandwidth is more likely to become a bottleneck, due to the increase in the number of instructions per cycle. In spite of extensive investigations over nearly half a century, relatively few analytical results are known on the performance of replacement policies for managing different levels of the memory hierarchy. Most studies are based on simulations.

In this talk, we consider a framework for stochastic analysis. The address trace is modeled as a stochastic process, characterized by independent, identically distributed LRU stack distances. A wide class of workloads can be modeled by varying the stack-distance distribution. The framework enables analitical evaluation of the miss probability for some well-known policies. It also allows to derive the optimal policy, for a given distribution.

(The above line of research is in collaboration with K. Ekanadham and P. Pattnaik of IBM Research and with A. Ferrante and F. Versaci of the University of Padova.)

Productivity: The Quest for Performance without Pain

Calin Cascaval

IBM T. J. Watson Research Center

Productivity is the latest buzzword in High Performance Computing (HPC) programming motivated by the DARPA HPCS investment of $600 mil. to change the landscape of HPC and make supercomputing performance available to the masses. Is that an attainable goal? In this talk we shall present our experience with two programming paradigms that enhance productivity, PGAS languages and Transactional Memory. We discuss what are some of their advantages in terms of ease of use, what are the implementation challenges and how do adding these features affect the entire execution stack. We conclude by discussing how these programming paradigms may affect the design of commercial systems.

Adaptive Winograd's Matrix Multiplication

Paolo D'Alberto

Yahoo!
We present a novel, hybrid, and adaptive recursive Strassen-Winograd's matrix multiplication (MM) that uses MMs from automatically tuned linear algebra software (ATLAS) or GotoBLAS. Our algorithm applies to any size and shape matrices --i.e., rectangular-- stored in either row or column major layout; it is efficiently applicable to both C and FORTRAN implementations and for single, double and complex precision. In addition, our algorithm divides the computation into sub-MMs of equivalent complexity and does not require any computation to combine the intermediary sub-MM results.

In our experimental evaluation and we achieve up to 22% execution-time reduction versus GotoBLAS/ATLAS alone for a single core and up to 19% for a 2 dual-core processor system. Even for small matrices such as ${1500}\times{1500}$, our approach attains already 10% execution-time reduction and, for matrices larger than ${3000}\times{3000}$, it delivers performance that would correspond, for a classic $O(n^3)$ algorithm, to faster-than-processor peak performance (i.e., our algorithm delivers the equivalent of 5 GFLOPS performance on a single core with 4.4 GFLOPS peak performance and where GotoBLAS achieves only 4 GFLOPS). Therefore, our algorithm is faster than any classic MM algorithms could ever be for matrices of this sizes. Furthermore, we present experimental evidence that our algorithm is --for practical purpose-- as accurate as the classic algorithms.

High-Performance Embedded Architecture and Compilation (HiPEAC) Roadmap

Koen De Bosschere

Ghent University

The roadmap details several of the key challenges that need to be tackled in the coming decade, in order to achieve scalable performance in multi-core systems, and in order to make them a practical mainstream technology for high-performance embedded systems.

The HiPEAC roadmap is organized around 10 central themes: (i) single core architecture, (ii) multi-core architecture, (iii) interconnection networks, (iv) programming models and tools, (v) compilation, (vi) run-time systems, (vii) benchmarking, (viii) simulation and system modeling, (ix) reconfigurable computing, and (x) real-time systems. Per theme, a list of challenges is identified. In total 55 key challenges are listed in this roadmap. The list of challenges can serve as a valuable source of reference for researchers active in the field, it can help companies building their own R&D roadmap, and - although not intended as a tutorial document - it can even serve as an introduction to scientists and professionals interested in learning about high-performance embedded architecture and compilation.

(Joint work with: Koen De Bosschere, Wayne Luk, Xavier Martorell, Nacho Navarro, Mike O’Boyle, Dionisios Pnevmatikatos, Alex Ramirez, Pascal Sainrat, André Seznec, Per Stenström, and Olivier Temam)

A Data-Driven Co-operative Approach to Scaling of Commercial Java Codes

Manish Gupta

IBM India Systems and Technology Lab
The rising power dissipation in microprocessor chips is leading to a fundamental shift in the computing paradigm, one that requires software to exploit ever-increasing levels of parallelism. We describe a data-driven approach, requiring co-operation across multiple layers of software, to scaling of J2EE codes to large scale parallelism. We show that several key characteristics of application and middleware behavior are normally hidden from the JVM runtime and the operating system, and argue that we need to tackle the problem by applying optimizations across the stack (comprising the application, application server, JVM, operating system, and the hardware layers).

(When) Will Multiprocessors Hit the Power Wall?

Ben Juurlink

Delft University of Technology
The total number of transistors available on a chip is ever increasing. The extra available transistors have mainly been used for exploiting ILP using approaches like superscalar and out-of-order. However, developments in this area show diminishing results. Thus we see a switch to multiprocessor chips (CMPs) and the exploitation of more levels of parallelism, i.e., DLP, ILP, and TLP. As we are researching architectures for future CMPs the question arises how many transistors will be available in the future and how many of them can actually be used considering thermal constraints. Using all transistors concurrently at the maximum frequency is expected to consume to much power resulting in a to high temperature, i.e., CMPs are expected to hit the power wall. In this paper we estimate the power consumption of future CMPs. Our goal is to have an indication of the performance that will be possible in the future, considering thermal aspects. Moreover, what does the power wall imply for computer design? Considering a CMP consisting of Alpha 21264 cores, scaled to future technology using the ITRS roadmap parameters, we calculated several properties, the most important one being the total power. This calculation assumes all cores are active and run at the maximum frequency. The results show that a CMP in 2020, which could contains 625 cores and run at 73 GHz, would consume about 60kW. The roadmap however, predicts that packaging constraints the maximum power to 198 W, which is 304 times less. The performance increase that could be achieved if the CMP would be throttled down to the maximum allowed power would be only about 10x in the time frame 2007 to 2020. Our calculations indicate that CMPs will definitely and soon hit the power wall. Therefore, power should be considered as the major design constraint at every design level. Performance measurements should include power efficiency and transistors can no longer be considered as costless.

Core Fusion: Accommodating Software Diversity in Chip Multiprocessors

Jose F. Martinez

Cornell University

Chip multiprocessors (CMPs) hold the prospect of delivering long-term performance growth by integrating more cores on the die with each new technology generation. In the short term, on-chip integration of a few relatively large cores may yield sufficient throughput when running multiprogrammed workloads. However, harnessing the full potential of CMPs in the long term makes a broad adoption of parallel programming inevitable.

We envision a CMP-dominated future where a diverse landscape of software in different stages of parallelization exists at all times. Unfortunately, in this future, the inherent rigidity in current proposals for CMP designs makes it hard to come up with a ``universal'' CMP that can accommodate this software diversity.

In this talk I will discuss Core Fusion, a CMP architecture where cores can ``fuse'' into larger cores on demand to execute sequential code very fast, while still retaining the ability to operate independently to run highly parallel code efficiently. Core Fusion builds upon a substrate of fundamentally independent cores and conventional memory coherence/ consistency support, and enables the CMP to dynamically morph into different configurations to adapt to the changing needs of software at run-time. Core Fusion does not require specialized software support, it leverages mature micro- architecture technology, and it can interface with the application through small extensions encapsulated in ordinary parallelization libraries, macros, or directives.

System Wide Software Development Tools for Heterogeneous Programming, Debug and Profiling of Mixed Multi-Core and Acceleration Processors

Ray McConnell

ClearSpeed

Recent Technological Trends and their Impact on System Desig

Pratap Pattnaik

IBM T. J. Watson Research Center
Over the past two decades, significant advancements in VLSI technologies (e.g. Moore's Law), Network Bandwidth, and Disk storage capacity have fueled an unprecedented integration of information technologies into the global economy. These advances have enabled the I/T (information technology) community to develop and deploy containerized and composable software stacks, while providing adequate performance to the end users. This model of driving high programmer-productivity with responsiveness in the I/T enterprises is based partly on software containerization, This model also depends significantly on obtaining high performance and robustness from the hardware and the operating systems autonomically. Because programmer-productivity and system responsiveness are key factors for the global economic growth, significant research and development efforts have been invested in continuing the innovations in hardware and operating systems. This talk will discuss the recent technological breakthroughs in VLSI technologies (i.e. high k dielectric) and in disk recording technologies (e.g. perpendicular recording), and the role of CMPs and their evolution.

Huge Networks, Tiny Faulty Nodes

Enoch Peserico

University of Padova

As a network of processing elements grows in size and complexity, must the computational resources of individual nodes also grow? Or is there a ``universal processor'' with limited state (say, 32 bits, total) and reliability (say, one out of every hundred operations produces a wrong result) out of which we can build reliable, arbitrarily large networks of arbitrarily complex topology?

In this talk, after presenting some reasons why one might want to build systems out of such skimpy processing units, we'll review what is known and what appear to be the main open problems. In a nutshell, with a handful of (unreliable) bits, one might be able to achieve much more than might be expected.

Network-Oblivious Algorithms

Andrea Pietracaprina

University of Padova

The design of algorithms that can run unchanged yet efficiently on a variety of machines characterized by different degrees of parallelism and communication capabilities is a highly desirable goal. We propose a framework for network-obliviousness based on a model of computation where the only parameter is the problem's input size. Algorithms are then evaluated on a model with two parameters, capturing parallelism and granularity of communication. We show that, for a wide class of network-oblivious algorithms, optimality in the latter model implies optimality in a block-variant of the Decomposable BSP model, which effectively describes a wide and significant class of parallel platforms. We illustrate our framework by providing optimal network-oblivious algorithms for a few key problems, and also establish some negative results.

The Cache-Oblivious Gaussian Elimination Paradigm

Vijaya Ramachandran

University of Texas at Austin

A cache-oblivious algorithm is a system-independent cache-efficient algorithm that simultaneously adapt to all levels of a multi-level memory hierarchy.

In this talk we present the cache-oblivious Gaussian Elimination Paradigm (GEP), a cache-oblivious framework for problems solvable using the triply-nested loop construct similar to the computation in Gaussian elimination without pivoting.

The traditional GEP code executes $O(n^3)$ operations and incurs $O(n^3 / B)$ I/Os (or transfers of memory blocks), where B is the block size. Our cache-oblivious version executes the same $O(n^3)$ operations but incurs only $O(n^3 / (B \sqrt{M}))$ I/Os, where M is the size of the cache.

We present two versions of cache-oblivious GEP:

  • IGEP, which is in-place and applies to Gaussian elimination without pivoting, Floyd-Warshall's all-pairs shortest paths, and other important special cases of GEP, and
  • CGEP, which is completely general and applies to all GEP problems, at the expense of using a modest amount of extra space.

We compare the performance of IGEP both to traditional code and to high-performance BLAS. Our results show that IGEP incurs fewer cache misses than BLAS. In terms of performance, IGEP is somewhat slower than BLAS, but overall it provides a good trade-off between computation speed on one hand, and simplicity and portability on the other. Our code for IGEP and CGEP is easy to parallelize, leading to good scalable performance for these important applications.

Both IGEP and CGEP have potential application in optimizing compilers as cache-oblivious tiling techniques.

(Joint work with Rezaul Alam Chowdhury)

Improving Performance by Improving Memory Sharing

Larry Rudolph

MIT & VMWare

The challenge for Chip Multiprocessors is to keep the memory bandwidth requirements constant while increasing the number of operations happening in parallel. This assumes that there is a sufficient number of parallel activities to keep all the processors or cores active. This also assumes that there is some time sharing of the processors to hide the memory latency. In such cases, gang scheduling for parallel jobs reduces the memory footprint. No only that, there is some form of gang scheduling for independent, non- communicating jobs. There may be many identical but independent tasks or jobs, such as Apache Servers, that can share pages of code and data. We show how to use machine virtualization techniques to shared identical pages across processors. The challenge is how to modify the scheduler to take advantage of this sharing.

On-Chip/Off-Chip Multiprocessing; Shared Memory or Message Passing

Einar Rustad

Dolphin Interconnect Solutions

We see a clear trend in multi-core development for standard microprocessors. with on-chip cores with a cache coherent shared memory view. How does this relate to multi-processing beyond the chip and the programming paradigms involved? Can an efficient off-chip cache coherency protocol provide a seamless shared memory view to ease the burden on programmers for creating parallel software?

Lattice QCD on Cell

Fabio Schifano

University of Ferrara & INFN - Ferrara
Lattice QCD (LQCD) is the physic theory thought to study accurately the interactions of elementary particles of the matter, and it is a grand challenge computing problem. The talk gives an overview of the HPC system used in the past and today to solve the LQCD problem, and will be focused on a development of a new generation of HPC system based on the Cell BE processor.

Transactional Memory for a Modern Microprocessor

Marc Tremblay

Sun Microsystems

Transactional Memory has emerged as a leading technique that enables applications to better take advantage of multi-threaded, multi-core microprocessors. Setting goals for the scope of an implementation of Transactional Memory is a key milestone that has a pervasive impact upon the overall architecture of a modern microprocessor. In this talk, a description of what we believe is the first hardware implementation of Transactional Memory will be given. The synergy between a modern pipeline capable of handling today's memory latency as well as supporting sophisticated multithreading, is the key enabler of our approach to Transactional Memory. This will be the second public disclosure for TM support on the Rock microprocessor.

Stability and Efficiency of a Random Local Load Balancing Protocol

Eli Upfal

Brown University

We study the long term (steady state) performance of a simple, randomized, local load balancing technique. We assume a system of $n$ processors connected by an arbitrary network topology. Jobs are placed in the processors by a deterministic or randomized adversary. The adversary knows the current and past load distribution in the network and can use this information to place the new tasks in the processors. The adversary can put a number of new jobs in each processor, in each step, as long as the (expected) total number of new jobs arriving at a given step is bounded by $\lambda n$. A node can execute one job per step, and also participate in one load balancing operation in which it can move tasks to a direct neighbor in the network. In the protocol we analyze here, a node equalizes its load with a random neighbor in the graph.

We first study the stability of a system running our load balancing protocol. Clearly, if $\lambda >1$ the system cannot be stable. We show that for any $\lambda<1$, and any connected network topology, the system is stable.

When the system is stable, the next performance parameter of interest is the waiting time of jobs. We develop high probability bounds and bounds on the expectation of the waiting time of jobs in terms of the network topology. In particular, if the network is an expander graph the expected wait of a task is $O(\log n)$, and the waiting time of a task that enters the network at an arbitrary time is $O(\log n)$ with high probability.

We contrast these results with the work stealing load balancing protocol, where we show that, in sparse networks, the load in the system and the execution time can be exponential in the network size.

(Joint work with Aris Anagnostopoulos and Adam Kirsch.)

An Asynchronous and Reconfigurable Computational Model for Multi/Many-Core Processors

Lorenzo Verdoscia

CNR - ICAR
Nowday the radical changes that computer architectures are undergoing reflect the state of information technology. Multi-core processing is now mainstream, and the future will be massively parallel computing performed on many-core processors. According to, ``successful many-core architectures and supporting software technologies could reset microprocessor hardware and software roadmaps for the next 30 years.'' As Moore's Law and nanoscale physics conspire to force chip designers to add transistors rather than increase processor clocks, many people see the creation of many-core architectures (hundreds to thousands of cores per processor) as a natural evolution of multi-core. On the other hand, if many-core is destined to be the way forward, a new parallel computing environment will need to be developed. One of the main issues is the type of hardware building blocks to be used for many-core systems. About this topic, in the scientific and industrial community there exists a reasonably definitive stand that envision processors with thousands of simple (i.e., RISC) processing cores. Small, simple cores are the most efficient structures for parallel codes, providing the best tradeoff between energy consumption, performance, and manufacturability. Intel's own 80-core terascale prototype processor Polaris uses simple RISC-type cores to achieve a teraflop (in less than 70 watts!), although the company implied that commercial versions would support Intel Architecture-based cores. But even 80 cores is an order of magnitude less than the typical one of a many-core. However, advances in integrated circuit technology impose new challenges about how to implement a high performance application for low power dissipation on processors created by hundred of (coarse or fine)-grained cores running at 200 MHz, rather than on one traditional processor running at 20 GHz.

From the architectural point of view, this new trend raises at least two queries: how to exploit such spatial parallelism, how to program such systems.
The first query brings us to seriously reconsider the dataflow paradigm, given the fine grain nature of its operations. In fact, instead of carrying out in sequence a set of operations like a von Neumann processor does, a many-core dataflow processor could calculate a function first connecting and configuring a number of identical simple cores as a dataflow graph and then allowing data asynchronously flow through them. However, despite the model simplicity, in the past technology limits and heterogeneity of links and actor I/O conditions shifted this approach more and more towards the von Neumann model But with the introduction of the homogeneous High-Level Dataflow System (hHLDS), we believe that the dataflow approach is still a valid proposal to increase performance and seriously attack the von Neumann model at least at processor level.
The second query brings us to seriously reconsider the functional programming style, given its intrinsic simplicity in writing parallel programs. In fact, functional languages have three key properties that make them attractive for parallel programming: they have powerful mechanisms for abstracting over both computation and coordination; they eliminate unnecessary dependencies; and their high-level coordination achieves a largely architecture-independent style of parallelism. Moreover, functional programming, thanks to its properties which stem from the inherent mathematical foundation of functions, constitute a valid computational model to naturally extract the fine grain parallelism that programs present and a dataflow engine can exploit.

Enabling ``Rugged'' HPC

Pierfrancesco Zuccato

Eurotech SpA