Reconfigurable Optical Interconnection Networks
for Shared-Memory Multiprocessor Architectures

Ph.D. thesis by Wim Heirman, 9 July 2008.

PDF version
(2.6 MB)
Powerpoint presentation
(in Dutch, 4.3 MB)
Some pictures of my defense
by Frederik Vandeputte
Keyword cloud
by Wordle

Abstract

Parallel processing in computer systems is gaining importance: not only supercomputers, but today also desktop and laptop computers have multiple processors. To make all these processors work efficiently on a single problem, they have to communicate. To this end, processors are connected to each other, and to the outside world, through a communication network. This network should allow a fast exchange of information, such that the processors receive new data to process in a timely manner.

Current technologies, using electrical signaling, are reaching the end of their capabilities, however. Moreover, the traffic on such networks is very irregular, making some parts of the network highly saturated while other parts are barely used. The precise patterns also depend on the application, and can even change during the run time of a single application. This makes it very difficult to build an efficient communication network.

This doctoral dissertation explores the possibilities of optical, reconfigurable networks. Optical connections are one part of the solution to the communication problem, since they allow for much higher data rates. Moreover, optics allows the network to be reconfigured, during the course of a program, in a data-transparent way, to the current traffic patterns. This way, all network connections are utilized more efficiently, which can both increase the network's performance and reduce power usage.

Presentation

View SlideShare presentation or Upload your own.

Summary

Parallel processing is gaining importance. While its applications used to be limited to supercomputing or large web and database servers, the era of the multicore processor is now bringing parallel computing to desktop PCs, laptops and game consoles. To put all these processors to good use on solving a single problem, requires them to communicate. To this end, the processors are connected to each other, and to the outside world, with an interconnection network. This network should support communication at high bandwidths and low latencies, to allow the processors to concentrate on doing useful work, rather than having to wait for data to sip in. However, current interconnect technologies, using electronic signaling over copper wires, are quickly reaching the end of their capabilities.

Optical interconnections are a possible solution, one that is at the moment being investigated by academic and industrial labs around the world, including big players such as IBM and Intel. It is expected that optically communicating multiprocessor systems will be on the market in just a few years. Multicore processor chips, with on-chip optical networks, may follow in the next five to ten years.

Reconfigurable networks solve another problem of current interconnection networks. The traffic on these networks is usually very non-uniform, causing some of the network links to be highly saturated while others remain mostly unused. The exact patterns depend on the application, and can even change during the runtime of a single program. This makes it hard to define an efficient architecture, i.e., one that provides high performance but is not scaled for the worst-case traffic patterns – which would result in most of the capacity being unused for a large fraction of time.

Fortunately, optical interconnections, which will most likely be introduced in the near future anyway, provide the possibility of creating runtime reconfigurable networks, in a data-transparent way. This can be realized using novel components such as tunable lasers, liquid crystal switches, etc.

This thesis focuses on the application of reconfigurable, optical interconnection networks in the context of shared-memory multiprocessors. In this kind of parallel machine, communication is implicit, i.e., it happens without explicit involvement by the programmer or compiler. This makes shared-memory machines easy to program, since the programmer should not have intimate knowledge of the target machine. It does make a shared-memory machine very sensitive to network latency, since several clever tricks that try to overlap communication with useful computation, which are possible when using a message-passing paradigm, cannot be used since one does not always know exactly when communication will occur. Therefore, this type of machine would benefit the most from reconfigurable network architectures, which are designed precisely to provide lower latency.

One miss-match in an otherwise fitting puzzle is the speed of reconfiguration. Ideally, one would like to reconfigure the network such that the minimal latency can be provided for each request. The available optical components do not allow this, however. We therefore looked for locality in the communication, and found that network traffic often remains very similar for longer periods of time. This fact can be exploited by reconfiguration: by changing the distribution of available bandwidth in the network, such that the most voluminous traffic streams can use the fastest connections, latency can be brought down significantly. All this can be done in an automatic, application-invisible way, staying true to the idea of the shared-memory abstraction in which the programmer need not be aware of the implementation details of the underlying machine.

Traffic patterns. Since reconfiguration happens, by necessity, at a time scale slower than individual memory accesses, the slow dynamics of network traffic patterns are very important to this work. It is known that memory references exhibit locality in space and time, in a fractal or self-similar way. This locality is exploited by caches, which keep recently accessed data in a small, fast memory – because the probability of the same data being accessed again is high. Due to the self-similar nature of locality, this effect is present at all time scales, from the very fast nanosecond scales exploited by first-level caches, down to micro- and millisecond scales which are visible on the interconnection network. We modeled this behavior as traffic bursts: periods of high-intensity communication between specific processor pairs. These bursts were observed to be active for up to several milliseconds, on a background of more uniform traffic with a much lower intensity.

A reconfigurable network architecture. This behavior prompted us to design the following reconfigurable network architecture: a base network with fixed topology, augmented with reconfigurable extra links (elinks). The base network has a regular topology such as a mesh, torus or hypercube, and connects all processors. The reconfigurable elinks are placed such that they provide a direct, high-bandwidth connection between those processor pairs that communicate with the highest intensity. The locations of these elinks are changed after every reconfiguration interval, which usually has a length in the order of milliseconds. This way, most of the high-volume burst traffic is carried by the elinks, while the base network is still available for background traffic, or during reconfiguration of the elinks.

We also propose an implementation of this architecture, using tunable lasers and a selective broadcast method that allows for efficient scaling of the architecture to a large number of processors. This implementation was developed in cooperation with the Vrije Universiteit Brussel, who are currently building a prototype of the selective broadcast system.

Speeding up design-space explorations. A large number of free parameters exist in the design of each interconnection network. Reconfiguration adds even more parameters, such as the reconfiguration interval and the number of elinks. Moreover, performance is now much more dependent on the specific characteristics of the network traffic, including its temporal behavior. When an interconnection network is designed, a trade-off has to be made that optimizes all these parameters towards a goal that combines good performance, low power consumption and limited cost. This requires the evaluation of a huge amount of design alternatives.

Because dependence on temporal traffic behavior is present only to a limited extent in existing, non-reconfigurable networks, methods that aim to speed up the initial evaluation of network candidates did not account for this temporal behavior. We therefore had to extend these methods, and proposed new methods that do allow a quick but sufficiently accurate evaluation of reconfigurable network architectures. Our methods are also usable on non-reconfiguring designs, were they can improve on the accuracy of older methods – precisely because temporal behavior is now taken into account.

Performance evaluation. Finally, we explored the performance of our proposed reconfigurable network architecture, under a wide range of workloads – using real benchmark applications – and architectural parameters. We combined large-scale explorations, using our faster methods described before, with detailed studies using highly accurate – but much more time consuming – execution-driven simulations.

Also, since elements of the architecture required heuristics to allow their evaluation in a limited time, upper bounds were found on the performance assuming these heuristics could be improved. For instance, the placement of the elinks is based on a prediction of network traffic for the next reconfiguration interval. We showed that even perfect traffic prediction could not increase the overall performance by more that a few percent, showing the quality of the heuristics that were developed.

Overall, we found that reconfigurable networks could, with the assumptions and architectural parameters that were used in our study, improve network performance significantly. Depending on the traffic pattern of the application, and on characteristics of the reconfigurable components available, latency could be lowered by 10 to 20% on small, 16-processor networks, for larger 64-processor networks this increased to a 20 to 40% reduction. And when reconfiguration is already present in the system, for instance for reliability reasons, this performance improvement can be obtained at a very low added cost.