PARILLU - Parallel simulation of global illumination |

Next: Locality Preserving Dynamic Load Balancing
With Provably Small Overhead

Prev: Introduction

We present a detailed experimental analysis of the
*task access graph*, defined on the tasks of the HRA.
We give quantitative
results underpinning ``common knowlegde'', that
HRA is difficult to parallelize.
Measurements regarding three
important issues of a parallel implementation are presented:
load balance, communication, and congestion.
The results are rated quantitatively
and indicate that there exist several bottlenecks.
Different graph
partitioners are examined with respect to
their ability to
cope with these bottlenecks. One main finding is
that a spatial partitioning
method overcomes most of the problems well.

We do not perform the measurements on a real parallel implementation. Instead we abstract from several possible techniques of dynamic load balancing and communication reduction by analyzing each radiosity iteration in a static manner. However, this approach does not assume a static partitioning strategy. Instead, it allows conclusions regarding the performance characteristics of any possible real dynamic partitioning method.

We use three partitioners: a naive partitioner, a spatial partitioner and an ``optimum'' partitioner. The optimum partitioner cannot be used in a real implementation, since it needs a priori knowledge of the graph, which itself evolves during computations. We consider the optimum partitioner as a generator of a reference solution.

It turns out, that the task access graph is very poorly partitionable. Even the best partitioner achieves a cut size of only 85 percent of all edges running across processor boundaries, when partitioning into 16 parts. The reason of this poor behaviour is the overwhelming amount of communication due to visibility computations. Therefore, in the following experiments we remove the visibility access edges from the definition of the graph. The resulting task access graphs are better partitionable and allow greater insight in the structure of the HRA.

We see, that the HRA
is a very dynamic algorithm. The amount of work
changes dramatically from iteration to iteration.
We also see that load balancing is strictly necessary,
since balance of load is poor, when not
accounting for load balance at all.
But, the changes in the balance of load are relatively small.
Hence, one could guess that load balancing
in early iterations pays off in later
iterations. An important result
is that load balancing amortizes, which is
confirmed by *load continuity matrices*.
These relate overloaded processor sets of different
iterations to each other.

Measured data on
the cut size and the connectivity balance
for different partitioners is presented. A result
is that all partitioners do achieve comparable bad
partition qualities.
The spatial partitioner seems to be the best choice,
because it exploits locality of reference,
leading to a reduced chance of network congestion.
We discuss this issue based on a new measure,
the *channel usage*.

Next: Locality Preserving Dynamic Load Balancing
With Provably Small Overhead

Prev: Introduction

Robert Garmann, February 22, 2002 |