An Efficient Use of Threads for SmartSpice Parallelization


As circuit size and time simulation increase dramatically any technique to reduce computational time can be crucial to improve productivity.

Our approach takes advantage of recent technology advances in both hardware and software on multiprocessor SMP machines. It relies on intensive use of multi-threaded programming via the IEEE POSIX 1003 implementation.

In the first part we briefly present the concept of multi-threading, then we will show how we have used it to parallelize SmartSpice.



With multiprocessors systems available on several popular architectures, multi-threaded programming provides a powerful way to speed-up computationally intensive programs. This part describes how to take advantage of some of the capabilities offered by multi-threading in a shared memory multiprocessor environment.

Threads are a new and efficient way to utilize the parallelism of the hardware. A thread is a sequence of execution within a process. A traditional single-threaded application follows a single sequence of control while executing. A multi-threaded process has several sequences of control, this allows several independent actions to occur at the same time. In addition, the threads can share a large amount of data within the same address space. This provides extremely high bandwidth and low latency communication between parallel tasks within a single application.

To speed-up the program, the computation must be divided into a set of tasks so that the tasks interact little with each other and so that the number of tasks can be easily adapted to the number of processors. There are many ways to make a program parallel, thought the main concern is to avoid processor starvation.

A first natural way follows a master-slave paradigm or control structure. The main master thread launches a set of slave threads with a predefined portion of the work to be done. The master starts the threads and then waits for all them to terminate at a synchronization point. In our case this must be done every computational cycle. This paradigm is illustrated in Figure 1.


Figure 1. Master-slave paradigm running on 4 processors


A second useful control structure is the workpile. In this paradigm the threads get their job from a set of chunks of works usually organized as a queue. The threads keep taking new jobs from the queue until it is empty. The second approach can be more flexible than the previous one, as it lets the threads compete with each other to get their tasks.


Multi-Threaded Smartspice

The first step in parallelizing an application is to isolate the time consuming tasks, as they will be the most productive in terms of speed-up. In SmartSpice, an execution cycle consists mainly in an assembly phase and a matrix inversion. This is shown by Figure 2. In this figure, the difference is made between parallel and sequential execution.


Figure 2. A time step in Parallel Smartspice on 4 processors


The second step is to analyze these time consuming tasks to find what paradigm will be more accurate in order to parallelize them. In standard circuit simulation, most of the time is spent in assembling and inverting a sparse matrix, each row of which represents a node in the circuit. The parallelization of these two phases will be discussed in the following sections.

The load phase consists essentially in two embedded loops. Every instance of every model in the circuit is successively evaluated and then stored in a sparse matrix and a right hand side. The amount of work to be done is known in advance and the computation can be easily split in independent tasks. Load balancing is thus obtained naturally by splitting the loops in chunks of the same size. It is then very convenient and efficient to use a master-slave paradigm to schedule the computation, between the threads. The master-slaves waits for all slave threads to complete their work and then continues the simulation. The assembly phase presents a typical coarse level of parallelism. Note that the performance can be improved by letting the master thread do a part of the job instead of waiting for the slaves to join him.

In SPICE the matrix inversion is performed using a Gaussian elimination with pivoting. Although this LU decomposition with pivoting is a robust algorithm to solve a symmetric sparse linear systems, its efficient parallel implementation is not obvious. Before the factorization, a re-ordering of the matrix for sparsity and accuracy may occur. Then as a first step in parallel factorization, the elimination tree is built. The elimination tree is the smallest structure that shows dependency informations between the elimination of the unknowns of the system. Each node of the tree will represent a column of the matrix. A column corresponding to a node must be computed after all its children nodes in the tree. In the example below, node 3 must be computed after nodes 6 and 9.

The ancestor-descendant relations contained into structure of the elimination tree is now used to build a queue of tasks. Each node will be assigned to levels so that each level contains completely independent columns whose computation can be made in parallel. The scheduling queue is constructed so that the tasks in the queue have increasing level numbers. Each thread will pick up the first available task in the queue and compute the corresponding column having previously checked that all the tasks of the previous level were completed, this ensures that all dependency constraints will be verified. This control structure is very close to a workpile paradigm and is the most suitable for very sparse SPICE matrix elimination.



The parallelization of SmartSpice, has been performs a very accurate manner in order to preserve the accuracy of the simulator. The accuracy of sequential SmartSpice, is successfully maintained when the simulation speed is noticeably increased. Some speed-ups are presented in the May 1997 issue of The Simulation Standard and new results will soon be published.

Figure 3. Elimination tree.