Parallel SmartSpice:
Fast and Accurate Circuit Simulation Finally Available

Introduction

Circuit simulation is a necessary everyday tool to circuit designers who need to constantly verify and debug their circuits during the design process. As engineers face larger, more complex designs and tighter project schedules, fast SPICE simulation with no loss in accuracy has become a necessity. Simulation indeed accounts for a large portion of the time spent in the design and optimization of a new circuit.

Several approaches are being used in spice to speed up circuit simulation including:

  • Table lookup methods: instead of evaluating the model equations for all the devices as a regular spice simulator would do, this approach evaluates currents and charges as a function of the node voltages and stores them in a lookup table. This table is used afterwards to get the corresponding values for a specific device with specified node voltages.
  • Circuit partitioning: these methods take advantage of the latency in some parts of the circuit to apply different time steps to different sub-circuits during a timing analysis. This leads to an event-driven simulation where only a particular event (voltage/ current threshold) would trigger the re-evaluation of a sub-circuit outputs.

These methods can significantly speed up the simulation of a circuit, but they both suffer from a loss of accuracy. Depending on the particular application and circuit specification, the loss of accuracy may or may not be acceptable. At a final stage of test and verification of a circuit, it usually is not acceptable.

In order to keep SPICE level accuracy, the approach chosen to speed up SmartSpice is parallelization. Several processors will cooperate to simulate one circuit in a fraction of the time necessary for one processor to do the same job. The only prerequisite is the availability of a multiprocessor computer. There are two affordable ways to make multiprocessing capability available:

  • through a compute server: small departments can set up a compute server hosting 4 to 16 processors with a fair amount of memory to serve as a central computing power. Access would be granted through X terminals or PCs or low-end workstations.
  • through desktop computers: most desktop workstations nowadays(and even PCs) already host two or more processors readily available for any parallelized application.

The compute server approach has some advantages over the desktop workstation approach. First, a central compute server can accommodate a large number of people if they don't all need its computing power at the same time. Whereas a desktop workstation is usually meant for individual use regardless of the actual machine load over a long period of time. Second, the large amount of memory in a compute server can be of a great help in simulating very large circuits, even if the application remains sequential and uses only one single processor. The same applies for any kind of memory-bound applications that one needs to run. Third, multiprocessor computers from some hardware vendors seem incapable of scaling a parallel application to the number of available processors because of background processes. For instance, a parallel application using 2 processors runs faster on a 4 processor computer than on a 2 processor computer.

In the next section, we will outline the structure of a spice simulator with emphasis on the time consuming tasks. Then we will describe the methods applied to parallelize SmartSpice and present some experimental results.

 

Workload chart

Figure 1 shows a simplified flowchart that describes the order in which the functions contained within each device model are called. The "setup" function is called at the very beginning of the simulation of a circuit. It initializes the model parameters to their default values and allocates storage for the assembled matrix, taking into account the model's internal nodes.

 

Figure 1. Simplified flowchart of
a transient simulation.

 

The "temp" function is called at the beginning of each analysis to evaluate temperature dependent variables and some time independent variables too. These will be used during the time loop that follows.

As for the "load" routine, it performs the actual model evaluation for every single device in the circuit. After evaluating the model equations, the result is assembled in a sparse matrix and a right hand side. Each row of the matrix corresponds to a node in the circuit or a model's internal node.

Once the sparse matrix and the right hand side are assembled, a linear system is solved in the "solve" routine using an LU decomposition and a backward+forward triangular solves.

During a transient simulation, the "load" and "solve" routines account for almost the entire execution time of the simulation. Depending on the size of the circuit and the particular device model used, the proportion of time spent in the load varies between 20 and 80 percent of the total execution time. Usually, the larger the circuit is, the longer the time spent in the "solve".

Parallelization

To parallelize SmartSpice, we focus on the two main time consuming tasks: the matrix assembly ("load") and the linear system solution ("solve"'). This will leave a fraction of sequential code in SmartSpice, like the preprocessing, post-processing, all I/Os etc. It is of utmost importance to keep this fraction of sequential code as small as possible in order to get acceptable overall performance. Figure 2 shows the maximum theoretic speedup achievable by any code depending on the fraction that remains sequential and assuming an optimum efficiency in the parallelized section of the code.

 

Figure 2. Theoretical achievable speedup
with a fraction of sequential code.

Because of the large amount of data involved and the inherent sequentiality in the waveform calculation, SmartSpice is an application that shows data parallelism and not functional parallelism. Therefore, an SPMD (Single Program Multiple Data streams) paradigm is better suited than a Fork and Join paradigm for parallelization.

Due to the tightly coupled nature of the computations and the rather small time spent in one iteration during transient analysis in most circuit sizes, it is highly recommended to keep the communication between collaborating processors as low as possible. Therefore, we chose a tightly coupled architecture, namely a shared memory architecture to implement the SPMD paradigm. On the one hand, this will allow the processors to access a common data space for communication instead of passing messages through a communication channel. On the other hand,communicating through a shared memory requires explicit synchronization, unlike passing messages where synchronization is implicit and the execution more asynchronous. Nevertheless, the gain is in the access to communicated data: the first one is through a system bus, the second one is through a much slower communication network.

On generic Unix-like based operating systems, there are two ways of implementing a parallel paradigm: multiprocessing and multi-threading. Actually, a thread is only a light weight version of a process. Several processes on the same computer share the same physical address space but have separate virtual address spaces, protected by access security mechanisms. Threads on the contrary share both physical and virtual address spaces. As a plus,threads inherit the parent's variables whereas tasks don't, making it necessary to communicate or recompute these variables for each task. Thus, a multi-threaded implementation is much more efficient than a multi-tasked implementation. Recall that PVM and MPI, two portable parallel development environments, use tasks and not threads in their shared memory implementation.

To summarize, parallel SmartSpice relies on an SPMD paradigm on shared memory parallel computers and uses the POSIX 1003.1c norm threads,which makes it portable across platforms from different hardware vendors.

 

Parallelization of the assembly

The parallelization of the assembly phase is rather straightforward. The circuit is partitioned into as many chunks of equal size as there are processors in order to achieve a good load balance. Each chunk is then assigned to a processor for model evaluation. Code transformation was necessary in the model evaluation to account for multiple write access conflicts that may result when two or more processors are updating devices sharing a common node. Thanks to a combination of optimization techniques, including write caching and inner loop blocking, the use of synchronization locks has been highly decreased for optimum performance. Although the flow of control has been altered to some extent, the computations involved in parallel SmartSpice are exactly the same as the original SmartSpice leading to the same accuracy.

 

Parallelization of the solver

The parallelization of the solve routine is much more complex. Parallelizing an LU decomposition with pivoting is still an active research field. To apply state of the art knowledge in this area, we have implemented a parallel LU decomposition based on level scheduling of the elimination tree with events-based synchronization. We first factorize symbolically the matrix after it has been renumbered for sparsity and better parallelism. Afterwards,we build the elimination tree that describes computational dependency information between the columns of the matrix. Once this is done, the nodes of the tree are assigned to levels such that no dependency appears between nodes on any one level. This guarantees that all nodes (corresponding to tasks) in the same level are entirely independent and can be eliminated in parallel. Then, the nodes are numbered according to a bottom-up walk of the elimination tree and arranged in a task queue. Processors pick up columns to process from that task queue and make sure the dependency constraints are satisfied before performing any actual computation. Instead of using a barrier synchronization at each level of the elimination tree as similar algorithms in the literature may suggest, we improved the performance of the algorithms by switching to an event-based synchronization. This allows for a more asynchronous execution and exhibits a better parallelism. This is essential because of the extremely sparse nature of spice-like matrices.

 

Experimental Results

As a first step, we present here performance results on several circuits using Silvaco's implementation of the BSIM3v3 transistor model. The circuits are part of the MCNC benchmark suite. Their sizes range from 288 to 18816 MOS devices. The first experiments have been carried out using 2 and 4 processors of a SUN sparc Server running Solaris 2.4 As Figure 3 shows, the average simulation speed exceeds 1.5 times that of a sequential execution. Therefore, a substantial saving of over 33% of simulation time is induced by the use of only one extra processor over 60% of the simulation time is saved when using 4 processors. Further results relating our latest experiments on several hardware platforms hosting up to 8 processors will be publicly available shortly.

 

Figure 3. Parallel SmartSpice speedup on 2 UltraSparc
CPU's using the MCNC benchmark suite

Conclusion

In an effort to reduce circuit development cycles, Silvaco introduces the first Parallel version of SmartSpice. This product is available on Sun servers running Solaris 2.4 and 2.5, SGI servers running IRIX 6.2 and 6.4 and HP S/X/V-Class servers running HPUX 11.0. This paper reports an average speedup of 1.58 on 2 processors and 2.55 on 4 processors for a broad range of circuits out of the MCNC benchmark suite. The precision of the original SmartSpice is maintained while the simulation speed is significantly increased.