Performance Evaluation of a New Hybrid MPI-thread Parallelized Direct Solver

 

Background

The solution of linear systems lies in the core of any TCAD simulation. On any nonlinear step of the computation a linear system needs to be solved. The size and condition number of the matrices in these linear systems vary significantly depending on the specific type of TCAD simulation. So in order to achieve fast convergence it is required that the linear solver has good performance, good accuracy, can handle cases of ill-conditioned matrices, and it would be nice if the solver works well on any size linear system.

The two main types of linear system solvers are Direct and Iterative solvers. The pros and cons of the two types of solvers are respectively : Direct solvers are very accurate but can require large amounts of memory for large size problems for example 3D problems and their performance for such problems is usually not very good. Iterative solvers on the other hand are less accurate compared to direct solvers can diverge for linear systems with ill-conditioned matrices, but have very good performance and are designed to handle large size problems.

The question is can we have a direct solver that can handle large 3D problems, not have excessive memory requirements and have performance similar to an iterative solver. And the answer is yes if we split the large problem into smaller ones and design a parallel direct solver with several levels of parallelism.

 

New Hybrid MPI-thread Parallelized Direct Solver:

The new PAS solver part of the Silvaco suite of linear system solvers is a direct solver parallelized both with MPI and pthreads.

The large matrix coming from a large 3D problem can be divided into pieces by a graph partitioning tool and distributed to several machines by launching MPI processes on each machine. This will resolve the issue with memory.

On each machine since most modern machines have multiple cores multiple threads can work in parallel on the piece of the matrix assigned to that machine. So here we have two levels of parallelism. The parallel implementation of the PAS solver resolves the issue with performance. The PAS solver is orders of magnitude faster than older generation direct solvers and can be close with respect to speed to an iterative solver.

On smaller sized problems it can even be faster than iterative solvers. The biggest advantage of course is the accuracy. For problems for which high accuracy is required, the matrix of the linear system is ill-conditioned and the iterative method diverges or it takes a large number of iterations to get convergence, the PAS solver provides fast convergence of the nonlinear solver and can make the overall computation faster.

To illustrate the performance of the PAS solver we ran a 3D Victory Device simulation with a matrix size 392 298 on a machine with two Intel Xeon E5 Series CPUs each with 18 cores for a total of 36 cores.

On the same machine we launched MPI processes and the first column of Table 1 shows the number of processes launched.

PAS
1 thr/MPI
2 thr/MPI
4 thr/MPI
8 thr/MPI
16 thr/MPI
32 thr/MPI
1 MPI
10451.58 s
5950.09 s
3607.05 s
2465.28 s
2098.07 s
1605.51 s
2 MPI
8133.70 s
4296.44 s
2913.44 s
2263.81 s
1827.38 s
4 MPI
5252.20 s
3424.69 s
2358.55 s
1656.67 s
8 MPI
3977.47 s
2719.87 s
1736.52 s
16 MPI
3347.99 s
2197.41 s
32 MPI
2903.98 s
Table1. Total time of a Victory Device simulation.

 

The first row shows how many threads per MPI process we have used. In the table we recorded the total time of the simulation in seconds. To get the total number of threads used for the computation multiply the number of MPI processes and the number of threads per MPI process.

On every diagonal starting from the bottom left and going to the top right we have equal number of threads.
Some of the slots in the table are not filled because the machine had only 36 cores so we did not run with for example 2 X 32 = 64 threads, or 4 x 16 = 64 threads, etc.

The graph on Figure 1 shows the data from the row of Table 1 starting with 1 MPI. This means that we have only one process launched and 1,2, …, 32 threads respectively are working in parallel on the whole problem.

Figure 1. Victory Device example ran on 1 MPI process with different number of threads.

 

The amount of time for the simulation decreases significantly with the increase of the number of threads. The speedup with 32 threads is over 6.5 times. See Table 2 below. Another observation is that there is no saturation point which occurs in many solvers i.e. even though more threads are used the computation does not become faster it might even become slower.

 

PAS
1 thr/MPI
2 thr/MPI
4 thr/MPI
8 thr/MPI
16 thr/MPI
32 thr/MPI
1 MPI
1
1.76
2.9
4.24
4.98
6.5
2 MPI
1.3
2.43
3.59
4.6
5.72
4 MPI
1.99
3.05
4.43
6.3
8 MPI
2.63
3.84
6.02
16 MPI
3.12
4.76
32 MPI
3.6
Table 2. Speedup of a Victory Device simulation calculated based on the timing information from Table 1.

 

For the PAS solver the more threads we use the faster the simulation.

Table 2 shows the speedup that we get with the PAS solver. The calculations are based on the timing information from Table 1. The speedup with 1 MPI process is graphed on Figure 2. The speedup is almost linear. The speedup increases in a similar fashion on every row of Table 2 so when you split the matrix you continue to get close to linear increase in the speedup.

 

Figure 2. Speedup on 1 MPI process with different number of threads

 

If you look at the first column of Table 2 you can see that splitting the matrix into pieces and using only 1 thread for each piece is not such a good idea the increase of speedup is slower. This is due to the communication overhead caused by the transfer of data between the MPI processes.

As you increase the number of MPI processes the pieces per process become smaller and also having only 1 thread per piece makes the second level of parallelism nonexistent.

So the conclusion is that if you are running on 1 machine it is better not to launch too many processes instead use more threads. The best speedup we get for this example - 6.5 - occurs when 1 MPI process is launched and 32 threads are used. If you do decide to launch more than one MPI processes it seems that using 4 MPI processes and splitting the matrix in 4 pieces and then running 8 threads on each piece works very well also, the speedup is 6.3 for this case.

For this case the pieces of the matrix are large enough there are only 4 and for each piece you have enough resources - 8 threads per piece - so that the whole computation is close to optimal.

Next we will illustrate the performance of the PAS solver on more than one machines. The 3 machines that we used are identical with two Intel Xeon E5 Series CPUs each with 18 cores for a total of 36 cores per machine. On each machine we launched one MPI process.

We can see from the timing information in Table 3 that the best performance is achieved on 3 machines with 32 threads working in parallel on each of the 3 pieces in which the problem has been divided. The best time in this case is better than what we got on 1 machine so again we confirm that the more resources are used for solving the problem the better the time.The speedup that we got when we used 3 machines is 8.14 which is a very good result.

 

PAS
1 thr/MPI
2 thr/MPI
4 thr/MPI
8 thr/MPI
16 thr/MPI
32 thr/MPI
1 mach
10451.58 s
5950.09 s
3607.05 s
2465.28 s
2098.07 s
1605.51 s
2 mach
8249.34 s
4231.33 s
2815.28 s
1935.84 s
1592.76 s
1321.30 s
3 mach
7071.58 s
3893.42 s
2262.28 s
1717.47 s
1413.03 s
1283.02 s
Table 3. Timing information for PAS solver on more than one machines.

 

Conclusion

We have added PAS a new hybrid MPI-thread parallelized direct solver to the Silvaco suite of linear system solvers. It complements our existing set of solvers and in particular serves as an alternative to our domain decomposition MPI parallelized iterative solver PAM for cases where we have convergence issues. The two solvers cover the large range of different types of linear systems that need to be solved in the TCAD simulations both in 2D and 3D.

The PAS solver is a modern parallel solver, very accurate and robust, with great performance and reasonable memory needs. It has the advantage of being able to run on several machines, and in parallel on each machine thus using their resources in an optimal manner.