HLIBpro
2.6

While the technical details of solving and preconditioning are discussed in Solvers, this section should comment on the practical issues related to the various preconditioners.
As a benchmark, we will use the same problem as in the Boundary Element Methods example, i.e., the Helmholtz SLP operator on the unit sphere. The wavenumber \(\kappa\) was chosen to be 16.
The source code for setting up the matrix and RHS is identical to the Boundary Element Methods example.
For this specific test, we use a grid with 62.466 vertices and 124.928 triangles. With a constant ansatz, this results in an Hmatrix of dimension \(124.928 \times 124.928\).
Furthermore, the matrix was built unsymmetric, i.e., both lower and upper triangular part are constructed, to prevent algorithmic properties of symmetric matrices, e.g., during matrixvector solves, to influence the numerical results.
For comparison with the timings below, we need the time to build the Hmatrix, which is 75.9 seconds for a blockwise accuracy of \(10^{4}\) with ACA+.
All tests below are performed using a single Intel Xeon E78857 CPU with 12 cores and 3GHz.
First, the various solvers in 𝖧𝖫𝖨𝖡𝗉𝗋𝗈 should try to solve the system without any preconditioner. The stop criterion is chosen with a maximal number of 1000 steps and a reduction of the residual norm of \(10^{6}\) compared to the initial residual norm. Furthermore, the lower bound for the absolute residual norm was set to \(10^{6}\):
The results are
The fastest solver is TFQMR, which managed to reach the wanted residual norm within 575 steps. However, it needed 58.12 seconds to do so, which is almost as long as for building the matrix.
The next test will use the Jacobi preconditioner, first the pointwise version:
The results now look as
The GMRES solver now is able to reduce the residual within 193 steps in 22.97 seconds. Also improved are the convergence rates of BiCGStab and TFQMR. CGS now is not able to converge, resulting in absent data.
However, pointwise Jacobi only uses the diagonal entries of \(A\) as an approximation. In the next test, the diagonal blocks of \(A\) are used instead:
The third argument to the TJacobi constructor defines the block size and since no block smaller than diagonal blocks are supported by 𝖧𝖫𝖨𝖡𝗉𝗋𝗈, a 2
results in using all diagonal blocks for a blockwise Jacobi preconditioner.
Instead of a blocksize value of 2
, one could also use \(n_{\min}\) if known, since this is equivalent as it defines the size of the smallest diagonal blocks.
The results are:
The convergence rate actually decreases for GMRES and BiCGStab. Only TFQMR benefits a little.
The reason for this behaviour is, that although going from pointwise diagonal to blockdiagonal with blocksize \(n_{\min}\) will increase the number of coefficients in the preconditioner, but the change in the overall approximation of \(A^{1}\) is still small, and hence the quality of the preconditioner does not change enbough to guarantee a better convergence for this example.
To improve this, we increase the diagonal blocksize to 100
:
which gives the following results:
For comparison, also a blocksize of 500
is used:
Since the approximation to \(A\) and hence also to \(A^{1}\) is getter better with an increased blocksize, the convergence rates are also decreasing.
However, an increased block size also has drawbacks:
The increased time for matrixvector multiplications can be seen for the BiCGStab and the TFQMR solver, which have similar solve times for both block sizes, although the number of iteration steps is reduced.
For pointwise Jacobi and for a blocksize equal to \(n_{\min}\), the setup was actually negligible. So lets correct the above data by the measured time for setting up the preconditioner. First for a blocksize of \(100\):
and also for a blocksize of 500
:
The relatively large increase in the setup time is due to Hinversion for the diagonal blocks, as performed by the TJacobi class. For small blocks, this is comparable to dense inversion, which is very fast for small matrices. However, depending on the Hblock structure, a more than linear increase may be visible for larger blocks (before Hcomplexity is dominant).
TJacobi also accepts a fourth argument defining the accuracy of the Hinversion, which defaults to exact inversion (up to machine precision):
In this example however, a relaxed accuracy only leads to very little change in the setup and solve times.
This is a typicall example that a more advanced preconditioner does not necessarily results in an overall performance increase. Only if the reduction in iteration steps is large enough to actually compensate the increase in the more costly setup phase, will the overall time also be reduced.
In addition to the diagonal of \(A\), the GaussSeidel preconditioner also uses the lower (forward GS), upper (backward GS) or both (symmetric GS) triangular part(s) of the matrix. This gives as the following preconditioners with \(A = D  E  F\) and \(E\) being the strict (block) lower triangular part of \(A\) and \(F\) the strict upper triangular part respectively:
Let us test all three versions of GaussSeidel for the Helmholtz model problem.
ForwardGS:
BackwardGS:
SymmetricGS:
Only SymmetricGS shows a significant improvement over blockwise Jacobi in terms of convergence rates, with less that 100 iteration steps to solve the Helmholtz problem.
However, the solve time is significantly higher compared to Jacobi which is the result of the matrix solve algorithm in 𝖧𝖫𝖨𝖡𝗉𝗋𝗈. For one, solving a lower/upper triangular Hmatrix takes much longer than matrixvector multiplication with a blockdiagonal matrix since much more data is involved. Furthermore this matrix solve algorithm does not scale as efficiently with the number of CPUs as does matrixvector multiplication, especially for a blockdiagonal matrix.
Already BlockJacobi and GaussSeidel made use of the block structure of Hmatrices and the correponding Harithmetic. Especially BlockJacobi used Hmatrix inversion for the diagonal blocks.
Using HLU factorisation, a better approximation to \(A\) can be computed by using the whole matrix instead of just the (block) diagonal, which then in theory should serve as an excellent preconditioner. In fact, an exact (up to machine precision) computation of \(A^{1}\) would result in a direct solver, i.e., only a single matrixvector multiplication (or matrixvector solve) would be needed for solving the equation system.
Harithmetic allows us to control the accuracy of the approximation, which thereby permits us to optimize the overall solution process, e.g., invest more time into setup (high accuracy HLU) or more time in the iteration process (low accuracy HLU) depending on our problem to solve.
For the Helmholtz problem, we will increase the blockwise Harithmetic accuracy \(\varepsilon\) from \(1\) to \(10^{4}\).
Except for the accuracy setting, the source for all examples is:
The numerical results are:
\(\varepsilon = 1\)
\(\varepsilon = 10^{2}\)
\(\varepsilon = 10^{4}\)
As expected, with a better Haccuracy, the convergence rate also becomes better, until only two iteration steps are needed for \(\varepsilon = 10^{4}\). However, the setup time for computing the HLU preconditioner also increases significantly with a smaller \(\varepsilon\), already passing matrix construction time for \(\varepsilon = 10^{2}\).
Regarding overall solve time: already HLU with the coarsest accuracy needs longer than a simple blockJacobi preconditioner. Furthermore, matrixvector solves as are needed for the application of HLU as a preconditioner are expensive, which already was a problem for GaussSeidel preconditioners.
All together, the advantage of HLU for this specific example will only show for many righthand sides to solve. As an example, we compare TFQMR+BlockJacobi (blocksize 100) with Linear Iteration+HLU ( \(\varepsilon = 10^{2}\)). Both solvers would need the same overall time for five righthand side vectors (110.47s vs. 110.73s). Only for more vectors HLU shows a better performance than BlockJacobi.
The main problem with blockwise Jacobi for large block sizes was the large setup costs due to inversion of the diagonal matrix blocks. Instead of inversion, these blocks may also be factorised using HLU factorisation. This permits larger block sizes for blockJacobi.
𝖧𝖫𝖨𝖡𝗉𝗋𝗈 provides two functions to either restrict a given matrix to it's blockdiagonal part
or to copy the blockdiagonal of a matrix:
In both cases, the blockdiagonal is either defined by the level of the diagonal blocks or by their sizes, e.g., stop recursion if a certain level or a certain blocksize was reached. The default value lvl_leaf
signals, that recursion should only be stopped at leaf level (or blocksize = \(n_{\min}\)).
For our problem, a corresponding preconditioner is obtained via
The results for different block sizes are:
blocksize \(100\)
blocksize \(500\)
blocksize \(2000\)
blocksize \(10000\)
blocksize \(20000\)
The results for a blocksize of \(100\) and \(500\) are (almost) identical to the blockJacobi preconditioner above, only the setup time is reduced, e.g., from 5.19 seconds to 0.58 seconds for blocksize \(500\). The solve times are slightly worse than in the Jacobi version, since matrixvector solves are involved.
However, already for a blocksize of \(500\), the overall time is faster and even better for a blocksize of \(2000\) with a minimum for \(10000\). Although a larger block size will result in a better convergence but the setup time also increases making blockdiagonal HLU to the fastest solver for our example in terms of overall time.
We have not discussed the accuracy used for HLU in case of block diagonal preconditioning. For the above tests, the accuracy was \(1\), i.e., a very coarse approximation. Tests with higher accuracy did not resulted in an improvment, for one because the block diagonal structure already sets a lower limit on the approximability of \(A^{1}\) (see blockJacobi) and second, because the setup time will then be higher. Only for very large block sizes, this could improve convergence but not necessarily solve times.
Another alternative for computing a Hbased preconditioner is to restrict the matrix to the nearfield part only, as defined by the dense matrix blocks. All other lowrank blocks are considered to be zero and will not be updated during LU factorization. This algorithm can be considered as an incomplete LU with limited fillin.
Since only dense arithmetic is involved, all computations are performed exact w.r.t. machine precision. Furthermore, the computation time is much faster compared to standard HLU. However, the approximation of \(A^{1}\) is also limited.
As for block diagonal matrices, we have the following two functions to either restrict a given matrix or to copy the nearfield part:
The second parameter defines, whether lowrank blocks should be deleted/not copied or just zeroed.
In our case, this looks as:
And the results are:
The solve times are not as fast as in the case with block diagonal (Jacobi or HLU), but not much worse. Especially the fast setup makes this version interesting. Also matrixvector solves are faster than for standard HLU.
For the given Helmholtz problem with a single righthand side, the fastest solver of the blockdiagonal HLU factorisation with a block size of \(10000\). The reason was the fast factorisation of the diagonal blocks, i.e., a low setup time, with a reasonable number of iteration steps.
Using the standard HLU preconditioner with \(\varepsilon = 10^{2}\) for comparsion (HLU time is similar to matrix construction time), it needs 6 righthand side vectors to be faster than blockdiagonal HLU.