HLIBpro  2.9
Preconditioning

While the technical details of solving and preconditioning are discussed in Solvers, this section should comment on the practical issues related to the various preconditioners.

Attention
The discussion of the numerical tests below only applies to the equation system used in this example. The results, especially for classical preconditioners may vary strongly for other matrices. This example should therefore only be considered as a hint, that different preconditioners are available and may be tested for a specific problem to obtain optimal results.

Problem Setup

As a benchmark, we will use the same problem as in the Boundary Element Methods example, i.e., the Helmholtz SLP operator on the unit sphere. The wavenumber \(\kappa\) was chosen to be 16.

The source code for setting up the matrix and RHS is identical to the Boundary Element Methods example.

For this specific test, we use a grid with 62.466 vertices and 124.928 triangles. With a constant ansatz, this results in an H-matrix of dimension \(124.928 \times 124.928\).

Furthermore, the matrix was built \i unsymmetric, i.e., both lower and upper triangular part are constructed, to prevent algorithmic properties of symmetric matrices, e.g., during matrix-vector solves, to influence the numerical results.

For comparison with the timings below, we need the time to build the H-matrix, which is 75.9 seconds for a block-wise accuracy of \(10^{-4}\) with ACA+.

All tests below are performed using a single Intel Xeon E7-8857 CPU with 12 cores and 3GHz.

Remarks
Although the example uses a dense matrix, for sparse matrices similar tests may be done. However, note that point-wise versions of the various preconditioners, e.g., Jacobi, Gauss-Seidel, SOR or also ILU, do not need H-arithmetic or 𝖧𝖫𝖨𝖡𝗉𝗋𝗈. Furthermore, several direct solvers are available, e.g., UMFPACK, PARDISO or MUMPS, which are very fast if they are applicable to a equation system.

No Preconditioner

First, the various solvers in 𝖧𝖫𝖨𝖡𝗉𝗋𝗈 should try to solve the system without any preconditioner. The stop criterion is chosen with a maximal number of 1000 steps and a reduction of the residual norm of \(10^{-6}\) compared to the initial residual norm. Furthermore, the lower bound for the absolute residual norm was set to \(10^{-6}\):

TSolver::TStopCriterion sstop( 1000, 1e-13, 1e-6 );
TBiCGStab solver( sstop );
solver->solve( A, x, b );

The results are

Lin. Iter. not converged after 1000 steps with rate 1.0000e+00, |r| = 1.6751e-02, in 98.89s
CG not converged after 1000 steps with rate 1.0040e+00, |r| = 9.0055e-01, in 98.76s
CGS not converged after 1000 steps with rate 1.0075e+00, |r| = 2.9027e+01, in 99.17s
BiCGStab converged after 722 steps with rate 9.8083e-01, |r| = 1.4219e-08, in 72.25s
MINRES not converged after 1000 steps with rate 9.9531e-01, |r| = 2.6489e-03, in 102.44s
GMRES(20) not converged after 1000 steps with rate 9.8640e-01, |r| = 1.8802e-08, in 109.60s
TFQMR converged after 575 steps with rate 9.7618e-01, |r| = 2.2689e-09, in 58.12s

The fastest solver is TFQMR, which managed to reach the wanted residual norm within 575 steps. However, it needed 58.12 seconds to do so, which is almost as long as for building the matrix.

Remarks
In the following, only results for solvers are printed, which managed to converge within the given iteration limits.

Jacobi Preconditioner

The next test will use the Jacobi preconditioner, first the point-wise version:

auto Jacobi = make_unique< TJacobi >( A.get(), 1.0, 1 );
auto solver = ...;
solver->solve( A, x, b, Jacobi.get() );

The results now look as

BiCGStab converged after 293 steps with rate 9.5385e-01, |r| = 4.2289e-07, in 32.03s
GMRES(20) converged after 193 steps with rate 9.3092e-01, |r| = 4.4926e-07, in 22.97s
TFQMR converged after 304 steps with rate 9.5552e-01, |r| = 6.3025e-08, in 33.59s

The GMRES solver now is able to reduce the residual within 193 steps in 22.97 seconds. Also improved are the convergence rates of BiCGStab and TFQMR. CGS now is not able to converge, resulting in absent data.

However, point-wise Jacobi only uses the diagonal entries of \(A\) as an approximation. In the next test, the diagonal blocks of \(A\) are used instead:

auto Jacobi = make_unique< TJacobi >( A.get(), 1.0, 2 );

The third argument to the TJacobi constructor defines the block size and since no block smaller than diagonal blocks are supported by 𝖧𝖫𝖨𝖡𝗉𝗋𝗈, a 2 results in using all diagonal blocks for a block-wise Jacobi preconditioner.

Instead of a blocksize value of 2, one could also use \(n_{\min}\) if known, since this is equivalent as it defines the size of the smallest diagonal blocks.

The results are:

BiCGStab converged after 317 steps with rate 9.5593e-01, |r| = 7.7363e-08, in 32.17s
GMRES(20) converged after 403 steps with rate 9.6628e-01, |r| = 1.1475e-07, in 44.76s
TFQMR converged after 293 steps with rate 9.5275e-01, |r| = 9.7448e-09, in 30.21s

The convergence rate actually decreases for GMRES and BiCGStab. Only TFQMR benefits a little.

The reason for this behaviour is, that although going from point-wise diagonal to block-diagonal with blocksize \(n_{\min}\) will increase the number of coefficients in the preconditioner, but the change in the overall approximation of \(A^{-1}\) is still small, and hence the quality of the preconditioner does not change enbough to guarantee a better convergence for this example.

To improve this, we increase the diagonal blocksize to 100:

auto Jacobi = make_unique< TJacobi >( A.get(), 1.0, 100 );

which gives the following results:

BiCGStab converged after 215 steps with rate 9.3745e-01, |r| = 5.7064e-08, in 22.47s
GMRES(20) converged after 258 steps with rate 9.4773e-01, |r| = 5.4591e-08, in 29.42s
TFQMR converged after 207 steps with rate 9.3496e-01, |r| = 4.9908e-09, in 22.00s

For comparison, also a blocksize of 500 is used:

BiCGStab converged after 173 steps with rate 9.2311e-01, |r| = 3.1961e-08, in 20.74s
GMRES(20) converged after 148 steps with rate 9.1078e-01, |r| = 4.3545e-08, in 19.29s
TFQMR converged after 174 steps with rate 9.2273e-01, |r| = 3.4310e-09, in 21.19s

Since the approximation to \(A\) and hence also to \(A^{-1}\) is getter better with an increased blocksize, the convergence rates are also decreasing.

However, an increased block size also has drawbacks:

  1. The time for matrix-vector multiplication with the preconditioner is bigger.
  2. The setup time for the preconditioner increases, i.e., the time for inverting the diagonal blocks.

The increased time for matrix-vector multiplications can be seen for the BiCGStab and the TFQMR solver, which have similar solve times for both block sizes, although the number of iteration steps is reduced.

For point-wise Jacobi and for a block-size equal to \(n_{\min}\), the setup was actually negligible. So lets correct the above data by the measured time for setting up the preconditioner. First for a blocksize of \(100\):

BiCGStab converged after 215 steps with rate 9.3745e-01, |r| = 5.7064e-08, in 22.47s + 0.47s = 22.94
GMRES(20) converged after 258 steps with rate 9.4773e-01, |r| = 5.4591e-08, in 29.42s + 0.47s = 29.89
TFQMR converged after 207 steps with rate 9.3496e-01, |r| = 4.9908e-09, in 22.00s + 0.47s = 22.47

and also for a blocksize of 500:

BiCGStab converged after 173 steps with rate 9.2311e-01, |r| = 3.1961e-08, in 20.74s + 10.72s = 31.46
GMRES(20) converged after 148 steps with rate 9.1078e-01, |r| = 4.3545e-08, in 19.29s + 10.72s = 30.01
TFQMR converged after 174 steps with rate 9.2273e-01, |r| = 3.4310e-09, in 21.19s + 10.72s = 31.91

The relatively large increase in the setup time is due to H-inversion for the diagonal blocks, as performed by the TJacobi class. For small blocks, this is comparable to dense inversion, which is very fast for small matrices. However, depending on the H-block structure, a more than linear increase may be visible for larger blocks (before H-complexity is dominant).

TJacobi also accepts a fourth argument defining the accuracy of the H-inversion, which defaults to exact inversion (up to machine precision):

auto Jacobi = make_unique< TJacobi >( A.get(), 1.0, 100, fixed_prec( 1e-4 ) );

In this example however, a relaxed accuracy only leads to very little change in the setup and solve times.

This is a typicall example that a more advanced preconditioner does not necessarily results in an overall performance increase. Only if the reduction in iteration steps is large enough to actually compensate the increase in the more costly setup phase, will the overall time also be reduced.

Gauss-Seidel Preconditioner

In addition to the diagonal of \(A\), the Gauss-Seidel preconditioner also uses the lower (forward GS), upper (backward GS) or both (symmetric GS) triangular part(s) of the matrix. This gives as the following preconditioners with \(A = D - E - F\) and \(E\) being the strict (block) lower triangular part of \(A\) and \(F\) the strict upper triangular part respectively:

  • Forward-GS: \((D-E)^{-1}\),
  • Backward-GS: \((D-F)^{-1}\),
  • Symmetric-GS: \((D-F)^{-1} D (D-E)^{-1}\).

Let us test all three versions of Gauss-Seidel for the Helmholtz model problem.

Forward-GS:

auto FGS = make_unique< TGaussSeidel >( A.get(), GS_FORWARD );
auto solver = ...;
solver->solve( A, x, b, FGS.get() );
BiCGStab converged after 300 steps with rate 9.5495e-01, |r| = 2.8519e-08, in 76.56s
GMRES(20) converged after 281 steps with rate 9.5196e-01, |r| = 4.4621e-08, in 77.02s
TFQMR converged after 250 steps with rate 9.4581e-01, |r| = 6.5929e-09, in 64.65s

Backward-GS:

auto BGS = make_unique< TGaussSeidel >( A.get(), GS_BACKWARD );
auto solver = ...;
solver->solve( A, x, b, BGS.get() );
BiCGStab converged after 279 steps with rate 9.5164e-01, |r| = 2.8068e-08, in 72.67s
GMRES(20) converged after 280 steps with rate 9.5179e-01, |r| = 4.7332e-08, in 77.79s
TFQMR converged after 260 steps with rate 9.4799e-01, |r| = 2.5888e-09, in 68.42s

Symmetric-GS:

auto SGS = make_unique< TGaussSeidel >( A.get(), GS_SYMMETRIC );
auto solver = ...;
solver->solve( A, x, b, SGS.get() );
BiCGStab converged after 88 steps with rate 8.5257e-01, |r| = 1.9405e-08, in 37.84s
GMRES(20) converged after 76 steps with rate 8.3227e-01, |r| = 7.9860e-08, in 34.66s
TFQMR converged after 93 steps with rate 8.6043e-01, |r| = 8.8380e-09, in 40.82s

Only Symmetric-GS shows a significant improvement over block-wise Jacobi in terms of convergence rates, with less that 100 iteration steps to solve the Helmholtz problem.

However, the solve time is significantly higher compared to Jacobi which is the result of the matrix solve algorithm in 𝖧𝖫𝖨𝖡𝗉𝗋𝗈. For one, solving a lower/upper triangular H-matrix takes much longer than matrix-vector multiplication with a block-diagonal matrix since much more data is involved. Furthermore this matrix solve algorithm does not scale as efficiently with the number of CPUs as does matrix-vector multiplication, especially for a block-diagonal matrix.

Remarks
𝖧𝖫𝖨𝖡𝗉𝗋𝗈 does not support block-GS with a larger block size than \(n_{\min}\), which possibly would further improve convergence rates albeit with higher setup costs comparable to block-Jacobi.

H-LU Preconditioner

Already Block-Jacobi and Gauss-Seidel made use of the block structure of H-matrices and the correponding H-arithmetic. Especially Block-Jacobi used H-matrix inversion for the diagonal blocks.

Using H-LU factorisation, a better approximation to \(A\) can be computed by using the whole matrix instead of just the (block-) diagonal, which then in theory should serve as an excellent preconditioner. In fact, an exact (up to machine precision) computation of \(A^{-1}\) would result in a direct solver, i.e., only a single matrix-vector multiplication (or matrix-vector solve) would be needed for solving the equation system.

H-arithmetic allows us to control the accuracy of the approximation, which thereby permits us to optimize the overall solution process, e.g., invest more time into setup (high accuracy H-LU) or more time in the iteration process (low accuracy H-LU) depending on our problem to solve.

For the Helmholtz problem, we will increase the block-wise H-arithmetic accuracy \(\varepsilon\) from \(1\) to \(10^{-4}\).

Except for the accuracy setting, the source for all examples is:

auto acc = fixed_prec( ... );
auto A_copy = A->copy( acc );
auto A_inv = factorise_inv( A_copy.get(), acc );
auto solver = ...;
solver->solve( A, x, b, A_inv.get() );
Remarks
Note, that already the copy of \(A\) is constructed with a reduced accuracy, thereby starting with less data for H-LU.

The numerical results are:

\(\varepsilon = 1\)

BiCGStab converged after 43 steps with rate 7.0292e-01, |r| = 2.7785e-09, in 15.28s + 28.98s = 44.26
MINRES converged after 104 steps with rate 8.7492e-01, |r| = 1.7772e-03, in 37.02s + 28.98s = 66.00
GMRES(20) converged after 36 steps with rate 6.7405e-01, |r| = 1.3937e-08, in 13.63s + 28.98s = 42.61
TFQMR converged after 43 steps with rate 7.2430e-01, |r| = 2.8872e-09, in 15.93s + 28.98s = 44.91

\(\varepsilon = 10^{-2}\)

Lin. Iter. converged after 6 steps with rate 8.5859e-02, |r| = 6.3671e-11, in 2.79s + 96.78s = 99.57
CG converged after 42 steps with rate 7.1875e-01, |r| = 1.0744e-13, in 17.49s + 96.78s = 114.27
CGS converged after 8 steps with rate 1.0733e-01, |r| = 1.3481e-12, in 3.90s + 96.78s = 100.68
BiCGStab converged after 6 steps with rate 8.1799e-02, |r| = 4.7609e-11, in 2.81s + 96.78s = 99.59
MINRES converged after 6 steps with rate 8.8599e-02, |r| = 1.5918e-09, in 3.12s + 96.78s = 99.90
GMRES(20) converged after 6 steps with rate 8.6919e-02, |r| = 3.8247e-11, in 3.12s + 96.78s = 99.90
TFQMR converged after 7 steps with rate 1.0685e-01, |r| = 4.6321e-12, in 3.92s + 96.78s = 100.70

\(\varepsilon = 10^{-4}\)

Lin. Iter. converged after 2 steps with rate 6.8749e-04, |r| = 3.2750e-13, in 1.40s + 248.34s = 249.74
CG converged after 5 steps with rate 5.7639e-02, |r| = 4.0160e-16, in 3.18s + 248.34s = 251.52
CGS converged after 2 steps with rate 7.2098e-04, |r| = 3.2766e-13, in 1.78s + 248.34s = 250.12
BiCGStab converged after 2 steps with rate 6.8746e-04, |r| = 3.2746e-13, in 1.41s + 248.34s = 249.75
MINRES converged after 2 steps with rate 7.2055e-04, |r| = 3.2755e-13, in 1.78s + 248.34s = 250.12
GMRES(20) converged after 2 steps with rate 7.2055e-04, |r| = 3.2758e-13, in 1.78s + 248.34s = 250.12
TFQMR converged after 2 steps with rate 8.5698e-04, |r| = 3.2750e-13, in 2.25s + 248.34s = 250.59

As expected, with a better H-accuracy, the convergence rate also becomes better, until only two iteration steps are needed for \(\varepsilon = 10^{-4}\). However, the setup time for computing the H-LU preconditioner also increases significantly with a smaller \(\varepsilon\), already passing matrix construction time for \(\varepsilon = 10^{-2}\).

Regarding overall solve time: already H-LU with the coarsest accuracy needs longer than a simple block-Jacobi preconditioner. Furthermore, matrix-vector solves as are needed for the application of H-LU as a preconditioner are expensive, which already was a problem for Gauss-Seidel preconditioners.

All together, the advantage of H-LU for this specific example will only show for many right-hand sides to solve. As an example, we compare TFQMR+Block-Jacobi (blocksize 100) with Linear Iteration+H-LU ( \(\varepsilon = 10^{-2}\)). Both solvers would need the same overall time for five right-hand side vectors (110.47s vs. 110.73s). Only for more vectors H-LU shows a better performance than Block-Jacobi.

Attention
Again, this is only valid for this example matrix. Already for a larger problem size for the same Helmholtz example, the Jacobi preconditioner will need more iteration steps, which will shift the break-even point in favour of H-LU.

Block-Diagonal H-LU Preconditioner

The main problem with block-wise Jacobi for large block sizes was the large setup costs due to inversion of the diagonal matrix blocks. Instead of inversion, these blocks may also be factorised using H-LU factorisation. This permits larger block sizes for block-Jacobi.

𝖧𝖫𝖨𝖡𝗉𝗋𝗈 provides two functions to either restrict a given matrix to it's block-diagonal part

void restrict_blockdiag ( TMatrix * A,
const uint lvl = lvl_leaf,
const size_t blocksize = 0 );

or to copy the block-diagonal of a matrix:

std::unique_ptr< TMatrix >
blockdiag ( const TMatrix * A,
const uint lvl = lvl_leaf,
const size_t blocksize = 0 );

In both cases, the block-diagonal is either defined by the level of the diagonal blocks or by their sizes, e.g., stop recursion if a certain level or a certain blocksize was reached. The default value lvl_leaf signals, that recursion should only be stopped at leaf level (or blocksize = \(n_{\min}\)).

For our problem, a corresponding preconditioner is obtained via

auto A_bd = blockdiag( A.get(), lvl_leaf, ... );
auto acc = fixed_prec( ... );
auto A_inv = factorise_inv( A_bd.get(), acc );
auto solver = ...;
solver->solve( A, x, b, A_inv.get() );

The results for different block sizes are:

blocksize \(100\)

BiCGStab converged after 221 steps with rate 9.3928e-01, |r| = 5.9560e-08, in 33.35s + 0.45s = 33.80s
GMRES(20) converged after 258 steps with rate 9.4773e-01, |r| = 5.4604e-08, in 41.84s + 0.45s = 42.29s
TFQMR converged after 207 steps with rate 9.3494e-01, |r| = 4.9847e-09, in 31.61s + 0.45s = 32.06s

blocksize \(500\)

BiCGStab converged after 175 steps with rate 9.2347e-01, |r| = 2.9163e-08, in 43.48s + 1.44s = 44.92s
GMRES(20) converged after 148 steps with rate 9.1077e-01, |r| = 4.3472e-08, in 39.37s + 1.44s = 40.81s
TFQMR converged after 174 steps with rate 9.2358e-01, |r| = 4.1605e-09, in 43.81s + 1.44s = 45.25s

blocksize \(2000\)

BiCGStab converged after 143 steps with rate 9.0767e-01, |r| = 2.5690e-08, in 40.33s + 4.65s = 44.98s
GMRES(20) converged after 124 steps with rate 8.9395e-01, |r| = 4.4211e-08, in 37.50s + 4.65s = 42.15s
TFQMR converged after 144 steps with rate 9.0833e-01, |r| = 6.2214e-09, in 41.22s + 4.65s = 45.87s

blocksize \(10000\)

BiCGStab converged after 94 steps with rate 8.6236e-01, |r| = 1.9162e-08, in 29.53s + 11.29s = 40.82s
GMRES(20) converged after 76 steps with rate 8.3217e-01, |r| = 2.7094e-08, in 25.47s + 11.29s = 36.76s
TFQMR converged after 106 steps with rate 8.7687e-01, |r| = 4.5741e-09, in 33.89s + 11.29s = 45.18s

blocksize \(20000\)

BiCGStab converged after 72 steps with rate 8.2350e-01, |r| = 1.6905e-08, in 24.45s + 15.79s = 40.24s
GMRES(20) converged after 59 steps with rate 7.8976e-01, |r| = 3.1821e-08, in 21.31s + 15.79s = 37.10s
TFQMR converged after 93 steps with rate 8.5858e-01, |r| = 1.8237e-09, in 32.14s + 15.79s = 47.93s

The results for a blocksize of \(100\) and \(500\) are (almost) identical to the block-Jacobi preconditioner above, only the setup time is reduced, e.g., from 5.19 seconds to 0.58 seconds for blocksize \(500\). The solve times are slightly worse than in the Jacobi version, since matrix-vector solves are involved.

However, already for a blocksize of \(500\), the overall time is faster and even better for a blocksize of \(2000\) with a minimum for \(10000\). Although a larger block size will result in a better convergence but the setup time also increases making block-diagonal H-LU to the fastest solver for our example in terms of overall time.

We have not discussed the accuracy used for H-LU in case of block diagonal preconditioning. For the above tests, the accuracy was \(1\), i.e., a very coarse approximation. Tests with higher accuracy did not resulted in an improvment, for one because the block diagonal structure already sets a lower limit on the approximability of \(A^{-1}\) (see block-Jacobi) and second, because the setup time will then be higher. Only for very large block sizes, this could improve convergence but not necessarily solve times.

Nearfield H-LU Preconditioner

Another alternative for computing a H-based preconditioner is to restrict the matrix to the nearfield part only, as defined by the dense matrix blocks. All other low-rank blocks are considered to be zero and will not be updated during LU factorization. This algorithm can be considered as an incomplete LU with limited fill-in.

Remarks
If started with a sparse matrix, this method corresponds to an \i incomplete LU, where the fill-in is restricted to those matrix coefficients defined by the dense matrix blocks.

Since only dense arithmetic is involved, all computations are performed exact w.r.t. machine precision. Furthermore, the computation time is much faster compared to standard H-LU. However, the approximation of \(A^{-1}\) is also limited.

As for block diagonal matrices, we have the following two functions to either restrict a given matrix or to copy the nearfield part:

void
restrict_nearfield ( TMatrix * A,
const bool delete_farfield );
std::unique_ptr< TMatrix >
nearfield ( const TMatrix * A,
const bool without_farfield );

The second parameter defines, whether low-rank blocks should be deleted/not copied or just zeroed.

In our case, this looks as:

auto A_nf = nearfield( A.get(), true );
auto A_inv = factorise_inv( A_nf.get(), acc_exact );
auto solver = ...;
solver->solve( A, x, b, A_inv.get() );

And the results are:

BiCGStab converged after 122 steps with rate 8.9181e-01, |r| = 1.6491e-08, in 30.07s + 1.47s = 31.54s
GMRES(20) converged after 117 steps with rate 8.8802e-01, |r| = 2.4539e-08, in 30.78s + 1.47s = 32.25s
TFQMR converged after 132 steps with rate 8.9907e-01, |r| = 1.0807e-09, in 33.00s + 1.47s = 34.47s

The solve times are not as fast as in the case with block diagonal (Jacobi or H-LU), but not much worse. Especially the fast setup makes this version interesting. Also matrix-vector solves are faster than for standard H-LU.

Conclusion

For the given Helmholtz problem with a single right-hand side, the fastest solver of the block-diagonal H-LU factorisation with a block size of \(10000\). The reason was the fast factorisation of the diagonal blocks, i.e., a low setup time, with a reasonable number of iteration steps.

Using the standard H-LU preconditioner with \(\varepsilon = 10^{-2}\) for comparsion (H-LU time is similar to matrix construction time), it needs 6 right-hand side vectors to be faster than block-diagonal H-LU.

Attention
To emphasize again: the results are for a specific problem. What is not visible in the data is for instance the dependency of the number of iteration steps on the problem dimension which is typical for classical precondtioners like Jacobi or Gauss-Seidel. Furthermore, convergence rates are strongly dependent on the properties of a matrix (we leave out convergence theory here).
HLIB::factorise_inv
std::unique_ptr< TFacInvMatrix > factorise_inv(TMatrix *A, const TTruncAcc &acc, const fac_options_t &options=fac_options_t())
compute factorisation of A and return inverse operator