HLIBpro  2.9
Many Right Hand Sides

For equation systems with many right hand sides, e.g.

$A x_i = b_i$

with $$i = 1 \ldots n_{rhs}, n_{rhs} \gg 1,$$ different ways are supported in 𝖧𝖫𝖨𝖡𝗉𝗋𝗈 to solve the equations, which shall be introduced in this example.

# Problem Description

We assume a given system matrix $$A$$ and right hand sides (RHSs) $$b_i$$. We are looking for the solutions $$x_i$$ in $$A x_i = b_i$$.

As with all previous applications of H-matrices, the H-LU factorisation shall be used to solve the equation systems. This may either be performed directly, i.e., as a direct solver involving a single matrix-vector evaluation, or with the help of an iterative solver. In all examples so far, the iterative approach was used since it usually relaxes the accuracy demands on the H-LU factorisation and hence reduces the factorisation runtime.

However, iterative solvers perform several matrix-vector multiplications for a single RHS. Therefore, this approach may increase the overall time significantly for many RHSs.

Alternatively, the direct solver approach may be used, which normally demands a much higher accuracy for H-LU to obtain the corresponding accuracy in the solution vector. A higher H-LU accuracy often increases the runtime and memory requirement of the factorisation process. However, depending on the number of RHSs, this additional costs may shorten the overall solution process.

Beside solving for a single RHS, 𝖧𝖫𝖨𝖡𝗉𝗋𝗈 also offers to solve for all RHSs simultaneously when used as a direct solver, as described below.

# Problem Setup

In the following, we use a system matrix $$A$$ and corresponding RHSs $$b_i$$ from the Boundary Element Methods example. However, instead of the Helmholtz kernel, the Laplace kernel is used. Furthermore, the matrix is constructed as an unsymmetric matrix, i.e., with upper and lower triangular part (see below for standard symmetric case).

The initialisation and construction of cluster trees is:

#include <iostream>
#include <string>
#include <boost/format.hpp>
#include <tbb/parallel_for.h>
#include "hlib.hh"
using namespace std;
using namespace HLIB;
using boost::format;
using real_t = HLIB::real;
using fnspace_t = TConstFnSpace;
int
main ( int argc, char ** argv )
{
string gridfile = "grid.tri";
double eps = 1e-4;
double fac_eps = 1e-6;
if ( argc > 1 )
gridfile = argv;
try
{
INIT();
TConsoleProgressBar progress;
TWallTimer timer;
TAutoBSPPartStrat part_strat;
TBSPCTBuilder ct_builder( & part_strat );
TBCBuilder bct_builder;
auto grid = make_grid( gridfile );
auto fnspace = make_unique< fnspace_t >( grid.get() );
auto coord = fnspace->build_coord();
auto ct = ct_builder.build( coord.get() );
auto bct = bct_builder.build( ct.get(), ct.get(), & adm_cond );

This is followed by H-matrix construction:

slpbf_t bf( fnspace.get(), fnspace.get() );
TBFCoeffFn< slpbf_t > bem_coeff_fn( & bf );
TPermCoeffFn< real_t > coeff_fn( & bem_coeff_fn,
bct->row_ct()->perm_i2e(),
bct->col_ct()->perm_i2e() );
TACAPlus< real_t > lrapx( & coeff_fn );
TTruncAcc acc = fixed_prec( eps );
TDenseMBuilder< real_t > h_builder( & coeff_fn, & lrapx );
auto A = h_builder.build( bct.get(), unsymmetric, acc, & progress );

Here, the unsymmetric matrix is constructed instead of the standard symmetric version.

# RHS Representation

Next, the RHSs shall be computed. For the BEM problem this is accomplished by deriving from TBEMFunction. Since multiple, different RHSs are to be constructed, two parameters alpha and beta are added, which are used in the evaluation function:

class TMyRHS : public TBEMFunction< real_t >
{
private:
const double _alpha, _beta;
public:
TMyRHS ( const double alpha,
const double beta )
: _alpha( alpha )
, _beta( beta )
{}
// actual RHS function
virtual real_t
eval ( const T3Point & pos,
const T3Point & ) const
{
return real_t( sin( 0.5 * pos.x() * cos( _alpha * 0.25 * pos.z() ) ) * cos( _beta * pos.y() ) );
}
};

Up to now, the RHS was represented as a single vector. For multiple RHSs this may be extended to a list of vectors. However, even more compact, and also useful for the different solve methods, is the representation as a dense matrix, i.e., each column corresponds to a different RHS:

$B := \left[ b_0 b_1 \ldots \right]$

In the source code the matrix $$B \in R^{n \times n_{rhs}}$$ is stored in the dense matrix RHS:

const size_t nrhs = 500;
TQuadBEMRHS< fnspace_t, real_t > rhs_build( 4 );
TDenseMatrix RHS( fnspace->n_indices(), nrhs );

For the actual construction, each column of RHS is taken as a standard vector and computed in parallel using tbb::parallel_for :

tbb::parallel_for( size_t(0), nrhs,
[&RHS,&rhs_build,&fnspace] ( const size_t i )
{
TMyRHS rhsfn( 100 * drand48() - 50,
100 * drand48() - 50 );
auto b = rhs_build.build( fnspace.get(), & rhsfn );
auto rhs_i = RHS.blas_rmat().column( i );
BLAS::copy( cptrcast( b.get(), TScalarVector )->blas_rvec(), rhs_i );
} );
RHS.permute( ct->perm_e2i(), nullptr );

The parameters alpha and beta of TMyRHS are chosen randomly. Similar to vectors, also the dense matrix has to be reordered to have the same ordering as the indices in the H-matrix. The permutation is here only applied to the rows, since they correspond to geometrical indices, while the column index only represents the index of the RHS.

Remarks
For simplicity, we have assumed that the RHS is stored in an object of type TScalarVector, which should be true for almost all cases.

# Solving the Equations

For all methods to solve the above equations, the LU factorisation $$A = LU$$ is needed, which is computed next:

auto B = A->copy();
TTruncAcc fac_acc( fac_eps, 0.0 );
auto A_inv = factorise_inv( B.get(), fac_acc, & progress );

## Iterative Solver

First, the classical method using an iterative solver is used. For this, each equation $$A x_i = b_i$$ is solved in parallel using the solve function. Analogously to the RHSs, the solution vectors are stored in a dense matrix SOL:

{
TDenseMatrix SOL( A->cols(), nrhs );
tbb::parallel_for( size_t(0), nrhs,
[&RHS,&SOL,&A,&A_inv,fac_eps] ( const size_t i )
{
BLAS::Vector< real_t > rhs_i( RHS.blas_rmat().column( i ) );
BLAS::Vector< real_t > sol_i( SOL.blas_rmat().column( i ) );
TScalarVector b( A->row_is(), rhs_i );
TScalarVector x( A->col_is(), sol_i );
solve( A.get(), & x, & b, A_inv.get() );
} );

After all equations are solved, the residual is computed with respect to all RHSs:

auto X = RHS.copy();
multiply( real_t(-1), apply_normal, A.get(), apply_normal, & SOL, real_t(1), X.get(), acc_exact );
cout << " |AX-B| = " << norm_F( X.get() ) << endl;
}

## Direct Vector Solves

Next, the \H-LU should be used as a direct solver. The only difference to the previous method is, that instead of using the function solve, the function apply of the operator A_inv is used, which applies the inverse of $$A$$ to the RHS $$b_i$$:

{
TDenseMatrix SOL( A->cols(), nrhs );
tbb::parallel_for( size_t(0), nrhs,
[&RHS,&SOL,&A,&A_inv] ( const size_t i )
{
BLAS::Vector< real_t > rhs_i( RHS.blas_rmat().column( i ) );
BLAS::Vector< real_t > sol_i( SOL.blas_rmat().column( i ) );
TScalarVector b( A->row_is(), rhs_i );
TScalarVector x( A->col_is(), sol_i );
A_inv->apply( & b, & x );
} );
auto X = RHS.copy();
multiply( real_t(-1), apply_normal, A.get(), apply_normal, & SOL, real_t(1), X.get(), acc_exact );
cout << " |AX-B| = " << norm_F( X.get() ) << endl;
}

Compared to the iterative method, this approach is faster but also the residual is usually higher. To achieve the same approximation as with the previous method, the H-LU factorisation has to be performed with an higher accuracy.

## Direct Matrix Solves

Finally, the LU factors $$L$$ and $$U$$ are used directly to solve all equations at once. Due to

$A X = L U X = B$

with $$X \in R^{n \times n_{rhs}}$$ being the solution matrix, first with the lower triangular matrix $$L$$ is solved and afterwards with the upper triangular matrix $$U$$:

{
TDenseMatrix SOL( A->cols(), nrhs );
solve_option_t opt_LL( block_wise, unit_diag );
solve_option_t opt_UR( block_wise, general_diag );
RHS.copy_to( & SOL );
solve_lower_left( apply_normal, B.get(), & SOL, fac_acc, opt_LL );
solve_upper_left( apply_normal, B.get(), & SOL, fac_acc, opt_UR );
auto X = RHS.copy();
multiply( real_t(-1), apply_normal, A.get(), apply_normal, & SOL, real_t(1), X.get(), acc_exact );
cout << " |AX-B| = " << norm_F( X.get() ) << endl;
}

Since $$L$$ is computed with unit diagonal, this also has to be defined for solving. Furthermore, instead of A_inv, which also represents the inverse operator, the matrix A_copy has to be used since it holds the actual LU factors.

Remarks
If $$A$$ is computed as a symmetric matrix, LDL factorisation would have been applied. In this case the matrix solve method looks as:
solve_lower_left( apply_normal, A_copy.get(), & SOL, fac_acc, opt_LL );
solve_diag_left( & SOL, apply_normal, A_copy.get(), fac_acc, opt_UR );
solve_lower_left( apply_transposed, A_copy.get(), & SOL, fac_acc, opt_LL );

In the hermitian case, apply_adjoint is to be used for the second solve_lower_left call.

The matrix approach yields to same residual as in the previous case. However, it may be more efficient in terms of runtime than the vector approach since it can exploit data locality much better.

However, the direct vector approach scales better with the number of cores. Hence, depending on your CPU configuration, this approach may be faster. Of course, the number of RHSs should be reasonably large for a good parallel scaling.

# The Plain Program

#include <iostream>
#include <string>
#include <boost/format.hpp>
#include <tbb/parallel_for.h>
#include "hlib.hh"
using namespace std;
using boost::format;
using namespace HLIB;
using real_t = HLIB::real;
using fnspace_t = TConstFnSpace;
namespace
{
class TMyRHS : public TBEMFunction< real_t >
{
private:
const double _alpha, _beta;
public:
TMyRHS ( const double alpha,
const double beta )
: _alpha( alpha )
, _beta( beta )
{}
// actual RHS function
virtual real_t
eval ( const T3Point & pos,
const T3Point & ) const
{
return real_t( sin( 0.5 * pos.x() * cos( _alpha * 0.25 * pos.z() ) ) * cos( _beta * pos.y() ) );
}
};
}
int
main ( int argc, char ** argv )
{
string gridfile = "grid.tri";
double eps = 1e-4;
double fac_eps = 1e-6;
if ( argc > 1 )
gridfile = argv;
try
{
INIT();
TConsoleProgressBar progress;
TAutoBSPPartStrat part_strat;
TBSPCTBuilder ct_builder( & part_strat );
TBCBuilder bct_builder;
auto grid = make_grid( gridfile );
auto fnspace = make_unique< fnspace_t >( grid.get() );
auto coord = fnspace->build_coord();
auto ct = ct_builder.build( coord.get() );
auto bct = bct_builder.build( ct.get(), ct.get(), & adm_cond );
slpbf_t bf( fnspace.get(), fnspace.get() );
TBFCoeffFn< slpbf_t > bem_coeff_fn( & bf );
TPermCoeffFn< real_t > coeff_fn( & bem_coeff_fn,
bct->row_ct()->perm_i2e(),
bct->col_ct()->perm_i2e() );
TACAPlus< real_t > lrapx( & coeff_fn );
TTruncAcc acc = fixed_prec( eps );
TDenseMBuilder< real_t > h_builder( & coeff_fn, & lrapx );
auto A = h_builder.build( bct.get(), unsymmetric, acc, & progress );
const size_t nrhs = 500;
TQuadBEMRHS< fnspace_t, real_t > rhs_build( 4 );
TDenseMatrix RHS( fnspace->n_indices(), nrhs );
tbb::parallel_for( size_t(0), nrhs,
[&RHS,&rhs_build,&fnspace] ( const size_t i )
{
TMyRHS rhsfn( 100 * drand48() - 50,
100 * drand48() - 50 );
auto b = rhs_build.build( fnspace.get(), & rhsfn );
auto rhs_i = RHS.blas_rmat().column( i );
BLAS::copy( cptrcast( b.get(), TScalarVector )->blas_rvec(), rhs_i );
} );
RHS.permute( ct->perm_e2i(), nullptr );
auto A_copy = A->copy();
TTruncAcc fac_acc( fac_eps, 0.0 );
auto A_inv = factorise_inv( A_copy.get(), fac_acc, & progress );
{
TDenseMatrix SOL( A->cols(), nrhs );
tbb::parallel_for( size_t(0), nrhs,
[&RHS,&SOL,&A,&A_inv,fac_eps] ( const size_t i )
{
BLAS::Vector< real_t > rhs_i( RHS.blas_rmat().column( i ) );
BLAS::Vector< real_t > sol_i( SOL.blas_rmat().column( i ) );
TScalarVector b( A->row_is(), rhs_i );
TScalarVector x( A->col_is(), sol_i );
solve( A.get(), & x, & b, A_inv.get() );
} );
auto X = RHS.copy();
multiply( real_t(-1), apply_normal, A.get(), apply_normal, & SOL, real_t(1), X.get(), acc_exact );
cout << " |AX-B| = " << norm_F( X.get() ) << endl;
}
{
TDenseMatrix SOL( A->cols(), nrhs );
tbb::parallel_for( size_t(0), nrhs,
[&RHS,&SOL,&A,&A_inv] ( const size_t i )
{
BLAS::Vector< real_t > rhs_i( RHS.blas_rmat().column( i ) );
BLAS::Vector< real_t > sol_i( SOL.blas_rmat().column( i ) );
TScalarVector b( A->row_is(), rhs_i );
TScalarVector x( A->col_is(), sol_i );
A_inv->apply( & b, & x );
} );
auto X = RHS.copy();
multiply( real_t(-1), apply_normal, A.get(), apply_normal, & SOL, real_t(1), X.get(), acc_exact );
cout << " |AX-B| = " << norm_F( X.get() ) << endl;
}
{
TDenseMatrix SOL( A->cols(), nrhs );
solve_option_t opt_LL( block_wise, unit_diag );
solve_option_t opt_UR( block_wise, general_diag );
RHS.copy_to( & SOL );
solve_lower_left( apply_normal, A_copy.get(), & SOL, fac_acc, opt_LL );
solve_upper_left( apply_normal, A_copy.get(), & SOL, fac_acc, opt_UR );
auto X = RHS.copy();
multiply( real_t(-1), apply_normal, A.get(), apply_normal, & SOL, real_t(1), X.get(), acc_exact );
cout << " |AX-B| = " << norm_F( X.get() ) << endl;
}
DONE();
}
catch ( Error & e )
{
cout << "HLIB::error : " << e.what() << endl;
}
catch ( std::exception & e )
{
cout << "std::exception : " << e.what() << endl;
}
return 0;
}
HLIB::TDenseMatrix
Represent a dense matrix.
Definition: TDenseMatrix.hh:32
HLIB::TBCBuilder
Recursively build block cluster tree with supplied admissibility condition.
Definition: TBCBuilder.hh:27
HLIB::TBSPCTBuilder
Base class for all cluster tree constructors based on BSP.
Definition: TBSPCTBuilder.hh:162
HLIB::TLaplaceSLPBF
Bilinear form for Laplace single layer potential.
Definition: TLaplaceBF.hh:46
HLIB::TBCBuilder::build
virtual std::unique_ptr< TBlockClusterTree > build(const TClusterTree *rowct, const TClusterTree *colct, const TAdmCondition *ac) const
HLIB::TACAPlus
Implements ACA+, which corrects some of the deficits of the original ACA algorithm.
Definition: TLowRankApx.hh:474
HLIB::TBlockClusterTree::row_ct
const TClusterTree * row_ct() const
return row cluster tree
Definition: TBlockClusterTree.hh:71
HLIB::TAutoBSPPartStrat
Automatic choice of best partitioning strategy.
Definition: TBSPPartStrat.hh:174
HLIB::eval
void eval(const TMatrix *A, TVector *v, const matop_t op, const eval_option_t &options)
evaluate L·U·x=y with lower triangular L and upper triangular U
HLIB::TClusterTree::perm_e2i
const TPermutation * perm_e2i() const
return external to internal permutation
Definition: TClusterTree.hh:73
HLIB::solve_option_t
options for how to solve with given matrix
Definition: solve_types.hh:28
HLIB::TClusterTree::perm_i2e
const TPermutation * perm_i2e() const
return internal to external permutation
Definition: TClusterTree.hh:76
HLIB::BLAS::solve
std::enable_if_t< is_matrix< T1 >::value &&is_vector< T2 >::value &&is_same_type< typename T1::value_t, typename T2::value_t >::value, void > solve(T1 &A, T2 &b)
solve A·x = b with known A and b; x overwrites b (A is overwritten upon exit!)
Definition: Algebra.hh:990
HLIB::BLAS::Vector
Standard vector in basic linear algebra, i.e. BLAS/LAPACK.
Definition: Vector.hh:38
HLIB::multiply
void multiply(const T_value alpha, const matop_t op_A, const TMatrix *A, const matop_t op_B, const TMatrix *B, const T_value beta, TMatrix *C, const TTruncAcc &acc, TProgressBar *progress=NULL)
compute C ≔ β·C + α·op(A)·op(B)
HLIB::TBlockClusterTree::col_ct
const TClusterTree * col_ct() const
return column cluster tree
Definition: TBlockClusterTree.hh:74
HLIB::BLAS::copy
std::enable_if_t< is_vector< T1 >::value &&is_vector< T2 >::value &&is_same_type< typename T1::value_t, typename T2::value_t >::value, void > copy(const T1 &x, T2 &y)
copy x into y
Definition: Algebra.hh:145
HLIB::TPermCoeffFn
Eval coefficient function with reordered indices, e.g. change internal to external ordering.
Definition: TCoeffFn.hh:138
HLIB::TBFCoeffFn
Provide matrix coefficients defined by bilinear forms.
Definition: TBFCoeffFn.hh:27
HLIB::TTruncAcc
Defines accuracy for truncation of low rank blocks.
Definition: TTruncAcc.hh:33
Standard admissibility for FEM/BEM applications normal : adm iff min( diam(τ), diam(σ) ) ≤ η·dist(τ,...