HLIBpro
2.8.1

Accumulator based Harithmetic uses a special accumulator for updates to matrix blocks during arithmetic. Furthermore, updates are applied following the Hhierarchy. By this, the number of lowrank truncations is significantly reduced and with this also the runtime. However, the savings are matrix dependent and also result in slightly reduced arithmetic accuracy.
To activate accumulator arithmetic, you have to set the parameter CFG::Arith::use_accu
to true
, e.g.,
or
or
in the configuration file .hlibpro.conf
(see also Parameters).
With accumulator arithmetic block updates may be computed as soon as possible (eager) and applied to the accumulator. Or the update computation and application is postponed to the time where the matrix block is needed for further computations (lazy).
By default, updates are applied in an eager mode. To switch to lazy mode, the parameter CFG::Arith::lazy_eval
has to be set to true
.
Furthermore, with lazy evaluation, the individual updates may be sorted to increase accuracy, e.g., if the updates differ much with respect to their norm, possibly letting smaller updates vanish when combined with larger updates.
Sorting updates in lazy mode is activated with the parameter CFG::Arith::sort_updates
.
We will run HLU benchmarks (without using symmetry) with three different Hmatrices to show the effect of accumulator Harithmetic compared to standard arithmetic.
All benchmarks are performed on a Intel i54670 CPU using a single CPU core.
This benchmark uses the Laplace SLP operator with a spherical with 32768 triangles and piecewise linear ansatz (see Boundary Element Methods).
Type  t(LU)  size(LU)  error 

standard  14.59s  199 MB  1.636₁₀3 
accu/eager  9.32s  216 MB  5.937₁₀3 
accu/lazy  9.64s  215 MB  4.117₁₀3 
acc/lazy/sort  10.22s  215 MB  4.294₁₀3 
Standard arithmetic gives smallest memory and best approximation but also highest runtime. Best runtime is obtained with accumulator arithmetic in eager mode. However, approximation is worst with this setting. Slightly slower is lazy mode and also improves approximation.
This benchmark uses the Helmholtz SLP operator with a spherical with 32768 triangles and piecewise constant ansatz (see Boundary Element Methods).
Type  t(LU)  size(LU)  error 

standard  69.43s  861 MB  6.726₁₀4 
accu/eager  42.25s  916 MB  1.167₁₀3 
accu/lazy  42.66s  915 MB  1.031₁₀3 
acc/lazy/sort  45.17s  915 MB  1.015₁₀3 
Again, standard arithmetic results in smallest memory and best approximation as well as highest runtime. Best runtime is again obtained with accumulator arithmetic in eager mode, which also has worst approximation. Lazy mode improves the latter slightly as does update sorting.
The third benchmark uses a sparse matrix from a 3D convection diffusion equation, partitioned using nested dissection (see Solving a Sparse Equation System). The number of unknowns is 132651.
Type  t(LU)  size(LU)  error 

standard  40.67s  1350 MB  1.271₁₀3 
accu/eager  35.62s  1350 MB  1.358₁₀3 
accu/lazy  35.08s  1350 MB  1.391₁₀3 
acc/lazy/sort  38.00s  1350 MB  1.367₁₀3 
The situation changes a little with this benchmark. Best runtime is now with lazy evaluation. Also the difference compared to standard arithmetic is smaller for runtime and for approximation error. The memory consumption is equal for all forms of arithmetic.
In all benchmark problems, accumulator based arithmetic results in significant runtime savings compared to the standard arithmetic. This however, goes along with a reduced accuracy, which can be compensated by slightly increasing the predefined accuracy of the arithmetic.
For instance, if the blockwise accuracy in the HLU for the Laplace example, where the relative difference in the approximation is highest, is reduced from \(10^{4}\) to \(3 \cdot 10^{5}\), the same error is obtained as with the standard arithmetic, while the runtime is then 11.24s.
In the other examples, the (relative) difference between the different arithmetic versions is even smaller.
There seem to be no significant improvments in runtime or accuracy when using lazy mode or sorting the updates.
Nevertheless, it is recommended to benchmark all different versions for a specific application to optimize runtime, memory and error.
See also Approximation and Truncation Algorithms for other parameters influencing arithmetic.