HLIBpro  2.8.1
Accumulator Arithmetic

# Introduction

Accumulator based H-arithmetic uses a special accumulator for updates to matrix blocks during arithmetic. Furthermore, updates are applied following the H-hierarchy. By this, the number of low-rank truncations is significantly reduced and with this also the runtime. However, the savings are matrix dependent and also result in slightly reduced arithmetic accuracy.

To activate accumulator arithmetic, you have to set the parameter CFG::Arith::use_accu to true, e.g.,

HLIB::CFG::Arith::use_accu = true;

or

export HLIB_Arith_use_accu=1

or

[Arith]
use_accu=1

in the configuration file .hlibpro.conf (see also Parameters).

## Eager vs Lazy Evaluation

With accumulator arithmetic block updates may be computed as soon as possible (eager) and applied to the accumulator. Or the update computation and application is postponed to the time where the matrix block is needed for further computations (lazy).

By default, updates are applied in an eager mode. To switch to lazy mode, the parameter CFG::Arith::lazy_eval has to be set to true.

Furthermore, with lazy evaluation, the individual updates may be sorted to increase accuracy, e.g., if the updates differ much with respect to their norm, possibly letting smaller updates vanish when combined with larger updates.

Sorting updates in lazy mode is activated with the parameter CFG::Arith::sort_updates.

# Benchmarks

We will run H-LU benchmarks (without using symmetry) with three different H-matrices to show the effect of accumulator H-arithmetic compared to standard arithmetic.

All benchmarks are performed on a Intel i5-4670 CPU using a single CPU core.

## Laplace

This benchmark uses the Laplace SLP operator with a spherical with 32768 triangles and piecewise linear ansatz (see Boundary Element Methods).

Type t(LU) size(LU) error
standard 14.59s 199 MB 1.636₁₀-3
accu/eager 9.32s 216 MB 5.937₁₀-3
accu/lazy 9.64s 215 MB 4.117₁₀-3
acc/lazy/sort 10.22s 215 MB 4.294₁₀-3

Standard arithmetic gives smallest memory and best approximation but also highest runtime. Best runtime is obtained with accumulator arithmetic in eager mode. However, approximation is worst with this setting. Slightly slower is lazy mode and also improves approximation.

## Helmholtz

This benchmark uses the Helmholtz SLP operator with a spherical with 32768 triangles and piecewise constant ansatz (see Boundary Element Methods).

Type t(LU) size(LU) error
standard 69.43s 861 MB 6.726₁₀-4
accu/eager 42.25s 916 MB 1.167₁₀-3
accu/lazy 42.66s 915 MB 1.031₁₀-3
acc/lazy/sort 45.17s 915 MB 1.015₁₀-3

Again, standard arithmetic results in smallest memory and best approximation as well as highest runtime. Best runtime is again obtained with accumulator arithmetic in eager mode, which also has worst approximation. Lazy mode improves the latter slightly as does update sorting.

## Convection-Diffusion

The third benchmark uses a sparse matrix from a 3D convection diffusion equation, partitioned using nested dissection (see Solving a Sparse Equation System). The number of unknowns is 132651.

Type t(LU) size(LU) error
standard 40.67s 1350 MB 1.271₁₀-3
accu/eager 35.62s 1350 MB 1.358₁₀-3
accu/lazy 35.08s 1350 MB 1.391₁₀-3
acc/lazy/sort 38.00s 1350 MB 1.367₁₀-3

The situation changes a little with this benchmark. Best runtime is now with lazy evaluation. Also the difference compared to standard arithmetic is smaller for runtime and for approximation error. The memory consumption is equal for all forms of arithmetic.

# Conclusion

In all benchmark problems, accumulator based arithmetic results in significant runtime savings compared to the standard arithmetic. This however, goes along with a reduced accuracy, which can be compensated by slightly increasing the predefined accuracy of the arithmetic.

For instance, if the block-wise accuracy in the H-LU for the Laplace example, where the relative difference in the approximation is highest, is reduced from $$10^{-4}$$ to $$3 \cdot 10^{-5}$$, the same error is obtained as with the standard arithmetic, while the runtime is then 11.24s.

In the other examples, the (relative) difference between the different arithmetic versions is even smaller.

There seem to be no significant improvments in runtime or accuracy when using lazy mode or sorting the updates.

Nevertheless, it is recommended to benchmark all different versions for a specific application to optimize runtime, memory and error.