'Alternating direction implicit' preconditioner #297

johnomotani · 2024-12-04T15:19:09Z

The preconditioner within the Jacobian-free Newton Krylov (JFNK) solver used for implicit timestepping needs to 'invert a matrix'. The initial implementation does this by an LU-factorisation method, using Julia's LinearAlgebra package. This LU solver is serial. Some parallelisation can be achieved by constructing separate preconditioners in each distributed-MPI block, at the cost of dropping the elements of the preconditioner that would couple different blocks, but our shared-memory parallelism cannot be exploited by the LU solver, creating a bottleneck where for a 1D1V test with 32 elements in the z-direction, no speed up can be gained by using 128 cores on ARCHER2 compared to 8 cores.

An alternative which is less accurate but more parallelisable is based on the 'alternating direction implicit' (ADI) approach. The aim is to solve

$$J.\delta x = R$$

for $\delta x$.
In the ADI approach we:

* First identify all the terms that only couple in $w_\parallel$. Split the Jacobian matrix into an 'implicit' matrix $J_{I,w}$ containing the terms that couple in $w_\parallel$ - which is block diagonal because grid points at different $z$ values are uncoupled - and an 'explicit' matrix $J_{E,w}$ with the remainder

$$J = J_{I,w} + J_{E,w}$$

Take an inital guess $\delta x_0 = \vec{0}$ and find an updated guess $\delta\tilde{x}_1$ from

$$J_{I,w}.\delta\tilde{x}_1 = R - J_{E,w}.\delta x_0 = R$$

* Second identify all the terms that only couple in $z$, and do a similar thing

$$J = J_{I,z} + J_{E,w}$$

where $J_{I,z}$ is block diagonal, this time points at different $w_\parallel$ are uncoupled (also $g_e$ and $p_{e\parallel}$ are uncoupled). Solve for a new guess $\delta x_1$ from

$$J_{I,z}.\delta x_1 = R - J_{E,z}.\delta \tilde{x}_1$$

* Optionally iterate these two steps to get $\delta x_2$, $\delta x_3$, etc. which should be better approximations to the solution of $J.\delta x = R$. Initial tests suggest that a single iteration might be enough - using more iterations in the ADI solve would slightly reduce the number of JFNK iterations, but not by the factor of 2,3,4... that would be needed to cover the cost of the extra ADI iterations.

Todo:

Add a test that checks that an ADI solver converges to the LU solution after enough iterations?

Cleaner, and might help compile-time.

Hopefully helps the compiler to remove unneeded branches at compile time.

Hopefully improves compile/run time.

Also provide positional-arguments-only form of `steady_state_residuals()`, which helps const-propagation and therefore type stability (the compiler knows that the function returns a Vector when `only_max_abs` is passed a const `true`).

Declare the type of the thing returned by `MPI.Bcast()` to avoid type instability.

@BoundsCheck

After @BoundsCheck, need to actually throw an error.

Means less hacking to do if we ever do want to change (temporarily or permanently) `mk_float` to a different type.

A variation on the 'alternating direction implicit' (ADI) method might be useful as a preconditioner. It will require split Jacobians where some 'implicit' parts only couple a subset of dimensions (i.e. z- or velocity-dimensions), while 'explicit' parts may couple all dimensions (but may also have some terms removed to make the matrix more sparse for numerical efficiency).

Avoid using more than two Jacobian-sized buffer arrays at any time in the Jacobian matrix tests. Using more than this would cause the Github Actions CI servers to run out of shared memory, causing an error.

Add a preconditioner for kinetic electrons using a variation on the 'alternating direction implicit' (ADI) method. Terms that couple velocity space are first solved implicitly (over vpa only for the 1V case), with terms with z-coupling treated 'explicitly' by being subtracted from the right-hand-side; then terms that couple in z are solved implicitly, with terms with v-coupling treated 'explicitly'. The two steps may be iterated more times if this is needed.

If delta_x is all-zero, then P^-1.delta_x is also all-zero, so no need to evaluate it.

One iteration of ADI preconditioning seems to be enough to make the JFNK solver for kinetic electrons converge. The number of linear (Krylov) iterations required increases slightly, but overall this should reduce the computational cost.

When not parallelising using shared memory, there is no need to split the preconditioner and the LU preconditioner should be the most efficient. Therefore use the LU precon in serial, and use ADI only when `block_size[] > 1`.

Indexing error meant that half the time the cutoff did not interpolate smoothly between grid points.

Bugfix.

Will allow the calculation of these parameters to be reused when calculating a Jacobian matrix for the wall bc.

Where integrals over the into-the-sheath part of the distribution function are split into two parts at +/- vcut, modify the way that the split is done so that the 'part 2' integral between 0 and +/-vcut is calculated the same way as an integral over the out-from-the-sheath part of the distribution which is cut off at -/+vcut. Not sure if this is necessary, but seems nicer to be more consistent.

If the prefactor that sets the correction terms to be proportional to vpa^2 near vpa=0 is too broad, then it is hard for the correction terms to fix errors in low moments (e.g. density moment), so making it a bit narrower reduces the size of the coefficients of the correction terms.

If phi_wall=0, giving vcut=0, then it is not possible to apply moment constraints, which causes an error due to a singular matrix. This case is unphysical, so should not ever be a converged solution (vcut=0 corresponds to a fully electron-absorbing sheath). To avoid the problem, check if vcut==0, and if so set vcut to some small value (we choose the value at the next grid point after the one closest to 0). [skip ci]

By default (if `implicit_electron_ppar = true`), use LU when `block_size[] == 1` or ADI otherwise, but now can pass `implicit_electron_ppar = "lu"` or `implicit_electron_ppar = "adi"` to pick the precoditioner type explicitly.

Hoped this might help find a bug, but did not help with that. Do not know any particular reason to update, but might as well keep up to date.

Previous scheme just interpolated using existing values in whole element containing zero. This meant that the result of the boundary condition depended on some points that are overwritten by the boundary condition. Improve on this by doing special interpolation in the element containing zero. Instead of using the usual `interpolate_to_grid_1d!()` function, use `interpolate_symmetric!()` which does an interpolation that is forced to be symmetric around v_parallel=0, as the interpolation polynomial is a polynomial in `v_parallel^2` (the interpolating polynomial is constructed using a Lagrange polynomial method). The inputs to the interpolation are now just the function values on grid points (within the element containing zero) that are not set by the boundary condition. Also optimises the interpolation of the points in the elements not containing zero by restricting the interpolation to just the points needed for output, instead of interpolating to the full reversed grid.

Skipping the single update of vcut, which was previously done even when `bc_constraints=false` was passed to `enforce_boundary_condition_on_electron_pdf!()` makes it possible to match the result with an interpolation matrix that does not couple (due to the integral nature of the update of vcut) every point in the vpa grid.

Reduces code duplication.

Improves convergence a bit, allowing electron solver to take larger pseudo-timesteps, in at least one case.

Should reduce allocations.

Hopefully reduce allocations. Maybe save a little compilation time by removing some unnecessary `@views`.

Several small updates. In particular need to be careful with variables that are captured by a locally-defined function like `residual_func!()`.

Input parameters that control maximum number of pseudotimesteps and maximum total pseudotime for each kinetic electron pseudotimestepping loop.

Helps if something didn't previously clean up the timers.

johnomotani · 2024-12-04T15:21:12Z

This PR is a draft for the moment, because I want to test whether it is possible and useful to add communication between different distributed-MPI blocks after each ADI iteration - could doing that and taking 2 or 3 ADI iterations speed up the JFNK convergence?

johnomotani · 2024-12-04T15:35:00Z

Initial performance testing for a 1D1V kinetic electron simulation on ARCHER2 with 32 elements in the $z$-direction split into 8 blocks. The grid in $z$ is uniform. Times and iteration counts are taken over 5 output steps, running from 0.004 to 0.009 $L_{ref}/c_{s,ref}$. This is still in an initial transient phase, which is maybe not ideal, but is as far as the 8-core runs (or 128 core LU) would get in 4 hours on 1 node of ARCHER2.

	Run time (s)	Total Krylov iterations	Time per preconditioner evaluation (ms)
LU, 8 cores	57.4	600093	4.60
LU, 128 cores	58.2	586143	4.24
ADI, 8 cores	115.8	885619	6.19
ADI, 128 cores	23.4	913841	0.397

This seems to agree with what we would expect. The ADI solver takes a few more iterations because it is only an approximate matrix inverse, whereas the LU is exact (within each block), but the LU solve does not scale with the number of shared-memory cores, whereas the ADI solve does - with almost perfect scaling as 6.19/0.397=15.6 times speed-up with 16 times more cores.

Suspect the run time for the 8 core ADI may be slightly high due to random fluctuation as based on the iteration count we would expect more like 90 s. This might make the perfect scaling mentioned above over-optimistic, depending whether the random slowdown happened (if it happened at all) during a call to the preconditioner or while the code was somewhere else.

This boundary condition function should act only at one `ir`, which is passed as an argument, so there should be no loop over `ir`. This bug has not impacted simulations so far because we have only used r.n=0 for kinetic electrons so far.

Can (?) skip the problematic solve, that was run to update the electron shape function without updating the electron pressure, on the 'explicit first stage' of ESDIRK schemes. This might have the effect of reducing the order of accuracy of the scheme somehow, as the qpar_e used for the 'explicit' calculation of the time derivative of electron_ppar is taken from the most recent implicit solve rather than a solve updated with the new ion/electron profiles. However, the change is probably small (?) and at least the solver does run now - it is useful to have an adaptive ion timestep as it may let the code recover by reducing the ion timestep when an electron implicit solve fails to converge.

When running kinetic electron simulations, it can cause problems to take a very short ion timestep. When writing outputs at exactly fixed output times, this can happen if the previous timestep happened to end just before the output time. To avoid the very short step default to writing output at whatever time the end of the timestep is that exceeds the set output time. There is an option to force the previous behaviour of a decreased timestep so that output is written exactly at the nominal output time.

This seems to avoid compiler errors on macOS. No idea why this should help, or what the original problem was!

This test is extremely slow when run in parallel on macOS (maybe the macOS servers on Github Actions don't have enough memory?), so skip it in this case to avoid test failures.

This might help to remove some of the performance loss due to inverting the preconditioner separately in each distributed-MPI block.

johnomotani added 30 commits November 20, 2024 21:14

Separate Jacobians for "electron_split_lu" into separate function

67518b0

Cleaner, and might help compile-time.

Use Val() types for preconditioner_type

8be9e5a

Hopefully helps the compiler to remove unneeded branches at compile time.

Improve type-stability in electron_backward_euler!()

51c5180

Hopefully improves compile/run time.

Fix typo in steady_state_square_residuals()

cfc36a4

Slightly improve type stability in setup_moment_kinetics()

1930929

Declare the type of the thing returned by `MPI.Bcast()` to avoid type instability.

Fix bounds checks in Jacobian functions

effb768

After @BoundsCheck, need to actually throw an error.

Use mk_float explicitly in a few more places

5f6caa6

Means less hacking to do if we ever do want to change (temporarily or permanently) `mk_float` to a different type.

Tests for ADI Jacobians

3c69294

Reduce memory usage in Jacobian matrix tests

c4fa021

Avoid using more than two Jacobian-sized buffer arrays at any time in the Jacobian matrix tests. Using more than this would cause the Github Actions CI servers to run out of shared memory, causing an error.

Skip qpar integral terms to make explicit matrix sparser for ADI precon

ac02605

Skip right-preconditioner evaluation when initial guess is all-zeros

e811cb0

If delta_x is all-zero, then P^-1.delta_x is also all-zero, so no need to evaluate it.

Fix electron bc cutoff at lower boundary

c68ca47

Indexing error meant that half the time the cutoff did not interpolate smoothly between grid points.

Recalculate vth before bc in electron_backward_euler!() residual_func!()

f2d0115

Bugfix.

Move cutoff parameters for kinetic electron bc to separate function

48e992e

Will allow the calculation of these parameters to be reused when calculating a Jacobian matrix for the wall bc.

Fix calculation of sigma_fraction in kinetic electron bc

67b546d

Update downloaded HDF5 to 1.14.5

f69b314

Hoped this might help find a bug, but did not help with that. Do not know any particular reason to update, but might as well keep up to date.

Function to interpolate a function symmetrically around x=0

2e40ee2

Calculate vcut_fraction in utility functions

68a6cce

Reduces code duplication.

Update vcut for kinetic electron bc at each Newton iteration

45b8466

Improves convergence a bit, allowing electron solver to take larger pseudo-timesteps, in at least one case.

johnomotani added 6 commits November 28, 2024 00:21

Add missing @views, move struct field lookups out of loops for electrons

d3c3e8e

Should reduce allocations.

Clean up @views in enforce_boundary_condition_on_electron_pdf!()

651f141

Hopefully reduce allocations. Maybe save a little compilation time by removing some unnecessary `@views`.

Improve type stability in enforce_boundary_condition_on_electron_pdf!()

18f5f56

Improve type stability in electron_backward_euler!()

3f3d7b5

Several small updates. In particular need to be careful with variables that are captured by a locally-defined function like `residual_func!()`.

Make maximum number of pseudotimesteps and maximum pseudotime settable

3184e4a

Input parameters that control maximum number of pseudotimesteps and maximum total pseudotime for each kinetic electron pseudotimestepping loop.

Reset timers at beginning of run_moment_kinetics()

cc9f967

Helps if something didn't previously clean up the timers.

johnomotani force-pushed the ADI-solver branch 2 times, most recently from 53c4bdc to db88708 Compare December 6, 2024 13:42

johnomotani added 7 commits December 9, 2024 13:52

Move exact_output_times to different position in time_info struct

ae16458

This seems to avoid compiler errors on macOS. No idea why this should help, or what the original problem was!

Skip Jacobian matrix tests in macOS parallel tests CI job

7ad038e

This test is extremely slow when run in parallel on macOS (maybe the macOS servers on Github Actions don't have enough memory?), so skip it in this case to avoid test failures.

Make kinetic electron test use shared-memory, and ADI, when possible

940ca4a

Communicate block boundary points after each ADI iteration

bda160b

This might help to remove some of the performance loss due to inverting the preconditioner separately in each distributed-MPI block.

johnomotani force-pushed the ADI-solver branch from db88708 to bda160b Compare December 9, 2024 13:52

johnomotani mentioned this pull request Dec 9, 2024

Collect timing information by default? #273

Closed

johnomotani marked this pull request as ready for review January 10, 2025 23:59

johnomotani merged commit bf751dd into master Jan 10, 2025
21 checks passed

johnomotani deleted the ADI-solver branch January 10, 2025 23:59

johnomotani added the enhancement New feature or request label Jan 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'Alternating direction implicit' preconditioner #297

'Alternating direction implicit' preconditioner #297

johnomotani commented Dec 4, 2024 •

edited

Loading

johnomotani commented Dec 4, 2024 •

edited

Loading

johnomotani commented Dec 4, 2024

'Alternating direction implicit' preconditioner #297

'Alternating direction implicit' preconditioner #297

Conversation

johnomotani commented Dec 4, 2024 • edited Loading

johnomotani commented Dec 4, 2024 • edited Loading

johnomotani commented Dec 4, 2024

johnomotani commented Dec 4, 2024 •

edited

Loading

johnomotani commented Dec 4, 2024 •

edited

Loading