Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Alternating direction implicit' preconditioner #297

Merged
merged 43 commits into from
Jan 10, 2025
Merged

Conversation

johnomotani
Copy link
Collaborator

@johnomotani johnomotani commented Dec 4, 2024

The preconditioner within the Jacobian-free Newton Krylov (JFNK) solver used for implicit timestepping needs to 'invert a matrix'. The initial implementation does this by an LU-factorisation method, using Julia's LinearAlgebra package. This LU solver is serial. Some parallelisation can be achieved by constructing separate preconditioners in each distributed-MPI block, at the cost of dropping the elements of the preconditioner that would couple different blocks, but our shared-memory parallelism cannot be exploited by the LU solver, creating a bottleneck where for a 1D1V test with 32 elements in the z-direction, no speed up can be gained by using 128 cores on ARCHER2 compared to 8 cores.

An alternative which is less accurate but more parallelisable is based on the 'alternating direction implicit' (ADI) approach. The aim is to solve

$$J.\delta x = R$$

for $\delta x$.
In the ADI approach we:

* First identify all the terms that only couple in $w_\parallel$. Split the Jacobian matrix into an 'implicit' matrix $J_{I,w}$ containing the terms that couple in $w_\parallel$ - which is block diagonal because grid points at different $z$ values are uncoupled - and an 'explicit' matrix $J_{E,w}$ with the remainder

$$J = J_{I,w} + J_{E,w}$$

Take an inital guess $\delta x_0 = \vec{0}$ and find an updated guess $\delta\tilde{x}_1$ from

$$J_{I,w}.\delta\tilde{x}_1 = R - J_{E,w}.\delta x_0 = R$$

* Second identify all the terms that only couple in $z$, and do a similar thing

$$J = J_{I,z} + J_{E,w}$$

where $J_{I,z}$ is block diagonal, this time points at different $w_\parallel$ are uncoupled (also $g_e$ and $p_{e\parallel}$ are uncoupled). Solve for a new guess $\delta x_1$ from

$$J_{I,z}.\delta x_1 = R - J_{E,z}.\delta \tilde{x}_1$$

* Optionally iterate these two steps to get $\delta x_2$, $\delta x_3$, etc. which should be better approximations to the solution of $J.\delta x = R$. Initial tests suggest that a single iteration might be enough - using more iterations in the ADI solve would slightly reduce the number of JFNK iterations, but not by the factor of 2,3,4... that would be needed to cover the cost of the extra ADI iterations.

Todo:

  • Add a test that checks that an ADI solver converges to the LU solution after enough iterations?

Hopefully helps the compiler to remove unneeded branches at compile
time.
Hopefully improves compile/run time.
Also provide positional-arguments-only form of
`steady_state_residuals()`, which helps const-propagation and therefore
type stability (the compiler knows that the function returns a Vector
when `only_max_abs` is passed a const `true`).
Declare the type of the thing returned by `MPI.Bcast()` to avoid type
instability.
After @BoundsCheck, need to actually throw an error.
Means less hacking to do if we ever do want to change (temporarily or
permanently) `mk_float` to a different type.
A variation on the 'alternating direction implicit' (ADI) method might
be useful as a preconditioner. It will require split Jacobians where
some 'implicit' parts only couple a subset of dimensions (i.e. z- or
velocity-dimensions), while 'explicit' parts may couple all dimensions
(but may also have some terms removed to make the matrix more sparse for
numerical efficiency).
Avoid using more than two Jacobian-sized buffer arrays at any time in
the Jacobian matrix tests. Using more than this would cause the Github
Actions CI servers to run out of shared memory, causing an error.
Add a preconditioner for kinetic electrons using a variation on the
'alternating direction implicit' (ADI) method. Terms that couple
velocity space are first solved implicitly (over vpa only for the 1V
case), with terms with z-coupling treated 'explicitly' by being
subtracted from the right-hand-side; then terms that couple in z are
solved implicitly, with terms with v-coupling treated 'explicitly'. The
two steps may be iterated more times if this is needed.
If delta_x is all-zero, then P^-1.delta_x is also all-zero, so no need
to evaluate it.
One iteration of ADI preconditioning seems to be enough to make the JFNK
solver for kinetic electrons converge. The number of linear (Krylov)
iterations required increases slightly, but overall this should reduce
the computational cost.
When not parallelising using shared memory, there is no need to split
the preconditioner and the LU preconditioner should be the most
efficient. Therefore use the LU precon in serial, and use ADI only when
`block_size[] > 1`.
Indexing error meant that half the time the cutoff did not interpolate
smoothly between grid points.
Will allow the calculation of these parameters to be reused when
calculating a Jacobian matrix for the wall bc.
Where integrals over the into-the-sheath part of the distribution
function are split into two parts at +/- vcut, modify the way that the
split is done so that the 'part 2' integral between 0 and +/-vcut is
calculated the same way as an integral over the out-from-the-sheath part
of the distribution which is cut off at -/+vcut.

Not sure if this is necessary, but seems nicer to be more consistent.
If the prefactor that sets the correction terms to be proportional to
vpa^2 near vpa=0 is too broad, then it is hard for the correction terms
to fix errors in low moments (e.g. density moment), so making it a bit
narrower reduces the size of the coefficients of the correction terms.
If phi_wall=0, giving vcut=0, then it is not possible to apply moment
constraints, which causes an error due to a singular matrix. This case
is unphysical, so should not ever be a converged solution (vcut=0
corresponds to a fully electron-absorbing sheath). To avoid the problem,
check if vcut==0, and if so set vcut to some small value (we choose the
value at the next grid point after the one closest to 0).

[skip ci]
By default (if `implicit_electron_ppar = true`), use LU when
`block_size[] == 1` or ADI otherwise, but now can pass
`implicit_electron_ppar = "lu"` or `implicit_electron_ppar = "adi"` to
pick the precoditioner type explicitly.
Hoped this might help find a bug, but did not help with that. Do not
know any particular reason to update, but might as well keep up to date.
Previous scheme just interpolated using existing values in whole element
containing zero. This meant that the result of the boundary condition
depended on some points that are overwritten by the boundary condition.
Improve on this by doing special interpolation in the element containing
zero. Instead of using the usual `interpolate_to_grid_1d!()` function,
use `interpolate_symmetric!()` which does an interpolation that is
forced to be symmetric around v_parallel=0, as the interpolation
polynomial is a polynomial in `v_parallel^2` (the interpolating
polynomial is constructed using a Lagrange polynomial method).  The
inputs to the interpolation are now just the function values on grid
points (within the element containing zero) that are not set by the
boundary condition.

Also optimises the interpolation of the points in the elements not
containing zero by restricting the interpolation to just the points
needed for output, instead of interpolating to the full reversed grid.
Skipping the single update of vcut, which was previously done even when
`bc_constraints=false` was passed to
`enforce_boundary_condition_on_electron_pdf!()` makes it possible to
match the result with an interpolation matrix that does not couple (due
to the integral nature of the update of vcut) every point in the vpa
grid.
Improves convergence a bit, allowing electron solver to take larger
pseudo-timesteps, in at least one case.
Hopefully reduce allocations. Maybe save a little compilation time by
removing some unnecessary `@views`.
Several small updates. In particular need to be careful with variables
that are captured by a locally-defined function like `residual_func!()`.
Input parameters that control maximum number of pseudotimesteps and
maximum total pseudotime for each kinetic electron pseudotimestepping
loop.
Helps if something didn't previously clean up the timers.
@johnomotani
Copy link
Collaborator Author

johnomotani commented Dec 4, 2024

This PR is a draft for the moment, because I want to test whether it is possible and useful to add communication between different distributed-MPI blocks after each ADI iteration - could doing that and taking 2 or 3 ADI iterations speed up the JFNK convergence?

@johnomotani
Copy link
Collaborator Author

Initial performance testing for a 1D1V kinetic electron simulation on ARCHER2 with 32 elements in the $z$-direction split into 8 blocks. The grid in $z$ is uniform. Times and iteration counts are taken over 5 output steps, running from 0.004 to 0.009 $L_{ref}/c_{s,ref}$. This is still in an initial transient phase, which is maybe not ideal, but is as far as the 8-core runs (or 128 core LU) would get in 4 hours on 1 node of ARCHER2.

Run time (s) Total Krylov iterations Time per preconditioner evaluation (ms)
LU, 8 cores 57.4 600093 4.60
LU, 128 cores 58.2 586143 4.24
ADI, 8 cores 115.8 885619 6.19
ADI, 128 cores 23.4 913841 0.397

This seems to agree with what we would expect. The ADI solver takes a few more iterations because it is only an approximate matrix inverse, whereas the LU is exact (within each block), but the LU solve does not scale with the number of shared-memory cores, whereas the ADI solve does - with almost perfect scaling as 6.19/0.397=15.6 times speed-up with 16 times more cores.

Suspect the run time for the 8 core ADI may be slightly high due to random fluctuation as based on the iteration count we would expect more like 90 s. This might make the perfect scaling mentioned above over-optimistic, depending whether the random slowdown happened (if it happened at all) during a call to the preconditioner or while the code was somewhere else.

@johnomotani johnomotani force-pushed the ADI-solver branch 2 times, most recently from 53c4bdc to db88708 Compare December 6, 2024 13:42
This boundary condition function should act only at one `ir`, which is
passed as an argument, so there should be no loop over `ir`. This bug
has not impacted simulations so far because we have only used r.n=0 for
kinetic electrons so far.
Can (?) skip the problematic solve, that was run to update the electron
shape function without updating the electron pressure, on the 'explicit
first stage' of ESDIRK schemes. This might have the effect of reducing
the order of accuracy of the scheme somehow, as the qpar_e used for the
'explicit' calculation of the time derivative of electron_ppar is taken
from the most recent implicit solve rather than a solve updated with the
new ion/electron profiles. However, the change is probably small (?) and
at least the solver does run now - it is useful to have an adaptive ion
timestep as it may let the code recover by reducing the ion timestep
when an electron implicit solve fails to converge.
When running kinetic electron simulations, it can cause problems to take
a very short ion timestep. When writing outputs at exactly fixed output
times, this can happen if the previous timestep happened to end just
before the output time. To avoid the very short step default to writing
output at whatever time the end of the timestep is that exceeds the set
output time. There is an option to force the previous behaviour of a
decreased timestep so that output is written exactly at the nominal
output time.
This seems to avoid compiler errors on macOS. No idea why this should
help, or what the original problem was!
This test is extremely slow when run in parallel on macOS (maybe the
macOS servers on Github Actions don't have enough memory?), so skip it
in this case to avoid test failures.
This might help to remove some of the performance loss due to inverting
the preconditioner separately in each distributed-MPI block.
@johnomotani johnomotani marked this pull request as ready for review January 10, 2025 23:59
@johnomotani johnomotani merged commit bf751dd into master Jan 10, 2025
21 checks passed
@johnomotani johnomotani deleted the ADI-solver branch January 10, 2025 23:59
@johnomotani johnomotani added the enhancement New feature or request label Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant