Skip to content

Kokkos Automatic Differentiation (Undisplaced)#32597

Open
NamjaeChoi wants to merge 13 commits into
idaholab:nextfrom
NamjaeChoi:kokkos_ad
Open

Kokkos Automatic Differentiation (Undisplaced)#32597
NamjaeChoi wants to merge 13 commits into
idaholab:nextfrom
NamjaeChoi:kokkos_ad

Conversation

@NamjaeChoi
Copy link
Copy Markdown
Contributor

@NamjaeChoi NamjaeChoi commented Apr 1, 2026

Refs #32699.

@NamjaeChoi NamjaeChoi changed the title [WIP] Kokkos Automatic Differentation [WIP] Kokkos Automatic Differentiation (Undisplaced) Apr 1, 2026
@NamjaeChoi NamjaeChoi moved this to In progress in NEAMS MP CONNECT/GPU 26 Apr 2, 2026
@NamjaeChoi NamjaeChoi force-pushed the kokkos_ad branch 2 times, most recently from 1e1ea85 to 9e48e13 Compare April 2, 2026 20:01
@NamjaeChoi NamjaeChoi changed the title [WIP] Kokkos Automatic Differentiation (Undisplaced) Kokkos Automatic Differentiation (Undisplaced) Apr 2, 2026
@moosebuild
Copy link
Copy Markdown
Contributor

moosebuild commented Apr 2, 2026

Job Documentation, step Docs: sync website on a2f3df9 wanted to post the following:

View the site here

This comment will be updated on new commits.

@NamjaeChoi NamjaeChoi force-pushed the kokkos_ad branch 4 times, most recently from e09c807 to 50259af Compare April 3, 2026 15:30
@idaholab idaholab deleted a comment from moosebuild Apr 13, 2026
@NamjaeChoi NamjaeChoi force-pushed the kokkos_ad branch 4 times, most recently from be28b83 to c18bb1b Compare April 21, 2026 21:53
@idaholab idaholab deleted a comment from moosebuild Apr 21, 2026
@NamjaeChoi NamjaeChoi marked this pull request as ready for review April 22, 2026 04:01
@NamjaeChoi NamjaeChoi force-pushed the kokkos_ad branch 3 times, most recently from 17b218d to a8ac7f0 Compare April 22, 2026 14:49
@NamjaeChoi NamjaeChoi requested a review from cticenhour as a code owner April 22, 2026 14:49
@NamjaeChoi
Copy link
Copy Markdown
Contributor Author

They can also do copy construction. It will be just a bit of waste

_dot = tag == problem.getVectorTagID(Moose::SOLUTION_DOT_TAG);

if (problem.vectorTagExists(Moose::OLD_SOLUTION_TAG))
_old = _old || tag == problem.getVectorTagID(Moose::OLD_SOLUTION_TAG);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the || conditional here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason

_dot = tag == problem.getVectorTagID(Moose::SOLUTION_DOT_TAG);

if (problem.vectorTagExists(Moose::OLD_SOLUTION_TAG))
_old = _old || tag == problem.getVectorTagID(Moose::OLD_SOLUTION_TAG);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question

@lindsayad
Copy link
Copy Markdown
Member

could we maybe build some unit tests to bring down the 140 unused lines introduced in Coupleable?


Kokkos-MOOSE kernels also support [automatic differentiation (AD)](automatic_differentiation/index.md).
AD kernels can be derived from `Moose::Kokkos::ADKernel` and should be registered with `registerKokkosADResidualObject()`.
AD kernels requires `computeQpResidual()` to be defined with the following signature, where everything remains the same with the ordinary kernels except the return type being `Moose::Kokkos::ADReal`:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
AD kernels requires `computeQpResidual()` to be defined with the following signature, where everything remains the same with the ordinary kernels except the return type being `Moose::Kokkos::ADReal`:
AD kernels require `computeQpResidual()` to be defined with the following signature, where everything remains the same with the ordinary kernels except the return type being `Moose::Kokkos::ADReal`:

AssemblyDatum & datum) const;
```

`computeQpJacobian()` and `computeQpOffDiagJacobian()` are unused, as AD automatically assembles Jacobian.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`computeQpJacobian()` and `computeQpOffDiagJacobian()` are unused, as AD automatically assembles Jacobian.
`computeQpJacobian()` and `computeQpOffDiagJacobian()` are unused, as AD automatically assembles the Jacobian.

_dims.copyToDevice();
_offsets.copyToDevice();
_data.create(_offsets.last() + stride);
_data.create(_offsets.last() + stride + 1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this a bug before? Some comment in the code regarding the +1 here would be nice

auto & sys = kokkosSystem(_kokkos_var.sys(comp));
auto row = sys.getElemLocalDofIndex(datum.elem().id, i, _kokkos_var.var(comp));

for (dof_id_type t = 0; t < _matrix_tags.size(); ++t)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to get KOKKOS_FUNCTION support for make_range and index_range (FYI @roystgnr). Because strictly speaking we could be promoting the integer type returned by _matrix_tags.size(), which is the safe way to do it compared to the other other way around. But the pedantic/perfectionist part of me doesn't love it

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, now that you mention it, so do I!

I don't want to put it in libMesh/libmesh#4451 (because that's almost ready to get merged, because I've already hassled @rochi00 enough about it, and because it is a logically separate feature) ... although, now that I look at things, I'm quite surprised that Rochi didn't have to get it in there. Looks like, although we #include <int_range.h> in type_vector.h, we only got around to using it for l1_norm() (which his code doesn't use) and we never switched any of the older methods to it!

{
auto tag = _matrix_tags[t];

if (sys.isMatrixTagActive(tag) && !sys.hasNodalBCMatrixTag(row, tag))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is to avoid an intermediate matrix close between the application of integrated residual objects and application of nodal BC constraints? I'm surprised I haven't noticed this before; maybe because of our brief convo with respect to the Claude review. I don't like the asymmetry with the residual. We don't check for a nodal BC vector tag when deciding whether to accumulate into the vector or not, which would allow us to avoid the intermediate vector close.

I don't know if we'll ever apply Kokkos to contact, but we do rely on the intermediate population of residuals and Jacobians by integrated residual objects as then the residual at the interface represents forces which we sometimes then move over to the non-constrained surface of the contact interface. Skipping the intermediate population of the global residual/matrix would make it impossible to do that kind of kinematic constraint enforcement

Copy link
Copy Markdown
Contributor Author

@NamjaeChoi NamjaeChoi May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This becomes very tricky in parallel. If an elemental kernel contributes to nodes covered by nodal BCs but not belonging to the current process, the remote process can't zero the nodes. I can't retrieve the closed matrix back into the COO ordering that I used for preallocation and assembly. PETSc doesn't have such capability.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sucks. Is the COO ordering required for things like nodal bcs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MatSetPreallocationCOO and MatSetValuesCOO are the only GPU matrix assembly method they support.

Copy link
Copy Markdown
Contributor Author

@NamjaeChoi NamjaeChoi May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if they can make something like MatGetValuesCOO. Then I can compute nodal BCs on it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is different between MatZeroRows (which is what we've always used in MOOSE) and MatZeroRowsColumns with respect to the COO format?

Copy link
Copy Markdown
Contributor Author

@NamjaeChoi NamjaeChoi May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the documentation:

Unlike MatZeroRows(), this ignores the MAT_KEEP_NONZERO_PATTERN option value set with MatSetOption(), it merely zeros those entries in the matrix, but never removes them from the nonzero pattern. The nonzero pattern of the matrix can still change if a nonzero needs to be inserted on a
diagonal entry that was previously missing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm already zeroing out the row where nodal BCs apply in the current code I think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the plan then to remove these nodal bc matrix tags, close the matrix after element assembly, apply MatZeroRowsAndColumns, and then do the nodal BC matrix application? To me I still feel like MatZeroRows with MAT_KEEP_NONZERO_PATTERN would still be a more robust pattern as in general we don't know if a nodal BC has dof coupling in which case the preset trick won't work

Copy link
Copy Markdown
Contributor Author

@NamjaeChoi NamjaeChoi May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented something. I hope the performance is not too bad because MatZeroRows is a CPU API and MatSetValuseCOO is now called twice.

Comment on lines -542 to +714
Real3 grad = 0;
Real3 grad;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this change combined with the += below. I guess it must work but a developer who sees this would be worried that we're doing += with unitialized memory

auto n_dofs = kokkosAssembly().getNumDofs(info.type, fe);
auto & grad_phi = kokkosAssembly().getGradPhiFace(info.subdomain, info.type, fe)(side);

ADReal3 grad;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same concern as above

input = 'kokkos_2d_ad_diffusion_test.i'
cli_args = 'Outputs/exodus=false'
recover = false
requirement = 'The Kokkos system shall provide exact Jacobian using automatic differentation.'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
requirement = 'The Kokkos system shall provide exact Jacobian using automatic differentation.'
requirement = 'The Kokkos system shall provide an exact Jacobian using automatic differentation.'

run_sim = True
ratio_tol = 1e-7
difference_tol = 1e-6
requirement = 'The Jacobian from KokkosADTimeDerivative shall be perfect.'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
requirement = 'The Jacobian from KokkosADTimeDerivative shall be perfect.'
requirement = 'The Jacobian from a Kokkos based time derivative kernel shall be perfect.'

run_sim = True
ratio_tol = 1e-7
difference_tol = 1e-6
requirement = 'The Jacobian from KokkosADCoupledTimeDerivative shall be perfect.'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
requirement = 'The Jacobian from KokkosADCoupledTimeDerivative shall be perfect.'
requirement = 'The Jacobian from a kokkos-based coupled variable time derivative kernel shall be perfect.'

@NamjaeChoi
Copy link
Copy Markdown
Contributor Author

NamjaeChoi commented May 14, 2026

could we maybe build some unit tests to bring down the 140 unused lines introduced in Coupleable?

No, this requires the full problem to be built for testing. Hopefully we can get more coverage as the time goes on and more tests get added.

@moosebuild
Copy link
Copy Markdown
Contributor

Job Test, step Results summary on a2f3df9 wanted to post the following:

Framework test summary

Compared against 54450aa in job civet.inl.gov/job/3821125.

Added tests

Test Time (s) Memory (MB)
kokkos/kernels/ad_time_derivative.test_jacobian 1.44 126.46
kokkos/kernels/ad_time_derivative.coupled_jacobian 0.92 122.19
kokkos/kernels/ad_time_derivative.test 0.72 114.28
kokkos/bcs/ad_coupled_var_neumann.nonlinear/jac 0.69 116.36
kokkos/kernels/ad_time_derivative.coupled 0.65 77.65
kokkos/bcs/ad_coupled_var_neumann.test 0.64 62.82
kokkos/kernels/ad_diffusion.jacobian 0.61 112.65
kokkos/kernels/ad_time_derivative.separate 0.60 127.85
kokkos/bcs/ad_coupled_var_neumann.nonlinear/exo 0.59 102.80
kokkos/kernels/ad_diffusion.neumann 0.57 64.75
kokkos/kernels/ad_diffusion.dirichlet 0.52 50.39

Modules test summary

Compared against 54450aa in job civet.inl.gov/job/3821125.

No change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

4 participants