AD performance #16466

hugary1995 · 2020-12-11T09:00:28Z

hugary1995
Dec 11, 2020
Collaborator

I have been using AD for tensor mechanics applications for quite some time, and I observe that in some cases AD is slightly slower but acceptable, but in some cases AD is ~5x slower.

We've had this discussion many times at many different places. I think it is time to sit down and look at the performance seriously. I hope this can be a good starting point for some AD performance improvements in tensor mechanics.

I put together three really simple test cases using the tensor mechanics module only.
The input files can be found at https://github.com/hugary1995/moose/tree/AD_performance/modules/tensor_mechanics/examples/perf

Model

This is a 2D square domain with RZ coordinates. The bottom edge is fixed in z, and the left edge is fixed in r. The top edge is being pulled upward (on the displaced mesh). There is a nonhomogeneous eigenstrain applied on the domain.

Each of the test case uses a different constitutive law. A total small strain linear elastic model, an incremental small strain power law creep model, and an incremental finite strain power law creep model.

Performance

All tests ran in serial.

	total small strain elastic		incremental small strain creep		incremental finite strain creep
	total_nl_its	total_time (sec)	total_nl_its	total_time (sec)	total_nl_its	total_time (sec)
non-AD	26	3.900	78	23.239	95	29.928
AD (global sparse)	20	6.980	65	72.463	74	118.143
AD (local dense)	20	8.985	65	89.159	74	134.956

Answered by hugary1995

Dec 11, 2020

@bwspenc @dschwen @lindsayad @tophmatthews I guess you will be interested in seeing this...

View full answer

hugary1995 · 2020-12-11T09:03:23Z

hugary1995
Dec 11, 2020
Collaborator Author

@bwspenc @dschwen @lindsayad @tophmatthews I guess you will be interested in seeing this...

4 replies

lindsayad Dec 11, 2020
Maintainer

PJFNK or NEWTON? I’m happy to see that global sparse is faster than local dense in all your tests.

hugary1995 Dec 11, 2020
Collaborator Author

All tests are NEWTON with -pc_type lu.

tophmatthews Dec 11, 2020

5x...ouch. I have seen similar slowdowns. I have found that the ADRankFourTensor is pretty rough. You should push this up into the repo eventually?

hugary1995 Dec 11, 2020
Collaborator Author

Yeah I think we all believe the tensor operations are slowing things down here. I plan to test the small strain test case without using RankFourTensors, and see how it peforms. If the result is promising, I will try to implement a specialized IsotropicRankFourTensor next week.

I am not sure where to put these tests if I were to push them to the repo. They don't provide additional code/capability coverage so it doesn't make a lot of sense to add them as regression tests. Also they are not "examples" in the sense that I am not modeling any real-world application here.

dschwen · 2020-12-11T16:45:59Z

dschwen
Dec 11, 2020
Collaborator

With AD we need to refocus our attention to the cost of full tensor operations. I'd like to see high symmetry tensor classes and either

the use of templating to support them in calculations
an enum in the tensor class to indicate symmetry and active entries

Gary, you mentioned that your group is already working on point 1. It would be great if we could convince them to share their approach. Otherwise we need to waste resources to do something similar.

1 reply

hugary1995 Dec 11, 2020
Collaborator Author

There might be some misunderstanding. We tried to write a simple IsotropicRankFourTensor with some basic operators defined. That was just a couple lines of code. We haven't used any templating yet.

hugary1995 · 2020-12-11T22:13:48Z

hugary1995
Dec 11, 2020
Collaborator Author

I wrote a somewhat optimized ADComputeIsotropicLinearElasticStress stress calculator. But the performance improvement isn't great. Running the small strain elastic case with AD (global sparse) still takes 6.075 seconds.

3 replies

dschwen Dec 11, 2020
Collaborator

15% speedup is nothing to scoff at, but I had hoped for more.

dschwen Dec 11, 2020
Collaborator

Love the sarcastic rocket emoji @recuero

recuero Dec 11, 2020
Collaborator

No sarcasm: From 9s to 6s using global indexing plus streamlined elasticity tensor isn't bad to me at all. It'd be nice to see how that speedup translates to assessment-like TensorMechanics problems

tophmatthews · 2020-12-11T22:38:32Z

tophmatthews
Dec 11, 2020

The other price is likely the inner newton loop. It feels like there can be a smarter way to avoid doing the inner newton loop twice, or maybe at all for the jacobian?

1 reply

lynnmunday Dec 17, 2020

Based on what Gary did, I put in a nonAD version of the Laromance model and for some simple load cases for creep modeling and found the nonADLaromance model to be 9x faster. These are simulating a 10x10x10 element block of material with Neumann and Dirichlet bcs. I start the simulation at a constant stress for 1e6s and then ramp the stress up to twice its value over 35e3s; the total time being simulated is 37 days. I will be running these for 20 years of simulation time. The loading is pretty simple so both AD and nonAD using Newton take 84 linear iterations, 12 nonlinear iterations, 28 simulation steps. The AD version takes 856s of wall time and nonAD takes 95s. The PR is here:
#16521

lindsayad · 2020-12-17T02:43:42Z

lindsayad
Dec 17, 2020
Maintainer

You guys should do some profiling with gperftools or instruments

…

On Dec 16, 2020, at 6:23 PM, Lynn Munday ***@***.***> wrote: Based on what Gary did, I put in a nonAD version of the Laromance model and for some simple load cases for creep modeling and found the nonADLaromance model to be 9x faster. These are simulating a 10x10x10 element block of material with Neumann and Dirichlet bcs. I start the simulation at a constant stress for 1e6s and then ramp the stress up to twice its value over 35e3s; the total time being simulated is 37 days. I will be running these for 20 years of simulation time. The loading is pretty simple so both AD and nonAD using Newton take 84 linear iterations, 12 nonlinear iterations, 28 simulation steps. The AD version takes 856s of wall time and nonAD takes 95s. The PR is here: #16521 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

3 replies

lynnmunday Dec 17, 2020

I'll profile the ad laromance model.

lynnmunday Dec 18, 2020

I profiled the ADLaromance function and it has some big vector<vector<vector>> that it fills once per time step. I was able to make some of them into Reals and all the tests still pass. This reduces the ADLaromance code runtime from 856s to 180s. It is now 2x slower than the nonAD version. I pushed this version to the same pr. I also attached the profile output. The ADLAROMANCEStressUpdateBase::precomputeROM takes 40% of the run time but it no longer has any AD data in it so I think this function would take a similar amount of time in the nonAD version

lynnmunday Dec 18, 2020

junk5.pdf

hugary1995 · 2020-12-18T16:14:51Z

hugary1995
Dec 18, 2020
Collaborator Author

On that same branch, I added a SymmetricIsotropicRankFourTensor class, which, as its name suggests, is optimized for symmetric isotropic rank four tensor. For the incremental finite strain creep case, the speed up is like 40 seconds -> 35 seconds.

@dschwen I spent yesterday the entire day trying to compile gperftools on my ubuntu but I ran into some issues. Could you help me sort it out? Alternatively you could just run some profiling on your machine.

The input files I am comparing here is AD_small_strain_power_law_creep.i and AD_small_strain_power_law_creep_optimized.i.

4 replies

lynnmunday Dec 18, 2020

Here are the commands I used to build gperftools on my mac

10 INSTALL GPERF
11 git clone from gperftools github page
12 >cd ~/code/gperftools/
13 >autoreconf -i
14 >./configure --prefix=/Users/mundlb/code/gperftools/installed
15 >mkdir /Users/mundlb/code/gperftools/installed
16 >make install
17

hugary1995 Dec 18, 2020
Collaborator Author

Thanks Lynn, I'll try it on a mac.

lynnmunday Dec 18, 2020

I can try running your code later

7 ----------------------------------------------------
8 RUNNING GPERF PPROF
9
10 INSTALL GPERF
11 git clone from gperftools github page
12 >cd /code/gperftools/
13 >autoreconf -i
14 >./configure --prefix=/Users/mundlb/code/gperftools/installed
15 >mkdir /Users/mundlb/code/gperftools/installed
16 >make install
17
18 SETUP ENV FOR MOOSE BUILD
19 (should probably put exports in bashrc)
20 >export GPERF_DIR=/code/gperftools/installed
21 >export PATH=$PATH:GPERF_DIR/bin
22 >cd ~/projects/blackbear
23 >METHOD=oprof make -j 8
24
25 RUN CODE WITH GPERF
26 >MOOSE_PROFILE_BASE=run1_ mpiexec -n 1 /projects/blackbear_laromance/blackbear-oprof -i creepRamp_AD_P91.i Outputs/file_base=oprof
27 >make a pdf bubble chart of runtime
28 >/code/gperftools/installed/bin/pprof --pdf /projects/blackbear_laromance/blackbear-oprof run1_0.prof > junk.pdf
29 look at in command line
30 >/code/gperftools/installed/bin/pprof ~/projects/blackbear_laromance/blackbear-oprof run1_0.prof
31
32 Daniel said to use the go version of pprof but I can't get it to work with my mpi
33 >brew install go
34 >go get -u github.com/google/pprof
35 then use this
36 ~/go/bin/pprof

lynnmunday Dec 18, 2020

I can't get it to stop crossing out lines
But I had some trouble with pprof but I couldn't get the go version of pprof to work with my mpi

dschwen · 2020-12-18T16:24:28Z

dschwen
Dec 18, 2020
Collaborator

I had to do `conda deactivate; conda deactivate` before building the go pprof. It will run fine in the activated environment later

…

On Fri, Dec 18, 2020 at 9:22 AM Lynn Munday ***@***.***> wrote: I can't get it to stop crossing out lines But I had some trouble with pprof but I couldn't get the go version of pprof to work with my mpi — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16466 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABRMPXT4KRNA4Y5VDBO2DTSVN6U7ANCNFSM4UWJYFSQ> .

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AD performance #16466

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 16 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

AD performance #16466

hugary1995 Dec 11, 2020 Collaborator

Model

Performance

Replies: 7 comments · 16 replies

hugary1995 Dec 11, 2020 Collaborator Author

lindsayad Dec 11, 2020 Maintainer

hugary1995 Dec 11, 2020 Collaborator Author

hugary1995 Dec 11, 2020 Collaborator Author

dschwen Dec 11, 2020 Collaborator

hugary1995 Dec 11, 2020 Collaborator Author

hugary1995 Dec 11, 2020 Collaborator Author

dschwen Dec 11, 2020 Collaborator

dschwen Dec 11, 2020 Collaborator

recuero Dec 11, 2020 Collaborator

lindsayad Dec 17, 2020 Maintainer

hugary1995 Dec 18, 2020 Collaborator Author

hugary1995 Dec 18, 2020 Collaborator Author

dschwen Dec 18, 2020 Collaborator

hugary1995
Dec 11, 2020
Collaborator

Replies: 7 comments 16 replies

hugary1995
Dec 11, 2020
Collaborator Author

lindsayad Dec 11, 2020
Maintainer

hugary1995 Dec 11, 2020
Collaborator Author

hugary1995 Dec 11, 2020
Collaborator Author

dschwen
Dec 11, 2020
Collaborator

hugary1995 Dec 11, 2020
Collaborator Author

hugary1995
Dec 11, 2020
Collaborator Author

dschwen Dec 11, 2020
Collaborator

dschwen Dec 11, 2020
Collaborator

recuero Dec 11, 2020
Collaborator

lindsayad
Dec 17, 2020
Maintainer

hugary1995
Dec 18, 2020
Collaborator Author

hugary1995 Dec 18, 2020
Collaborator Author

dschwen
Dec 18, 2020
Collaborator