Periodic gravity calculation

For the record, here is the accuracy histogram for the particles in the EAGLE-25 case with periodic BCs switched on:

changed the description

added enhancement label

added SPH label

added 1 commit

fcb8e0e0 - Make the long-range gravity task lock/unlock the multipoles and not the gparts.

Compare with previous version

added 1 commit

e987e310 - Move the addition to the interaction counter before the MAC check in the long-range task.

Compare with previous version

Tried running EAGLE_50 on COSMA7:

  ../swift -a -t 28 -s -G -S eagle_50.yml

with the undefined sanitizer on and we get a couple of signed int overflows:

engine.c:3116:35: runtime error: signed integer overflow: 69457614 * 54 cannot be represented in type 'int'
engine.c:3120:35: runtime error: signed integer overflow: 69457614 * 125 cannot be represented in type 'int'

So s->size_links is going to be wrong. Sure I've ran this on one node before, so either the total number of cells has grown a lot, or it wasn't fatal for some reason. Will try this out on master.

The explanation could be simpler. I see the cell_split_size value has dropped to 40 from 400 in eagle_50.yml. Is this necessary? Suggest we should promote s->size_links to size_t anyway...

cell_split_size should probably be dropped to a lower value, yes. Although we will have to do some performance tests to identify what a good value is.

Agreed, should be a size_t although that sounds a bit scary to have that many links.

Mmmh... There is some atomic work done on the links table... Does not look like fun is lying ahead.

The docs suggest integer types up to unsigned long long are OK, so size_t should be OK. If that worries you use long long.

Ah, yes, that's right. Any integer type is accepted. My recollection was that only 4-bytes integers were supported. Good that simplifies things.

Since it is easier for me to test, should I go ahead and make the changes?

I will not have time before tomorrow, so yes, please.

added 1 commit

0024fd69 - Use size_t for number of task links

Compare with previous version

OK, done that job is now running steps.

The next, single node job on COSMA7, ran until step 2 at which point it exceeded the memory limit and was killed (by SLURM). I'll try raising cell_split_size a bit...

Got a different result this time, still stopped during step 2 with:

[06370.6] mesh_gravity.c:mesh_to_gparts_CIC():202: Invalid gpart position in x

This is with debugging checks enabled.

Repeating this with EAGLE_25 sees the same check failing during step 4.

... and with EAGLE_12 during step 2.

Here are the details:

gnu_comp/7.3.0 gsl/2.4 parallel_hdf5/1.8.20 fftw/3.3.7

./configure --enable-debugging-checks --enable-sanitizer --enable-undefined-sanitizer

 ../swift -a -t 14 -s -G -S eagle_12.yml

Running with -c seems to work for EAGLE_12, I'll try that on the larger volumes, but this is one for you.

Thanks for reporting this. I'll investigate what happens here.

added 1 commit

4032bd5a - Allow the code to compile with lower-order multipoles.

Compare with previous version

Ok. I have tracked it down to NaNs arising because of too large numbers in the multipole-multipole calculation.

If I reduce the order of the calculation with --with-multipole-order=3 then everything runs fine. I will re-arrange the order of operations to prevent that FPE to happen.

added 1 commit

268196ea - Make scheduler_reweight() robust against integer over-flow

Compare with previous version

Thanks, looking like the EAGLE_50 volume is running with -c and an increase of cell_split_size to 128.

This job ran for 671 steps, then failed with:

[27469.8] runner.c:runner_do_ghost():918: Smoothing length failed to converge on 47 particles.

I guess I'll wait for this update, in case that is related.

added 1 commit

55f9b74f - Do not compress the EAGLE_50 snapshots.

Compare with previous version

added 1 commit

009a24cf - Code formatting

Compare with previous version

added 1 commit

c4f17f91 - Create the infrastructure to stop and restart with the pm_mesh switched on.

Compare with previous version

It seems I can't reproduce this. Although I should say I ran with the intel compiler.

I'll look more into it.

I'll update with these changes and see if the problem repeats.

So these changes should be unrelated actually. They correct other things. I'll restart with GCC but the code is about 8x slower than with ICC... I reach step 671 in just over 1 hour...

I got to step 671 in 7 hours 40 minutes, so that is about right, but of course I have the sanitizers switched on, Intel doesn't do those.

Do you think the sanitizer slows you down? I think I struggle to get GCC to auto-vectorize the gravity whilst ICC does it out of the box.

Only one response to that... I'll try it without them.

Actually the debugging checks will likely be the culprit here. The code can't be optimized fully with them on.

Sadly that only got to step 2 and aborted with:

[02314.4] runner.c:runner_do_ghost():922: Smoothing length failed to converge on 1 particles.

BTW, that check seems to be just a repeat of the code in the debugging section just before it (I also disabled debugging checks this time), I'm sure we could loose that...

at least that happens more quickly. what modukes and configuration options did you use here?

Modules:

gnu_comp/7.3.0 openmpi/3.0.1 metis/5.1.0 parallel_hdf5/1.8.20 gsl/2.4 fftw/3.3.7

Configure:

./configure

Command:

../swift -a -t 28 -c -s -G -S eagle_50.yml

Thanks. I also stumbled upon a segfault on the EAGLE_25 case so there is more work for me here.

For reference this test actually broke in the same place on a second run.

added 1 commit

37b66011 - Documentation fixes.

Compare with previous version

added 1 commit

5106ef95 - Do not compute the high-order softened derivatives as the resulting term is 0 anyway.

Compare with previous version

added 1 commit

a5d29e70 - Default to order 4 multipoles by default.

Compare with previous version

This last batch of fixes seem to do the trick.

added 4 commits

a7bfbf6f - Add an option to cancel the sqrt(a) factors in the velocities of Gadget ICs.
0609ef8d - More uniform prop --> props in the single_io code.
eab55966 - Make doxygen ignore the Markdown files in the repository.
4b01a513 - Documentation fixes and code formatting

Compare with previous version