Rewait tasks can deadlock
Getting back to the topic of raciness (branch thread_safety
, issue #58)
we have a definite deadlock issue with the rewait tasks. Running:
swift_fixdt -t 1 -f sodShock.hdf5 -m 0.01 -w 5000 -c 0.01 -d 1e-7 -e 0.01
Many, many times will eventually deadlock with the following bt's:
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x0000000000436b6b in scheduler_start (s=s@entry=0x7fffe4d2d080, mask=mask@entry=32768, submask=submask@entry=0) at scheduler.c:1016
#2 0x0000000000430832 in engine_launch (e=e@entry=0x7fffe4d2d060, nr_runners=nr_runners@entry=1, mask=mask@entry=32768, submask=submask@entry=0) at engine.c:1494
#3 0x0000000000405cf8 in space_parts_sort (s=s@entry=0x7fffe4d2ce60, ind=ind@entry=0x2e711f0, N=1024128, min=min@entry=0, max=14399, verbose=verbose@entry=0) at space.c:585
#4 0x0000000000406ba8 in space_rebuild (s=0x7fffe4d2ce60, cell_max=cell_max@entry=0, verbose=0) at space.c:403
#5 0x000000000042f86f in engine_rebuild (e=e@entry=0x7fffe4d2d060) at engine.c:1307
#6 0x000000000042fa58 in engine_prepare (e=e@entry=0x7fffe4d2d060) at engine.c:1354
#7 0x0000000000430935 in engine_init_particles (e=e@entry=0x7fffe4d2d060) at engine.c:1525
#8 0x00000000004034f8 in main (argc=<optimised out>, argv=<optimised out>) at main.c:508
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x0000000000436ffa in scheduler_gettask (s=s@entry=0x7fffe4d2d080, qid=0, prev=0x0) at scheduler.c:1298
#2 0x00000000004293aa in runner_main (data=0x16cd060) at runner.c:982
#3 0x00002b5dffcb9182 in start_thread (arg=0x2b5e14cfa700) at pthread_create.c:312
#4 0x00002b5e00b8047d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
i.e. lines:
1014 pthread_mutex_lock(&s->sleep_mutex);
1015 while (s->waiting > waiting_old) {
1016 pthread_cond_wait(&s->sleep_cond, &s->sleep_mutex);
and
1297 pthread_mutex_lock(&s->sleep_mutex);
1298 if (s->waiting > 0) pthread_cond_wait(&s->sleep_cond, &s->sleep_mutex);
1299 pthread_mutex_unlock(&s->sleep_mutex);
Tom sees this effect much more often with the gravity tests...