[libdispatch-dev] Question on race condition when using dispatch_group_notify_f
Christopher Jones
cdj at fnal.gov
Mon Apr 2 14:28:28 PDT 2012
To whom it may concern,
I have found a race condition using the port of libdispatch from http://mark.heily.com/sites/mark.heily.com/files/libdispatch-f16-SRPMS.tgz. The race condition manifests when doing the equivalent of
void run(void* iContext) {
dispatch_group_t g = reinterpret_cast<dispatch_group_t>(iContext);
dispatch_group_async_f(g, dispatch_get_global_queue(0, 0), g, do_work);
dispatch_group_notify_f(g, dispatch_get_global_queue(0, 0), g, run);
}
The idea is to have a new 'run' task start up after the previous 'do_work' task has finished. However, with a system under heavy load I see that 'run' can be called before its instance of 'do_work' has finished. This happens because of a race condition in '_dispatch_group_wake'
_dispatch_group_wake(dispatch_semaphore_t dsema)
{
struct dispatch_sema_notify_s *tmp;
struct dispatch_sema_notify_s *head = (struct dispatch_sema_notify_s *)dispatch_atomic_xchg(&dsema->dsema_notify_head, NULL);
...
while (head) {
dispatch_async_f(head->dsn_queue, head->dsn_ctxt, head->dsn_func);
_dispatch_release(head->dsn_queue);
do {
tmp = head->dsn_next;
} while (!tmp && !dispatch_atomic_cmpxchg(&dsema->dsema_notify_tail, head, NULL));
free(head);
head = tmp;
}
…
}
Under heavy load, the thread processing _dispatch_group_wake can be swapped out right after the first call to 'dispatch_async_f'. The task can then call 'run(void* iContext)' which can make it all the way to the end of that function thereby adding a new task to the tail of the dispatch_group. Once the '_dispatch_group_wake' thread reawakens, it now has a new entry in 'head->dsn_next' (since in dispatch_group_notify_f the 'head' and 'tail' refer to the same memory address for this problem). This new entry is processed which causes the 'run' to go off before the associated 'do_work' finishes.
You can find a small test which can, usually, exhibit the error and cause an assert to fail
http://dl.dropbox.com/u/11356841/raceCondition.cpp
This test has succeeded in exhibiting the error for me when run on a 16 core (4CPUs with each CPU having 4 cores) machine under Scientific Linux 6 [which is derived from Red Hat Enterprise 6].
So my question is, is the idea of having dispatch_group_notify_f effectively call itself not a supported activity or is this a bug that should be fixed?
Sincerely,
Chris
Dr Christopher Jones
Fermi National Accelerator Laboratory
cdj at fnal.gov
More information about the libdispatch-dev
mailing list