Question on race condition when using dispatch_group_notify_f
To whom it may concern, I have found a race condition using the port of libdispatch from http://mark.heily.com/sites/mark.heily.com/files/libdispatch-f16-SRPMS.tgz. The race condition manifests when doing the equivalent of void run(void* iContext) { dispatch_group_t g = reinterpret_cast<dispatch_group_t>(iContext); dispatch_group_async_f(g, dispatch_get_global_queue(0, 0), g, do_work); dispatch_group_notify_f(g, dispatch_get_global_queue(0, 0), g, run); } The idea is to have a new 'run' task start up after the previous 'do_work' task has finished. However, with a system under heavy load I see that 'run' can be called before its instance of 'do_work' has finished. This happens because of a race condition in '_dispatch_group_wake' _dispatch_group_wake(dispatch_semaphore_t dsema) { struct dispatch_sema_notify_s *tmp; struct dispatch_sema_notify_s *head = (struct dispatch_sema_notify_s *)dispatch_atomic_xchg(&dsema->dsema_notify_head, NULL); ... while (head) { dispatch_async_f(head->dsn_queue, head->dsn_ctxt, head->dsn_func); _dispatch_release(head->dsn_queue); do { tmp = head->dsn_next; } while (!tmp && !dispatch_atomic_cmpxchg(&dsema->dsema_notify_tail, head, NULL)); free(head); head = tmp; } … } Under heavy load, the thread processing _dispatch_group_wake can be swapped out right after the first call to 'dispatch_async_f'. The task can then call 'run(void* iContext)' which can make it all the way to the end of that function thereby adding a new task to the tail of the dispatch_group. Once the '_dispatch_group_wake' thread reawakens, it now has a new entry in 'head->dsn_next' (since in dispatch_group_notify_f the 'head' and 'tail' refer to the same memory address for this problem). This new entry is processed which causes the 'run' to go off before the associated 'do_work' finishes. You can find a small test which can, usually, exhibit the error and cause an assert to fail http://dl.dropbox.com/u/11356841/raceCondition.cpp This test has succeeded in exhibiting the error for me when run on a 16 core (4CPUs with each CPU having 4 cores) machine under Scientific Linux 6 [which is derived from Red Hat Enterprise 6]. So my question is, is the idea of having dispatch_group_notify_f effectively call itself not a supported activity or is this a bug that should be fixed? Sincerely, Chris Dr Christopher Jones Fermi National Accelerator Laboratory cdj@fnal.gov
Hi Christoper, please see the Lion branch on macosforge for the latest sources: http://svn.macosforge.org/repository/libdispatch/branches/Lion in particular, semaphore.c in those sources has a change in this area where the list of notify waiters is snapshot before any waiters are woken up, to support exactly this use case (new group notify added from a notify callout) Do note however that notify blocks/functions for a group are not themselves part of the group, so care needs to taken when reusing groups like this (e.g. there is no guarantee that a group is empty by the time a group notify function is called if new work is being submitted asynchronously to the notification). Daniel On Apr 2, 2012, at 14:28, Christopher Jones <cdj@fnal.gov> wrote:
To whom it may concern,
I have found a race condition using the port of libdispatch from http://mark.heily.com/sites/mark.heily.com/files/libdispatch-f16-SRPMS.tgz. The race condition manifests when doing the equivalent of
void run(void* iContext) { dispatch_group_t g = reinterpret_cast<dispatch_group_t>(iContext);
dispatch_group_async_f(g, dispatch_get_global_queue(0, 0), g, do_work); dispatch_group_notify_f(g, dispatch_get_global_queue(0, 0), g, run); }
The idea is to have a new 'run' task start up after the previous 'do_work' task has finished. However, with a system under heavy load I see that 'run' can be called before its instance of 'do_work' has finished. This happens because of a race condition in '_dispatch_group_wake'
_dispatch_group_wake(dispatch_semaphore_t dsema) { struct dispatch_sema_notify_s *tmp; struct dispatch_sema_notify_s *head = (struct dispatch_sema_notify_s *)dispatch_atomic_xchg(&dsema->dsema_notify_head, NULL); ... while (head) { dispatch_async_f(head->dsn_queue, head->dsn_ctxt, head->dsn_func); _dispatch_release(head->dsn_queue); do { tmp = head->dsn_next; } while (!tmp && !dispatch_atomic_cmpxchg(&dsema->dsema_notify_tail, head, NULL)); free(head); head = tmp; } … }
Under heavy load, the thread processing _dispatch_group_wake can be swapped out right after the first call to 'dispatch_async_f'. The task can then call 'run(void* iContext)' which can make it all the way to the end of that function thereby adding a new task to the tail of the dispatch_group. Once the '_dispatch_group_wake' thread reawakens, it now has a new entry in 'head->dsn_next' (since in dispatch_group_notify_f the 'head' and 'tail' refer to the same memory address for this problem). This new entry is processed which causes the 'run' to go off before the associated 'do_work' finishes.
You can find a small test which can, usually, exhibit the error and cause an assert to fail http://dl.dropbox.com/u/11356841/raceCondition.cpp
This test has succeeded in exhibiting the error for me when run on a 16 core (4CPUs with each CPU having 4 cores) machine under Scientific Linux 6 [which is derived from Red Hat Enterprise 6].
So my question is, is the idea of having dispatch_group_notify_f effectively call itself not a supported activity or is this a bug that should be fixed?
Sincerely, Chris
Dr Christopher Jones Fermi National Accelerator Laboratory cdj@fnal.gov
_______________________________________________ libdispatch-dev mailing list libdispatch-dev@lists.macosforge.org http://lists.macosforge.org/mailman/listinfo.cgi/libdispatch-dev
participants (2)
-
Christopher Jones
-
Daniel A. Steffen