[libdispatch-dev] Question on race condition when using dispatch_group_notify_f

Mon Apr 2 14:49:07 PDT 2012

Hi Christoper,

please see the Lion branch on macosforge for the latest sources:
	http://svn.macosforge.org/repository/libdispatch/branches/Lion

in particular, semaphore.c in those sources has a change in this area where the list of notify waiters is snapshot before any waiters are woken up, to support exactly this use case (new group notify added from a notify callout)

Do note however that notify blocks/functions for a group are not themselves part of the group, so care needs to taken when reusing groups like this (e.g. there is no guarantee that a group is empty by the time a group notify function is called if new work is being submitted asynchronously to the notification).

Daniel

On Apr 2, 2012, at 14:28, Christopher Jones <cdj at fnal.gov> wrote:

> To whom it may concern,
> 
> I have found a race condition using the port of libdispatch from http://mark.heily.com/sites/mark.heily.com/files/libdispatch-f16-SRPMS.tgz. The race condition manifests when doing the equivalent of
> 
> void run(void* iContext) {
> 	dispatch_group_t g = reinterpret_cast<dispatch_group_t>(iContext);
> 	
> 	dispatch_group_async_f(g, dispatch_get_global_queue(0, 0), g, do_work);
> 	dispatch_group_notify_f(g, dispatch_get_global_queue(0, 0), g, run); 
> }
> 
> The idea is to have a new 'run' task start up after the previous 'do_work' task has finished. However, with a system under heavy load I see that 'run' can be called before its instance of 'do_work' has finished. This happens because of a race condition in '_dispatch_group_wake' 
> 
> _dispatch_group_wake(dispatch_semaphore_t dsema)
> {
>        struct dispatch_sema_notify_s *tmp;
>        struct dispatch_sema_notify_s *head = (struct dispatch_sema_notify_s *)dispatch_atomic_xchg(&dsema->dsema_notify_head, NULL);
> 	...
>        while (head) {
>                dispatch_async_f(head->dsn_queue, head->dsn_ctxt, head->dsn_func);
>                _dispatch_release(head->dsn_queue);
>                do {
>                        tmp = head->dsn_next;
>                } while (!tmp && !dispatch_atomic_cmpxchg(&dsema->dsema_notify_tail, head, NULL));
>                free(head);
>                head = tmp;
>        }
> 	…
> }
> 
> Under heavy load, the thread processing _dispatch_group_wake can be swapped out right after the first call to 'dispatch_async_f'. The task can then call 'run(void* iContext)' which can make it all the way to the end of that function thereby adding a new task to the tail of the dispatch_group. Once the '_dispatch_group_wake' thread reawakens, it now has a new entry in 'head->dsn_next' (since in dispatch_group_notify_f the 'head' and 'tail' refer to the same memory address for this problem). This new entry is processed which causes the 'run' to go off before the associated 'do_work' finishes.
> 
> You can find a small test which can, usually, exhibit the error and cause an assert to fail
> 	http://dl.dropbox.com/u/11356841/raceCondition.cpp
> 
> This test has succeeded in exhibiting the error for me when run on a 16 core (4CPUs with each CPU having 4 cores) machine under Scientific Linux 6 [which is derived from Red Hat Enterprise 6].
> 
> So my question is, is the idea of having dispatch_group_notify_f effectively call itself not a supported activity or is this a bug that should be fixed?
> 
> 	Sincerely,
> 		Chris
> 
> Dr Christopher Jones
> Fermi National Accelerator Laboratory
> cdj at fnal.gov
> 
> _______________________________________________
> libdispatch-dev mailing list
> libdispatch-dev at lists.macosforge.org
> http://lists.macosforge.org/mailman/listinfo.cgi/libdispatch-dev