[libdispatch-dev] Question on race condition when using dispatch_group_notify_f

Mon Apr 2 14:28:28 PDT 2012

To whom it may concern,

I have found a race condition using the port of libdispatch from http://mark.heily.com/sites/mark.heily.com/files/libdispatch-f16-SRPMS.tgz. The race condition manifests when doing the equivalent of

void run(void* iContext) {
	dispatch_group_t g = reinterpret_cast<dispatch_group_t>(iContext);

	dispatch_group_async_f(g, dispatch_get_global_queue(0, 0), g, do_work);
	dispatch_group_notify_f(g, dispatch_get_global_queue(0, 0), g, run); 
}

The idea is to have a new 'run' task start up after the previous 'do_work' task has finished. However, with a system under heavy load I see that 'run' can be called before its instance of 'do_work' has finished. This happens because of a race condition in '_dispatch_group_wake' 

_dispatch_group_wake(dispatch_semaphore_t dsema)
{
        struct dispatch_sema_notify_s *tmp;
        struct dispatch_sema_notify_s *head = (struct dispatch_sema_notify_s *)dispatch_atomic_xchg(&dsema->dsema_notify_head, NULL);
	...
        while (head) {
                dispatch_async_f(head->dsn_queue, head->dsn_ctxt, head->dsn_func);
                _dispatch_release(head->dsn_queue);
                do {
                        tmp = head->dsn_next;
                } while (!tmp && !dispatch_atomic_cmpxchg(&dsema->dsema_notify_tail, head, NULL));
                free(head);
                head = tmp;
        }
	…
}

Under heavy load, the thread processing _dispatch_group_wake can be swapped out right after the first call to 'dispatch_async_f'. The task can then call 'run(void* iContext)' which can make it all the way to the end of that function thereby adding a new task to the tail of the dispatch_group. Once the '_dispatch_group_wake' thread reawakens, it now has a new entry in 'head->dsn_next' (since in dispatch_group_notify_f the 'head' and 'tail' refer to the same memory address for this problem). This new entry is processed which causes the 'run' to go off before the associated 'do_work' finishes.

You can find a small test which can, usually, exhibit the error and cause an assert to fail
	http://dl.dropbox.com/u/11356841/raceCondition.cpp

This test has succeeded in exhibiting the error for me when run on a 16 core (4CPUs with each CPU having 4 cores) machine under Scientific Linux 6 [which is derived from Red Hat Enterprise 6].

So my question is, is the idea of having dispatch_group_notify_f effectively call itself not a supported activity or is this a bug that should be fixed?

	Sincerely,
		Chris

Dr Christopher Jones
Fermi National Accelerator Laboratory
cdj at fnal.gov