[libdispatch-dev] lib dispatch worker threads may loop depending on compiler optimization

Fri Sep 9 00:23:12 PDT 2011

On Sep 8, 2011, at 10:09 PM, Paolo Bonzini wrote:

> On Thu, Sep 8, 2011 at 18:16, Daniel A. Steffen <dsteffen at apple.com> wrote:
>> The Lion libdispatch source already uses clang's __sync_swap() for dispatch_atomic_xchg() when available, c.f. src/shims/atomic.h
>> 
>> #if __has_builtin(__sync_swap)
>> #define dispatch_atomic_xchg(p, n) \
>>                ((typeof(*(p)))__sync_swap((p), (n)))
>> #else
>> #define dispatch_atomic_xchg(p, n) \
>>                ((typeof(*(p)))__sync_lock_test_and_set((p), (n)))
>> #endif
>> 
>> unless GCC 4.5.1 has something similar, switching dispatch_atomic_xchg() to an
>> inline asm volatile("xchg") on intel when builiding with that compiler is the cleanest
>> workaround IMO
> 
> I doubt it, __sync_lock_test_and_set is a full barrier on x86.
> Compiler-wise it is always a full optimization barrier, the actual
> semantics depend on the processor.

Strictly speaking, that isn't true. From the documentation:

http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html

"This builtin is not a full barrier, but rather an acquire barrier. This means that references after the builtin cannot move to (or be speculated to) before the builtin, but previous memory stores may not be globally visible yet, and previous memory loads may not yet be satisfied."

In practice, GCC and clang have historically treated _sync_lock_test_and_set() as a full barrier, and that is why GCD was able to get away with using it to get at the "xchg" instruction.

The behavior of _sync_lock_test_and_set() may have changed in recent GCC compilers, and this would explain the bug observed in this email thread where the store instruction was statically moved after the "xchg" instruction. That is why the recently introduced __sync_swap() intrinsic is preferable when available.

davez