[libdispatch-dev] lib dispatch worker threads may loop depending on compiler optimization

Fri Sep 9 00:39:49 PDT 2011

On 09/09/2011 09:23 AM, Dave Zarzycki wrote:
>>> I doubt it, __sync_lock_test_and_set is a full barrier on x86.
>>> Compiler-wise it is always a full optimization barrier, the
>>> actual semantics depend on the processor.
>
> Strictly speaking, that isn't true. From the documentation:
>
> http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html
>
> "This builtin is not a full barrier, but rather an acquire barrier.
> This means that references after the builtin cannot move to (or be
> speculated to) before the builtin, but previous memory stores may not
> be globally visible yet, and previous memory loads may not yet be
> satisfied."
>
> In practice, GCC and clang have historically treated
> _sync_lock_test_and_set() as a full barrier, and that is why GCD was
> able to get away with using it to get at the "xchg" instruction.

Yes, the documentation is conservative.  However, if you look at the 
code (and this hasn't changed in recent GCC):

* moving references before the builtin is clearly prohibited, and so is 
speculating them;

* depending on the target, memory stores may not be globally visible 
yet, and previous memory loads may not yet be satisfied;

* however, the compiler will *never* sink references below the builtin, 
which is what I meant by "compiler-wise it is always a full optimization 
barrier" like asm("":::"memory").

So I find it extremely unlikely that this is the cause of the problem.

It is more likely that an optimization barrier like the above no-op asm 
is missing in the source, and clang is getting away without it. 
Remember that while the x86 does not need explicit read or write 
barriers in the assembly (only full barriers), you do need to write the 
barriers in the code and expand them to no-op asms.  Otherwise the 
compiler may move references across the barrier.

(Former GCC developer here :)).

Paolo