Hi Marius,
It's great that you have enough interest in the libdispatch project to work on a port to Linux and Win32. If you weren't aware, we already have Linux and partial Win32 support in the primary libdispatch source repository. I would encourage you to contribute modifications to the primary project rather than attempt a complete rewrite.
I reviewed some of the sources of "libxdispatch" and I can see that this does approximate the libdisptach API at a high level, but it's important to keep in mind some of the aspects of the libdispatch implementation that set it apart — specifically thread safety and wait-free algorithms — which allow it to scale well with large numbers of cores and heavy load. I noted several places in the libxdispatch/core/queue.c implementation where thread safety was questionable and other cases where simple spin locks (waiting) were in use.
I'd be happy to answer any specific questions you may have about the libdispatch implementation or scalability goals.
There's one part of this message that caught my eye, and I'm particularly curious about:
but it's important to keep in mind some of the aspects of the libdispatch implementation that set it apart — specifically thread safety and wait-free algorithms — which allow it to scale well with large numbers of cores and heavy load.
Specifically, the "wait-free algorithms" part. To be clear, "wait-free" may mean different things to different people, but in high concurrency circles, "wait-free" tends to have a very specific meaning. See the wikipedia article
http://en.wikipedia.org/wiki/Non-blocking_algorithm for the definition(s) I'm using.
Is libdispatch truly "wait-free"? I looked at the code when it first came out, and nothing in particular struck me as rising to the level of "wait-free". Obstruction-free, sure. "Practical" lock-free, most likely. Wait-free? I don't see how that's possible. From the wikipedia article wrt/ wait-free:
An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes.
This basically eliminates anything that uses a CAS or LL/SC primitive with "loop and retry on failure" logic- there is no guarantee that the operation will ever complete. If I recall correctly, libdispatch uses this technique to implement it's lockless data structures. Even ensuring (for some very strong definition of ensuring, like provably) "lock-free" behavior with this kind of logic is extremely difficult, hence the informal term "practical" lock-free. In practice, truly lock-free and wait-free almost requires access to a Verilog description (or equivalent) of all the components and doing very low level simulation to ensure all possible permutations are shown to "work". As you can imagine, this just isn't practical for a modern, out of order, super scalar CPU, let alone everything it connects to. To the best of my knowledge, the x86 architecture is one of the few out there that has a CAS operation the ensures "at least one succeeds" semantics (which I believe is largely due to the requirement to be "mostly" backwards compatible all the way back to 8086, particularly on bus transactions). Most other architectures do not. In fact, many implementations of LL/SC can result in a live-lock condition in extremely rare circumstances (it's happened on IBM big iron, for example).
Is this just a case where a better choice of words would have been more appropriate? Because a truly wait-free implementation would be quite a coup, and I'd appreciate any pointers to the details of how you did this.