The code is faster now because the continuations is immediately made available for reuse (as opposed to freeing it after dc_func returns). The better assembly is a side effect of this optimization. If you want to see the difference, just apply the following gratuitous "use after free" fix and compare the resulting assembly.
Index: src/queue.c
===================================================================
--- src/queue.c (revision 55311)
+++ src/queue.c (working copy)
@@ -358,15 +358,15 @@
// The ccache version is per-thread.
// Therefore, the object has not been reused yet.
// This generates better assembly.
- if ((long)dou._do->do_vtable & DISPATCH_OBJ_ASYNC_BIT) {
- _dispatch_continuation_free(dc);
- }
if ((long)dou._do->do_vtable & DISPATCH_OBJ_GROUP_BIT) {
dg = dc->dc_group;
} else {
dg = NULL;
}
dc->dc_func(dc->dc_ctxt);
+ if ((long)dou._do->do_vtable & DISPATCH_OBJ_ASYNC_BIT) {
+ _dispatch_continuation_free(dc);
+ }
if (dg) {
dispatch_group_leave(dg);
_dispatch_release(dg);