libdispatch object caches
Hi, I have a question about a possible optimization and would appreciate feedback on whether it would be considered a) worthwhile to investigate b) a candidate for integration if providing good results. The reason for asking is that I’ve seen significantly better (seems to be around 2x) performance on Solaris for e.g. ‘dispatch starfish’ for the lap times, which seems to be malloc-bound (as is also commented in the starfish implementation). libdispatch today uses optimized TSD dispatch continuation object caches with a heavily optimized implementation for Darwin. This cache seems unbounded during the lifetime of a given drain of the queue, after which a forced cleanup is performed (by calling _dispatch_force_cache_cleanup()). The proposed change would be to let _dispatch_continuation_alloc_from_heap() allocate its objects from a magazine based object cache, and let _dispatch_cache_cleanup2() to return objects back to that object cache. This is the approach taken in libumem (see http://www.usenix.org/event/usenix01/full_papers/bonwick/bonwick.pdf for background) - in fact, the easiest approach would possibly be to even depend on libumem being available (https://labs.omniti.com/trac/portableumem exists for non-Solaris platforms) - this could be compile-time optional of course depending on the availability of libumem in such a case, but that would implicitly replace the default memory allocator (if we don’t ‘cherry pick’ out e.g. the magazine layer). It seems that this would allow us to reuse dispatch continuation objects across invocations/threads in an efficient way. So there are a few questions: - Would a magazine-based object cache be interesting for libdispatch? - Have any analysis of typical TSD continuation cache sizes been done already? (anyone who have seen any issues here in GCD-heavy applications?) - What would be a reasonable test for performance validation? - Would a relying on libumem be acceptable for people in general? One other possibility would be to only keep a single item in the TSD and fall back on such an object-cache if that single object would not be available (that is, releasing it back to the object pool in _dispatch_continuation_free, the ‘hot continuation’ trick from _dispatch_continuation_pop() would thus still work) - but this would require more serious investigation (the magazine approach above would be a ’safer’ change). Thoughts? Joakim PS Apple folks also have rdar://4944235 for libumem/OS X, as it would be a nice analogue to the GCD system-level thread management, but for the memory allocation / object cacheing subsystems instead
On May 20, 2011, at 4:26 AM, Joakim Johansson wrote:
- Would a magazine-based object cache be interesting for libdispatch?
malloc(3) on Mac OS X is already magazine based.
- Have any analysis of typical TSD continuation cache sizes been done already? (anyone who have seen any issues here in GCD-heavy applications?)
One has to define the applications that will be evaluated first. In any case, most programs are extremely bursty in nature, so the caches are cleaned regularly in practice.
- What would be a reasonable test for performance validation?
The "starfish test" was setup to, as much as possible, take advantage of thread and continuation cache recycling. It isn't a great test, and depending on tuning knobs it can spillover into swap files, but at least it is something.
- Would a relying on libumem be acceptable for people in general?
Hard to say. davez
On 05/24/2011 12:49 PM, Dave Zarzycki wrote:
- Would a relying on libumem be acceptable for people in general?
Hard to say.
libumem should probably be used by the Solaris port of libdispatch, since the default malloc() on Solaris is not optimized for multithreaded applications [1]. On Linux, I would want to see concrete performance benefits for real-world applications before using libumem. It appears that malloc() on Linux is already optimized for concurrency [2]. Regards, - Mark [1] http://developers.sun.com/solaris/articles/multiproc/multiproc.html [2] http://en.wikipedia.org/wiki/Malloc#dlmalloc_and_its_derivatives
participants (3)
-
Dave Zarzycki
-
Joakim Johansson
-
Mark Heily