Joakim Johansson wrote:
On Sat, 16 Apr 2011 22:34:21 +0700 "C. Bergström" <cbergstrom@pathscale.com> wrote:
Does anyone have ideas on Solaris or Linux non-portable tsd or semaphore optimizations? Thanks! ./C
Some small comment with regard to TSD optimization for Linux/Solaris (I did have a small look at this for Solaris):
I'm not aware of how it looks at Linux, but Solaris do already have a quite optimized version of pthread_get/set_specific() for the first 8 TSD keys allocated [1]. For most "normal" applications, the libdispatch keys would be allocated in that space during startup during library initialization. See links below for details (it is not much code to read really...).
Of course, the Apple implementation is even one more notch 'down to the metal' [2] - for Solaris I could see the following possible optimizations:
1. Use the non-portable thr_get/set_specific instead of the pthread interface, it saves one function call when setting TSD, but loses one load/store for the get interface 2. Inline the appropriate logic similar to the Apple TSD optimization using the knowledge of the libc implementation from opensolaris.org
"1" Seems like easy and "safe", but will trade off slightly worse performance for TSD lookup (see comments in the link below) vs avoiding an extra function call when setting a TSD - depends on the usage pattern of libdispatch - I think this has a decent chance of being a net loss in reality…
"2" Would in practice be difficult to do in terms of robustness (Oracle might change the implementation…), but could fairly easily be done as a proof-of-concept just for testing the performance difference by peeking at the implementation referenced below. It might be a small challenge to get the required interfaces supported directly by Oracle then as the next step :-)
Frankly, I don't think either one is that attractive unless someone from Oracle/Sun would like to step in and directly address "2" so it could be done in a supported way.
A good performance test case would be needed before going either route though, so far I haven't seen either function in any samples on Solaris and decided to postpone further investigation until it's popped up in any samples. This will end up getting used in HPC so we'll likely try to squeeze every bit of performance out that's possible to make it competitive against OpenMP. (Yes I know it's not Apple/Apple comparison) Personally I don't much about Linux and have a hobby OpenSolaris based thing I work on and may be able to do this.