libdispatch's thread model
Hi, when dispatching a bunch of jobs (blocks) to a global concurrent queue, libdispatch, running on Linux or FreeBSD, always spawns a thread for each job submitted. On Mac OS X it works as expected. This somewhat interferes with libdispatch's initial intention of a set of worker threads - equal to the number of processors - that execute each job in turn. Is there some chance to get that fixed? I know that Apple's non-posix extension to pthread isn't available. Nevertheless, it should be possible to at least use a bounded number of worker threads less than MAX_THREAD_COUNT (255). What about dynamically setting MAX_THREAD_COUNT (or a similar variable) to the number of processors instead? ciao, Mario
Hi, On 27.01.2010 16:41, Mario Schwalbe wrote:
I know that Apple's non-posix extension to pthread isn't available. Nevertheless, it should be possible to at least use a bounded number of worker threads less than MAX_THREAD_COUNT (255). What about dynamically setting MAX_THREAD_COUNT (or a similar variable) to the number of processors instead?
As it turns out, the patch is rather simple: Index: queue.c =================================================================== --- queue.c (Revision 181) +++ queue.c (Arbeitskopie) @@ -1188,6 +1188,8 @@ if (r != ENOTSUP) { (void)dispatch_assume_zero(r); } +#else /* HAVE_PTHREAD_WORKQUEUES */ + _dispatch_root_queue_contexts[i].dgq_thread_pool_size = _dispatch_hw_config.cc_max_active; #endif /* HAVE_PTHREAD_WORKQUEUES */ #if USE_MACH_SEM // override the default FIFO behavior for the pool semaphores There's similar code in _dispatch_root_queues_init() for the implementation using pthread workqueues, but disabled. Why? ciao, Mario
On Jan 27, 2010, at 8:01 AM, Mario Schwalbe wrote:
Hi,
On 27.01.2010 16:41, Mario Schwalbe wrote:
I know that Apple's non-posix extension to pthread isn't available. Nevertheless, it should be possible to at least use a bounded number of worker threads less than MAX_THREAD_COUNT (255). What about dynamically setting MAX_THREAD_COUNT (or a similar variable) to the number of processors instead?
As it turns out, the patch is rather simple:
Index: queue.c =================================================================== --- queue.c (Revision 181) +++ queue.c (Arbeitskopie) @@ -1188,6 +1188,8 @@ if (r != ENOTSUP) { (void)dispatch_assume_zero(r); } +#else /* HAVE_PTHREAD_WORKQUEUES */ + _dispatch_root_queue_contexts[i].dgq_thread_pool_size = _dispatch_hw_config.cc_max_active; #endif /* HAVE_PTHREAD_WORKQUEUES */ #if USE_MACH_SEM // override the default FIFO behavior for the pool semaphores
There's similar code in _dispatch_root_queues_init() for the implementation using pthread workqueues, but disabled. Why?
that code is in fact not intended for the pthread workqueue implementation (which never looks at dgq_thread_pool_size) but for the pthread-pool implementation, it was mistakenly included inside the #if HAVE_PTHREAD_WORKQUEUES, it should be moved out, c.f. below The reason it is disabled is explained by the comment... the kernel workqueue mechanism has the ability to create new worker threads when all existing threads for a workqueue are blocked, the pthread-pool implementation does not, so a non-overcommitting queue with a limited pool size can result in deadlocks for client code that expects to be able to submit many concurrently executing interdependent workitems that block. diff --git i/src/queue.c w/src/queue.c index dd18edf..33537ba 100644 --- i/src/queue.c +++ w/src/queue.c @@ -1171,18 +1171,17 @@ _dispatch_root_queues_init(void *context __attribute__((unused))) #endif for (i = 0; i < DISPATCH_ROOT_QUEUE_COUNT; i++) { -#if HAVE_PTHREAD_WORKQUEUES - r = pthread_workqueue_attr_setqueuepriority_np(&pwq_attr, _dispatch_rootq2wq_pri(i)); - (void)dispatch_assume_zero(r); - r = pthread_workqueue_attr_setovercommit_np(&pwq_attr, i & 1); - (void)dispatch_assume_zero(r); // some software hangs if the non-overcommitting queues do not overcommit when threads block #if 0 if (!(i & 1)) { dispatch_root_queue_contexts[i].dgq_thread_pool_size = _dispatch_hw_config.cc_max_active; } #endif - +#if HAVE_PTHREAD_WORKQUEUES + r = pthread_workqueue_attr_setqueuepriority_np(&pwq_attr, _dispatch_rootq2wq_pri(i)); + (void)dispatch_assume_zero(r); + r = pthread_workqueue_attr_setovercommit_np(&pwq_attr, i & 1); + (void)dispatch_assume_zero(r); r = 0; if (disable_wq || (r = pthread_workqueue_create_np(&_dispatch_root_queue_contexts[i].dgq_kworkqueue, &pwq_attr))) { if (r != ENOTSUP) {
Hi, On 27.01.2010 17:43, Daniel A. Steffen wrote:
On Jan 27, 2010, at 8:01 AM, Mario Schwalbe wrote:
There's similar code in _dispatch_root_queues_init() for the implementation using pthread workqueues, but disabled. Why?
that code is in fact not intended for the pthread workqueue implementation (which never looks at dgq_thread_pool_size) but for the pthread-pool implementation, it was mistakenly included inside the #if HAVE_PTHREAD_WORKQUEUES, it should be moved out, c.f. below
The reason it is disabled is explained by the comment... the kernel workqueue mechanism has the ability to create new worker threads when all existing threads for a workqueue are blocked, ...
This sounds like the main advantage of work queues over thread pools, because blocked threads that release the processor allow other jobs in the queue to be run instead. However, if libdispatch isn't able to create new worker threads when existing threads block (I assume due to I/O), only the overall completion time of a set of jobs will increase. This isn't that bad for tasks that use libdispatch to perform asynchronous computations and use dispatch sources to respond to I/O. @Mark Heily: Is the apache thread pool implementation, you suggested, capable of creating new threads? I'm asking, because I didn't find any documentation that briefly describes the exact semantics. If it is not, I see no benefit in implementing a work queue API, that isn't better than the existing thread pool.
... the pthread-pool implementation does not, so a non-overcommitting queue with a limited pool size can result in deadlocks for client code that expects to be able to submit many concurrently executing interdependent workitems that block.
How? 1. Let's assume someone manages to submit n jobs that block each other causing a deadlock on a machine having n processors. The thread pool implementation isn't able to execute any other jobs anymore, so the application can be considered erroneous/dead. The work queue implementation is still able to execute jobs waiting in the queue, so the application (as a whole) can still make some progress, but cannot finish either, because - assuming the results of those blocked jobs are important - it at some point has to wait for their results (dispatch_group_wait()) which will block forever as well. 2. A deadlock requires a cycle in the dependency graph. But jobs submitted that are still waiting in the queue, won't be executed yet, and, hence, won't be able to acquire any resource to prevent other jobs from executing. ciao, Mario
On Jan 28, 2010, at 7:07 AM, Mario Schwalbe wrote:
1. Let's assume someone manages to submit n jobs that block each other causing a deadlock on a machine having n processors. The thread pool implementation isn't able to execute any other jobs anymore, so the application can be considered erroneous/dead. The work queue implementation is still able to execute jobs waiting in the queue, so the application (as a whole) can still make some progress, but cannot finish either, because - assuming the results of those blocked jobs are important - it at some point has to wait for their results (dispatch_group_wait()) which will block forever as well.
2. A deadlock requires a cycle in the dependency graph. But jobs submitted that are still waiting in the queue, won't be executed yet, and, hence, won't be able to acquire any resource to prevent other jobs from executing.
not necessarily a cycle, a dependency chain longer than the size of the thread pool is sufficient: sem1 = dispatch_semaphore_create(0); sem2 = dispatch_semaphore_create(0); dispatch_async(q, ^{dispatch_semaphore_wait(sem1);}); dispatch_async(q, ^{dispatch_semaphore_wait(sem2); dispatch_signal(sem1);}); dispatch_async(q, ^{dispatch_signal(sem2);}); this will deadlock if the pool size is 2 but work correctly with a larger pool (or with the workqueue implementation)
On Thu, Jan 28, 2010 at 12:20 PM, Daniel A. Steffen <dsteffen@apple.com> wrote:
On Jan 28, 2010, at 7:07 AM, Mario Schwalbe wrote:
1. Let's assume someone manages to submit n jobs that block each other causing a deadlock on a machine having n processors. The thread pool implementation isn't able to execute any other jobs anymore, so the application can be considered erroneous/dead. The work queue implementation is still able to execute jobs waiting in the queue, so the application (as a whole) can still make some progress, but cannot finish either, because - assuming the results of those blocked jobs are important - it at some point has to wait for their results (dispatch_group_wait()) which will block forever as well.
2. A deadlock requires a cycle in the dependency graph. But jobs submitted that are still waiting in the queue, won't be executed yet, and, hence, won't be able to acquire any resource to prevent other jobs from executing.
not necessarily a cycle, a dependency chain longer than the size of the thread pool is sufficient: sem1 = dispatch_semaphore_create(0); sem2 = dispatch_semaphore_create(0); dispatch_async(q, ^{dispatch_semaphore_wait(sem1);}); dispatch_async(q, ^{dispatch_semaphore_wait(sem2); dispatch_signal(sem1);}); dispatch_async(q, ^{dispatch_signal(sem2);});
this will deadlock if the pool size is 2 but work correctly with a larger pool (or with the workqueue implementation)
Or to put it another way, every dispatch block has an implicit dependency on the thread pool itself, so when you add that to the analysis, this situation *does* have a cycle, because the first two acquire thread pool resources, and the last two wait on thread pool resources before they can run. This is something to watch out for even on the Mac, because the thread pool maxes out at, I believe, 512 threads. It's tough, but not impossible, to hit that number inadvertently and have everything seize up. Mike
On Jan 28, 2010, at 11:36 AM, Michael Ash wrote:
On Thu, Jan 28, 2010 at 12:20 PM, Daniel A. Steffen <dsteffen@apple.com> wrote:
On Jan 28, 2010, at 7:07 AM, Mario Schwalbe wrote:
1. Let's assume someone manages to submit n jobs that block each other causing a deadlock on a machine having n processors. The thread pool implementation isn't able to execute any other jobs anymore, so the application can be considered erroneous/dead. The work queue implementation is still able to execute jobs waiting in the queue, so the application (as a whole) can still make some progress, but cannot finish either, because - assuming the results of those blocked jobs are important - it at some point has to wait for their results (dispatch_group_wait()) which will block forever as well.
2. A deadlock requires a cycle in the dependency graph. But jobs submitted that are still waiting in the queue, won't be executed yet, and, hence, won't be able to acquire any resource to prevent other jobs from executing.
not necessarily a cycle, a dependency chain longer than the size of the thread pool is sufficient: sem1 = dispatch_semaphore_create(0); sem2 = dispatch_semaphore_create(0); dispatch_async(q, ^{dispatch_semaphore_wait(sem1);}); dispatch_async(q, ^{dispatch_semaphore_wait(sem2); dispatch_signal(sem1);}); dispatch_async(q, ^{dispatch_signal(sem2);});
this will deadlock if the pool size is 2 but work correctly with a larger pool (or with the workqueue implementation)
Or to put it another way, every dispatch block has an implicit dependency on the thread pool itself, so when you add that to the analysis, this situation *does* have a cycle, because the first two acquire thread pool resources, and the last two wait on thread pool resources before they can run.
This is something to watch out for even on the Mac, because the thread pool maxes out at, I believe, 512 threads. It's tough, but not impossible, to hit that number inadvertently and have everything seize up.
Yes, there is a finite limit to threads even on a mac. :) The thread workqueue code in the kernel hooks before and after the thread context switch so it can monitor how long threads have been blocked on I/O (besides hooking thread yields). If threads have been blocked for a while it will add more threads to the work pool. If threads have been "parked" (or not doing anything) for a while it will also reduce the pool size as well. Before doing a similar implementation like apple's I partially implemented some kernel code that would send "hints" (via kevents) to an user level thread manager about creating new threads. Unfortunately that didn't seem to work so well. It seemed that it would stall for a relatively long period of time when a bunch of work items where submitted to the work queue. Having the kernel create the new threads seems to do much better. -stacey.
On Jan 27, 2010, at 9:41 AM, Mario Schwalbe wrote:
Hi,
when dispatching a bunch of jobs (blocks) to a global concurrent queue, libdispatch, running on Linux or FreeBSD, always spawns a thread for each job submitted. On Mac OS X it works as expected.
This somewhat interferes with libdispatch's initial intention of a set of worker threads - equal to the number of processors - that execute each job in turn. Is there some chance to get that fixed?
I know that Apple's non-posix extension to pthread isn't available. Nevertheless, it should be possible to at least use a bounded number of worker threads less than MAX_THREAD_COUNT (255). What about dynamically setting MAX_THREAD_COUNT (or a similar variable) to the number of processors instead?
Can I interest you in testing an experimental patch for FreeBSD that adds pthread work queues? :-) See http://people.freebsd.org/~sson/thrworkq/ Best Regards, -stacey.
Mario Schwalbe wrote:
I know that Apple's non-posix extension to pthread isn't available. Nevertheless, it should be possible to at least use a bounded number of worker threads less than MAX_THREAD_COUNT (255). What about dynamically setting MAX_THREAD_COUNT (or a similar variable) to the number of processors instead?
A more useful solution would be to create a portable implementation of the Apple's pthread_workqueue interfaces [1]. This has been on my TODO list for some time, and I filed a Trac issue about this problem [2]. I don't really want to reinvent the wheel, so my first attempt will use the APR thread pool routines [3]. To prevent deadlock, each workqueue will be serviced by a separate thread pool. This won't scale to large numbers of workqueues, and makes it impossible to implement priority levels, but it should be simple and work better than creating a new thread for each task. Stacy Son has kindly documented the pthread_workqueue(3) API which should be very helpful to get the correct semantics [4]. I have taken a stab at implementing the most important functions, and it looks promising [5]. Regards, - Mark [1] datatypes: pthread_workqueue_t pthread_workqueue_attr_t functions: pthread_workqueue_init_np pthread_workqueue_attr_init_np pthread_workqueue_attr_destroy_np pthread_workqueue_attr_setqueuepriority_np pthread_workqueue_attr_setovercommit_np pthread_workqueue_create_np pthread_workqueue_additem_np [2] http://libdispatch.macosforge.org/trac/ticket/2 [3] http://apr.apache.org/docs/apr/trunk/group___a_p_r___util___t_p.html [4] http://people.freebsd.org/~sson/thrworkq/pthread_workqueue_2009_12_14.diff [5] #include <pthread.h> #include <apr-1.0/apr_thread_pool.h> #define pthread_workqueue_t apr_thread_pool_t static pthread_once_t init_once = PTHREAD_ONCE_INIT; static apr_pool_t *mpool; static int _pthread_workqueue_init(void) { return (apr_pool_create(&mpool, NULL) == APR_OK ? 0 : -1); } int pthread_workqueue_init_np(void) { pthread_once(_pthread_workqueue_init, init_once); } int pthread_workqueue_create_np( pthread_workqueue_t *workqp, const pthread_workqueue_attr_t *attr) { return (apr_thread_pool_create(workqp, 1, 255, mpool) == APR_OK ? 0 : -1); } int pthread_workqueue_additem_np( pthread_workqueue_t workq, void ( *workitem_func)(void *), void * workitem_arg, pthread_workitem_handle_t * itemhandlep, unsigned int *gencountp) { int rv; rv = apr_thread_pool_push(workq, workitem_func, workitem_arg, APR_THREAD_TASK_PRIORITY_NORMAL, NULL); /* FIXME - Hope these aren't actually used :) */ *itemhandlep = NULL; *gencountp = 0; return (rv == APR_OK ? 0 : -1); }
On Jan 27, 2010, at 9:32 PM, Mark Heily wrote:
Mario Schwalbe wrote:
I know that Apple's non-posix extension to pthread isn't available. Nevertheless, it should be possible to at least use a bounded number of worker threads less than MAX_THREAD_COUNT (255). What about dynamically setting MAX_THREAD_COUNT (or a similar variable) to the number of processors instead?
Stacy Son has kindly documented the pthread_workqueue(3) API which should be very helpful to get the correct semantics [4].
FYI, I didn't document all of apple's pthread_workqueue_*_np() functions but just the more interesting ones. Also, my system call interface is very different from apple's and not documented but I wasn't trying to keep that compatible with their implementation.
[4]
http://people.freebsd.org/~sson/thrworkq/pthread_workqueue_2009_12_14.diff
You might find http://people.freebsd.org/~sson/thrworkq/pthread_workqueue.3.txt a bit easier to read. Regards, -stacey.
Hi, On 28.01.2010 04:32, Mark Heily wrote:
Mario Schwalbe wrote:
I know that Apple's non-posix extension to pthread isn't available. Nevertheless, it should be possible to at least use a bounded number of worker threads less than MAX_THREAD_COUNT (255). What about dynamically setting MAX_THREAD_COUNT (or a similar variable) to the number of processors instead? A more useful solution would be to create a portable implementation of the Apple's pthread_workqueue interfaces [1]. This has been on my TODO list for some time, and I filed a Trac issue about this problem [2].
Ack. But since it most likely requires a kernel extension to implement overcommitting, it cannot really be portable.
I don't really want to reinvent the wheel, so my first attempt will use the APR thread pool routines [3]. To prevent deadlock, each workqueue will be serviced by a separate thread pool. This won't scale to large numbers of workqueues, and makes it impossible to implement priority levels, but it should be simple and work better than creating a new thread for each task.
Stacy Son has kindly documented the pthread_workqueue(3) API which should be very helpful to get the correct semantics [4]. I have taken a stab at implementing the most important functions, and it looks promising [5].
And I played a bit and filled in some lines of code. (You didn't actually compile or run it, did you?) The APR thread pool implementation is capable of creating new threads. But only if a certain threshold of jobs waiting is reached. Also, in contrast to your suggestion, I used only one global thread pool, since we cannot prevent deadlocks, but at least have priority support. ciao, Mario
participants (5)
-
Daniel A. Steffen
-
Mario Schwalbe
-
Mark Heily
-
Michael Ash
-
Stacey Son