[launchd-dev] Hanging system calls with non-system users

Axel Luttgens AxelLuttgens at swing.be
Thu Mar 14 05:12:42 PDT 2013


Le 2 mars 2013 à 17:42, Axel Luttgens a écrit :

> Going a bit further with the topics of my previous post. ;-)
> 
> [...]

And possibly another perspective.

As a reminder, I wrote a very minimalist code (a "skeleton") so as to try to circumscribe what was happening with Dovecot's child processes running as "virtual" users.

Going further with my investigations, I was inserting various syslog calls into Dovecot's code for debugging purposes. And suddenly noticed that having a call to syslog() somewhere before the call to setgroups() in Dovecot's child process avoids those mysterious launchd interpositions with "pure" virtual users...

Surprising, isn't it?

Anyway, annoying too, since I wasn't comparing similar things: up to now, my pseudochild process was running under cases 1 and 2 hereafter, while Dovecot's child processes were running under cases 3 and 4.

I thus went back to my original pseudochild.c code, the one with the call to gethostbyname(), and the results were:

         existing   call to     skeleton's     Dovecot's
           gid      syslog()      child          child
         --------   --------    ----------     ---------
CASE 1     no         yes          OK             OK
CASE 2     yes        yes          OK             OK
CASE 3     no         no          fails          fails
CASE 4     yes        no          fails           OK

Something is still missing in my skeleton code so as to mimic more closely Dovecot's children behavior (see case 4); finding that missing piece might help to further diagnose the phenomena coming into play.

So, why does a preliminary call to syslog() have such an influence?
Looks like there's a bug somewhere...

On the other hand, perhaps should some provisions be made, in the plist or/and in the code, so as to avoid those hangs?

Or are such problems fundamentally unavoidable, for example because of some changes in the security model? 
(but then, one would expect to face something more well-behaved...)

I describe hereafter the behaviors displayed by my skeleton code under cases 1 to 4; for the sake of completeness, as well as to avoid any ambiguity, I also reproduce that code.

Still desperately in quest of an explanation. ;-)

TIA,
Axel


CASE 1:
=======

Compile pseudochild.c with lines 12 and 16 set to:

	line 12: gid_t gid = 100016;
	line 16: syslog(LOG_NOTICE, "Pseudochild: started.");

Launching the master process with:

	sudo launchctl load -w /Library/LaunchDaemons/pseudomaster.plist

yields expected results in the log:

	Mar 14 09:59:30 ALMba.local pseudomaster[45210] <Notice>: Master: started.
	Mar 14 09:59:30 ALMba.local pseudomaster[45212] <Notice>: Child: forked.
	Mar 14 09:59:30 ALMba.local pseudochild[45212] <Notice>: Pseudochild: started.
	Mar 14 09:59:30 ALMba.local pseudochild[45212] <Notice>: Pseudochild: gethostbyname() returned almba.local
	Mar 14 09:59:30 ALMba.local pseudomaster[45210] <Notice>: Master: child done with 0.
	Mar 14 09:59:35 ALMba.local pseudomaster[45210] <Notice>: Master: looping.
	Mar 14 09:59:35 ALMba.local pseudomaster[45213] <Notice>: Child: forked.
	[and so on]

Unload the master process:

	sudo launchctl unload -w /Library/LaunchDaemons/pseudomaster.plist


CASE 2:
=======

Compile pseudochild.c with:

	line 12: gid_t gid = 20;
	line 16: syslog(LOG_NOTICE, "Pseudochild: started.");

Launch the master process and look at the log:

	Mar 14 10:12:41 ALMba.local pseudomaster[45310] <Notice>: Master: started.
	Mar 14 10:12:41 ALMba.local pseudomaster[45312] <Notice>: Child: forked.
	Mar 14 10:12:41 ALMba.local pseudochild[45312] <Notice>: Pseudochild: started.
	Mar 14 10:12:42 ALMba.local pseudochild[45312] <Notice>: Pseudochild: gethostbyname() returned almba.local
	Mar 14 10:12:42 ALMba.local pseudomaster[45310] <Notice>: Master: child done with 0.
	Mar 14 10:12:47 ALMba.local pseudomaster[45310] <Notice>: Master: looping.
	Mar 14 10:12:47 ALMba.local pseudomaster[45313] <Notice>: Child: forked.
	[and so on]

Unload the master process.


CASE 3:
=======

Compile pseudochild.c with:

	line 12: gid_t gid = 100016;
	line 16: // syslog(LOG_NOTICE, "Pseudochild: started.");

Launch the master process and look at the log:

	Mar 14 10:22:52 ALMba.local pseudomaster[45402] <Notice>: Master: started.
	Mar 14 10:22:52 ALMba.local pseudomaster[45403] <Notice>: Child: forked.
	Mar 14 10:22:52 ALMba com.apple.launchd[1] (com.apple.launchd.peruser.100016[45404]) <Error>: getpwuid("100016") failed
	Mar 14 10:22:52 ALMba com.apple.launchd[1] (com.apple.launchd.peruser.100016[45404]) <Notice>: Job failed to exec(3). Setting up event to tell us when to try again: 3: No such process
	Mar 14 10:22:52 ALMba com.apple.launchd[1] (com.apple.launchd.peruser.100016[45404]) <Notice>: Job failed to exec(3) for weird reason: 3

Processes pseudomaster and pseudochild are now sleeping (according to top) forever.
A backtrace of the hung pseudochild process shows:

	(gdb) bt full
	#0  0x00007fff8644e686 in mach_msg_trap ()
	No symbol table info available.
	#1  0x00007fff8644dc42 in mach_msg ()
	No symbol table info available.
	#2  0x00007fff921e47c4 in _vproc_transaction_try_exit ()
	No symbol table info available.
	#3  0x00007fff921e3515 in bootstrap_look_up3 ()
	No symbol table info available.
	#4  0x00007fff921e3591 in bootstrap_look_up ()
	No symbol table info available.
	#5  0x00007fff8f6091a1 in ___notify_lib_init_block_invoke_0 ()
	No symbol table info available.
	#6  0x00007fff8dd8b0b6 in _dispatch_client_callout ()
	No symbol table info available.
	#7  0x00007fff8dd8b041 in dispatch_once_f ()
	No symbol table info available.
	#8  0x00007fff8f60573f in _notify_lib_init ()
	No symbol table info available.
	#9  0x00007fff8f606834 in notify_register_check ()
	No symbol table info available.
	#10 0x00007fff8aee8a39 in __si_module_static_ds_block_invoke_0 ()
	No symbol table info available.
	#11 0x00007fff8dd8b0b6 in _dispatch_client_callout ()
	No symbol table info available.
	#12 0x00007fff8dd8b041 in dispatch_once_f ()
	No symbol table info available.
	#13 0x00007fff8aee89c0 in si_module_static_ds ()
	No symbol table info available.
	#14 0x00007fff8aee8489 in si_module_with_name ()
	No symbol table info available.
	#15 0x00007fff8aee88dd in si_module_config_modules_for_category ()
	No symbol table info available.
	#16 0x00007fff8aee85e0 in __si_module_static_search_block_invoke_0 ()
	No symbol table info available.
	#17 0x00007fff8dd8b0b6 in _dispatch_client_callout ()
	No symbol table info available.
	#18 0x00007fff8dd8b041 in dispatch_once_f ()
	No symbol table info available.
	#19 0x00007fff8aee8580 in si_module_static_search ()
	No symbol table info available.
	#20 0x00007fff8aee8489 in si_module_with_name ()
	No symbol table info available.
	#21 0x00007fff8aeec086 in gethostbyname ()
	No symbol table info available.
	#22 0x0000000105231da0 in main ()
	No symbol table info available.

Unload the master process.
Note that this this is sufficient to have the pseudochild process to be stopped as well; in the case of Dovecot, the child process has to be KILLed, probably because of a more complex environment.

Also remove the peruser launchd job that has just been created:
	sudo launchctl remove com.apple.launchd.peruser.100016
otherwise subsequent failures won't be explicitly logged anymore.


CASE 4:
=======

Compile pseudochild.c with:

	line 12: gid_t gid = 20;
	line 16: // syslog(LOG_NOTICE, "Pseudochild: started.");

Launch the master process and look at the log:

	Mar 14 10:42:41 ALMba.local pseudomaster[45608] <Notice>: Master: started.
	Mar 14 10:42:41 ALMba.local pseudomaster[45609] <Notice>: Child: forked.
	Mar 14 10:42:41 ALMba com.apple.launchd[1] (com.apple.launchd.peruser.100016[45610]) <Error>: getpwuid("100016") failed
	Mar 14 10:42:41 ALMba com.apple.launchd[1] (com.apple.launchd.peruser.100016[45610]) <Notice>: Job failed to exec(3). Setting up event to tell us when to try again: 3: No such process
	Mar 14 10:42:41 ALMba com.apple.launchd[1] (com.apple.launchd.peruser.100016[45610]) <Notice>: Job failed to exec(3) for weird reason: 3

So, same behavior as with CASE 3.
Note that with Dovecot, making use of an existing gid allows the child process to run without any problem. Still trying to find what I should add to my skeleton code so as to reproduce that ability.

Unload the master process.
Remove the peruser launchd job.


CODE:
=====

#include <syslog.h>
#include <unistd.h>
#include <stdlib.h>

int   main( int argc, const char * argv[]) 
{
	pid_t pid;
	int retstatus;
	char * dummyargs[] = { "pseudochild", NULL };

	syslog(LOG_NOTICE, "Master: started.");

	for (;;)
	{
		pid = fork();
	
		if (pid < 0)
		{
			syslog(LOG_NOTICE, "Master: fork() failed.");
			return(1);
		}
		else if (pid == 0)
		{
			syslog(LOG_NOTICE, "Child: forked.");
		
			if (execv("/_COD/pseudochild", dummyargs) < 0)
			{
				syslog(LOG_NOTICE, "Child: execv() failed.");
				_exit(1);
			}
		}

		waitpid(pid, & retstatus, 0);
		syslog(LOG_NOTICE, "Master: child done with %i.", retstatus);
		sleep(5);
		syslog(LOG_NOTICE, "Master: looping.");
	}
}


#include <syslog.h>
#include <netdb.h>
#include <unistd.h>
#include <stdlib.h>

int   main( int argc, const char * argv[]) 
{
	struct hostent *hent;
	gid_t gidset[1];

	uid_t uid = 100016;
	gid_t gid = 100016;
	
	gidset[0] = gid;

	syslog(LOG_NOTICE, "Pseudochild: started.");
	
	if (setgid(gid) != 0)
	{
		syslog(LOG_NOTICE, "Pseudochild: setgid() failed.");
		_exit(1);
	}
	if (setgroups(1, gidset) != 0)
	{
		syslog(LOG_NOTICE, "Pseudochild: setgroups() failed.");
		_exit(1);
	}
	if (setuid(uid) != 0)
	{
		syslog(LOG_NOTICE, "Pseudochild: setuid() failed.");
		_exit(1);
	}

	hent = gethostbyname("ALMba.local");
	if ( hent != NULL )
	{
		syslog(LOG_NOTICE, "Pseudochild: gethostbyname() returned %s\n", hent->h_name);
		return(0);
	}
	else
	{
		syslog(LOG_NOTICE, "Pseudochild: gethostbyname() failed.");
		return(1);
	}
}


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>ServiceDescription</key>
	<string>Master test</string>
	<key>Label</key>
	<string>pseudomaster</string>
	<key>Disabled</key>
	<true/>
	<key>KeepAlive</key>
	<true/>
	<key>ProgramArguments</key>
	<array>
		<string>/_COD/pseudomaster</string>
	</array>
</dict>
</plist>



More information about the launchd-dev mailing list