Hanging system calls with non-system users
Hello, I'm encountering a problem that seems to be launchd-related, hence my post here. Beside Dovecot's case, I tend to believe this could be of interest for everyone devising daemons. Dovecot may be configured with pure "virtual" users, i.e. users with an uid/gid pair unknown in the system; this allows to fetch those users from, say, a database while providing increased security (virtual user A should not be able to access the mailbox of virtual user B). On Mac OS X (10.8.2 here), with Dovecot 2.1.14, this works up to a certain point; for example, connecting as "user1" (100016/100016) for the first time, one gets this in the logs: dovecot[97622] <Info>: master: Dovecot v2.1.14 starting up (core dumps disabled) dovecot[97624] <Info>: pop3-login: Login: user=<user1>, method=PLAIN, rip=127.0.0.1, lip=127.0.0.1, mpid=97632, secured, session=<rziWFXnW0gB/AAAB> dovecot[97624] <Debug>: pop3(user1): Debug: Effective uid=100016, gid=100016, home=/_Data/Mailstores/100016 dovecot[97624] <Debug>: pop3(user1): Debug: Home dir not found: /_Data/Mailstores/100016 dovecot[97624] <Debug>: pop3(user1): Debug: fs: root=/_Data/Mailstores/100016/mboxes, index=, control=, inbox=/_Data/Mailstores/100016/mboxes/inbox, alt= dovecot[97624] <Debug>: pop3(user1): Debug: Namespace : /_Data/Mailstores/100016/mboxes doesn't exist yet, using default permissions dovecot[97624] <Debug>: pop3(user1): Debug: Namespace : Using permissions from /_Data/Mailstores/100016/mboxes: mode=0700 gid=-1 dovecot[97624] <Debug>: pop3(user1): Debug: Namespace : /_Data/Mailstores/100016/mboxes/inbox doesn't exist yet, using default permissions dovecot[97624] <Debug>: pop3(user1): Debug: Namespace : Using permissions from /_Data/Mailstores/100016/mboxes: mode=0700 gid=-1 dovecot[97624] <Debug>: pop3(user1): Debug: Namespace : /_Data/Mailstores/100016/mboxes/inbox doesn't exist yet, using default permissions dovecot[97624] <Debug>: pop3(user1): Debug: Namespace : Using permissions from /_Data/Mailstores/100016/mboxes: mode=0700 gid=-1 dovecot[97624] <Debug>: pop3(user1): Debug: In my_hostdomain: 100016 100016 100016 100016 com.apple.launchd[1] (com.apple.launchd.peruser.100016[97633]) <Error>: getpwuid("100016") failed com.apple.launchd[1] (com.apple.launchd.peruser.100016[97633]) <Notice>: Job failed to exec(3). Setting up event to tell us when to try again: 3: No such process com.apple.launchd[1] (com.apple.launchd.peruser.100016[97633]) <Notice>: Job failed to exec(3) for weird reason: 3 and the pop3 process then hangs forever; but many things have nevertheless been done for user1, especially the creation of needed files/directories. Subsequent connections as "user1" (after having restarted Dovecot and sent KILL to the pop3 process) systematically end with hanging pop3 processes, but without those com.apple.launchd.peruser messages in the logs anymore. Debugging has been quite a challenge (at least, with my poor skills), since all of this becomes quite elusive with the silent interposition of launchd; but it is now almost certain that the hangs occur on a call to gethostbyname() from the code of the pop3 executable. I suspect this to be more general and to be liable to occur with other system calls as well (probably those from LibInfo). So, I've devised the pieces of code at the end of this message so as to provide a skeleton of how Dovecot's master process spawns its children, for example a child pop daemon upon a client connection on port 110. They really are a skeleton, as Dovecot performs much more initializations (pre-exec() and post-exec() setups in the comments below). Yet, they may be viewed as a proof of concept: a child process running as a virtual user may invoke a system call such as gethostbyname() without encountering any problem; in the logs: pseudomaster[11946] <Notice>: Master: started. pseudomaster[11947] <Notice>: Child: forked. Unknown[11947] <Notice>: Pseudochild: started. Unknown[11947] <Notice>: Pseudochild: gethostbyname() returned almba.local pseudomaster[11946] <Notice>: Master: child done. pseudomaster[11946] <Notice>: Master: looping. pseudomaster[11949] <Notice>: Child: forked. Unknown[11949] <Notice>: Pseudochild: started. Unknown[11949] <Notice>: Pseudochild: gethostbyname() returned almba.local pseudomaster[11946] <Notice>: Master: child done. pseudomaster[11946] <Notice>: Master: looping. [...] So, the question is to know what Dovecot may do so as to trigger hangs of gethostbyname(). Note that the problem disappear with Dovecot as soon as one makes use of a "real" gid, such as the one of group "staff": a "semi-virtual" user with 100016/20 may connect without any problem; this may perhaps provide a clue about the kind of settings Dovecot makes use of. Conversely, what kind of additional settings should I consider in my skeleton so as to trigger similar hangs in the child process? But more generally, what exactly is the rationale for having launchd entering the game? Many thanks in advance, Axel pseudomaster.c ============== #include <syslog.h> #include <unistd.h> #include <stdlib.h> int main( int argc, const char * argv[]) { pid_t pid; int retstatus; char * dummyargs[] = { NULL }; syslog(LOG_NOTICE|LOG_MAIL, "Master: started."); for (;;) { pid = fork(); if (pid < 0) { syslog(LOG_NOTICE|LOG_MAIL, "Master: fork() failed."); } else if (pid == 0) { syslog(LOG_NOTICE|LOG_MAIL, "Child: forked."); // Pre-exec() setups, e.g. wrt file descriptors. if (execv("/_COD/pseudochild", dummyargs) < 0) { syslog(LOG_NOTICE|LOG_MAIL, "Child: execv() failed."); _exit(1); } } waitpid(-1, & retstatus, 0); syslog(LOG_NOTICE|LOG_MAIL, "Master: child done."); sleep(5); syslog(LOG_NOTICE|LOG_MAIL, "Master: looping."); } } pseudochild.c ============= #include <syslog.h> #include <netdb.h> #include <unistd.h> #include <stdlib.h> int main( int argc, const char * argv[]) { struct hostent *hent; gid_t gidset[1]; uid_t uid = 100016; gid_t gid = 100016; gidset[0] = gid; syslog(LOG_NOTICE|LOG_MAIL, "Pseudochild: started."); if (setgid(gid) != 0) { syslog(LOG_NOTICE|LOG_MAIL, "Pseudochild: setgid() failed."); _exit(1); } if (setgroups(1, gidset) != 0) { syslog(LOG_NOTICE|LOG_MAIL, "Pseudochild: setgroups() failed."); _exit(1); } if (setuid(uid) != 0) { syslog(LOG_NOTICE|LOG_MAIL, "Pseudochild: setuid() failed."); _exit(1); } // Other post-exec() setups hent = gethostbyname("ALMba.local"); if ( hent != NULL ) { syslog(LOG_NOTICE|LOG_MAIL, "Pseudochild: gethostbyname() returned %s\n", hent->h_name); return(0); } else { syslog(LOG_NOTICE|LOG_MAIL, "Pseudochild: gethostbyname() failed."); return(1); } } pseudomaster.plist ================== <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"> <plist version="1.0"> <dict> <key>ServiceDescription</key> <string>Master test</string> <key>Label</key> <string>pseudomaster</string> <key>Disabled</key> <true/> <key>KeepAlive</key> <true/> <key>ProgramArguments</key> <array> <string>/_COD/pseudomaster</string> </array> </dict> </plist>
Going a bit further with the topics of my previous post. ;-) As hinted in the log excerpt, it could well be that Dovecot invokes getpwuid(3), directly or indirectly:
dovecot[97622] <Info>: master: Dovecot v2.1.14 starting up (core dumps disabled) [...] dovecot[97624] <Debug>: pop3(user1): Debug: Namespace : Using permissions from /_Data/Mailstores/100016/mboxes: mode=0700 gid=-1 com.apple.launchd[1] (com.apple.launchd.peruser.100016[97633]) <Error>: getpwuid("100016") failed com.apple.launchd[1] (com.apple.launchd.peruser.100016[97633]) <Notice>: Job failed to exec(3). Setting up event to tell us when to try again: 3: No such process com.apple.launchd[1] (com.apple.launchd.peruser.100016[97633]) <Notice>: Job failed to exec(3) for weird reason: 3
In this precise case (opening a first time pop connection), this seems to be an indirect call to getpwuid(3). Anyway, keeping pseudomaster.plist and pseudomaster.c unchanged, I quickly modified pseudochild.c so that it now calls getpwuid(3). When loading pseudomaster.plist for the first time, this is what get written in the logs: ALMba.local pseudomaster[66978] <Notice>: Master: started. ALMba.local pseudomaster[66980] <Notice>: Child: forked. ALMba.local pseudochild[66980] <Notice>: Pseudochild: started. ALMba com.apple.launchd[1] (com.apple.launchd.peruser.100018[66981]) <Error>: getpwuid("100018") failed ALMba com.apple.launchd[1] (com.apple.launchd.peruser.100018[66981]) <Notice>: Job failed to exec(3). Setting up event to tell us when to try again: 3: No such process ALMba com.apple.launchd[1] (com.apple.launchd.peruser.100018[66981]) <Notice>: Job failed to exec(3) for weird reason: 3 and process pseudochild is just hanging. Note the process number 66981, which is not pseudochild's one (66980), as if there were an attempt to spawn a subprocess. And this is a very elusive one; for example, no way to catch it with execsnoop or similar tools. Even launchd appears somewhat lost. Subsequent unloads/reloads of pseudomaster.plist always end with a hanging pseudochild process, yet without those com.apple.launchd.peruser messages anymore. In fact, in order to get those messages back, one has to remove the job bearing label "com.apple.launchd.peruser.100018". And this is without mentioning the directories /var/log/com.apple.launchd.peruser.100018 and /var/db/launchd/com.apple.launchd.peruser.100018 created under such circumstances; moreover, those directories seem to persist across reboots... So, all of this looks quite similar to the problems encountered with Dovecot; the fact that the stack of the hung pop process ends around a gethostbyname call could thus just be a red herring. On the other hand, Dovecot manages to go a bit further than my pseudochild.c code: with an uid/gid pair such as 100018/20, Dovecot's pop3 process doesn't hang and performs without a glitch, while pseudochild desperately insists on entering in a stuck state. Currently, I'm with a wtf? mood... I would really appreciate some hints, some explanations for all those phenomenona... TIA, Axel pseudochild.c ============= #include <syslog.h> #include <unistd.h> #include <sys/errno.h> #include <sys/types.h> #include <pwd.h> #include <uuid/uuid.h> int main( int argc, const char * argv[]) { struct passwd * pw; gid_t gidset[1]; uid_t uid = 100018; gid_t gid = 100018; gidset[0] = gid; syslog(LOG_NOTICE|LOG_MAIL, "Pseudochild: started."); if (setgid(gid) != 0) { syslog(LOG_NOTICE|LOG_MAIL, "Pseudochild: setgid() failed."); _exit(1); } if (setgroups(1, gidset) != 0) { syslog(LOG_NOTICE|LOG_MAIL, "Pseudochild: setgroups() failed."); _exit(1); } if (setuid(uid) != 0) { syslog(LOG_NOTICE|LOG_MAIL, "Pseudochild: setuid() failed."); _exit(1); } errno = 0; pw = getpwuid(uid); if ( pw != NULL ) { syslog(LOG_NOTICE|LOG_MAIL, "Pseudochild: getpwuid() succeeded"); _exit(0); } else { syslog(LOG_NOTICE|LOG_MAIL, "Pseudochild: getpwuid() failed with rc %i", errno); _exit(1); } }
Le 2 mars 2013 à 17:42, Axel Luttgens a écrit :
Going a bit further with the topics of my previous post. ;-)
[...]
And possibly another perspective. As a reminder, I wrote a very minimalist code (a "skeleton") so as to try to circumscribe what was happening with Dovecot's child processes running as "virtual" users. Going further with my investigations, I was inserting various syslog calls into Dovecot's code for debugging purposes. And suddenly noticed that having a call to syslog() somewhere before the call to setgroups() in Dovecot's child process avoids those mysterious launchd interpositions with "pure" virtual users... Surprising, isn't it? Anyway, annoying too, since I wasn't comparing similar things: up to now, my pseudochild process was running under cases 1 and 2 hereafter, while Dovecot's child processes were running under cases 3 and 4. I thus went back to my original pseudochild.c code, the one with the call to gethostbyname(), and the results were: existing call to skeleton's Dovecot's gid syslog() child child -------- -------- ---------- --------- CASE 1 no yes OK OK CASE 2 yes yes OK OK CASE 3 no no fails fails CASE 4 yes no fails OK Something is still missing in my skeleton code so as to mimic more closely Dovecot's children behavior (see case 4); finding that missing piece might help to further diagnose the phenomena coming into play. So, why does a preliminary call to syslog() have such an influence? Looks like there's a bug somewhere... On the other hand, perhaps should some provisions be made, in the plist or/and in the code, so as to avoid those hangs? Or are such problems fundamentally unavoidable, for example because of some changes in the security model? (but then, one would expect to face something more well-behaved...) I describe hereafter the behaviors displayed by my skeleton code under cases 1 to 4; for the sake of completeness, as well as to avoid any ambiguity, I also reproduce that code. Still desperately in quest of an explanation. ;-) TIA, Axel CASE 1: ======= Compile pseudochild.c with lines 12 and 16 set to: line 12: gid_t gid = 100016; line 16: syslog(LOG_NOTICE, "Pseudochild: started."); Launching the master process with: sudo launchctl load -w /Library/LaunchDaemons/pseudomaster.plist yields expected results in the log: Mar 14 09:59:30 ALMba.local pseudomaster[45210] <Notice>: Master: started. Mar 14 09:59:30 ALMba.local pseudomaster[45212] <Notice>: Child: forked. Mar 14 09:59:30 ALMba.local pseudochild[45212] <Notice>: Pseudochild: started. Mar 14 09:59:30 ALMba.local pseudochild[45212] <Notice>: Pseudochild: gethostbyname() returned almba.local Mar 14 09:59:30 ALMba.local pseudomaster[45210] <Notice>: Master: child done with 0. Mar 14 09:59:35 ALMba.local pseudomaster[45210] <Notice>: Master: looping. Mar 14 09:59:35 ALMba.local pseudomaster[45213] <Notice>: Child: forked. [and so on] Unload the master process: sudo launchctl unload -w /Library/LaunchDaemons/pseudomaster.plist CASE 2: ======= Compile pseudochild.c with: line 12: gid_t gid = 20; line 16: syslog(LOG_NOTICE, "Pseudochild: started."); Launch the master process and look at the log: Mar 14 10:12:41 ALMba.local pseudomaster[45310] <Notice>: Master: started. Mar 14 10:12:41 ALMba.local pseudomaster[45312] <Notice>: Child: forked. Mar 14 10:12:41 ALMba.local pseudochild[45312] <Notice>: Pseudochild: started. Mar 14 10:12:42 ALMba.local pseudochild[45312] <Notice>: Pseudochild: gethostbyname() returned almba.local Mar 14 10:12:42 ALMba.local pseudomaster[45310] <Notice>: Master: child done with 0. Mar 14 10:12:47 ALMba.local pseudomaster[45310] <Notice>: Master: looping. Mar 14 10:12:47 ALMba.local pseudomaster[45313] <Notice>: Child: forked. [and so on] Unload the master process. CASE 3: ======= Compile pseudochild.c with: line 12: gid_t gid = 100016; line 16: // syslog(LOG_NOTICE, "Pseudochild: started."); Launch the master process and look at the log: Mar 14 10:22:52 ALMba.local pseudomaster[45402] <Notice>: Master: started. Mar 14 10:22:52 ALMba.local pseudomaster[45403] <Notice>: Child: forked. Mar 14 10:22:52 ALMba com.apple.launchd[1] (com.apple.launchd.peruser.100016[45404]) <Error>: getpwuid("100016") failed Mar 14 10:22:52 ALMba com.apple.launchd[1] (com.apple.launchd.peruser.100016[45404]) <Notice>: Job failed to exec(3). Setting up event to tell us when to try again: 3: No such process Mar 14 10:22:52 ALMba com.apple.launchd[1] (com.apple.launchd.peruser.100016[45404]) <Notice>: Job failed to exec(3) for weird reason: 3 Processes pseudomaster and pseudochild are now sleeping (according to top) forever. A backtrace of the hung pseudochild process shows: (gdb) bt full #0 0x00007fff8644e686 in mach_msg_trap () No symbol table info available. #1 0x00007fff8644dc42 in mach_msg () No symbol table info available. #2 0x00007fff921e47c4 in _vproc_transaction_try_exit () No symbol table info available. #3 0x00007fff921e3515 in bootstrap_look_up3 () No symbol table info available. #4 0x00007fff921e3591 in bootstrap_look_up () No symbol table info available. #5 0x00007fff8f6091a1 in ___notify_lib_init_block_invoke_0 () No symbol table info available. #6 0x00007fff8dd8b0b6 in _dispatch_client_callout () No symbol table info available. #7 0x00007fff8dd8b041 in dispatch_once_f () No symbol table info available. #8 0x00007fff8f60573f in _notify_lib_init () No symbol table info available. #9 0x00007fff8f606834 in notify_register_check () No symbol table info available. #10 0x00007fff8aee8a39 in __si_module_static_ds_block_invoke_0 () No symbol table info available. #11 0x00007fff8dd8b0b6 in _dispatch_client_callout () No symbol table info available. #12 0x00007fff8dd8b041 in dispatch_once_f () No symbol table info available. #13 0x00007fff8aee89c0 in si_module_static_ds () No symbol table info available. #14 0x00007fff8aee8489 in si_module_with_name () No symbol table info available. #15 0x00007fff8aee88dd in si_module_config_modules_for_category () No symbol table info available. #16 0x00007fff8aee85e0 in __si_module_static_search_block_invoke_0 () No symbol table info available. #17 0x00007fff8dd8b0b6 in _dispatch_client_callout () No symbol table info available. #18 0x00007fff8dd8b041 in dispatch_once_f () No symbol table info available. #19 0x00007fff8aee8580 in si_module_static_search () No symbol table info available. #20 0x00007fff8aee8489 in si_module_with_name () No symbol table info available. #21 0x00007fff8aeec086 in gethostbyname () No symbol table info available. #22 0x0000000105231da0 in main () No symbol table info available. Unload the master process. Note that this this is sufficient to have the pseudochild process to be stopped as well; in the case of Dovecot, the child process has to be KILLed, probably because of a more complex environment. Also remove the peruser launchd job that has just been created: sudo launchctl remove com.apple.launchd.peruser.100016 otherwise subsequent failures won't be explicitly logged anymore. CASE 4: ======= Compile pseudochild.c with: line 12: gid_t gid = 20; line 16: // syslog(LOG_NOTICE, "Pseudochild: started."); Launch the master process and look at the log: Mar 14 10:42:41 ALMba.local pseudomaster[45608] <Notice>: Master: started. Mar 14 10:42:41 ALMba.local pseudomaster[45609] <Notice>: Child: forked. Mar 14 10:42:41 ALMba com.apple.launchd[1] (com.apple.launchd.peruser.100016[45610]) <Error>: getpwuid("100016") failed Mar 14 10:42:41 ALMba com.apple.launchd[1] (com.apple.launchd.peruser.100016[45610]) <Notice>: Job failed to exec(3). Setting up event to tell us when to try again: 3: No such process Mar 14 10:42:41 ALMba com.apple.launchd[1] (com.apple.launchd.peruser.100016[45610]) <Notice>: Job failed to exec(3) for weird reason: 3 So, same behavior as with CASE 3. Note that with Dovecot, making use of an existing gid allows the child process to run without any problem. Still trying to find what I should add to my skeleton code so as to reproduce that ability. Unload the master process. Remove the peruser launchd job. CODE: ===== #include <syslog.h> #include <unistd.h> #include <stdlib.h> int main( int argc, const char * argv[]) { pid_t pid; int retstatus; char * dummyargs[] = { "pseudochild", NULL }; syslog(LOG_NOTICE, "Master: started."); for (;;) { pid = fork(); if (pid < 0) { syslog(LOG_NOTICE, "Master: fork() failed."); return(1); } else if (pid == 0) { syslog(LOG_NOTICE, "Child: forked."); if (execv("/_COD/pseudochild", dummyargs) < 0) { syslog(LOG_NOTICE, "Child: execv() failed."); _exit(1); } } waitpid(pid, & retstatus, 0); syslog(LOG_NOTICE, "Master: child done with %i.", retstatus); sleep(5); syslog(LOG_NOTICE, "Master: looping."); } } #include <syslog.h> #include <netdb.h> #include <unistd.h> #include <stdlib.h> int main( int argc, const char * argv[]) { struct hostent *hent; gid_t gidset[1]; uid_t uid = 100016; gid_t gid = 100016; gidset[0] = gid; syslog(LOG_NOTICE, "Pseudochild: started."); if (setgid(gid) != 0) { syslog(LOG_NOTICE, "Pseudochild: setgid() failed."); _exit(1); } if (setgroups(1, gidset) != 0) { syslog(LOG_NOTICE, "Pseudochild: setgroups() failed."); _exit(1); } if (setuid(uid) != 0) { syslog(LOG_NOTICE, "Pseudochild: setuid() failed."); _exit(1); } hent = gethostbyname("ALMba.local"); if ( hent != NULL ) { syslog(LOG_NOTICE, "Pseudochild: gethostbyname() returned %s\n", hent->h_name); return(0); } else { syslog(LOG_NOTICE, "Pseudochild: gethostbyname() failed."); return(1); } } <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"> <plist version="1.0"> <dict> <key>ServiceDescription</key> <string>Master test</string> <key>Label</key> <string>pseudomaster</string> <key>Disabled</key> <true/> <key>KeepAlive</key> <true/> <key>ProgramArguments</key> <array> <string>/_COD/pseudomaster</string> </array> </dict> </plist>
participants (1)
-
Axel Luttgens