Stray process with PGID equal to this dead job
Sorry for reopening a thread that was already dealt with, but I ran into a serious problem that I'm not able to workaround in any way. I have a perlscript (let's call it backupfather.pl) that calls a second perlscript (backupchild.pl) 15 times for initiating rsync over ssh backups from 15 different client-hosts. The callerscript knows how to deal with open processes (keep them running for a designated time, kill them if the target host where it fetches the backup from isn't reachable or the backup script is overtime, don't launch the child script for a specific host if one process like that is still running) Additionally it knows how to deal with the child processes of backupchild.pl (rsync and ssh) building up pid and parent pid trees and killing all involved before launching again with the same arguments. So the processmanagement itself is clean. In order to start the backupchild.pl script 15 times I need to background them. They may rund serveral hours so the backupfather.pl has to get free to initiate the finished backups again after an hour, not initiating the still running ones. I've tried several ways to get out of the launchctl processgroup prison the backupfather.pl is running in. setpgrp doesn't work on Mac OS X (that way it would be possible to let it run under the root's processgroups ID). forking and exiting the backupchild.pl didn't help eighter, nore did creating 15 instances for each client backup and launching them as necessary. As soon as it comes to backgrounding I'm stuck in: "Stray process with PGID equal to this dead job: PID _pidnumber_ PPID _parentpidnumber_ perl" Any ideas how to get the backupchild.pl processes out of the launchctl prison? Thanx in advance for any help, Johannes PS: If you're wondering why I don't use TimeMachine for Backups: We've used it till AppleFileServer started eating 100% CPU on the Server, rendering the Fileservice for the regular AFP Fileservice Clients unusable, while SSH at the same time had fullspeed and no problems. So splitting backup from fileservice was the deal for making fileservice useable again...
On Jul 27, 2009, at 11:26 AM, Johannes wrote:
Sorry for reopening a thread that was already dealt with, but I ran into a serious problem that I'm not able to workaround in any way.
I have a perlscript (let's call it backupfather.pl) that calls a second perlscript (backupchild.pl) 15 times for initiating rsync over ssh backups from 15 different client-hosts.
The callerscript knows how to deal with open processes (keep them running for a designated time, kill them if the target host where it fetches the backup from isn't reachable or the backup script is overtime, don't launch the child script for a specific host if one process like that is still running)
Additionally it knows how to deal with the child processes of backupchild.pl (rsync and ssh) building up pid and parent pid trees and killing all involved before launching again with the same arguments.
So the processmanagement itself is clean.
Why bother creating your own tree? You already have a process tree maintained by the kernel. "Clean" process management is having the parent kill and reap its children as necessary and allowing the semantics of POSIX parent-child relationships to create a chain of responsibility. So if your job sends SIGTERM to its immediate child, that child will, in turn, forward the SIGTERM to its children, wait for them to exit and then die accordingly.
In order to start the backupchild.pl script 15 times I need to background them. They may rund serveral hours so the backupfather.pl has to get free to initiate the finished backups again after an hour, not initiating the still running ones.
I've tried several ways to get out of the launchctl processgroup prison the backupfather.pl is running in.
"launchctl process group prison"? I think you mean "POSIX standard behavior". Your child processes inherit a PGID equal to that of their parent's PID. launchd has nothing to do with this behavior; it's enforced by the kernel. launchd complains about this because your job's has attempted to daemonize without calling setsid(2), setpgrp (2), etc.
setpgrp doesn't work on Mac OS X (that way it would be possible to let it run under the root's processgroups ID).
Mac OS X conforms to POSIX, so setpgrp(2) does work. What errno does it set when you call it? Are you calling it from the child process with 0 as the first argument, or from the parent with the child's PID as the first argument? Also, please read setsid(2).
forking and exiting the backupchild.pl didn't help eighter, nore did creating 15 instances for each client backup and launching them as necessary. As soon as it comes to backgrounding I'm stuck in:
"Stray process with PGID equal to this dead job: PID _pidnumber_ PPID _parentpidnumber_ perl"
Any ideas how to get the backupchild.pl processes out of the launchctl prison?
I would recommend making backupchild.pl into a launchd job that your main job kicks off by running `launchctl start`. If you need multiple instances of the job, you can create them on the fly with `launchctl submit`, and launchd will take care of the process management. -- Damien Sorresso BSD Engineering Apple Inc.
participants (2)
-
Damien Sorresso
-
Johannes