#18 closed bug (fixed)
Possible race condition in jobq daemon launch
Reported by: | Vincent Caron | Owned by: | zecrazytux |
---|---|---|---|
Priority: | major | Component: | jobq |
Keywords: | Cc: |
Description
Jobq needs at most one daemon instance per queue. It uses a pidfile but does not lock it, so there is a race condition between the time this pidfile is read, the pid tested and a new instance eventually started.
Hints:
- Ugly hack: minimize time/instructions between pidfile reading and pidfile writing if another intsance was stared
- Correct fix: lock pidfile, but I'm not sure there's a lot of portable locking primtives from a shell
- We're kind of re-implemting start-stop-daemon or something like every distro might have... I'll prefer to keep this simple shell script portable, but if the correct way is a Debian dependency, let's go.
Change History (17)
comment:1 Changed 11 years ago by
Status: | new → assigned |
---|
comment:2 Changed 11 years ago by
OK, could you:
- port run_daemon() to start-stop-daemon in source:/jobq/jobq
- add the Debian dep to source:/jobq/debian/control ?
Thanks.
comment:3 Changed 11 years ago by
comment:4 Changed 11 years ago by
Owner: | changed from zecrazytux to Vincent Caron |
---|---|
Status: | assigned → new |
I commited my changes.
Is that correct to you ?
comment:5 Changed 11 years ago by
-d/--daemon is an option meant to start a "queue runner daemon", and is also not meant to be really used directly (since as soon as you submit a job, jobq makes sure a queue runner is up). So -s/--startqueue looks like a dupe and the whole thing might need a little bit of refactoring.
jobqpath="$(cd "$(dirname "$0")"; pwd)/$(basename $0)"
does not look very pretty, isn't it ? Is it a "realpath" emulation ? I could live without it.
comment:6 Changed 11 years ago by
comment:7 follow-up: 8 Changed 11 years ago by
Owner: | changed from Vincent Caron to zecrazytux |
---|
Ideas:
- start-stop-daemon handles the pidfile and thus the backgrounding
- the end user does not need an explicit way to start the queue runner since we do it for him at the most convenient time
Since with --background start-stop-daemon cannot properly check the return status of the daemon (it's asynchronous), I moved the "queue runner started ...." info into queue_runner itself. Not much, but at least let us know if something failed in start-stop-daemon invocation or in jobq --runner startup itself.
I did some tests and it looked fine, I'd love you to double-check it...
PS: there's a FIXME that's maybe one or two lines of signal handling trickery but I think it's painful to test although the feature would be very interesting. If you have time and still some sanity...
comment:8 Changed 11 years ago by
Owner: | changed from zecrazytux to Vincent Caron |
---|
I did some tests and it looked fine, I'd love you to double-check it...
Looks fine to me too :)
PS: there's a FIXME that's maybe one or two lines of signal handling trickery but I think it's painful to test although the feature would be very interesting. If you have time and still some sanity...
Fixed: send USR1 to the `sleep' pid, got with ps from its known ppid (jobq pid stored in the pidfile)
comment:10 follow-up: 12 Changed 11 years ago by
Owner: | changed from Vincent Caron to sbo |
---|
sleeppid=$(ps -o pid,cmd --ppid $(cat "$pidfile") --noheadings | awk '/sleep/{ print $1 }')
Yummy :)
My inuitition was rather that you could send a signal to the queue runner itself to interrupt the current command. Checkin "bash manual", "SIGNALS" chapter:
If bash is waiting for a command to complete and receives a signal for which a trap has been set, the trap will not be executed until the command completes. When bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed.
Which let me believe something like sleep & wait $!
could be interrupted by sending a HUP to the queue runner itself. Could you try it ? Is it a bashism ? Or only a bournism ?
comment:11 Changed 11 years ago by
Owner: | changed from sbo to zecrazytux |
---|
comment:12 Changed 11 years ago by
My inuitition was rather that you could send a signal to the queue runner itself to interrupt the current command. Checkin "bash manual", "SIGNALS" chapter:
If bash is waiting for a command to complete and receives a signal for which a trap has been set, the trap will not be executed until the command completes. When bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed.Which let me believe something like
sleep & wait $!
could be interrupted by sending a HUP to the queue runner itself.
Yes, it is another solution. TIMTOWTDI :]
But we need to had a dummy `trap' call (else the default behaviour is "exit") and there is a "polution of sleep processes" (killing the right process is better to me, but that's just a point of view :))
Is it a bashism ? Or only a bournism ?
Neither. This is a POSIX regular command, that you may find in (at least) most of shells on (at least) most Unix OS :)
Do you want me to switch to a trap/wait mechanism ?
comment:13 Changed 11 years ago by
But we need to had a dummy `trap' call
There used to be a trap before we used start-stop-daemon, we can set it one again. Let's trap USR1.
there is a "polution of sleep processes"
If you submit jobs faster than one every 5sec, yes. Let's kill the sleep if it's there:
sleep & wait $! kill $! 2>/dev/null
A very cool thing would be a builtin sleep shell command, sigh.
The only thing is that it relies on the pidfile which is more reliable than scrubbing the process table. If there are several daemon, by accident or not, the pidfile arbitrate which daemon is the "one". If there's no pidfile, well... things are already badly broken :).
Do you want me to switch to a trap/wait mechanism ?
If it still makes sense to you, yes.
comment:15 Changed 11 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:16 Changed 11 years ago by
I've played with it, it works very nicely, I can't fill the jobqueue by manually pushing jobs from the cli now :).
I've noticed a production situation where there was a 'storm of jobq submissions', and the 5sec delays between job pickups would add up to a huge amount of lost time.
AFAIK, There's no internals for file locking. Locking using touch/cat/rm *.lock is IMHO not the solution: that's racy too.
There is [lockfile-progs http://packages.debian.org/fr/lenny/lockfile-progs] but it looks like debian-centric, meaning that it is certainly as 'portable' as start-stop-daemon is... or less
Debian-based and gentoo (at least) uses start-stop-daemon. There's the idea of porting it to fedora in their wishlist.
I think using start-stop-daemon is the way to go. People using a distro that don't distribute it might take the script out from
dpkg
or use the appropriate method for their distribution (this is a shell script after all, hack is 'easy')