Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#18 closed bug (fixed)

Possible race condition in jobq daemon launch

Reported by: vcaron Owned by: sbocahu
Priority: major Component: jobq
Keywords: Cc:

Description

Jobq needs at most one daemon instance per queue. It uses a pidfile but does not lock it, so there is a race condition between the time this pidfile is read, the pid tested and a new instance eventually started.

Hints:

  • Ugly hack: minimize time/instructions between pidfile reading and pidfile writing if another intsance was stared
  • Correct fix: lock pidfile, but I'm not sure there's a lot of portable locking primtives from a shell
  • We're kind of re-implemting start-stop-daemon or something like every distro might have... I'll prefer to keep this simple shell script portable, but if the correct way is a Debian dependency, let's go.

Change History (17)

comment:1 in reply to: ↑ description Changed 7 years ago by sbocahu

  • Status changed from new to assigned
  • Correct fix: lock pidfile, but I'm not sure there's a lot of portable locking primtives from a shell

AFAIK, There's no internals for file locking. Locking using touch/cat/rm *.lock is IMHO not the solution: that's racy too.

There is [lockfile-progs http://packages.debian.org/fr/lenny/lockfile-progs] but it looks like debian-centric, meaning that it is certainly as 'portable' as start-stop-daemon is... or less

  • We're kind of re-implemting start-stop-daemon or something like every distro might have... I'll prefer to keep this simple shell script portable, but if the correct way is a Debian dependency, let's go.

Debian-based and gentoo (at least) uses start-stop-daemon. There's the idea of porting it to fedora in their wishlist.

I think using start-stop-daemon is the way to go. People using a distro that don't distribute it might take the script out from dpkg or use the appropriate method for their distribution (this is a shell script after all, hack is 'easy')

comment:2 Changed 7 years ago by vcaron

OK, could you:

Thanks.

comment:3 Changed 7 years ago by zecrazytux

(In [525]) Use `start-stop-daemon' instead of reinventing the wheel, see #18

comment:4 Changed 7 years ago by sbocahu

  • Owner changed from sbocahu to vcaron
  • Status changed from assigned to new

I commited my changes.

Is that correct to you ?

comment:5 Changed 7 years ago by vcaron

-d/--daemon is an option meant to start a "queue runner daemon", and is also not meant to be really used directly (since as soon as you submit a job, jobq makes sure a queue runner is up). So -s/--startqueue looks like a dupe and the whole thing might need a little bit of refactoring.

jobqpath="$(cd "$(dirname "$0")"; pwd)/$(basename $0)" does not look very pretty, isn't it ? Is it a "realpath" emulation ? I could live without it.

comment:6 Changed 7 years ago by zerodeux

(In [527]) Refactor the --daemon/start-stop-daemon stuff a bit, hopefully making some sense out of it - see #18

comment:7 follow-up: Changed 7 years ago by vcaron

  • Owner changed from vcaron to sbocahu
I actually kept the -sstartqueue option, renamed as -rrunner. I removed the -ddaemon one which did not make any sense.

Ideas:

  • start-stop-daemon handles the pidfile and thus the backgrounding
  • the end user does not need an explicit way to start the queue runner since we do it for him at the most convenient time
-rrunner starts queue_runner in the foreground which makes for a handy debug mode, hence I removed the -xdebug option

Since with --background start-stop-daemon cannot properly check the return status of the daemon (it's asynchronous), I moved the "queue runner started ...." info into queue_runner itself. Not much, but at least let us know if something failed in start-stop-daemon invocation or in jobq --runner startup itself.

I did some tests and it looked fine, I'd love you to double-check it...

PS: there's a FIXME that's maybe one or two lines of signal handling trickery but I think it's painful to test although the feature would be very interesting. If you have time and still some sanity...

comment:8 in reply to: ↑ 7 Changed 7 years ago by sbocahu

  • Owner changed from sbocahu to vcaron

I did some tests and it looked fine, I'd love you to double-check it...

Looks fine to me too :)

PS: there's a FIXME that's maybe one or two lines of signal handling trickery but I think it's painful to test although the feature would be very interesting. If you have time and still some sanity...

Fixed: send USR1 to the `sleep' pid, got with ps from its known ppid (jobq pid stored in the pidfile)

comment:9 Changed 7 years ago by sbocahu

I forget the "see #18" trick in the commit message... see 528

comment:10 follow-up: Changed 7 years ago by vcaron

  • Owner changed from vcaron to sbo
sleeppid=$(ps -o pid,cmd --ppid $(cat "$pidfile") --noheadings | awk '/sleep/{ print $1 }') 

Yummy :)

My inuitition was rather that you could send a signal to the queue runner itself to interrupt the current command. Checkin "bash manual", "SIGNALS" chapter:

If bash is waiting for a command to complete and receives a signal for  which  a  trap
has been set, the trap will not be executed until the command completes.  When bash is
waiting for an asynchronous command via the wait builtin, the reception  of  a  signal
for  which  a trap has been set will cause the wait builtin to return immediately with
an exit status greater than 128, immediately after which the trap is executed.

Which let me believe something like sleep & wait $! could be interrupted by sending a HUP to the queue runner itself. Could you try it ? Is it a bashism ? Or only a bournism ?

comment:11 Changed 7 years ago by vcaron

  • Owner changed from sbo to sbocahu

comment:12 in reply to: ↑ 10 Changed 7 years ago by sbocahu

My inuitition was rather that you could send a signal to the queue runner itself to interrupt the current command. Checkin "bash manual", "SIGNALS" chapter:

If bash is waiting for a command to complete and receives a signal for  which  a  trap
has been set, the trap will not be executed until the command completes.  When bash is
waiting for an asynchronous command via the wait builtin, the reception  of  a  signal
for  which  a trap has been set will cause the wait builtin to return immediately with
an exit status greater than 128, immediately after which the trap is executed.

Which let me believe something like sleep & wait $! could be interrupted by sending a HUP to the queue runner itself.

Yes, it is another solution. TIMTOWTDI :]

But we need to had a dummy `trap' call (else the default behaviour is "exit") and there is a "polution of sleep processes" (killing the right process is better to me, but that's just a point of view :))

Is it a bashism ? Or only a bournism ?

Neither. This is a POSIX regular command, that you may find in (at least) most of shells on (at least) most Unix OS :)

Do you want me to switch to a trap/wait mechanism ?

comment:13 Changed 7 years ago by vcaron

But we need to had a dummy `trap' call

There used to be a trap before we used start-stop-daemon, we can set it one again. Let's trap USR1.

there is a "polution of sleep processes"

If you submit jobs faster than one every 5sec, yes. Let's kill the sleep if it's there:

sleep & wait $!
kill $! 2>/dev/null

A very cool thing would be a builtin sleep shell command, sigh.

The only thing is that it relies on the pidfile which is more reliable than scrubbing the process table. If there are several daemon, by accident or not, the pidfile arbitrate which daemon is the "one". If there's no pidfile, well... things are already badly broken :).

Do you want me to switch to a trap/wait mechanism ?

If it still makes sense to you, yes.

comment:14 Changed 7 years ago by zecrazytux

(In [533]) Use wait' & trap', see #18

comment:15 Changed 7 years ago by sbocahu

  • Resolution set to fixed
  • Status changed from new to closed

comment:16 Changed 7 years ago by vcaron

I've played with it, it works very nicely, I can't fill the jobqueue by manually pushing jobs from the cli now :).

I've noticed a production situation where there was a 'storm of jobq submissions', and the 5sec delays between job pickups would add up to a huge amount of lost time.

comment:17 Changed 7 years ago by zerodeux

(In [534]) Bump queue pickup timer from 5 to 60sec, now that we can wake up the queue runner ASAP (laptop friendly, etc) - see #18

Note: See TracTickets for help on using tickets.