[lug] monitoring jobs on linux

Davide Del Vento davide.del.vento at gmail.com
Fri Mar 16 12:30:32 MDT 2012

we have a server where users have shell access, and they usually
submit nohupped background jobs (or cron jobs). I would like to
monitor what users are doing. At the bare minimum how long the jobs
last on average and what the distribution looks like. Better yet if I
can get more details, such as when those jobs run (e.g. is the
distribution changing during the weekends? is there any particular
user doing something much off the others? etc.) I am particularly
interested in long-running stuff, so a sampling would work fine, even
at low frequency (e.g. 1-10 minutes)

None of this is rocket science, filtering the output of ps happening
in a cron every 5m or so would do the trick. However I don't want to
do this myself, since there are many small details that would make
this a serious project and not a quick test to collect some data to
slap on a manager's desk. For example: what if PID rolls over? What
about spawned processes? I care only about the "top level" jobs
submitted by the user, so if in the system there is only a single
10-hour bash script calling 10 1-hour things, I want and easy way to
be able to find the information I want which is "the average running
time is 10 hours", and not the quick answer "the average running time
is 1.8 hours" (since there have been 1 10h + 10 1h processes running).
Again, since ps can do some parent-child stuff this is possible....

But instead of reinventing the wheel, I'm wondering if such a tool
exists (maybe withing Nagios and/or Ganglia which are already running
on the system - I can just go to the system administrators and ask for
what I need). I didn't find anything on Google, but that's probably
because I am not a system administrator so I asked the "wrong"
question (and Google is not smart enough to accept very elaborate
queries like this by email :-)


More information about the LUG mailing list