[lug] monitoring jobs on linux
Davide Del Vento
davide.del.vento at gmail.com
Fri Mar 16 13:16:17 MDT 2012
This in fact is what we will do (for other reasons).
However, we would like to have that information on the running jobs
*before* we can have the scheduler installed and configured and the
users (they are several hundreds!) trained and convinced to use it.
Any other idea?
On Fri, Mar 16, 2012 at 12:37, Will Sterling <will.sterling at gmail.com> wrote:
> Install a job scheduler then have your users submit their jobs using
> the scheduler. You will then be able to run canned reports for all
> kinds of info you never knew you were missing.
> On Mar 16, 2012, at 12:31 PM, Davide Del Vento
> <davide.del.vento at gmail.com> wrote:
>> we have a server where users have shell access, and they usually
>> submit nohupped background jobs (or cron jobs). I would like to
>> monitor what users are doing. At the bare minimum how long the jobs
>> last on average and what the distribution looks like. Better yet if I
>> can get more details, such as when those jobs run (e.g. is the
>> distribution changing during the weekends? is there any particular
>> user doing something much off the others? etc.) I am particularly
>> interested in long-running stuff, so a sampling would work fine, even
>> at low frequency (e.g. 1-10 minutes)
>> None of this is rocket science, filtering the output of ps happening
>> in a cron every 5m or so would do the trick. However I don't want to
>> do this myself, since there are many small details that would make
>> this a serious project and not a quick test to collect some data to
>> slap on a manager's desk. For example: what if PID rolls over? What
>> about spawned processes? I care only about the "top level" jobs
>> submitted by the user, so if in the system there is only a single
>> 10-hour bash script calling 10 1-hour things, I want and easy way to
>> be able to find the information I want which is "the average running
>> time is 10 hours", and not the quick answer "the average running time
>> is 1.8 hours" (since there have been 1 10h + 10 1h processes running).
>> Again, since ps can do some parent-child stuff this is possible....
>> But instead of reinventing the wheel, I'm wondering if such a tool
>> exists (maybe withing Nagios and/or Ganglia which are already running
>> on the system - I can just go to the system administrators and ask for
>> what I need). I didn't find anything on Google, but that's probably
>> because I am not a system administrator so I asked the "wrong"
>> question (and Google is not smart enough to accept very elaborate
>> queries like this by email :-)
>> Web Page: http://lug.boulder.co.us
>> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
>> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
> Web Page: http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
More information about the LUG