I’ve recently been having problems with collectd crashing without notice on a server aggregating a large amount of stats from ~20 nodes. Initially I set up a shell script to monitor whether it’s up and restart it, but there’s a much more elegant solution in the form of collectdmon.
It’s design is really simple and quite elegant: collectdmon
starts and runs collectd
with the -f
flag, causing collectd
to run in the foreground. If collectd
exits for whatever reason, collectdmon
will just catch it (because it’s waiting for it to exit), and start it back up. You can also send signals to collectdmon
to restart or shut down the collectd
process at any time.
The only thing left to do is modify the init script to start collectd
with collectdmon
. On Red Hat I did this with the following modification:
diff -u etc/rc.d/init.d/collectd /etc/init.d/collectd --- etc/rc.d/init.d/collectd 2008-10-14 05:15:29.000000000 +1100 +++ /etc/init.d/collectd 2008-10-10 00:27:17.000000000 +1100 @@ -25,7 +25,8 @@ echo -n $"Starting $prog: " if [ -r "$CONFIG" ] then - daemon /usr/sbin/collectd -C "$CONFIG" + daemon collectdmon -c /usr/sbin/collectd -P /var/run/collectdmon.pid -- -C "$CONFIG" RETVAL=$? echo [ $RETVAL -eq 0 ] && touch /var/lock/subsys/$prog @@ -33,7 +34,7 @@ } stop () { echo -n $"Stopping $prog: " - killproc $prog + killproc collectdmon RETVAL=$? echo [ $RETVAL -eq 0 ] && rm -f /var/lock/subsys/$prog