collectdmon for a crashing collectd

I’ve recently been having problems with collectd crashing without notice on a server aggregating a large amount of stats from ~20 nodes. Initially I set up a shell script to monitor whether it’s up and restart it, but there’s a much more elegant solution in the form of collectdmon.

It’s design is really simple and quite elegant: collectdmon starts and runs collectd with the -f flag, causing collectd to run in the foreground. If collectd exits for whatever reason, collectdmon will just catch it (because it’s waiting for it to exit), and start it back up. You can also send signals to collectdmon to restart or shut down the collectd process at any time.

The only thing left to do is modify the init script to start collectd with collectdmon. On Red Hat I did this with the following modification:

diff -u etc/rc.d/init.d/collectd /etc/init.d/collectd 
--- etc/rc.d/init.d/collectd    2008-10-14 05:15:29.000000000 +1100
+++ /etc/init.d/collectd        2008-10-10 00:27:17.000000000 +1100
@@ -25,7 +25,8 @@
        echo -n $"Starting $prog: "
        if [ -r "$CONFIG" ]
        then
-               daemon /usr/sbin/collectd -C "$CONFIG"
+               daemon collectdmon -c /usr/sbin/collectd -P /var/run/collectdmon.pid -- -C "$CONFIG"
                RETVAL=$?
                echo
                [ $RETVAL -eq 0 ] && touch /var/lock/subsys/$prog
@@ -33,7 +34,7 @@
 }
 stop () {
        echo -n $"Stopping $prog: "
-       killproc $prog
+       killproc collectdmon
        RETVAL=$?
        echo
        [ $RETVAL -eq 0 ] && rm -f /var/lock/subsys/$prog