A(n awful) day in the life of a System Administrator..

Today had to be a day with one of the more annoying sequences of everyday life that Administrators deal with.

  • To start off the day, I managed to get my new remote machines. Great! Er, why is everything dropping off every few minutes, and arp shows really odd things happening.. where are the machines going?
  • Oh, good. The new systems have a 4 minute watchdog in hardware that is resetting it because Linux is not talking to it.
  • Now someone disabled ACPI on my new remote machines during the ‘watchdog reboot’, so it killed the second CPU. I managed to fix that after wasting time, believing that it was my custom kernel configuration with some new security patches that may have caused an issue – after all, remote hands technicians never lie (or forget) what they were doing. (The CPUs would show, but be inaccessible with the stock SMP kernels, with a rather generic ACPI error, they they’d show in the scheduler and in /proc/cpuinfo. Yay for inconsistencies within Linux!)
  • So, I finally get a KVM installed so I can fix ACPI and the watchdog timeout for good (and see what other BIOS settings are incorrect), to FINALLY have the new Debian Security Update released, so I can revert to semi-stock configurations before these go live. I prefer to have more support, than less, after all.
  • The machines are once again running with all CPUs, and not rebooting themselves fifteen times an hour, and I was presented with the joy of recompiling the custom kernels all over again.
  • Finally manage get every machine globally updated, fixing PAE extensions for machines with over 4GB of data (this was implied, before, and mentioned nowhere in the config, of course), and during the test run, the system refuses to come back. Finally get Java shoehorned on a local Linux system so I can test it. The machine’s rebooted a total of 30 times, so it’s forcing itself to test the full drive for errors, but, no, everything’s fine. Finally.

...how was YOUR day?