Friday, February 18, 2005

E-mail I sent to my colleagues

At some point, very late tonight (after midnight), Waite exhausted memory and the web server crashed. I tried to restart the server, and it abended like crazy as it unloaded the NLMs, and then of course hung up instead of rebooting.

So at this point, the portal is down, and no one can get to webmail. I feel really bad about this, especially since it's so soon after its official release, and we've been getting really positive feedback about it.
I considered driving there, but as soon as I got on the road I quickly reconsidered, as I observed my car was able to break traction on straightaway using only the accelerator pedal (not on snow). I consider myself to be a fairly bawlsy driver, but that's my cut-off :)

The problem does not appear to be associated with Extend (Portal Services). I think Apache2 (or some supporting module) is leaking memory. I found a TID that proves its possible, even if it is a little out of date: . A short term fix we can do is load apache2 in protected address space. I bet this will at least allow us to restart the server successfully despite the abends. Also, when we get the web power switch, we should move Waite down to the library server room so we can hook it up to that. It is now on the list of mission critical servers (during off hours).

So whomever gets in first tommorrow, please reboot Waite.

