Long service outage on 26-May-15

At about 8:00 a.m. on Tuesday, 26-May-15, the construction crew on 18th Ave to the south of Newman & Wolfrom accidentally cut the power to NW and Evans.  The departmental server room NW2109 lost all power for about 15 minutes.  Unfortunately, we drained one of the UPS units before power was restored.

When we attempted to restart the servers in that room, a group of hosted virtual servers on two of the three virtualization hosts were unable to see their disk storage.  After several hours of work, we threw in the towel and started restoring servers to the remaining functional virtualization host.  This process is imperfect because we can only host so many servers on any given virtualization host.  So as of now we have restored essential services, but will need to work throughout the rest of the day to get back to full production capacity.

UPDATE:  26-May-15 @ 7:00 p.m.

We isolated the problem, and we were able to bring all services back online.  Details:  The UPS supporting the SAN switch ran out of battery charge before anything else in the server room.  In fact, that was the only UPS that went dead during the power outage this morning.  Two of our three ESXi boxes were in a blocked I/O state, and that state was determined by the management cluster, not by an individual host.  (Since the SAN switch went down but everything else stayed up, the management cluster was trying to protect the storage arrays on the SAN.)  Therefore, even a restart of the ESXi boxes would not fix the problem.  We needed to clear the block from the management cluster.  (Actually, we needed to re-add I/O channels to the storage arrays for those two ESXi hosts.)  When we finally learned that we needed to do this, (shout out to our recently-departed-but-still-answering-our-emails virtualization engineer Deric Crago), we were able to bring all ESXi hosts back online, and thereafter had the capacity to bring all the hosted VMs back online also.