MSI Systems Outage

April 6 Outage

Some MSI systems went down on Saturday, April 6 in the morning, and some are still down at this time. Here is what has happened, and what we know presently.

Early on Saturday morning, one of the three Uninterruptible Power Supply (UPS) systems protecting the MSI data center failed completely. This caused power to be lost to much of the Mesabi and Mangi clusters, as well as the Stratus openstack system.

MSI systems staff were onsite by 9am Saturday, and started recovery. Most of the day was spent rerouting power from other sources to restore the Stratus cluster and the most essential components of Mangi/Mesabi. The UPS vendor was contacted but had no engineers available on-call to look at our UPS.

As of Sunday evening, approximately half of the Mangi and Mesabi clusters remain down, including the Mangi login and GPU nodes. We've rerouted the default login address "login.msi.umn.edu" to the Mesabi cluster instead of Mangi so that users can still login as normal.  1/3 of the Stratus compute nodes remain offline, so we migrated nodes in order to restore the most critical Virtual Machines (VMs) for MIDB and GEMS efforts

The Agate cluster is almost completely unaffected. We believe that most critical resources needed for the Open OnDemand service are also available, though there is a reduction in available interactive nodes.

MSI storage systems and core infrastructure are also largely unaffected from a user perspective, although these services are running on a single rather than dual power feed.

We expect to hear back from the UPS vendor with an ETA for a technician to assess the time needed to return to fully functioning power infrastructure for the MSI data center.

Discover Advanced Computing and Data Solutions at MSI

Our Services