SGP1 - Power/availability issue for single rack

Incident Report for Teraswitch

Postmortem

On 5/17/2025, the facility provider for Teraswitch SGP1 (Equinix SG3) began scheduled maintenance on the B-side power bus for select data halls.  This maintenance was expected to be non-service impacting to Teraswitch, as all our customer equipment and critical infrastructure are powered by redundant PSUs connected to diverse rack PDUs & facility power buses.

At 10:00am UTC, the Equinix scheduled maintenance window began.

At 01:38pm UTC, Teraswitch monitoring alerted to degraded power redundancy on SGP1 systems, which was expected given the scheduled maintenance.

At 02:25pm UTC, Teraswitch monitoring provided the earliest indication of an issue affecting a single SGP1 compute rack.

By 02:30pm UTC, additional alerts clearly indicated that this rack and associated addressing had gone offline.

At 02:31pm UTC, Teraswitch engineering declared a likely power loss in the rack and contingency operations began.  A Teraswitch technician located in Singapore was promptly dispatched to Equinix SG3, and a P1 smart hands ticket with Equinix was opened shortly thereafter at 02:56pm.

At 03:28pm UTC, Equinix smart hands reported that the Bank 2 breaker on one of the PDUs in the affected rack had tripped, and all devices connected to it had consequently lost power.  This suggested that an individual bank on the A-side PDU had been overloaded while the B-side power bus was offline for maintenance.

At 03:46pm UTC, the Teraswitch technician arrived onsite and commenced work after clearing security.

  • After further investigation, it was confirmed that equipment in the rack had not been properly distributed across the PDU power banks. This had not caused an issue up until now since load is split across A and B-side PDUs under normal operating conditions - however, due to the facility B-side maintenance the rack load was transferred exclusively to the A-side PDU, and Bank 2 tripped due to the improper distribution. This caused a loss of power to much of the equipment in this rack.
  • Some customer equipment was connected to Bank 1 and did not lose power. However, Teraswitch network equipment was connected to Bank 2, causing a loss of network connectivity to all customer equipment in the rack regardless of power status.

The Teraswitch technician redistributed power connections across the PDU banks on both A and B-side PDUs to avoid an overload of any one bank, and reset the Bank 2 breaker.

At 04:25pm UTC, Teraswitch monitoring alerted to the final recovery of rack network equipment following power redistribution.  Customer services recovered around the same time, within a few minutes before or after.

Teraswitch has taken several measures in response to this incident:

  • In the affected SGP1 rack, power cabling was redistributed properly across the PDU banks, so there should be no future impact to that rack in any subsequent power events.
  • Teraswitch engineering evaluated PDU models in use globally that divide power capacity into banks, and added additional monitoring to notify of similar cases where an individual bank could be improperly loaded and its breaker could trip in a future power event.
  • As a result of the increased monitoring, Teraswitch staff identified two other racks in our global network that are susceptible to bank overload in their current configuration.  Internal work tickets have been created to redistribute the power cabling in these racks to avoid any impact in a future power event.

We apologize for any inconvenience caused by this incident and thank you for your continued support as a Teraswitch customer.

If you have any questions or concerns, feel free to reach out to Teraswitch Support at support@teraswitch.com.

Posted May 19, 2025 - 18:43 UTC

Resolved

This incident has been resolved. A root cause analysis will be posted shortly.
Posted May 17, 2025 - 20:32 UTC

Monitoring

The power issue affecting a single compute rack in SGP1 has been resolved - all affected customer systems should be back online at this time. Teraswitch staff are monitoring for stability but we do not expect further impacts.
Posted May 17, 2025 - 16:52 UTC

Identified

Teraswitch has isolated the issue to loss of power on a single compute rack in SGP1, causing downtime for bare metal customers in that rack. Customers in other racks at SGP1 are unaffected at this time. Teraswitch has engaged the facility to identify the cause and provide immediate restoration. We will provide an update once service has been restored.
Posted May 17, 2025 - 15:05 UTC

Investigating

Teraswitch is investigating a power loss event affecting at least one compute rack inside SGP1.
Posted May 17, 2025 - 14:46 UTC
This incident affected: SGP1 - Singapore.