PDA

View Full Version : Unexpected Downtime


FH-Dave
09-21-2003, 09:26 AM
At approximately 9:44 AM local time (EDT, GMT -4:00), there was a problem affecting part of Internap's facility, including us. By around 10:15 AM local time, most servers are already responding back to normal again. We are still waiting for the rest of the servers to come back up. We will update you along the way.

If you have any questions/concerns, please email to support[at]fluidhosting.com.

FH-Dave
09-21-2003, 09:48 AM
All servers should be up, except VPS Control Panel (which is also ns4.fluidhosting.com) server.

To VPS customer: your VPS is being brought up one by one.

FH-Dave
09-21-2003, 09:41 PM
We received this email from Internap at 11:58AM. Unfortunately I was out of town during that time. So my apology for updating you just now.


Subject:

Hello,

The Internap NOC has received additional information regarding the BSN outage experienced by several customers between approximately 09:45 and 10:16 EDT this morning, 9/21/03. This outage was related to a power failure which led to customers with colocated equipment in the BSN facility losing power during the outage window.

This morning at the BSN PNAP, electricians were performing preventive maintenance on the UPSA & UPSB power systems with the intent of rolling customers to battery backup power and then back to the UPS once the work was complete. When power was rolled back to UPSA and UPSB the Maintenance Isolation Breaker tripped causing loss of power to UPSA. Customers with power homed solely to UPSA lost power to equipment at this point, at
approximately 09:45 EDT. Once the outage occured the Internap NOC contacted the BSN facility and began to troubleshoot the outage in order to restore customer connectivity as quickly as possible. Power was restored to the UPSA feed at 10:16 EDT, at which time all affected customers were restored. Electricians are currently investigating the cause of the Maintenance Isolation Breaker tripping, and we will forward a more complete RFO report to our customers after a full analysis of what occurred has been completed.

Unfortunately customers in BSN were not notified of this maintnenace in advance. The NOC is investigating why BSN customers were not notified of this event via our normal notification procedure currently, and will also pass on that information, as well as what steps will be taken to prevent
this sort of miscommunication in the future.

If you have any questions or concerns regarding this issue please email [Internap email contact deleted], or call [Internap phone contact deleted] and reference ticket 128019.

Thank you,
--------------------------------------------------------------------------
[Internap contact deleted]


All of our shared hosting, reseller, colo and dedicated servers were practically up and running again 5-10 minutes after the power was restored. The VPS servers suffer the most downtime, they were back running roughly 30 minutes after the power was restored (most possibly because of longer time it took to fsck the servers). Furthermore, it took time to reboot/restart each VPS one by one. Some VPS were brought up considerably later than other VPSes.

A refund of 5-10% in accordance to our SLA (fluidhosting.com/sla.php) will be given to VPS customers only as we already fell below our uptime guarantee for September because of this outage. Please contact billing[at]fluidhosting.com to request for a refund credit adjustment.

We sincerely apologize for what happened this morning. We have and we will always have our highest respect and belief in Internap's quality and profesionalism. We believe this morning mistake was simply a human error, both in communication and during the execution of the maintenance work itself.

Should you have any further questions/concerns, please do not hesitate to contact us at support[at]fluidhosting.com.

FH-Dave
09-23-2003, 05:03 PM
Further update/explanation from Internap regarding the brief power outage on Sunday.


Hello,

Regarding the outage to BSN customers that stemmed from a power loss at the BSN facility yesterday, Sunday 9/21/03 between 09:45 and 10:17 EDT, Internap has completed our preliminary Reason for Outage report which will be included in this email. Additionally, this email will outline the steps that nternap will be taking to replace the failed equipment and the
impact of this to customers.

The outage resulted from planned maintenance work that the Internap Field Operations department was performing on 9/21/03. The maintenance was intended to include regular preventive maintenance on our UPS power systems, and should not of resulted in any loss of power.

On Sunday 9/21/03 at approximately 09:45 EDT, the Internap NOC was alerted that a large number of customer circuits had failed in the BSN facility. The failure impacted customers that did not have redundant power connections to both the UPS-A and UPS-B feeds. The NOC contacted the BSN facility and after some investigation was informed that there had been a failure in the UPS-A feed. At this point the cause of the power failure was unknown, however the NOC began to work with the Field Ops
personnel to restore power. By 10:17 EDT power had been restored to all customers off of the UPS-A feed.

Subsequent analysis of the failure has revealed that several minutes after the routine maintenance to UPS-A was completed, the Main Isolation Breaker which is located on the output side of UPS-A, tripped and failed. Unfortunately there were no alarms evident to the Field Operations personnel that the breaker had failed. To resolve the power failure, the UPS technician placed the UPS in bypass mode, which allowed the UPS-A fed circuits to be re-energized on commercial power. At this point, customers that had power failures due to the UPS-A feed failing are still on commercial power. Please see below for information on Internap's plan to replace the failed breaker and get customers back onto the UPS system. At this time commercial power is backed up by generators, so should a failure of commercial power occur, customers will be powered by generators after a brief fail over.

With regards to Internap's plan moving forward to replace the failed breaker, the Internap Field Ops department is currently working with our vendors to locate and ship replacement equipment as quickly as possible. Before the breaker can be replaced, due to the down time associated with the equipment replacement, the Internap NOC and BSN facility personnel will work with customers who will be impacted by this replacement directly. Before the replacement of the breaker can occur, it will be
necessary to move any customers homed only to the UPS-A power feed to the UPS-B feed, so that when the replacement occurs, there will not be a loss of power. If you will be directly impacted by this replacement, the NOC will contact you in a separate email in the next 24 hours to notify you of the plan, and time line that we will need your assistance in order to minimize the impact to your services as we restore the UPS-A feed.

Finally, regarding the lack of customer notification about the planned maintenance in the BSN facility, our investigation revealed that there had been a miscommunication between the BSN facility engineers who were to perform the maintenance and the NOC, who coordinates all customer notifications of such work. Unfortunately the request for a maintenance event, which triggers customer notification was not entered correctly into the appropriate systems, which led to customers not being notified of the event. The NOC and the Field Operations department have reviewed the process that occurs when setting up maintenance windows, and have built additional safeguards into the process that will ensure that this sort of miscommunication does not reoccur.

As stated earlier, moving forward, Internap will first need to replace the failed breaker in order to move customers off of commercial power and back onto UPS-A. In the interim customers who will need to move their power feed to the UPS-B feed before the breaker replacement will be contacted separately in the next 24 hours for further instruction.

If you have any additional questions or concerns, please direct them to [Internap email contact deleted] or [Internap phone contact deleted] and reference ticket 128019.

Thank you.
[Internap contact deleted]