Production Issue - Sept 29, 2017
Incident Report for OpsRamp
Postmortem

October 2, 2017

Hello Valued OpsRamp Customer,

As part of our standard process, OpsRamp is performing a full root cause analysis for the intermittent application availability event on Friday, September 29.

At this time, we are awaiting our primary data center, Sungard, to provide a detailed RCA outlining the timeline of the UPS power supply system failure. We anticipate Sungard to provide their RCA by this week.

We understand the impact this has had on our affected customers and extend our apologies. We take this situation very seriously and will deliver the full root cause analysis by COB Thursday, October 5 (PST).

Kindest regards,

The OpsRamp Team


September 30, 2017

Dear Valued Customer,

On Friday, September 29, OpsRamp experienced intermittent application availability. We would like to share details on the cause of this incident, impact that this event may have had on your environments, and further actions we are taking in this regard.

Our primary data center - in Rancho Cordova, California - suffered a failure in the UPS power supply system that supports racks in the cage where OpsRamp is hosted. This power failure triggered an automated failover to our secondary datacenter and an automated failback to the primary datacenter. This caused platform to be unavailable, for few minutes.

We recognize the impact this service interruption may have had on your operations. We value the trust you place in OpsRamp. We continue to be committed to delivering the best service levels for our customers. Please do not hesitate to contact our Customer Success team (support@opsramp.com), if you have further questions.

Thank you for your continued business and partnership.

Sincerely,

The OpsRamp Team

Posted about 1 year ago. Sep 30, 2017 - 16:11 PDT

Resolved
This incident has been resolved.
Posted about 1 year ago. Sep 29, 2017 - 17:01 PDT
Update
We will provide a detailed RCA summarizing today's major event.
Posted about 1 year ago. Sep 29, 2017 - 13:39 PDT
Monitoring
Connectivity has stabilized and is under observation.
Posted about 1 year ago. Sep 29, 2017 - 13:38 PDT
Update
We are targeting full restoration of agent and gateway connectivity in the next hour. We will provide an update if there are any changes.
Posted about 1 year ago. Sep 29, 2017 - 13:18 PDT
Update
We are actively working on fixing the agent and gateway connectivity issue.
Posted about 1 year ago. Sep 29, 2017 - 13:12 PDT
Update
We have confirmed that agents and gateway are still down and we are actively working on fixing this issue.

We have confirmed that the APP UI and APIs should be accessible.

We will provide an update shortly.
Posted about 1 year ago. Sep 29, 2017 - 12:33 PDT
Update
We are experiencing a production outage that has resulted in agents and device connectivity issues. We are working on fixing this issue.
Posted about 1 year ago. Sep 29, 2017 - 12:17 PDT
Identified
Update - We have identified a production issue that has impacted connectivity. We are actively working on resolving this issue immediately. We will provide an update in the next 15 minutes.
Posted 35 minutes ago. Sep 29, 2017 - 11:37 PDT

Identified - We have a production outage going from last 30 minutes. We are actively working on the issue.
Posted about 1 hour ago. Sep 29, 2017 - 11:14 PDT
Posted about 1 year ago. Sep 29, 2017 - 12:13 PDT