Network instabilities

Incident Report for RG System

Resolved

This incident has been resolved.

As a result of this outage, we had to choose between keeping the service up, or keeping data being rebuilt for 10 days with a major impact on performances. We decided to put offline the history of all the custom script results logs. If you need to have an access to that data, please send an email to help@rgsystem.com

Posted Jun 29, 2019 - 10:05 CEST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jun 28, 2019 - 22:39 CEST

Identified

The issue have been identified. We've lost a database node today at 13am, with no impacts on production. As a consequence, 3 hours later, the lost node wasn't back, and others nodes decided to replicate the lost data on alive nodes. This task take a lot of a resources, which impact the whole stack. Databases are being synchronized with a lower rate, this should mitigate the load, and restore the service back.

Posted Jun 28, 2019 - 18:33 CEST

Investigating

We are currently investigating this issue.

Posted Jun 28, 2019 - 17:21 CEST

This incident affected: Agents Listeners Cluster (Lisa).