504 errors on Dashboard
Incident Report for RG System
Customers were experiencing random 504 Gateway Time-out errors on Dashboard. We had a DNS network error this night, it was the consequence of a network maintenance. We're still waiting for information from the Service Provider. Our services are heavily using DNS services for High Availability and Load Balancing purposes, but when one DNS node is going down, the default behavior for a Linux server is to try the first server, timeout, and pass to the next into the list, without taking into account that a specific DNS server is offline. It makes all subsequent calls being long (5 seconds because of the DNS lookup timeout), and then it succeeds. All services were available (alerting, agent connectivity, real time alerts, etc.) for the whole time the status page was showing degraded performances (because the status page was also waiting 5 seconds to be able to catch the information)
Posted Apr 04, 2018 - 09:50 CEST