Connectivity Issues
Incident Report for Vonage Video API
Postmortem

Final Postmortem

What happened

On July 27th, 2020, between 10:27 UTC and 11:14 UTC, as well as 22:51 UTC and 23:26 UTC, API calls on standard or enterprise environments may have failed. Affected APIs include clients joining a session and server REST API calls. This incident did not impact ongoing sessions for which no API calls were invoked.

Causes

A security agent running on the non-persistent database servers, which monitors and scans disk access, caused a resource conflict between the agent and the database. As part of its normal operation the non-persistent database writes files to disk for logging. The security agent performs anti-malware in-line analysis of accesses made to files. The non-persistent database is unique in that it is single threaded and it opens and closes the log file for each log entry. 

The scan of the log files by the agent caused a resource conflict between the agent and the database, this resulted in increased latency in the requests thus increasing the number of open connections needed to be serviced.

This cascaded into requests timing out and triggered a failover to a new primary node. The new primary node experienced the same behavior and a cascaded effect.  At this moment the cluster was unable to handle all traffic and the entire cluster became unresponsive.

Timeline

The non-persistent database cluster management nodes were restarted at 10:46 UTC, and the platform resumed to normal operations. 

When the second incident occurred, the team was investigating the reason for the first one, and the security agent was not yet identified as the root cause for the incident.

Restarting the non-persistent database cluster management node at 23:10 UTC did not solve the second incident, which still persisted. Further investigation found excessive resource utilization by the security agent. The agent was disabled in all the non-persistent database cluster nodes at 23:24 UTC. Normal operations resumed subsequent to that.

Preventive Actions

  • Migrate the non-persistent database to a managed service.
  • Review the recommendations from the security agent provider to update the policies and prevent the resource conflict. 
  • Increase the scope for testing, including installation of 3rd party software, such as security agents, in the test harness.
  • Continue improving our incident management processes to ensure customers are notified as soon as an incident is detected.

Interim Postmortem

What happened

On July 27th, 2020, between 10:27 UTC and 11:14 UTC, as well as 22:51 UTC and 23:26 UTC, users on all client SDKs may have not been able to connect to a session running in the Standard or Enterprise environments. This incident did not impact users who had already joined the session.

Causes

For reasons that are currently under investigation by internal and external teams, a third party software component running on the nodes of the non-persistent database made them unresponsive.

It was discovered that this component was using almost all computing power of the cluster at the time; therefore, causing the cluster to fail.

The fix for the initial incident started to roll out at 10:46 UTC and the incident seemed to resolve by restarting the API Gateway and the platform resumed to normal operations.

When the second incident occurred, the team was investigating the reason for the first one and at that point, this software component was not yet identified as the root cause for the issue.

While rebooting the API Gateways resolved the first incident, the issue persisted after doing so when we were notified of the new outage. During the second outage, it was discovered that the software component was utilizing most of the resources of the API Gateway. This component was quickly disabled across the database cluster. Normal operations resumed subsequent to that.

Preventative Actions

  • Migrate the non-persistent database to a managed service.
  • Review the suitability of the third party software and associated policies for this particular service.
  • Continue improving our incident management processes to ensure customers are notified as soon as an incident is detected.
Posted Aug 06, 2020 - 19:09 UTC

Resolved
This incident has been resolved.
If you have any questions, please reach support@tokbox.com.
Posted Jul 28, 2020 - 14:40 UTC
Monitoring
A fix has been implemented and we are currently monitoring the results. As a result of this incident, the services were impacted between 22:54 UTC and 23:28 UTC. Please reach out to support@tokbox.com if you have any further questions.
Posted Jul 27, 2020 - 23:34 UTC
Update
We are continuing to investigate this issue.
Posted Jul 27, 2020 - 23:22 UTC
Update
At this time, new clients will not be able to connect to sessions. Ongoing sessions with connected clients should not be impacted by this incident.
Posted Jul 27, 2020 - 23:14 UTC
Investigating
We are currently experiencing connectivity issues. The issue is under investigation and we will post updates as we have more information available.
Posted Jul 27, 2020 - 23:05 UTC
Monitoring
Service appears to be recovered since 11:14 am UTC, we are monitoring the situation closely. If you have any questions please reach out to us at support@tokbox.com.
Posted Jul 27, 2020 - 11:39 UTC
Update
We are continuing to investigate this issue.
Posted Jul 27, 2020 - 11:24 UTC
Investigating
Starting at 10:45 am UTC, we are experiencing connectivity issues. During this time, platform API may appear out of service. The issue is under investigation. We are already observing recovery at this point. If you have any questions please reach out to us at support@tokbox.com.
Posted Jul 27, 2020 - 11:23 UTC
This incident affected: Enterprise (Enterprise Video, Enterprise API, Enterprise Broadcast, Enterprise SIP, Enterprise Session Monitoring, Enterprise Archiving), Standard (Standard Video, Standard API, Standard Broadcast, Standard SIP, Standard Session Monitoring, Standard Archiving), and Tools, Account Portal, China Relay, Advanced Insights.