TokBox API server not responding
Incident Report for TokBox
Postmortem

Production Outage: February 5, 2018 4:46 pm - 5:53 pm PST

For more than a year TokBox has exceeded it's 99.95% availability targets. On February 5 we experienced a significant interruption to North American data centers that lasted a little more than an hour.

Timeline

  • All connectivity to the platform was impacted in the North America Region because API servers could not resolve HTTP requests that are used for session negotiation.
  • API traffic routed to other data centers, such as London, worked as required
  • Our first internal alert was triggered at 4:48pm
  • At 5:05pm, our severity-one channel was alerted and all stakeholders were notified and immediately convened as the Production swat team assembled for analysis.
  • At 5:14pm the status page at status.tokbox.com was updated with corresponding findings after the triage based on internal tests.
  • At 5:50pm connectivity and service was restored.

Root-Cause Analysis

  • We saw a sudden service degradation in HTTP API response latency.
  • Initially our production team noticed an apparent loss of connectivity between our top level HTTP API Servers and Redis (cache machines).
  • API servers started running out of file descriptors as a part of this loss of response.
  • We determined the cause for the failure was a 700% increase in traffic to our API service. This essentially resulted in a deadlock in the API servers where requests were not being serviced and additional new requests were piling up further exacerbating the problem.

Remediation

  • We have approximately doubled our API server capacity in 3 major geographical regions - San Jose, New York and London to make sure an additional buffer exists.
Posted 6 months ago. Feb 15, 2018 - 10:09 PST

Resolved
Even though all traffic has resumed normal operations we are continuing to monitor and investigate the root cause.
Posted 6 months ago. Feb 05, 2018 - 17:50 PST
Investigating
We are aware of an issue affecting the TokBox API servers and are working to resolve the issue.
Posted 6 months ago. Feb 05, 2018 - 16:46 PST
This incident affected: Enterprise (Enterprise Video, Enterprise API, Enterprise Broadcast, Enterprise SIP, Enterprise Session Monitoring, Enterprise Archiving), Standard (Standard Video, Standard API, Standard Broadcast, Standard SIP, Standard Session Monitoring, Standard Archiving), and Tools, Account Portal.