Tokbox Services Impacted
Incident Report for TokBox
Postmortem

Summary

On Thursday June 6, the New Jersey data center (DC) hosted by one of our external providers lost connectivity to the public Internet. All the core services present in the DC became unreachable. Some of these services are essential for the operation of the platform, in particular the database (DB), the non-persistent DB and the API server.

The failover mechanism to route DB traffic from the New Jersey DC to a secondary DC in California was successfully executed. This restored the developer tools (Playground, Inspector, Pre-Call test, Archive inspector, GraphQL Explorer), the account portal and the session creation.

Unfortunately, there was an issue with the failover mechanism for the non-persistent database that prevented new entries from being created. This impacted new session connections within the New Jersey DC. Existing sessions running in other DCs continued to work without issues. The DC connectivity was restored to service at 01:49 PDT, and the non-persistent DB was reconfigured to recover the remaining services.

The outage began at 22:35 PDT on June 6 and ended at 02:15 PDT on June 7; time to full service restoration was 220 minutes.

Root Cause

The New Jersey DC became unreachable when substantial fiber damage occurred to one of the external hosting providers we currently use. In particular,

The primary fiber backbone was lost causing connectivity failure to services hosted within the DC.

A carrier diverse secondary fiber backbone had independently lost upstream connectivity preventing successful failover from occurring.

Note: A formal RCA has been requested of this hosting provider to gain additional clarity in regards to the secondary failure. An additional update may be published should we receive any details that warrant such an update.

The communication received on June 10th from this provider has confirmed an official incident closure time of 10:12 PDT on June 8. Network redundancy was fully restored to the facility on Monday June 10 at approximately 08:50 PDT. As a precaution, we maintained a monitoring state until full redundancy was confirmed and have an incident closure of 16:47 PDT on June 10.

On Thursday June 6 when the DC went down, TokBox invoked its redundancy mechanism to failover connections from New Jersey to the California DC. However, the sudden unavailability of an entire DC led to unresolvable problems with the non-persistent database (Redis).

Timeline

Time Period (PDT) Major incident milestones
June 6, 2019, 22:30 Alert received by Engineering Ops.
June 6, 2019, 22:45 DC External Hosting Provider announces issue on their status page.
June 6, 2019, 23:43 Incident posted on TokBox Status Page.
June 7, 2019, 00:04 DB usage migrated to secondary DC; Tools, Account Portal and session creation are restored.
June 7, 2019, 01:49 Primary DC available again and configuration restored.
June 7, 2019, 02:15 Database synchronisation completed, and all services were restored.
June 10, 2019, 08:50 Connectivity resilience restored at NJ Datacenter.
June 10, 2019, 10:19 Communication received from provider confirming incident closure. Monitoring state maintained during review.
June 10, 2019, 16:35 Incident closed by TokBox.

Remediation

As a result of our post-mortem investigation we have identified the following areas of improvements:

A complete analysis on the database failover procedure to exceed Disaster Recovery requirements.

Analysis and improvements to the architecture of our Redis to prevent future race conditions during multiple DB or cluster failures.

Review of Incident and Escalation Management processes for continual service improvement.

Evaluation of host provider operational level agreements.

Posted 3 months ago. Jun 11, 2019 - 14:14 PDT

Resolved
Our supplier has now confirmed that this incident is fully resolved.

Our support and engineering teams continue to work on a post-mortem, which will include actions relating to issue prevention, along with improvements to the processes for incident handling and customer communication. This will be finalized and published with this incident once we receive the full RCA from our provider.
Posted 3 months ago. Jun 10, 2019 - 16:35 PDT
Update
We are continuing to monitor our service, which remains fully operational.

Although our service provider has not yet confirmed a final resolution for their data center incident, they do not expect any further disruption. We are keeping the incident status as 'monitoring'.

Our support and engineering teams continue to work on a post-mortem, which will include actions relating to issue prevention, along with improvements to the processes for incident handling and customer communication. This will be finalized and published once we receive the full RCA from our provider.

We will provide another update within 24 hours.
Posted 3 months ago. Jun 10, 2019 - 07:10 PDT
Update
We are continuing to monitor our service, which remains fully operational. We will keep this incident in status 'Monitoring' until our service provider has confirmed full resolution and/or will update it as needed if anything changes.

Once this incident is marked as fully resolved and we have all the necessary information, we will publish a post-mortem.
Posted 4 months ago. Jun 07, 2019 - 15:00 PDT
Update
We are continuing to monitor our service, which remains fully operational. Our service provider has not yet confirmed that their incident as resolved, so we are choosing to keep this incident in status 'Monitoring' for the time being.
We will include the incident timeline in our post-mortem. The current estimate is that the main service disruption was between 22:30 6th June PDT to 2:15 7th June PDT, although this may be subject to revision after later analysis
Posted 4 months ago. Jun 07, 2019 - 05:46 PDT
Update
All services have now been restored. We are continuing to monitor to ensure no further disruption occurs.
Posted 4 months ago. Jun 07, 2019 - 02:37 PDT
Monitoring
Most services on our platform have now been restored, although some customers are continuing to report issues with the Account Portal.

Our upstream provider has confirmed that they have partially restored their service, while they continue to work on their full recovery. We will continue to work on resolving the Account Portal issues and monitoring our other services. We will provide an update on progress soon.
Posted 4 months ago. Jun 07, 2019 - 02:26 PDT
Update
Our engineering teams and upstream provider continue to work on a resolution. We do not have new information at this time. We will provide more details as they become available.
Posted 4 months ago. Jun 07, 2019 - 01:52 PDT
Update
Our engineering teams are continuing to work on a temporary resolution, which has already restored some services including the Account Portal, Inspector, Archive Inspector, and Precall Test. This work continues, while our provider works on restoring the service overall.
Posted 4 months ago. Jun 07, 2019 - 01:17 PDT
Update
One of our upstream providers is experiencing a major outage. They are continuing to work to resolve the issue. We will provide an ETA as soon as possible. A further update will be provided soon here.
Posted 4 months ago. Jun 07, 2019 - 00:30 PDT
Update
One of our upstream providers is experiencing a major outage. They are continuing to work to resolve the issue. We will provide an ETA as soon as possible. A further update will be provided soon here.
Posted 4 months ago. Jun 07, 2019 - 00:30 PDT
Investigating
We currently are investigating an issue with the connectivity to our servers.
Users are unable to login to Tokbox account, run Precall test tool, access Playground tool and Tokbox APIs returning 500 Internal Server Error or 503 Service Unavailable error.
We are currently investigating the issue.
Posted 4 months ago. Jun 06, 2019 - 23:43 PDT
This incident affected: Enterprise (Enterprise Video, Enterprise API, Enterprise Broadcast, Enterprise SIP, Enterprise Session Monitoring, Enterprise Archiving), Standard (Standard Video, Standard API, Standard Broadcast, Standard SIP, Standard Session Monitoring, Standard Archiving), and Tools, Account Portal.