On Thursday June 6, the New Jersey data center (DC) hosted by one of our external providers lost connectivity to the public Internet. All the core services present in the DC became unreachable. Some of these services are essential for the operation of the platform, in particular the database (DB), the non-persistent DB and the API server.
The failover mechanism to route DB traffic from the New Jersey DC to a secondary DC in California was successfully executed. This restored the developer tools (Playground, Inspector, Pre-Call test, Archive inspector, GraphQL Explorer), the account portal and the session creation.
Unfortunately, there was an issue with the failover mechanism for the non-persistent database that prevented new entries from being created. This impacted new session connections within the New Jersey DC. Existing sessions running in other DCs continued to work without issues. The DC connectivity was restored to service at 01:49 PDT, and the non-persistent DB was reconfigured to recover the remaining services.
The outage began at 22:35 PDT on June 6 and ended at 02:15 PDT on June 7; time to full service restoration was 220 minutes.
The New Jersey DC became unreachable when substantial fiber damage occurred to one of the external hosting providers we currently use. In particular,
The primary fiber backbone was lost causing connectivity failure to services hosted within the DC.
A carrier diverse secondary fiber backbone had independently lost upstream connectivity preventing successful failover from occurring.
Note: A formal RCA has been requested of this hosting provider to gain additional clarity in regards to the secondary failure. An additional update may be published should we receive any details that warrant such an update.
The communication received on June 10th from this provider has confirmed an official incident closure time of 10:12 PDT on June 8. Network redundancy was fully restored to the facility on Monday June 10 at approximately 08:50 PDT. As a precaution, we maintained a monitoring state until full redundancy was confirmed and have an incident closure of 16:47 PDT on June 10.
On Thursday June 6 when the DC went down, TokBox invoked its redundancy mechanism to failover connections from New Jersey to the California DC. However, the sudden unavailability of an entire DC led to unresolvable problems with the non-persistent database (Redis).
|Time Period (PDT)||Major incident milestones|
|June 6, 2019, 22:30||Alert received by Engineering Ops.|
|June 6, 2019, 22:45||DC External Hosting Provider announces issue on their status page.|
|June 6, 2019, 23:43||Incident posted on TokBox Status Page.|
|June 7, 2019, 00:04||DB usage migrated to secondary DC; Tools, Account Portal and session creation are restored.|
|June 7, 2019, 01:49||Primary DC available again and configuration restored.|
|June 7, 2019, 02:15||Database synchronisation completed, and all services were restored.|
|June 10, 2019, 08:50||Connectivity resilience restored at NJ Datacenter.|
|June 10, 2019, 10:19||Communication received from provider confirming incident closure. Monitoring state maintained during review.|
|June 10, 2019, 16:35||Incident closed by TokBox.|
As a result of our post-mortem investigation we have identified the following areas of improvements:
A complete analysis on the database failover procedure to exceed Disaster Recovery requirements.
Analysis and improvements to the architecture of our Redis to prevent future race conditions during multiple DB or cluster failures.
Review of Incident and Escalation Management processes for continual service improvement.
Evaluation of host provider operational level agreements.