Many customers using the OpenTok platform experienced a serious outage on Tuesday, the 21st of May.
We understand how critical the platform is for our customers, so we strive to ensure the highest possible reliability. We know that Tuesday’s incident caused disruption for those customers impacted, for which we sincerely apologize.
Our focus for any incident is to ensure:
The incident is resolved quickly, minimizing any further disruption to customers.
Customers are provided with current and accurate information on status.tokbox.com.
We conduct a thorough post-mortem so we fully understand any lessons that can be learned, allowing us to prevent any recurrence, and to further improve our platform’s reliability.
We’d like to summarize what happened and what we’re doing next. Our analysis is continuing in some areas, so is subject to revision.
Timeline & Business Impact
21 May 2019 00:57 - 01:24 PST - A significant subset of customers were unable to use the OpenTok video communication platform or account portal.
21 May 2019 00:57 - 01:24 PST - A subset of customers experienced intermittent failures to upload archives in progress.
21 May 2019 08:15 - 09:46 PST - A subset of customers received errors when attempting to start a new archive or broadcast and stop an archive in progress.
The primary cause for the incident was a database server going offline at one of our datacenter hosting locations.
Our platform is built to mitigate against such disruption by failing over to alternative locations. There was an issue with the failover process, which resulted in the failover taking a little longer than expected.
This initial issue resulted in those customers not being able to use the platform or account portal. Since the database was not accessible, the archive file ID could not be entered into the database, so the archive file could not be uploaded to the storage. Any archive that was recoverable will have been uploaded/available now.
The follow-on issues with some customers unable to start or stop archives and broadcast were caused by the temporary inconsistent state of the session information causing temporary capacity issues.
A program of work has started that includes but is not limited to:
Review power source redundancy with our data center provider.
Add improved monitoring for the archiving service.
Review and enhance the failover system.
Conduct a comprehensive Internal Post Mortem to highlight any additional improvements in terms of: Communication with customers. Process for handling incidents. Other preventive actions that could be taken.
Review the post-mortem requested from our hosting provider and their own plan for improvements.
Please contact your account manager or firstname.lastname@example.org with any questions.