Intermittent Issue with Archiving Feature
Incident Report for TokBox
Postmortem

Many customers using the OpenTok platform experienced a serious outage on Tuesday, the 21st of May.

We understand how critical the platform is for our customers, so we strive to ensure the highest possible reliability. We know that Tuesday’s incident caused disruption for those customers impacted, for which we sincerely apologize.

Our focus for any incident is to ensure:

The incident is resolved quickly, minimizing any further disruption to customers.

Customers are provided with current and accurate information on status.tokbox.com.

We conduct a thorough post-mortem so we fully understand any lessons that can be learned, allowing us to prevent any recurrence, and to further improve our platform’s reliability.

We’d like to summarize what happened and what we’re doing next. Our analysis is continuing in some areas, so is subject to revision.

Timeline & Business Impact

21 May 2019 00:57 - 01:24 PST - A significant subset of customers were unable to use the OpenTok video communication platform or account portal.

21 May 2019 00:57 - 01:24 PST - A subset of customers experienced intermittent failures to upload archives in progress.

21 May 2019 08:15 - 09:46 PST - A subset of customers received errors when attempting to start a new archive or broadcast and stop an archive in progress.

Root Cause

The primary cause for the incident was a database server going offline at one of our datacenter hosting locations.

Our platform is built to mitigate against such disruption by failing over to alternative locations. There was an issue with the failover process, which resulted in the failover taking a little longer than expected.

This initial issue resulted in those customers not being able to use the platform or account portal. Since the database was not accessible, the archive file ID could not be entered into the database, so the archive file could not be uploaded to the storage. Any archive that was recoverable will have been uploaded/available now.

The follow-on issues with some customers unable to start or stop archives and broadcast were caused by the temporary inconsistent state of the session information causing temporary capacity issues.

Next Steps

A program of work has started that includes but is not limited to:

Review power source redundancy with our data center provider.

Add improved monitoring for the archiving service.

Review and enhance the failover system.

Conduct a comprehensive Internal Post Mortem to highlight any additional improvements in terms of: Communication with customers. Process for handling incidents. Other preventive actions that could be taken.

Review the post-mortem requested from our hosting provider and their own plan for improvements.

Please contact your account manager or support@tokbox.com with any questions.

Posted May 24, 2019 - 12:59 PDT

Resolved
The Archiving Feature has been fully operational again since May 21st at 9:30am PST for Enterprise customers, and 11:10am PST for all other customers.

We are still assessing the whole impact that this incident had on archives at the time this incident was occurring. We will be able to provide more details on a Post Mortem which will be published in the next few days.

Please contact support@tokbox.com if you have any questions.
Posted May 21, 2019 - 15:52 PDT
Monitoring
Our engineering team has identified an issue which appears isolated to a few specific servers. As a result, the archiving service has been impacted.
The impacted servers are now back working as normal and starting an archive should work as expected. We are looking into recovering archives that failed to upload due to this.

We are currently monitoring the systems and we will publish a post-mortem as soon as possible.

Please contact support@tokbox.com if you have any questions.
Posted May 21, 2019 - 12:01 PDT
Update
Thank you for your continued patience while we progress our investigation on an issue related to our archive service. Engineers are diligently working to restore service for those affected and will provide an update as soon as possible.
Posted May 21, 2019 - 11:00 PDT
Update
We are currently investigating issues on starting archives in some regions.
Posted May 21, 2019 - 09:07 PDT
Update
Our Engineering team has confirmed that records generated during the previous incident (12:56 to 2:00am PDT) may not be immediately available to customers for download. Records generated outside of this time window are not impacted.
Posted May 21, 2019 - 07:46 PDT
Investigating
We are currently investigating an ongoing issue with our Archiving feature. Archives are not converted to stopped state and so users are unable to access Archiving information via the Inspector tool. Some archives are also not uploaded on Amazon S3 or Microsoft azure accounts. There appear to also be some issues in using Archiving API as well.

Further updates will be provided as we have more information.
Posted May 21, 2019 - 06:22 PDT
This incident affected: Standard (Standard Archiving) and Enterprise (Enterprise Archiving).