Incorrectly marked media server causing failures
Incident Report for TokBox
Postmortem

Timeline

At approximately 12:00 pm PST, some Enterprise media servers were put into rotation in the SJS Datacenter after an OS upgrade and the servers not configured correctly.

A manual sanity test was made before the servers were put into rotation, but failed to notice that audio was in fact not working.

At approximately, 3:30 pm PST, the misconfigured servers were taken out of rotation.

Root-Cause Analysis

Due to a human error, an incorrect configuration (chef recipe) was used, which was not fully compatible with the binary deployed. This triggered a bug in which the misconfigured media server would reject the audio codec in the SDP. As a result, the clients would not publish any audio.

Remediation

We fixed the server configuration and after new sessions were generated, worked as required.

The team is discussing improvements needed to ensure this won't happen again.

Posted 5 months ago. May 09, 2018 - 13:12 PDT

Resolved
Both the connection errors and audio issues have now been resolved.
Posted 5 months ago. May 08, 2018 - 16:47 PDT
Identified
One of our enterprise TURN servers was incorrectly marked as a media server. This caused clients to make requests to it that repeatedly failed. The problem may manifest as websocket failures or audio missing from a call.

We initially received reports around 12:15PM PST and the the server was taken out of rotation at 12:33PM PST. There are some stale sessions that may still be making requests.
Posted 5 months ago. May 08, 2018 - 14:14 PDT
This incident affected: Enterprise (Enterprise Video, Enterprise API).