Anvil issues centered around sjc
Incident Report for TokBox
Postmortem

API Service Interruption: February 15, 2018 2:01 am PST to 3:40 am PST

Summary

API Servers in NYJ and SJC began crashing around 2 am PST on February 15 2018 due to a JVM bug on the servers. This resulted in a production outage for customers trying to create new sessions. Since API servers are shared by both Standard and Enterprise lines, this could have affected customers on either line.

Timeline

  • At 2:01 am, API servers began crashing in the SJC and NYJ regions. Initial reports suggested that only SJC was affected.
  • At 2:28 am, the Ops team began restarting affected servers while continuing to investigate.
  • At 2:34 am, we started to redirect traffic to other regions such as NYJ but NYJ was similarly affected.
  • From 3:00 am, we started to rollback servers.
  • API service was restored by 3:40 am PST.

Root-Cause Analysis

  • The JVM crashed on all of these machines and our belief is that this is related to a bug that exists in the JDK. OpenJDK Bug Link
  • The bug is related to a crypto library that is used and causes a non-deterministic crash in old versions of the JDK (1.8.0_45)
  • We also updated some workflows in our API servers that were related to TLS and released it to production on Feb 14, 2018.

Remediation

  • Upgrade JDK versions on API servers to v8u152+
  • While we always release to a few machines first and wait until a required load threshold is met before rolling out further, we may revisit this threshold.
  • We are working on a full separation of API, Messaging & Media Servers between Standard and Enterprise clients.
Posted 8 months ago. Feb 17, 2018 - 13:23 PST

Resolved
This incident has been resolved.
Posted 8 months ago. Feb 15, 2018 - 10:01 PST
Investigating
At 2:01am pst anvil connectivity around the sjc region occurred lasting 2:57. Traffic was rerouted. We are continuing to monitor.
Posted 8 months ago. Feb 15, 2018 - 02:58 PST
This incident affected: Standard (Standard Video, Standard API, Standard Broadcast, Standard SIP, Standard Session Monitoring, Standard Archiving), Enterprise (Enterprise Video, Enterprise API, Enterprise Broadcast, Enterprise SIP, Enterprise Session Monitoring, Enterprise Archiving), and Tools, Account Portal.