[2024-05-27] Web Application Collaboration Server Failure

May 28, 2024 3:55 AM[ 2024-05-27 20:55 PDT ]
Out team got first reports about issue when some web applications can’t be opened in web application designer.


May 28, 2024 4:10 AM[ 2024-05-27 21:10 PDT ]
We got first information about the nature of the issue and identified failed component - Redis Database of Studio Collaboration Service.


May 28, 2024 6:45 AM[ 2024-05-27 23:45 PDT ]
Backup snapshots were rolled and our team manually restored affected applications.


May 28, 2024 8:00 AM[ 2024-05-28 01:00 PDT ]
Studio Collaboration Service is fully restored.


Details of the incident
During the infrastructure maintenance window our team updated production cluster of collaboration service, including it’s Redis database. Our current configuration of the Redis database utilized RDB persistence with no AOF. As a part of the sever update procedure, new configuration with AOF were introduced. According to the Redis documentation, before introducing AOF parameter in server configuration AOF must be enabled on the live server to prevent data loss.

We failed to enable AOF on the live Redis server and updated configuration led to the data loss. Total data loss estimated about 190kb and affected <50 web applications.

While we had additional persistent storage (PostgreSQL) we were unable to recover collaboration data from it immediately because it were overwritten by new transactions.

Most of the applications were automatically restored using cached data in the users browsers (Indexed DB), some applications were manually restored from PostgreSQL backups.

To prevent issues in the future we introduced additional snapshotting of the collaboration data to the persistent storage and updated our operating manuals.