Post mortem on the Global Services Outage
Outage of the 1st of February

All mention of timestamps are in UTC+2.
Beggining on Thursday, February the 1st, 2024, at 8:48 AM curioustudios' services began slowly performing and shortly after experienced an outage resulting in disability to respond to any requests. These services included the Stytsko bot, the Ukraine Info bot, the Ukraine Info Beta Programme bot, internal databases, artificial intelligence services gateways and several services related to monitoring air raid alerts in Ukraine.
All other services and bots like the Curiousity bot and Stytsko's analytics database were fine as they were hosted off-site.
This incident lasted from February 1st at 8:48 AM up until 9:43 AM when all services have completely returned to normal operations. The migration of the Ukraine Info Beta Programme bot marked the end of the incident.
What it started with
This incident started with a sudden drop in network traffic and the server stopped transmitting any data for 5 minutes (from 8:49 AM to 8:54 AM). Right before the network traffic drop, the server started seeing an increase in load average:

At 8:49 AM, right before the trafic got cut off, the server was at 100% CPU usage.
The cause
On this screenshot, in each category (CPU %, I/O read bytes and Memory (resident)) there are two colours of points shown: one is green, and, one is purple.
Please keep in mind that we are showing Memory (resident) right now, as that will be important in the future.
The lime one is a process called mysqld. It is the process for, well, MySQL. As you'll see in the screenshot, it was just launched when the incident started, indicating it might be one of the causes of the incident.
The purple one is a process called unattended-upgr. The unattended-upgr process in Ubuntu is a feature that allows automatic installation of security updates on a system without requiring user intervention.
This screenshot is in UTC.

As you can see, the MySQL process had just been launched at around 8:49 AM, but we can also see how that's happening while unattended-upgr is running and had also just been started just a minute earlier.
We rely on unattended-upgr to keep our systems secure at all times. However, could it launch MySQL? It very much could.
Before, I said that we would be measuring Memory (resident) at that moment. The resident version means that it will display only the memory that is used by the process. If we look at Memory (virtual), we'll see the actual amount of RAM that's allocated to the process, which may also explain the slowness and overallocation of memory.

After looking at Memory (virtual), we can immediately see that MySQL is definitely overallocating our server's RAM.
What caused the launch of MySQL and its enormous usage?
In a typical system setup, unattended-upgrades is responsible for automatically installing security updates, including updates for packages like MySQL. The process involves checking for available updates and applying them without requiring user intervention. However, it's important to mention that the logs do not explicitly mention the involvement of unattended-upgrades in triggering the MySQL upgrade and we will be searching for more follow-up evidence.
The launch of MySQL could have resulted in unexpected interactions between the upgrade process and other services or could have resulted in increased resource demands. One example of that is our old Ukraine Info bot software, which started demanding more CPU when it didn't get what it needed, which then resulted in overallocatin and unavailibility of the server.
Lessons learned, changes to be made
Here are some things that we'll be implementing in the next few months or weeks:
set soft limits on processes and applications to prevent a complete outage
always leave spare resources
isolate and dockerize all applications to make sure others aren't affected by local outages
implement more logging and monitoring
completely uninstall obsolete or unused software
schedule and perform manual upgrades from time-to-time
improve our alerting system
try no more than 3 times to start a process if it fails to do so, keeping resource usage low even if something goes extremely south 💥
Those are some improvements we'll be making to avoid incidents we've had in the past, including this one.
We are sorry for any inconvenience caused the 1st of February and we'll improve ourselves as much as we can to make sure things like this never happen again. While you're at it, check out our projects ⬇️
🚜 Stytsko
🪐 Curiousity
- Artem
