7/13/2022 in news
Hey, given that we had one of the largest, involuntary downtimes two days ago, I thought a proper post going into the details of the incident was in order. TLDR: due to an oversight we ran out of disk space and lost some data.
On Monday night, around midnight (GMT), players started noticing strange behaviors: hydration timeouts, connections problems and similar issues. For some players the game didn't even start anymore. Usually if a server node fails, which produces similar results, we have an automatic process in place to restart it and players can continue playing. It takes a while for the new node to boot up, but eventually everything is back to normal. Not this time though. Some players tried to ping the team in Discord, but since we are a small team, currently all based in Germany we didn't notice, we were sound asleep.
The next morning
On Tuesday morning I woke up to several direct messages, mentions and an application log full of errors. I quickly realized that we ran into a serious problem, so I shut down the server to be able to investigate the issue and not cause more damage. After a first assessment of the server log files it became clear that the issue is the database. One of three database nodes failed. The mechanism that restarts failing server nodes also restarts failing database nodes. During the startup of said node a problem occurred causing it to fail again and restart again. Over and over.
After digging further into the database issue it became clear that most of the data that this node has written since midnight is corrupted. The reason became obvious very fast: the node ran out of disk space. It is as simple as that. The database was able to store a few writes every now and then, basically whenever some disk space became available, but most of the writes not only failed, but produced corrupted data.
I then quickly enlarged the disk volumes and removed the corrupted files. After yet another restart the database node booted up and reported for duty. At this point I knew that we lost some data that has been created in the night, but not how bad it would be. There were two options to choose from: restoring the database from a backup (yes we do have them 😀) or trying to boot up the server and see what the damage was. I decided to see what the damage was first, because the backup was more than a week old.
I fired up the servers, but kept the client deactivated to prevent players from connecting. I was astonished that the server booted up almost without errors, I expected it to go up in flames to be perfectly honest. That gave me the confidence to activate the client again and let players onto the servers.
Once the players started pouring in, so did the bug reports. Most of the bug reports were of the same kind: a contract that was not in sync with the partner, a commodity exchange order that was not in sync with the broker and so on. There were hardly any errors that were contained in a single entity like a company or user.
From an architectural point that makes sense: every entity has an identifier and that identifier determines in which database node its data is being stored. If it is to be stored in the faulty database node, then the writes just failed that night and the respective player wasn't able to play properly, because they could not make any changes to their company. If the writes went to a healthy node everything worked as before.
The trouble started with actions that involve two entities, one writing to a healthy database node and one to the faulty database node. In these cases the healthy one stored its data, for example a contract condition fulfillment, without problems, while the other side tried to store the data, but failed. This led to the mentioned inconsistencies.
Luckily most of the inconsistencies are not too bad and can be fixed by breaching a contract for example or deleting a commodity exchange order.
So far we are pretty confident that we don't have to roll back to the old backup and can continue as is. This isn't the final universe yet, so it doesn't matter too much. Hopefully this sentence will age well 😉
The most important lesson learnt during this incident: Don't let the database run out of disk space, ever! We do have monitoring for the amount of disk space left and should have reacted way before it ran out. Unfortunately the disk space monitoring is not yet attached to our alert system which sends out notifications to the team if we are running low on disk space. We'll add it as soon as possible.