The Good, the Bad, the Ugly - Development Log #491

Michi talks about the recent maintenance release and the server outage on Friday

Avatar Michi

Michi (molp)

What a week! A lot has happened over the last days, so let's break it down!

The good news is that the maintenance update has been released on July 9th without much trouble. It contains some technical improvements to mission planning, adjustments to inflation and a buff to the faction contract rewards. The SYSI command now shows more information and a lot of small quality of life improvements have been added. As usual, you can find the full release notes on the forums.

You probably experienced the bad news yourself: on Friday we had a server outage. Unfortunately, I wasn't working that day, so I didn't notice it immediately. Shortly before midday I received a ping and rushed to the office to see what was happening. It was pretty clear from the beginning that this was a major problem and not just an entity-got-stuck-and-needs-percussive-maintenance thing. It was easy to locate the problem though: one of the three volumes containing the actual data of the database was full. That led to the failure of one of the three database nodes and impacted the game as a whole.

We had a similar problem a few years back, and ever since, we keep plenty of empty space available for the database. This time though, it wasn’t enough. In the night of the 5th to 6th, in just a few hours the volume utilization grew by almost 30%, something we’ve never experienced before. The database managed to cope for a few more days until it failed on Friday.

Unfortunately, increasing the available space didn't help. The database wouldn't start, and that brings us to the ugly part: The database files got corrupted. Everything on the failed node starting from Friday 09:37 onwards had to be removed. There wasn't really any other option. The last backup was older than 09:37, so we decided to give it a try. If it doesn't work, we would have had to go back to the backup in any case. We prepared for the worst, deleted the files and restarted the servers.

To our surprise the servers started up, the game started and everything just seemed to work as expected. No exceptions in the log, no problems loading the game's entities like companies, brokers etc. After a few more tests we let the players join and watched the logs, but nothing bad happened. Since it was getting late already and I had family duties, I called it a day. From time to time I had a peek at the logs, but everything seemed to run normal.

I’m pretty happy that we built the game resilient enough, so it even withstands such a data loss. If you’re one of the effected players, the data loss can look like a ship that you sent on a flight, but it never started, an inventory transfer you made that is reverted now or a production order you placed that vanished. It is very unlikely for things to have gone lost entirely.

There have been some reports on Discord of general sluggishness and multiple comex brokers that wouldn't load. I wasn't able to reproduce the issue yet, maybe it has resolved itself. If you run into any issues, and think it might be related to the outage, please report them in this topic on the forums.

We're still not sure what caused the sudden spike in disk utilization. After we increased the volume and restarted the server, the additional space was freed automatically. We’re definitely going to improve our alerting in the coming days..

As always, we'd love to hear what you think: join us on Discord or the forums!

Happy trading!