7/31/2023 in devlog
This week Michi talks about the Liquidity release and the technical difficulties it brought with it. Join us for a livestream on YouTube next week!
Next week we'll hold a dev livestream where we'll talk about the liquidity update and what's next for prosperous universe. You'll find the livestream here on Monday, August 7th at 8am PST / 11am EST / 6pm CET.
With that announcement out of the way, it is time to talk about the Liquidity release. I am sure every active player felt the problems it has brought with it: The servers have been down more often than up, for almost two days. What happened?
Not long after the new code has been deployed we noticed that the performance of the servers was way worse than before the update. Usually we deploy the server part first, then after a couple of minutes the client part. That gives the server time to run updates and us the opportunity to check for any potential problems, before players can join and add additional load. Although being slow, the logs did seem good, and we allowed players to join the server.
After a while a few entities still weren't up, showing hydration timeouts, and after a quick investigation it was clear, that all of them were faction agents. A few minutes later the servers ran out of memory and had to restart. This chain of events occurred roughly every ten minutes. Sometimes they would bring all servers down, sometimes only a single node. For the players it looked like random disconnects combined with a really slow APEX interface, since most of the data had to be loaded from scratch on every iteration.
The faction agents are rather large entities, since they provide the faction contracts to almost all players. We thought that maybe the memory of our nodes was too small to load them at once, so we tried to increase the memory allocated to the server nodes as much as possible. It didn't help. We then ordered new nodes, with triple the amount of available memory and moved the server to the new nodes. Still no success.
You might recall from older devlogs, that in our architecture all events that happen in the lifetime of an entity are stored to the database and loaded again as a stream of events once the entity "wakes up". In order to make that process quicker an entity can store snapshots of its current state. We implemented snapshotting for faction agents for the Liquidity release, because we knew the faction agents could potentially get very large.
So the goal was to provide as much memory as possible, so that the agents could write their first snapshot and get rid of all the event data that lies before the snapshot.
After much debugging we found out that memory was not as much of a problem as we thought! The problem really laid in the event processing during the loading phase of the faction agents. Certain events (condition fulfilled events for example) took several tens, or even hundreds of milliseconds to load, while regular events could be loaded in only a few milliseconds or even less. That of course poses a real problem for an entity that mainly consists of hundreds of thousands of contract related events. While the faction agents have been busy working these slow events the database kept on shoving new event data into the agents' mailboxes, overwhelming them and effectively using up all the available memory.
After fixing that bottleneck the faction agents loaded relatively quickly, could throw away old data and take a snapshot of themselves, speeding up further recoveries by orders of magnitude.
There are many things that can go wrong with a release, but this problem was special: It didn't show up in testing and wasn't visible on our test server. It only occurred in production.
It all ended well though, APEX is back up, and we learned a lot about debugging performance problems. I want to thank you, for staying patient and positive during that time!