I wanted to share some details about the migration of the lobby we did on Friday and what went problems we encountered while doing so. That'll take a bit to explain so I suggest grab a cup of tea and some cookies before you continue to read.
The motivation for migrating the lobby to a new server was to get it into an up-to-date state and to improve its performance. The operating system on the previous server was already getting older and as it got manually updated over the past years, there was lots of stuff on the server, which wasn't really necessary for operating the lobby. Therefore, to start with a clean slate, we decided to set up a new server from scratch. Server in this case means a new virtual machine as part of the infrastructure Wildfire Games has available to host everything related to 0 A.D. Instead of manually configuring the new server, we wrote so called playbooks for a tool called Ansible, which are instructions written as code how to configure such a server. We did so to increase transparency and documentation what's running on the lobby server. Going forward instead of doing manual changes on the live lobby, changes are written in Ansible code and can be reviewed like any other code related to 0.A.D. as well. This also improves the ability to test changes in an isolated environment instead of having to use the live lobby for that, as the written Ansible code makes it easy to configure other computers the same way. With an additional tool called Vagrant, which allows easy creation of virtual machines running on your local computer, it's now easy to get a nearly identical copy of the official lobby running for testing purposes on your own computer. If you're interested in the details regarding that, please check out the git repository where we published all of that: https://github.com/0ad/lobby-infrastructure/. If you're interested in participating and improving the infrastructure behind the lobby, contributions are always welcome there as well.
While having the configuration of the lobby available as Ansible playbooks means that configuring a new server is just a matter of running a single command and waiting a few minutes, that's only true for a new lobby without any existing state. State in this context means lobby accounts, ratings and certain logs we wanted to keep. As migrating the state takes additional time and care and because unexpected things might (and in fact did) happen during such a migration, we planned a quite long downtime of 4 hours. However, the actual migration only took roughly one hour and once everything looked good again we made the lobby available for all of you again. Unfortunately, that's when we and some of you started to notice some problems, which took us quite a while to debug and fix. Here is a list of the most critical and interesting problems we encountered:
Games becoming invisible
Once more and more of you joined the lobby after the migration, eager to play again, we quickly noticed that something wasn't right. Games would become invisible or wouldn't even show up in the first place.
After searching for a while, we figured out that this was caused by rate-limiting of messages getting sent through the server. There is rate-limiting in place to avoid spamming of large amounts of messages. That means that each user can only send a certain amount of text per second. During the migration we made a mistake in the configuration and applied the same rate-limiting which applies to all players to the bot managing the games as well. While you don't see many written messages from WFGBot, it's actually a pretty busy bot and sends out a lot of messages which get processed by 0.A.D. to be able to show you the list of games. With WFGBot not being able anymore to send all messages it wanted to send, this meant the list of shown games would be outdated or even completely empty, because your instance of 0.A.D. wouldn't get up-to-date information of the available games. We didn't catch this problem in our testing prior to the migrating, as our test setup had too little volume in terms of online players and hosted games to trigger the rate-limiting.
Fortunately the fix for this problem was easy and just required fixing the mistake in the configuration.
Lots of stale and outdated games being shown
With invisible games not being a problem anymore, the list of games constantly grew and quickly started to show games whose hosts had already left the lobby. That's no new problem, and you've probably seen such stale games in the past already. That happens when WFGBot doesn't get a notification when a player hosting a games leaves the lobby, as WFGBot then doesn't know that that player isn't hosting a games anymore. We don't know why that happens sometimes, but it does and when it does it leaves behind such stale games.
To avoid this problem going forward we added a filter to only show hosted games whose host is still online, which fixes this problem.
Windows users not able to join anymore with TLS-encrypted connections
Another problem which became visible was that Windows users weren't able to join the lobby anymore if they had TLS-encrypted connections enabled in the settings (which is the default and a good idea to have set).
To explain why that happened I have to back up a bit. Core of the lobby is a protocol called XMPP. At its core XMPP is an extensible chat protocol. When you connect to the lobby using 0.A.D., it'll establish an XMPP-connection to an XMPP-server running as part of the lobby. Such connections can be unencrypted or encrypted with TLS. TLS is the encryption protocol also used when you visit websites whose protocol is HTTPS, like your beloved https://play0ad.com/. TLS is available in multiple versions. For historical reasons 0.A.D. up to Alpha 26 on Windows only works with TLS v1.0, which is deprecated nowadays and usually already disabled by default. Connecting to the lobby with TLS-encrypted connections didn't work for Windows users right after the migration, because the lobby XMPP-server didn't offer TLS v1.0 anymore, but only more recent TLS versions. However, the configuration of the XMPP-server was fine. What we missed during the migration of the lobby was to enable TLS v1.0 in OpenSSL, the library the XMPP-server uses for all the heavy-duty TLS-related work. Interestingly, even if we hadn't missed that it wouldn't have worked, because the configuration for OpenSSL required slightly different parameters than before thanks to it being a newer OpenSSL version. Nevertheless, this problem should have been surfaced during testing before the migration, but didn't because we simply forgot to test with Windows.
The fix was once again straight-forward and just involved setting the correct OpenSSL configuration after we figured out what exactly the culprit was there. Going forward we'd love to disable such old TLS versions, but we'll have to wait with that until all recent 0.A.D. versions support newer TLS versions as well.
Some Linux users not able to anymore with TLS-encrypted connections
With login for Windows users fixed, we received reports from some Linux users not being able to connect to the lobby with TLS-encrypted connection enabled. Figuring out what was causing this took us a while, as it did work for the majority of Linux users, but not for all of them. The culprit in this case were the TLS-versions supported by the XMPP-server again, but this time not an old version was missing, but a new version causing problems in certain cases.
As part of the migration we enabled the most recent TLS version TLS v1.3. This usually works without having to change clients, because only clients supporting such a version will use it. The client which didn't work correctly was gloox, which is the XMPP-client library used by 0.A.D. We don't know yet why it doesn't work, but it apparently doesn't. The interesting piece is why that was so difficult to track down. Contrary to Windows it's common with Linux that application don't contain all of their dependencies, but rely on them being provided by the operating system in one way or the other. The version of gloox adding support for TLS v1.3 got released less than 4 weeks ago and Linux distributions usually don't include software which got just released a few weeks ago. That's why the majority of Linux users had no problem, as the version of gloox they were using didn't support TLS v1.3. The affected users were mainly users which did install 0.A.D. as Flatpak application, as the Flatpak app for 0.A.D. included such a new version of gloox.
Our workaround for this problem was to disable support for TLS v1.3 again on the lobby server, as that makes gloox and therefore 0.A.D. happy again. That's of course just a workaround, as we'd like to be able to offer support for TLS v1.3 in future as well, but to enable it again, we have to figure out why it's currently not working with gloox and get that fixed first.
As you can see preparing and carrying out the migration of the lobby was quite some effort and not without challenges. While most of the problems which appeared during the migration could've been avoided, that's always easy to say in hindsight. I'm personally very satisfied with the result of the migration though, as it's a great base for further improvements and the performance of the lobby is even better than before as well.