Jump to content

Lobby problem


borg-
 Share

Recommended Posts

Thanks for the heads up, however the forums aren't a good place for such urgent information, as receiving and reading its notifications might come with a significant delay. In fact I just did see this post when I went to the forums to post the post-mortem, you can see below. Pinging me in the lobby (as long as it's still available) or in the 0ad channels on IRC is usually much faster.

Today's lobby outage happened between 18:31 UTC and 19:04 UTC. During that outage the ability to host games or join hosted games was limited and the lobby bots rapidly quit and rejoined the lobby, leading to a lot of spam due to the quit and join messages.

Here is why that happened:

Earlier today I merged some lobby related infrastructure improvements (https://github.com/0ad/lobby-infrastructure/) and deployed the changes to the lobby VM. As part of that I accidentally deployed the productive configuration (lobby-config.yml) to a local instance of the lobby running as a Vagrant VM on my machine as well. This happened, because the hosts file I use for Ansible didn't just contain the VM of the official lobby, but my local instance as well and I wasn't limiting applying the changes to the official lobby (using ansible-playbook --limit …). I had my local lobby instance in the hosts file, because I sometimes test things which are easier to do with ansible-playbook, than with vagrant provision. While I did notice the incorrect deployment of the productive config to my local VM, I didn't recognize the possible impact that might have and instead figured that running vagrant provision later on would cause the correct config to be used for my local instance and would fix everything again.

When I ran vagrant provision later to test some unrelated changes locally, the correct config (lobby-config-vagrant.yml) did get used, but instead of fixing things, it resulted in the outage. That's because deploying changes to the lobby bots doesn't remove possibly existing instances of the lobby bots. So if for example a bot named xpartamupp1 is present on a target host, it doesn't get removed when a new deployment happens which doesn't include xpartamupp1 in its config anymore, but rather includes for example xpartamupp2. That alone wouldn't have been an issue, but that deployment also removed the mapping of lobby.wildfiregames.com to 127.0.0.1 in /etc/hosts, as is done for the productive lobby to ensure the lobby bots don't need to do a network round-trip when connecting to the ejabberd which runs on the same host. That then caused the productive lobby bots to connect to the official ejabberd server instead to the one running on localhost.

That meant suddenly for all bots two instances with the same XMPP resource were connecting and that caused ejabberd to kick out the one connected first. As each bot then immediately reconnected again, that resulted in a kick-loop which caused the rapid quits and rejoins of the lobby bots. To get out of this situation I simply stopped the bots on my local lobby instance again. That took a bit, as I had to figure out the reason for the problems first.

While that whole outage was ultimately my fault, I believe there are some things we can improve to avoid such problems or their impact in future:

- Adjust the Ansible code to delete (or at least disable) lobby bots present on the target host, but not in the configuration used to deploy.
- Fix the lobby bots so they honor the reconnect delay when getting kicked like in this situation.
- Avoid putting hosts which aren't the official lobby host in Ansible's `hosts` file.

If somebody wants to step in and contribute to these improvements or would like to see other improvements, pull requests for https://github.com/0ad/lobby-bots/ and https://github.com/0ad/lobby-infrastructure/ are always welcome.

  • Thanks 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...