Jump to content

Dunedan

WFG Programming Team
  • Posts

    238
  • Joined

  • Days Won

    4

Everything posted by Dunedan

  1. We're having some issues with the lobby right now, but we're working on it. I'll post an update once it's back up.
  2. Thanks for the heads up, however the forums aren't a good place for such urgent information, as receiving and reading its notifications might come with a significant delay. In fact I just did see this post when I went to the forums to post the post-mortem, you can see below. Pinging me in the lobby (as long as it's still available) or in the 0ad channels on IRC is usually much faster. Today's lobby outage happened between 18:31 UTC and 19:04 UTC. During that outage the ability to host games or join hosted games was limited and the lobby bots rapidly quit and rejoined the lobby, leading to a lot of spam due to the quit and join messages. Here is why that happened: Earlier today I merged some lobby related infrastructure improvements (https://github.com/0ad/lobby-infrastructure/) and deployed the changes to the lobby VM. As part of that I accidentally deployed the productive configuration (lobby-config.yml) to a local instance of the lobby running as a Vagrant VM on my machine as well. This happened, because the hosts file I use for Ansible didn't just contain the VM of the official lobby, but my local instance as well and I wasn't limiting applying the changes to the official lobby (using ansible-playbook --limit …). I had my local lobby instance in the hosts file, because I sometimes test things which are easier to do with ansible-playbook, than with vagrant provision. While I did notice the incorrect deployment of the productive config to my local VM, I didn't recognize the possible impact that might have and instead figured that running vagrant provision later on would cause the correct config to be used for my local instance and would fix everything again. When I ran vagrant provision later to test some unrelated changes locally, the correct config (lobby-config-vagrant.yml) did get used, but instead of fixing things, it resulted in the outage. That's because deploying changes to the lobby bots doesn't remove possibly existing instances of the lobby bots. So if for example a bot named xpartamupp1 is present on a target host, it doesn't get removed when a new deployment happens which doesn't include xpartamupp1 in its config anymore, but rather includes for example xpartamupp2. That alone wouldn't have been an issue, but that deployment also removed the mapping of lobby.wildfiregames.com to 127.0.0.1 in /etc/hosts, as is done for the productive lobby to ensure the lobby bots don't need to do a network round-trip when connecting to the ejabberd which runs on the same host. That then caused the productive lobby bots to connect to the official ejabberd server instead to the one running on localhost. That meant suddenly for all bots two instances with the same XMPP resource were connecting and that caused ejabberd to kick out the one connected first. As each bot then immediately reconnected again, that resulted in a kick-loop which caused the rapid quits and rejoins of the lobby bots. To get out of this situation I simply stopped the bots on my local lobby instance again. That took a bit, as I had to figure out the reason for the problems first. While that whole outage was ultimately my fault, I believe there are some things we can improve to avoid such problems or their impact in future: - Adjust the Ansible code to delete (or at least disable) lobby bots present on the target host, but not in the configuration used to deploy. - Fix the lobby bots so they honor the reconnect delay when getting kicked like in this situation. - Avoid putting hosts which aren't the official lobby host in Ansible's `hosts` file. If somebody wants to step in and contribute to these improvements or would like to see other improvements, pull requests for https://github.com/0ad/lobby-bots/ and https://github.com/0ad/lobby-infrastructure/ are always welcome.
  3. I see little value in automated matchmaking with the current numbers of players playing multiplayer games, as matching with equally ranked players would either take much too long and/or would match with the same opponents over and over again. For making a meaningful difference to just selecting a game as it is right now, match making IMO would have to happen server-side (as it could be easily gamed with modified clients otherwise), such games must not be visible in the existing lobby and 0 A.D. clients must not know the identity of the opponent before a game starts. While one of the players would still need to host the game, its settings shouldn't be set by the host, but based on the preferences of all matched players and should be set by the server logic. Implementing it this way would also offer the ability to centrally punish known smurfs or cheaters, as we could deny them to participate in such automatically matched games (or shadowban them, match them with stronger players, ...). Maybe it'd make sense to introduce automated matchmaking only when WFG-hosted games are an option as well, as that'd remove a whole bunch of problems.
  4. I wanted to share some details about the migration of the lobby we did on Friday and what went problems we encountered while doing so. That'll take a bit to explain so I suggest grab a cup of tea and some cookies before you continue to read. The motivation for migrating the lobby to a new server was to get it into an up-to-date state and to improve its performance. The operating system on the previous server was already getting older and as it got manually updated over the past years, there was lots of stuff on the server, which wasn't really necessary for operating the lobby. Therefore, to start with a clean slate, we decided to set up a new server from scratch. Server in this case means a new virtual machine as part of the infrastructure Wildfire Games has available to host everything related to 0 A.D. Instead of manually configuring the new server, we wrote so called playbooks for a tool called Ansible, which are instructions written as code how to configure such a server. We did so to increase transparency and documentation what's running on the lobby server. Going forward instead of doing manual changes on the live lobby, changes are written in Ansible code and can be reviewed like any other code related to 0.A.D. as well. This also improves the ability to test changes in an isolated environment instead of having to use the live lobby for that, as the written Ansible code makes it easy to configure other computers the same way. With an additional tool called Vagrant, which allows easy creation of virtual machines running on your local computer, it's now easy to get a nearly identical copy of the official lobby running for testing purposes on your own computer. If you're interested in the details regarding that, please check out the git repository where we published all of that: https://github.com/0ad/lobby-infrastructure/. If you're interested in participating and improving the infrastructure behind the lobby, contributions are always welcome there as well. While having the configuration of the lobby available as Ansible playbooks means that configuring a new server is just a matter of running a single command and waiting a few minutes, that's only true for a new lobby without any existing state. State in this context means lobby accounts, ratings and certain logs we wanted to keep. As migrating the state takes additional time and care and because unexpected things might (and in fact did) happen during such a migration, we planned a quite long downtime of 4 hours. However, the actual migration only took roughly one hour and once everything looked good again we made the lobby available for all of you again. Unfortunately, that's when we and some of you started to notice some problems, which took us quite a while to debug and fix. Here is a list of the most critical and interesting problems we encountered: Games becoming invisible Once more and more of you joined the lobby after the migration, eager to play again, we quickly noticed that something wasn't right. Games would become invisible or wouldn't even show up in the first place. After searching for a while, we figured out that this was caused by rate-limiting of messages getting sent through the server. There is rate-limiting in place to avoid spamming of large amounts of messages. That means that each user can only send a certain amount of text per second. During the migration we made a mistake in the configuration and applied the same rate-limiting which applies to all players to the bot managing the games as well. While you don't see many written messages from WFGBot, it's actually a pretty busy bot and sends out a lot of messages which get processed by 0.A.D. to be able to show you the list of games. With WFGBot not being able anymore to send all messages it wanted to send, this meant the list of shown games would be outdated or even completely empty, because your instance of 0.A.D. wouldn't get up-to-date information of the available games. We didn't catch this problem in our testing prior to the migrating, as our test setup had too little volume in terms of online players and hosted games to trigger the rate-limiting. Fortunately the fix for this problem was easy and just required fixing the mistake in the configuration. Lots of stale and outdated games being shown With invisible games not being a problem anymore, the list of games constantly grew and quickly started to show games whose hosts had already left the lobby. That's no new problem, and you've probably seen such stale games in the past already. That happens when WFGBot doesn't get a notification when a player hosting a games leaves the lobby, as WFGBot then doesn't know that that player isn't hosting a games anymore. We don't know why that happens sometimes, but it does and when it does it leaves behind such stale games. To avoid this problem going forward we added a filter to only show hosted games whose host is still online, which fixes this problem. Windows users not able to join anymore with TLS-encrypted connections Another problem which became visible was that Windows users weren't able to join the lobby anymore if they had TLS-encrypted connections enabled in the settings (which is the default and a good idea to have set). To explain why that happened I have to back up a bit. Core of the lobby is a protocol called XMPP. At its core XMPP is an extensible chat protocol. When you connect to the lobby using 0.A.D., it'll establish an XMPP-connection to an XMPP-server running as part of the lobby. Such connections can be unencrypted or encrypted with TLS. TLS is the encryption protocol also used when you visit websites whose protocol is HTTPS, like your beloved https://play0ad.com/. TLS is available in multiple versions. For historical reasons 0.A.D. up to Alpha 26 on Windows only works with TLS v1.0, which is deprecated nowadays and usually already disabled by default. Connecting to the lobby with TLS-encrypted connections didn't work for Windows users right after the migration, because the lobby XMPP-server didn't offer TLS v1.0 anymore, but only more recent TLS versions. However, the configuration of the XMPP-server was fine. What we missed during the migration of the lobby was to enable TLS v1.0 in OpenSSL, the library the XMPP-server uses for all the heavy-duty TLS-related work. Interestingly, even if we hadn't missed that it wouldn't have worked, because the configuration for OpenSSL required slightly different parameters than before thanks to it being a newer OpenSSL version. Nevertheless, this problem should have been surfaced during testing before the migration, but didn't because we simply forgot to test with Windows. The fix was once again straight-forward and just involved setting the correct OpenSSL configuration after we figured out what exactly the culprit was there. Going forward we'd love to disable such old TLS versions, but we'll have to wait with that until all recent 0.A.D. versions support newer TLS versions as well. Some Linux users not able to anymore with TLS-encrypted connections With login for Windows users fixed, we received reports from some Linux users not being able to connect to the lobby with TLS-encrypted connection enabled. Figuring out what was causing this took us a while, as it did work for the majority of Linux users, but not for all of them. The culprit in this case were the TLS-versions supported by the XMPP-server again, but this time not an old version was missing, but a new version causing problems in certain cases. As part of the migration we enabled the most recent TLS version TLS v1.3. This usually works without having to change clients, because only clients supporting such a version will use it. The client which didn't work correctly was gloox, which is the XMPP-client library used by 0.A.D. We don't know yet why it doesn't work, but it apparently doesn't. The interesting piece is why that was so difficult to track down. Contrary to Windows it's common with Linux that application don't contain all of their dependencies, but rely on them being provided by the operating system in one way or the other. The version of gloox adding support for TLS v1.3 got released less than 4 weeks ago and Linux distributions usually don't include software which got just released a few weeks ago. That's why the majority of Linux users had no problem, as the version of gloox they were using didn't support TLS v1.3. The affected users were mainly users which did install 0.A.D. as Flatpak application, as the Flatpak app for 0.A.D. included such a new version of gloox. Our workaround for this problem was to disable support for TLS v1.3 again on the lobby server, as that makes gloox and therefore 0.A.D. happy again. That's of course just a workaround, as we'd like to be able to offer support for TLS v1.3 in future as well, but to enable it again, we have to figure out why it's currently not working with gloox and get that fixed first. Conclusion As you can see preparing and carrying out the migration of the lobby was quite some effort and not without challenges. While most of the problems which appeared during the migration could've been avoided, that's always easy to say in hindsight. I'm personally very satisfied with the result of the migration though, as it's a great base for further improvements and the performance of the lobby is even better than before as well.
  5. Connecting to the lobby from Linux with TLS-encrypted enabled, which didn't work after the migration with certain configurations (e.g. when 0ad got installed via flatpak), should work again now.
  6. Stale games shouldn't be shown anymore now. If you do still encounter stale games or have other issues with the list of shown games in the lobby, please let us know.
  7. No worries, you're not the only one encountering this problem. We'll figure out how to fix it. In the meanwhile you can use the workaround of disabling TLS I mentioned above.
  8. Did you have TLS-encrypted connections enabled prior to today as well? That sounds like your Linux Mint installation is missing the Letsencrypt root certificate. Letsencrypt is the certificate authority we get certificates issued from for use with TLS-encrypted connections. That only works if your operating system can build a chain of trust which only works if the root certificate is in the certificate store of your operating system. That's a known issue. These games are stale and their host left, however we didn't get a notification for that and therefore haven't been able to remove the game from the list. What Linux distribution are you using? How did you install 0ad there?
  9. Connecting to the lobby with Windows should now work again with TLS enabled. Tomorrow or the day after we'll post a more detailed post what issues we encountered during the migration and what caused them. Everything should now work again as it did before the migration. If you notice anything not working with the lobby as expected anymore, please let us know.
  10. The lobby is already online again, however we're still having some problems with encrypted connections from Windows users. If you're using Windows and can't connect, disabling encrypted connections should work as a workaround, as explained in the FAQ (second point below "Multiplayer"): https://trac.wildfiregames.com/wiki/FAQ#Multiplayer
  11. Tomorrow, on 7 April 2023, between 12:00 and 16:00 UTC we'll migrate the multiplayer lobby to a new server. During that time no connections to the multiplayer lobby will be possible and it won't be possible to play multiplayer games. We expect everything to work as usual afterwards. We'll post relevant updates in here. Edit: A previous version of this post stated that the downtime was to be scheduled between 08:00 and 12:00 UTC. Unfortunately we had to reschedule and the updated post reflects that now.
  12. @rossenburg Are the changes proposed by @smileysomething you can work with for the UI part?
  13. The feature to be able to change the password, won't help such users at all, as they can't proof that the account they want to change the password for is their own account. There is a difference though between users who don't remember their password and don't have it stored in 0ad (these are the ones which can't change their password) and users who don't remember their password, but have it stored in 0ad (those ones can still log in and should also be able to change it). How would you do that? As you don't have their password you can't be logged in into their account, which is a prerequisite to changing the password. Essentially it's the same as for most websites which offer accounts: To change your password, you have to log in first. Once logged in there is somewhere a form you can use to change the password. Note that this is different from a password reset process which usually doesn't require a login, but something like a recovery email instead. That's a use case we can't cover for now.
  14. AFAIK neither gloox nor ejabberd support checking the current password during password changes, because you can change your password anyway only when already being authenticated. So to implement that you either have to check against a locally stored password or do some re-authentication during the password change. Checking against a local stored password is something which can be circumvented with a patched version of 0ad and isn't straight forward anyway, because storing the XMPP password is optional. Doing re-authentication with the XMPP-server during password change adds a lot of additional complexity and isn't as it's usually done with XMPP. Also keep in mind that most players which have stored their password in 0ad probably don't even know anymore what their password is and therefore wouldn't be able to change their password at all, if a password change would require typing in the current password. What would be the benefit of checking the current password anyway? In which situations would that provide a benefit? The only situation I can imagine right now is when granting an untrusted person access to your computer, but that's a case we shouldn't try to handle. If an untrusted person has access to your computer they can also simply install a keylogger and get the password that way. My suggestion is to keep the initial implementation as simple and straight-forward as possible. Avoid stuff which adds complexity in terms of code as well as in terms of user experience if it doesn't provide a significant benefit.
  15. While I'm not that familiar with the Pyrogenesis side of things, that sounds like it'd make a lot of sense. AFAIK changing the password is only possible after authenticating. I guess you're aware, but there are several Pyrogenesis related functions (like "ChangeGame" and "GetProfile") missing. As a disclaimer: I'm currently working on the server-side for PubSub (https://trac.wildfiregames.com/ticket/4203), which will require some additional changes for Pyrogenesis (like subscribing for the relevant PubSub nodes), however I believe that should happen separately and after such a refactoring.
  16. *bump* Anybody interested in contribution password change functionality? If there are any roadblocks regarding that just let me know.
  17. Please consider that @Stan` as the project lead might have a better picture of a situation than you do. The code for the bots is open on https://github.com/0ad/lobby-bots and contributions are always welcome. In case you're willing to contribute, but don't know what to tackle, feel free to reach out to me. Aside from the bots, contributions for the 0ad client side lobby code would be much appreciated, as I only work on the lobby bots and server-side lobby setup. A month ago I called for an implementation of password changing functionality in 0ad, but so far nobody has come up with an implementation. Please also be aware that I'm by no means blocked by technical topics, but rather how certain things are getting handled at WFG, but that's nothing to discuss here.
  18. No idea if the lobby code in pyrogenesis supports something like popups. Afaik announcements are currently displayed as chat messages from "system".
  19. First of all there needs to be a consensus how this should work. Should there be one rating or one for 1vs1 and one for multiplayer games? Also I'm not sure how good results for rating multiplayer games would be with the current ELO implementation. Maybe it'd make sense to switch to another rating algorithm for that. Once all these details are clarified it's just a matter of enabling/implementing the necessary support in EcheLOn and enable sending of game reports for multiplayer games with more than 2 players in pyrogenesis.
  20. These are just warnings. It should work nonetheless. There is already an open pull request, which will get rid of these warnings once merged: https://github.com/0ad/lobby-bots/pull/13
  21. That's the problem, it should be this instead: muc_admin: - allow: admin - allow: bots
  22. Of course it does and I didn't say otherwise. You likely didn't properly configure ejabberd, so EcheLOn doesn't get the JID of users joining the MUC room. To fix that you have to either ensure that the MUC room isn't anonymous or that the bots have MUC admin permissions. How to configure any of that is documented in the README. If you can't get that to work, posting your ejabberd.yml here would be helpful for further debugging.
  23. My guess would be that you didn't open UDP Port 3478 in the security group of your EC2 instance. At least that port isn't reachable from the internet.
×
×
  • Create New...