ArenaNet Staff Popular Post Rubi Bayer.8493 Posted May 3, 2023 ArenaNet Staff Popular Post Share Posted May 3, 2023 Hi all, I want to give you a behind-the-scenes look at the unusual number of issues after yesterday's game update. I talked with one of our Senior Software Engineers to help me outline what happened, so here you go! In Monday's studio update we talked about "taking on projects that will keep the game healthy in the long run, like refactoring large swaths of old code". Development on a game that's been live for a decade can be a minefield: fixes like that risk breaking content that was inadvertently depending on a bug in order to "work." That's what happened here when we fixed one of those pieces of old code. We did catch the problem and fix it, but Murphy's Law was now paying attention. You see, shipping a huge live game like GW2 requires a complex system for managing the flow of code and content going out the door. Behind the scenes there are multiple "copies" of a lot of stuff we work with: copies that are getting tinkered with, copies that are getting tested, and copies that are right on the precipice of being released. Which means that even when a fix is done and dusted it still needs to move forward on various conveyor belts, so to speak, to make sure it gets to you. In this case, the fix missed one of those conveyor belts along the way. What did this mean for you? In this case, the piece of code that broke was related to content that depends on real time, what we call "time spans". When it went live, it quickly affected a lot of content that appeared to be unrelated at first glance--metas and world bosses, guild missions, raid buffs, Karmic Converters and portal devices, and more--but they all relied on that bit of code in one way or another. So why didn't we just ship that missing fix right away? It was right there! Well, tracking down the problem and getting a fix together was a bit of a scavenger hunt. When you all flag an issue in the live game the first thing we do is find or recreate the problem based on the information we have so we can look it over firsthand. Yesterday's issues began with a player reporting that the Verdant Brink night boss meta was not triggering, so we started looking to see what was wrong with Verdant Brink. Two more reports about metas elsewhere rolled in shortly afterward, so we switched our search to try to find what was wrong with meta events. Meanwhile, other players started reporting that certain gizmos were broken, so we began looking into this apparently unrelated issue as well. Then we learned that Pact Supply Agents had all vanished. At this point we started looking for a connection, because it was clear that something bigger was going on here. We knew there was probably a connection, but there were a number of possibilities that we were investigating--including time spanning. Having a list of possible connections allowed us to start tagging in people from specific teams to help--including a software engineer who was able to trace the various issues back to that piece of code, confirm the connection, find the lost fix, stick it back on the conveyor belt, and and hand it off to our Release Management team to get a hotfix ready. So how did we not catch such major issues before we shipped the update? When there is a bug and something breaks, it doesn't proactively send up a "Help, I'm broken!" flare--we find it when we test that specific content or item. We of course can't test the entire game before every update since there are millions of things to test (every item, skill, enemy, event, NPC, and so on), so we have to use our time wisely while casting the widest possible net for issues. Our QA teams test a broad range of categories, including the new things in the build, things that have broken in the past, and things related to recent major changes to fundamental systems to make sure they are still working properly. In this case, none of the affected systems had broken in this way before, nor were they in any of the "to check" categories. That said, when we ship major bugs like this, it's an opportunity to look at our processes for holes and find ways to improve! This is obviously a simplified explanation, but I hope this peek at the workflow and the sometimes-very-surprising vagaries of game development provides some insight. In the meantime, our teams are continuing to work on issues as they arise, so thank you again for letting us know when something isn't working properly! 107 49 6 1 Link to comment Share on other sites More sharing options...
Inculpatus cedo.9234 Posted May 3, 2023 Share Posted May 3, 2023 Thanks for the insights. 👍 6 1 Link to comment Share on other sites More sharing options...
Diruuo.6314 Posted May 4, 2023 Share Posted May 4, 2023 Thank you for the insight and explanations! Always interesting to get a better understanding of the process like this! 2 1 Link to comment Share on other sites More sharing options...
TheQuickFox.3826 Posted May 4, 2023 Share Posted May 4, 2023 Tanks for the post .I appreciate the communication and explanation. As beta tester for a community project I know how easy it is to miss things during testing. 4 1 1 Link to comment Share on other sites More sharing options...
HotDelirium.7984 Posted May 4, 2023 Share Posted May 4, 2023 (edited) 5 hours ago, Rubi Bayer.8493 said: Hi all, I want to give you a behind-the-scenes look at the unusual number of issues after yesterday's game update. I talked with one of our Senior Software Engineers to help me outline what happened, so here you go! In Monday's studio update we talked about "taking on projects that will keep the game healthy in the long run, like refactoring large swaths of old code". Development on a game that's been live for a decade can be a minefield: fixes like that risk breaking content that was inadvertently depending on a bug in order to "work." That's what happened here when we fixed one of those pieces of old code. We did catch the problem and fix it, but Murphy's Law was now paying attention. You see, shipping a huge live game like GW2 requires a complex system for managing the flow of code and content going out the door. Behind the scenes there are multiple "copies" of a lot of stuff we work with: copies that are getting tinkered with, copies that are getting tested, and copies that are right on the precipice of being released. Which means that even when a fix is done and dusted it still needs to move forward on various conveyor belts, so to speak, to make sure it gets to you. In this case, the fix missed one of those conveyor belts along the way. What did this mean for you? In this case, the piece of code that broke was related to content that depends on real time, what we call "time spans". When it went live, it quickly affected a lot of content that appeared to be unrelated at first glance--metas and world bosses, guild missions, raid buffs, Karmic Converters and portal devices, and more--but they all relied on that bit of code in one way or another. So why didn't we just ship that missing fix right away? It was right there! Well, tracking down the problem and getting a fix together was a bit of a scavenger hunt. When you all flag an issue in the live game the first thing we do is find or recreate the problem based on the information we have so we can look it over firsthand. Yesterday's issues began with a player reporting that the Verdant Brink night boss meta was not triggering, so we started looking to see what was wrong with Verdant Brink. Two more reports about metas elsewhere rolled in shortly afterward, so we switched our search to try to find what was wrong with meta events. Meanwhile, other players started reporting that certain gizmos were broken, so we began looking into this apparently unrelated issue as well. Then we learned that Pact Supply Agents had all vanished. At this point we started looking for a connection, because it was clear that something bigger was going on here. We knew there was probably a connection, but there were a number of possibilities that we were investigating--including time spanning. Having a list of possible connections allowed us to start tagging in people from specific teams to help--including a software engineer who was able to trace the various issues back to that piece of code, confirm the connection, find the lost fix, stick it back on the conveyor belt, and and hand it off ot our Release Management team to get a hotfix ready. So how did we not catch such major issues before we shipped the update? When there is a bug and something breaks, it doesn't proactively send up a "Help, I'm broken!" flare--we find it when we test that specific content or item. We of course can't test the entire game before every update since there are millions of things to test (every item, skill, enemy, event, NPC, and so on), so we have to use our time wisely while casting the widest possible net for issues. Our QA teams test a broad range of categories, including the new things in the build, things that have broken in the past, and things related to recent major changes to fundamental systems to make sure they are still working properly. In this case, none of the affected systems had broken in this way before, nor were they in any of the "to check" categories. That said, when we ship major bugs like this, it's an opportunity to look at our processes for holes and find ways to improve! This is obviously a simplified explanation, but I hope this peek at the workflow and the sometimes-very-surprising vagaries of game development provides some insight. In the meantime, our teams are continuing to work on issues as they arise, so thank you again for letting us know when something isn't working properly! After Cam's departure, I'm so anxious for you all. I love transparency but I hope a motivation in that isn't inspired by any vitriol from some players. It will take you however long to fix everything and we should all be just fine because players SHOULD have a life outside of the game service. I'm sorry if some players aren't patient but after the roadmap, there seems like so much to be grateful for and I hope people remember that and keep it in perspective. Be well! Edited May 4, 2023 by HotDelirium.7984 6 1 2 5 1 Link to comment Share on other sites More sharing options...
Rakotia.5812 Posted May 4, 2023 Share Posted May 4, 2023 (edited) Thanks a lot for the detailed explanation! Really appreciate the comunication of what happened. Keep the good work going! Edited May 4, 2023 by Rakotia.5812 1 Link to comment Share on other sites More sharing options...
Sirius.4510 Posted May 4, 2023 Share Posted May 4, 2023 ...yeah... "wait, does this release actually have the fix?" is a startlingly common question in my line of work 😄 1 1 2 Link to comment Share on other sites More sharing options...
Marge.4035 Posted May 4, 2023 Share Posted May 4, 2023 Thanks for the info, stay well, team! Link to comment Share on other sites More sharing options...
nitemsg.4537 Posted May 4, 2023 Share Posted May 4, 2023 Thanks for the update. What about the bugs that exist for months or years even (some of them exist since the game was launched) ? 2 1 1 Link to comment Share on other sites More sharing options...
Suzu.1546 Posted May 4, 2023 Share Posted May 4, 2023 Very interesting read, love the write up on this sort of topic because it's always curious to me to see why/how things go wrong, get identified and then addressed. 1 Link to comment Share on other sites More sharing options...
Cyninja.2954 Posted May 4, 2023 Share Posted May 4, 2023 Thanks for the heads up. Link to comment Share on other sites More sharing options...
vandrefalk.6823 Posted May 4, 2023 Share Posted May 4, 2023 Interesting read; thanks for the update. And now I'm expecting "time spans" are added to the conveyor belt for QA testing of fixes... "does this new code impact time-related events" 😄 Link to comment Share on other sites More sharing options...
Mia.5386 Posted May 4, 2023 Share Posted May 4, 2023 Out of curiosity, has this series of events lead to the team considering the possibility of a public test realm? If there are issues like this under the hood that the QA team can't always find right away, I think the game could benefit from having a PTR that just ships with the current iteration of 'possibly buggy upcoming changes' every week or so, so we can get in, break stuff, and give you all a heads up before it goes live. (The best part is, you could ship it on Fridays and see how many reports come in over the weekend!) So many other mmos do this (hell look at Blizzard, who at some points has 5 different PTRs running for their flagship game) and I think guild wars is large enough that this would be a very good idea to start on. 4 1 1 1 Link to comment Share on other sites More sharing options...
Robdalf.2561 Posted May 4, 2023 Share Posted May 4, 2023 (edited) Been there, done that. Excellent explanation. Anyone who deals with multiple code branches, release testing and the ever-present fear that something didn't get pulled into the right release candidate knows exactly what you are talking about. It's a testament to your team that this is a very unusual occurrence given the incredibly complex codebase you have (and I'm guessing parts of it might be nearly 15 years old). And you also do it with live patching which is just a phenomenal capability. Edited May 4, 2023 by Robdalf.2561 1 Link to comment Share on other sites More sharing options...
Robdalf.2561 Posted May 4, 2023 Share Posted May 4, 2023 7 hours ago, Sirius.4510 said: ...yeah... "wait, does this release actually have the fix?" is a startlingly common question in my line of work 😄 Oh I was having flashbacks reading this! Link to comment Share on other sites More sharing options...
kumiorava.9674 Posted May 4, 2023 Share Posted May 4, 2023 A fascinating read! I love these peeks behind the scenes. Link to comment Share on other sites More sharing options...
JeffJeth.9518 Posted May 4, 2023 Share Posted May 4, 2023 My husband is a software engineer for a Fortune 100 company, and yeah. I related the gist of this post to him and his response was, "Oh yeah. I've seen that any number of times." This, on huge projects written by programmers who make 6 figures for their work. No one is immune. Link to comment Share on other sites More sharing options...
Kain Francois.4328 Posted May 4, 2023 Share Posted May 4, 2023 This bug has given birth to one of the greatest memes: Reaper is now so strong, it breaks time itself. Thanks for being so swift to fix all this, and kudos for the great laughs! 👍 1 3 Link to comment Share on other sites More sharing options...
Stadsport.8714 Posted May 4, 2023 Share Posted May 4, 2023 The communication is really appreciated. It's a 10 year old game, kitten happens. But getting to understand why and what happened is really interesting and it definitely helps us to understand what's going on behind the scenes. 2 Link to comment Share on other sites More sharing options...
Rose Solane.1027 Posted May 4, 2023 Share Posted May 4, 2023 Great to see a post about why a major update issue happened. Unfortunately this kind of errors happen and are not always caught by testers. I think the communication around this case was pretty good and now we even have an explanation 🙂 Link to comment Share on other sites More sharing options...
Chaba.5410 Posted May 4, 2023 Share Posted May 4, 2023 15 hours ago, Rubi Bayer.8493 said: content that was inadvertently depending on a bug in order to "work." This was the best line. xD 1 Link to comment Share on other sites More sharing options...
murven.7581 Posted May 4, 2023 Share Posted May 4, 2023 I would recommend the implementation of a sentry system that has a list of "expected events" that should happen at certain times. Running the sentry system in the test instance would fire alarms if an expected event does not happen. This relieves the QA team from having to worry about testing this kind of issues and makes sure that any regression is caught before the final deployment. I know that all this is easier said than done, but I thought of mentioning it, just in case it is something that is feasible without a lot of effort. I would imagine that there is an actors system firing these events, so the sentry would be a specialized client of this actor system that has a checklist of things that should happen and could simply log an error when one of the expected events does not occur. Somebody could just take a quick look at the sentry log after a deployment. The sentry system log should remain empty most of the time, so just seeing data there would be enough to investigate. I may be assuming implementation details that are wrong, but even if the implementation is not exactly the same, the translation to your architecture should not be too complex. Of course, it is work, but it is work that would avoid future issues, so it is a good time investment in my opinion. Just my two cents for whatever is worth. The transparency and insight is appreciated. 1 Link to comment Share on other sites More sharing options...
ArenaNet Staff Rubi Bayer.8493 Posted May 4, 2023 Author ArenaNet Staff Share Posted May 4, 2023 1 hour ago, Kain Francois.4328 said: This bug has given birth to one of the greatest memes: Reaper is now so strong, it breaks time itself. Thanks for being so swift to fix all this, and kudos for the great laughs! 👍 Chronomancers furious, film at 11. 4 11 Link to comment Share on other sites More sharing options...
jussi.8139 Posted May 4, 2023 Share Posted May 4, 2023 Thank you for the insights, swift fixes, and the courage to refactor legacy code! 😁 ✨ Link to comment Share on other sites More sharing options...
Mild Disaster.2596 Posted May 4, 2023 Share Posted May 4, 2023 (edited) Thanks for the update. I'm sure a lot of angst comes from folks not understanding how things work behind the scenes, communication like this helps. And for those of us who are familiar with these types of environments; I'm sure a lot of 'been there done that' and sympathetic sentiments came up while reading the synopsis. But considering the nature of this bug, I have to wonder about your QA tests. This seems like a pretty fundamental functionality that could have probably been caught with simple automated tests ( although I will plead ignorance ) Also have to temper this against the ongoing issues plaguing wvw that never seem to get attention/communication ( think recent reset night crash fests ). Maybe some communication there would help the community. Edited May 4, 2023 by Mild Disaster.2596 2 1 Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now