Jump to content
  • Sign Up

Details Of May 2 Game Update Issues


Recommended Posts

5 hours ago, Rubi Bayer.8493 said:

Hi all, I want to give you a behind-the-scenes look at the unusual number of issues after yesterday's game update. I talked with one of our Senior Software Engineers to help me outline what happened, so here you go!

In Monday's studio update we talked about "taking on projects that will keep the game healthy in the long run, like refactoring large swaths of old code". Development on a game that's been live for a decade can be a minefield: fixes like that risk breaking content that was inadvertently depending on a bug in order to "work." That's what happened here when we fixed one of those pieces of old code. 

We did catch the problem and fix it, but Murphy's Law was now paying attention. You see, shipping a huge live game like GW2 requires a complex system for managing the flow of code and content going out the door.  Behind the scenes there are multiple "copies" of a lot of stuff we work with: copies that are getting tinkered with, copies that are getting tested, and copies that are right on the precipice of being released.  Which means that even when a fix is done and dusted it still needs to move forward on various conveyor belts, so to speak, to make sure it gets to you. In this case, the fix missed one of those conveyor belts along the way. 

What did this mean for you? In this case, the piece of code that broke was related to content that depends on real time, what we call "time spans". When it went live, it quickly affected a lot of content that appeared to be unrelated at first glance--metas and world bosses, guild missions, raid buffs, Karmic Converters and portal devices, and more--but they all relied on that bit of code in one way or another.  

So why didn't we just ship that missing fix right away? It was right there! Well, tracking down the problem and getting a fix together was a bit of a scavenger hunt. When you all flag an issue in the live game the first thing we do is find or recreate the problem based on the information we have so we can look it over firsthand. Yesterday's issues began with a player reporting that the Verdant Brink night boss meta was not triggering, so we started looking to see what was wrong with Verdant Brink. Two more reports about metas elsewhere rolled in shortly afterward, so we switched our search to try to find what was wrong with meta events. 

Meanwhile, other players started reporting that certain gizmos were broken, so we began looking into this apparently unrelated issue as well. Then we learned that Pact Supply Agents had all vanished. At this point we started looking for a connection, because it was clear that something bigger was going on here. We knew there was probably a connection, but there were a number of possibilities that we were investigating--including time spanning. Having a list of possible connections allowed us to start tagging in people from specific teams to help--including a software engineer who was able to trace the various issues back to that piece of code, confirm the connection, find the lost fix, stick it back on the conveyor belt, and and hand it off ot our Release Management team to get a hotfix ready. 

So how did we not catch such major issues before we shipped the update? When there is a bug and something breaks, it doesn't proactively send up a "Help, I'm broken!" flare--we find it when we test that specific content or item. We of course can't test the entire game before every update since there are millions of things to test (every item, skill, enemy, event, NPC, and so on), so we have to use our time wisely while casting the widest possible net for issues. Our QA teams test a broad range of categories, including the new things in the build, things that have broken in the past, and things related to recent major changes to fundamental systems to make sure they are still working properly. In this case, none of the affected systems had broken in this way before, nor were they in any of the "to check" categories. That said, when we ship major bugs like this, it's an opportunity to look at our processes for holes and find ways to improve!

This is obviously a simplified explanation, but I hope this peek at the workflow and the sometimes-very-surprising vagaries of game development provides some insight. In the meantime, our teams are continuing to work on issues as they arise, so thank you again for letting us know when something isn't working properly! 

After Cam's departure, I'm so anxious for you all. I love transparency but I hope a motivation in that isn't inspired by any vitriol from some players. It will take you however long to fix everything and we should all be just fine because players SHOULD have a life outside of the game service. I'm sorry if some players aren't patient but after the roadmap, there seems like so much to be grateful for and I hope people remember that and keep it in perspective. Be well!

Edited by HotDelirium.7984
  • Like 6
  • Thanks 1
  • Haha 2
  • Confused 5
  • Sad 1
Link to comment
Share on other sites

Out of curiosity, has this series of events lead to the team considering the possibility of a public test realm? If there are issues like this under the hood that the QA team can't always find right away, I think the game could benefit from having a PTR that just ships with the current iteration of 'possibly buggy upcoming changes' every week or so, so we can get in, break stuff, and give you all a heads up before it goes live.

(The best part is, you could ship it on Fridays and see how many reports come in over the weekend!)

 

So many other mmos do this (hell look at Blizzard, who at some points has 5 different PTRs running for their flagship game) and I think guild wars is large enough that this would be a very good idea to start on.

  • Like 4
  • Thanks 1
  • Confused 1
  • Sad 1
Link to comment
Share on other sites

Been there, done that. Excellent explanation. Anyone who deals with multiple code branches, release testing and the ever-present fear that something didn't get pulled into the right release candidate knows exactly what you are talking about. It's a testament to your team that this is a very unusual occurrence given the incredibly complex codebase  you have (and I'm guessing parts of it might be nearly 15 years old). And you also do it with live patching which is just a phenomenal capability. 

 

Edited by Robdalf.2561
  • Like 1
Link to comment
Share on other sites

My husband is a software engineer for a Fortune 100 company, and yeah. I related the gist of this post to him and his response was, "Oh yeah. I've seen that any number of times."

This, on huge projects written by programmers who make 6 figures for their work. No one is immune.

Link to comment
Share on other sites

I would recommend the implementation of a sentry system that has a list of "expected events" that should happen at certain times. Running the sentry system in the test instance would fire alarms if an expected event does not happen. This relieves the QA team from having to worry about testing this kind of issues and makes sure that any regression is caught before the final deployment. I know that all this is easier said than done, but I thought of mentioning it, just in case it is something that is feasible without a lot of effort. I would imagine that there is an actors system firing these events, so the sentry would be a specialized client of this actor system that has a checklist of things that should happen and could simply log an error when one of the expected events does not occur. Somebody could just take a quick look at the sentry log after a deployment. The sentry system log should remain empty most of the time, so just seeing data there would be enough to investigate. I may be assuming implementation details that are wrong, but even if the implementation is not exactly the same, the translation to your architecture should not be too complex. Of course, it is work, but it is work that would avoid future issues, so it is a good time investment in my opinion. Just my two cents for whatever is worth. The transparency and insight is appreciated.

  • Thanks 1
Link to comment
Share on other sites

  • ArenaNet Staff
1 hour ago, Kain Francois.4328 said:

This bug has given birth to one of the greatest memes: Reaper is now so strong, it breaks time itself.

 

Thanks for being so swift to fix all this, and kudos for the great laughs! 👍

Chronomancers furious, film at 11.

  • Like 4
  • Haha 11
Link to comment
Share on other sites

Thanks for the update.  I'm sure a lot of angst comes from folks not understanding how things work behind the scenes, communication like this helps.  And for those of us who are familiar with these types of environments; I'm sure a lot of 'been there done that' and sympathetic sentiments came up while reading the synopsis.

But considering the nature of this bug, I have to wonder about your QA tests.  This seems like a pretty fundamental functionality that could have probably been caught with simple automated tests ( although I will plead ignorance )

Also have to temper this against the ongoing issues plaguing wvw that never seem to get attention/communication ( think recent reset night crash fests ).  Maybe some communication there would help the community.

Edited by Mild Disaster.2596
  • Like 2
  • Thanks 1
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...