Jump to content
  • Sign Up

Inside ArenaNet: Live Game Outage Analysis


Recommended Posts

I very much enjoyed the News Blog post.  It was very entertaining (you may have a second career in writing :classic_wink: ) and most informative.

 

Thank you for sharing, and I look forward to many more News Blog posts.

(Hopefully, we will receive them regularly from a variety of ArenaNet staff, on a multitude of topics.)

 

Keep up the good work; it is appreciated. 

  • Like 2
Link to comment
Share on other sites

Interesting article! As someone who works QA for a financial instution, writes automation code, and helps manage the test databases it was a super interesting read.

Reminds me of incidents we've had that ended up being something silly (like a space in a production password that had always been there but a new vendor driver no longer trimmed it so devices started to "randomly" going offline when they did their checkin.)

 

 

  • Like 1
Link to comment
Share on other sites

Thank you very much for that read!!!
I am enjoying learning about the games behind the scenes and, really KUDOS TO YOU ALL!!!

It has been a widespread meme in the community that you of Anet just don't communicate with the players besides "There's this new shiny in the Gem-Shop."

But you are really turning the wheel around lately. You're blowing us out of the water and out of our comfort zone (but in a good way :D). I am absolutely LOVING the amount if intel and communication we receive. From the big summer announcement, to the balance patch, your reaction to the communities reaction to the balance patch, how freaking well the Legendary Armory turned out. I am absolutely stunned and I got no Stun-break. 

Defenitely hyped  for the future :D

  • Like 4
Link to comment
Share on other sites

As a developer who is always interested in how other companies work and has spent a ton of time dealing with both AWS and Azure not only was this interesting to read, but you guys should look into more managed solutions. Like avoiding managing your own instances and using their databases.

 

As a player, I love this continued feedback and insight into the company and how hard you guys work to ensure a stable game.

  • Like 2
Link to comment
Share on other sites

Thanks for this post, I was nicely thorough and very insightful. Thanks for the excellent uptime.

Probably not the right place to mention this, but it reminded me that I have long wished for an explanation with a similar level of detail related to the the pros and cons related to wvw blob v blob skill lag (with respect to server hardware and package size or amount of data to/from each client per second). I believe I've previously read about how sending data to each other player becomes the issue, when there is more and more players nearby.

Edited by Loke.1429
  • Like 1
Link to comment
Share on other sites

That was a very interesting read. While I'm no dev-ops or in this case platform-ops, I do understand the difficulties of managing multiple interconnected systems, each with its own dependencies and technical requirements. Thank you Grouch for making sure ArenaNet communicates with the community. Welcome back.

  • Like 1
Link to comment
Share on other sites

Really a very interesting read, thank you so much for sharing it! I think such type of openness (without revealing critical business details of course) is great, for both players and developers out there (like myself).

It can help the community to be more comprehensive and patient when such things occur and lean that things aren't so easy and simple as it can look from outside. It can also helps developers from different areas out there to learn from others' experiences, mistakes and success, to also know how to go forward.

It has reminded me of this post (https://steamcommunity.com/app/323380/discussions/3/361787186427352059/) from another game (which ironically is all about satirizing game development and the relationship between devs and the players). Just out of nowhere one of the devs made that post sharing how they did an AI optimization, even sharing technical detail and code snippets. Really cool stuff!

Kudos for all devs and teams and thanks again for all your work to make this game great! ❤️

Edited by shadow.6174
missed a word
  • Like 1
Link to comment
Share on other sites

So happy to hear you love the team you work with. We have a similar incident management process at work, and agreed the culture plays a huge role. At the end it should never be about finding someone to blame, it's about learning how to prevent this issue from happening the future. People make mistakes, programs and processes help us catch them early! 😄 

 

Is great to read a bit about your setup, makes one feel closer to the team. 

 

GW2 is absolutely amazing with its uptime, keep up the amazing work! 🥰You rock

  • Like 3
Link to comment
Share on other sites

As a very (very very) senior platform engineer and architect, this outage raises a number of serious questions from an architectural and process standpoint.

  1. Why weren't outstanding queries and disk throughput/latency/queue already being monitored? Those are all very much core metrics, and often the only warning sign you get as AWS has another disk outage due to depending on us-east or the host dying with them being completely unaware of it. (Oh yes, this has happened. Many times.)
  2. Why is the database still architected as single A/P? (We'll avoid the digression on live service issues for the moment, might come back to it.) This leaves you vulnerable either to single region outages or subjects you to a self-inflicted Denial of Cash attack. That also means in the event of future corruption or failures, you still have the same exact problem.
  3. Which brings me to three: why wasn't there a cold or warm standby? If there were known good backups, what prevented operations from spinning up a new instance while troubleshooting was going on? (Yes, I know the caveats. But it should have been able to bring things back in a lot less time or at least provide an A/B set.)
  4. While I know it's completely unthinkable and we totally don't want to admit it could ever happen, why wasn't there a predefined "EPO" (Emergency Power Off) process outside of change control? Believe me, I can tell you exactly how bad it can get if bad data is being written to a live service and you can't perform an immediate shutdown. Especially a shared tenant live service. (Buy me a beer and I'll tell you the whole story, Robert. It's a good one.)
  5. Stepping away from the live service portion, why is ANet self-managing the guest OS? (I can make some educated guesses. Neither of us want me to. Trust me. They come with a lecture series.) Unless you've got a full time junior me doing nothing but maintaining that, that's a recipe for exactly this kind of root cause.
  6. What changes has ANet made to the dev/stage/live test process to better align it to the real world loads? Yes I know you can't fire up 5,000+ desktops running bot clients. But have you implemented things like Hell Queries and Keyboard Chewers?
  7. Okay, I said I might come back to it. So let's. Why is the game still using a single database architecture? Yes, certain data must be present always (account, inventory, guilds, etc) and synchronized across all instances. But as GW2 is not a seamless world, I would have thought backend would have switched to zone shard and flush. (Also would have assumed RDS, but good on you for avoiding that Hotel California. Seriously.) Especially as backend changes like that require fewer or no client changes.

That's all that's coming to mind for the moment though. On the outside looking in, there's only so many questions you can ask without making bad assumptions like what database type (I mean, I'm guessing MySQL but hoping it's PostgreSQL, though MS SQL is also fine) or operating system ("is.. is that Mandriva?!") or the like.

  • Like 1
  • Thanks 2
  • Confused 2
Link to comment
Share on other sites

Honest question:

 

Do you guys (Dev Team) considering about revealing and creating a blog post about what happened in that 'Cassiano' event? How he duplicated all these infusions and golds? How your systems are not detected this issue till he makes a terrible mistake on live? Is he completely (including all alt and other suspicious accounts) removed from the game with the duplicated items? Is the duplication process fixed or still hapenning somewhere elses?

 

Thank you!

  • Like 2
Link to comment
Share on other sites

Dear Robert, I feel you, as a back-end Dev, and as current Team Leader on my project, I also migrated servers from on-premise to AWS. All in this year of remote working. Now I feel even more connected to GW2 than before. Thanks for sharing your story, that I relate so much.

Also, nice to know what happened on those days, we will also not forget them xD

Edited by Aria Lliane.8693
  • Like 2
Link to comment
Share on other sites

Super interesting article. I appreciate your hard work but  now i know who to moan at because BLTC is running like crap. 🤣

 

And it does run like crap... It freezes up every 5-10 seconds and it times out/loses connection to the server CONSTANTLY. Its total crap.

 

But on a more positive side, Its a hell of a lot better than it was a year ago - Literal night and day difference compared to back then but its not quite as smooth as when the game first launched. 

  • Confused 2
Link to comment
Share on other sites

Very interesting read. Nice to get some insight into the rollback. Us players speculated about what had happened. There were several theories.

I was one of the people who logged in during the rollback. Initially I was so confused by what had happened to my characters. (On one character I had lost several levels, and a set of rare armour). Then I saw in the general chat that I wasn't the only one affected. Some people were much worse off than me, having lost important items or achievements. After that I was uncertain whether I should continue playing, or not log in for a few days until the issue was fixed. It was quite unnerving. Hopefully it won't happen again.

Link to comment
Share on other sites

As an ex-developer myself (software and database engineer, not game) topics like this are always fun reads. They also help to flesh out Anet as a company from a marketing perspective, whomever is behind this initiative is on the right path, they have identified a real strength of GW2 and are happy to market it in a way that highlights the people behind the studio.

  • Like 1
Link to comment
Share on other sites

Great post, well written and very informative. Keep up the good work and YES I would love to see more "Behind the Scenes" posts! Handling the massive amount of data and being in that "always online" state is obviously a huge challenge, but you are doing everything you can to maintain a stable platform/World for us all to enjoy! I'm really pleased to see you have backups and backup-backups haha! Thanks again, and its only a mistake if you don't learn from it! Cheers.

Link to comment
Share on other sites

Guest
This topic is now closed to further replies.
×
×
  • Create New...