Inside ArenaNet: Live Game Outage Analysis

boCa.4138 · July 15, 2021

Thanks for sharing this! I work in a similar sector and kitten! I had a lot of fun reading. More please! 🙂

Inculpatus cedo.9234 · July 15, 2021

I very much enjoyed the News Blog post. It was very entertaining (you may have a second career in writing ) and most informative.

Thank you for sharing, and I look forward to many more News Blog posts.

(Hopefully, we will receive them regularly from a variety of ArenaNet staff, on a multitude of topics.)

Keep up the good work; it is appreciated.

LuRkEr.9462 · July 15, 2021

Interesting article! As someone who works QA for a financial instution, writes automation code, and helps manage the test databases it was a super interesting read.

Reminds me of incidents we've had that ended up being something silly (like a space in a production password that had always been there but a new vendor driver no longer trimmed it so devices started to "randomly" going offline when they did their checkin.)

Kosmirion Epos.9214 · July 15, 2021

Thank you very much for that read!!!
I am enjoying learning about the games behind the scenes and, really KUDOS TO YOU ALL!!!

It has been a widespread meme in the community that you of Anet just don't communicate with the players besides "There's this new shiny in the Gem-Shop."

But you are really turning the wheel around lately. You're blowing us out of the water and out of our comfort zone (but in a good way :D). I am absolutely LOVING the amount if intel and communication we receive. From the big summer announcement, to the balance patch, your reaction to the communities reaction to the balance patch, how freaking well the Legendary Armory turned out. I am absolutely stunned and I got no Stun-break.

Defenitely hyped for the future :D

Proteus.6192 · July 15, 2021

As a developer who is always interested in how other companies work and has spent a ton of time dealing with both AWS and Azure not only was this interesting to read, but you guys should look into more managed solutions. Like avoiding managing your own instances and using their databases.

As a player, I love this continued feedback and insight into the company and how hard you guys work to ensure a stable game.

Loke.1429 · July 15, 2021

Thanks for this post, I was nicely thorough and very insightful. Thanks for the excellent uptime.

Probably not the right place to mention this, but it reminded me that I have long wished for an explanation with a similar level of detail related to the the pros and cons related to wvw blob v blob skill lag (with respect to server hardware and package size or amount of data to/from each client per second). I believe I've previously read about how sending data to each other player becomes the issue, when there is more and more players nearby.

Edited July 15, 2021 by Loke.1429

TheOlian.7368 · July 15, 2021

That was a very interesting read. While I'm no dev-ops or in this case platform-ops, I do understand the difficulties of managing multiple interconnected systems, each with its own dependencies and technical requirements. Thank you Grouch for making sure ArenaNet communicates with the community. Welcome back.

Faridah.8431 · July 15, 2021

Release Manager here. That was a great read. Thanks.

shadow.6174 · July 15, 2021

Really a very interesting read, thank you so much for sharing it! I think such type of openness (without revealing critical business details of course) is great, for both players and developers out there (like myself).

It can help the community to be more comprehensive and patient when such things occur and lean that things aren't so easy and simple as it can look from outside. It can also helps developers from different areas out there to learn from others' experiences, mistakes and success, to also know how to go forward.

It has reminded me of this post (https://steamcommunity.com/app/323380/discussions/3/361787186427352059/) from another game (which ironically is all about satirizing game development and the relationship between devs and the players). Just out of nowhere one of the devs made that post sharing how they did an AI optimization, even sharing technical detail and code snippets. Really cool stuff!

Kudos for all devs and teams and thanks again for all your work to make this game great! ❤️

Edited July 15, 2021 by shadow.6174
missed a word

Gimli.9461 · July 15, 2021

So happy to hear you love the team you work with. We have a similar incident management process at work, and agreed the culture plays a huge role. At the end it should never be about finding someone to blame, it's about learning how to prevent this issue from happening the future. People make mistakes, programs and processes help us catch them early! 😄

Is great to read a bit about your setup, makes one feel closer to the team.

GW2 is absolutely amazing with its uptime, keep up the amazing work! 🥰You rock

rootwyrm.6408 · July 15, 2021

As a very (very very) senior platform engineer and architect, this outage raises a number of serious questions from an architectural and process standpoint.

Why weren't outstanding queries and disk throughput/latency/queue already being monitored? Those are all very much core metrics, and often the only warning sign you get as AWS has another disk outage due to depending on us-east or the host dying with them being completely unaware of it. (Oh yes, this has happened. Many times.)
Why is the database still architected as single A/P? (We'll avoid the digression on live service issues for the moment, might come back to it.) This leaves you vulnerable either to single region outages or subjects you to a self-inflicted Denial of Cash attack. That also means in the event of future corruption or failures, you still have the same exact problem.
Which brings me to three: why wasn't there a cold or warm standby? If there were known good backups, what prevented operations from spinning up a new instance while troubleshooting was going on? (Yes, I know the caveats. But it should have been able to bring things back in a lot less time or at least provide an A/B set.)
While I know it's completely unthinkable and we totally don't want to admit it could ever happen, why wasn't there a predefined "EPO" (Emergency Power Off) process outside of change control? Believe me, I can tell you exactly how bad it can get if bad data is being written to a live service and you can't perform an immediate shutdown. Especially a shared tenant live service. (Buy me a beer and I'll tell you the whole story, Robert. It's a good one.)
Stepping away from the live service portion, why is ANet self-managing the guest OS? (I can make some educated guesses. Neither of us want me to. Trust me. They come with a lecture series.) Unless you've got a full time junior me doing nothing but maintaining that, that's a recipe for exactly this kind of root cause.
What changes has ANet made to the dev/stage/live test process to better align it to the real world loads? Yes I know you can't fire up 5,000+ desktops running bot clients. But have you implemented things like Hell Queries and Keyboard Chewers?
Okay, I said I might come back to it. So let's. Why is the game still using a single database architecture? Yes, certain data must be present always (account, inventory, guilds, etc) and synchronized across all instances. But as GW2 is not a seamless world, I would have thought backend would have switched to zone shard and flush. (Also would have assumed RDS, but good on you for avoiding that Hotel California. Seriously.) Especially as backend changes like that require fewer or no client changes.

That's all that's coming to mind for the moment though. On the outside looking in, there's only so many questions you can ask without making bad assumptions like what database type (I mean, I'm guessing MySQL but hoping it's PostgreSQL, though MS SQL is also fine) or operating system ("is.. is that Mandriva?!") or the like.

Lazanyeah.7014 · July 15, 2021

Honest question:

Do you guys (Dev Team) considering about revealing and creating a blog post about what happened in that 'Cassiano' event? How he duplicated all these infusions and golds? How your systems are not detected this issue till he makes a terrible mistake on live? Is he completely (including all alt and other suspicious accounts) removed from the game with the duplicated items? Is the duplication process fixed or still hapenning somewhere elses?

Thank you!

Aria Lliane.8693 · July 16, 2021

Dear Robert, I feel you, as a back-end Dev, and as current Team Leader on my project, I also migrated servers from on-premise to AWS. All in this year of remote working. Now I feel even more connected to GW2 than before. Thanks for sharing your story, that I relate so much.

Also, nice to know what happened on those days, we will also not forget them xD

Edited July 16, 2021 by Aria Lliane.8693

Slayerseventh.9843 · July 16, 2021

Maybe it was a bit too technical, but it's still awesome that you guys shared it. Communication and transparency ❤️

Dying Oblivion.8296 · July 16, 2021

Super interesting article. I appreciate your hard work but now i know who to moan at because BLTC is running like crap. 🤣

And it does run like crap... It freezes up every 5-10 seconds and it times out/loses connection to the server CONSTANTLY. Its total crap.

But on a more positive side, Its a hell of a lot better than it was a year ago - Literal night and day difference compared to back then but its not quite as smooth as when the game first launched.

Zinchmwah.2067 · July 16, 2021

I can't believe it was such a simple problem. I would have laugh uncontrollably, too.

IntrepidSpark.9307 · July 16, 2021

Devops here. I enjoyed reading your process and management 😀

Ashantara.8731 · July 16, 2021

Thank you for taking the time to put the truth out there (even if it's over a year late). You obviously invested a lot of time and effort into this article - it's highly appreciated.

Eowyn.1420 · July 16, 2021

Very interesting read. Nice to get some insight into the rollback. Us players speculated about what had happened. There were several theories.

I was one of the people who logged in during the rollback. Initially I was so confused by what had happened to my characters. (On one character I had lost several levels, and a set of rare armour). Then I saw in the general chat that I wasn't the only one affected. Some people were much worse off than me, having lost important items or achievements. After that I was uncertain whether I should continue playing, or not log in for a few days until the issue was fixed. It was quite unnerving. Hopefully it won't happen again.

Morshale.1986 · July 16, 2021

Very cool post! I manage my companies infrastructure in the cloud and it's cool to connect what I do for fun with what I do for work. I'd definitely be interested in seeing more posts like this!

unknowable.8470 · July 16, 2021

As an ex-developer myself (software and database engineer, not game) topics like this are always fun reads. They also help to flesh out Anet as a company from a marketing perspective, whomever is behind this initiative is on the right path, they have identified a real strength of GW2 and are happy to market it in a way that highlights the people behind the studio.

Sakorath.8910 · July 16, 2021

Great read and very well written. Thanks for sharing this information, and the steps ANet is taking to maintain the integrity of the live service.

Cheers!

Lizardguard.2860 · July 16, 2021

Absolutely fascinating read.

We´ve just had this somewhat broad topic in software architecture class as well.

This absolutely made my day, and I´m so happy about Anet opening up more.

Thank you for this!

Floland.4138 · July 16, 2021

I loved it!

Please share more things like that.

Aecydn.6859 · July 16, 2021

Great post, well written and very informative. Keep up the good work and YES I would love to see more "Behind the Scenes" posts! Handling the massive amount of data and being in that "always online" state is obviously a huge challenge, but you are doing everything you can to maintain a stable platform/World for us all to enjoy! I'm really pleased to see you have backups and backup-backups haha! Thanks again, and its only a mistake if you don't learn from it! Cheers.

Inside ArenaNet: Live Game Outage Analysis

Recommended Posts

Link to comment

Share on other sites

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

Fire Attunement.9835

Robert Neckorcuk.1502

Avenrise.3592

Posted Images

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites