Jump to content
  • Sign Up

Robert Neckorcuk.1502

ArenaNet Staff
  • Posts

    9
  • Joined

  • Last visited

Everything posted by Robert Neckorcuk.1502

  1. Hi All, Many thanks for the feedback about this post and communication in general! I really enjoyed preparing this content - I hope my passion for this game and community showed through! Thanks for reading! Also there was a Skritt insider. I've made some new graphs to watch for it 😄 -R
  2. EU tourn has restarted, however, it appears to have still pulled a bad value on startup. We are starting another opening at 1:30pm PST and starting at 1:45 - this will have a team limit of 100. -R
  3. We are looking into this. And we will still be running the mAT this weekend! -R
  4. We had made a few minor changes to deal with "missed network messages", which appear to be working on normal and this monthly tourn. Overall this specific tournament ran quite smoothly, so we did not have to use any of the new tools to correct anything. There is still a known issue in the case of the underlying microservice or hardware dying completely, we can recover about 90% of the information that we need to resume the tournament from the point of failure. This is a change that is still being worked on and tested, to ensure no new bugs or regressions are introduced.
  5. Hello there! Just wanted to pop in and say that we have identified Sparkfly Fen as one of the more server intensive maps, and we are monitoring the numbers to see if there is something we can do to improve them. -R
  6. The original spelling (and the inflection marks) have been altered/removed, but it's pronounced as 'neck-or-chuck', I usually add emphasis on the 'or', but sometimes my dad or uncle will stress the 'Neck'. This was an item we looked at when testing the fix. At the current traffic levels, we saw no measurable difference. Because the Gateways are essentially just routers, if we do start to see a large uptick in messages sent, the impact would be slightly increased latency for object data/state updates. If things become measurably slower, there is always the option of getting beefier hardware and/or a larger software change where rosters and other objects would have knowledge of their specific gateway connection, and would update the backing service if and when their gateway connection changes. Thanks for all the positive feedback! I'll have to keep digging into interesting bugs and writing them up for you all!
  7. Hello Again PvP Community! I wanted to provide you with a status update on the Queue Instability issue's. For many of you, noticed or not, there has been a noticeable increase in the reliability of the queue system (about 10x) since we deployed a change last Wednesday, July 10. That being said, we are still seeing two additional types of 'stuck' screens that we are continuing to dive into. So what fix did we push out last week? Depending on how closely you follow ours or industry tech, you may know that our infrastructure is built using micro-services. Each service deals with (ideally) one core task, and can talk to other micro-services through messaging. The micro-service that handles arena-based PvP is called PvpSrv. (Go figure...) When creating objects (Arena's, matches, rosters, etc), PvpSrv will "talk" to other services to persist the current data and state of each of these objects. For some clusters of micro-services, each service is able to talk to others directly, no middle men or gatekeepers or anything. Some micro-services however live in different clusters. For PvpSrv to talk to some of these services, it must make a connection to a "gateway" micro-service, and that gateway will forward the message to the appropriate micro-service in a different cluster. This all works well for the case of a few micro-services sending a few messages, but PvpSrv is not the only service talking cross-cluster. We have... several... gateways that handle the traffic of... several... micro-services. So there's our background - PvpSrv, when setting up a player in a new roster, will send messages to some local micro-services for data and persistence, and will send messages through gateways for additional data and state persistence. How was this causing "stuck" rosters? PvpSrv config was set up to use 'round robin' gate connections; each roster would get its state updates through a different gateway. (e.g. If we had 4 gateways, 25% of all rosters would be on gateway 1, 25% of rosters on gateway 2, etc.) This worked well for distributing the message load, but didn't work so well for restoration and resilience. There are many reasons why a service can restart, hardware can die (much less likely), or a network can disconnect (more common than you think). In the case of PvpSrv talking to the gateways, if and when a gateway connection terminated, PvpSrv would have all the rosters re-connect to the new pool of available gateways. For the majority of rosters, they would retain their existing connection. However, for rosters that were talking through the terminated gateway, they would create a new connection to another gateway, but the micro-service they were talking to would not know where to send any in-progress response messages. If a state update was made, the backing service would now be sending a message to a gateway that may or may not be connected to a given roster object. Then of course, the roster object would miss its state update, and it would, well, stick. In terms of code changes, the actual change was very simple - instead of round-robin assignment, PvpSrv now connects to one gateway with a single connection. If and when this one connection is severed, PvpSrv will connect to another, single gateway. All rosters are associated with the single gateway, and the backing micro-services have only one location through which to send messages. This was a great find, and I am glad to have seen the incident count drop dramatically over the past week. As stated, we still have some work to do, and are currently eyes deep in an issue surrounding map voting and sticking progress. We hope this and other up-coming changes positively impact your PvP experiences! -R
  8. Hello GW2 PVPer's! My name is Robert and I'm on the Platform/Server team here at ArenaNet. I've been working with the PvP team for the past few weeks on this issue of queue instability. I wanted to write a quick note here and try to give a bit of insight as to what's going on. First and foremost, this is a very visible issue for the team here. We understand your frustration, and we are working both on improved mitigation and resolution for this issue. This week, alongside the new release, we deployed additional code to the pvp servers. The goals for these changes were additional logging and metrics to help us better understand the patterns around this issue, as well as an attempted timer-based mitigation to reduce the overall time that a stuck roster will take to get cleaned up. We placed the timer update in one of the main code paths for roster destruction, however, we have not seen any instances of this timer having be activated thus far. This tells us that the main destroy path is working, but that does not yet help those players who are getting stuck. The root cause of this issue deals with the mechanism we are using to sync multiple pieces of information from a single source of truth. When a roster is "finished" (either through completing its series of states, or getting invalidated during one of its states), we need to perform update and cleanup steps across multiple objects and multiple networked servers. These steps must be resilient to transient errors caused by internal server errors, network timeout errors, and more. The simplest explanation for where we are right now is: we have 100 different ways to manage an object, and 99 of them are working correctly. With the new logging and metrics, we are already gaining additional insight into where the divergent behaviour is originating. This is good news, and we are making progress, but writing and rolling out the root cause fix will still take some time. Thanks, and we appreciate your patience, -R
×
×
  • Create New...