Jump to content
  • Sign Up

REQUEST: Dynamic Events monitored by watchdog process


Recommended Posts

An RTS game should use many of the same techniques used in an RTOS. I'm referring to a system watchdog process.

In an embedded system, if the watchdog process detects a stuck process or out of value conditions, it sends a signal to restart the process or the even entire system in an orderly fashion after notifying the user (if possible). It is an EXTREMELY low overhead feature that creates a robust system.

In the RTS(RPG game), I think the same idea should be true. Your system and instances already have logic to start and stop instances based on conditions. When a stuck event is detected, a bug report could be sent to the logging server and the corresponding process would be restarted. If necessary, the map instance would mark itself for recycling and follow the same logic you already have for low-population maps. The watchdog can even be used to grab a memory snapshot for later debugging. During lulls or available time, staff could gradually schedule and fix 1 or 2 of the top offenders on the failure list along with other development activity on that map.

We have to generate so many bug reports and tickets because events are sometimes stuck for days. It is so common that saying in map chat "event is glitched" that people don't even need an explanation.

I know of a few events that are almost ALWAYS stuck. This has been the case for almost 4 years. It breaks immersion and leaves a negative impression.

Link to comment
Share on other sites

Sounds like a hackish solution, but preferable to our current crisis of abandoned, stalled events until reset.

Depending on how difficult to implement it is, Anet could even just force close maps once a day, which would be the easiest, and dumbest, solution. That would still be preferable to having stalled events on non-popular maps going on for multiple days. :/

Perhaps, active events could just get all a timeout timer of 30 minutes. I am sure that no matter how stalled an event is, these triggers would still take effect and manage to return the map to a working condition for the next cycle of the event?

Link to comment
Share on other sites

You mention hackish when in fact is a best practice when designing a fault tolerant system. It sounds complicated because it is complex. At work, I always cringe when I hear a programmer say "It's too hard". They probably made a bad career choice. Systems that can perform triage, report or log an error, and continue working save workers untold hours.

But a well designed, robust system does have complex elements so that the resulting RELIABLE application appears simple to the end-user.

Link to comment
Share on other sites

"Systems that can perform triage, report or log an error, and continue working"if only Windows where that way. :P

Though I am used to most games not really being designed to be resilient to bugs, it would be great when it's possible to include such Qol which greatly reduces the frustration of dealing with bugs. Such as stalled events.

Unfortunately, anet seems to not have the budget to embark on complex upgrades that won't have significant returns. See how other hot topics like build templates have been pending for years. So... I would not hold my breath for a properly coded solution if the investment doesn't actually pay out beyond "qol" level.

Thus, I suppose if something like this ever gets introduced, it'd be in a hackish way, like adding a 30 minute failure timer.. which is hidden and doesn't even shows in the ui?

Link to comment
Share on other sites

@"keenedge.9675" said:An RTS game should use many of the same techniques used in an RTOS. I'm referring to a system watchdog process.

In an embedded system, if the watchdog process detects a stuck process or out of value conditions, it sends a signal to restart the process or the even entire system in an orderly fashion after notifying the user (if possible). It is an EXTREMELY low overhead feature that creates a robust system.

In the RTS(RPG game), I think the same idea should be true. Your system and instances already have logic to start and stop instances based on conditions. When a stuck event is detected, a bug report could be sent to the logging server and the corresponding process would be restarted. If necessary, the map instance would mark itself for recycling and follow the same logic you already have for low-population maps. The watchdog can even be used to grab a memory snapshot for later debugging. During lulls or available time, staff could gradually schedule and fix 1 or 2 of the top offenders on the failure list along with other development activity on that map.

We have to generate so many bug reports and tickets because events are sometimes stuck for days. It is so common that saying in map chat "event is glitched" that people don't even need an explanation.

I know of a few events that are almost ALWAYS stuck. This has been the case for almost 4 years. It breaks immersion and leaves a negative impression.

This is another self-defeating suggestion since you are appealing to an emphasis on quality control, the lack of which is what caused these bugs to go unfixed in the first place. Let's not the play the game of finding convoluted alternatives to fixing bugs. Just fix the bugs.

Link to comment
Share on other sites

Do you realize how many of these stalls are player created as well? I'm not going to say a huge percentage but it would be my guess that around 50% of these end up being caused by the players themselves, mostly accidental but sometimes intentionally(yes, players can intentionally cause an event to stall or bug out if they know how the event works and do certain things during that particular event and it's usually fairly easy).

Link to comment
Share on other sites

@"nosleepdemon.1368" said:Gee. If only the developers had thought of a way to cure stuck events. Clearly, the solution is as simple as writing it out in a forum post!

Sometimes things are overlooked, even simple things, actual the simple things are often overlooked. Sure this isnt "simple" due to the amount of work it would need, but having a piece of code set up in the event a..event breaks that force restarts the event is something i would have done if i was in the coding world. It would also allow for bugs to be found and possibly fixed easier too, if reports of an event breaking at the same point everytime started popping up you could investigate that alot easier than the entire event chain, and the snark was a little unneeded.

Link to comment
Share on other sites

Sounds great in theory but "detecting stuck event" would be non-trivial. Assuming there is already a way to query an instance for the state of all events the watchdog process would also need to be aware of how all those state interact.

A much easier solution would be giving every instance a maximum lifetime.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...