REQUEST: Dynamic Events monitored by watchdog process

keenedge.9675 · August 21, 2018

An RTS game should use many of the same techniques used in an RTOS. I'm referring to a system watchdog process.

In an embedded system, if the watchdog process detects a stuck process or out of value conditions, it sends a signal to restart the process or the even entire system in an orderly fashion after notifying the user (if possible). It is an EXTREMELY low overhead feature that creates a robust system.

In the RTS(RPG game), I think the same idea should be true. Your system and instances already have logic to start and stop instances based on conditions. When a stuck event is detected, a bug report could be sent to the logging server and the corresponding process would be restarted. If necessary, the map instance would mark itself for recycling and follow the same logic you already have for low-population maps. The watchdog can even be used to grab a memory snapshot for later debugging. During lulls or available time, staff could gradually schedule and fix 1 or 2 of the top offenders on the failure list along with other development activity on that map.

We have to generate so many bug reports and tickets because events are sometimes stuck for days. It is so common that saying in map chat "event is glitched" that people don't even need an explanation.

I know of a few events that are almost ALWAYS stuck. This has been the case for almost 4 years. It breaks immersion and leaves a negative impression.

Skotlex.7580 · August 21, 2018

Sounds like a hackish solution, but preferable to our current crisis of abandoned, stalled events until reset.

Depending on how difficult to implement it is, Anet could even just force close maps once a day, which would be the easiest, and dumbest, solution. That would still be preferable to having stalled events on non-popular maps going on for multiple days. :/

Perhaps, active events could just get all a timeout timer of 30 minutes. I am sure that no matter how stalled an event is, these triggers would still take effect and manage to return the map to a working condition for the next cycle of the event?

keenedge.9675 · August 22, 2018

You mention hackish when in fact is a best practice when designing a fault tolerant system. It sounds complicated because it is complex. At work, I always cringe when I hear a programmer say "It's too hard". They probably made a bad career choice. Systems that can perform triage, report or log an error, and continue working save workers untold hours.

But a well designed, robust system does have complex elements so that the resulting RELIABLE application appears simple to the end-user.

Skotlex.7580 · August 22, 2018

"Systems that can perform triage, report or log an error, and continue working"if only Windows where that way. :P

Though I am used to most games not really being designed to be resilient to bugs, it would be great when it's possible to include such Qol which greatly reduces the frustration of dealing with bugs. Such as stalled events.

Unfortunately, anet seems to not have the budget to embark on complex upgrades that won't have significant returns. See how other hot topics like build templates have been pending for years. So... I would not hold my breath for a properly coded solution if the investment doesn't actually pay out beyond "qol" level.

Thus, I suppose if something like this ever gets introduced, it'd be in a hackish way, like adding a 30 minute failure timer.. which is hidden and doesn't even shows in the ui?

Leablo.2651 · August 23, 2018

@"keenedge.9675" said:An RTS game should use many of the same techniques used in an RTOS. I'm referring to a system watchdog process.
In an embedded system, if the watchdog process detects a stuck process or out of value conditions, it sends a signal to restart the process or the even entire system in an orderly fashion after notifying the user (if possible). It is an EXTREMELY low overhead feature that creates a robust system.
In the RTS(RPG game), I think the same idea should be true. Your system and instances already have logic to start and stop instances based on conditions. When a stuck event is detected, a bug report could be sent to the logging server and the corresponding process would be restarted. If necessary, the map instance would mark itself for recycling and follow the same logic you already have for low-population maps. The watchdog can even be used to grab a memory snapshot for later debugging. During lulls or available time, staff could gradually schedule and fix 1 or 2 of the top offenders on the failure list along with other development activity on that map.
We have to generate so many bug reports and tickets because events are sometimes stuck for days. It is so common that saying in map chat "event is glitched" that people don't even need an explanation.
I know of a few events that are almost ALWAYS stuck. This has been the case for almost 4 years. It breaks immersion and leaves a negative impression.

This is another self-defeating suggestion since you are appealing to an emphasis on quality control, the lack of which is what caused these bugs to go unfixed in the first place. Let's not the play the game of finding convoluted alternatives to fixing bugs. Just fix the bugs.

Zaklex.6308 · August 23, 2018

Do you realize how many of these stalls are player created as well? I'm not going to say a huge percentage but it would be my guess that around 50% of these end up being caused by the players themselves, mostly accidental but sometimes intentionally(yes, players can intentionally cause an event to stall or bug out if they know how the event works and do certain things during that particular event and it's usually fairly easy).

nosleepdemon.1368 · August 23, 2018

Gee. If only the developers had thought of a way to cure stuck events. Clearly, the solution is as simple as writing it out in a forum post!

Drecien.4508 · August 26, 2018

Why not just flush all maps at reset or shortly after in case an event is going on.

Dante.1763 · August 26, 2018

@"nosleepdemon.1368" said:Gee. If only the developers had thought of a way to cure stuck events. Clearly, the solution is as simple as writing it out in a forum post!

Sometimes things are overlooked, even simple things, actual the simple things are often overlooked. Sure this isnt "simple" due to the amount of work it would need, but having a piece of code set up in the event a..event breaks that force restarts the event is something i would have done if i was in the coding world. It would also allow for bugs to be found and possibly fixed easier too, if reports of an event breaking at the same point everytime started popping up you could investigate that alot easier than the entire event chain, and the snark was a little unneeded.

Khisanth.2948 · August 26, 2018

Sounds great in theory but "detecting stuck event" would be non-trivial. Assuming there is already a way to query an instance for the state of all events the watchdog process would also need to be aware of how all those state interact.

A much easier solution would be giving every instance a maximum lifetime.

REQUEST: Dynamic Events monitored by watchdog process

Recommended Posts

keenedge.9675

Link to comment

Share on other sites

Skotlex.7580

Link to comment

Share on other sites

keenedge.9675

Link to comment

Share on other sites

Skotlex.7580

Link to comment

Share on other sites

Leablo.2651

Link to comment

Share on other sites

Zaklex.6308

Link to comment

Share on other sites

nosleepdemon.1368

Link to comment

Share on other sites

Drecien.4508

Link to comment

Share on other sites

Dante.1763

Link to comment

Share on other sites

Khisanth.2948

Link to comment

Share on other sites

Archived