Glicko-2 algorithm put into code (Updated). Conclusion about win-streaks. — Guild Wars 2 Forums

Glicko-2 algorithm put into code (Updated). Conclusion about win-streaks.

Tiah.3091Tiah.3091 Member ✭✭✭
edited October 24, 2018 in PVP

UPDATED 24 Oct, 11:19 PM (GMT+3)

INTRO

A few weeks ago there was a thread, where some dude** hypothesised**, that "MatchMaking algorithm forces you to lose games in streaks after you had a win streak." Here it is.
The thread was met with healthy criticism, and one dude, @Megametzler.5729 , even linked a pdf, where MM was described, step by step.
I read it thoroughly (at least I think I did), and after that I had a feeling, that the situation, which OP described, is kinda-sorta possible-ish. Except MM didn't force anyone, ofc. But more on that later.

So, as you can check yourself, the math, which describes the algorithm is quite trivial. _(Well, the math, that describes it is indeed trivial, but math, which took the author of the paper to prove it actually works - is slightly more complicated. Sadly, the full process is not described in the paper) _
Despite that fact, the formulas look quite clunky and not very fit for visual comprehension.
That's why I thought it would be a fun thing to do, if I put them into a code. So, yesterday I was really bored and gave it a try: link to Python 2.7 Jupyter Notebook (updated). The code itself is a little bit trashy, but should be easy enough to read.

The main goal of the code was to simulate a game history of some player in 1v1 scenario (although, in GW2 spvp happens in form of 5v5, in the context of our hypothesis it doesn't really matter).
In order to simulate something, you have to provide the model of some level of adequacy.
In the case of this code, there supposed to be 2 models (the 2nd I'll add later)

MODEL-1

Description:

  • There's ONE very high TRUE SKILL level player. Let's say, 1900 level.
  • Although, he's initially Unranked, and has to play 10 games for seeding against various opponents of some skill level.
    And, to his misfortune, he does it, while being STONED AS kitten, which in terms of math means, his winrate against 800-1200 scrubs is precisely 50% (for those 10 games only)

  • Then he finishes with some result and feels like "kitten, man, that won't do, I must tryhard." And he starts doing exactly that, playing with his full potential.

  • Important, MatchMaker Algorithm: The matchmaker now assumes, that all players in the game have their ratings distributed according to Gauss Distribution. With mean 1000 rating and standard deviation 266 rating (see the image 1.)
    (1200 was taken from here, as well as the other constants, and the standard deviation I took assuming, that 1800 is 3 sigma level, where only 0.3% of the elitist dudes play. 30 ppl above 1800, makes the playerbase something like 10000 - that makes sense, I guess.)
    It works like this: the matchmaker rolls normally distributed number (with mu and sigma 1000 and 200, accordingly)
    If it does NOT BELONGin a range of +/- 10 rating from our dude current rating, then we increase this range by +10 (making it +/-20, then +/-30, and so on), and re-iterate the process.
    If it DOES, however, we take this number value as our opponent rating for the current game. The higher our dude's rating, the tougher it is for him to find decent opponents (see image 2)

  • Then we calculate the winrate against this opponent, according to Glicko manual (see the formula for E in glicko-2.pdf, Step 3). Then we record win or loss for this game, in accordance to that winrate.

  • Then we calculate the updated rating, rating deviation and volatility from the game he just played, and update his game history with those values. Initially it was 10 games after seeding, and now it starts growing. If it grows to 100, we then start removing the first element of the history array and shift the whole array left by 1, and record the last game as before (I.E., making space for new games, and forgetting very old ones)
  • Then we reiterate process until our guy plays 500 games.

UPDATE from 24 Oct:
1) removed RD decay over time, introduced a hard cap for RD=30,
2) updated mean and standard deviation for gaussian of skill,
2) took system constant and other parameters from wiki page

And, finally, we can see how his rating changes with time, by the end of a season. See image 3.

(from top to bottom):

IMAGE-1: Gauss distribution of TRUE SKILL levels of players in GW2 leaderboard. Approximately, of course. 10000 players, mean is 1000, sigma is 266.
These numbers derived from the assumption, that 3-sigma level is 1800 rating, and there are 30 players above 1800.

IMAGE-2 Matchmaking representative samples for players with 1000 TRUES SKILL level (Blue) and 1900 TRUES SKILL (Green). As you can see, 1000 rating player will almost always be playing with similar level opponents. While 1900 rating player will not only be playing against a_ much wider_ range of opponents, but he will also be forced to play against lower skill players most of the times.

IMAGE-3 The game history of our 1900 rated player. Rating is displayed at the left scale (Red) and the Rating Deviation on the right scale (Blue). Note how quickly it converges to 1800-2000 range and stays there throughout the whole season. Even though winstreaks occur, they won't bring the rating too high.

MODEL-2

@Airdive.2613 said:
I've come up with an idea of an interesting (at least to me) experiment.
...
The data of interest:
1. I'm curious to see the distribution of the overall number of people depending on their rating (like a histogram of the number of players to divisions), as well as "bad" and "good" ones independently.
2. Using the same data as in point 1, calculate the sum of all players' ratings. What I mean is, how does the sum of all players' ratings after 100,000 games compare to the initial sum of their ratings? (The initial ratings' sum would be, for example, 1,500x5,000 = 7,500,000.)

This would require a slightly different model, and... It's coming soon ;)


OUTRO

Well, you've got the idea - although, the graph 3 clearly indicates, that ** winstreaks (and lose streaks) MAY EXIST to certain level, they shouldn't take you much farther, than +/-50 rating below or above your TRUE SKILL LEVEL. ** Especially at the end of the season.

As always, take it with a grain of salt. Because the model is STILL quite simplified and there's STILL a lot of uncertainties and unknowns.
Constructive critics is welcomed. Please check the code yourself, if you're interested.


ALL THE MATERIALS, IN CASE SOMEONE MISSED SOMETHING:
1) Code (python 2.7 notebook)
2) Glicko-2.pdf
3) previous thread
4) Gauss Distribution

Tagged:
<1

Comments

  • Megametzler.5729Megametzler.5729 Member ✭✭✭✭
    edited October 21, 2018

    Haha, great job man. :smiley: You actually did it! Thanks a lot for this!

    Also, absolutely correct sidenote about the matchmaking <> Glicko. Matchmaker is developed by Anet and we have rather few information on this. But Glicko(2) is a solid and widely used and proven system.

    Still I am kind of surprised about the huge deviations. Did you turn some screws, the system constant tau (τ) for example? Could you add two or three graphs for different values, because that is supposed to determine the volatility? But only if you should be bored (again). :lol: I don't know Python and I am abroad so I don't have Matlab on my private computer...

    Anyway, again, thanks a lot! Really fun to see.

    €: I see, you used 0.8. Could you give it a shot with lower values? :smile:

  • Arlette.9684Arlette.9684 Member ✭✭✭✭
    edited October 21, 2018

    Great Work! Can we get a "sticky" on this?

    Vae Victus!
    [Hcm] Promotraitor

  • jportell.2197jportell.2197 Member ✭✭✭

    Wow the major swings in the rating are pretty insane. I wonder if this is representative of other people's exp. However the ability to do que definitively skews this. I play exclusively solo and the highest I've gotten this season is 1657. When I hit that I noticed teams almost always had two do que. Kinda annoyed with that.

  • Arlette.9684Arlette.9684 Member ✭✭✭✭

    @jportell.2197 said:
    Wow the major swings in the rating are pretty insane. I wonder if this is representative of other people's exp. However the ability to do que definitively skews this. I play exclusively solo and the highest I've gotten this season is 1657. When I hit that I noticed teams almost always had two do que. Kinda annoyed with that.

    I seem to get better average results when I solo que past low Plat, than when I duo Q. Talking over the course of about 300 games last season, I'm as of yet unable to play competitively still this season, due to an ongoing hand injury.

    Vae Victus!
    [Hcm] Promotraitor

  • Tiah.3091Tiah.3091 Member ✭✭✭

    @Megametzler.5729 said:
    Still I am kind of surprised about the huge deviations. Did you turn some screws, the system constant tau (τ) for example? Could you add two or three graphs for different values, because that is supposed to determine the volatility? But only if you should be bored (again). :lol: I don't know Python and I am abroad so I don't have Matlab on my private computer...

    Anyway, again, thanks a lot! Really fun to see.

    €: I see, you used 0.8. Could you give it a shot with lower values? :smile:

    Yeah, well, I'd suggest you, guys, not to take THE ACTUAL VALUES of the swings too seriously. Because they are most likely just too huge.
    My matchmaking (actual matchmaking, lol) algorithm doesn't take into account, that by the end of the season the majority of players have their rating deviations gradually decreased.
    My assumption about Rating Deviation, is that when it goes below a certain threshold for an opponent, it instead rerolls randomly to be in range of 0-50.
    Which is, again, is FALSE for the VAST MAJORITY of players.

    My point is, is that those swings shouldn't be TAHT HUGE, like you guys correctly noticed.
    Instead, if the player TRUE SKILL rating is 1500, the graph should look something like this:

    (Sorry for the Paint).
    In other words - it should converge to 1500. I'm pretty sure one can do that, by playing with parameters a little bit. But with that much of free parameters (including the most major one - the ACTUAL MM algorithm), that's like finding a needle in a haystack.

    What can be learned from it RIGHT NOW, however, is that you are most likely to converge to your TRUE SKILL rating at one point or another, but that might be quite a bumpy ride with big lose streaks and equally big win streaks.

    I'll play with it tomorrow, hopefully. I mean, with variable RD for an opponent. And I'll also try one with a bigger TAU (kinda like a "viscosity" parameter, eh?).
    But you, guys, are more than a welcome to give it a try as well!
    Cheers!

  • Arlette.9684Arlette.9684 Member ✭✭✭✭

    The volatility will undoubtedly drop by a lot if you include more variables to take into account.

    Vae Victus!
    [Hcm] Promotraitor

  • Airdive.2613Airdive.2613 Member ✭✭✭

    So, the question. If the matchmaking is not provided by Glicko, what exactly does it do? Is its only output a number indicative of a player's rating? If so, is its only function to determine the amount of points gained/lost?

  • Tiah.3091Tiah.3091 Member ✭✭✭
    edited October 22, 2018

    @Airdive.2613 said:
    So, the question. If the matchmaking is not provided by Glicko, what exactly does it do? Is its only output a number indicative of a player's rating? If so, is its only function to determine the amount of points gained/lost?

    Well, assuming, that there WAS a decent matchmaking, Glicko can be descibed as follows: it takes player's current rating, rating deviation and volatility, and the ratings and rd's of the enemies from N of his previous matches. And then returns the updated values of rating, rating deviation and volatility (for player).
    3 numbers, to be precise. Not 1.
    But if you put it simple - yes. It "doesn't do much". All the matchmaking magic happens thankfully to some cryptic A-Net's algorithm.

    Matchmaking algorithm, which I used in my code is kitten simple: if the enemy fits into a range of [current_player_rating - 50; current_player_rating+50], then this is our guy. Exact value of the enemy rating is taken randomly, ofc.

  • Mbelch.9028Mbelch.9028 Member ✭✭✭

    This, this is neat.

  • mrauls.6519mrauls.6519 Member ✭✭✭

    @Arlette.9684 said:

    @jportell.2197 said:
    Wow the major swings in the rating are pretty insane. I wonder if this is representative of other people's exp. However the ability to do que definitively skews this. I play exclusively solo and the highest I've gotten this season is 1657. When I hit that I noticed teams almost always had two do que. Kinda annoyed with that.

    I seem to get better average results when I solo que past low Plat, than when I duo Q. Talking over the course of about 300 games last season, I'm as of yet unable to play competitively still this season, due to an ongoing hand injury.

    Solo queue tends to yield better results UNLESS you have a solid duo queue partner. Duo queue means you have a good chance of ending up with 3 people that don't know kitten they're doing - since duo queue inflates you're combined MMR

    Twitch | YouTube

    Dragonhunter main | If I'm forced to I can play holo... poorly

  • Arlette.9684Arlette.9684 Member ✭✭✭✭

    @mrauls.6519 said:

    @Arlette.9684 said:

    @jportell.2197 said:
    Wow the major swings in the rating are pretty insane. I wonder if this is representative of other people's exp. However the ability to do que definitively skews this. I play exclusively solo and the highest I've gotten this season is 1657. When I hit that I noticed teams almost always had two do que. Kinda annoyed with that.

    I seem to get better average results when I solo que past low Plat, than when I duo Q. Talking over the course of about 300 games last season, I'm as of yet unable to play competitively still this season, due to an ongoing hand injury.

    Solo queue tends to yield better results UNLESS you have a solid duo queue partner. Duo queue means you have a good chance of ending up with 3 people that don't know kitten they're doing - since duo queue inflates you're combined MMR

    Is this factual? If it is, it's rather counter intuitive.

    Vae Victus!
    [Hcm] Promotraitor

  • Trevor Boyer.6524Trevor Boyer.6524 Member ✭✭✭✭
    edited October 22, 2018

    I actually experience elongated win/lose streaks extremely frequently to the point that most of my seasons are played as win or lose streaks. Why does this happen to me? Good question, but it makes me wonder if:

    • (A) There is something going on that Arenanet isn't aware of
    • or (B) There is something going on that they don't talk to us about

    Either way, in 6 years and almost 13,000 matches played, I've come to the conclusion that the algorithm notes and all simulations are somewhere incorrect and in no way reflect the deep applied effects of 3rd party programs/smurfing/win trading/whatever the hell else is going on, and their effects on actual match making.

    Don't believe me? I had woke up earlier this morning to play some games and went on a 10 or 11 game win streak. And no, there is no win trading here. This is just random legit ranked solo que. I mean, does someone have a plausible explanation for this happening so frequently to some players? I'd love to hear it.

  • Arlette.9684Arlette.9684 Member ✭✭✭✭
    edited October 22, 2018

    @Trevor Boyer.6524 said:
    I actually experience elongated win/lose streaks extremely frequently to the point that most of my seasons are played as win or lose streaks. Why does this happen to me? Good question, but it makes me wonder if:

    • (A) There is something going on that Arenanet isn't aware of
    • or (B) There is something going on that they don't talk to us about

    Either way, in 6 years and almost 13,000 matches played, I've come to the conclusion that the algorithm notes and all simulations are somewhere incorrect and in no way reflect the deep applied effects of 3rd party programs/smurfing/win trading/whatever the hell else is going on, and their effects on actual match making.

    Don't believe me? I had woke up earlier this morning to play some games and went on a 10 or 11 game win streak. And no, there is no win trading here. This is just random legit ranked solo que. I mean, does someone have a plausible explanation for this happening so frequently to some players? I'd love to hear it.

    Am I correct in my assumption that all under 10 min games were blowouts? I have never really bothered to time PvP games so I'm curious to get some input on the subject. Also what was your baseline rating at the beginning of the streak vs where it ended?

    Vae Victus!
    [Hcm] Promotraitor

  • Tiah.3091Tiah.3091 Member ✭✭✭
    edited October 22, 2018

    @Trevor Boyer.6524 said:
    Don't believe me? I had woke up earlier this morning to play some games and went on a 10 or 11 game win streak. And no, there is no win trading here. This is just random legit ranked solo que. I mean, does someone have a plausible explanation for this happening so frequently to some players? I'd love to hear it.

    I do believe you, because I experience exatly the same - 10 wins and 10 losses streaks with may be like 1-2 outliers. Well, normal "2 win, 1 loss, 1 wins, 2 loss" stuff also happens, ofc. But to my feeling, these win and lose streaks happen WAY too often. MUCH more often, than in other games with rating.

    So, an attempt to find and explanation is explicitly the reason of this thread.
    So far, what I can tell for sure - if you win, say 7 games in a row, the matchmaker SHOULD put you against a "statistically tougher" opponent. Because normally it should put you against an equal opponent. An absolute case of equal opponent is you "mirror". Someone with exactly the same rating and RD as you.
    BUT!
    In case of a winstreak, precise "mirror match" (when your supposed opponent has EXACT same rating and RD, as you) results in WIN expectancy >50%..

    Though, for the ultimate success I still need to converge that graph.

  • Airdive.2613Airdive.2613 Member ✭✭✭

    I've come up with an idea of an interesting (at least to me) experiment.
    Could you please provide the data on the following?

    • Let's assume there are 5,000 players in the system and this number does not change.
    • Let's assume every player's initial scores are exactly the same (with the rating of, for example, 1,500).
    • Let's assume there are three tiers of players (1,000 "bad", 3,000 "average", and 1,000 "good"; tiers do not change throughout the experiment) and the randomly formed team's chances of winning somewhat depend on each player's tier.
    • Let the matchmaking then randomly create a large number (say, 100,000) of games of 10 players (maybe in the form of "choose 5 players, then choose another 5 players, then do a random roll for victory (normally isn't just 50%), then compare and update each player's scores, then use the updated scores for the rest of the experiment").

    The data of interest:
    1. After these 100,000 games our "players" will presumably spread across the ladder. I'm curious to see the distribution of the overall number of people depending on their rating (like a histogram of the number of players to divisions), as well as "bad" and "good" ones independently.
    2. Using the same data as in point 1, calculate the sum of all players' ratings. What I mean is, how does the sum of all players' ratings after 100,000 games compare to the initial sum of their ratings? (The initial ratings' sum would be, for example, 1,500x5,000 = 7,500,000.)
    3. Moving on to another stage, a "soft reset" occurs somehow at the end of the "season". How does it affect the sum of all players' ratings, if does at all (or maybe it is just a volatility reset)?
    4. After running several seasons of the same experiment (and with the soft reset occuring in-between), I'd like to once again see the histogram of the number of players to divisions for the whole population, "bad" subgroup, and "good" subgroup, as well as the sum of all players' ratings.

    I know it's a lot to ask, but unfortunately I'm not familiar with programming and it seems like too daunting a task to do it using spreadsheets.

  • Tiah.3091Tiah.3091 Member ✭✭✭
    edited October 22, 2018

    @Airdive.2613 said:
    I've come up with an idea of an interesting (at least to me) experiment.
    Could you please provide the data on the following?

    That is indeed an interesting experiment, and actually quite common one, when you do the statistical analysis. Thank you, for pointing it out! ;)

    It's actualy much easier task, than you might think, when using that code.
    1)

    Let's assume every player's initial scores are exactly the same (with the rating of, for example, 1,500).

    This assumption is already in the code.
    2)

    Let's assume there are 5,000 players

    this assumption is done with 1 line, basically. Right now I only provided statistics for 1 player - his entire rating history throughout the season. To make it 5000 - you just have to run 1 additional loop from 0 to 4999. And since we only interested in the FINAL rating value for each player, the answer could be represented as an 1 dimensional array of values, something like this [1485,1257,...1763].
    3)

    Let's assume there are three tiers of players

    This is called normal distribution.
    I was planning to introduce it at some point, though, in a different place.
    Right now the MM algorithm simply takes random opponents from the interval of ratings. While normal distribution would suggest, that it has lesser probability to take the opponent with extremally different rating. E.g. the interval is 1400-1600 (+/- 100 around 1500), and the probability of taking the opponent with 1410 rating is 10%, while the probability of taking 1490 guy is 90% (right now those probabilities are equal)
    4)

    the randomly formed team's chances of winning somewhat depend on each player's tier

    Right now the TRUE SKILL of our experiment-rabbit player is 1500. And the probability to win is determined linearly - 1500 vs 500 rating is 100% win, 1500 vs 1500 rating is 50% win, 1500 vs 2500 is 0% win. (I was thinking to replace this model with some more sophisticated. Like the mentioned ELO - it has win probability dependencies from difference in players ratings.)
    What you suggest, is that I take those 5000 players, and simply introduce a normal distribution to their TRUE SKILL level. I.e., we had 1 guy with 1500, now we'll have 500 guys between 1400 and 1500, 10 guys between 0 and 200 and 50 guys between 1900 and 2000.
    This is easy =)
    5)

    maybe in the form of "choose 5 players, then choose another 5 players

    Ultimatively, the Glicko algorithm only calculates the outcome for 1v1 matches. In GW2, as we know it, spvp matches are 5v5. Then how the algorithm transfers it to 5v5 you ask? Well, my suggestion: it simply takes the average rating of the enemy team, and treats this team as a singular opponent for each player in ally team.
    So, in the end it doesn't really matter - the results won't be much different in case of 1v1 and 1v5. So, the the suggestion to make it 5v5, I think is unnecessary.


    But all in all, this is a great suggestion, and I was planning to do it myself at one point or another (well, after I make the player rating converge to a certain value and decrease volatility).
    Thanks, man!

  • Dreddo.9865Dreddo.9865 Member ✭✭✭
    edited October 22, 2018

    Want to add one more 'variable' into your theorycrafting regarding the MMe. Just a while ago I get this match with 4 engineers, 2 being scrappers (both on my team) and the other 2 holos were enemies. I think most people can understand what happened. I mean wouldn't be possible for the engineers to be equally distributed in teams? Why 2 scrappers -a profession that is weak especially compared to holo- in one team?

  • Tiah.3091Tiah.3091 Member ✭✭✭

    Е> @Dreddo.9865 said:

    Want to add one more 'variable' into your theorycrafting regarding the MMe. Just a while ago I get this match with 4 engineers, 2 being scrappers (both on my team) and the other 2 holos were enemies. I think most people can understand what happened. I mean wouldn't be possible for the engineers to be equally distributed in teams? Why 2 scrappers -a profession that is weak especially compared to holo- in one team?

    That's not exactly a "variable". That's not really anything. I understand what are you talking about, but how do you suggest I should evaluate the "strength" of the spec? Just take a "wild guess"? That there are 2 equally skilled players, one's playing holo, another one playing scrapper, and the one playing holo has a better chance of winning? Better how much exactly? 20% 10%? 1%?
    That is basically a free parameter - whatever I chose to take, would affect the model significantly.
    The model has enough free parameters on its own, I'm not sure it would be a good idea, to introduce EVEN MORE of those.

  • Airdive.2613Airdive.2613 Member ✭✭✭

    @Tiah.3091 said:

    Ultimatively, the Glicko algorithm only calculates the outcome for 1v1 matches. In GW2, as we know it, spvp matches are 5v5. Then how the algorithm transfers it to 5v5 you ask? Well, my suggestion: it simply takes the average rating of the enemy team, and treats this team as a singular opponent for each player in ally team.
    So, in the end it doesn't really matter - the results won't be much different in case of 1v1 and 1v5. So, the the suggestion to make it 5v5, I think is unnecessary.

    Oh, that's cool!
    To clarify: what I wanted to look at (in my latter points) is whether it might be possible that the total sum of the players' ratings changes with time (if they aren't zero-sum), thus causing the leaderboard to become "biased" after several seasons - upping or lowering the mean/median player rating within the same (0, 2100) borders or revealing some sort of a pattern.
    I agree the normal distribution is clearly a better choice in terms of modelling the playerbase, but it must get increasingly harder to calculate the probability of winning. I mean, it shows well "how many people" are better than you, but I don't know for sure just "how much better" a 1800 player is over a 1500 one. Some math needs to be done. :0

  • I always knew it they must us some alghoritm. Good job !

  • Tiah.3091Tiah.3091 Member ✭✭✭

    @Airdive.2613 said:
    To clarify: what I wanted to look at (in my latter points) is whether it might be possible that the total sum of the players' ratings changes with time (if they aren't zero-sum), thus causing the leaderboard to become "biased" after several seasons - upping or lowering the mean/median player rating within the same (0, 2100) borders or revealing some sort of a pattern.

    Some of your questions I can answer without any modelling.
    - Even in the case of zero sum game, the ratings for the constant number of people doesn't remain constant, it inflates very slightly, because ratings <0 are not allowed. I.e., when you win against 0 rating player and receive 10 points - he didn't lose 10, he remains at 0. Which means you got those 10 points from nowhere, therefore the inflation.
    - However, the leaderboard doesn't "become "biased" after several seasons ", because every season it resets. So, the payer rating sums are equal at the beginning of every season.

    I know what you mean: few years ago there were people with 2k ratings (well, I only play for 2 month, that's just my assumption).
    Now, top 10 people are all <1900, HOW COME?
    I'll tell you how^ it has nothing to do with "biasing". It's simply the amount of playing people reduced significantly. Therefore the absolute sum of players ratings became lower (simply because it was ~1500N, and now it's ~15000.5*N). It's like the system ran out of fuel - those skilled players can't simply take away rating from others, because there's nothing to take anymore.

    However, an important note, is that unlike ELO, Glicko is not a zero summ system (at least I think it's not). How this non-zero summ would affect the player summ - that's a very interesting question.
    Although, I really doubt it will be much. Because even if summ is not ZERO for SINGULAR player, on AVERAGE of N players it should be more or less ZERO.
    Where N is the system constant - in case of my code (and in case of GW2 spvp) its 10. Remember how they ask you to play 10 games to determine your rating - that's the constant I'm talking about.

  • Exedore.6320Exedore.6320 Member ✭✭✭

    So basically this only shows what we've known since the Glicko / Glicko2 algorithms were released:

    • Initially a player's rating has a high volatility which decreases over time. This is exactly how the algorithm is supposed to work. It doesn't know a lot about the player initially, so their rating adjustments are larger.
    • You can never view a player's rating as a single number; it should be viewed as a their true rating falling within their Glicko2 rating ± their Glicko2 deviation. Given that GW2 rating shifts by 20-30 points per game once volatility stabilizes, I would expect deviation (once stable) in the 50-100 range. A deviation of 50 means that a 1200 rated player could be rated between 1100 and 1300 with 95% certainty. This deviation is why tier groupings are a good choice for games over raw rating.

    The main flaw in the original work is how win/loss is determined. It's definitely not linear. That flaw leads to more volatility in the rating than there should be. I would suggest using the Elo probability of winning calculation for determining the outcome, using the player's true skill and the opponents actual rating (assuming all opponents were rated correctly). Going further, you could do additional tweaks:

    • Toss in a small, random fudge factor on top of that probability to account for people DC'ing or manipulating.
    • Simulate a pool of players at once rather than one at a time. Your current method knows your test player's true rating, but it assumes all other teammates and opponents are correctly rated - this is never the case.
  • Tiah.3091Tiah.3091 Member ✭✭✭

    @Exedore.6320 said:
    The main flaw in the original work is how win/loss is determined. It's definitely not linear. That flaw leads to more volatility in the rating than there should be. I would suggest using the Elo probability of winning calculation for determining the outcome, using the player's true skill and the opponents actual rating (assuming all opponents were rated correctly).

    Yeah, I was thinking the same:

    @Tiah.3091 said:
    Right now the TRUE SKILL of our experiment-rabbit player is 1500. And the probability to win is determined linearly. I was thinking to replace this model with some more sophisticated. Like the mentioned ELO - it has win probability dependencies from difference in players ratings.

    ELO indeed should give much better approximation, than linear model.

    Going further, you could do additional tweaks:

    • Toss in a small, random fudge factor on top of that probability to account for people DC'ing or manipulating.
    • Simulate a pool of players at once rather than one at a time. Your current method knows your test player's true rating, but it assumes all other teammates and opponents are correctly rated - this is never the case.

    That, on the other hand are interesting suggestions. I'll give it a try, thanks!

  • Megametzler.5729Megametzler.5729 Member ✭✭✭✭

    @Tiah.3091 said:
    (...)



    (...)

    On a side note: The rating deviation (RD) does usually have a lower limit to account for "skill changes", in the GW environment of course also balance patches and class changes and stuff. After the first 15-20 games we all seem to hit that limit - that is when the placement match deviations have become low. Ever wondered why the tenth matches gives ±30 rating? Then the 20th game at the exact same rating only gives ±15 (and stays like that)? That is the RD or it's lower limit respectively. You can still have statistical streaks, but the impact is lower.

    On the topic of the matchmaker: We have rather few informations her indeed. https://wiki.guildwars2.com/wiki/PvP_Matchmaking_Algorithm gives some hints that it tests loops to look for the "best rooster" of teams, but we do not know many details on the criterias or for example how duos are implemented. Ben once showed an example here on the forums*, where it seemed not to be accounted for at all. That would indeed be a major flaw, but not connected to win/loss streaks.

    So if your rating inflates by duoQing, you might indeed experience a loss streak to get you back to your solo rating. :wink:

    Final note: Glicko-2 is solid, though I would still like to see some changes. Matchmaking itself however could be an issue. DuoQs are only my personal worst idea. It does not, however, seem to look at your previous match outcomes except they keep lying to us. :smile:

    *Here: https://en-forum.guildwars2.com/discussion/54656/match-ranking/p2

    @Ben Phongluangtham.1065 said:

    @Reikou.7068 said:

    @Ben Phongluangtham.1065 said:

    @Axelteas.7192 said:
    This season im finding pro teams in silver, thats unacceptable... matches losing 15-500

    Can you tell me the date/time and exact score of the match where you as a silver player had pro teams? I'd like to look at that match.

    Please help me look at a match that ended a minute ago.

    Maybe about 10/7/2018 - 12:28AM Server time.

    Didn't grab screenshots of it, but this is the only match on my account for the day. Horrible match, completely one sided blowout. was 2x duos vs full team of soloqueue, and I'm pretty sure half of my team were bots, and one guy said he was silver just in the game for pips.

    I think I found the match, at least this one had 2 duo queues on the opposite team. No silver players. The averages skill rating difference between the teams is 5 points.

    Blue team (Defeated):
    Ranger - 1393
    Necromancer - 1432
    Necromancer - 1440
    Thief - 1515
    Guardian - 1521
    Average Skill Rating - 1460.2
    Std. Deviation - 55.72

    Red team (Winner):
    Guardian - 1359
    Thief - 1391
    Necromancer - 1475
    Necromancer - 1514
    Mesmer - 1587
    Average Skill Rating - 1465.2
    Std. Deviation - 92.33

  • Airdive.2613Airdive.2613 Member ✭✭✭
    edited October 22, 2018

    @Tiah.3091 said:

    @Exedore.6320 said:
    Going further, you could do additional tweaks:

    • Toss in a small, random fudge factor on top of that probability to account for people DC'ing or manipulating.
    • Simulate a pool of players at once rather than one at a time. Your current method knows your test player's true rating, but it assumes all other teammates and opponents are correctly rated - this is never the case.

    That, on the other hand are interesting suggestions. I'll give it a try, thanks!

    As for the first point - I believe it's reasonable to assume "different skill tiers" (however you call it) already include the stuff like disconnections, these things are just more likely to occur to "worse" players.
    The second one, in my opinion, is what should be done and what I suggested with a fixed pool of 5,000 players. ^^

  • All I read from this is somewhere on the enemy team, there is a soul mate for us all. Let's use this to kill the toxicity ❤️❤️❤️

  • Tiah.3091Tiah.3091 Member ✭✭✭
    edited October 23, 2018

    Oke, guys. A HUGE UPDATE here (check the OP)
    1. Ratings now resemble the REAL GW2 spvp leaderboard ratings as close, as possible, according to Gauss Distribution.
    2. Updated the matchmaking algorithm with that neat ole' good Gauss.
    3. Rating Deviation now converges, as the season progressing.
    4. Player's game history interval now increases, as he plays more games up to 100 games (was 10, constant).

    As a result, the volatility VASTLY decreased and winstreaks-losestreaks are almost gone. They are still here, ofc, but not as huge, as they were before.
    I would advise you to re-read the first post from the "MODEL-1" paragraph. And run the code, if you want more thorough look at the thought process.
    Cheers!


    Well, NOW we can ask for sticky, I guess (drops mic) B)

  • Exedore.6320Exedore.6320 Member ✭✭✭

    I like that you put in the effort. But the people who complain about the rating system are still going to ignore any mathematical proof.

    A few interesting things to try:
    1. Use a slightly skewed Gaussian distribution for player rating (not centered at the mean rating). This draws from an experiment I saw done in Overwatch. A player would play one account on weekdays and another account on weekends. The weekend account ended up over 500 rating lower (OW goes 0 to 5000) than the weekday account.
    2. Try to better reflect the matchmaker's behavior for fringe and low population. After ~5min, the matchmaker expands the rating margin for matching a player. This is particularly evident outside of prime play time.

    On your question about duo queue, it uses the average of the players in a party as the roster's rating.

  • Tiah.3091Tiah.3091 Member ✭✭✭
    edited October 23, 2018

    @Exedore.6320 said:
    1. Use a slightly skewed Gaussian distribution for player rating (not centered at the mean rating). This draws from an experiment I saw done in Overwatch. A player would play one account on weekdays and another account on weekends. The weekend account ended up over 500 rating lower (OW goes 0 to 5000) than the weekday account.

    Could you, please, provide a link to that forum? Because I don't get the idea of this experiment.
    If it's the same guy playing both accounts, his TRUE SKILL level is still the same. It's not like he's playing worse on one account, than on the other, right?
    I use normal distribution explicitly for the TRUE SKILL level. The rating, which you IDEALLY should have.
    I don't get why it should be skewed. I mean, most of the things in nature distributed according to Gauss with very good precision, starting from something as fundamental, as the Cosmic Microwave Background and ending with something as simple as the boob size.

    To me it seems like he got 500 lower rating simply because he didn't play enough games on second account.

    1. Try to better reflect the matchmaker's behavior for fringe and low population. After ~5min, the matchmaker expands the rating margin for matching a player.

    I already did that, and I tried my best to explain how I did it in the updated OP post. Yes, I expand the rating margin, exactly as you suggest.

    On your question about duo queue, it uses the average of the players in a party as the roster's rating.

    That wasn't MY question.
    I'm not interested in duo q results. At least for now. Probably that's a topic for MODEL-3 ;)

  • Exedore.6320Exedore.6320 Member ✭✭✭

    @Tiah.3091 said:

    @Exedore.6320 said:
    1. Use a slightly skewed Gaussian distribution for player rating (not centered at the mean rating). This draws from an experiment I saw done in Overwatch. A player would play one account on weekdays and another account on weekends. The weekend account ended up over 500 rating lower (OW goes 0 to 5000) than the weekday account.

    Could you, please, provide a link to that forum? Because I don't get the idea of this experiment.
    If it's the same guy playing both accounts, his TRUE SKILL level is still the same. It's not like he's playing worse on one account, than on the other, right?
    I use normal distribution explicitly for the TRUE SKILL level. The rating, which you IDEALLY should have.
    I don't get why it should be skewed. I mean, most of the things in nature distributed according to Gauss with very good precision, starting from something as fundamental, as the Cosmic Microwave Background and ending with something as simple as the boob size.

    Too lazy to look for the post, but he did play enough to stabilize his rating. The idea is that different groups of people of play on weekdays vs. weekends. In particular, weekends may have more casual (slightly less skilled) players; possibly a higher number of middle school and high school kids; etc. Since rating is a representation of an individual's skill against the population, changing the overall skill level of the population will change an individual's rating.

  • Faux Play.6104Faux Play.6104 Member ✭✭✭

    Couple points.

    Every season I've played I've had games with 10+ wins in a row, and streaks where wins are hard to come by.

    The match maker doesn't try to match you with teammates that are close to your rating. Instead it matches your team's average rating withe the other team's average rating. Teammates can have several hundred points between the best and the worst player.

    Glixko will give accurate results for missmatched opponents. Because of this...

    You are better off matching players that are of similar skill on the same team. The current method boosts the bad players and punishes the good. This tends to drive people towards the same rating vs. Separating them.

    When I have run the numbers with the constants anet uses, the deviation won't go below 60. That means the system is 95% confident you are 2 deviations from your current rating, -/+ 120.

  • Tiah.3091Tiah.3091 Member ✭✭✭
    edited October 24, 2018

    @Faux Play.6104 said:
    The match maker doesn't try to match you with teammates that are close to your rating. Instead it matches your team's average rating withe the other team's average rating. Teammates can have several hundred points between the best and the worst player.

    Yeah, that's a good point, man.
    How it affects the actual rating approximation for every player - negatively or otherwise - that's a subject to research. I was thinking about that myself last night before sleep. Because I remembered the post in this thread, where dude cited A-net dev post, where he confirmed, that they use exactly that: they balance team average rating vs other team average rating.

    When I have run the numbers with the constants anet uses, the deviation won't go below 60. That means the system is 95% confident you are 2 deviations from your current rating, -/+ 120.

    Did you run the code from the original post?
    Because I just remembered, that I forgot to upload the new version, after the update.
    Also, the code uses slightly different constants from those, that A-net uses. I'll fix it and upload the new version. (However, that point about team vs team - I'm not yet sure what to do with it)

  • Faux Play.6104Faux Play.6104 Member ✭✭✭

    @Tiah.3091 said:

    @Faux Play.6104 said:
    The match maker doesn't try to match you with teammates that are close to your rating. Instead it matches your team's average rating withe the other team's average rating. Teammates can have several hundred points between the best and the worst player.

    Yeah, that's a good point, man.
    How it affects the actual rating approximation for every player - negatively or otherwise - that's a subject to research. I was thinking about that myself last night before sleep. Because I remembered the post in this thread, where dude cited A-net dev post, where he confirmed, that they use exactly that: they balance team average rating vs other team average rating.

    When I have run the numbers with the constants anet uses, the deviation won't go below 60. That means the system is 95% confident you are 2 deviations from your current rating, -/+ 120.

    Did you run the code from the original post?
    Because I just remembered, that I forgot to upload the new version, after the update.
    Also, the code uses slightly different constants from those, that A-net uses. I'll fix it and upload the new version. (However, that point about team vs team - I'm not yet sure what to do with it)

    I made my own based off the guild wars wiki and the glicko paper. Most of the posts I made on the subject were in the old forum. I'll have to dig it up when I get home.

    For teams, I'd just do a sum of squares for the deviations and assume the player is at the midpoint.

    I thought the wiki said 30 was the low cap, but I could not reach it when I ran multiple matches. Starting at 0 it would slowly grow until it reached 60. Same if you started at 700. It would shrink to 60. Regardless, I think your deviation numbers are too low.

  • Trevor Boyer.6524Trevor Boyer.6524 Member ✭✭✭✭

    @Tiah.3091 said:

    @Faux Play.6104 said:
    The match maker doesn't try to match you with teammates that are close to your rating. Instead it matches your team's average rating withe the other team's average rating. Teammates can have several hundred points between the best and the worst player.

    Yeah, that's a good point, man.
    How it affects the actual rating approximation for every player - negatively or otherwise - that's a subject to research. I was thinking about that myself last night before sleep. Because I remembered the post in this thread, where dude cited A-net dev post, where he confirmed, that they use exactly that: they balance team average rating vs other team average rating.

    Before you read! Know that this response is admittingly, mostly conjecture and conspiracy theory.

    I actually did this once with my best simulation. Say you take 100 players ranging from 1800 down to 800. I found that realistically over the course of a season, those rating margins will implode on themselves, not expand. Meaning, the season starts with a margin of difference between ratings ranging from1800 to 800, but at the end of the season with how rating is effected by wins & losses, it will end up looking more like 1400 to 1200. At least this is what seems to happen on paper when the population is very low and the match maker is putting together matches where our teammates and opponents can be several hundreds of rating higher or lower than each other. It creates a situation where high rated players are punished more than intended for loses and low rated players receive too much rating reward for wins where they are being carried. When I saw this result, it made me question why our leaderboards weren't doing this in the higher ranked margins? What was keeping the 1800+ margins of the board expanding and not imploding? The lower rated margins also seem to be stunted from inevitable implosion towards a median. Is it win trading creating unrealistic higher margins? Is it.. something else in the algorithm that isn't mentioned in the notes?

    That is when I REALLY sat back and started thinking about those win/lose streaks that everyone seems to talk about that happen so frequently. Season after season I began paying close attention to many different player's "rising & falling" rhythms in the leaderboards. I began noticing something odd indeed. The same players would always go on win or lose streaks at the same time. What I mean is: Players (A)(B)(C)(D) always seem to hit some win rhythm at the same time, whilst Players (E)(F)(G)(H) are all hitting a lose rhythm at the same time A B C D are on a win rhythm. And it would usually go on for 2 or 3 days, until the rhythms apparently swap? Then I'll see A B C D all take on a sudden lose rhythm whilst E F G H all go on a win rhythm? I mean these are consistent patterns I have been watching for many many seasons now. I began to wonder if the algorithm had a secret function that was enforcing win & lose streak rhythms. From what I had seen, it would seem to be done in some kind of Control Group (A) and Control Group (B) kind of thing. When A is on a good rhythm for match making, B is on a bad rhythm. When B is on a good rhythm, A is on a bad rhythm. It certainly would explain the "according to algorithm notes" highly improbable but somehow super frequent win & lose streaks. It would also explain why the rating margins somehow magically expand high and low instead of imploding, a system that makes us take turns being ping ponged around the rating margins. Why would they program something like that in? I dunno, to make sure glicko margins work for a 5v5 game mode that makes you que with random people, to avoid implosion? Why would they not just tell us about it? Well it certainly wouldn't be a strong selling point for the game mode when a player read the glicko notes and realized the algorithm was sniping them during ranked ques.

    I mean, does no one else find it odd that rating never settles as if your skill never settles? I have nearly 13,000 matches played and I'm sure I've peaked in my skill at Guild Wars 2 at this point! So why is my rating so ridiculously volatile every season all season? I would expect to bounce around between 1600ish and 1500ish, but to bounce around between 1650 and like 1350 four or five times a season, due to win & lose streaks that come and go like scheduled clockwork? ^^ It makes one wonder, but again, this is all just conjecture and a lot of conspiracy theory.

    There is one other thing I wanted to note about running glicko simulations on paper. I've been pointing this out for years now and I'm going to say it again. The math all would seem to be perfectly accurate on paper, yes. But there are factors going on here that cannot be equated with numbers. Amongst these factors are things like "Are your teammates on their mains or are they alting for pvp wing achievements?" "Are some of your opponents smurfing on low rated f2p accounts but they play at a plat 2 level?" "Do you land a bad team comp but the enemy has a meta comp?" "Is anyone using 3rd party programs and/or win trading?" "Does someone random AFK to answer their door and pay for a pizza?" ect.. ect.. But by far the BIGGEST factor that stunts the accuracy of simulations is that they in no way consider how Conquest is actually played. This will be easier to explain in a list:

    • 1700 guy ques and match maker makes him a match. He gets put into a team looking like, RED: 1700 1400 1400 1400 1400
    • He gets put against BLUE: 1500 1500 1500 1400 1400
    • This looks perfectly balanced on paper "perfect" but is this actually balanced for a Guild Wars 2 Conquest match? In short, no.
    • What usually happens in these situations is that the 1700 will take and defend any node that he is at the entire game, but his 1400 teammates on the other two nodes that he won't be at, are going to get crunched the entire game by the 1500s. The BLUE team will likely hold 2 nodes much more often throughout the game and win the match as such.
    • In other words, a team with a high who is with lows, is at disadvantage vs. a team of players who have a tighter average within the glicko algorithm, despite that the math looks like "perfect matchmaking". This is because of how Conquest is actually played. The smaller our population gets, the more frequently this particular problem begins to occur.

    But yeah, something to think about.

  • coro.3176coro.3176 Member ✭✭✭✭

    I've always wondered how well the rank distribution takes into account certain factors that come up in GW2-style matches based on playstyle and class/build choice.

    • Carry factor: eg. some players will dominate the match if left unchecked. Sometimes they are shut down completely. Other times they run wild and win the game for their team. Eg. a glass cannon dps that can single-handedly wipe the other team
    • Acceptable-Teammate factor: other players are decent and will perform similarly in a game at any rating. Eg. A support player that tries to be in the middle of a teamfight healing + buffing but relying on other players to actually clear the point and get kills

    Would we expect both types of players to have the same win/loss streaks in their results?

    I'm completely speculating here, but I would expect Carry to be less streaky as the season goes on. They should settle to right around the point where players can deal with them. If they lose too much, they start to dominate games until they are back at the point where players can counter them again. If they win too much, they get countered every game.

    Acceptable-Teammate though.. I think they could rise or fall to potentially any rating, as they are more dependent on the team they get. Obviously, there's a fair amount of shuffling of teams, but I think Acceptable-Teammate may be more prone to streaks of luck with the matchmaker.

  • Tiah.3091Tiah.3091 Member ✭✭✭
    edited October 24, 2018

    @Trevor Boyer.6524 said:
    but again, this is all just conjecture and a lot of conspiracy theory.

    Pretty much what you said, yeah, lol xD
    (Also, quite a tough hypothesis to falsify, because it's hard to account for such factor.)

    • 1700 guy ques and match maker makes him a match. He gets put into a team looking like, RED: 1700 1400 1400 1400 1400
    • He gets put against BLUE: 1500 1500 1500 1400 1400
    • This looks perfectly balanced on paper "perfect" but is this actually balanced for a Guild Wars 2 Conquest match? In short, no.

    Yeah, that was my primary concern, when thinking about how matchmaker should behave in such situations. Especially how such skill distribution between teams would affect the winrate. I.e., will the 1700 dude carry the game, of will those 3 1500 guys do?
    I don't know how to approach it yet :/


    @Faux Play.6104 said:
    I thought the wiki said 30 was the low cap, but I could not reach it when I ran multiple matches. Starting at 0 it would slowly grow until it reached 60. Same if you started at 700. It would shrink to 60. Regardless, I think your deviation numbers are too low.

    Well, my deviation numbers are low indeed,** because I took active steps to reduce it.**
    I introduced the decaying RD for all playerbase, assuming that towards the end of the season people settle to their true rating.
    Now I updated the code - shifted the mean of gaussian to 1200, reduced standard deviation to 200 (so 3 sigma level is still 1800). And also I removed the decaying RD and introduced hard cap of 30 - all according to the wiki page (I didn't read it before, lol)

    The second step, that I took, and it by far is playing a much more important role:
    Glicko-2 takes 6 parameters for its calculation of player's new rating, RD and volatility. Those are: current rating, current RD and volatility - all 3 are scalars. And also 3 arrays, which provide info about his opponents from N previous games. Opponent's ratings, RDs and match results (either 0 or 1). Or, well, if you want, a 2-dimensional array (N,3).
    As we all know, when we are Unranked, the game asks us to play 10 matches for "seeding".
    Now, I initially took those 10 games as N, and kept it constant throughout whole ordeal. The results were looking something like this:

    As you can see: huge volatility, RD never drops below 40, and, obviously, HUUUUUUUUUUUUGE WIN STREAKS AND LOSE STREAKS.

    Then I assumed, like, man, devs can't be that shallow. They definitely have this N increasing, as the season progresses. I.e., the "game history array" should be increasing with time . It definitely should have more than 10 games recorded.
    That was my assumption.
    So, I introduced the "growing array" - after every new match, that our player played, the algorithm "remembered" all his previous games. Up until it reached 100 games. I had to stop at 100, because otherwise my laptop was just basically saying "there's no way I'm doing it in the next millennium".
    So, after it reached 100 games, the first game (historically) was removed from the array, 2nd game became 1th, 3th became 2nd and so on. Freeing the space for the last game.

    And that's what I got for doing that (the same picture is in OP post):

    Now, if I didn't introduce RD cap 30, it would drop to ~0 values quite soon. Volatility is almost non-existent, rating stabilises at ~1900 (which is a TRUE SKILL level for our test dude).

    Wiki doesn't have that info. And you see yourself how significantly it's affecting the results. Therefore, I'm asking @Ben Phongluangtham.1065: can you tell what exactly this constant is? Is it 10, or is it gradually increasing to certain level (like in my simulation it was 100)?
    The question is super-important.

  • Megametzler.5729Megametzler.5729 Member ✭✭✭✭
    edited October 24, 2018

    @Tiah.3091 said:
    (...)
    So, I introduced the "growing array" - after every new match, that our player played, the algorithm "remembered" all his previous games. (...)

    As far as I know, this is how Glicko-2 works in general. The examplepaper on it shows it like this - an ever decreasing RD out of all previous games (which is less, though, since noone plays 100 matches in chess per year. Maybe not representative there.^^).

    Also I think a limit to minimum RD is kind of okay - but I think it is too high. I once asked for it to be reduced: less effect of skill changes throughout the season, balance patches and stuff, but way less volatility on late matches reducing the punishment of many games played per season and maybe decreasing toxicity of later games. Maybe we can get a hint here too and maybe it could be reduced?

  • Exedore.6320Exedore.6320 Member ✭✭✭

    @Tiah.3091 said:
    As we all know, when we are Unranked, the game asks us to play 10 matches for "seeding".

    The 10 seeding games are to mask the fact that your rating is potentially shifting by hundreds of points in the first few games. The less players see, the less they freak out about it before thinking.


    @Trevor Boyer.6524 said:
    I mean, does no one else find it odd that rating never settles as if your skill never settles? I have nearly 13,000 matches played and I'm sure I've peaked in my skill at Guild Wars 2 at this point! So why is my rating so ridiculously volatile every season all season? I would expect to bounce around between 1600ish and 1500ish, but to bounce around between 1650 and like 1350 four or five times a season, due to win & lose streaks that come and go like scheduled clockwork? ^^ It makes one wonder, but again, this is all just conjecture and a lot of conspiracy theory.

    With the numbers used in GW2, I would expect a fluctuation of 150-200 points is normal (two standard deviations in each direction). Further variation can be explained by changes in the player from day to day. Maybe you're tired, playing a different build, get frustrated with a few losses and let it cloud your judgment, etc.

    But by far the BIGGEST factor that stunts the accuracy of simulations is that they in no way consider how Conquest is actually played. This will be easier to explain in a list:

    • 1700 guy ques and match maker makes him a match. He gets put into a team looking like, RED: 1700 1400 1400 1400 1400
    • He gets put against BLUE: 1500 1500 1500 1400 1400
    • This looks perfectly balanced on paper "perfect" but is this actually balanced for a Guild Wars 2 Conquest match? In short, no.
    • What usually happens in these situations is that the 1700 will take and defend any node that he is at the entire game, but his 1400 teammates on the other two nodes that he won't be at, are going to get crunched the entire game by the 1500s. The BLUE team will likely hold 2 nodes much more often throughout the game and win the match as such.

    I would actually posit the opposite outcome. The 1700 player, if playing a solo/assassin build (holosmith, mesmer, and thief are all good at this) would single out opponents and easily defeat them. This causes the opposing team to stagger, which makes it all that much easier for him to control at a choke point shared between multiple nodes. The remaining 1400 players can zerg around and overwhelm their opponents.

    If you're looking for a theory with some weight behind it, try this:
    HoT and PoF have introduced many builds with a low skill threshold. If you can hit buttons quickly, you can do decently - mechanical skill has dramatically decreased as a discriminator. Further, fight/run decisions and map strategy (rotation) are considerably advanced skills. This causes a large amount of players to fit in ratings just below those players who have the fight/run and map strategy skill set. The rating system tries to smooth them into a normal distribution, but because there is a lot of people with little skill difference, there is significant volatility. If you're above that threshold and have a bad day, you're stuck with that fickle group and luck in matchmaking can pull you down. This is especially true if you play a role which needs teamplay to succeed.

  • Tiah.3091Tiah.3091 Member ✭✭✭

    @Exedore.6320 said:

    @Trevor Boyer.6524 said:

    • This looks perfectly balanced on paper "perfect" but is this actually balanced for a Guild Wars 2 Conquest match? In short, no.
    • What usually happens in these situations is that the 1700 will take and defend any node that he is at the entire game, but his 1400 teammates on the other two nodes that he won't be at, are going to get crunched the entire game by the 1500s. The BLUE team will likely hold 2 nodes much more often throughout the game and win the match as such.

    I would actually posit the opposite outcome. The 1700 player, if playing a solo/assassin build (holosmith, mesmer, and thief are all good at this) would single out opponents and easily defeat them. This causes the opposing team to stagger, which makes it all that much easier for him to control at a choke point shared between multiple nodes. The remaining 1400 players can zerg around and overwhelm their opponents.

    Whoever thought it would be nice thing to do, if I accounted for 5v5 fights, instead of 1v1.
    Do you see that kitten?
    How the heck am I supposed to evaluate the win probability in a chaos like this? :#
    (And win probability is THE MOST crucial part of the algorithm, because otherwise how should it converge to your "true skill level"?)

  • Reikou.7068Reikou.7068 Member ✭✭
    edited October 25, 2018

    @Ben Phongluangtham.1065 said:

    @Faux Play.6104 said:
    The match maker doesn't try to match you with teammates that are close to your rating. Instead it matches your team's average rating withe the other team's average rating. Teammates can have several hundred points between the best and the worst player.

    This is not accurate. When a match is being built around a player, the matchmaker first looks for 9 other people within 25 points of rating. If it doesn't find enough people after 5 minutes (in ranked) it starts expanding the range over time until it finds enough players. Note: This doesn't mean if you've been in queue shorter than 5 minutes, everyone in the match is going to be within 25 points. If you're in a match with people over 25 points in rating difference than you, it just means whichever player the matchmaker built the match around was probably in queue for 5 minutes or more.

    After those 10 people around found, is when it arranges teams to ensure that each side is close in average skill rating and standard deviation from that skill rating.

    Additional note: We've experimented with making it so that everyone in a match had to have been waiting over 5 minutes before their ranges expanded. It didn't generally result in better matches and people at the higher end of skill rating ended up sometimes waiting in excess of 40 minutes for matches.

    Is there any limit on the expanded search range? Have you experimented with that?

    Perhaps lowering the search time to expand to about 2-3 minutes, BUT having a hard maximum search range of +/- 50 or 100 would help?

  • Faux Play.6104Faux Play.6104 Member ✭✭✭

    @Tiah.3091 said:

    @Trevor Boyer.6524 said:
    but again, this is all just conjecture and a lot of conspiracy theory.

    Pretty much what you said, yeah, lol xD
    (Also, quite a tough hypothesis to falsify, because it's hard to account for such factor.)

    • 1700 guy ques and match maker makes him a match. He gets put into a team looking like, RED: 1700 1400 1400 1400 1400
    • He gets put against BLUE: 1500 1500 1500 1400 1400
    • This looks perfectly balanced on paper "perfect" but is this actually balanced for a Guild Wars 2 Conquest match? In short, no.

    Yeah, that was my primary concern, when thinking about how matchmaker should behave in such situations. Especially how such skill distribution between teams would affect the winrate. I.e., will the 1700 dude carry the game, of will those 3 1500 guys do?
    I don't know how to approach it yet :/


    @Faux Play.6104 said:
    I thought the wiki said 30 was the low cap, but I could not reach it when I ran multiple matches. Starting at 0 it would slowly grow until it reached 60. Same if you started at 700. It would shrink to 60. Regardless, I think your deviation numbers are too low.

    Well, my deviation numbers are low indeed,** because I took active steps to reduce it.**
    I introduced the decaying RD for all playerbase, assuming that towards the end of the season people settle to their true rating.
    Now I updated the code - shifted the mean of gaussian to 1200, reduced standard deviation to 200 (so 3 sigma level is still 1800). And also I removed the decaying RD and introduced hard cap of 30 - all according to the wiki page (I didn't read it before, lol)

    The second step, that I took, and it by far is playing a much more important role:
    Glicko-2 takes 6 parameters for its calculation of player's new rating, RD and volatility. Those are: current rating, current RD and volatility - all 3 are scalars. And also 3 arrays, which provide info about his opponents from N previous games. Opponent's ratings, RDs and match results (either 0 or 1). Or, well, if you want, a 2-dimensional array (N,3).
    As we all know, when we are Unranked, the game asks us to play 10 matches for "seeding".
    Now, I initially took those 10 games as N, and kept it constant throughout whole ordeal. The results were looking something like this:

    As you can see: huge volatility, RD never drops below 40, and, obviously, HUUUUUUUUUUUUGE WIN STREAKS AND LOSE STREAKS.

    Then I assumed, like, man, devs can't be that shallow. They definitely have this N increasing, as the season progresses. I.e., the "game history array" should be increasing with time . It definitely should have more than 10 games recorded.
    That was my assumption.
    So, I introduced the "growing array" - after every new match, that our player played, the algorithm "remembered" all his previous games. Up until it reached 100 games. I had to stop at 100, because otherwise my laptop was just basically saying "there's no way I'm doing it in the next millennium".
    So, after it reached 100 games, the first game (historically) was removed from the array, 2nd game became 1th, 3th became 2nd and so on. Freeing the space for the last game.

    And that's what I got for doing that (the same picture is in OP post):

    Now, if I didn't introduce RD cap 30, it would drop to ~0 values quite soon. Volatility is almost non-existent, rating stabilises at ~1900 (which is a TRUE SKILL level for our test dude).

    Wiki doesn't have that info. And you see yourself how significantly it's affecting the results. Therefore, I'm asking @Ben Phongluangtham.1065: can you tell what exactly this constant is? Is it 10, or is it gradually increasing to certain level (like in my simulation it was 100)?
    The question is super-important.

    Cooking the books to force it to 0 deviation isn't realistic statistics. You never know something with 100% certainty :-)

    Something still looks off. I don't get wild swings of deviation like that or rating once it settles. Once you get about 20-30 matches in the +/- is about 12 points per game. I'm not summing matches and calculating a new rating, as this isn't done like a chess tournament. After every match I calculate a new rating and deviation for the players. It is an assumption, but maintaining a queue of match history for 10s-100s of thousands of players seems like a waste of computing resources. The main reason they did that was so you didn't have to repeat the iterative part of the calculation if you do it by hand. With a computer that is trivial.

  • Faux Play.6104Faux Play.6104 Member ✭✭✭
    edited October 25, 2018

    @Ben Phongluangtham.1065 said:

    @Faux Play.6104 said:
    The match maker doesn't try to match you with teammates that are close to your rating. Instead it matches your team's average rating withe the other team's average rating. Teammates can have several hundred points between the best and the worst player.

    This is not accurate. When a match is being built around a player, the matchmaker first looks for 9 other people within 25 points of rating. If it doesn't find enough people after 5 minutes (in ranked) it starts expanding the range over time until it finds enough players. Note: This doesn't mean if you've been in queue shorter than 5 minutes, everyone in the match is going to be within 25 points. If you're in a match with people over 25 points in rating difference than you, it just means whichever player the matchmaker built the match around was probably in queue for 5 minutes or more.

    After those 10 people around found, is when it arranges teams to ensure that each side is close in average skill rating and standard deviation from that skill rating.

    Additional note: We've experimented with making it so that everyone in a match had to have been waiting over 5 minutes before their ranges expanded. It didn't generally result in better matches and people at the higher end of skill rating ended up sometimes waiting in excess of 40 minutes for matches.

    Thanks for the info!

    Based on that I took a stab at definitions on the wiki for the following code:
    <Rating start="5m" end="10m" max="1200" min="25"/>
    https://wiki.guildwars2.com/wiki/PvP_Matchmaking_Algorithm

    Filter/Rating/@Min
    The maximum rating difference between rosters the filter starts at.
    Filter/Rating/@Max
    The maximum rating difference between rosters that can exist after padding is applied.

  • Tiah.3091Tiah.3091 Member ✭✭✭

    @Faux Play.6104 said:

    After every match I calculate a new rating and deviation for the players. It is an assumption, but maintaining a queue of match history for 10s-100s of thousands of players seems like a waste of computing resources.

    Wait, I didn't get it: in your code you don't feed the history of player's previous matches to glicko?
    But that is just simply wrong!
    Even in the pdf the author does the example run for 3 matches.

    Logic is: the better algorithm knows the history, the more precise it is

  • Exedore.6320Exedore.6320 Member ✭✭✭

    @Tiah.3091 said:
    How the heck am I supposed to evaluate the win probability in a chaos like this? :#
    (And win probability is THE MOST crucial part of the algorithm, because otherwise how should it converge to your "true skill level"?)

    I'm fairly certain that the rating adjustment is player rating vs. averaged team rating. Several seasons ago I did a few ranked games with a friend. I'm in platinum, he's somewhere around silver/gold. My rating adjustments were tiny for a win compared to me playing in platinum; his were huge. For a loss, mine were huge and his were tiny. A player vs. 5 players setup could also produce this, but ANet would have to do something to account for the magnitude of adjustment which I find unlikely.

  • Faux Play.6104Faux Play.6104 Member ✭✭✭

    @Tiah.3091 said:

    @Faux Play.6104 said:

    After every match I calculate a new rating and deviation for the players. It is an assumption, but maintaining a queue of match history for 10s-100s of thousands of players seems like a waste of computing resources.

    Wait, I didn't get it: in your code you don't feed the history of player's previous matches to glicko?
    But that is just simply wrong!
    Even in the pdf the author does the example run for 3 matches.

    Logic is: the better algorithm knows the history, the more precise it is

    The history is built into the rating and deviation numbers. The history you are referring to is to minimize the number of times you have to do the iterative calculation if you are doing it by hand.

    Each match would sill be evaluated once.

    The term for increasing volatility due to inactivity is disabled, so the period isn't of much use.

  • Deimos.4263Deimos.4263 Member ✭✭✭

    Do any of these models take into account a player's skill improving over time? Because of course it will. You learn stuff.

  • Megametzler.5729Megametzler.5729 Member ✭✭✭✭
    edited October 25, 2018

    @Deimos.4263 said:
    Do any of these models take into account a player's skill improving over time? Because of course it will. You learn stuff.

    Of course. However, only the relative skill improvement. So if everybody else improves at the same rate, you remain at your current rating. :wink:

    The lower limit of RD exists to account for skill changes throughout the season as well as balance patch changes and stuff like that. It does not fix your rating too quickly (and therefor precisely), so things can still change.

  • Tiah.3091Tiah.3091 Member ✭✭✭
    edited October 25, 2018

    @Deimos.4263 said:
    Do any of these models take into account a player's skill improving over time? Because of course it will. You learn stuff.

    No, my code doesn't account for it. But this code and this entire thread is mostly dedicated to one problem: winstreaks and losestreaks.
    Which tend to happen at MUCH shorter intervals, than the player would learn his stuff.
    I mean, you have a lose streak of 5-10 games, then a winstreak 5-10 games.
    I really doubt the player can improve his skills any faster than 100-200 games. Therefore, the effect is absolutely insignificant.


    @Faux Play.6104 said:
    The history is built into the rating and deviation numbers. The history you are referring to is to minimize the number of times you have to do the iterative calculation if you are doing it by hand.

    Well, this is just plain wrong. Did you really read the paper?

    "m opponents with ratings mu1, mu2, ... mum" or "scores against EACH opponent"
    Can you see capital greek Sigma letter? With "j=1" below and "m" above?
    Do you know what this means?

    I'm just asking, though. Probably I have misunderstood you.
    But RESULTS of the matches vs m previous opponents are DEFINITELY taken into account. The results of the matches - is that what I call "match history".
    Please tell me, if I'm still unclear.


    @Exedore.6320 said:
    I'm fairly certain that the rating adjustment is player rating vs. averaged team rating.

    Oh.. of that I'm fairly certain as well. Perhaps with some lowering coefficient, but yeah, I've been in that situation, where I lose 15 and friend loses 8.

    What I was talking about, IS NOT the "win probability" from the Glicko, which is required for rating update:

    No.
    I meant the REAL win probability. Why is it not the same - because glicko takes your (and your opponents) current rating for the calculation. Which is likely not exactly your real rating. Especially, if the season has just began.
    I.e. the dude, who was 1900 last season plays the game with 9 scrubs, who were 800-1300 last season.
    However, on paper, EVERYONE'S rating might be 1200 (first game of the season for all 10 ppl).
    What Glicko will calculate in this situation is obvious - it'll just take all those 1200 ratings, do its magic, and BOOM - everyone's equal, the winrate is 50v50%.

    But is it true? No. So, what was the ACTUAL win probability for that game?

  • Faux Play.6104Faux Play.6104 Member ✭✭✭

    Im> @Tiah.3091 said:

    @Deimos.4263 said:
    Do any of these models take into account a player's skill improving over time? Because of course it will. You learn stuff.

    No, my code doesn't account for it. But this code and this entire thread is mostly dedicated to one problem: winstreaks and losestreaks.
    Which tend to happen at MUCH shorter intervals, than the player would learn his stuff.
    I mean, you have a lose streak of 5-10 games, then a winstreak 5-10 games.
    I really doubt the player can improve his skills any faster than 100-200 games. Therefore, the effect is absolutely insignificant.


    @Faux Play.6104 said:
    The history is built into the rating and deviation numbers. The history you are referring to is to minimize the number of times you have to do the iterative calculation if you are doing it by hand.

    Well, this is just plain wrong. Did you really read the paper?

    "m opponents with ratings mu1, mu2, ... mum" or "scores against EACH opponent"
    Can you see capital greek Sigma letter? With "j=1" below and "m" above?
    Do you know what this means?

    I'm just asking, though. Probably I have misunderstood you.
    But RESULTS of the matches vs m previous opponents are DEFINITELY taken into account. The results of the matches - is that what I call "match history".
    Please tell me, if I'm still unclear.


    @Exedore.6320 said:
    I'm fairly certain that the rating adjustment is player rating vs. averaged team rating.

    Oh.. of that I'm fairly certain as well. Perhaps with some lowering coefficient, but yeah, I've been in that situation, where I lose 15 and friend loses 8.

    What I was talking about, IS NOT the "win probability" from the Glicko, which is required for rating update:

    No.
    I meant the REAL win probability. Why is it not the same - because glicko takes your (and your opponents) current rating for the calculation. Which is likely not exactly your real rating. Especially, if the season has just began.
    I.e. the dude, who was 1900 last season plays the game with 9 scrubs, who were 800-1300 last season.
    However, on paper, EVERYONE'S rating might be 1200 (first game of the season for all 10 ppl).
    What Glicko will calculate in this situation is obvious - it'll just take all those 1200 ratings, do its magic, and BOOM - everyone's equal, the winrate is 50v50%.

    But is it true? No. So, what was the ACTUAL win probability for that game?

    There is nothing that says m needs to be greater than 1. That portion of the calculation is inside of an iterative loop where you need answers to converge. If you are calculating the iterative portion by hand it is more convenient to sum up several matches and do the iterative portion once.

    It doesn't say to treat the summation portion like a fifo queue where each match is evaluated m times.

©2010–2018 ArenaNet, LLC. All rights reserved. Guild Wars, Guild Wars 2, Heart of Thorns, Guild Wars 2: Path of Fire, ArenaNet, NCSOFT, the Interlocking NC Logo, and all associated logos and designs are trademarks or registered trademarks of NCSOFT Corporation. All other trademarks are the property of their respective owners.