UPDATED 24 Oct, 11:19 PM (GMT+3)

A few weeks ago there was a thread, where some dude** hypothesised**, that *"MatchMaking algorithm forces you to lose games in streaks after you had a win streak."* Here it is.

The thread was met with healthy criticism, and one dude, @Megametzler.5729 , even linked a pdf, where MM was described, step by step.

I read it thoroughly (at least I think I did), and after that I had a feeling, that the situation, which OP described, is

So, as you can check yourself, the math, which describes the algorithm is quite trivial. _(Well, the math, that **describes** it is indeed trivial, but math, which took the author of the paper **to prove it actually works** - is slightly more complicated. Sadly, the full process is not described in the paper) _

Despite that fact, the formulas look quite clunky and not very fit for visual comprehension.

That's why I thought it would be a fun thing to do, if I put them into a code. So, yesterday I was really bored and gave it a try: link to Python 2.7 Jupyter Notebook (updated). The code itself is a little bit trashy, but should be easy enough to read.

The main goal of the code was to **simulate a game history** of some player in 1v1 scenario (although, in GW2 spvp happens in form of 5v5, in the context of our hypothesis it doesn't really matter).

In order to simulate something, you have to provide the model of some level of adequacy.

In the case of this code, there supposed to be 2 models (the 2nd I'll add later)

*Description:*

- There's ONE very high TRUE SKILL level player. Let's say, 1900 level.
Although, he's initially Unranked, and has to play 10 games for seeding against various opponents of some skill level.

And, to his misfortune, he does it, while being STONED AS kitten, which in terms of math means, his winrate against 800-1200 scrubs is**precisely**50%*(for those 10 games only)*Then he finishes with some result and feels like "kitten, man, that won't do, I must tryhard." And he starts doing exactly that, playing with his full potential.

**Important, MatchMaker Algorithm:**The matchmaker now assumes, that all players in the game have their ratings distributed according to Gauss Distribution. With mean 1000 rating and standard deviation 266 rating**(see the image 1.)**

*(1200 was taken from here, as well as the other constants, and the standard deviation I took assuming, that 1800 is 3 sigma level, where only 0.3% of the elitist dudes play. 30 ppl above 1800, makes the playerbase something like 10000 - that makes sense, I guess.)*

It works like this: the matchmaker rolls normally distributed number (with mu and sigma 1000 and 200, accordingly)

If it does NOT BELONGin a range of +/- 10 rating from our dude*current rating*, then we increase this range by +10 (making it +/-20, then +/-30, and so on), and re-iterate the process.

If it DOES, however, we take this number value as our opponent rating for the current game. The higher our dude's rating, the tougher it is for him to find decent opponents**(see image 2)**Then we calculate the winrate against this opponent, according to Glicko manual (see the formula for

**E**in glicko-2.pdf, Step 3). Then we record win or loss for this game, in accordance to that winrate.- Then we calculate the updated rating, rating deviation and volatility from the game he just played, and update his game history with those values. Initially it was 10 games after seeding, and now it starts growing. If it grows to 100, we then start removing the first element of the history array and shift the whole array left by 1, and record the last game as before (I.E., making space for new games, and forgetting very old ones)
- Then we reiterate process until our guy plays 500 games.

*UPDATE from 24 Oct:
1) removed RD decay over time, introduced a hard cap for RD=30,
2) updated mean and standard deviation for gaussian of skill,
2) took system constant and other parameters from wiki page*

And, finally, we can see how his rating changes with time, by the end of a season. See **image 3.**

(from top to bottom):

**IMAGE-1:** Gauss distribution of TRUE SKILL levels of players in GW2 leaderboard. Approximately, of course. 10000 players, mean is 1000, sigma is 266.

These numbers derived from the assumption, that 3-sigma level is 1800 rating, and there are 30 players above 1800.

**IMAGE-2** Matchmaking representative samples for players with 1000 TRUES SKILL level (Blue) and 1900 TRUES SKILL (Green). As you can see, 1000 rating player will *almost always be playing with similar level* opponents. While 1900 rating player will not only be playing against a_ much wider_ range of opponents, but *he will also be forced to play against lower skill players most of the times*.

**IMAGE-3** The game history of our 1900 rated player. Rating is displayed at the left scale (Red) and the Rating Deviation on the right scale (Blue). Note how quickly it converges to 1800-2000 range and stays there throughout the whole season. Even though winstreaks occur, they won't bring the rating too high.

@Airdive.2613 said:

I've come up with an idea of an interesting (at least to me) experiment.

...

The data of interest:

1. I'm curious to see the distribution of the overall number of people depending on their rating (like a histogram of the number of players to divisions), as well as "bad" and "good" ones independently.

2. Using the same data as in point 1, calculate the sum of all players' ratings. What I mean is, how does the sum of all players' ratings after 100,000 games compare to the initial sum of their ratings? (The initial ratings' sum would be, for example, 1,500x5,000 = 7,500,000.)

This would require a slightly different model, and... It's coming soon

Well, you've got the idea - although, the **graph 3** clearly indicates, that ** winstreaks (and lose streaks) MAY EXIST to certain level, they shouldn't take you much farther, than +/-50 rating below or above your TRUE SKILL LEVEL. ** Especially at the end of the season.

As always, take it with a grain of salt. Because the model is **STILL** quite simplified and there's **STILL** a lot of uncertainties and unknowns.

Constructive critics is welcomed. Please check the code yourself, if you're interested.

ALL THE MATERIALS, IN CASE SOMEONE MISSED SOMETHING:

1) Code (python 2.7 notebook)

2) Glicko-2.pdf

3) previous thread

4) Gauss Distribution

Tagged:

17

Sign In With Your GW2 Account or Register to comment.

## Comments

Haha, great job man. You actually did it! Thanks a lot for this!

Also, absolutely correct sidenote about the matchmaking <> Glicko. Matchmaker is developed by Anet and we have rather few information on this. But Glicko(2) is a solid and widely used and proven system.

Still I am kind of surprised about the huge deviations. Did you turn some screws, the system constant tau (τ) for example? Could you add two or three graphs for different values, because that is supposed to determine the volatility? But only if you should be bored (again). I don't know Python and I am abroad so I don't have Matlab on my private computer...

Anyway, again, thanks a lot! Really fun to see.

€: I see, you used 0.8. Could you give it a shot with lower values?

Great Work! Can we get a "sticky" on this?

Vae Victus!

[Hcm] Promotraitor

Wow the major swings in the rating are pretty insane. I wonder if this is representative of other people's exp. However the ability to do que definitively skews this. I play exclusively solo and the highest I've gotten this season is 1657. When I hit that I noticed teams almost always had two do que. Kinda annoyed with that.

I seem to get better average results when I solo que past low Plat, than when I duo Q. Talking over the course of about 300 games last season, I'm as of yet unable to play competitively still this season, due to an ongoing hand injury.

Vae Victus!

[Hcm] Promotraitor

Yeah, well, I'd suggest you, guys, not to take THE ACTUAL VALUES of the swings too seriously. Because they are most likely just too huge.

My matchmaking (actual matchmaking, lol) algorithm doesn't take into account, that by the end of the season the majority of players have their rating deviations gradually decreased.

My assumption about Rating Deviation, is that when it goes below a certain threshold for an opponent, it instead rerolls randomly to be in range of 0-50.

Which is, again, is FALSE for the VAST MAJORITY of players.

My point is, is that those swings shouldn't be TAHT HUGE, like you guys correctly noticed.

Instead, if the player TRUE SKILL rating is 1500, the graph should look something like this:

(Sorry for the Paint).

In other words - it should converge to 1500. I'm pretty sure one can do that, by playing with parameters a little bit. But with that much of free parameters (including the most major one - the ACTUAL MM algorithm), that's like finding a needle in a haystack.

What can be learned from it RIGHT NOW, however, is that you are most likely to converge to your TRUE SKILL rating at one point or another, but that might be quite a bumpy ride with big lose streaks and equally big win streaks.

I'll play with it tomorrow, hopefully. I mean, with variable RD for an opponent. And I'll also try one with a bigger TAU (kinda like a "viscosity" parameter, eh?).

But you, guys, are more than a welcome to give it a try as well!

Cheers!

The volatility will undoubtedly drop by a lot if you include more variables to take into account.

Vae Victus!

[Hcm] Promotraitor

So, the question. If the matchmaking is not provided by Glicko, what exactly does it do? Is its only output a number indicative of a player's rating? If so, is its only function to determine the amount of points gained/lost?

Well, assuming, that there

WASa decent matchmaking, Glicko can be descibed as follows: it takes player's current rating, rating deviation and volatility, and the ratings and rd's of the enemies from N of his previous matches. And then returns the updated values of rating, rating deviation and volatility (for player).3 numbers, to be precise. Not 1.

But if you put it simple - yes. It "doesn't do much". All the matchmaking magic happens thankfully to some cryptic A-Net's algorithm.

Matchmaking algorithm, which I used in my code is kitten simple: if the enemy fits into a range of

[current_player_rating - 50; current_player_rating+50], then this is our guy. Exact value of the enemy rating is taken randomly, ofc.This, this is neat.

Solo queue tends to yield better results UNLESS you have a solid duo queue partner. Duo queue means you have a good chance of ending up with 3 people that don't know kitten they're doing - since duo queue inflates you're combined MMR

Twitch|YouTubeMake DH major trait "heavy light" baseline & replace it with a better traitIs this factual? If it is, it's rather counter intuitive.

Vae Victus!

[Hcm] Promotraitor

I actually experience elongated win/lose streaks extremely frequently to the point that most of my seasons are played as win or lose streaks. Why does this happen to me? Good question, but it makes me wonder if:

Either way, in 6 years and almost 13,000 matches played, I've come to the conclusion that the algorithm notes and all simulations are somewhere incorrect and in no way reflect the deep applied effects of 3rd party programs/smurfing/win trading/whatever the hell else is going on, and their effects on actual match making.

Don't believe me? I had woke up earlier this morning to play some games and went on a 10 or 11 game win streak. And no, there is no win trading here. This is just random legit ranked solo que. I mean, does someone have a plausible explanation for this happening so frequently to some players? I'd love to hear it.

The 10 Commandments Of ConquestAbide by the commandments or God shalt deliver unto thee a packet of salt as often as thou did break them-> https://en-forum.guildwars2.com/discussion/38081/the-10-commandments-of-conquest#1

Am I correct in my assumption that all under 10 min games were blowouts? I have never really bothered to time PvP games so I'm curious to get some input on the subject. Also what was your baseline rating at the beginning of the streak vs where it ended?

Vae Victus!

[Hcm] Promotraitor

I do believe you, because I experience exatly the same - 10 wins and 10 losses streaks with may be like 1-2 outliers. Well, normal "2 win, 1 loss, 1 wins, 2 loss" stuff also happens, ofc. But to my feeling, these win and lose streaks happen WAY too often. MUCH more often, than in other games with rating.

So, an attempt to find and explanation is explicitly the reason of this thread.

So far, what I can tell

for sure- if you win, say 7 games in a row, the matchmaker SHOULD put you against a "statistically tougher" opponent. Because normally it should put you against an equal opponent. An absolute case of equal opponent is you "mirror". Someone with exactly the same rating and RD as you.BUT!In case of a winstreak, precise "mirror match" (when your supposed opponent has EXACT same rating and RD, as you) results in WIN expectancy >50%..

Though, for the ultimate success I still need to converge that graph.

I've come up with an idea of an interesting (at least to me) experiment.

Could you please provide the data on the following?

somewhatdepend on each player's tier.The data of interest:

1. After these 100,000 games our "players" will presumably spread across the ladder. I'm curious to see the distribution of the overall number of people depending on their rating (like a histogram of the number of players to divisions), as well as "bad" and "good" ones independently.

2. Using the same data as in point 1, calculate the sum of all players' ratings. What I mean is, how does the sum of all players' ratings after 100,000 games compare to the initial sum of their ratings? (The initial ratings' sum would be, for example, 1,500x5,000 = 7,500,000.)

3. Moving on to another stage, a "soft reset" occurs somehow at the end of the "season". How does it affect the sum of all players' ratings, if does at all (or maybe it is just a volatility reset)?

4. After running

severalseasons of the same experiment (and with the soft reset occuring in-between), I'd like to once again see the histogram of the number of players to divisions for the whole population, "bad" subgroup, and "good" subgroup, as well as the sum of all players' ratings.I know it's a lot to ask, but unfortunately I'm not familiar with programming and it seems like too daunting a task to do it using spreadsheets.

That is indeed an interesting experiment, and actually quite common one, when you do the statistical analysis. Thank you, for pointing it out!

It's actualy much easier task, than you might think, when using that code.

1)This assumption is already in the code.

2)this assumption is done with 1 line, basically. Right now I only provided statistics for 1 player - his entire rating history throughout the season. To make it 5000 - you just have to run 1 additional loop from 0 to 4999. And since we only interested in the FINAL rating value for each player, the answer could be represented as an 1 dimensional array of values, something like this [1485,1257,...1763].

3)This is called normal distribution.

I was planning to introduce it at some point, though, in a different place.

Right now the MM algorithm simply takes random opponents from the interval of ratings. While normal distribution would suggest, that it

has lesser probability to take the opponent with extremally different rating. E.g. the interval is 1400-1600 (+/- 100 around 1500), and the probability of taking the opponent with 1410 rating is 10%, while the probability of taking 1490 guy is 90%(right now those probabilities are equal)4)Right now the TRUE SKILL of our experiment-rabbit player is 1500. And the probability to win is determined linearly - 1500 vs 500 rating is 100% win, 1500 vs 1500 rating is 50% win, 1500 vs 2500 is 0% win. (

I was thinking to replace this model with some more sophisticated. Like the mentioned ELO - it has win probability dependencies from difference in players ratings.)What you suggest, is that I take those 5000 players, and simply introduce a

normal distributionto their TRUE SKILL level. I.e., we had 1 guy with 1500, now we'll have 500 guys between 1400 and 1500, 10 guys between 0 and 200 and 50 guys between 1900 and 2000.This is easy

5)Ultimatively, the Glicko algorithm only calculates the outcome for 1v1 matches. In GW2, as we know it, spvp matches are 5v5. Then how the algorithm transfers it to 5v5 you ask? Well, my suggestion: it simply takes the average rating of the enemy team, and treats this team as a singular opponent for each player in ally team.

So, in the end it doesn't really matter - the results won't be much different in case of 1v1 and 1v5. So, the the suggestion to make it 5v5, I think is unnecessary.

But all in all, this is a great suggestion, and I was planning to do it myself at one point or another (well, after I make the player rating converge to a certain value and decrease volatility).

Thanks, man!

Want to add one more 'variable' into your theorycrafting regarding the MMe. Just a while ago I get this match with 4 engineers, 2 being scrappers (both on my team) and the other 2 holos were enemies. I think most people can understand what happened. I mean wouldn't be possible for the engineers to be equally distributed in teams? Why 2 scrappers -a profession that is weak especially compared to holo- in one team?

Е> @Dreddo.9865 said:

That's not exactly a "variable". That's not really anything. I understand what are you talking about, but how do you suggest I should evaluate the "strength" of the spec? Just take a "wild guess"? That there are 2 equally skilled players, one's playing holo, another one playing scrapper, and the one playing holo has a better chance of winning? Better how much exactly? 20% 10%? 1%?

That is basically a

free parameter- whatever I chose to take, would affect the modelsignificantly.The model has enough

free parameterson its own, I'm not sure it would be a good idea, to introduce EVEN MORE of those.Oh, that's cool!

To clarify: what I wanted to look at (in my latter points) is whether it might be possible that the total sum of the players' ratings changes with time (if they aren't zero-sum), thus causing the leaderboard to become "biased" after several seasons - upping or lowering the mean/median player rating within the same (0, 2100) borders or revealing some sort of a pattern.

I agree the normal distribution is clearly a better choice in terms of modelling the playerbase, but it must get increasingly harder to calculate the probability of winning. I mean, it shows well "how many people" are better than you, but I don't know for sure just "how much better" a 1800 player is over a 1500 one. Some math needs to be done. :0

I always knew it they must us some alghoritm. Good job !

Some of your questions I can answer without any modelling.

- Even in the case of zero sum game, the ratings for the constant number of people doesn't remain constant, it inflates very slightly, because ratings <0 are not allowed. I.e., when you win against 0 rating player and receive 10 points - he didn't lose 10, he remains at 0. Which means you got those 10 points from nowhere, therefore the inflation.

- However, the

leaderboard doesn't "become "biased" after several seasons ", because every season it resets. So, the payer rating sums are equal at the beginning of every season.I know what you mean: few years ago there were people with 2k ratings (well, I only play for 2 month, that's just my assumption).

Now, top 10 people are all <1900, HOW COME?

I'll tell you how^ it has nothing to do with "biasing". It's simply the amount of playing people reduced significantly. Therefore the absolute sum of players ratings became lower (simply because it was ~1500

N, and now it's ~15000.5*N). It's like the system ran out of fuel - those skilled players can't simply take away rating from others, because there's nothing to take anymore.However, an important note,is that unlike ELO, Glicko is not a zero summ system (at least I think it's not). How this non-zero summ would affect the player summ - that's a very interesting question.Although, I really doubt it will be much. Because even if summ is not ZERO for SINGULAR player, on AVERAGE of N players it should be more or less ZERO.Where N is the system constant - in case of my code (and in case of GW2 spvp) its 10. Remember how they ask you to play 10 games to determine your rating - that's the constant I'm talking about.

So basically this only shows what we've known since the Glicko / Glicko2 algorithms were released:

The main flaw in the original work is how win/loss is determined. It's definitely

notlinear. That flaw leads to more volatility in the rating than there should be. I would suggest using the Elo probability of winning calculation for determining the outcome, using the player's true skill and the opponents actual rating (assuming all opponents were rated correctly). Going further, you could do additional tweaks:Yeah, I was thinking the same:

ELO indeed should give much better approximation, than linear model.

That, on the other hand are interesting suggestions. I'll give it a try, thanks!

On a side note: The rating deviation (RD) does usually have a lower limit to account for "skill changes", in the GW environment of course also balance patches and class changes and stuff. After the first 15-20 games we all seem to hit that limit - that is when the placement match deviations have become low. Ever wondered why the tenth matches gives ±30 rating? Then the 20th game at the exact same rating only gives ±15 (and stays like that)? That is the RD or it's lower limit respectively. You can still have statistical streaks, but the impact is lower.

On the topic of the matchmaker: We have rather few informations her indeed. https://wiki.guildwars2.com/wiki/PvP_Matchmaking_Algorithm gives some hints that it tests loops to look for the "best rooster" of teams, but we do not know many details on the criterias or for example how duos are implemented. Ben once showed an example here on the forums*, where it seemed not to be accounted for

at all. That would indeed be a major flaw, but not connected to win/loss streaks.So if your rating inflates by duoQing, you might indeed experience a loss streak to get you back to your solo rating.

Final note: Glicko-2 is solid, though I would still like to see some changes. Matchmaking itself however could be an issue. DuoQs are only my personal worst idea. It does not, however, seem to look at your previous match outcomes except they keep lying to us.

*Here: https://en-forum.guildwars2.com/discussion/54656/match-ranking/p2

As for the first point - I believe it's reasonable to assume "different skill tiers" (however you call it) already include the stuff like disconnections, these things are just more likely to occur to "worse" players.

The second one, in my opinion, is what should be done and what I suggested with a fixed pool of 5,000 players. ^^

All I read from this is somewhere on the enemy team, there is a soul mate for us all. Let's use this to kill the toxicity ❤️❤️❤️

Oke, guys. A HUGE UPDATE here (check the OP)

1. Ratings now resemble the REAL GW2 spvp leaderboard ratings as close, as possible, according to Gauss Distribution.

2. Updated the matchmaking algorithm with that neat ole' good Gauss.

3. Rating Deviation now converges, as the season progressing.

4. Player's game history interval now increases, as he plays more games up to 100 games (was 10, constant).

As a result, the volatility VASTLY decreased and winstreaks-losestreaks are almost gone. They are still here, ofc, but not as huge, as they were before.

I would advise you to re-read the first post from the "MODEL-1" paragraph. And run the code, if you want more thorough look at the thought process.

Cheers!

Well, NOW we can ask for sticky, I guess (drops mic)

I like that you put in the effort. But the people who complain about the rating system are still going to ignore any mathematical proof.

A few interesting things to try:

1. Use a slightly skewed Gaussian distribution for player rating (not centered at the mean rating). This draws from an experiment I saw done in Overwatch. A player would play one account on weekdays and another account on weekends. The weekend account ended up over 500 rating lower (OW goes 0 to 5000) than the weekday account.

2. Try to better reflect the matchmaker's behavior for fringe and low population. After ~5min, the matchmaker expands the rating margin for matching a player. This is particularly evident outside of prime play time.

On your question about duo queue, it uses the average of the players in a party as the roster's rating.

Could you, please, provide a link to that forum? Because I don't get the idea of this experiment.

If it's the same guy playing both accounts, his TRUE SKILL level is still the same. It's not like he's playing worse on one account, than on the other, right?

I use normal distribution explicitly for the TRUE SKILL level. The rating, which you IDEALLY should have.

I don't get why it should be skewed. I mean, most of the things in nature distributed according to Gauss with very good precision, starting from something as fundamental, as the Cosmic Microwave Background and ending with something as simple as the boob size.

To me it seems like he got 500 lower rating simply because he didn't play enough games on second account.

I already did that, and I tried my best to explain how I did it in the updated OP post. Yes, I expand the rating margin, exactly as you suggest.

That wasn't MY question.

I'm not interested in duo q results. At least for now. Probably that's a topic for MODEL-3

Too lazy to look for the post, but he did play enough to stabilize his rating. The idea is that different groups of people of play on weekdays vs. weekends. In particular, weekends may have more casual (slightly less skilled) players; possibly a higher number of middle school and high school kids; etc. Since rating is a representation of an individual's skill against the population, changing the overall skill level of the population will change an individual's rating.

Couple points.

Every season I've played I've had games with 10+ wins in a row, and streaks where wins are hard to come by.

The match maker doesn't try to match you with teammates that are close to your rating. Instead it matches your team's average rating withe the other team's average rating. Teammates can have several hundred points between the best and the worst player.

Glixko will give accurate results for missmatched opponents. Because of this...

You are better off matching players that are of similar skill on the same team. The current method boosts the bad players and punishes the good. This tends to drive people towards the same rating vs. Separating them.

When I have run the numbers with the constants anet uses, the deviation won't go below 60. That means the system is 95% confident you are 2 deviations from your current rating, -/+ 120.

Yeah, that's a good point, man.

How it affects the actual rating approximation for every player - negatively or otherwise - that's a subject to research. I was thinking about that myself last night before sleep. Because I remembered the post in this thread, where dude cited A-net dev post, where he confirmed, that they use exactly that: they balance team average rating vs other team average rating.

Did you run the code from the original post?

Because I just remembered, that I forgot to upload the new version, after the update.

Also, the code uses slightly different constants from those, that A-net uses. I'll fix it and upload the new version. (However, that point about team vs team - I'm not yet sure what to do with it)

I made my own based off the guild wars wiki and the glicko paper. Most of the posts I made on the subject were in the old forum. I'll have to dig it up when I get home.

For teams, I'd just do a sum of squares for the deviations and assume the player is at the midpoint.

I thought the wiki said 30 was the low cap, but I could not reach it when I ran multiple matches. Starting at 0 it would slowly grow until it reached 60. Same if you started at 700. It would shrink to 60. Regardless, I think your deviation numbers are too low.

This is not accurate. When a match is being built around a player, the matchmaker first looks for 9 other people within 25 points of rating. If it doesn't find enough people after 5 minutes (in ranked) it starts expanding the range over time until it finds enough players. Note: This doesn't mean if you've been in queue shorter than 5 minutes, everyone in the match is going to be within 25 points. If you're in a match with people over 25 points in rating difference than you, it just means whichever player the matchmaker built the match around was probably in queue for 5 minutes or more.

After those 10 people around found, is when it arranges teams to ensure that each side is close in average skill rating and standard deviation from that skill rating.

Additional note: We've experimented with making it so that everyone in a match had to have been waiting over 5 minutes before their ranges expanded. It didn't generally result in better matches and people at the higher end of skill rating ended up sometimes waiting in excess of 40 minutes for matches.

Ben Phongluangtham

Game Designer

Reddit: ANET_BenP

Twitch: AnetBenP

Before you read! Know that this response is admittingly, mostly conjecture and conspiracy theory.

I actually did this once with my best simulation. Say you take 100 players ranging from 1800 down to 800. I found that realistically over the course of a season, those rating margins will implode on themselves, not expand. Meaning, the season starts with a margin of difference between ratings ranging from1800 to 800, but at the end of the season with how rating is effected by wins & losses, it will end up looking more like 1400 to 1200. At least this is what seems to happen on paper when the population is very low and the match maker is putting together matches where our teammates and opponents can be several hundreds of rating higher or lower than each other. It creates a situation where high rated players are punished more than intended for loses and low rated players receive too much rating reward for wins where they are being carried. When I saw this result, it made me question why our leaderboards weren't doing this in the higher ranked margins? What was keeping the 1800+ margins of the board expanding and not imploding? The lower rated margins also seem to be stunted from inevitable implosion towards a median. Is it win trading creating unrealistic higher margins? Is it.. something else in the algorithm that isn't mentioned in the notes?

That is when I REALLY sat back and started thinking about those win/lose streaks that everyone seems to talk about that happen so frequently. Season after season I began paying close attention to many different player's "rising & falling" rhythms in the leaderboards. I began noticing something odd indeed. The same players would always go on win or lose streaks at the same time. What I mean is: Players (A)(B)(C)(D) always seem to hit some win rhythm at the same time, whilst Players (E)(F)(G)(H) are all hitting a lose rhythm at the same time A B C D are on a win rhythm. And it would usually go on for 2 or 3 days, until the rhythms apparently swap? Then I'll see A B C D all take on a sudden lose rhythm whilst E F G H all go on a win rhythm? I mean these are consistent patterns I have been watching for many many seasons now. I began to wonder if the algorithm had a secret function that was enforcing win & lose streak rhythms. From what I had seen, it would seem to be done in some kind of Control Group (A) and Control Group (B) kind of thing. When A is on a good rhythm for match making, B is on a bad rhythm. When B is on a good rhythm, A is on a bad rhythm. It certainly would explain the "according to algorithm notes" highly improbable but somehow super frequent win & lose streaks. It would also explain why the rating margins somehow magically expand high and low instead of imploding, a system that makes us take turns being ping ponged around the rating margins. Why would they program something like that in? I dunno, to make sure glicko margins work for a 5v5 game mode that makes you que with random people, to avoid implosion? Why would they not just tell us about it? Well it certainly wouldn't be a strong selling point for the game mode when a player read the glicko notes and realized the algorithm was sniping them during ranked ques.

I mean, does no one else find it odd that rating never settles as if your skill never settles? I have nearly 13,000 matches played and I'm sure I've peaked in my skill at Guild Wars 2 at this point! So why is my rating so ridiculously volatile every season all season? I would expect to bounce around between 1600ish and 1500ish, but to bounce around between 1650 and like 1350 four or five times a season, due to win & lose streaks that come and go like scheduled clockwork? ^^ It makes one wonder, but again, this is all just conjecture and a lot of conspiracy theory.

There is one other thing I wanted to note about running glicko simulations on paper. I've been pointing this out for years now and I'm going to say it again. The math all would seem to be perfectly accurate on paper, yes. But there are factors going on here that cannot be equated with numbers. Amongst these factors are things like "Are your teammates on their mains or are they alting for pvp wing achievements?" "Are some of your opponents smurfing on low rated f2p accounts but they play at a plat 2 level?" "Do you land a bad team comp but the enemy has a meta comp?" "Is anyone using 3rd party programs and/or win trading?" "Does someone random AFK to answer their door and pay for a pizza?" ect.. ect.. But by far the BIGGEST factor that stunts the accuracy of simulations is that they in no way consider how Conquest is actually played. This will be easier to explain in a list:

But yeah, something to think about.

The 10 Commandments Of ConquestAbide by the commandments or God shalt deliver unto thee a packet of salt as often as thou did break them-> https://en-forum.guildwars2.com/discussion/38081/the-10-commandments-of-conquest#1

I've always wondered how well the rank distribution takes into account certain factors that come up in GW2-style matches based on playstyle and class/build choice.

Would we expect both types of players to have the same win/loss streaks in their results?

I'm completely speculating here, but I would expect Carry to be less streaky as the season goes on. They should settle to right around the point where players can deal with them. If they lose too much, they start to dominate games until they are back at the point where players can counter them again. If they win too much, they get countered every game.

Acceptable-Teammate though.. I think they could rise or fall to potentially any rating, as they are more dependent on the team they get. Obviously, there's a fair amount of shuffling of teams, but I think Acceptable-Teammate may be more prone to streaks of luck with the matchmaker.

Pretty much what you said, yeah, lol xD

(Also, quite a tough hypothesis to falsify, because it's hard to account for such factor.)

Yeah, that was my primary concern, when thinking about how matchmaker should behave in such situations. Especially

how such skill distribution between teams would affect the winrate. I.e., will the 1700 dude carry the game, of will those 3 1500 guys do?I don't know how to approach it yet

Well, my deviation numbers are low indeed,** because I took active steps to reduce it.**

I introduced the decaying RD for all playerbase, assuming that towards the end of the season people settle to their true rating.

Now I updated the code - shifted the mean of gaussian to 1200, reduced standard deviation to 200 (so 3 sigma level is still 1800). And also I removed the decaying RD and introduced hard cap of 30 - all according to the wiki page (I didn't read it before, lol)

The

second step, that I took, and it by far is playing a much more important role:Glicko-2 takes 6 parameters for its calculation of player's new rating, RD and volatility. Those are: current rating, current RD and volatility - all 3 are scalars. And also 3 arrays, which provide info about his opponents from N previous games. Opponent's ratings, RDs and match results (either 0 or 1). Or, well, if you want, a 2-dimensional array (N,3).

As we all know, when we are Unranked, the game asks us to play 10 matches for "seeding".Now, I initially took those 10 games as N, and kept it constant throughout whole ordeal. The results were looking something like this:

As you can see: huge volatility, RD never drops below 40, and, obviously,

HUUUUUUUUUUUUGE WIN STREAKS AND LOSE STREAKS.Then I assumed, like, man, devs can't be that shallow. They definitely have this N increasing, as the season progresses. I.e.,

the "game history array" should be increasing with time. It definitely should have more than 10 games recorded.That was my assumption.So, I introduced the "growing array" - after every new match, that our player played, the algorithm "remembered" all his previous games. Up until it reached 100 games. I had to stop at 100, because otherwise my laptop was just basically saying "there's no way I'm doing it in the next millennium".

So, after it reached 100 games, the first game (historically) was removed from the array, 2nd game became 1th, 3th became 2nd and so on. Freeing the space for the last game.

And that's what I got for doing that (the same picture is in OP post):

Now, if I didn't introduce RD cap 30, it would drop to ~0 values quite soon. Volatility is almost non-existent, rating stabilises at ~1900 (which is a TRUE SKILL level for our test dude).

Wiki doesn't have that info. And you see yourself how significantly it's affecting the results. Therefore, I'm asking @Ben Phongluangtham.1065:

can you tell what exactly this constant is? Is it 10, or is it gradually increasing to certain level (like in my simulation it was 100)?The question is super-important.

As far as I know, this is how Glicko-2 works in general. The examplepaper on it shows it like this - an ever decreasing RD out of all previous games (which is less, though, since noone plays 100 matches in chess per year. Maybe not representative there.^^).

Also I think a limit to minimum RD is kind of okay - but I think it is too high. I once asked for it to be reduced: less effect of skill changes throughout the season, balance patches and stuff, but way less volatility on late matches reducing the punishment of many games played per season and maybe decreasing toxicity of later games. Maybe we can get a hint here too and maybe it could be reduced?

The 10 seeding games are to mask the fact that your rating is potentially shifting by hundreds of points in the first few games. The less players see, the less they freak out about it before thinking.

With the numbers used in GW2, I would expect a fluctuation of 150-200 points is normal (two standard deviations in each direction). Further variation can be explained by changes in the player from day to day. Maybe you're tired, playing a different build, get frustrated with a few losses and let it cloud your judgment, etc.

I would actually posit the opposite outcome. The 1700 player, if playing a solo/assassin build (holosmith, mesmer, and thief are all good at this) would single out opponents and easily defeat them. This causes the opposing team to stagger, which makes it all that much easier for him to control at a choke point shared between multiple nodes. The remaining 1400 players can zerg around and overwhelm their opponents.

If you're looking for a theory with some weight behind it, try this:

HoT and PoF have introduced many builds with a low skill threshold. If you can hit buttons quickly, you can do decently - mechanical skill has dramatically decreased as a discriminator. Further, fight/run decisions and map strategy (rotation) are considerably advanced skills. This causes a large amount of players to fit in ratings just below those players who have the fight/run and map strategy skill set. The rating system tries to smooth them into a normal distribution, but because there is a lot of people with little skill difference, there is significant volatility. If you're above that threshold and have a bad day, you're stuck with that fickle group and luck in matchmaking can pull you down. This is especially true if you play a role which needs teamplay to succeed.

Whoever thought it would be nice thing to do, if I accounted for 5v5 fights, instead of 1v1.

Do you see that kitten?How the heck am I supposed to evaluate the win probability in a chaos like this?

(And win probability is THE MOST crucial part of the algorithm, because otherwise how should it converge to your "true skill level"?)Is there any limit on the expanded search range? Have you experimented with that?

Perhaps lowering the search time to expand to about 2-3 minutes,

BUThaving a hard maximum search range of +/- 50 or 100 would help?Cooking the books to force it to 0 deviation isn't realistic statistics. You never know something with 100% certainty :-)

Something still looks off. I don't get wild swings of deviation like that or rating once it settles. Once you get about 20-30 matches in the +/- is about 12 points per game. I'm not summing matches and calculating a new rating, as this isn't done like a chess tournament. After every match I calculate a new rating and deviation for the players. It is an assumption, but maintaining a queue of match history for 10s-100s of thousands of players seems like a waste of computing resources. The main reason they did that was so you didn't have to repeat the iterative part of the calculation if you do it by hand. With a computer that is trivial.

Thanks for the info!

Based on that I took a stab at definitions on the wiki for the following code:

`<Rating start="5m" end="10m" max="1200" min="25"/>`

https://wiki.guildwars2.com/wiki/PvP_Matchmaking_Algorithm

Filter/Rating/@Min

The maximum rating difference between rosters the filter starts at.

Filter/Rating/@Max

The maximum rating difference between rosters that can exist after padding is applied.

Wait, I didn't get it: in your code you don't feed the history of player's previous matches to glicko?

But that is just simply wrong!

Even in the pdf the author does the example run for 3 matches.

Logic is: the better algorithm knows the history, the more precise it isI'm fairly certain that the rating adjustment is player rating vs. averaged team rating. Several seasons ago I did a few ranked games with a friend. I'm in platinum, he's somewhere around silver/gold. My rating adjustments were tiny for a win compared to me playing in platinum; his were huge. For a loss, mine were huge and his were tiny. A player vs. 5 players setup could also produce this, but ANet would have to do something to account for the magnitude of adjustment which I find unlikely.

The history is built into the rating and deviation numbers. The history you are referring to is to minimize the number of times you have to do the iterative calculation if you are doing it by hand.

Each match would sill be evaluated once.

The term for increasing volatility due to inactivity is disabled, so the period isn't of much use.

Do any of these models take into account a player's skill improving over time? Because of course it will. You learn stuff.

Of course. However, only the

relativeskill improvement. So if everybody else improves at the same rate, you remain at your current rating.The lower limit of RD exists to account for skill changes throughout the season as well as balance patch changes and stuff like that. It does not fix your rating too quickly (and therefor precisely), so things can still change.

No, my code doesn't account for it. But this code and this entire thread is mostly dedicated to one problem: winstreaks and losestreaks.

Which tend to happen at MUCH shorter intervals, than the player would learn his stuff.

I mean, you have a lose streak of 5-10 games, then a winstreak 5-10 games.

I really doubt the player can improve his skills any faster than 100-200 games. Therefore, the effect is absolutely insignificant.

Well, this is just plain wrong. Did you really read the paper?

"

m opponents with ratings" or "mu1, mu2, ... mumscores against"EACHopponentCan you see capital greek Sigma letter? With

"j=1"below and"m"above?Do you know what this means?

I'm just asking, though. Probably I have misunderstood you.

But RESULTS of the matches vs m previous opponents are DEFINITELY taken into account. The results of the matches - is that what I call "match history".

Please tell me, if I'm still unclear.

Oh.. of that I'm fairly certain as well. Perhaps with some lowering coefficient, but yeah, I've been in that situation, where I lose 15 and friend loses 8.

What I was talking about,

IS NOTthe "win probability" from the Glicko, which is required for rating update:No.

I meant the REAL win probability. Why is it not the same - because glicko takes your (and your opponents) current rating for the calculation. Which is likely not exactly your real rating. Especially, if the season has just began.

I.e. the dude,

who was 1900 last seasonplays the game with 9 scrubs,who were 800-1300 last season.However,

on paper, EVERYONE'S rating might be 1200(first game of the season for all 10 ppl).What Glicko will calculate in this situation is obvious - it'll just take all those 1200 ratings, do its magic, and BOOM - everyone's equal, the winrate is 50v50%.

But is it true? No. So, what was the ACTUAL win probability for that game?

Im> @Tiah.3091 said:

There is nothing that says m needs to be greater than 1. That portion of the calculation is inside of an iterative loop where you need answers to converge. If you are calculating the iterative portion by hand it is more convenient to sum up several matches and do the iterative portion once.

It doesn't say to treat the summation portion like a fifo queue where each match is evaluated m times.