TV++ for Inducement – Will It Work?

In a recent posting on the Cyanide Forums plasmoid asked me if TV++ can be realistically expected to work as a method for determining who gets inducements and how much they get. My answer to this is eventually, yes. Now, conditional answers probably don’t fill you with confidence, and you should certainly want me to elaborate on “yes” rather than just taking my word for it, so lets get to that.

First, I could say “yes” pretty confidently if we were talking about using TV++ in a brand new matchmaking environment – one in which teams would be made after the matching system was implemented. In such a system a given team’s TV++ value would reach its relatively stable (there’s no such thing as perfect stability in a game where outcome is so heavily based on chance) point as quickly as could ever be expected.

TV++ is self-correcting across multiple matches in so much as if it gives one team too much power in terms of inducements (which is unlikely to happen spontaneously, given what we know of inducements, but that’s not relevant to this) the match being won by the TV++ underdog will bring the two teams much closer to one another in terms of TV++ for the next match. If those two teams faced each other over and over, without facing anyone else, their TV++ values would eventually reach a point where the TV++ underdog was getting only enough inducements to give them a roughly 50-50 chance of winning because of that.

Second, I can say “yes, eventually” if we’re talking about changing an existing environment to a TV++ matching environment without using fresh teams. The time it takes a given team to get a stable-ish TV++ value in an existing environment will depend quite a bit on how old the team is, and how far its initial zSum is from zero. All matching systems inherently assume that the rating used for matching was achieved by using that system.. when that’s not the case there will be an adjustment period during which the errors (or differences) of the past system are ironed out. The worst hit during the adjustment period would, of course, be teams that have been doing unrealistically well under the old system.. and teams that have been ground under everybody’s boots will suddenly have a string of very strong games.

What’s important to understand, though, is that the entire environment is a large scale version of the above-mentioned two teams playing each other over and over… TV++ will stabilize eventually for everybody who continues to play in the environment. Even if it dishes out a few rough matches during the adjustment period, those will pass in favour of increasingly appropriate challenges, even for teams that had played in the environment before the conversion.

Another thing that may not be obvious on first read is that the 90k value for zSum is not absolutely key to the working of the TV++ system. You can apply nearly any TV-equivalency value to zSum and the system will still work, 90k just happens to be the lowest TV value that maintains the maximum predictive power in OCC (not a matchmaking environment) so it was used as the base. You could set it to 50k, or 150k, and it would still work fine in the long run. The difference between using an arbitrary number and a finely tuned value in that place is only how quickly TV++ stabilizes, and how resistant it is to fluctuation – the long-term effect will still be the same, either way. Each game played brings those two theoretical teams <x> TV++ closer to one another, until the inducements given to the TV++ underdog are strong enough to give them a fighting chance of winning the match, but not an advantage. Regardless of what value <x> is, that will happen eventually.

Hopefully that helps you understand the answer. If you have more questions, you can fire me an email, PM, or post to the thread linked to above.

A Few Other Matching Ideas

Given that we found that WF (this is an arbitrary name, btw) which is defined as losses subtracted from wins, seems to be the best method for creating reasonably even match ups between teams in a matchmaking league based on both FUMBBL Box and OCC League data, the question was asked – what about other performance based systems? WF is the same as a league scoring system of 1W,0D,-1L which is not as popular as some other points systems like 3W/1D/0L or even the win% system which treats draws as half a win. So lets take a quick look at how those fare as compared to WF which I now think would look cooler if we called it zSum (since the score across an entire league should average to exactly 0).

So, first off, lets take a look at the 310 system, which I’m going to abbreviate as LP (league points):

Box LP: r = 0.102, p < 0.01, n = 137448
OCC LP: r = 0.094, p < 0.01, n = 26450

So… that’s sucks. It’s better than TV difference in Box, but worse than even TV difference in OCC. It’s certainly a whole lot worse than zSum in terms of predicting match outcome which is unsurprising. Why? Well, LP is a cumulative system where you get higher in the “rating” over time unless you lose every game. It’s more of a qualifying stat than it is a performance stat because if you’re absolutely terrible at the game, you won’t be put up against worse teams over time… you’ll just rise slower.

Next, I decided to take a look at relative win% in terms of its predicting match outcome. I started with OCC:

OCC w%: r = 0.228, p < 0.01, n = 26450

Not terrible, but worse than ELO and lots worse than zSum. I then subsequently saved my OCC data over my Box data so… I guess we won’t be looking at win% in box, though we can pretty much predict that it will have a similar power (with the exception of TV++ at certain TV levels, all the predictive measures have been about the same in both environments). I’ll eventually check when I reconstitute my data.

So, zSum is still the champion method for deciding matches in a matchmaking environment. Too bad nobody uses it (yet)!

Matching by Numbers

One of the things we’ve seen from past analyses is that TV seems to be the best matching criteria from the variables we have available to analyze – but is it really the best matching system? Can it be improved by somehow including other variables in the matching system?

First off, it’s important to declare what we mean by “best”, and in this case it’s “most even match-up” in terms of the match outcome. We basically want a system to match two teams for a match such that they have as equal a chance of victory as possible, statistically. I’ve been told this isn’t necessarily the most important thing… but lets assume that, for the most part, people tend to enjoy games where they are challenged, but not thoroughly disadvantaged.

Now, another new thing to our analyses is that we’ve got match-level data from the OCC league which is a perpetual play scheduled match league NOT matched by TV. This will give us some extra perspective, so we will be looking at both FUMBBL Box (TV based MM) data and OCC data (separately) to see what differences and similarities we might have in using various variables for match outcome prediction.

Before continuing, let me state that for these analyses I use “mirror matches” which is to say any given match is actually two datapoints – each one being the perspective of one team in that match. Why do this rather than treat each match as a single datapoint? A given match has different results for each of the teams involved, so we can look at outcomes from the standpoint of losers or winners, and we can create more detailed models with greater ease. Ok… onward.

Lets first recap with our comparison of TV difference and games played difference in terms of predicting (correlating with) match victory:

Box TV – r = 0.085, p < 0.01, n = 137448
Box GP – r = 0.075, p < 0.01, n = 137448

OCC TV – r = 0.167, p < 0.01, n = 22990
OCC GP – r = -0.038, p < 0.01, n = 26450

As seen previously, TV is a better prediction of match outcome even in a TV matched environment (which, to some degree is already controlling for that variable) but this becomes even more obvious when we look at a non-TV matched environment. Games played actually negatively correlates with match outcome in OCC… so… lets just toss that puppy out for now.

One thing that the OCC data does have is a team’s ELO rating at the time of a match, so lets look at ELO difference as a predictor:

OCC ELO – r = 0.285, p < 0.01, n = 26450

Well then… looks like ELO is better than TV at predicting outcome in the non TV-matched environment. Maybe that’s the answer? Lets look at some more candidates..

One thing that has been speculated is that Fan Factor could be incorporated in some way. Given that FF is part of TV already, there will certainly be some interaction between TV and FF already. Add to this the fact that FF at time of match is not in any of the datasets and not much has been done with FF thus far. Lets assume, for sake of analysis, that most teams will not be buying FF at creation. The Box data includes change in FF from the match, so we can estimate a team’s FF by applying that change in FF to the total FF for a team’s next match.

Box FF – r = 0.175, p < 0.01, n = 137448

So, using estimated FF, we’re seeing that the difference in FF between two teams is a much stronger predictor of match outcome than TV in a TV-matched environment. FF is included in TV (at 10k per FF), so can we make TV “better” by increasing the amount of TV contributed by FF? Turns out we can! I’ll save you the progression, but the optimal per-FF cost as far as making TV best predict match outcome is 60k per FF. The resulting “TV Plus” difference, applied to box, gives us:

Box TV+ – r = 0.187, p < 0.01, n = 137448

That’s even better, especially when you consider the very low granularity of the match outcome variable (there are only 3 possible states).

But lets think about what FF is for a second. FF has a minimum value of 0, and a maximum value of 18, and there is only a *chance* of FF changing each match. What if we cut out these minimums and maximums and removed the random element… might we have a variable that is even better? Enter WF or “Win Factor” which is simply “wins – losses” for the team at the time of the match:

Box WF – r = 0.285, p < 0.01, n = 137448

Cripes, that’s quite a jump! It’s better than relative FF, and even better than optimizing the cost of FF’s contribution to TV. So then, lets see about using WF as part of TV to improve that further. The optimal cost per WF ends up being 30k per:

Box TV++ – r = 0.291, p < 0.01, n = 137448

That appears to be the best we can get. We notice, however, that TV++ is not all that much better than relative WF. Lets look at TV++ and WF for the average number of games played by a team (or less), which is 10.

Box TV++ <= 10GP – r = 0.348, p < 0.01, n = 70058
Box WF <= 10GP – r = 0.348, p < 0.01, n = 70058

In the range where the most games are played, WF and TV++ have equal power in terms of predicting match outcome. Lets look at matches where TV was 1250 or less (a recent subset used in the discussion of minmaxing):

Box TV++ <= 1250TV – r = 0.401, p < 0.01, n = 51549
Box WF <= 1250TV – r = 0.409, p < 0.01, n = 51549

WF is stronger than a combination of WF and TV in this case. There’s certainly some interaction, though, due to the fact that WF will correlate with FF, and FF is a part of TV, and the environment we’re looking at is, to a degree, controlling for TV. Lets look at OCC data, calculating both TV++ and WF:

OCC TV++ – r = 0.193, p < 0.01, n = 26450
OCC WF – r = 0.407, p < 0.01, n = 26450

In OCC TV++ is not that great a predictor (well, its better than TV alone, or games played) being less predictive than ELO. WF on the other hand is stronger than anything.. and in box WF is also highly predictive (moreso than TV, adjusted or otherwise).

ERRATA: I calculated TV++ wrong when I did these tests, having not noticed that OCC uses different scale for TV than Box data does. TV++ is actually superior to ELO in OCC at the 30k mark, and superior even to WF/zSum at the 90k mark.

So, that pretty much gives us our conclusion. We can greatly improve perpetual play matchmaking by using what we’ve found to be the most predictive variable in terms of match outcome, as the variable on which to perform the matches:

WF = wins – losses

Match by “Games Played” – Analysis

One of the persistent suggestions from certain segments of the community has been that online matchmaking would be, in some way, better off if matches were arranged based on number of games played rather than similar Team Values. The logic behind this is that it would better simulate scheduled league play (where teams typically play the same number of games due to the schedule) and where (supposedly) none of the problems that people feel exist in matchmaking environments seem to show up.

For our analysis of this idea, we’ll use the FUMBBL BlackBox data. The B league on FUMBBL is a matchmaking environment based on similar TVs exclusively, with no ability to refuse matches you are assigned. Each match is assigned a “win” value of 1 for a win, 0 for loss, and 0.5 for a draw (as per the BBRC’s win% calculation). Due to the very low granularity of those values, we can expect to see small r values in our correlations, without those small values really meaning they are of low practical significance. Luckily, we’re really just looking at relative strength as far as using TV or games played as a predictor of a given match’s outcome.

Test 1 – overall prediction

First, lets look at how well relative TVs and relative team ages correlate with match outcome. This means we’re not breaking down the matches into any specific TV ranges, we’re just looking at all matches played within the Black Box division.

Relative TV: r = 0.085, p < 0.01, N = 137448
Relative Age: r = 0.075, p < 0.01, N = 137448

If we control for the effect of each on the other (TV and team age themselves correlate for obvious reasons), we find that TV difference becomes a stronger predictor of match outcome:

Relative TV: r = 0.076, p < 0.01, N = 137448
Relative Age: r = 0.065, p < 0.01, N = 137448

Result: Across all TV levels, TV difference is a better predictor of match outcome than relative games played.

Test 2 – low games played (where a team has 10 or less games played)

Next, lets look at how the measures predict match outcome for a team that has 10 or less games under its belt. We’re not going to limit the number of games the opponent has played because we want to allow for the supposed effects of “minmaxing” – low TV, high games played teams that are well-developed but kept at a low TV to abuse new teams.

Relative TV: r = 0.076, p < 0.01, N = 70053
Relative Age: r = 0.074, p < 0.01, N = 70053

if we control for each in the calculation for the other, the difference is higher:

Relative TV: r = 0.067, p < 0.01, N = 70053
Relative Age: r = 0.064, p < 0.01, N = 70053

Result: Close, but TV difference remains the stronger predictor of match outcome, even if we allow for scenarios where, for example, one team has played 3 games and the other 300.

Conclusion

Both across the entire dataset, and when looking at matches involving teams with a low number of games played, TV difference remains the stronger predictor of match outcome. Limiting our view to matches played by teams with less than 10 games covers both the mean and median number of games that teams in that environment play (10 and 5 respectively). Certainly the data shows that the likelihood of a win tends to decrease as the gap between a team and its opponent grows (in the opponent’s favour), and that the likelihood of a win decreases as the gap between number of games played grows (in the opponent’s favour), but the TV difference appears to have more effect than the team’s age.

Given this, there does not seem to be a strong case for matching by “games played” being a superior method for matchmaking than matching by TV similarity. If anything, it would result in less equal match-ups.

Blood Bowl Stats Blog

Seeking truth in the actual numbers.