Machine Learning And Big Data Is All Just Fun And Games

April 17th, 2016Dan Yoder

Was your bracket busted? Maybe predictive analytics can help! Or maybe just offer you a fun way to learn something new! (Photo credit: Wikipedia.)

Analytics and machine learning are increasingly relevant tools in the software professional’s toolbox. Statistics and probability now play a greater role than ever in our work.

Games can be a fun “gateway drug” for learning these techniques. In particular, the sports industry has recently turned to mathematics in the quest to gain a competitive advantage. Organizations like FiveThirtyEight, originally known for using polling data to make predictions about political races, now routinely predict outcomes for sporting events based on performance models and historical data.

For example, they predicted the outcome of every game of the NCAA Men’s and Women’s Basketball tournaments. Which raises a question.

How do they make predictions like this?

Start With What We Know

We’re going to answer that question step-by-step, using a process that works well for modeling problem like these. And along the way, we’re going to express our solutions in code.

The first thing we need to determine is what we can do with the data we have. In this case, we know the outcomes of all the games each team has played so far in a given season. For example, on the eve of tournament, the favorite, the University of Kansas Jayhawks, had 30 wins against 4 losses.

Let’s assume that, over a large enough sample of games, we can use the ratio of wins to losses as a proxy for the odds that a team will win any given game against an average team. That is, we can say that Kansas was 30-to-4, or 15-to-2, odds to win against an average team. Of course, we know this isn’t entirely true. For example, perhaps Kansas’ opponents are collectively better or worse than average. But we can use this as a first pass approximation.

Let’s code that up.

odds = ([w, l]) -> w / l

What would the likelihood be that Kansas would defeat its first round opponent, Austin Peay. Well, since Austin Peay had 18 wins against 17 losses, we could just use those same odds, since Austin Peay’s record suggests that it was close to an average team.

But what about their second round oppenent, the 25-10 UConn Huskies? They would be 25-to-10, or 5-to-2 odds to win against an average team heading into their second round match-up against Kansas, well above average. How can we determine the odds of Kansas prevailing over UConn?

Apples To Oranges

The nice thing about odds is that they can be expressed as ratios. And we can triangulate ratios that express similar quantities. In this case, both ratios express the odds of winning against an average team. What we want is the ratio expressing the odds of one team beating the other. Which is simply the ratio of the two ratios.

To see this more clearly, consider how we might determine the ratio of apples to oranges, given that all we know is the ratios of apples and oranges to bananas. Suppose we know we have 3 apples for every 2 bananas and 4 oranges for every 3 bananas. The ratio of apples to oranges is simply:

assert = require "assert"
applesToBananas = 3/2
orangesToBananas = 4/3
applesToOranges = applesToBananas/orangesToBananas
assert.equal 9/8, applesToOranges

That is, there are nine apples for every eight oranges.

Similarly, we can simply divide the ratios of wins to losses for each team against an average team, to get the ratio of wins to losses against each other.

We can code this up pretty simply:

# odds of team A beating team B
# (where A and B are [w, l] pairs)
# is just the ratio of odds
win = (a, b) -> (odds a) / (odds b)

Check Your Model

Let’s do a sanity check. An average team will have the same number of wins and losses, and, thus, even odds. And so the ratio of wins to losses will always be one. The trivial case of computing the odds against an average team works, since dividing something by one returns itself.

# the average team wins as much as it loses
average = [1, 1]

# should be 2:1 odds
assert.equal (win [2, 1], average), odds [2, 1]

# should be 1:2 odds
assert.equal (win [1, 2], average), odds [1, 2]

For a team that never wins, the ratio is always zero, and so the resulting odds will always be infinite. And vice-versa. For a team that always wins, the odds will always be zero.

neverWins = [0, 1]
alwaysWins = [1, 0]
assert.equal (win [2, 1], neverWins), Infinity
assert.equal (win [2, 1], alwaysWins), 0

First Pass Approximation

Thus the odds of Kansas, who was now 31-4 after their opening round victory over Austin Peay, defeating UConn, 25-10, was 31-to-10:

kansas = [31, 4]
uconn = [25, 10]

assert.equal 3.1, (win kansas, uconn)

That is, if the two teams were to play 41 games, we’d expected Kansas to win 31 times

We can normalize odds into probabilities by converting them into percentages. Kansas would win 31 out of 41 times, or about 76% of the time. Thus, Kansas had a 76% probability of defeating UConn.

# p returns probability from odds
p = (odds) -> odds/(1 + odds)

# convert to a percentage
pct = (n) -> n * 100

assert.equal 76,
  Math.round pct p win kansas, uconn

This seems like a reasonable result. Kansas had won 89% of its games to that point, but UConn had won 71% of theirs, so we’d expect Kansas probability of beating UConn to be less than 89%, since UConn is better than than an average team.

Refactoring

Refactoring is just as important, and possibly more so, when building models like this. In this case, let’s redefine our odds function to be the inverse of our probability function. The new function will take a probability instead of a pair. This allows us to define teams in terms of a single number, the probably that it wins against an average team.

# return the odds given a probability
odds = (p) -> p/(1 - p)

# win probability given wins and losses
wp = (w, l) -> p (w / l)

# define teams as a single number...
kansas = wp 31, 4
uconn = wp 25, 10

# make sure it all still checks out...
assert.equal 76,
  Math.round pct p win kansas, uconn

Iterate

Once you have a first pass approximation, test it out and see if it’s good enough. In this case, we’ll use FiveThirtyEight’s predictions as a way to test our own. Their win probability for Kansas was 86%, ten points higher than ours. That suggests there’s still plenty of room for improvement!

Some possibilities include:

Use points scored and allowed to determine expected wins and losses, instead of using wins and losses directly. This gives us more data to work with, and thus increases the accuracy of our results.
Take into account other factors that might influence the outcome, such as injuries, where the game is being played, and so on.
Instead of using odds directly, we could use them as inputs to logistic functions to improve upon our approach here.

Fun And Games Is Serious Business

This same basic technique is applicable to a wide variety of problems, For example, a variation of this technique is widely used in sociology.

We started with games and used that to add some new tools to our toolbox. I find it fascinating how often exploring a question that is interesting to me ultimately informs my approach to seemingly unrelated problems. And the more tools you have in that toolbox, the more likely you are to be able to choose the right tool for the job.

We’d love to hear about the tools you’re using. How did you first learn about them? Were you able to use something you’d learned from another domain and apply it? Have you struggled with an analytics problems? Did you prevail? If so, how? If not, what prevented you from succeeding? We’ll feature your stories in future posts about big data and machine learning.

Update For those of you who might be wondering, yes, this is closely related to Bill James’ log5 method. However, the technique dates back much further, to Ernst Zermelo’s work in the early 20th century. It turns out Bill James was apparently unaware of that work, and his deriviation is both fascinating and more empirical in nature.