I've been playing around with Retrosheet to see how often a baseball team wins a game if it's, say, five runs ahead at the end of the fourth inning. I plan to get that up sometime during this long weekend, but while doing the study I found another interesting result.
Retrosheet's Play-by-Play files, along with the cwevent program from Chadwick, let you extract all sorts of information from almost every MLB game played between 1950-2010. From that data I extracted every game that was tied at the end of a half-inning, and figured out who eventually won. Then I counted up the number of times the home team won for each half-inning. The results are shown below:
Click on graph to see a larger figure
The black diamond represents the situation at the start of the game, the red diamonds the situation where the game is tied in the middle of the inning, and the blue diamonds when it's tied at the end of an inning. We'll get to the error bars in a minute.
So what is all of this? Well, at the beginning of a game the score is obviously tied, so that should be part of the study. So if we look at all 115,748 games in the database, we find:
- The Home Team won 62,418 games,
- the Visiting Team won 53,192 games, and
- there were 138 games that were tied when the game was called.
If we throw out the ties, then the Home Team won 53.990% of the games that went to a decision. That's the black diamond at the far left of the graph. The 54% win rate is baseball's version of Home Field Advantage, and it has been very constant:
Click on graph to see a larger figure
I performed the same calculation for all the games which were tied at the end of a half inning. For example, if the game is tied at the end of the fifth, the home team has a 52.0% chance of eventually winning the game. Tied after the top of the sixth? It's up to 60%.
So what do the error bars represent? Basically they give you an idea of the number of games in the sample. Suppose that in a given game the home team wins with probability p. Then in an N game sample the probability that the home team wins n games follows the binomial distribution, e.g.
N! n N-n P(N,n) = ----------- p (1-p) n! (N-n)!
If we look at a large number of N-game samples, then we'll find that on average the home team will win N p games, which makes sense. The standard deviation will be [N p (1-p)]½. Since the graph normalized everything by the number of games played, the error bars are the standard deviation divided by N, or [p (1-p)/N]½. Since most values of p are between 0.5 and 0.7, wider error bars basically tell you that fewer games have gotten to that point. And when the error bars get really wide, as they do after the fourteenth inning or so, it says there aren't enough statistics available to give you meaningful information.
What does it all mean, you ask? Well, first it says that the home team advantage is real. Why there is a home field advantage is another question, and there is not enough information here to answer that question.
Then there's the observation that the home team has a larger advantage if the game is tied in the middle of the inning than it does if the game is tied at the end of an inning. That's just common sense. In the middle of the fourth inning, the visiting team has five more innings at the plate. The home team has six — five for sure, and one more if they need it. This isn't the
home team bats last advantage, it's the
home team gets one more at-bat than the visitors advantage, not the same thing.
Next, we see that if the game is tied at the end of an inning, the home team's advantage decreases slightly, so that at the end of eight innings it's only 51.94%. Presumably that's because the home team does have some advantage in being at home, but as the game progresses they have less and less chance to use that advantage. After the fifth inning the home team's advantage oscillates around 52%, down from the 54% advantage they had at the start of the game.
Indeed, the fact that the blue dots go down from innings 1-4 suggests that the
home team bats last advantage isn't worth a whole lot. If it was, you'd expect the advantage to be greater in tie-game situations as the game wears on, because that
last at-bat becomes a larger and larger proportion of what's left of the game.
Finally, there is that dip in the red diamonds between the first and second inning. If the game is tied going into the bottom of the first, meaning that the visiting team didn't score, then the home team will win 59.15% of the time, with a standard deviation of 0.17%. If the game is tied going into the bottom of the second, however, then the home team only has a 58.01% chance of winning, with σ = 0.22%. That's a five-σ change in the probability, which I would think is statistically significant. The win rate is pretty much constant in the third inning ( 58.16% ± 0.27%) and then starts going up, as you'd expect, since the home team has proportionately more at-bats than the visitors at the middle of an inning.
Why is this so? I have no idea. All I can say is that if your a visiting baseball team, it's better to be tied with the home team in the middle of the second inning than it is to be tied before the home team comes to bat. And you better be ahead by the middle of the third.
The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at www.retrosheet.org.