We've all been there. The home team drops seven runs in an early inning. By the seventh inning stretch they've shown no sign of an offense, and are still six runs down. It's hot, muggy, Washington night, there are thunderstorms brewing over the horizon, and your wife just phoned that she heard on WTOP that Rt. 50 to Annapolis was closing for repair work at 10 p.m.
Question: If I go home now, am I likely to miss anything? Aside from heat stroke and road rage?
Answer: Probably not. If the visiting team is ahead by six runs after the top of the seventh, historically the chance that the home team will pull out a victory is 1.6%, or about 60 to 1 against. Unless you're of the extreme optimist persuasion I'd suggest going home.
I got to thinking about this a few years ago, when Bill James published an article in Slate on when a college basketball game is really over. Based on his observations, he was able to come up with an algorithm which predicts when a team has a safe lead, based on the lead, time left in the game, and who has the ball.
In baseball we can do the same kind of thing, except that there are only a finite number of logical stopping points (the end of a half-inning) and leads (the largest of which was less than 30 runs). Plus, we have a line score for just about every major league baseball game played since 1900, so we have a lot of data. This means that we don't need no stinkin' algorithm, we can give you the history probability that any given lead was overcome.
I didn't go all the way back to 1900. I stopped at 1948, because that was the data that was available using Retrosheet's Play-by-Play Files. With the Chadwick Software Tools we can go through all the games in the database and see how many times that, say, the visiting team was ahead by five runs after then end of the first, and how often the home won in that situation. Do that for all possible combinations of leads and innings and we get the table below. (You may have to widen your browser window to see everything.)
|Visitors Lead||Tie||Home Lead|
So what is all of this?
- The left hand column represents the situation at the end of either the top (visitor's) half of an inning, or the bottom (home team's) half, so T 7 describes the situation just before the Cubs let that day's designated karaoke singer ruin Take Me Out to the Ballgame. Since all innings after the ninth are played under the same conditions — the team ahead after the bottom of the inning wins, and if it's tied we try, try again — I combined all the data from those games into the labels T 9+ and B 9+.
- The other columns represent the lead by either the home team (on the right) or the visiting team (left) at the end of the half-inning. Ties are right in the middle.
- The decimal fraction in each block indicates the home team's chance of winning in that situation. So, for example, if you've just sat through an agonizing top of the third, when Yankees have scored a bunch of runs and lead by 7, we look at row T 3, column Visitors 7, and see that the Orioles have a 0.043 (4.3%) chance of winning the game — no, that's not true. Over the course of the last 60 years, 4.3% of all major league home teams down by 7 going in to the bottom of the third have come back to win. These are the Orioles, however, so they have about a 0.001% chance of coming back.
- I put all leads of ten or more runs in the 10 + categories. The software I wrote to write the table is easy to modify to list bigger leads, if you like.. I stopped at 10 because that will more or less fit on a standard blog page.
- The blank spaces are impossible situations. The home team can't score before it gets up to bat, so the right-hand side of the T 1 row can never be reached. And the home team doesn't need the bottom of the ninth or later inning unless it was tied or behind after the top of the inning. In that case they can never win the game by more than four runs.
- Indeed, I debated about putting in the B 9+ column, since it's a trivial case, but I wanted to highlight the fact that the home team still has a big advantage if the score is tied in late or extra innings. I've discussed this elsewhere.
- Finally, I haven't told you about the statistical significance of these results, i.e, the standard deviation or the sample size. Suffice it to say that there are more than enough events here for leads of nine runs or less. When we get to the 10+ category, especially in the early innings, there just aren't that many games. If you want to see the raw numbers (home wins/visitor wins/total games) drop me a line and I'll send them to you.
- Really last finally: I made a slight modification to the Chadwick source code to make it easier to parse the runs scored per inning. I also wrote a Perl script and some Fortran code to parse the output from Chadwick and write the HTML for the table above. If you're interested in the codes, drop me a line. If there's enough interest I'll make all the code available on my website.