Saturday, June 11, 2011

Take Me Out to the SQualL Game

As some of you know, I have a moderate interest in Baseball and Baseball statistics. However, I'm not a database programmer, and the little bit of DB manipulation I've looked at has left me hopelessly confused.

Several years ago I did buy a copy of Baseball Hacks, but I never really got started with it. Last weekend I had some time on my hands and started playing around with it. Turns out you actually have to try to do something with a program in order to learn it (who'da thunk?). And it also turns out the Baseball Hacks, even though it's an O'Reilly book, is pretty oriented to Windows/Microsoft Access, although it does have substantial hints for Linux and Mac users, not to mention an online collection of scripts from the book (ZIP file), along with some other stuff.

So what should we do? As a first shot, how about updating my table of averages for several modern baseball statistics to include 2009 and 2010? I did the previous tables by doing some judicious editing of the Batting.csv file in Sean Lahman's baseball database, but the folks at have all of the data packaged neatly into a mySQL database (ZIP file), so we'll use that.

So as a start, we're going to

  • Install appropriate parts of the mySQL database program in Ubuntu,
  • Set it up to read the database file,
  • Find the eligible batters for 2009 and 2010,
  • Get their batting data into a spreadsheet,
  • Find the appropriate averages, runs created, etc., and
  • add the results to the appropriate tables.

That should be enough for one day. It's going to be a long journey, though, so when you've got some time join us after the break.

Ready? OK, let's go.

Installing mySQL under Ubuntu

and, probably, any Linux system, with the usual modifications for different distributions. For what it's worth, I set up a virtual machine which didn't previously have mySQL installed so that we can see how everything works on a clean system.

This will install everything we're going to use today, and then some:

$ sudo apt-get install mysql-server mysql-query-browser

During the installation process you'll see a screen pop up that looks like this:

mySQL server password screen

This is the password to the mySQL server. As it says, you may be able to get away with no password, but it's probably a good idea to have one, especially if you share your machine with others. Given this password, you can change content and access to any database that's installed on you system. We'll make the password here gBrett.

After installation, the mysqld daemon will always be running on your system, though it may spend a lot of time sleeping:

$ ps xa | grep mysqld
25154 ? Ssl 0:00 /usr/sbin/mysqld

Probably better check to see if it works:

$ mysql -u root -p
Enter password: gBrett
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 123
Server version: 5.1.54-1ubuntu4 (Ubuntu)

Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.
This software comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to modify and redistribute it under the GPL v2 license

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> quit

Set up the Database

Now download the Baseball-Databank mySQL database (ZIP file) and unzip into a convenient directory:

$ mkdir Documents/baseballDB
$ cd Documents/baseballDB
$ wget
$ unzip

Now we want to set up access to the database for a user. You can use your current username for this, but you could create another user name if you wished. Here, we'll use our own user name, and we'll follow Baseball Hacks and call the database bbdatabank:

$ mysql -p -u root
Enter password: gBrett
mysql> GRANT ALL ON bbdatabank.* TO 'dave'@'localhost' IDENTIFIED BY 'cRojas';
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE DATABASE bbdatabank;
Query OK, 1 row affected (0.00 sec)

mysql> quit

(Note the semicolons at the end of the line, which tells mySQL you're finished with the command. If you don't enter a semicolon, nothing much will happen until you do.)

This command allows dave to log into mySQL and play with the database called bbdatabank. Of course, there's nothing in bbdatabank as yet, but now dave can fix that:

$ mysql -u dave -p -s bbdatabank < BDB-sql-2011-03-28.sql
Enter password: cRojas

This will take a bit of time, but eventually the DB will load. To see if it's correct, do something like this:

$ mysql -u dave -p -s bbdatabank
Enter password: cRojas
mysql> show tables;
mysql> quit

That lists all of the various tables in the database. One's really important, that's the one called Master, and if you look into the BDB-sql-2011-03-28.sql near the top you'll find something like this:

/*!40000 ALTER TABLE `Master` DISABLE KEYS */;
INSERT INTO `Master` VALUES (1,'aaronha01','','aaronha01h',1934,2,5,'USA','AL','Mobile',NULL,NULL,NULL,NULL,NULL,NULL,'Hank','Aaron',NULL,'Henry Louis','Hammer,Hammerin\' Hank,Bad Henry',180,72.0,'R','R','1954-04-13','1976-10-03','','aaronha01','aaronha01','aaroh101','aaronha01','aaronha01')

which tells you that somebody named Hank Aaron, born in 1934 in Mobile, Alabama, will be referred to as aaronha01 in other parts of the database. You can probably figure out most of the rest of the things here. This is the key to the entire database. In the Batting table, for example, there will be twenty-two references to aaronha01, one for each year he played in the majors.

OK, we've got all of that. What do we do with it? First, as a warm-up, let's identify the players in MLB since 2000 who came to the plate more than 740 times (just to fit the list onto one screen) in a year, and see which one of them had the best batting average. This is a modification of Baseball Hacks Hack 18, at the bottom of page 95.

$ mysql -u dave -p -s bbdatabank
Enter password: cRojas
mysql> select b.H/b.AB as AVG,
    -> b.AB+b.BB+b.HBP+b.SH+b.SF as PA,
    -> m.nameLast, m.nameFirst, b.teamID, b.yearID 
    -> from Batting b inner join Master m
    -> where b.playerID=m.playerID
    -> and
    -> b.yearID>1999
    -> and b.AB+b.BB+b.HBP+b.SH+b.SF > 735
    -> order by AVG DESC;
AVG PA nameLast nameFirst teamID yearID
0.3722 762 Suzuki Ichiro SEA 2004
0.3510 736 Suzuki Ichiro SEA 2007
0.3497 738 Suzuki Ichiro SEA 2001
0.3260 748 Pierre Juan FLO 2004
0.3223 752 Suzuki Ichiro SEA 2006
0.3140 748 Young Michael TEX 2006
0.3130 739 Young Michael TEX 2004
0.3105 749 Suzuki Ichiro SEA 2008
0.3089 752 Jeter Derek NYA 2005
0.3085 739 Utley Chase PHI 2006
0.3054 746 Pierre Juan FLO 2003
0.3034 739 Suzuki Ichiro SEA 2005
0.3003 741 Soriano Alfonso NYA 2002
0.2997 736 Furcal Rafael LAN 2006
0.2965 763 Reyes Jose NYN 2008
0.2961 778 Rollins Jimmy PHI 2007
0.2918 750 Pierre Juan CHN 2006
0.2901 751 Sizemore Grady CLE 2006
0.2805 765 Reyes Jose NYN 2007
0.2772 758 Rollins Jimmy PHI 2006
0.2771 748 Sizemore Grady CLE 2007
0.2700 739 Jeter Derek NYA 2010
0.2688 754 Weeks Rickie MIL 2010
0.2681 745 Sizemore Grady CLE 2008

(Note that I added a lot of carriage returns to input, just to make it visible. I could have run it all together:

select b.H/b.AB as AVG, b.AB+b.BB+b.HBP+b.SH+b.SF as PA, m.nameLast, m.nameFirst, b.teamID, b.yearID from Batting b inner join Master m where b.playerID=m.playerID and b.yearID>1999 and b.AB+b.BB+b.HBP+b.SH+b.SF > 735 order by AVG DESC;

I'd recommend keeping the commands you're trying in a text file, and then cutting and pasting into the database program.)

Let's analyze that command line by line and see what it does:

# Set up a variable called AVG, which measures hits (H) divided by At Bats (AB), and print it out:
select b.H/b.AB as AVG,
# Set up another variable called PA, for plate appearances (every time the player came to the plate) and print it out:
-> b.AB+b.BB+b.HBP+b.SH+b.SF as PA,
# This line print the Players name (Last name first), the team he played for, and the year:
-> m.nameLast, m.nameFirst, b.teamID, b.yearID
# This identifies the tables. A prefix of "b." means the data comes from the Batting table, while "m." comes from the Master table:
-> from Batting b inner join Master m
# This makes sure that we're looking at the same player in the Master table as in the Batting table:
-> where b.playerID=m.playerID
# But wait, there's more:
-> and
# We only want to look at years after 1999:
-> b.yearID>1999
# and we only want to see batters who came to the plate a large number of times:
-> and b.AB+b.BB+b.HBP+b.SH+b.SF > 735
# Finally, we want this ordered by average, starting from the top and going down:
-> order by AVG DESC;

Find Eligible Batters for 2009 and 2010

This is pretty easy, right? To be eligible for a batting title, Rule 10.22(a) says that you must come to the plate at least 3.1 times for every scheduled game your team plays. Since every team is scheduled for 162 games per year, that works out to 502.2 plate appearances. So let's work it out for 2009:

$ mysql -u dave -p -s bbdatabank
Enter password: gBrett
mysql> select b.AB+b.BB+b.HBP+b.SH+b.SF as PA,
    -> m.nameLast, m.nameFirst, b.teamID, b.yearID
    -> from Batting b inner join Master m
    -> where b.playerID=m.playerID and b.yearID=2009
    -> and b.AB+b.BB+b.HBP+b.SH+b.SF>502
    -> order by PA DESC;

Whoa! That will list 140 names, headed by Jimmy Rollins with 734 PA and ending with Grady Sizemore and Martin Prado, who came up 503 times each. Unless you have a really big monitor, or you like to cut and paste from your terminal buffer, you'll need a better interface. and LibreOffice have Base components, and with sufficient work I'm told you can get them to interact with mySQL. Unfortunately, despite pursuing various tutorials I couldn't get that to work. Fortunately, mySQL has its own graphical interface, mysql-query-browser. Basically it the mySQL commands above and puts them in table form on your screen. We downloaded it as part of the mySQL installation process, so let's try it on this problem:

mysql-query-browser &

This launches a graphical interface. The first time through you'll have to fill in the boxes, after that it will remember everything except the password (which is cRojas):

mysql-query-browser password screen

Once you're logged in, click the tab that says Result Set 1 or something like that. You'll get a screen that looks like this:

Startup mysql-query-browser

The top box on the left gives you the details of the database. For example, if you click on Batting, you'll see categories such as playerID, teamID, H, etc. These are categories you can search or print out with the database commands. Of course, if you don't know that GIDP means Grounded Into Double Play then you are going to be mystified, but presumably the database creator (which may be you) will have all of these things identified somewhere.

The top box is the query box. Enter the entire command as we did before, including the ;, then click the Execute icon to the right:

select b.AB+b.BB+b.HBP+b.SH+b.SF as PA,
m.nameLast, m.nameFirst, b.teamID, b.yearID
from Batting b inner join Master m
where b.playerID=m.playerID and b.yearID=2009
and b.AB+b.BB+b.HBP+b.SH+b.SF>502
order by PA DESC;

This gives you a spreadsheet like format

Players eligible for batting title in 2009

which is scrollable and much easier to use. You can also export it into a spreadsheet or table. Click on File => Export Resultsheet and you can save this table as a set of Comma Separated Values, or into HTML, XML, or Excel. Why Excel? Beats me. I'd hope for something open source, but I guess .xls is the default spreadsheet these days. Fortunately all of the graphical open source spreadsheets I know can work with these spreadsheets.

Enter Eligible Players batting data into the spreadsheet:

Simple enough right? We just need to dump everything for both years into a spreadsheet. Try this in the query box:

select b.yearID, m.nameLast, m.nameFirst,
b.teamID, b.G, b.AB+b.BB+b.HBP+b.SH+b.SF as PA, b.AB, b.R, b.H, b.2B,
b.3B, b.HR, b.RBI, b.SB, b.CS, b.BB, b.IBB, b.HBP, b.SH, b.SF, b.GIDP
from Batting b inner join Master m
where b.playerID=m.playerID and b.yearID>2008 and
b.AB+b.BB+b.HBP+b.SH+b.SF > 502
order by b.yearID ASC, m.nameLast, m.nameFirst;

Note that the sort is on year, last name, and first name. We want 2009 first in the sort, so we go in ascending order (ASC) rather than descending (DESC). However, ASC is the default, so we don't really need it, and I left it off the first and last name search.

Export this to a spreadsheet I'll call batting0910.xls and open it up with your favorite spreadsheet program. You should get something that looks like the picture below, except that I did a bit of editing to make it look cleaner. It's rather small here, so right-click and chose view image in another tab so see it more clearly.

Batting spreadsheet for title-qualified batters in 2009 and 2010

We'll be using total bases at least twice, so let's make a column for it.

  • In cell V1 add the label TB
  • In cell V2 enter the string =I2+J2+2*K2+3*L2 (remember that hits (H) includes doubles, triples and home runs)

Your spreadsheet should look something like this:

Spreadsheet after adding Total Bases column

Now let's fill in some other columns. Most of these will be fractions, so format them for as many decimal places as you like. Following long-time baseball convention I'm going to print out everything to three places.

Column Category Symbol Row 2 Formula
W Outs OUTS =G2-H2+O2+S2+T2+U2
X Batting Average BA =I2/G2
Y On-Base Pct. OBP =(I2+P2+R2)/F2
Z Slugging Pct. SLG =V2/G2
AA On-Base + Slugging OPS =X2+Y2
AB Runs Created RC =(I2+P2-O2+R2-U2)*(V2+0.26*(P2-Q2+R2)+0.52*(S2+T2+N2))/F2
AB Runs Created / Game RC/G =24*AB2/W2
AC Base Runs BsR =(I2+P2-L2)*(1.02*(1.4*V2-0.6*I2-3*L2+0.1*P2))/(1.02*(1.4*V2-0.6*I2-3*L2+0.1*P2)+G2-H2+L2)
AD Base Runs/Game BsR/G =24*AC2/W2

I'm not really sure about the last one: the original Wikipedia article for Baseruns has lost its formulas, I can't find other web page that mentions 1.02 or 1.4, and in my original calculations I seem to have miscalculated total bases. So take that with a grain of salt.

Once you have all of that, you can find average and standard deviations of any category you like, by year or over the range of years you selected. I'm going to recreate the entire table sometime this week, but for now, here are the results for 2006-10, including results from my original tables. For now I'll leave out the standard deviations.

2006 146 0.286 20.46 80.95 0.355 0.469 0.824 113.09 96.13
2007 156 0.283 19.13 79.76 0.354 0.458 0.812 110.98 93.58
2008 137 0.280 19.62 78.03 0.350 0.457 0.807 108.99 92.95
2009 140 0.282 19.93 78.13 0.353 0.461 0.814 90.75 58.35
2010 136 0.273 18.73 75.12 0.343 0.444 0.787 85.30 55.42

OK, problems: I'm not sure if the falloff in Runs Created is really that dramatic, or if I just made a mistake in the previous calculations. The falloff is certainly consistent with, if more dramatic than, the falloff in the other categories, and fits with the cough end of the steroid era. The falloff in Base Runs is undoubtedly due to errors in my previous calculations. So for now we'll stop here, and I'll give you a fully redone table in a week or so.

Sunday, June 12, 2011: I updated all of the tables mentioned here, including corrected Base Runs and the years 2009 and 2010. All of that data is available here.

At least we've covered the basics of running mySQL and doing some analysis. I'll try to cover some more of this, along with other tricks and tips from Baseball Hacks in the near future. If you've got anything you'd like to look at, leave a comment.