Sunday, August 14, 2011

Transparent PNG (or GIF) From the Command Line

Given the popularity of my post about making transparent PNG images with the Gimp, I should share a recent discovery: If you know the exact color you want to make transparent, you can do it from the command line.

What you need is the ImageMagick package, picture manipulation software that's available for all major Linux distributions, Windows, and Mac OS X (through MacPorts). If none of those work for you, you can also compile the source code.

Specifically, you need the convert command line tool. The syntax is:
convert -transparent color original_picture.png picture_with_transparent_background.png
and you can covert from png to gif, or visa versa, with the same command.

An example: here's the good ol' pengjay, with a horrid green (#00FF00) background:

Pengjay with green background

Since we know the background color is #00FF00, we can pipe it through convert:

convert -transparent #00FF00 pengjay_green.png pengjay_transparent.png

and we get this:

Pengjay with transparent background

Note that if you don't know the color, you're better off using the GIMP.

Saturday, August 06, 2011

Can I Go Home Now?

We've all been there. The home team drops seven runs in an early inning. By the seventh inning stretch they've shown no sign of an offense, and are still six runs down. It's hot, muggy, Washington night, there are thunderstorms brewing over the horizon, and your wife just phoned that she heard on WTOP that Rt. 50 to Annapolis was closing for repair work at 10 p.m.

Question: If I go home now, am I likely to miss anything? Aside from heat stroke and road rage?

Answer: Probably not. If the visiting team is ahead by six runs after the top of the seventh, historically the chance that the home team will pull out a victory is 1.6%, or about 60 to 1 against. Unless you're of the extreme optimist persuasion I'd suggest going home.

I got to thinking about this a few years ago, when Bill James published an article in Slate on when a college basketball game is really over. Based on his observations, he was able to come up with an algorithm which predicts when a team has a safe lead, based on the lead, time left in the game, and who has the ball.

In baseball we can do the same kind of thing, except that there are only a finite number of logical stopping points (the end of a half-inning) and leads (the largest of which was less than 30 runs). Plus, we have a line score for just about every major league baseball game played since 1900, so we have a lot of data. This means that we don't need no stinkin' algorithm, we can give you the history probability that any given lead was overcome.

I didn't go all the way back to 1900. I stopped at 1948, because that was the data that was available using Retrosheet's Play-by-Play Files. With the Chadwick Software Tools we can go through all the games in the database and see how many times that, say, the visiting team was ahead by five runs after then end of the first, and how often the home won in that situation. Do that for all possible combinations of leads and innings and we get the table below. (You may have to widen your browser window to see everything.)

Visitors Lead Tie Home Lead
Inn. 10+ 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10+
T 1 .077 .000 .040 .041 .083 .157 .187 .304 .378 .486 .591                    
B 1 .000 .000 .063 .056 .055 .132 .148 .236 .313 .416 .533 .643 .740 .819 .893 .912 .933 .960 1.00 .944 1.00
T 2 .000 .027 .067 .057 .080 .169 .180 .262 .345 .452 .580 .692 .781 .846 .925 .929 .969 .974 1.00 .941 1.00
B 2 .000 .016 .043 .032 .055 .134 .127 .218 .295 .397 .532 .656 .756 .829 .895 .920 .956 .966 .957 .976 1.00
T 3 .034 .032 .019 .043 .075 .133 .166 .250 .339 .456 .582 .705 .801 .863 .927 .939 .972 .984 .980 1.00 1.00
B 3 .015 .024 .016 .031 .047 .095 .122 .200 .278 .394 .526 .655 .766 .840 .909 .930 .958 .977 .986 .989 .990
T 4 .012 .021 .028 .033 .071 .096 .145 .225 .314 .446 .589 .719 .820 .886 .933 .963 .971 .988 .990 .994 .988
B 4 .004 .011 .022 .028 .039 .062 .094 .164 .252 .369 .526 .677 .788 .866 .917 .949 .969 .986 .990 .992 .996
T 5 .004 .009 .020 .031 .045 .078 .114 .199 .290 .422 .591 .741 .843 .904 .945 .967 .982 .995 .995 .991 1.00
B 5 .002 .005 .009 .019 .026 .049 .077 .139 .224 .344 .523 .692 .808 .887 .936 .962 .981 .989 .995 .993 .999
T 6 .001 .005 .011 .019 .031 .055 .091 .161 .268 .406 .600 .771 .873 .931 .964 .975 .994 .995 .999 .998 .998
B 6 .001 .000 .005 .009 .014 .030 .049 .097 .185 .305 .520 .725 .848 .917 .956 .972 .989 .994 .998 .999 .998
T 7 .001 .001 .004 .010 .016 .038 .061 .118 .217 .357 .610 .819 .912 .958 .977 .991 .995 .998 .999 1.00 .999
B 7 .000 .002 .002 .002 .007 .015 .033 .059 .130 .245 .523 .772 .894 .949 .973 .990 .995 .998 .999 .999 .999
T 8 .000 .001 .003 .004 .007 .017 .041 .074 .153 .296 .634 .890 .956 .985 .992 .997 .999 1.00 .999 1.00 1.00
B 8 .000 .000 .000 .001 .002 .006 .013 .027 .067 .150 .519 .865 .944 .981 .990 .997 .999 1.00 .999 1.00 1.00
T 9+ .000 .000 .000 .002 .003 .007 .014 .033 .071 .157 .615 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
B 9+ .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .524 1.00 1.00 1.00 1.00            

So what is all of this?

  • The left hand column represents the situation at the end of either the top (visitor's) half of an inning, or the bottom (home team's) half, so T 7 describes the situation just before the Cubs let that day's designated karaoke singer ruin Take Me Out to the Ballgame. Since all innings after the ninth are played under the same conditions — the team ahead after the bottom of the inning wins, and if it's tied we try, try again — I combined all the data from those games into the labels T 9+ and B 9+.
  • The other columns represent the lead by either the home team (on the right) or the visiting team (left) at the end of the half-inning. Ties are right in the middle.
  • The decimal fraction in each block indicates the home team's chance of winning in that situation. So, for example, if you've just sat through an agonizing top of the third, when Yankees have scored a bunch of runs and lead by 7, we look at row T 3, column Visitors 7, and see that the Orioles have a 0.043 (4.3%) chance of winning the game — no, that's not true. Over the course of the last 60 years, 4.3% of all major league home teams down by 7 going in to the bottom of the third have come back to win. These are the Orioles, however, so they have about a 0.001% chance of coming back.
  • I put all leads of ten or more runs in the 10 + categories. The software I wrote to write the table is easy to modify to list bigger leads, if you like.. I stopped at 10 because that will more or less fit on a standard blog page.
  • The blank spaces are impossible situations. The home team can't score before it gets up to bat, so the right-hand side of the T 1 row can never be reached. And the home team doesn't need the bottom of the ninth or later inning unless it was tied or behind after the top of the inning. In that case they can never win the game by more than four runs.
  • Indeed, I debated about putting in the B 9+ column, since it's a trivial case, but I wanted to highlight the fact that the home team still has a big advantage if the score is tied in late or extra innings. I've discussed this elsewhere.
  • Finally, I haven't told you about the statistical significance of these results, i.e, the standard deviation or the sample size. Suffice it to say that there are more than enough events here for leads of nine runs or less. When we get to the 10+ category, especially in the early innings, there just aren't that many games. If you want to see the raw numbers (home wins/visitor wins/total games) drop me a line and I'll send them to you.
  • Really last finally: I made a slight modification to the Chadwick source code to make it easier to parse the runs scored per inning. I also wrote a Perl script and some Fortran code to parse the output from Chadwick and write the HTML for the table above. If you're interested in the codes, drop me a line. If there's enough interest I'll make all the code available on my website.

Sunday, July 31, 2011

Almost Spot On

I'm currently listening to The Essential Kansas, the 70's boy band (I jest) from Topeka. For some reason, I've never bought a Kansas album, and I don't plan to in the near future. Yet I'm listening completely legally, and not paying a dime.

That's because Spotify has finally made it from Europe to the U.S. I read about it in a New York Times article (I just access the free stuff there, too) the other day and decided to try it out. As far as I can tell, the deal is:

  • With a free-as-in-beer account, you can stream any song in Spotify's vast library (and it's vast, if somewhat uneven).
  • For now, you can listen to an unlimited amount of music, only interrupted by two minutes of commercials any hour.
  • After six months, your free account is limited to ten hours of music per month. I read that in the NYT article, good luck in finding it in Spotify's account description.
  • Of course, there are other plans that let you stream unlimited amounts of music, without ads, for a price.
  • You have to use Spotify's music player.

That last part was almost a deal breaker. (Can you break a free deal?) There is no generally available Linux client. There is an alpha version of a Linux player, but it doesn't seem to have been worked on for a year or so, and it can only be used with the paid accounts, because we haven't found a reliable way to display ads yet.

However, the Windows client runs under Wine, and Spotify gives detailed instructions on how to get it started. You can't play your local MP3 files through Spotify (not sure I'd want to), but the stuff that comes over the web sounds good — according to the NYT article, it's in 160-kbps Ogg Vorbis format.

All right, how is it? Funny you should ask.

The sound quality is good enough for my speakers-in-the-monitor setup. Beyond that I couldn't say. Audio purists will probably find some fault, but the real purists are listening to vinyl anyway. The playlist is extensive, but not complete. OK, I didn't expect the Beatles here, but the Eagles are mostly missing. You do get what seems to be complete coverage of the Rolling Stones, Simon & Garfunkel, Creedence Clearwater Revival, Bob Marley, Peter, Paul & Mary, the Mommas and the Papas, the aforementioned Kansas, Roger Miller, Glen Campbell, Johnny Cash, a limited amount of Dylan, and even Hugh Laurie reading Three Men in a Boat. They have what seems to be the complete Jimmie Rodgers the Elder, but I can't find anybody's version of Jimmie Rodgers the Younger's classic ballad It's Over. This would frustrate me if I was actually paying for this stuff, but since it's all free I really can't complain. It seems as though they're working on it.

Oh yes, one more thing: if you want to sign up for the free account, you need an invite. Apparently paid-for account holders can give you an invite, or you can go to Spotify.com and ask for one. It took about thirty seconds for me to get an email after I asked, but YMMV.

All in all, it seems to be a pretty good service, at least for now. If you put a map of the continental U.S. on a dartboard so that Kansas is the bull's eye, Spotify's dart hits at about Oklahoma City. Add the Eagles, more Dylan, and the Beatles (Ha!), and they can hit Holyrood.

Sunday, July 17, 2011

When He's Right, He's Right

Penguin Pete. Opinionated and passionate about it.

And boy, when he's right, he's right.

Monday, July 11, 2011

Baseball After the All Star Break

Some years ago, I did a predictive study on how Major League Baseball teams would rank at the end of the season, based on their records at the All Star Break and the Pythagorean projection of future wins, based on the runs scored and allowed by each team.

It didn't work all that well. In particular, I predicted that Boston would win the AL East pennant, and Washington would be the NL Wild Card. That sorta didn't happen.

Nevertheless, I'll try again. Here's the table, based on the MLB standings at the All Star Break. The method is the same as last time, so you can read all about it there.

American League
East  W   L  PCT Place GB  RS   RA  Pyth GL PW PL TW TL PCT GB
New York Yankees 53 35 0.602 2 1 455 334 0.637 74 47.14 26.86 100.14 61.86 0.618 0.00
Boston 55 35 0.611 1 0 482 371 0.617 72 44.42 27.58 99.42 62.58 0.614 0.73
Tampa Bay 49 41 0.544 3 6 380 343 0.546 72 39.35 32.65 88.35 73.65 0.545 11.80
Toronto 45 47 0.489 4 11 426 416 0.511 70 35.76 34.24 80.76 81.24 0.498 19.39
Baltimore 36 52 0.409 5 18 355 454 0.390 74 28.85 45.15 64.85 97.15 0.400 35.29
Central  W   L  PCT Place GB  RS   RA  Pyth GL PW PL TW TL PCT GB
Cleveland 47 42 0.528 2 0.5 386 382 0.505 73 36.85 36.15 83.85 78.15 0.518 0.00
Detroit 49 43 0.533 1 0 413 421 0.491 70 34.39 35.61 83.39 78.61 0.515 0.46
Chicago White Sox 44 48 0.478 3 5 366 383 0.479 70 33.55 36.45 77.55 84.45 0.479 6.29
Minnesota 41 48 0.461 4 6.5 347 414 0.420 73 30.69 42.31 71.69 90.31 0.443 12.16
Kansas City 37 54 0.407 5 11.5 402 449 0.450 71 31.94 39.06 68.94 93.06 0.426 14.91
West  W   L  PCT Place GB  RS   RA  Pyth GL PW PL TW TL PCT GB
Texas 51 41 0.554 1 0 457 404 0.556 70 38.91 31.09 89.91 72.09 0.555 0.00
Los Angeles Angels 50 42 0.543 2 1 355 330 0.533 70 37.32 32.68 87.32 74.68 0.539 2.59
Seattle 43 48 0.473 3 7.5 301 319 0.474 71 33.63 37.37 76.63 85.37 0.473 13.28
Oakland 39 53 0.424 4 12 315 339 0.467 70 32.66 37.34 71.66 90.34 0.442 18.24
National League
East  W   L  PCT Place GB  RS   RA  Pyth GL PW PL TW TL PCT GB
Philadelphia 57 34 0.626 1 0 384 295 0.618 71 43.86 27.14 100.86 61.14 0.623 0.00
Atlanta 54 38 0.587 2 3.5 365 312 0.571 70 39.96 30.04 93.96 68.04 0.580 6.89
New York Mets 46 45 0.505 3 11 399 388 0.513 71 36.40 34.60 82.40 79.60 0.509 18.46
Washington 46 46 0.500 4 11.5 352 354 0.497 70 34.82 35.18 80.82 81.18 0.499 20.04
Florida 43 48 0.473 5 14 352 396 0.447 71 31.71 39.29 74.71 87.29 0.461 26.15
Central  W   L  PCT Place GB  RS   RA  Pyth GL PW PL TW TL PCT GB
St. Louis 49 43 0.533 2 0 433 407 0.528 70 36.97 33.03 85.97 76.03 0.531 0.00
Milwaukee 49 43 0.533 1 0 405 406 0.499 70 34.92 35.08 83.92 78.08 0.518 2.05
Pittsburgh 47 43 0.522 3 1 354 346 0.510 72 36.75 35.25 83.75 78.25 0.517 2.22
Cincinnati 45 47 0.489 4 4 437 408 0.531 70 37.18 32.82 82.18 79.82 0.507 3.79
Chicago Cubs 37 55 0.402 5 12 375 459 0.409 70 28.63 41.37 65.63 96.37 0.405 20.34
Houston 30 62 0.326 6 19 358 464 0.384 70 26.89 43.11 56.89 105.11 0.351 29.08
West  W   L  PCT Place GB  RS   RA  Pyth GL PW PL TW TL PCT GB
San Francisco 52 40 0.565 1 0 332 322 0.514 70 35.97 34.03 87.97 74.03 0.543 0.00
Arizona 49 43 0.533 2 3 416 407 0.510 70 35.70 34.30 84.70 77.30 0.523 3.28
Colorado 43 48 0.473 3 8.5 395 407 0.486 71 34.53 36.47 77.53 84.47 0.479 10.44
Los Angeles Dodgers 41 51 0.446 4 11 340 373 0.458 70 32.06 37.94 73.06 88.94 0.451 14.92
San Diego 40 52 0.435 5 12 304 338 0.452 70 31.63 38.37 71.63 90.37 0.442 16.34

Abbreviations:

  • W: Current team wins
  • L: Current team loses
  • PCT: Winning rate
  • Place: Current place in standings
  • GB: Games Behind
  • RS: Total Runs scored by team
  • RA: Total Runs allowed by team
  • Pyth: Pythagorean expected win rate. Following MLB, I used an exponent of 1.82 rather than the original James value of 2. It doesn't make a lot of difference, and didn't change the order.
  • GL: Games left in season for the team
  • PW: Projected wins in remainder of season, assuming they win at the Pythagorean rate
  • PL: Projected Pythagorean loses
  • TW: Total wins, current + projected Pythagorean
  • TL: Total loses
  • PCT: Projected final winning ratio
  • GB: Projected final games behind

OK, not a lot of changes going on. Despite an anemic offense, San Francisco's fantastic pitching will keep them in first in the NL West. Philadelphia will win the NL East going away, even though Atlanta wins the NL Wild Card. Texas will hang on in the AL West.

There are a few predicted swaps, highlighted in yellow: St. Louis will pull ahead of Milwaukee. And Cleveland will (yawn) edge out Detroit. Surprisingly, the only changes occur at the top, which probably says something about competitive balance in MLB.

And, finally, Red Sox will be the AL Wild Card. Which means …

Frak

Saturday, July 02, 2011

The Home Team Wins Most Ties (In Baseball, Anyway)

I've been playing around with Retrosheet to see how often a baseball team wins a game if it's, say, five runs ahead at the end of the fourth inning. I plan to get that up sometime during this long weekend, but while doing the study I found another interesting result.

Retrosheet's Play-by-Play files, along with the cwevent program from Chadwick, let you extract all sorts of information from almost every MLB game played between 1950-2010. From that data I extracted every game that was tied at the end of a half-inning, and figured out who eventually won. Then I counted up the number of times the home team won for each half-inning. The results are shown below:

Probability Home Team wins baseball game if it is tied at the end of a half-inning

Click on graph to see a larger figure

The black diamond represents the situation at the start of the game, the red diamonds the situation where the game is tied in the middle of the inning, and the blue diamonds when it's tied at the end of an inning. We'll get to the error bars in a minute.

So what is all of this? Well, at the beginning of a game the score is obviously tied, so that should be part of the study. So if we look at all 115,748 games in the database, we find:

  • The Home Team won 62,418 games,
  • the Visiting Team won 53,192 games, and
  • there were 138 games that were tied when the game was called.

If we throw out the ties, then the Home Team won 53.990% of the games that went to a decision. That's the black diamond at the far left of the graph. The 54% win rate is baseball's version of Home Field Advantage, and it has been very constant:

Probability Home Team wins baseball game in a given year

Click on graph to see a larger figure

I performed the same calculation for all the games which were tied at the end of a half inning. For example, if the game is tied at the end of the fifth, the home team has a 52.0% chance of eventually winning the game. Tied after the top of the sixth? It's up to 60%.

So what do the error bars represent? Basically they give you an idea of the number of games in the sample. Suppose that in a given game the home team wins with probability p. Then in an N game sample the probability that the home team wins n games follows the binomial distribution, e.g.

             N!        n      N-n
P(N,n) = -----------  p  (1-p)
          n! (N-n)!

If we look at a large number of N-game samples, then we'll find that on average the home team will win N p games, which makes sense. The standard deviation will be [N p (1-p)]½. Since the graph normalized everything by the number of games played, the error bars are the standard deviation divided by N, or [p (1-p)/N]½. Since most values of p are between 0.5 and 0.7, wider error bars basically tell you that fewer games have gotten to that point. And when the error bars get really wide, as they do after the fourteenth inning or so, it says there aren't enough statistics available to give you meaningful information.

What does it all mean, you ask? Well, first it says that the home team advantage is real. Why there is a home field advantage is another question, and there is not enough information here to answer that question.

Then there's the observation that the home team has a larger advantage if the game is tied in the middle of the inning than it does if the game is tied at the end of an inning. That's just common sense. In the middle of the fourth inning, the visiting team has five more innings at the plate. The home team has six — five for sure, and one more if they need it. This isn't the home team bats last advantage, it's the home team gets one more at-bat than the visitors advantage, not the same thing.

Next, we see that if the game is tied at the end of an inning, the home team's advantage decreases slightly, so that at the end of eight innings it's only 51.94%. Presumably that's because the home team does have some advantage in being at home, but as the game progresses they have less and less chance to use that advantage. After the fifth inning the home team's advantage oscillates around 52%, down from the 54% advantage they had at the start of the game.

Indeed, the fact that the blue dots go down from innings 1-4 suggests that the home team bats last advantage isn't worth a whole lot. If it was, you'd expect the advantage to be greater in tie-game situations as the game wears on, because that last at-bat becomes a larger and larger proportion of what's left of the game.

Finally, there is that dip in the red diamonds between the first and second inning. If the game is tied going into the bottom of the first, meaning that the visiting team didn't score, then the home team will win 59.15% of the time, with a standard deviation of 0.17%. If the game is tied going into the bottom of the second, however, then the home team only has a 58.01% chance of winning, with σ = 0.22%. That's a five-σ change in the probability, which I would think is statistically significant. The win rate is pretty much constant in the third inning ( 58.16% ± 0.27%) and then starts going up, as you'd expect, since the home team has proportionately more at-bats than the visitors at the middle of an inning.

Why is this so? I have no idea. All I can say is that if your a visiting baseball team, it's better to be tied with the home team in the middle of the second inning than it is to be tied before the home team comes to bat. And you better be ahead by the middle of the third.

The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at www.retrosheet.org.

Saturday, June 25, 2011

Near Miss

You may have heard that Asteroid 2011 MD will pass by Monday morning. And by pass by, we mean within 8,000 miles.

That's pretty close. How close? Well, have a look at these animations:

Asteroid 2011 MD Flyby

Sunday, June 12, 2011

An Update on the Average Major League Hitter

This is the promised update to my table of pretty good averages for everyday baseball players at the major league level. As before, I consider an everyday ball player to be one who was eligible for a league batting title in a given year, which from 1996-2010 means that he made at least 502 plate appearances in a given year.

The raw data for this study was taken from the mySQL database thoughtfully provided by the folks at Baseball-DataBank.org. As outlined in yesterday's post, I used mySQL to pull out the appropriate data, this time using the command:

select b.yearID as Year, m.nameLast as Last, m.nameFirst as First,
b.teamID as TEAM, b.G, b.AB+b.BB+b.HBP+b.SH+b.SF as PA, b.AB, b.R, b.H, b.2B,
b.3B, b.HR, b.RBI, b.SB, b.CS, b.BB, b.SO, b.IBB, b.HBP, b.SH, b.SF, b.GIDP
from Batting b inner join Master m
where b.playerID=m.playerID and b.yearID>1995 and
b.AB+b.BB+b.HBP+b.SH+b.SF > 502
order by b.yearID ASC, m.nameLast, m.nameFirst;

to get a list of all players from 1996 on who were eligible for a batting title. I then used mysql-query-browser to export all of the data into a spreadsheet, and there computed all of the averages and standard deviations. All the calculations are the same as in my original post, except:

  • I added the batting data for the 2009 and 2010 seasons.
  • My original post fraked up David Smyth's Base Runs statistic. I used the right formula (the second one on the page), but miscalculated total bases by forgetting that doubles, triples, and home runs are already counted as hits. So the numbers found in my earlier study are too high.
  • I dropped 1995 from the study this time because only 144 games were scheduled for each team, so the batting eligibility criterion was 144 × 3.1 = 447 plate appearances. Just lazy.

Saturday, June 11, 2011

Take Me Out to the SQualL Game

As some of you know, I have a moderate interest in Baseball and Baseball statistics. However, I'm not a database programmer, and the little bit of DB manipulation I've looked at has left me hopelessly confused.

Several years ago I did buy a copy of Baseball Hacks, but I never really got started with it. Last weekend I had some time on my hands and started playing around with it. Turns out you actually have to try to do something with a program in order to learn it (who'da thunk?). And it also turns out the Baseball Hacks, even though it's an O'Reilly book, is pretty oriented to Windows/Microsoft Access, although it does have substantial hints for Linux and Mac users, not to mention an online collection of scripts from the book (ZIP file), along with some other stuff.

So what should we do? As a first shot, how about updating my table of averages for several modern baseball statistics to include 2009 and 2010? I did the previous tables by doing some judicious editing of the Batting.csv file in Sean Lahman's baseball database, but the folks at Baseball-Databank.org have all of the data packaged neatly into a mySQL database (ZIP file), so we'll use that.

So as a start, we're going to

  • Install appropriate parts of the mySQL database program in Ubuntu,
  • Set it up to read the database file,
  • Find the eligible batters for 2009 and 2010,
  • Get their batting data into a spreadsheet,
  • Find the appropriate averages, runs created, etc., and
  • add the results to the appropriate tables.

That should be enough for one day. It's going to be a long journey, though, so when you've got some time join us after the break.

Monday, May 30, 2011

The Natty Narwhal Upgrade

Like Travis McGee, this week I'm taking an installment of my retirement. Unlike McGee, my retirement seems to consist of fixing up computers.

So today, Ladies and Gentlemen, Boys and Girls, Desktops and Netbooks, we're going to talk about upgrading Hal here to Ubuntu 11.04.

Yawn. This shouldn't really be that big a deal, should it? You just hit the update button, or type do-release-upgrade at the command line, right?

Well, it's not always that simple. A friend of mine did that, and he seemed to have a lot of problems. I'm not sure why, as I never got to examine his system before he reinstalled 10.10. Maybe it was the change from the Xorg window system and Gnome desktop to Wayland and the Unity desktop. Or not. Maybe it was just the way the wind was blowing that day. But I decided not to take the chance on the upgrade, but rather install 11.04 from the ground up.

I didn't have to be pushed hard, because I was leaning toward a clean install anyway. When I bought this version of Hal I was in a hurry to get Linux set up, but wanted to keep Windows 7 around. As a result, I let Ubuntu do the repartitioning of the disk, and ended up with the entire Linux system in one giant partition.

This is less than optimal. (Thanks, Dave. Don't mention it, Hal.) Ideally one wants /home and possibly /usr/local and /opt in partitions separate from the root operating system. That way when you upgrade the OS, you don't lose your previous data. (That doesn't mean you shouldn't back up that data before upgrading. There's a word for people who don't. The polite form of it is idiot)

But this requires repartitioning Hal's disk. Fortunately, all of Hal's memories reside in the extended partition, so I don't have to fiddle with the Windows partition. After about 30 secons of thought, I decided to repartition Hal's big Linux partition into 15GB for /, the root system, another 15GB for /usr/local, /opt, and /scratch (more on how to do this later), and leave the rest for /home. That's probably too much for /, and maybe not enough for /scratch, depending on what I right there, but that's how I'm going to do it this go-round. Ideally gparted would leave the data on /home as is, but that takes forever because of all the data that needs to be moved around, so we're just going to repartition everything and hope the backup holds.

Preparation

First, get the software we'll need. I downloaded and burned CDs for the 64-bit versions of:

OK, Gparted is obvious: that's what's going to do the disk repartitioning. But why the alternative distributions?

Because you never know. The largely hypothetical long-time reader will remember that I used to use Fedora, but that I switched to Ubuntu when the installation of Fedora 5 failed miserably, and Ubuntu was there, ready and waiting. So having alternative Linux CDs on hand seems to be a really good idea. Besides, we bought 100 CDs maybe five years ago and still have a bunch left. They just aren't used much anymore. (That's right, rcjhawk is still a trailing indicator for device popularity.)

OK, we've got hopefully every bit of software that we need, so let's go.

Find out what's on your machine

I never remember which packages are installed from one upgrade to the next. So let's make a list:

dpkg --get-selections > installed.txt

Edit installed.txt, leaving out any packages you think will be installed by the system (e.g., kernel, window manager, etc.). If you can't remember what a package does, delete it too. If you need it, it will eventually show up as a dependency to a package you want, or you can reinstall with apt-get or synaptic. The important thing to do here is to remove cruft from the system. For example, I have the bsd-games package on here. I never play those games, so why leave them in place?

It would have been nice to make a copy of /etc/fstab, but some dingbat forgot to do that (more on this later).

Back it all up

Well, not everything. No need to back up the kernel, or Emacs, or latex, or any of that. The distribution will provide. No, I want to back up my data (basically $HOME), the Intel Fortran compiler, a royal pain to install, and my renegade version of SoX. So do the following (/backup is my ext4 formatted USB disk):

$ cd
# -rpv == recurse directories,
#  preserve permissions and timestamps,
#  and speak up about it verbosely.
$ cp -rpv dave /backup
# Assuming you're named dave, of course
$ cd /usr
$ cp -rpv local /backup
$ cd /
$ cp -rpv opt /backup

To be safe, I also had backintime, my backup manager, do an additional labeled backup, which should keep it on the disk forever and a day, or at least until the end of 2012.

Repartition

The next step is the scariest one. I've taken what was an 800+GB partition and broken it up into three parts: 15GB for /, 15GB for what will be /usr/local, /opt and /scratch, and the remainder for /home. It's scary because this wipes out the primary data, so you are now relying on the kindness of backups. It's theoretically possible to gave gparted to keep all the data in what's going to be the /home partition, but that requires moving a lot of stuff around, at least the way I tried it, and an estimated 18 hours to get it done. So I just said a prayer, bit my lip, and reformatted. Fortunately it all went well.

Installation

Not much to say here. I told Ubuntu that I wanted to do an install with a custom installation, had it mount everything where I wanted it, and let it go. The usual waiting around occurred.

Copying Files

This was rather easy: turn on the backup USB disk drive, and copy the files back. I copied back configuration files I knew I wanted to keep, e.g. things like .emacs and the .devilspie directory, but I'm letting Ubuntu set up the desktop. This means I'll have to play with preferences later, but it should get eliminate any conflicts between old and new configuration parameters.

Mounting the Backup Drive

I could just plug the backup drive in and access it at /media/really_long_hexadecimal_string, but what I really want to do is have the backup drive mounted to /backup. I've covered this before, but as I mentioned above, I forgot the UUID of the drive. You get that with the command:

blkid

which gives you all the UUID of all connected disks.

Restoring /usr/local and /opt

This is awkward. Add-on programs that you compile yourself generally go into /usr/local. Other programs, such as the Intel Fortran compiler and Google's Picasa, end up in /opt. I want both to stick around between upgrades, so they need to be out of the / partition. But I don't want two extra partitions. So here's what I did:

  • Create a partition /usr/local
  • Create a directory /usr/local/opt
  • Copy all of the /usr/local files where they belong.
  • Copy all of the /opt files to /usr/local/opt
  • Then, as root, run the command
    ln -s /usr/local/opt /opt
  • If you want a /scratch directory, do the same thing, but don't forget to make the original world writable:
    # mkdir /usr/local/scratch
    # chmod 777 /usr/local/scratch
    # ln -s /usr/local/scratch /scratch
    

Restoring your add-on packages.

Now we want to get back all of our old packages, the ones that aren't installed by default, at least so far as Ubuntu will let us. Remember that file installed.txt I mentioned you should create and edit? Here's how we'll use it:

  • Select System => Administration => Update Manager
  • Click on Settings in the lower left corner
  • Under Ubuntu Software make sure all of the sources you want enabled are. If you don't want proprietary drives, etc., turn them off. Do the same under Other Software
  • Open a terminal window and go the the directory where you have that installed.txt
  • Run the following commands:
    sudo apt-get update
    sudo apt-get install `awk '{print $1}' installed.txt | xargs`
    
  • You'll probably get some error messages, but they're pretty clear about what you need to fix. Edit the installed.txt file to match.
  • This will take a while, as all the packages have to be downloaded from the net.

What Works

Pretty much everything. I selected Gnome-classic for my desktop, and after a little fiddling to with System => Preferences => Startup Applications got the system to look pretty much as it did before.

As for third party software, the Intel Fortran compiler still works fine. I was able to reinstall Picasa from Google's supplied 64-bit .deb.

Wayland seems to work just like Xorg, at least from my point of view. It pops up windows in the same way as before, and I can run, say
ssh -X majel
to get me to another machine, and then run
firefox
on majel, and the window pops up here on Hal. So no major problems for me there.

What doesn't work

Google Earth. Maybe this is a 64-bit problem, I don't know. I tried using Google's .deb file, and the official Ubuntu method. Neither worked.

By default Ubuntu installs the 3-D version of Unity, which requires a pretty good graphics accelerator. Hal lives in 2-d, so that wouldn't run. (Funny, the Unity desktop shows up when you run the live CD.) I installed the 2-D version and that works. Which brings me to

What Sucks

Unity. From my limited experience (about two minutes, after which I ran away screaming), it's an overblown and somewhat hideous version and of the Mac desktop. But don't mind me, I was the last person on Earth to use fvwm. I suppose if I played around awhile I could make Unity behave the way I wanted it to, but why bother, since Ubuntu still supplies the classic Gnome desktop. Mind you, I've seen Gnome 3, and I don't like that, either. (The word you're looking for is Luddite.)

Summary

All in all, a successful update, as long as I stay way from Unity. I have one more computer that needs updating to 11.4, I think I'll just try that as a distribution update. If it doesn't work, I can always do a full install.

More Later

Troubles will surface, they always do. When they do, I'll write about them here.

Saturday, May 21, 2011

Compression

The other day an email arrived from sourceforge which mentioned that they were hosting the file compression program 7-zip. Now I had used 7-zip under windows, as an all purpose archiving tool, mostly for reading zip files. I'd never thought of it in a Linux context. But both openSuSE and Ubuntu have the command-line version available, and what more do you need?

In OpenSuSE, the RPM is called, simply enough, 7z. In Ubuntu it's a bit more complicated. There's p7zip, which provides the bare-bones standalone version of the compression program, 7zr, and a wrapper p7zip, which makes 7zr work like gzip. For the full-blown 7z program, you want the package p7zip-full, which includes 7z. While you're at it you might want to get the p7zip-rar package, which lets you decompress RAR files.

The reason you want 7z and not just 7zr is that the smaller program only compresses to the 7z format, but, as it says in the blurb,

not only does [7z] handle 7z but also ZIP, Zip64, CAB, RAR, ARJ, GZIP, BZIP2, TAR, CPIO, RPM, ISO and DEB archives.

And by handle, they mean read and write to these formats (except RAR). So I could create a zip archive with 7z, or a gzip/bzip2 compressed tarball. And guess what?

7z compression is 30-50% better than ZIP compression.

Is it? Well, let's find out.

Test 1

Presented for your consideration: an uncompressed tarball:

$ ls -l Ru.tar
-rw-r--r-- 1 dave dave 26716160 2011-05-21 15:41 Ru.tar

This is IO from the elk FP-LAPW code, so it has a lot of repetition of text, and what for our purposes are a bunch of random numbers. First we'll try compressing it with programs using their native formats. I'll try for maximum compression in all cases. Note that zip and 7z will create separate archives, while gzip and bzip2 compress the file in place.

Program Command File Size Ratio
zip zip -9 Ru Ru.tar 3899523 0.146
gzip gzip -9 Ru.tar 3899386 0.146
bzip2 bzip2 -9 Ru.tar 2992422 0.112
7z 7zr a -mx=9 Ru.7z Ru.tar 2242708 0.084

Pretty good, huh? As advertised, 7z is about 40% better than zip/gzip, and 25% better than bzip2. But wait, there's more. Not every computer is going to have 7z available, so you may want to compress files using a more established protocol. 7z can do that, too, which is why we wanted it, not just 7zr:

Format Command File Size Ratio
zip 7z a -mx=9 -tzip Ru.zip Ru.tar 3287420 0.123
gzip 7z a -mx=9 -tgzip Ru.tar.gz Ru.tar 3287335 0.123
bzip2 7z a -mx=9 -tbzip2 Ru.tar.bz2 Ru.tar 2989193 0.112

So 7z compresses to zip/gzip better than the native programs do it themselves. It doesn't really outperform bzip2 here, though. The only disadvantage compared to gzip or bzip2 is that it doesn't compress the files in place, unless you go through a script such as the one in /usr/bin/p7zip.

Test 2

Pi to 4 million Decimals has, duh, π to, actually, 4,194,034 places. The file pi.tar.gz has it in ascii, with a bit of header information. If we uncompress that file, it comes in at 4362370 bytes. Since the digits of π don't repeat, it's hard for a compression program to find blocks of bytes to compress. The following table lists the compressions achieved by our test programs, in whatever formats they can use. (See the above tables for the appropriate commands.) Let's see how everybody does:

Program Protocol File Size Ratio
None None 4362370 1.000
zip zip 2041130 0.468
gzip gzip 2040997 0.468
bzip2 bzip2 1863892 0.427
7z zip 1983378 0.455
7z gzip 1983297 0.455
7z bzip2 1860047 0.426
7z 7z 1884999 0.432

Here, bzip2 is competitive with 7z. Oddly, though, you should use 7z to do the bzip2 compression. Weird, huh? But still, 7z is pretty good.

Wrapping Up

So what's not to like?

For one thing, 7z is slow. If you just want to quickly compress a file, go ahead and use gzip, or bzip2. There is a price to pay for better compression.

Then, too, there's a warning on the 7z man page:

DO NOT USE the 7-zip format for backup purpose on Linux/Unix because 7-zip does not store the owner/group of the file.

you can get around this by piping tar into 7z:

tar cf - directory_to_be_archived | 7z a -si directory.tar.7z

which creates the analog of a gzip/bzip2 tarball.

And finally, the native 7z format isn't standard, yet, so it's not going to be available everywhere, and might even vanish. But 7z and its compression algorithm LZMA are open source, so they are likely to stay around for awhile. A few years ago bzip2 wasn't standard, and once upon a time neither was gzip. It's probably safe to compress your files to 7z format, but if you want to be safe, use 7z to compress to gzip or bzip2 format.

Friday, March 18, 2011

Why I Hate March

I just spent the last three hours not watching the Kansas-Boston University first round NCAA game. In case you didn't join me in not watching, KU spent the first 25 or so minutes wondering why the BU players weren't genuflecting. Then they concentrated for a few minutes, and the game was over. Because of course, you know, when #1 meets #16, #16 has never won, right?

Trust me. Some day it will happen. And which team will that be? We'll, let's look at the past for some clues, starting here:

  • 2010: #1 Seed, lost in second round
  • 2007: #1 Seed, lost in Sweet 16
  • 1998: #1 Seed, lost in second round
  • 1997: #1 Seed, lost in Sweet 16
  • 1995: #1 Seed, lost in Sweet 16
  • 1992: #1 Seed, lost in second round

Not to mention Bradley, or Bucknell, or the suffering I did during the Ted Owens years.

It's enough to make one start a website, except a) someone beat me too it, and b) Roy was as bad, or worse, because he didn't win a Championship until he clicked his heels three times and went home.

But mark my words, the #1 Seed in NCAA history to lose in the first round will have a six-letter name on the front of the jersey, and have two Crimson and Blue mascots that look something like this.

Sunday, February 20, 2011

Superbowl Commercials

There weren't many that were any good this year. I'll show you my favorite, below, but I found that a lot of people didn't get the reference. So let's go back to September, 1979, when this famous commercial aired, starting Pittsburgh's Mean Joe Green:

And now that we've set the stage, here's my favorite commercial from this year's Superbowl:

And one more reference to TV gone by:

Wednesday, January 26, 2011

FUSE on the Mac

Awhile ago I wrote about how one can mount remote computer filesystems using ssh. That was for Linux. You can do it with Mac OS X as well, but FUSE (File System in UserSpace) isn't installed by default, as it is in modern Linux systems.

So first you need to get FUSE installed. The package I use is called, duh, MacFUSE. It installs like a standard Mac package: Download the disk image, click on it, and go.

Next is sshfs itself. There seem to be many versions, but Google's seems to work well. Download the appropriate version for your OS (the Leopard version works on Snow Leopard), rename it sshfs, make it executable, and put it in your path. From then on it works pretty much like in Linux, with minor differences. Assuming you have a directory ~/hal on your Mac, you can look at the files on your Linux machine hal with the command

sshfs rchawk@hal:/home/rcjhawk ~/hal

So a file named /home/rcjhawk/Documents/testthis.doc on hal will appear as ~/hal/Documents/testthis.doc on your Mac.

And it's simple to unmount it:

umount ~/hal

Wednesday, January 05, 2011

A More Perfect World

http://xkcd.com/843 I wish I lived in this universe.

Me, to.

Wikipedia's List of Common Misconceptions.