Tuesday, January 01, 2013

Sorting From Back to Front

The other day I was presented with a list of names and email addresses, something like this one:

Fred Flinstone <flintstone@bedrock.sag>
Barney & Wilma Rubble <bambamsfolks@bedrock.sag>
Steadholder Honor Harrington <dutchess@harrington.mdc>
Count Miles Vorkosigan <auditor@vorkosigan.byr>
Dennis & Margaret Mitchell <stilltrouble@funnies.comics.net>
Homer Simpson <donuts@springfield.st.us>
Rudolph <rednose@reindeer.np>
Kimball Kinnison <kinnison@graylens.gp>
Wile E. Coyote <genius@acme.net>
Tiberius Claudius Drusus Nero Germanicus Julius Caesar <imperator@spqr.rm>
Gen. Jack O'Neill <jack.oneill@stargate.oml>

Except that it had around 100 names. What I wanted to do was to alphabetize this by last name, to make it easier to figure out who was missing from the list, but keep the final result as
     FirstName MiddleName(s) LastName <email>
since this was input to an email list in that format.

This would not difficult if each person had exactly two names, say
     FirstName LastName <email>
in which case we'd just run the command
     sort -k 2 < elist
and we'd be done.

Unfortunately each line contains between two and eight fields, counting the email address, and we want to sort on the next to last one. As far as I can tell, sort doesn't support searches from the end of the line in.

However, the awk (or gawk) command does. For example, the command
     awk '{print $NF}' < elist
would list just the email addresses from the above file, and
     awk '{print $(NF-1)}' < elist
would list the last names — no, I don't know why you use parenthesis, but you do.

So what we need is a way to have awk pull out the last name from the file, sort those, then put everything back together. It turns out we can do that with a one-liner. I found it on the web yesterday, but I've lost the link, so I can't give proper credit. I did save the command, or my modification of it, at least:

awk '{print $(NF-1), $0}' < elist | sort | cut -f2- -d' '

Let's look at that in detail:

  • awk '{print $(NF-1), $0}' < elist
    prints out the next to last column of each line, followed by the entire line ($0).
  • sort
    then sorts everything on the first column, e.g. the last name. Unfortunately, that leaves you with entries like this:
     Simpson Homer Simpson <donuts@springfield.st.us>
    To get rid of these, we need
  • cut -f2- -d' '
    which separates fields by whitespace (the -d' ') and prints everything out starting from the second column (-f2- . If we wanted just the second and third column it would be -f2-3).

And the correctly sorted output is:

Tiberius Claudius Drusus Nero Germanicus Julius Caesar <imperator@spqr.rm>
Wile E. Coyote <genius@acme.net>
Fred Flinstone <flintstone@bedrock.sag>
Steadholder Honor Harrington <dutchess@harrington.mdc>
Kimball Kinnison <kinnison@graylens.gp>
Dennis & Margaret Mitchell <stilltrouble@funnies.comics.net>
Gen. Jack O'Neill <jack.oneill@stargate.oml>
Barney & Wilma Rubble <bambamsfolks@bedrock.sag>
Rudolph <rednose@reindeer.np>
Homer Simpson <donuts@springfield.st.us>
Count Miles Vorkosigan <auditor@vorkosigan.byr>

Fairly simple, huh? I generalized it a bit, so that we can sort on an arbitrary column from the end:

#! /bin/bash

# Usage

# lastsort N filename
# Sorts the file filename of the field N columns from the end
# N=0 is last column of the file

awk '{print $(NF-'$1'), $0}' $2 | sort | cut -f2- -d' '

Note the single quotes around the $1 in the awk command, which passes the first argument of the calling command to awk. Without the quotes you get an error.

OK, this could have a few bells and whistles, but I'm not going to bother with that now.