Using comm, uniq, and sort: ttftw 2023w21

By Robert Russell

May 21, 2023 - 6 minutes read - 1272 words

Three things from this week

This week I’ve been fanagling a lot of files, trying to organize my backups and my archives a little better. It turns out I have a lot of partially-duplicated copies of ad hoc backups. Don’t judge me.

Trying to compare lists of files across directories gets complicated and it can sink a whole lot of time. I’m using some shell one-liners to compare lists of files and so this week I thought I’d share a three good bash commands for this kind of work.

comm

This is a new one for me. It takes two lists and figures out which items are in only the first list, only the second list, or common to both lists. The output is a table with three columns but you can suppress one or two columns. So here’s how I applied it.

Let’s use this made-up file structure.

$ tree
.
├── left
│   ├── backup(1)
│   ├── copy of copy of Documents
│   ├── important-stuff
│   └── junkdrawer
└── right
    ├── backup(1)
    ├── important-stuff
    └── top-secret

9 directories, 0 files

Here’s what comm does with the directories left and right

$ comm <(ls left) <(ls right)
                backup(1)
copy of copy of Documents
                important-stuff
junkdrawer
        top-secret

The tab-delimeted columns are hard to read but they are columns.

More interesting for my case is just the list of files that are in both directories:

$ comm -12 <(ls left) <(ls right)
backup(1)
important-stuff

The -12 option means to skip columns 1 and 2. It’s pretty cryptic but it came in handy. The other cryptic part here is how I’ve supplied the output of ls to comm. I adapted this from an example given by tldr and assumed the <(command) syntax was some sort of redirect. It acts a lot like a redirect but actually it’s a process substitution.

If you do process substitution wrong might see a message with /dev/fd/number in it. Things like this:

$ echo <(ls left) <(ls right)
/dev/fd/63 /dev/fd/62

You might expect this commaond would do something like just printing out the names of all the files - since echo should just print the text that it’s given. And echo is doing that but the text that it’s been given is sort of like a temporary file name so that the output from the ls processes can be referenced. Using cat instead of echo would dump the contents of those files and they are indeed the output from the two ls commands:

$ cat <(ls left) <(ls right)
backup(1)
copy of copy of Documents
important-stuff
junkdrawer
backup(1)
important-stuff
top-secret

I don’t understand all the finer details here but if that’s your cup of tea then here’s a nice post on /dev/fd to ease into discovering more.

Using comm was nice, I had a bunch of directories so I made lists of various combinations that helped me pare down extraneous copies and consolidate the remains.

uniq

This one isn’t new to me but it does more than I realized. That seems to be a theme with shell commands when I read the man pages. Usually uniq is a nice quick way to remove duplicate lines from a file or command output. For example, using the same directory tree from above, I might want to know just how many folders I’m dealing with. So I’d use

$ for d in left right ; do ls -1 ${d} ; done | sort | uniq
backup(1)
copy of copy of Documents
important-stuff
junkdrawer
top-secret

In this sorted list we get the list of all the directories with each appearing only once. I had hundreds of these things so I just wanted to know how much more work I’d have to do - how many directories do I have to deal with. That’s just a count of the number of lines here, given by wc -l.

$ for d in left right ; do ls -1 ${d} ; done | sort | uniq | wc -l
5

This is the common use of uniq I’d done before though. The interesting thing I discovered is a different sort of measurement. With the --unique flag we can also get the lines that only appear once in the input. This is, in a way, the opposite of what comm is doing. Where comm can tell what lines show up in both inputs, uniq --unique (or uniq -D if you like short flags) can tell me which ones appear in only one input.

$ sort <(ls -1 left) <(ls -1 right) | uniq --unique
copy of copy of Documents
junkdrawer
top-secret

Here I’ve switched to process substitution again. The sort command takes a list of filenames and outputs the result of sorting the contents of all those files together. The file descriptors (that we saw can look like /dev/fd/NN) are what sort opens up to get the lines it needs to sort. Then the sorted result of this is sent as input to uniq --unique.

The output from uniq --unique looks like the intersection of the set of items whereas the output from comm -12 is the union of the items.

sort

Both uniq and comm require sorted input. When I was using these commands earlier in the week I actually also piped things to sort. What I learned today (again?) is that ls actually does sort alphabetically by default. I mean I always observe that the output is sorted but for some reason I assumed that it wasn’t guaranteed to be, or that it was some environment variable controlling that behaviour. In fact ls does “Sort entries alphabetically if none of -cftuvSUX nor –sort is specified” per man 1 ls.

I still want to share my favourite sort flag all the same: -h. Human-friendly sorting for lists that have kB, MB, GB, etc makes it easy to work with the output of du and similar tools. Apparently it has long forms too but they’re very long: --human-numeric-sort and --sort=human-numeric. I love/hate that sort has a --sort= flag but apparently it can also parse other things, such as version numbers with --sort=version. Looking at the weirder sort options and features like defining a sort key (with -k) makes me think if you need that then maybe you want to check out Visidata instead of relying on sort and friends.

My earlier directories are all empty but just for completeness, here’s what I mean about sort -h with du:

$ du -h | sort -h
4.0K    ./left/backup(1)
4.0K    ./left/copy of copy of Documents
4.0K    ./left/important-stuff
4.0K    ./left/junkdrawer
4.0K    ./right/backup(1)
4.0K    ./right/important-stuff
4.0K    ./right/top-secret
16K     ./right
20K     ./left
40K     .

Et cetera

The GNU Coreutils manual is full of little tidbits that you might not run into elsewhere. And maybe that’s fine since most people don’t need to make a permuted index index anymore¹. However you might save yourself a lot of unnecessary text munging if you know that ls -1 outputs only filenames, so you don’t need to use cut. Or that split and csplit can slice up a file for you so you don’t need to run sed or use grep with a bunch of context lines.

Even though I hate on bash scripting quite vocally I do find that shell proficiency is undeniably helpful in speeding up the mundane work we have to do on computers all the time. Packing your history full of one-liners puts them just a quick ctrl-r away.

But if you do need a permuted then ptx apparently has you covered. Sorry but learning about Key Word in Context indices was really interesting and I had to shoehorn a link in here. ↩︎

ttftw