Using comm, uniq, and sort: ttftw 2023w21
By Robert Russell
- 6 minutes read - 1272 wordsThree things from this week
This week I’ve been fanagling a lot of files, trying to organize my backups and my archives a little better. It turns out I have a lot of partially-duplicated copies of ad hoc backups. Don’t judge me.
Trying to compare lists of files across directories gets complicated and it can sink a whole lot of time. I’m using some shell one-liners to compare lists of files and so this week I thought I’d share a three good bash commands for this kind of work.
comm
This is a new one for me. It takes two lists and figures out which items are in only the first list, only the second list, or common to both lists. The output is a table with three columns but you can suppress one or two columns. So here’s how I applied it.
Let’s use this made-up file structure.
$ tree
.
├── left
│ ├── backup(1)
│ ├── copy of copy of Documents
│ ├── important-stuff
│ └── junkdrawer
└── right
├── backup(1)
├── important-stuff
└── top-secret
9 directories, 0 files
Here’s what comm
does with the directories left
and right
$ comm <(ls left) <(ls right)
backup(1)
copy of copy of Documents
important-stuff
junkdrawer
top-secret
The tab-delimeted columns are hard to read but they are columns.
More interesting for my case is just the list of files that are in both directories:
$ comm -12 <(ls left) <(ls right)
backup(1)
important-stuff
The -12
option means to skip columns 1 and 2. It’s pretty cryptic but it came in handy. The other cryptic part here is how I’ve supplied the output of ls
to comm
. I adapted this from an example given by tldr and assumed the <(command)
syntax was some sort of redirect. It acts a lot like a redirect but actually it’s a process substitution.
If you do process substitution wrong might see a message with /dev/fd/number
in it. Things like this:
$ echo <(ls left) <(ls right)
/dev/fd/63 /dev/fd/62
You might expect this commaond would do something like just printing out the names of all the files - since echo
should just print the text that it’s given. And echo
is doing that but the text that it’s been given is sort of like a temporary file name so that the output from the ls
processes can be referenced. Using cat
instead of echo would dump the contents of those files and they are indeed the output from the two ls
commands:
$ cat <(ls left) <(ls right)
backup(1)
copy of copy of Documents
important-stuff
junkdrawer
backup(1)
important-stuff
top-secret
I don’t understand all the finer details here but if that’s your cup of tea then here’s a nice post on /dev/fd to ease into discovering more.
Using comm
was nice, I had a bunch of directories so I made lists of various combinations that helped me pare down extraneous copies and consolidate the remains.
uniq
This one isn’t new to me but it does more than I realized. That seems to be a theme with shell commands when I read the man
pages. Usually uniq
is a nice quick way to remove duplicate lines from a file or command output. For example, using the same directory tree from above, I might want to know just how many folders I’m dealing with. So I’d use
$ for d in left right ; do ls -1 ${d} ; done | sort | uniq
backup(1)
copy of copy of Documents
important-stuff
junkdrawer
top-secret
In this sorted list we get the list of all the directories with each appearing only once. I had hundreds of these things so I just wanted to know how much more work I’d have to do - how many directories do I have to deal with. That’s just a count of the number of lines here, given by wc -l
.
$ for d in left right ; do ls -1 ${d} ; done | sort | uniq | wc -l
5
This is the common use of uniq
I’d done before though. The interesting thing I discovered is a different sort of measurement. With the --unique
flag we can also get the lines that only appear once in the input. This is, in a way, the opposite of what comm
is doing. Where comm
can tell what lines show up in both inputs, uniq --unique
(or uniq -D
if you like short flags) can tell me which ones appear in only one input.
$ sort <(ls -1 left) <(ls -1 right) | uniq --unique
copy of copy of Documents
junkdrawer
top-secret
Here I’ve switched to process substitution again. The sort
command takes a list of filenames and outputs the result of sorting the contents of all those files together. The file descriptors (that we saw can look like /dev/fd/NN
) are what sort
opens up to get the lines it needs to sort. Then the sorted result of this is sent as input to uniq --unique
.
The output from uniq --unique
looks like the intersection of the set of items whereas the output from comm -12
is the union of the items.
sort
Both uniq
and comm
require sorted input. When I was using these commands earlier in the week I actually also piped things to sort
. What I learned today (again?) is that ls
actually does sort alphabetically by default. I mean I always observe that the output is sorted but for some reason I assumed that it wasn’t guaranteed to be, or that it was some environment variable controlling that behaviour. In fact ls
does “Sort entries alphabetically if none of -cftuvSUX nor –sort is specified” per man 1 ls
.
I still want to share my favourite sort
flag all the same: -h
. Human-friendly sorting for lists that have kB, MB, GB, etc makes it easy to work with the output of du
and similar tools. Apparently it has long forms too but they’re very long: --human-numeric-sort
and --sort=human-numeric
. I love/hate that sort
has a --sort=
flag but apparently it can also parse other things, such as version numbers with --sort=version
. Looking at the weirder sort options and features like defining a sort key (with -k
) makes me think if you need that then maybe you want to check out Visidata instead of relying on sort
and friends.
My earlier directories are all empty but just for completeness, here’s what I mean about sort -h
with du
:
$ du -h | sort -h
4.0K ./left/backup(1)
4.0K ./left/copy of copy of Documents
4.0K ./left/important-stuff
4.0K ./left/junkdrawer
4.0K ./right/backup(1)
4.0K ./right/important-stuff
4.0K ./right/top-secret
16K ./right
20K ./left
40K .
Et cetera
The GNU Coreutils manual is full of little tidbits that you might not run into elsewhere. And maybe that’s fine since most people don’t need to make a permuted index index anymore1. However you might save yourself a lot of unnecessary text munging if you know that ls -1
outputs only filenames, so you don’t need to use cut
. Or that split
and csplit
can slice up a file for you so you don’t need to run sed
or use grep
with a bunch of context lines.
Even though I hate on bash scripting quite vocally I do find that shell proficiency is undeniably helpful in speeding up the mundane work we have to do on computers all the time. Packing your history full of one-liners puts them just a quick ctrl-r
away.
-
But if you do need a permuted then ptx apparently has you covered. Sorry but learning about Key Word in Context indices was really interesting and I had to shoehorn a link in here. ↩︎