Spelunking into a Chamber of Horrors

Spoiler: there’s no actual subterranean exploration or cavern filled with toothsome denizens of the deep in this one. “Extracting file types and appending them to file names” would, admittedly, be a more accurate description, but not nearly as much fun.

About a week ago a very nice lady handed me a very large folder full of data. I cannot – for a variety of reasons – divulge the identity of the lady (or the source, type and dispensation of the data), but you should rest assured that the lady was nice, and the folder was… large. I’m not betraying any confidences or violating any laws when I tell you that the folder contained four thousand and ninety-six subdirectories, and that the entire sum of the files came to a hair shy of fifty-thousand files which in turn added up to many gigabytes. Almost all of the gigabytes, actually.

Now, that’s a lot of files, but in the grand scheme of things not what you’d call a remarkable; no, what was remarkable was that they were files referenced as attachments from a large, hopelessly damaged and now-defunct MySQL database, and that they had names like:

2b82e7672d8cfb84d47fdcad0cdab4a66a27b247a689e429cc67d83499976c88

And that’s it. No suffix, no identifying file information at all. What, you might ask yourself, is that thing? Is it a PDF? A Word Doc? A GIF of a dog playing a piano? Well, I’d tell you while interrupting your internal monologue, there’s no way of knowing because that file lacks something useful like “.pdf” on the end of the thing. If you tried to open that file then your computer would cough discreetly and put up a helpful message to ask you what the hell you were doing, and would you please stop?

Fortunately, macOS contains within it a panoply of tools both common and obscure that can be used to root around inside these kinds of files, and with a little elementary scripting it’s not terribly difficult to put together a shell script that will make this whole process simple and transparent. Behold then, The Thing That I Wrote:

Enjoy the magenta, blue, green and silver content that I had no decision in but that I’m dubbing “accidentally festive.”

A word of explanation might not go amiss.

First, I needed a place to pull the files from and a place to put them when they were finished processing. I called these places “Attachments” and “Attachments-converted” respectively. Let’s put a pin in that for a couple of paragraphs, as there’s more to explore with that.

Secondly, I used a for loop to go through each item in the folder, and set suffixName as a variable that would pull the file type from the document. I used file -b to query the file type (the -b flag trims the file name and path from the front of the result) which returned something like this: PDF document, version 1.3, 1 pages.

So? Great! It’s a PDF, and – just like this example – every file returns something similar (i.e., the file type in all-caps followed by version information and size). The next move was to pipe that over to grep -Eo '[A-Z]+' which went into that result and pulled out anything that matched all-caps strings, and then sent that over to head -1 which trimmed everything except the first match from that string. The net effect of that series of operations was that as far as the script was concerned “PDF document, version 1.3, 1 pages” had been whittled down to just plain old “PDF“.

I could have let that be, but giant all-caps file suffixes are ugly and seem like they’re shouting all the time, so I piped that over to the translate (tr) command which turned the upper-case text into lower-case and thus turned PDF into pdf.

Finally, I used the mv command to move each file into the “Attachments-converted” directory, renaming it as the original filename (“$f“) followed by a period and the suffixName variable, which turned:

2b82e7672d8cfb84d47fdcad0cdab4a66a27b247a689e429cc67d83499976c88

into:

2b82e7672d8cfb84d47fdcad0cdab4a66a27b247a689e429cc67d83499976c88.pdf

This is probably the point to pull the pin out of the bit I mentioned earlier – notably, how to extract fifty thousand files from four thousand and ninety-six directories. There is, a wiser and more knowledgeable friend of mine, informs me, a very easy and simple way of doing all of this with Ruby, but I’m terrible at Ruby and his solution – while elegant – was far beyond my miserable skill set.

Note: If you do not have a genius-level programming savant as a close personal friend then I strongly recommend you remedy that error as soon as possible. They’re essential tools, and all you have to do is put gin in one end and wait around for complex solutions to appear.

So, I may know next to nothing about Ruby, but I do know a fair deal about Shortcuts and Automator, and Automator excels at this kind of arbitrary donkey work, thus:

The speed at which this entire operation was conducted was borderline remarkable. Automator sorted and moved each file in around ten seconds, and the shell script I threw together had everything examined, renamed and sorted inside about thirty more seconds. The most difficult part of the procedure was trying to work out how to stick it all together – which chiefly involved me staring out of the window with my customary vacant expression – but it’s pretty impressive how the tools that are built in to macOS can be fit together to accomplish some complex solutions to otherwise frustrating problems…

Leave a Reply

Your email address will not be published. Required fields are marked *