Corrupt PDFs

As a digital/media asset manager, corrupt files are the bane of my existence.

To programmatically find corrupt pdf files, I landed with the following utilizing pdftotext from xpdf.

for f in *.pdf; \
do pdftotext -q -f 1 -l 2 $f $f.txt; \
err=$?; \
if [ $err -ne 0 ]; then mv $f _failed_$f; \
elif [ $err -eq 3 ]; then mv $f _locked_$f; \
else rm $f.txt; fi;
done

Kick off multiple instances of script

I had the need to launch about 100 instances of a script all at once to scrape some web data. This was the loop I used:

for f in *.txt; do python inStock.py $f & done

Note:

  • Semicolon launches consecutively, waiting for previous script to finish:
    • for f in *.txt; do foobar.py $f; done
  • Ampersand launches in the background rather than waiting for the consecutive script before it to finish:
    • for f in *.txt; do foobar.py $f & done

 

Set Environmental Variables

When working with login credentials, server addresses, etc you can hide them from git and other viewers by saving them as local environmental variables.

Steps:

  1. create a .bash_profile if it does not exist yet.
  2. edit the .bash_profile to add or change the variables
  3. activate the .bash_profile

In a terminal/console:

touch ~/.bash_profile
vim ~/.bash_profile
source ~/.bash_profile

To set a new variable in the .bash_profile:

#aws
export AWS_KEY=$"foobarkey123"
export AWS_SECRET=$"foobarsecret123"

To use environmental variables in Python:

import os
awsKey = os.environ['AWS_KEY']
awsSecret = os.environ['AWS_SECRET']

Remove duplicate files by MD5

I recently had the need to delete an unknown number of duplicate files out of a batch of ~10,000 images. The filenames were all different, but the md5 hash was known and all of the duplicates had the same.

md5 -r *.jpg | grep "37d1b8f9d6f02f31cmb192a28b96cade" | awk '{ print $2 }' | xargs rm

Split a text file in half

I often have the need to split long text files into smaller chunks, and my need requires they are split by line.

Here's my one liner solution:

f=$"filename"; s=$(wc -l $f | awk '{print $1}'); \
h=$(echo "scale=0;" $(($s/2+1)) | bc -q); \
split -l $h $f "output_"; for file in output_*; \
do mv "$file" "$file.txt"; done

Notes:

  • BC can handle floating point numbers, so scale=0 limits any decimal output.

  • Bash math requires double parentheses.