pdftk, copying files without taking comments and annotations - adobe

I have many PDF files which contain comments and annotations made with Adobe Acrobat Reader. However, it will take many hours to copy these files with the comment being deleted manually.
Does PDFtk provide commands to copy files without taking comments and annotations?

You can do this with:
cpdf -remove-annotations in.pdf -o out.pdf

One helpful solution is:
$ LC_CTYPE=C && LANG=C
$ pdftk in.pdf output - uncompress | sed '/^\/Annots/d' | pdftk - output out.pdf compress
The out.pdf has no comments and annotations.
Use bash to process on macOS:
LC_CTYPE=C && LANG=C
paperList=papers.txt
rm ${paperList}
ls | cat > ${paperList}
saveDir=../temp_without_annon
mkdir -p ${saveDir}
130 ↵
while IFS= read -r line
do
pdftk ${line} output - uncompress | sed '/^\/Annots/d' | pdftk - output ${saveDir}/${line} compress;
done < ${paperList}
References
How to install pdftk on Mac OS X
https://stackoverflow.com/a/49614525/5046896

Related

How to find (and rename) file and folder names with octl characters

I accidentally copied files with the wrong encoding, so instead of utf-8 the file and folder names appear to be encoded with octl, i.e. they are for example called L\334ten.txt instead of Löten.txt. I would like to (at least) find all affected files and folders, ideally I would be able to rename the files automatically (so \334 to ö and so on). If changing the encoding is an option, that's of course okay, too. A Bash solution would be best, but I am open to using python or something similar.
I tried identifying the files/folders using grep/find, but sadly without any luck.
The quick and dirty oneliner:
for file in $(find . -regextype posix-extended -regex ".*[\][0-9]{3}.*"); \
do \
OLD_NAME=$(basename $file); \
NEW_NAME=$(echo $OLD_NAME | \
sed 's/\\337/ß/g' | \
sed 's/\\344/ä/g' | \
sed 's/\\366/ö/g' | \
sed 's/\\374/ü/g'); \
mv $file $(dirname $file)/$NEW_NAME; \
done
Proof:
$ touch 'W\344rme.txt' 'L\366ten.txt' 'l\366tf\344hige.txt'
$ ls
'L\366ten.txt' 'l\366tf\344hige.txt' 'W\344rme.txt'
$ copy_paste_oneliner_here
$ ls
Löten.txt lötfähige.txt Wärme.txt
UPDATE:
#rt87 If I understood right your comment, it's possible to emulate your weird filenames:
$ touch $(echo "Löten.txt" | iconv -f UTF-8 -t ISO-8859-1)
So, now we have a file with incorrect encoded name for UTF-8 locale - L�ten.txt. In the terminal you can see:
$ ls
'L'$'\366''ten.txt'
Thus, you can get back your files with another oneliner:
for file in *.*; do mv "$file" "$(echo $file | iconv -f ISO-8859-1 -t UTF-8)"; done
In our test examle we got:
$ ls
Löten.txt

SHA1 checksum of files in large directory and output to text file

I need to output by file the sha1 checksum into an output file. Several of the files are > than 8gb if that makes a difference and the entire directory contains ~28K files.
This is what I need the output file to look like:
SHA1(Windows printed document-1.pdf)= 1c1e2844be6e9ddd995941388b98c12a8b7a1e8d
SHA1(Windows printed document.pdf)= 4ea8a157c5d8d0fc9c38aa6312d120ab425900a0
SHA1(checklist.chk)= c5f9e078578925ef3de1e6075b0777d75296bb8f
...
This is the code so far:
openssl dgst -sha1 * > ~/Desktop/output.txt
This code works great for small files or smallish directories but it's throwing an exception that the argument list is too large when I try to put it into production.
Try hashdeep -c sha1 -r $DIR or find $DIR -print0 -type f | xargs -0 -I {} sha1sum {} > output.txt

combining gunzip and tar commands in Solaris and AIX

I am running the below command to untar a file in Solaris and AIX:
# gunzip /opt/myfile.tar.gz | tar -xvf-
but I'm getting this error:
tar: Unexpected end-of-file while reading from the storage media.
What do I need to fix?
Why should this work? The default behaviour of gunzip unpacks the file in place, substitutes the packed file with the unpacked one and you didn't specified the nescessary command to put the uncompressed datastream to stdout. So the tar command doesn't receive anything through the pipe to process and so you get the errormessage you have seen.
This will work:
gunzip -c ../myfile.tar.gz | tar -xfv -
This command line was tested on a Solaris 11.3 ... older variants of Solaris may need a different sorting of the command line like
gunzip -c ../myfile.tar.gz | tar -xvf -
I think something like this should work but I I don't have a Solaris system to test it...
gzip -dc /opt/myfile.tar.gz | tar xvf -

sed edit file in place

I am trying to find out if it is possible to edit a file in a single sed command without manually streaming the edited content into a new file and then renaming the new file to the original file name.
I tried the -i option but my Solaris system said that -i is an illegal option. Is there a different way?
The -i option streams the edited content into a new file and then renames it behind the scenes, anyway.
Example:
sed -i 's/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g' filename
while on macOS you need:
sed -i '' 's/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g' filename
On a system where sed does not have the ability to edit files in place, I think the better solution would be to use perl:
perl -pi -e 's/foo/bar/g' file.txt
Although this does create a temporary file, it replaces the original because an empty in place suffix/extension has been supplied.
Note that on OS X you might get strange errors like "invalid command code" or other strange errors when running this command. To fix this issue try
sed -i '' -e "s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g" <file>
This is because on the OSX version of sed, the -i option expects an extension argument so your command is actually parsed as the extension argument and the file path is interpreted as the command code. Source: https://stackoverflow.com/a/19457213
The following works fine on my mac
sed -i.bak 's/foo/bar/g' sample
We are replacing foo with bar in sample file. Backup of original file will be saved in sample.bak
For editing inline without backup, use the following command
sed -i'' 's/foo/bar/g' sample
One thing to note, sed cannot write files on its own as the sole purpose of sed is to act as an editor on the "stream" (ie pipelines of stdin, stdout, stderr, and other >&n buffers, sockets and the like). With this in mind you can use another command tee to write the output back to the file. Another option is to create a patch from piping the content into diff.
Tee method
sed '/regex/' <file> | tee <file>
Patch method
sed '/regex/' <file> | diff -p <file> /dev/stdin | patch
UPDATE:
Also, note that patch will get the file to change from line 1 of the diff output:
Patch does not need to know which file to access as this is found in the first line of the output from diff:
$ echo foobar | tee fubar
$ sed 's/oo/u/' fubar | diff -p fubar /dev/stdin
*** fubar 2014-03-15 18:06:09.000000000 -0500
--- /dev/stdin 2014-03-15 18:06:41.000000000 -0500
***************
*** 1 ****
! foobar
--- 1 ----
! fubar
$ sed 's/oo/u/' fubar | diff -p fubar /dev/stdin | patch
patching file fubar
Versions of sed that support the -i option for editing a file in place write to a temporary file and then rename the file.
Alternatively, you can just use ed. For example, to change all occurrences of foo to bar in the file file.txt, you can do:
echo ',s/foo/bar/g; w' | tr \; '\012' | ed -s file.txt
Syntax is similar to sed, but certainly not exactly the same.
Even if you don't have a -i supporting sed, you can easily write a script to do the work for you. Instead of sed -i 's/foo/bar/g' file, you could do inline file sed 's/foo/bar/g'. Such a script is trivial to write. For example:
#!/bin/sh
IN=$1
shift
trap 'rm -f "$tmp"' 0
tmp=$( mktemp )
<"$IN" "$#" >"$tmp" && cat "$tmp" > "$IN" # preserve hard links
should be adequate for most uses.
You could use vi
vi -c '%s/foo/bar/g' my.txt -c 'wq'
sed supports in-place editing. From man sed:
-i[SUFFIX], --in-place[=SUFFIX]
edit files in place (makes backup if extension supplied)
Example:
Let's say you have a file hello.txtwith the text:
hello world!
If you want to keep a backup of the old file, use:
sed -i.bak 's/hello/bonjour' hello.txt
You will end up with two files: hello.txt with the content:
bonjour world!
and hello.txt.bak with the old content.
If you don't want to keep a copy, just don't pass the extension parameter.
If you are replacing the same amount of characters and after carefully reading “In-place” editing of files...
You can also use the redirection operator <> to open the file to read and write:
sed 's/foo/bar/g' file 1<> file
See it live:
$ cat file
hello
i am here # see "here"
$ sed 's/here/away/' file 1<> file # Run the `sed` command
$ cat file
hello
i am away # this line is changed now
From Bash Reference Manual → 3.6.10 Opening File Descriptors for Reading and Writing:
The redirection operator
[n]<>word
causes the file whose name is the expansion of word to be opened for
both reading and writing on file descriptor n, or on file descriptor 0
if n is not specified. If the file does not exist, it is created.
Like Moneypenny said in Skyfall: "Sometimes the old ways are best."
Kincade said something similar later on.
$ printf ',s/false/true/g\nw\n' | ed {YourFileHere}
Happy editing in place.
Added '\nw\n' to write the file. Apologies for delay answering request.
You didn't specify what shell you are using, but with zsh you could use the =( ) construct to achieve this. Something along the lines of:
cp =(sed ... file; sync) file
=( ) is similar to >( ) but creates a temporary file which is automatically deleted when cp terminates.
mv file.txt file.tmp && sed 's/foo/bar/g' < file.tmp > file.txt
Should preserve all hardlinks, since output is directed back to overwrite the contents of the original file, and avoids any need for a special version of sed.
To resolve this issue on Mac I had to add some unix functions to core-utils following this.
brew install grep
==> Caveats
All commands have been installed with the prefix "g".
If you need to use these commands with their normal names, you
can add a "gnubin" directory to your PATH from your bashrc like:
PATH="/usr/local/opt/grep/libexec/gnubin:$PATH"
Call with gsed instead of sed. The mac default doesn't like how grep -rl displays file names with the ./ preprended.
~/my-dir/configs$ grep -rl Promise . | xargs sed -i 's/Promise/Bluebird/g'
sed: 1: "./test_config.js": invalid command code .
I also had to use xargs -I{} sed -i 's/Promise/Bluebird/g' {} for files with a space in the name.
Very good examples. I had the challenge to edit in place many files and the -i option seems to be the only reasonable solution using it within the find command. Here the script to add "version:" in front of the first line of each file:
find . -name pkg.json -print -exec sed -i '.bak' '1 s/^/version /' {} \;
In case you want to replace stings contain '/',you can use '?'. i.e. replace '/usr/local/bin/python' with '/usr/bin/python3' for all *.py files.
find . -name \*.py -exec sed -i 's?/usr/local/bin/python?/usr/bin/python3?g' {} \;

Performing grep operation in tar files without extracting

I have list of files which contain particular patterns, but those files have been tarred. Now I want to search for the pattern in the tar file, and to know which files contain the pattern without extracting the files.
Any idea...?
the tar command has a -O switch to extract your files to standard output. So you can pipe those output to grep/awk
tar xvf test.tar -O | awk '/pattern/{print}'
tar xvf test.tar -O | grep "pattern"
eg to return file name one pattern found
tar tf myarchive.tar | while read -r FILE
do
if tar xf test.tar $FILE -O | grep "pattern" ;then
echo "found pattern in : $FILE"
fi
done
The command zgrep should do exactly what you want, directly.
for example
zgrep "mypattern" *.gz
http://linux.about.com/library/cmd/blcmdl1_zgrep.htm
GNU tar has --to-command. With it you can have tar pipe each file from the archive into the given command. For the case where you just want the lines that match, that command can be a simple grep. To know the filenames you need to take advantage of tar setting certain variables in the command's environment; for example,
tar xaf thing.tar.xz --to-command="awk -e '/thing.to.match/ {print ENVIRON[\"TAR_FILENAME\"] \":\", \$0}'"
Because I find myself using this often, I have this:
#!/bin/sh
set -eu
if [ $# -lt 2 ]; then
echo "Usage: $(basename "$0") <pattern> <tarfile>"
exit 1
fi
if [ -t 1 ]; then
h="$(tput setf 4)"
m="$(tput setf 5)"
f="$(tput sgr0)"
else
h=""
m=""
f=""
fi
tar xaf "$2" --to-command="awk -e '/$1/{gsub(\"$1\", \"$m&$f\"); print \"$h\" ENVIRON[\"TAR_FILENAME\"] \"$f:\", \$0}'"
This can be done with tar --to-command and grep --label:
tar xaf archive.tar.gz --to-command 'egrep -Hn --label="$TAR_FILENAME" your_pattern_here || true'
--label gives grep the filename
-H tells grep to display the filename, and -n the line number
|| true because otherwise grep will exit with an error if the pattern is not found, and tar will complain about that.
xaf means to extract, and automagically decompress based off the file extension
--to-command has tar pass each file in the tarfile to a separate invocation of grep, and sets various environment variables with info about the file. See the manpage for more info.
Pretty heavily based off of Chipaca's answer (and Daniel H's comment), but this should be a bit easier to use and just uses tar and grep.
Python's tarfile module along with Tarfile.extractfile() will allow you to inspect the tarball's contents without extracting it to disk.
The easiest way is probably to use avfs. I've used this before for such tasks.
Basically, the syntax is:
avfsd ~/.avfs # Sets up a avfs virtual filesystem
rgrep pattern ~/.avfs/path/to/file.tar#/
/path/to/file.tar is the path to the actual tar file.
Pre-pending ~/.avfs/ (the mount point) and appending # lets avfs expose the tar file as a directory.
That's actually very easy with ugrep option -z:
-z, --decompress
Decompress files to search, when compressed. Archives (.cpio,
.pax, .tar, and .zip) and compressed archives (e.g. .taz, .tgz,
.tpz, .tbz, .tbz2, .tb2, .tz2, .tlz, and .txz) are searched and
matching pathnames of files in archives are output in braces. If
-g, -O, -M, or -t is specified, searches files within archives
whose name matches globs, matches file name extensions, matches
file signature magic bytes, or matches file types, respectively.
Supported compression formats: gzip (.gz), compress (.Z), zip,
bzip2 (requires suffix .bz, .bz2, .bzip2, .tbz, .tbz2, .tb2, .tz2),
lzma and xz (requires suffix .lzma, .tlz, .xz, .txz).
For example:
ugrep -z PATTERN archive.tgz
This greps each of the archived files to display PATTERN matches with the archived filenames. Archived filenames are shown in braces to distinguish them from ordinary filenames. Everything else is the same as grep (ugrep has the same options and produces the same output). For example:
$ ugrep -z "Hello" archive.tgz
{Hello.bat}:echo "Hello World!"
Binary file archive.tgz{Hello.class} matches
{Hello.java}:public class Hello // prints a Hello World! greeting
{Hello.java}: { System.out.println("Hello World!");
{Hello.pdf}:(Hello)
{Hello.sh}:echo "Hello World!"
{Hello.txt}:Hello
If you just want the file names, use option -l (--files-with-matches) and customize the filename output with option --format="%z%~" to get rid of the braces:
$ ugrep -z Hello -l --format="%z%~" archive.tgz
Hello.bat
Hello.class
Hello.java
Hello.pdf
Hello.sh
Hello.txt
Tarballs (.tar.gz/.tgz, .tar.bz2/.tbz, .tar.xz/.txz, .tar.lzma/.tlz) are searched as well as .zip archives.
You can mount the TAR archive with ratarmount and then simply search for the pattern in the mounted view:
pip install --user ratarmount
ratarmount large-archive.tar mountpoint
grep -r '<pattern>' mountpoint/
This should be much faster than iterating over each file and printing it to stdout, especially for compressed TARs.
Here is a simple comparison benchmark:
function checkFilesWithRatarmount()
{
local pattern=$1
local archive=$2
ratarmount "$archive" "$archive.mountpoint"
'grep' -r -l "$pattern" "$archive.mountpoint/"
}
function checkEachFileViaStdOut()
{
local pattern=$1
local archive=$2
tar --list --file "$archive" | while read -r file; do
if tar -x --file "$archive" -O -- "$file" | grep -q "$pattern"; then
echo "Found pattern in: $file"
fi
done
}
function createSampleTar()
{
for i in $( seq 40 ); do
head -c $(( 1024 * 1024 )) /dev/urandom | base64 > $i.dat
done
tar -czf "$1" [0-9]*.dat
}
createSampleTar myarchive.tar.gz
time checkEachFileViaStdOut ABCD myarchive.tar.gz
time checkFilesWithRatarmount ABCD myarchive.tar.gz
sleep 0.5s
fusermount -u myarchive.tar.gz.mountpoint
Results in seconds for a 55 MiB uncompressed and 42 MiB compressed TAR archive containing 40 files:
Compression
Ratarmount
Bash Loop over tar -O
none
0.31 +- 0.01
0.55 +- 0.02
gzip
1.1 +- 0.1
13.5 +- 0.1
bzip2
1.2 +- 0.1
97.8 +- 0.2
Of course, these results are highly dependent on the archive size and how many files the archive contains. These test examples are pretty small because I didn't want to wait too long but they already show the problem. The more files there are, the longer it takes for tar -O to jump to the correct file. And for compressed archives, it will be quadratically slower the larger the archive size is because everything before the requested file has to be decompressed and each file is requested separately. Both of these problems are solved by ratarmount.

Resources