pdf2image: how to remove the '0001' in jpg file names? (Solved) - pdf2image

My goal is to convert a multi page pdf file into a number of .jpg files, in such a way that the images are directly written to the hard-disk/SSD in stead of stored into memory.
In python 3.11 :
from pdf2image import convert_from_path
poppler_path = r".\poppler-22.12.0\Library\bin"
images = convert_from_path('test.pdf', output_folder='.', output_file = 'test',
poppler_path=poppler_path, paths_only = True)
pdf2image generates files with the following names
'test_0001-1.jpg',
'test_0001-2.jpg',
etc
Problem:
I would like to have the files have names without the suffix '_0001-' (eg. 'test1.jpg').
The only way so far seems to be to use convert_from_path WITHOUT output_folder and then
save each images by images.save. But in this way the images are stored first into memory, which easyly can become a lot of Mbytes.
Is it possible to change the way pdf2image generates the file names when saving images directly to files?

I'm not familiar if Poppler already has some parameters to customize the generated file names, but you can always do this:
Run the command in an empty directory (e.g. in tempfile.TemporaryDirectory())
After command finishes, list the contents of the directory and store the result in a list
Iterate over the list with a regex that will match the numbers, and create a dict for the mapping (integer to file name)
At this point you are free to rename the files to whatever you like, or to process them.
The benefit of this solution is that it's neutral, robust and works for many similar scenarios.

hi have a look at your codebase in file generators.py ,
I got mine from def counter_generator(prefix="", suffix="", padding_goal=4):
at line 41 you have :
....
#threadsafe
def counter_generator(prefix="", suffix="", padding_goal=4):
"""Returns a joined prefix, iteration number, and suffix"""
i = 0
while True:
i += 1
yield str(prefix) + str(i).zfill(padding_goal) + str(suffix)
....
think you need to play with the yield line zfill() :
The Python String zfill() method is used to fill the string with zeroes on its left until it reaches a certain width; else called Padding. If the prefix of this string is a sign character (+ or -), the zeroes are added after the sign character rather than before.
The Python String zfill() method does not fill the string if the length of the string is greater than the total width of the string after padding.
Note: The zfill() method works similar to the rjust() method if we assign '0' to the fillchar parameter of the rjust() method.
https://www.tutorialspoint.com/python/string_zfill.htm

Just use poppler utilities direct (or xpdf pdftopng) so simply call it via a shell (add other options like -r 200 as desired for resolutions other than 150)
I recommend PNG as better image fidelity, however if you want .jpg replace "-png" below with "-jpg" (direct answer as asked would be pdftoppm -jpg -f 1 -l 9 -sep "" test.pdf "test") but do follow the below enhancement for file sorting. Windows file sorting needs leading zeros otherwise sort in zip or folder is 1,10,11...2,20...., which is often undesirable.
"path to bin\pdftoppm" -png "path to \in.pdf" "name"
Result =
name-1.png
name-2.png etc.
adding digits is limited compared to other apps so if you want "name-01.png" you need to only output pages 1-9 as
\bin>pdftoppm -png -f 1 -l 9 -sep "0" in.pdf "name-"
then for pages 10 to ## use say for up to 99 page file use default (it will only use the page numbers that are available)
\bin>pdftoppm -png -f 10 -l 99 in.pdf "name"
thus for 12 pages this would produce only -10 -11 and -12 as required
likewise, for up to 9999 pages you need 4 calls, if you don't want - simply delete it. For different output directory adjust output accordingly.
set "name=%~dpn1"
set "bin=path to Poppler\Release-22.12.0-0\poppler-22.12.0\Library\bin"
"%bin%\pdftoppm" -png -r 200 -f 1 -l 9 -sep "0" "%name%.pdf" "%name%-00"
"%bin%\pdftoppm" -png -r 200 -f 10 -l 99 -sep "0" "%name%.pdf" "%name%-0"
"%bin%\pdftoppm" -png -r 200 -f 100 -l 999 -sep "0" "%name%.pdf" "%name%-"
"%bin%\pdftoppm" -png -r 200 -f 1000 -l 9999 -sep "" "%name%.pdf" "%name%-"
in say example for 12 page above the worst case would be last calls replies
Wrong page range given: the first page (100) can not be after the last page (12). and same for 1000 Thus, those warnings can be ignored.
Those 4 lines could be in a windows or OS script batch file (for sendto or drag and drop) that accepts arguments then very simply use in system or python by call pdf2png.bat input.pdf for each file and output will in that simple case be same directory.

Related

ZSH shell script loops MANY times?

So I want to convert these blogposts to PDF using wkhtmltopdf
On MacOS in automator I set up a WorkFlow: GET SPECIFIED TEXT > EXTRACT URLS FROM TEXT > RUN SHELL SCRIPT (shell: /bin/zsh, pass input: as arguments)
#!/bin/zsh
# url example: "https://www.somewordpressblog.com/2022/12/16/some-post-title/"
i=1
for url in "$#"
do
title1=${url[(ws:/:)6]} # split with delimiter / sixth part
title2=${(U)title1//-/ } # replace all hypens with spaces and capitalize
title3="$i. - " # prefix add numbers
title4=".pdf" # suffix add .PDF extension
title5="${title3}${title2}${title4}" # join everything
/usr/local/bin/wkhtmltopdf --disable-javascript --print-media-type $url $title5
((i+=1))
done
The files got downloaded, but for a test with only 2 URL's there was like 2 minutes waiting and the RESULTS from the Schell Script showed me 84 items!!
I am counting 14 DONES from the wkhtmltopdf output.
What is wrong with this loop? Do I need to implement something to wait for the loop to continue or something? And How?
Any code suggestions welcome as well, first day with ZSH..

How do facetiles in Apple Photos that correspond to RKFace.modelId?

I have been digging through the Apple Photos macOS app for a couple weekends now and I am stuck. I am hoping the smart people at StackOverflow can figure this out.
What I don't know:
How are new hex directories determined and how do they correspond to RK.modelId. Perhaps 16 mod of RKFace.ModelId, or mod 256 of RKFace.ModelId?
After a while, the facetile hex value no longer corresponds to the RKFace.ModelId. For example RKFace.modelId 61047 should be facetile_ee77.jpeg. The correct facetile, however, is face/20/01/facetile_1209b.jpeg. hex value 1209b is dec value 73883 for which I have no RKFace.ModelId.
Things I know:
Apple Photos leverages deep learning networks to detect and crop faces out of your imported photos. It saves a cropped jpeg these detected faces into your photo library in resources/media/face/00/00/facetile_1.jpeg.
A record corresponding to this facetile is inserted into RKFace where RKFace.modelId integer is the decimal number of tail hex version of the filename. You can use a standard dec to hex converter and derive the correct values. For example:
Each subdirectory, for example "/00/00" will only hold a maximum of 256 facetiles before it starts a new directory. The directory name is also in hex format with directories. For example 3e, 3f.
While trying to render photo mosaics, I stumbled upon that issue, too...
Then I was lucky to find both a master image and the corresponding facetile,
allowing me to grep around, searching for the decimal and hex equivalent of the numbers embedded in the filenames.
This is what I came up with (assuming you are searching for someone named NAME):
SELECT
printf('%04x', mr.modelId) AS tileId
FROM
RKModelResource mr, RKFace f, RKPerson p
WHERE
f.modelId = mr.attachedModelId
AND f.personId = p.modelId
AND p.displayName = NAME
This select prints out all RKModelResource.modelIds in hex, used to name the corresponding facetiles you were searching for. All that is needed now is the complete path to the facetile.
So, a complete bash script to copy all those facetiles of a person (to a local folder out in the current directory) could be:
#!/bin/bash
set -eEu
PHOTOS_PATH=$HOME/Pictures/Photos\ Library.photoslibrary
DB_PATH=$PHOTOS_PATH/database/photos.db
echo $NAME
mkdir -p out/$NAME
TILES=( $(sqlite3 "$DB_PATH" "SELECT printf('%04x', mr.modelId) AS tileId FROM RKModelResource mr, RKFace f, RKPerson p WHERE f.modelId = mr.attachedModelId AND f.personId = p.modelId AND p.displayName='"$NAME"'") )
for TILE in ${TILES[#]}; do
FOLDER=${TILE:0:2}
SOURCE="$PHOTOS_PATH/resources/media/face/$FOLDER/00/facetile_$TILE.jpeg"
[[ -e "$SOURCE" ]] || continue
TARGET=out/$NAME/$TILE.jpeg
[[ -e "$TARGET" ]] && continue
cp "$SOURCE" "$TARGET" || :
done

How to match ID in column in unix?

I am fully aware that similar questions may have been posted, but after searching it seems that the details of our questions are different (or at least I did not manage to find a solution that can be adopted in my case).
I currently have two files: "messyFile" and "wantedID". "messyFile" is of size 80,000,000 X 2,500, whereas "wantedID" is of size 1 x 462. On the 253rd line of "messyFile", there are 2500 IDs. However, all I want is the 462 IDs in the file "wantedID". Assuming that the 462 IDs are a subset of the 2500 IDs, how can I process the file "messyFile" such that it only contains information about the 462 IDs (ie. of size 80,000,000 X 462).
Thank you so much for your patience!
ps: Sorry for the confusion. But yeah, the question can be boiled down to something like this. In the 1st row of "File#1", there are 10 IDs. In the 1st row of "File#2", there are 3 IDs ("File#2" consists of only 1 line). The 3 IDs are a subset of the 10 IDs. Now, I hope to process "File#1" so that it contains only information about the 3 IDs listed in "File#2".
ps2: "messyFile" is a vcf file, whereas "wantedID" can be a text file (I said "can be" because it is small, so I can make almost any type for it)
ps3: "File#1" should look something like this:
sample#1 sample#2 sample#3 sample#4 sample#5
0 1 0 0 1
1 1 2 0 2
"File#2" should look something like this:
sample#2 sample#4 sample#5
Desired output should look like this:
sample#2 sample#4 sample#5
1 0 1
1 0 2
For parsing VCF format, use bcftools:
http://samtools.github.io/bcftools/bcftools.html
Specifically for your task see the view command:
http://samtools.github.io/bcftools/bcftools.html#view
Example:
bcftools view -Ov -S 462sample.list -r chr:pos -o subset.vcf superset.vcf
You will need to get the position of the SNP to specify chr:pos above.
You can do this using DbSNP:
http://www.ncbi.nlm.nih.gov/SNP/index.html
Just make sure to match the genome build to the one used in the VCF file.
You can also use plink:
https://www.cog-genomics.org/plink2
But, PLINK is finicky about duplicated SNPs and other things, so it may complain unless you address these issues.
I've done what you are attempting in the past using the awk programming language. For your sanity, I recommend using one of the above tools :)
Ok, I have no idea what a vcf file is but if the File#1 and File#2 samples you gave were files containing tab separated columns this will work:
declare -a data=(`head -1 data.txt`)
declare -a header=(`head -1 header.txt`)
declare fields
declare -i count
for i in "${header[#]}" ; do
count=0
for j in "${data[#]}" ; do
count=$count+1;
if [ $i == $j ] ; then
fields=$fields,$count
fi
done
done
cut -f ${fields:1} data.txt
If they aren't tab separated values perhaps it can be amended for the actual data format.

Compress EACH LINE of a file individually and independently of one another? (or preserve newlines)

I have a very large file (~10 GB) that can be compressed to < 1 GB using gzip. I'm interested in using sort FILE | uniq -c | sort to see how often a single line is repeated, however the 10 GB file is too large to sort and my computer runs out of memory.
Is there a way to compress the file while preserving newlines (or an entirely different method all together) that would reduce the file to a small enough size to sort, yet still leave the file in a condition that's sortable?
Or any other method of finding out / countin how many times each line is repetead inside a large file (a ~10 GB CSV-like file) ?
Thanks for any help!
Are you sure you're running out of the Memory (RAM?) with your sort?
My experience debugging sort problems leads me to believe that you have probably run out of diskspace for sort to create it temporary files. Also recall that diskspace used to sort is usually in /tmp or /var/tmp.
So check out your available disk space with :
df -g
(some systems don't support -g, try -m (megs) -k (kiloB) )
If you have an undersized /tmp partition, do you have another partition with 10-20GB free? If yes, then tell your sort to use that dir with
sort -T /alt/dir
Note that for sort version
sort (GNU coreutils) 5.97
The help says
-T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp;
multiple options specify multiple directories
I'm not sure if this means can combine a bunch of -T=/dr1/ -T=/dr2 ... to get to your 10GB*sortFactor space or not. My experience was that it only used the last dir in the list, so try to use 1 dir that is big enough.
Also, note that you can go to the whatever dir you are using for sort, and you'll see the acctivity of the temporary files used for sorting.
I hope this helps.
As you appear to be a new user here on S.O., allow me to welcome you and remind you of four things we do:
. 1) Read the FAQs
. 2) Please accept the answer that best solves your problem, if any, by pressing the checkmark sign. This gives the respondent with the best answer 15 points of reputation. It is not subtracted (as some people seem to think) from your reputation points ;-)
. 3) When you see good Q&A, vote them up by using the gray triangles, as the credibility of the system is based on the reputation that users gain by sharing their knowledge.
. 4) As you receive help, try to give it too, answering questions in your area of expertise
There are some possible solutions:
1 - use any text processing language (perl, awk) to extract each line and save the line number and a hash for that line, and then compare the hashes
2 - Can / Want to remove the duplicate lines, leaving just one occurence per file? Could use a script (command) like:
awk '!x[$0]++' oldfile > newfile
3 - Why not split the files but with some criteria? Supposing all your lines begin with letters:
- break your original_file in 20 smaller files: grep "^a*$" original_file > a_file
- sort each small file: a_file, b_file, and so on
- verify the duplicates, count them, do whatever you want.

Can I determine if the terminal interprets the C1 control codes?

ISO/IEC 2022 defines the C0 and C1 control codes. The C0 set are the familiar codes between 0x00 and 0x1f in ASCII, ISO-8859-1 and UTF-8 (eg. ESC, CR, LF).
Some VT100 terminal emulators (eg. screen(1), PuTTY) support the C1 set, too. These are the values between 0x80 and 0x9f (so, for example, 0x84 moves the cursor down a line).
I am displaying user-supplied input. I do not wish the user input to be able to alter the terminal state (eg. move the cursor). I am currently filtering out the character codes in the C0 set; however I would like to conditionally filter out the C1 set too, if terminal will interpret them as control codes.
Is there a way of getting this information from a database like termcap?
The only way to do it that I can think of is using C1 requests and testing the return value:
$ echo `echo -en "\x9bc"`
^[[?1;2c
$ echo `echo -e "\x9b5n"`
^[[0n
$ echo `echo -e "\x9b6n"`
^[[39;1R
$ echo `echo -e "\x9b0x" `
^[[2;1;1;112;112;1;0x
The above ones are:
CSI c Primary DA; request Device Attributes
CSI 5 n DSR; Device Status Report
CSI 6 n CPR; Cursor Position Report
CSI 0 x DECREQTPARM; Request Terminal Parameters
The terminfo/termcap that ESR maintains (link) has a couple of these requests in user strings 7 and 9 (user7/u7, user9/u9):
# INTERPRETATION OF USER CAPABILITIES
#
# The System V Release 4 and XPG4 terminfo format defines ten string
# capabilities for use by applications, .... In this file, we use
# certain of these capabilities to describe functions which are not covered
# by terminfo. The mapping is as follows:
#
# u9 terminal enquire string (equiv. to ANSI/ECMA-48 DA)
# u8 terminal answerback description
# u7 cursor position request (equiv. to VT100/ANSI/ECMA-48 DSR 6)
# u6 cursor position report (equiv. to ANSI/ECMA-48 CPR)
#
# The terminal enquire string should elicit an answerback response
# from the terminal. Common values for will be ^E (on older ASCII
# terminals) or \E[c (on newer VT100/ANSI/ECMA-48-compatible terminals).
#
# The cursor position request () string should elicit a cursor position
# report. A typical value (for VT100 terminals) is \E[6n.
#
# The terminal answerback description (u8) must consist of an expected
# answerback string. The string may contain the following scanf(3)-like
# escapes:
#
# %c Accept any character
# %[...] Accept any number of characters in the given set
#
# The cursor position report () string must contain two scanf(3)-style
# %d format elements. The first of these must correspond to the Y coordinate
# and the second to the %d. If the string contains the sequence %i, it is
# taken as an instruction to decrement each value after reading it (this is
# the inverse sense from the cup string). The typical CPR value is
# \E[%i%d;%dR (on VT100/ANSI/ECMA-48-compatible terminals).
#
# These capabilities are used by tac(1m), the terminfo action checker
# (distributed with ncurses 5.0).
Example:
$ echo `tput u7`
^[[39;1R
$ echo `tput u9`
^[[?1;2c
Of course, if you only want to prevent display corruption, you can use less approach, and let the user switch between displaying/not displaying control characters (-r and -R options in less). Also, if you know your output charset, ISO-8859 charsets have the C1 range reserved for control codes (so they have no printable chars in that range).
Actually, PuTTY does not appear to support C1 controls.
The usual way of testing this feature is with vttest, which provides menu entries for changing the input- and output- separately to use 8-bit controls. PuTTY fails the sanity-check for each of those menu entries, and if the check is disabled, the result confirms that PuTTY does not honor those controls.
I don't think there's a straightforward way to query whether the terminal supports them. You can try nasty hacky workarounds (like print them and then query the cursor position) but I really don't recommend anything along these lines.
I think you could just filter out these C1 codes unconditionally. Unicode declares the U+0080.. U+009F range as control characters anyway, I don't think you should ever use them for anything different.
(Note: you used the example 0x84 for cursor down. It's in fact U+0084 encoded in whichever encoding the terminal uses, e.g. 0xC2 0x84 for UTF-8.)
Doing it 100% automatically is challenging at best. Many, if not most, Unix interfaces are smart (xterms and whatnot), but you don't actually know if connected to an ASR33 or a PC running MSDOS.
You could try some of the terminal interrogation escape sequences and timeout if there is no reply. But then you might have to fall back and maybe ask the user what kind of terminal they are using.

Resources