Makefile pattern rules and wildcard - how to pass two parameters from target to recipe - gnu-make

I'm trying to pass parameters through file names in phony to a stata do-file, and I'm wondering if there is a way to get around the restriction of only one pattern rule for my situation.
In phony:
test: file1_2005_2010.dta file1_2004_2008.dta
In another makefile I now want to parse these start and end years, and in theory this would work by parsing 2005_2010 and 2004_2008 in the stata do file:
file1_%.dta:
cd path/folder && YEAR=$* $(STATA) dofile.do
But the problem is that some of the prerequisites have only the start year in it, which have to be made dynamically so I can use % only for 2005 in this case:
file1_2005_2010.dta: file2_2005.dta
cd path/folder && $(STATA) dofile.do
file1_%_2010.dta: file2_%.dta
cd path/folder && YEAR=$* $(STATA) dofile.do
I don't necessarily need 2010 to match any prerequisite filenames, it just needs to be passed to the recipe. Using a wildcard for 2010 (file1_%_*.dta) doesnt' work either if the target doesn't already exist.
Is there any way around these two restrictions?

This will first extract the end year part from the target and then prefix it with the start year in the pattern. Walking through it step by step.
To get the target filename we use automatic variable $#:
YEAR=$#
# Year is file1_2005_2010.dta
Remove the suffix .dta:
YEAR=$(basename $#)
# Year is file1_2005_2010
Exchange _ with whitespace to create words:
YEAR=$(subst _, ,$(basename $#))
# Year is "file1 2005 2010"
Extract last word:
YEAR=$(lastword $(subst _, ,$(basename $#)))
# Year is 2010
Finally prefix with start year:
YEAR=$*_$(lastword $(subst _, ,$(basename $#)))
# Year is 2005_2010

Related

pdf2image: how to remove the '0001' in jpg file names? (Solved)

My goal is to convert a multi page pdf file into a number of .jpg files, in such a way that the images are directly written to the hard-disk/SSD in stead of stored into memory.
In python 3.11 :
from pdf2image import convert_from_path
poppler_path = r".\poppler-22.12.0\Library\bin"
images = convert_from_path('test.pdf', output_folder='.', output_file = 'test',
poppler_path=poppler_path, paths_only = True)
pdf2image generates files with the following names
'test_0001-1.jpg',
'test_0001-2.jpg',
etc
Problem:
I would like to have the files have names without the suffix '_0001-' (eg. 'test1.jpg').
The only way so far seems to be to use convert_from_path WITHOUT output_folder and then
save each images by images.save. But in this way the images are stored first into memory, which easyly can become a lot of Mbytes.
Is it possible to change the way pdf2image generates the file names when saving images directly to files?
I'm not familiar if Poppler already has some parameters to customize the generated file names, but you can always do this:
Run the command in an empty directory (e.g. in tempfile.TemporaryDirectory())
After command finishes, list the contents of the directory and store the result in a list
Iterate over the list with a regex that will match the numbers, and create a dict for the mapping (integer to file name)
At this point you are free to rename the files to whatever you like, or to process them.
The benefit of this solution is that it's neutral, robust and works for many similar scenarios.
hi have a look at your codebase in file generators.py ,
I got mine from def counter_generator(prefix="", suffix="", padding_goal=4):
at line 41 you have :
....
#threadsafe
def counter_generator(prefix="", suffix="", padding_goal=4):
"""Returns a joined prefix, iteration number, and suffix"""
i = 0
while True:
i += 1
yield str(prefix) + str(i).zfill(padding_goal) + str(suffix)
....
think you need to play with the yield line zfill() :
The Python String zfill() method is used to fill the string with zeroes on its left until it reaches a certain width; else called Padding. If the prefix of this string is a sign character (+ or -), the zeroes are added after the sign character rather than before.
The Python String zfill() method does not fill the string if the length of the string is greater than the total width of the string after padding.
Note: The zfill() method works similar to the rjust() method if we assign '0' to the fillchar parameter of the rjust() method.
https://www.tutorialspoint.com/python/string_zfill.htm
Just use poppler utilities direct (or xpdf pdftopng) so simply call it via a shell (add other options like -r 200 as desired for resolutions other than 150)
I recommend PNG as better image fidelity, however if you want .jpg replace "-png" below with "-jpg" (direct answer as asked would be pdftoppm -jpg -f 1 -l 9 -sep "" test.pdf "test") but do follow the below enhancement for file sorting. Windows file sorting needs leading zeros otherwise sort in zip or folder is 1,10,11...2,20...., which is often undesirable.
"path to bin\pdftoppm" -png "path to \in.pdf" "name"
Result =
name-1.png
name-2.png etc.
adding digits is limited compared to other apps so if you want "name-01.png" you need to only output pages 1-9 as
\bin>pdftoppm -png -f 1 -l 9 -sep "0" in.pdf "name-"
then for pages 10 to ## use say for up to 99 page file use default (it will only use the page numbers that are available)
\bin>pdftoppm -png -f 10 -l 99 in.pdf "name"
thus for 12 pages this would produce only -10 -11 and -12 as required
likewise, for up to 9999 pages you need 4 calls, if you don't want - simply delete it. For different output directory adjust output accordingly.
set "name=%~dpn1"
set "bin=path to Poppler\Release-22.12.0-0\poppler-22.12.0\Library\bin"
"%bin%\pdftoppm" -png -r 200 -f 1 -l 9 -sep "0" "%name%.pdf" "%name%-00"
"%bin%\pdftoppm" -png -r 200 -f 10 -l 99 -sep "0" "%name%.pdf" "%name%-0"
"%bin%\pdftoppm" -png -r 200 -f 100 -l 999 -sep "0" "%name%.pdf" "%name%-"
"%bin%\pdftoppm" -png -r 200 -f 1000 -l 9999 -sep "" "%name%.pdf" "%name%-"
in say example for 12 page above the worst case would be last calls replies
Wrong page range given: the first page (100) can not be after the last page (12). and same for 1000 Thus, those warnings can be ignored.
Those 4 lines could be in a windows or OS script batch file (for sendto or drag and drop) that accepts arguments then very simply use in system or python by call pdf2png.bat input.pdf for each file and output will in that simple case be same directory.

ZSH shell script loops MANY times?

So I want to convert these blogposts to PDF using wkhtmltopdf
On MacOS in automator I set up a WorkFlow: GET SPECIFIED TEXT > EXTRACT URLS FROM TEXT > RUN SHELL SCRIPT (shell: /bin/zsh, pass input: as arguments)
#!/bin/zsh
# url example: "https://www.somewordpressblog.com/2022/12/16/some-post-title/"
i=1
for url in "$#"
do
title1=${url[(ws:/:)6]} # split with delimiter / sixth part
title2=${(U)title1//-/ } # replace all hypens with spaces and capitalize
title3="$i. - " # prefix add numbers
title4=".pdf" # suffix add .PDF extension
title5="${title3}${title2}${title4}" # join everything
/usr/local/bin/wkhtmltopdf --disable-javascript --print-media-type $url $title5
((i+=1))
done
The files got downloaded, but for a test with only 2 URL's there was like 2 minutes waiting and the RESULTS from the Schell Script showed me 84 items!!
I am counting 14 DONES from the wkhtmltopdf output.
What is wrong with this loop? Do I need to implement something to wait for the loop to continue or something? And How?
Any code suggestions welcome as well, first day with ZSH..

Snakemake specify a new wildcard in a new rule

I have input files:
Bob_1.fastq.gz
Bob_2.fastq.gz
Bob_3.fastq.gz
Bob_4.fastq.gz
Ron_1.fastq.gz
Ron_2.fastq.gz
Ron_3.fastq.gz
Ron_4.fastq.gz
I am running demultiplexing and trimming steps in one snakefile, like this:
workdir: "/path/to/dir/"
(SAMPLES,) =glob_wildcards('/path/to/dir/raw/{sample}.fastq.gz')
rule all:
input:
expand("demulptiplex/{sample}.fastq.gz", sample=SAMPLES),
expand("trimmed/{sample}.trimmed.fastq.gz", sample=SAMPLES)
rule sabre:
input:
infile="/path/to/dir/raw/{sample}.fastq.gz",
barcodefile= "files/{sample}.txt"
output:
unknownfile=temp("demulptiplex/unknown_barcode_{sample}.fastq.gz"),
shell:
"""
/Tools/sabre-master2/sabre se -f {input.infile} -b {input.barcodefile} -u {output.unknownfile}
"""
rule trimmomatic_se:
input:
r="{sample}.fastq.gz"
output:
r="trimmed/{sample}.trimmed.fastq.gz"
threads: 10
shell:
"""java -jar /Tools/Trimmomatic-0.36/trimmomatic-0.36.jar SE -threads {threads} {input.r} {output.r} ILLUMINACLIP:/Tools/Trimmomatic-0.36/adapters/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36"""
The demultiplex output files are like this:
Bob_1_CL1.fastq.gz.... Bob_1_CL345.fastq.gz
Bob_2_CL1.fastq.gz.... Bob_1_CL248.fastq.gz
Ron_1_dad1.fastq.gz... Ron_1_dad67.fastq.gz
and so on
So,if I do not specify the demultiplex output file the program would create it by itself. My problem is how to specify/introduce a new wildcard from the output of the previous rule in the next trimming step, as the wildcards are different from initial sample now.
Wildcards just need to be consistent in a rule, not across the workflow. The issue here is that you have a rule generating 'unknown' outputs that you need to process further. For that you need to use checkpoints.
Read through the second block of code about aggregating. Your checkpoint will be demultiplexing and if you don't have any other steps, all will be your aggregate step that calls checkpoints.demultiplex.get. If you search for checkpoint on stackoverflow you will find lots of examples; it's a hard feature to use at first!

Name for GNU Make $(var:=suffix) syntax

Evidently, GNU Make supports the syntax $(var:=suffix), which does the same thing as $(addsuffix suffix,$(var)) as far as I can tell, except that suffix can contain , in the := version without the use of a variable.
What is this form of expansion called?
Evidently it operates on whitespace-delimited words, producing a new string without modifying the original variable.
This file
# Makefile
words=cat dog mouse triangle
$(info $(words:=.ext))
$(info $(words:=.ext))
all:
#true
produces the following when run:
$ make
cat.ext dog.ext mouse.ext triangle.ext
cat.ext dog.ext mouse.ext triangle.ext

how to specify wildcards in a filename for amazon EMR job

If I run a EMR job and specify wildcards in the directory path it all works fine
e.g: s3n://mybucket///*/fileName.gz --- picks all files with name fileName.gz under subdirectories of mybucket
However when I specify wildcards in the fileName then emr logs show an error that no match found. It seems to treat the '' character as a literal character part of fileName instead as a wildcard
e.g: s3n//mybucket/Dir1/fileName..gz
gives an error back that no matches were found for fielName.*.gz in that directory
How do we specify wildcards in filename for an amazon emr job
Just went through this myself. It is very useful to pass NON-globbed wildcard expressions from the start script to spark/pyspark because the distribution mechanism inside the spark program can be efficient when presented when something like this; note globbing at both directory level and filename level:
df = spark.read.json('s3://my-bucket/archive/*/2014/7/G.*.json.bz2')
Not to mention of course that almost all the time you want globbing to occur on the remote resource, not your local launch environment.
The trick is to ensure that the initial shell variable does not get globbed when created and also protected when presented to aws emr add-steps. Here is a simple launch script that assumes a cluster has been created. To
show it can be done, we also escape newlines to make it easier to see the args. Be careful, however, NOT to re-introduce extra whitespace when doing this!
# Use single quotes to stop globbing at the var level:
DATA_URI='s3://my-bucket/archive/*/2014/7/G.*.json.bz2'
# DO NOT add trailing slash to the output_uri. S3 will
# automatically create subdirs under that. e.g.
# --output_uri s3://$SRC_BUCKET/V4_t
# will be created and populated with many part-0000-... files.
# If you are not renaming or deleting the output_uri for each run,
# make sure your spark program uses overwrite mode for dataframe output e.g.
# dfx.write.mode("overwrite").json(output_uri)
# Careful to protect the DATA_URI arg by wrapping it single quotes:
aws emr add-steps \
--cluster-id j-3CDMYEF3NJGHR \
--steps Type=Spark,\
Name="myAnalytics",\
ActionOnFailure=CONTINUE,\
Args=[\
s3://$SRC_BUCKET/blunders.py,\
--game_data,\'$DATA_URI\',
--output_uri,s3://$SRC_BUCKET/V4_t]

Resources