ZSH shell script loops MANY times? - zsh

So I want to convert these blogposts to PDF using wkhtmltopdf
On MacOS in automator I set up a WorkFlow: GET SPECIFIED TEXT > EXTRACT URLS FROM TEXT > RUN SHELL SCRIPT (shell: /bin/zsh, pass input: as arguments)
#!/bin/zsh
# url example: "https://www.somewordpressblog.com/2022/12/16/some-post-title/"
i=1
for url in "$#"
do
title1=${url[(ws:/:)6]} # split with delimiter / sixth part
title2=${(U)title1//-/ } # replace all hypens with spaces and capitalize
title3="$i. - " # prefix add numbers
title4=".pdf" # suffix add .PDF extension
title5="${title3}${title2}${title4}" # join everything
/usr/local/bin/wkhtmltopdf --disable-javascript --print-media-type $url $title5
((i+=1))
done
The files got downloaded, but for a test with only 2 URL's there was like 2 minutes waiting and the RESULTS from the Schell Script showed me 84 items!!
I am counting 14 DONES from the wkhtmltopdf output.
What is wrong with this loop? Do I need to implement something to wait for the loop to continue or something? And How?
Any code suggestions welcome as well, first day with ZSH..

Related

pdf2image: how to remove the '0001' in jpg file names? (Solved)

My goal is to convert a multi page pdf file into a number of .jpg files, in such a way that the images are directly written to the hard-disk/SSD in stead of stored into memory.
In python 3.11 :
from pdf2image import convert_from_path
poppler_path = r".\poppler-22.12.0\Library\bin"
images = convert_from_path('test.pdf', output_folder='.', output_file = 'test',
poppler_path=poppler_path, paths_only = True)
pdf2image generates files with the following names
'test_0001-1.jpg',
'test_0001-2.jpg',
etc
Problem:
I would like to have the files have names without the suffix '_0001-' (eg. 'test1.jpg').
The only way so far seems to be to use convert_from_path WITHOUT output_folder and then
save each images by images.save. But in this way the images are stored first into memory, which easyly can become a lot of Mbytes.
Is it possible to change the way pdf2image generates the file names when saving images directly to files?
I'm not familiar if Poppler already has some parameters to customize the generated file names, but you can always do this:
Run the command in an empty directory (e.g. in tempfile.TemporaryDirectory())
After command finishes, list the contents of the directory and store the result in a list
Iterate over the list with a regex that will match the numbers, and create a dict for the mapping (integer to file name)
At this point you are free to rename the files to whatever you like, or to process them.
The benefit of this solution is that it's neutral, robust and works for many similar scenarios.
hi have a look at your codebase in file generators.py ,
I got mine from def counter_generator(prefix="", suffix="", padding_goal=4):
at line 41 you have :
....
#threadsafe
def counter_generator(prefix="", suffix="", padding_goal=4):
"""Returns a joined prefix, iteration number, and suffix"""
i = 0
while True:
i += 1
yield str(prefix) + str(i).zfill(padding_goal) + str(suffix)
....
think you need to play with the yield line zfill() :
The Python String zfill() method is used to fill the string with zeroes on its left until it reaches a certain width; else called Padding. If the prefix of this string is a sign character (+ or -), the zeroes are added after the sign character rather than before.
The Python String zfill() method does not fill the string if the length of the string is greater than the total width of the string after padding.
Note: The zfill() method works similar to the rjust() method if we assign '0' to the fillchar parameter of the rjust() method.
https://www.tutorialspoint.com/python/string_zfill.htm
Just use poppler utilities direct (or xpdf pdftopng) so simply call it via a shell (add other options like -r 200 as desired for resolutions other than 150)
I recommend PNG as better image fidelity, however if you want .jpg replace "-png" below with "-jpg" (direct answer as asked would be pdftoppm -jpg -f 1 -l 9 -sep "" test.pdf "test") but do follow the below enhancement for file sorting. Windows file sorting needs leading zeros otherwise sort in zip or folder is 1,10,11...2,20...., which is often undesirable.
"path to bin\pdftoppm" -png "path to \in.pdf" "name"
Result =
name-1.png
name-2.png etc.
adding digits is limited compared to other apps so if you want "name-01.png" you need to only output pages 1-9 as
\bin>pdftoppm -png -f 1 -l 9 -sep "0" in.pdf "name-"
then for pages 10 to ## use say for up to 99 page file use default (it will only use the page numbers that are available)
\bin>pdftoppm -png -f 10 -l 99 in.pdf "name"
thus for 12 pages this would produce only -10 -11 and -12 as required
likewise, for up to 9999 pages you need 4 calls, if you don't want - simply delete it. For different output directory adjust output accordingly.
set "name=%~dpn1"
set "bin=path to Poppler\Release-22.12.0-0\poppler-22.12.0\Library\bin"
"%bin%\pdftoppm" -png -r 200 -f 1 -l 9 -sep "0" "%name%.pdf" "%name%-00"
"%bin%\pdftoppm" -png -r 200 -f 10 -l 99 -sep "0" "%name%.pdf" "%name%-0"
"%bin%\pdftoppm" -png -r 200 -f 100 -l 999 -sep "0" "%name%.pdf" "%name%-"
"%bin%\pdftoppm" -png -r 200 -f 1000 -l 9999 -sep "" "%name%.pdf" "%name%-"
in say example for 12 page above the worst case would be last calls replies
Wrong page range given: the first page (100) can not be after the last page (12). and same for 1000 Thus, those warnings can be ignored.
Those 4 lines could be in a windows or OS script batch file (for sendto or drag and drop) that accepts arguments then very simply use in system or python by call pdf2png.bat input.pdf for each file and output will in that simple case be same directory.

zsh: redirect only standard error to /dev/null

I want to use something like pdfs=$(echo *.pdf) and drop the error message that comes in case of no files present. But the docs have only examples where both outputs are redirected combined.
Standard error is file descriptor 2, if you are actually running a command you expect to produce output to standard error.
pdfs=$(echo *.pdf 2> /dev/null)
However, don't write code like in your example. A flat string cannot usefully store an arbitrary list of file names, because you can't distinguish between filename delimiters and valid characters in a filename. Instead, use an array which doesn't require any separate commands (and thus any need to redirect standard error):
pdfs=( *.pdf(N) ) # You can drop the (N) if you already have NULL_GLOB enabled

How to create a new output file in R if a file with that name already exists?

I am trying to run an R-script file using windows task scheduler that runs it every two hours. What I am trying to do is gather some tweets through Twitter API and run a sentiment analysis that produces two graphs and saves it in a directory. The problem is, when the script is run again it replaces the already existing files with that name in the directory.
As an example, when I used the pdf("file") function, it ran fine for the first time as no file with that name already existED in the directory. Problem is I want the R-script to be running every other hour. So, I need some solution that creates a new file in the directory instead of replacing that file. Just like what happens when a file is downloaded multiple times from Google Chrome.
I'd just time-stamp the file name.
> filename = paste("output-",now(),sep="")
> filename
[1] "output-2014-08-21 16:02:45"
Use any of the standard date formatting functions to customise to taste - maybe you don't want spaces and colons in your file names:
> filename = paste("output-",format(Sys.time(), "%a-%b-%d-%H-%M-%S-%Y"),sep="")
> filename
[1] "output-Thu-Aug-21-16-03-30-2014"
If you want the behaviour of adding a number to the file name, then something like this:
serialNext = function(prefix){
if(!file.exists(prefix)){return(prefix)}
i=1
repeat {
f = paste(prefix,i,sep=".")
if(!file.exists(f)){return(f)}
i=i+1
}
}
Usage. First, "foo" doesn't exist, so it returns "foo":
> serialNext("foo")
[1] "foo"
Write a file called "foo":
> cat("fnord",file="foo")
Now it returns "foo.1":
> serialNext("foo")
[1] "foo.1"
Create that, then it returns "foo.2" and so on...
> cat("fnord",file="foo.1")
> serialNext("foo")
[1] "foo.2"
This kind of thing can break if more than one process might be writing a new file though - if both processes check at the same time there's a window of opportunity where both processes don't see "foo.2" and think they can both create it. The same thing will happen with timestamps if you have two processes trying to write new files at the same time.
Both these issues can be resolved by generating a random UUID and pasting that on the filename, otherwise you need something that's atomic at the operating system level.
But for a twice-hourly job I reckon a timestamp down to minutes is probably enough.
See ?files for file manipulation functions. You can check if file exists with file.exists, and then either rename the existing file, or create a different name for the new one.

how to specify wildcards in a filename for amazon EMR job

If I run a EMR job and specify wildcards in the directory path it all works fine
e.g: s3n://mybucket///*/fileName.gz --- picks all files with name fileName.gz under subdirectories of mybucket
However when I specify wildcards in the fileName then emr logs show an error that no match found. It seems to treat the '' character as a literal character part of fileName instead as a wildcard
e.g: s3n//mybucket/Dir1/fileName..gz
gives an error back that no matches were found for fielName.*.gz in that directory
How do we specify wildcards in filename for an amazon emr job
Just went through this myself. It is very useful to pass NON-globbed wildcard expressions from the start script to spark/pyspark because the distribution mechanism inside the spark program can be efficient when presented when something like this; note globbing at both directory level and filename level:
df = spark.read.json('s3://my-bucket/archive/*/2014/7/G.*.json.bz2')
Not to mention of course that almost all the time you want globbing to occur on the remote resource, not your local launch environment.
The trick is to ensure that the initial shell variable does not get globbed when created and also protected when presented to aws emr add-steps. Here is a simple launch script that assumes a cluster has been created. To
show it can be done, we also escape newlines to make it easier to see the args. Be careful, however, NOT to re-introduce extra whitespace when doing this!
# Use single quotes to stop globbing at the var level:
DATA_URI='s3://my-bucket/archive/*/2014/7/G.*.json.bz2'
# DO NOT add trailing slash to the output_uri. S3 will
# automatically create subdirs under that. e.g.
# --output_uri s3://$SRC_BUCKET/V4_t
# will be created and populated with many part-0000-... files.
# If you are not renaming or deleting the output_uri for each run,
# make sure your spark program uses overwrite mode for dataframe output e.g.
# dfx.write.mode("overwrite").json(output_uri)
# Careful to protect the DATA_URI arg by wrapping it single quotes:
aws emr add-steps \
--cluster-id j-3CDMYEF3NJGHR \
--steps Type=Spark,\
Name="myAnalytics",\
ActionOnFailure=CONTINUE,\
Args=[\
s3://$SRC_BUCKET/blunders.py,\
--game_data,\'$DATA_URI\',
--output_uri,s3://$SRC_BUCKET/V4_t]

Printing hard copies of code

I have to hand in a software project that requires either a paper or .pdf copy of all the code included.
One solution I have considered is grouping classes by context and doing a cat *.extension > out.txt to provide all the code, then by catting the final text files I should have a single text file that has classes grouped by context. This is not an ideal solution; there will be no page breaks.
Another idea I had was a shell script to inject latex page breaks in between files to be joined, this would be more acceptable. Although I'm not too adept at scripting or latex.
Are there any tools that will do this for me?
Take a look at enscript (or nenscript), which will convert to Postscript, render in columns, add headers/footers and perform syntax highlighting. If you want to print code in a presentable fashion, this works very nicely.
e.g. here's my setting (within a zsh function)
# -2 = 2 columns
# -G = fancy header
# -E = syntax filter
# -r = rotated (landscape)
# syntax is picked up from .enscriptrc / .enscript dir
enscript -2GrE $*
For a quick solution, see a2ps, followed by ps2pdf. For a nicer, more complex solution I would go for a simple script that puts each file in a LaTeX listings environment and combines the result.

Resources