How to efficiently convert a large pdf with many pages into individual (high-res) jpgs with node (in the backend) using (for example) graphicsmagick? - graphicsmagick

I would like to use node (in my backend) to convert a large PDF with many (hundreds!) of pages to individual jpgs to store them in a database for further purposes.
For this I have chosen the npm package "gm" which uses "graphicsmagick" in the background.
I have encountered several big issues. For example, node seems to be unable to "digest" a large number of pages at a time. Since "gm" is synchronous, it does not wait but tries to start to convert all pages almost instantly which "freezes" my node application, i.e., it never stops working and it does not produce any pages. If I limit the number of pages to, say, 20, it works perfectly.
I could not find any documentation for "gm" or "graphicsmagick" providing "best practices" for converting (large) pdfs.
The 2 most relevant questions that I have are:
a) Is there a way to tell "graphicsmagick" to produce an individual jpg file for each pdf page? "imagemagick", for example, is doing this "out of the box". To be more specific
convert -density 300 test.pdf test.jpg
would produce files like "test-0.jpg", "test-1.jpg", "test-2.jpg", and so on while
gm convert -density 300 test.pdf test.jpg
only produces one jpg file (the first page of the pdf).
b) Is there a way with "gm" to reuse the same "Buffer" to produce jpg images? I assume that calling "gm" with a big buffer (of > 100MB) hundreds of times is not the best way to do it
Here is the code that I am using right now:
import gm from 'gm';
import fs from 'fs';
// Create "Buffer" to be used by "gm"
const buf = fs.readFileSync('test.pdf');
// Identify number of pages in pdf
gm(buf, 'test.pdf').identify((err: any, value: gm.ImageInfo) => {
if (err) {
console.log('err');
} else {
const actualArray: string[] = value.Format.toString().split(',');
let numPages: number = actualArray.length;
// Loop through all pages and produce desired output
for (let currentPage: number = 0; currentPage < numPages; currentPage++) {
gm(buf, `test.pdf[${currentPage}]`)
.density(150, 150)
.quality(90)
.write(`test${currentPage}.jpg`, (err: any) => {
if (err) console.log(err);
});
}
}
});
This approach
does not work with large pdfs (at least not on my machine)
is very slow (presumably because it calls "gm" hundreds of times almost instantly with a "Buffer" of > 100MB)
Is there a "best practice" approach to do this right? Any hint will be highly appreciated!

For Commercially Free Open Source task you need to avoid those that depend on licensed GhostScript PDF handling in the background such as ImageMagick GraphicsMagick etc.
If its for personal use then consider Ghostscript's sister MuTool its generally the fastest method see what is fastest way to convert pdf to jpg image?
So the best FOSS workhorse for this task is Poppler and the means to convert PDF into image pages is pdftoppm which has many output formats including 2 types of jpg, however I recommend consider PNG as preferable output for documents. Any difference in size is more than compensated by clarity of pixels.
For OCR use ppm
For DOCuments / LineART use png
for Photos use standard jpeg
-png : generate a PNG file
-jpeg : generate a JPEG file
-jpegcmyk : generate a CMYK JPEG file
-jpegopt : jpeg options, with format <opt1>=<val1>[,<optN>=<valN>]*
Typical windows command line
"bin\pdftoppm.exe" -png -r %resolution% "%filename.pdf%" "%output/rootname%"

You can try pdfimages utilite (poppler or xpdf project) for original images extract from pdf.

Related

R jsonlite fromJSON very slow on large files--is this expected?

I am trying to figure out if the time elapsed using fromJSON that I'm looking at is typical or a red flag, because it's multiple orders of magnitude slower than I have grown to expect from R in any situation.
I'm using files from these API endpoints (i.e., I get JSON in, I do not have control over the form of the JSON before it is imported, it is theoretically the same every time), namely the default cards and all cards bulk files. I use download.file() to retrieve them and then load the saved files with fromJSON.
The ~281mb default cards file takes between 65 and 95 seconds to fromJSON in. The ~1.5gb all cards file, which is similarly-structured because they're both sets of Card objects, takes between 25 and 45 minutes. (I am using R 4.2.1 in RStudio on Windows 11 with a SSD and 64gb of RAM, if that matters; peak memory usage tops out around ~12gb RAM.)
Is that in the realm of reasonable expectation for files of this size/complexity, including the nonlinear increase in processing time, or should I be treating it as a red flag? I have no basis of comparison for JSON as opposed to CSV and don't mean to be disrespectful if the answer is in fact that this is as fast as it gets, I just didn't want to assume that was normal.
To reproduce/see on your own machine:
Get https://api.scryfall.com/bulk-data directly from that url to get the download links for "default" and "all cards" data
download.file() the JSON for "default" and "all" cards
Try fromJSON() on the resulting files (I am using default settings and only entering that filepath)
(Optional) The numbers I'm citing are from benchmarking with tictoc.
(Same query with different framing posted on the jsonlite issue tracker here.)
I am open to suggestions for either making this go faster or alternative download/import methods that would be less of a drag. This is in the concept of a scripted ETL process that ultimately should be able to run as a cron job without human interaction.

Why does R raster::writeRaster() generate a pic which can't be shown in Win10?

I read my hyperspectral (.raw) file and combine three bands to "gai_out_r" Then I output as following:
writeRaster(gai_out_r,filepath,format="GTiff")
finally I got gai_out_r.tif
But, why Win10 can't display this small tif as the pic that I output the same way from envi--save image as--tif
Two tiffs are displayed by Win10 as following:
Default windows image viewing applications doesn't support Hyperspectral Images-since you are just reading and combining 3 bands from your .raw file, the resulting image will be a hyperspectral image.You need to have separate dedicated softwares to view hypercubes or can view it using spectral-python also.
In sPy, using envi.save_image , will save it as a ENVI type file only. To save it as an rgb image file(readable in windows OS) we need to use other methods.
You are using writeRaster to write to a GTiff (GeoTiff) format file. To write to a standard tif file you can use the tiff method. With writeRaster you could also write to a PNG instead
writeRaster(gai_out_r, "gai.png")
Cause of the issue:
I had a similar issue and recognised that the exported .tif files had a different bit depth than .tif images I could open. The images could not be displayed using common applications, although they were not broken and I could open them in R or QGIS. Hence, the values were coded in a way Windows would not expect.
When you type ?writeRaster() you will find that there are various options when it comes to saving a .tif (or other format) using the raster::writeRaster() function. Click on the links therein to get to the dataType {raster} help site and you'll find there are various integer types to choose from.
Solution (write a Windows-readable GeoTIFF):
I set the following options to make the resulting .tif file readable (note the datatype option):
writeRaster(raster, filename = "/path/to/your/output.tif",
format = "GTiff", datatype = "INT1U")
Note:
I realised your post is from 2 and a half years ago... Anyways, may this answer help others who encounter this problem.

OpenCPU and multi-page plots

I'm trying to capture a multi-plot pdf from a function. In R, this gives me a three page PDF:
pdf(file='test.pdf', onefile=TRUE)
lapply(1:3, 'plot')
dev.off()
Using OpenCPU:
$ curl http://localhost:6977/ocpu/library/base/R/lapply -H 'Content-Type: application/json' -d '{"X":[1,2,3], "FUN":"plot"}'
/ocpu/tmp/x0dc3dad0/R/.val
/ocpu/tmp/x0dc3dad0/graphics/1
/ocpu/tmp/x0dc3dad0/graphics/2
/ocpu/tmp/x0dc3dad0/graphics/3
/ocpu/tmp/x0dc3dad0/stdout
/ocpu/tmp/x0dc3dad0/source
/ocpu/tmp/x0dc3dad0/console
/ocpu/tmp/x0dc3dad0/info
I can get any of the individual pages as a single-page PDF file, but not as one combined file.
Two possible work-arounds, not without their share of problems:
Use par(mfrow), layout(), or a similar mechanism, though this will create a monster image in the end (I'm dealing with more than three images in my code).
Use tempfile, create an Rmd file on-the-fly, return the filename in the session (have not tested this yet), and use OpenCPU's processing of Rmd files. Unfortunately, this now uses LaTeX's geometries and page numbering (workarounds exist for this).
Are there other ways to do this?
Good question. OpenCPU captures graphics using evaluate which stores each graphic individually. The API itself doesn't support combining multiple graphics within a single file. I would personally do this sort of PDF post processing in the application layer (i.e. with non-R tools), but perhaps it would be useful to support this in the API.
Some suggestions:
Any file that your R function/script saves to the working directory (i.e. getwd()) will also become available through the API. So one thing you could do is in your R code manually create your combined pdf file and save it to the working directory and then download it through opencpu.
Graphics are actually recordedPlot objects, and besides png, pdf and svg, you can also retrieve the graphic as rds or rda. So you could write an R function that downloads the recordedPlot object from the API and then prints it. Not sure if that would be helpful in your use case.

is there any way we can find PDF file is compressed or not?

we are using ITEXTPDF to compress the PDF but the issues is we want to compress the files which are compressed before uploading into our site...if the files are uploaded without compressing we would like to leave those like that..
so to do that we need to identify is that PDF is compressed or not..am wondering is there any way we can identify PDF is compressed or not using ITEXTPDF or some other tool!!!..
i have tried to Google it but couldn't find appropriate answer..
kindly let me know if u have any idea...
thanks
There are several types of compression you can get in a PDF. Data for objects can be compressed and objects can be compressed into object streams.
I voted Mark's answer up because he's right: you won't get an answer if you're not more specific. I'll add my own answer with some extra information.
In PDF 1.0, a PDF file consisted of a mix of ASCII characters for the PDF syntax and binary code for objects such as images. A page stream would contain visible PDF operators and operands, for instance:
56.7 748.5 m
136.2 748.5 l
S
This code tells you that a line has to be drawn (S) between the coordinate (x = 56.7; y = 748.5) (because that's where the cursor is moved to with the m operator) and the coordinate (x = 136.2; y = 748.5) (because a path was constructed using the l operator that adds a line).
Starting with PDF 1.2, one could start using filters for such content streams (page content streams, form XObjects). In most cases, you'll discover a /Filter entry with value /FlateDecode in the stream dictionary. You'll hardly find any "modern" PDFs of which the contents aren't compressed.
Up until PDF 1.5, all indirect objects in a PDF document, as well as the cross-reference stream were stored in ASCII in a PDF file. Starting with PDF 1.5, specific types of objects can be stored in an objects stream. The cross-reference table can also be compressed into a stream. iText's PdfReader has a isNewXrefType() method to check if this is the case. Maybe that's what you're looking for. Maybe you have PDFs that need to be read by software that isn't able to read PDFs of this type, but... you're not telling us.
Maybe we're completely misinterpreting the question. Maybe you want to know if you're receiving an actual PDF or a zip file with a PDF. Or maybe you want to really data-mine the different filters used inside the PDF. In short: your question isn't very clear, and I hope this answer explains why you should clarify.

R: dev.copy2pdf, multiple graphic devices to a single file, how to append to file?

I have a script that makes barplots, and opens a new window when 6 barplots have been written to the screen and keeps opening new graphic devices whenever necessary.
Depending on the input, this leaves me with a potential large number of openened windows (graphic devices) which I would like to write to a single PDF file.
Considering my Perl background, I decided to iterate over the different graphics devices, printing them out one by one. I would like to keep appending to a single PDF file, but I do not know how to do this, or if this is even possible. I would like to avoid looping in R. :)
The code I use:
for (i in 1:length(dev.list())
{
dev.set(which = dev.list()[i]
dev.copy2pdf(device = quartz, file = "/Users/Tim/Desktop/R/Filename.pdf")
}
However, this is not working as it will overwrite the file each time. Now is there an append function in R, like there is in Perl. Which allows me to keep adding pages to the existing pdf file?
Or is there a way to contain the information in a graphic window to a object, and keep adding new graphic devices to this object and finally print the whole thing to a file?
Other possible solutions I thought about:
writing different pdf files, combining them after creation (perhaps even possible in R, with the right libraries installed?)
copying the information in all different windows to one big graphic device and then print this to a pdf file.
Quick comments:
use the onefile=TRUE argument which gets passed through to pdf(), see the help pages for dev.copypdf and pdf
as a general rule, you may find it easier to open the devices directly; again see help(pdf)
So in sum, add onefile=TRUE to you call and you should be fine but consider using pdf() directly.
To further elaborate on the possibility to append to a pdf. Although, multiples graphs can be put easaly into one file it turns out that it is impossiple or at least not simple to really append a pdf once finished by dev.off() - see here.
I generate many separate pages and then join them with something like system('pdfjam pages.pdf -o output.pdf' )*

Resources