Converting PDF to a collection of images on the server using GhostScript - asp.net

These are the steps I am trying to achieve:
Upload a PDF document on the server.
Convert the PDF document to a set of images using GhostScript (every page is converted to an image).
Send the collection of images back to the client.
So far, I am interested in #2.
First, I downloaded both gswin32c.exe and gsdll32.dll and managed to manually convert a PDF to a collection of images(I opened cmd and run the command bellow):
gswin32c.exe -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r150 -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -dMaxStripSize=8192 -sOutputFile=image_%d.jpg somepdf.pdf
Then I thought, I'll put gswin32c.exe and gsdll32.dll into ClientBin of my web project, and run the .exe via a Process.
System.Diagnostics.Process process1 = new System.Diagnostics.Process();
process1.StartInfo.WorkingDirectory = Request.MapPath("~/");
process1.StartInfo.FileName = Request.MapPath("ClientBin/gswin32c.exe");
process1.StartInfo.Arguments = "-dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r150 -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -dMaxStripSize=8192 -sOutputFile=image_%d.jpg somepdf.pdf"
process1.Start();
Unfortunately, nothing was output in ClientBin. Anyone got an idea why? Any recommendation will be highly appreciated.

I've tried your code and it seem to be working fine. I would recommend checking following things:
verify if your somepdf.pdf is in the working folder of the gs process or specify the full path to the file in the command line. It would also be useful to see ghostscript's output by doing something like this:
....
process1.StartInfo.RedirectStandardOutput = true;
process1.StartInfo.UseShellExecute = false;
process1.Start();
// read output
string output = process1.StandardOutput.ReadToEnd();
...
process1.WaitForExit();
...
if gs can't find your file you would get an "Error: /undefinedfilename in (somepdf.pdf)" in the output stream.
another suggestion is that you proceed with your script without waiting for the gs process to finish and generate resulting image_N.jpg files. I guess adding process1.WaitForExit should solve the issue.

Related

running r script in AWS

Looking at this page and this piece of code in particular:
import boto3
account_id = boto3.client("sts").get_caller_identity().get("Account")
region = boto3.session.Session().region_name
ecr_repository = "r-in-sagemaker-processing"
tag = ":latest"
uri_suffix = "amazonaws.com"
processing_repository_uri = "{}.dkr.ecr.{}.{}/{}".format(
account_id, region, uri_suffix, ecr_repository + tag
)
# Create ECR repository and push Docker image
!docker build -t $ecr_repository docker
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri
This is not pure Python obviously? Are these AWS CLI commands? I have used docker previously but I find this example very confusing. Is anyone aware of an end-2-end example of simply running some R job in AWS using sage maker/docker? Thanks.
This is Python code mixed with shell script magic calls (the !commands).
Magic commands aren't unique to this platform, you can use them in Jupyter, but this particular code is meant to be run on their platform. In what seems like a fairly convoluted way of running R scripts as processing jobs.
However, the only thing you really need to focus on is the R script, and the final two cell blocks. The instruction at the top (don't change this line) creates a file (preprocessing.R) which gets executed later, and then you can see the results.
Just run all the code cells in that order, with your own custom R code in the first cell. Note the line plot_key = "census_plot.png" in the last cell. This refers to the image being created in the R code. As for other output types (eg text) you'll have to look up the necessary Python package (PIL is an image manipulation package) and adapt accordingly.
Try this to get the CSV file that the R script is also generating (this code is not validated, so you might need to fix any problems that arise):
import csv
csv_key = "plot_data.csv"
csv_in_s3 = "{}/{}".format(preprocessed_csv_data, csv_key)
!aws s3 cp {csv_in_s3} .
file = open(csv_key)
dat = csv.reader(file)
display(dat)
So now you should have an idea of how two different output types the R script example generates are being handled, and from there you can try and adapt your own R code based on what it outputs.

Check if write_xlsx has correctly executed and the file has been written

I am using a code like this:
writexl::write_xlsx(
x = list(
"Sheet" = df
),
path = paste0("User/", name,"_", date, ".xlsx")
)
The code should write in the folder User that is in the wd the file.xlsx
I need that if the command not success in the writing operation the console display a message.
For example if in the object name there is a "/" windows not allow to write the file but any message appear to me. Also if for permission problem windows not allow to write the file.
I just tried with TryCatch without success.
Thanks

Download file smaller than it's real size

I'm trying to download all the comics(png) from xkcd.com and here's my code to do the download job:
imageFile=open(os.path.join('XKCD',os.path.basename(comicLink)),'wb')
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()
And the downloadeded file is 6388Bytes and cannot be opened while the real file from link is 27.6Kb.
I've already tested my code line by line in shell, so I'm pretty sure I get the right link and right file.
I just don't understand why the png downloaded by my code is smaller.
Also I tried to search why this is happening but without helpful information.
Thanks.
Okay since you are using requests here is function that will let you download a file given a url
def download_file(url):
local_filename = url.split('/')[-1]
# NOTE the stream=True parameter
r = requests.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
return local_filename
Link to the documentation -> http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow

Reading a gpx file into Shiny from a dropbox account

I have a Shiny app that accesses data from a dropbox account. I used the instructions at https://github.com/karthik/rdrop2/blob/master/README.md to be been able to read in csv data with no problem, i.e. using the drop_read_csv command from the rdrop2 package after doing the authentication step.
e.g.
my_data<-drop_read_csv("ProjectFolder/DataSI.csv")
My next problem however is that there are going to be a lot of gpx track files uploaded to the dropbox that I want the app to be able to read in. I have tried using:
gpx.files<-drop_search('gpx', path="ProjectFolder/gpx_files")
trk.tmp<-vector("list",dim(gpx.files)[1])
for(i in 1: dim(gpx.files)[1]){
trk.tmp[[i]]<-readOGR(gpx.files$path[i], layer="tracks")
}
But no luck. At the readOGR step, I get:
Error in ogrInfo(dsn = dsn, layer = layer, encoding = encoding, use_iconv = use_iconv, :
Cannot open data source
Hopefully someone can help.
My problem was I hadn't specified the dropbox path properly. I have used the drop_read_csv code and made a drop_readOGR version:
drop_readOGR<-function(my.file, dest=tempdir()){
localfile = paste0(dest, "/", basename(my.file))
drop_get(my.file, local_file = localfile, overwrite = TRUE)
readOGR(localfile, layer="tracks")
}
So now I can just use what I was doing before except I have changed the line in the loop to call the new function.
gpx.files<-drop_search('gpx', path="ProjectFolder/gpx_files")
trk.tmp<-vector("list",dim(gpx.files)[1])
for(i in 1: dim(gpx.files)[1]){
trk.tmp[[i]]<-drop_readOGR(gpx.files$path[i])
}

Converting Docx to image using Docx4j and PdfBox causes OutOfMemoryError

I'm converting the first page of a docx file to an image in twoo steps using dox4j and pdfbox but I'm currently getting an OutOfMemoryError every time.
I've been able to determine that the exception is thrown on the very last step of this process, while the convertToImage method is being called, however I've been using the second step of this method to convert pdfs for some time now without issue so I am at a loss as to what might be the cause unless perhaps dox4j is encoding the pdf is a way which I have not yet tested or is corrupt.
I've tried replacing the ByteArrayOutputStream with a FileOutputStream and the pdf seems to render correctly is not any larger than I would expect.
This is the code I am using:
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(file);
org.docx4j.convert.out.pdf.PdfConversion c = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage);
((org.docx4j.convert.out.pdf.viaXSLFO.Conversion)c).setSaveFO(File.createTempFile("fonts", ".fo"));
ByteArrayOutputStream os = new ByteArrayOutputStream();
c.output(os, new PdfSettings());
byte[] bytes = os.toByteArray();
os.close();
ByteArrayInputStream is = new ByteArrayInputStream(bytes);
PDDocument document = PDDocument.load(is);
PDPage page = (PDPage) document.getDocumentCatalog().getAllPages().get(0);
BufferedImage image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 96);
is.close();
document.close();
Edit
To give more context on this situation, this code is being run in a grails web-application. I have tried several different variants of this code, including nulling out everything once no longer needed, using FileInputStream and FileOutputStream to try to conserve more physical memory and inspect the output of docx4j and pdfbox, each of which seem to work correctly.
I'm using docx4j 2.8.1 and pdfbox 0.7.3, I have also tried pdf-renderer but I still get an OutOfMemoryError. My suspicions are that docx4j is using too much memory but does not produce the error until the pdf to image conversion.
I would gladly except an alternate way of converting a docx file to a pdf or directly to an image as an answer, however I am currently trying to replace jodconverter which has been problematic to run on a server.
I'm part of XDocreport team.
We recently develop a little webapp deployed on cloudbees (http://xdocreport-converter.opensagres.cloudbees.net/) that shows the behaviour converters.
You can easily compare the behaviour and the performances of docx4j and xdocreport for PDF and Html convertion.
Source code can be found here :
https://github.com/pascalleclercq/xdocreport-demo (REST-Service-Converter-WebApplication subfolder).
and here :
https://github.com/pascalleclercq/xdocreport/blob/master/remoting/fr.opensagres.xdocreport.remoting.converter.server/src/main/java/fr/opensagres/xdocreport/remoting/converter/server/ConverterResourceImpl.java
The firsts numbers I get is that Xdocreport is roughly 10 time faster for generating a PDF than Docx4J.
Feedback is welcome.
Glorious success at last! I replaced docx4j with XDocReport and the document converts to a PDF in no time at all. However there seems to be some issues with some documents but I would expect this is due to the OS that they were created on and may be solved by using:
PDFViaITextOptions options = PDFViaITextOptions.create().fontEncoding("windows-1250");
Using the approiate OS instead of just:
PDFViaITextOptions options = PDFViaITextOptions.create();
Which defaults to the current OS.
This is the code I now use to convert from DOCX to PDF:
FileInputStream in = new FileInputStream(file);
XWPFDocument document = new XWPFDocument(in);
PDFViaITextOptions options = PDFViaITextOptions.create();
ByteArrayOutputStream out = new ByteArrayOutputStream();
XWPF2PDFViaITextConverter.getInstance().convert(document, out, options);
byte[] bytes = out.toByteArray();
out.close();
ByteArrayInputStream is = new ByteArrayInputStream(bytes);
PDDocument document = PDDocument.load(is);
PDPage page = (PDPage) document.getDocumentCatalog().getAllPages().get(0);
BufferedImage image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 96);
is.close();
document.close();
return image;

Resources