Converting Docx to image using Docx4j and PdfBox causes OutOfMemoryError - out-of-memory

I'm converting the first page of a docx file to an image in twoo steps using dox4j and pdfbox but I'm currently getting an OutOfMemoryError every time.
I've been able to determine that the exception is thrown on the very last step of this process, while the convertToImage method is being called, however I've been using the second step of this method to convert pdfs for some time now without issue so I am at a loss as to what might be the cause unless perhaps dox4j is encoding the pdf is a way which I have not yet tested or is corrupt.
I've tried replacing the ByteArrayOutputStream with a FileOutputStream and the pdf seems to render correctly is not any larger than I would expect.
This is the code I am using:
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(file);
org.docx4j.convert.out.pdf.PdfConversion c = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage);
((org.docx4j.convert.out.pdf.viaXSLFO.Conversion)c).setSaveFO(File.createTempFile("fonts", ".fo"));
ByteArrayOutputStream os = new ByteArrayOutputStream();
c.output(os, new PdfSettings());
byte[] bytes = os.toByteArray();
os.close();
ByteArrayInputStream is = new ByteArrayInputStream(bytes);
PDDocument document = PDDocument.load(is);
PDPage page = (PDPage) document.getDocumentCatalog().getAllPages().get(0);
BufferedImage image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 96);
is.close();
document.close();
Edit
To give more context on this situation, this code is being run in a grails web-application. I have tried several different variants of this code, including nulling out everything once no longer needed, using FileInputStream and FileOutputStream to try to conserve more physical memory and inspect the output of docx4j and pdfbox, each of which seem to work correctly.
I'm using docx4j 2.8.1 and pdfbox 0.7.3, I have also tried pdf-renderer but I still get an OutOfMemoryError. My suspicions are that docx4j is using too much memory but does not produce the error until the pdf to image conversion.
I would gladly except an alternate way of converting a docx file to a pdf or directly to an image as an answer, however I am currently trying to replace jodconverter which has been problematic to run on a server.

I'm part of XDocreport team.
We recently develop a little webapp deployed on cloudbees (http://xdocreport-converter.opensagres.cloudbees.net/) that shows the behaviour converters.
You can easily compare the behaviour and the performances of docx4j and xdocreport for PDF and Html convertion.
Source code can be found here :
https://github.com/pascalleclercq/xdocreport-demo (REST-Service-Converter-WebApplication subfolder).
and here :
https://github.com/pascalleclercq/xdocreport/blob/master/remoting/fr.opensagres.xdocreport.remoting.converter.server/src/main/java/fr/opensagres/xdocreport/remoting/converter/server/ConverterResourceImpl.java
The firsts numbers I get is that Xdocreport is roughly 10 time faster for generating a PDF than Docx4J.
Feedback is welcome.

Glorious success at last! I replaced docx4j with XDocReport and the document converts to a PDF in no time at all. However there seems to be some issues with some documents but I would expect this is due to the OS that they were created on and may be solved by using:
PDFViaITextOptions options = PDFViaITextOptions.create().fontEncoding("windows-1250");
Using the approiate OS instead of just:
PDFViaITextOptions options = PDFViaITextOptions.create();
Which defaults to the current OS.
This is the code I now use to convert from DOCX to PDF:
FileInputStream in = new FileInputStream(file);
XWPFDocument document = new XWPFDocument(in);
PDFViaITextOptions options = PDFViaITextOptions.create();
ByteArrayOutputStream out = new ByteArrayOutputStream();
XWPF2PDFViaITextConverter.getInstance().convert(document, out, options);
byte[] bytes = out.toByteArray();
out.close();
ByteArrayInputStream is = new ByteArrayInputStream(bytes);
PDDocument document = PDDocument.load(is);
PDPage page = (PDPage) document.getDocumentCatalog().getAllPages().get(0);
BufferedImage image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 96);
is.close();
document.close();
return image;

Related

Custom decoders for PNG files

I've donwloaded an image from user manual (see attachment) and need to transform it. When I tried to load it via following code, I got the exception: "Image cannot be loaded. Available decoders:\r\n - JPEG : JpegDecoder\r\n - PNG : PngDecoder\r\n - GIF : GifDecoder\r\n - BMP : BmpDecoder\r\n".
Is it possible to apply custom decoder and where can I found them?
using (var originalImage = new MemoryStream(...))
using (var image = Image.Load<Rgba32>(originalImage))
{
}
The linked image can be decoded using ImageSharp. As #tocsoft says in the comments it likely you forgot to reset your input stream position.
Here's the two images. The second we've loaded and flipped vertically using the following code:
using (var image = Image.Load(Path.Combine(inPath, "-7bH2hfA.png")))
{
image.Mutate(x => x.Flip(FlipMode.Vertical));
image.Save(Path.Combine(outPath, "-7bH2hfA-flipped.png"));
}
Your input image:
Our flipped output image:
EDIT
When I originally tested the image I used r-click save which gave me a valid png. I have since used the direct download from Dropbox which yields the original file.
It's not a png! It is, in fact, a webp file.

EPPlus 'System.OutOfMemoryException'

I am trying to open a 38MB Excel File using EPPlus v4.0, I am able to pass it to the ExcelPackage variable but when I'm trying to get the workbook from that variable, it causes me a 'System.OutOfMemoryException'.
Here's my code:
Dim temppath = Path.GetTempPath()
Dim filenamestr As String = Path.GetFileNameWithoutExtension(Path.GetRandomFileName())
Dim tempfilename As String = Path.Combine(temppath, filenamestr + ".xlsx")
fileUploadExcel.SaveAs(tempfilename)
Dim XLPack = New ExcelPackage(File.OpenRead(tempfilename))
GC.Collect()
If File.Exists(tempfilename) Then
File.Delete(tempfilename)
End If
Dim xlWorkbook As ExcelWorkbook = XLPack.Workbook 'the error shows here
I'm stuck. Any help would really be appreciated. Thanks in advance.
You are probably hitting the ram limit as that is a big file. If you have the option to compile to 64 bit you might be able to solve the problem:
https://stackoverflow.com/a/29912563/1324284
But if you can only compile to x86 there is not a whole lot you can do with epplus. You will have to either use a different library or build the XML files for excel yourself:
https://stackoverflow.com/a/26802061/1324284
Essential XlsIO is an option for loading large Excel files using .NET.
The whole suite of controls is available for free (commercial applications also) through the community license program if you qualify (less than 1 million US Dollars in revenue). The community license is the full product with no limitations or watermarks.
Note: I work for Syncfusion.

Download generated excel file

Following code given:
Microsoft.Office.Interop.Excel.Application excelFile = new Microsoft.Office.Interop.Excel.Application();
excelFile.Visible = false;
Workbook wb = excelFile.Workbooks.Add(XlWBATemplate.xlWBATWorksheet);
Worksheet sheet1 = wb.ActiveSheet as Worksheet;
sheet1.Name = "Test";
sheet1.Cells[1, 1] = "Test";
string fileName = Environment.GetFolderPath(System.Environment.SpecialFolder.DesktopDirectory) + "\\tickets.xlsx";
wb.SaveAs(Filename: fileName, FileFormat: XlFileFormat.xlOpenXMLWorkbook, AccessMode: XlSaveAsAccessMode.xlNoChange);
wb.Close();
excelFile.UserControl = true;
excelFile.Quit();
This generates an excelfile and saves it to the desktop. What do I have to change to ask for a save location?
Using excel on the server is not supported and opens a whole can of worms, especially the is a high risk that excel pops up a dialog, which cannot be dismissed because no one sees the server desktop. Also, excel is very slow, generating a critical bottleneck. Last, debugging this is nearly impossible - this solution works, but will never work well.
The solution: use a library like epplus which can read / write xlsx files easily, is faster to develop, magnitudes faster in building the file and free. There are other libraries out there which can read xls files, if needed.
Prior to setting the filename, you could open the SaveFileDialog

Working with Bitmap? Giving error?

In web application, i am using Bitmap for finding the width and hight of the image. When i write the code it is giving error: Parameter is not a valid.
Bitmap bmp = new Bitmap(Server.MapPath("./Images/" + ds.Tables[0].Rows[0]["image"].ToString()));
I am getting error can you help me. Thank you.
Error says that the image you are saving has invalid image data (state) or the filename which is used to save image is already is in used.
Probably you just need to use this
System.Drawing.Image bmp = System.Drawing.Image.FromFile(Server.MapPath("YourPathHere"));

Converting PDF to a collection of images on the server using GhostScript

These are the steps I am trying to achieve:
Upload a PDF document on the server.
Convert the PDF document to a set of images using GhostScript (every page is converted to an image).
Send the collection of images back to the client.
So far, I am interested in #2.
First, I downloaded both gswin32c.exe and gsdll32.dll and managed to manually convert a PDF to a collection of images(I opened cmd and run the command bellow):
gswin32c.exe -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r150 -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -dMaxStripSize=8192 -sOutputFile=image_%d.jpg somepdf.pdf
Then I thought, I'll put gswin32c.exe and gsdll32.dll into ClientBin of my web project, and run the .exe via a Process.
System.Diagnostics.Process process1 = new System.Diagnostics.Process();
process1.StartInfo.WorkingDirectory = Request.MapPath("~/");
process1.StartInfo.FileName = Request.MapPath("ClientBin/gswin32c.exe");
process1.StartInfo.Arguments = "-dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r150 -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -dMaxStripSize=8192 -sOutputFile=image_%d.jpg somepdf.pdf"
process1.Start();
Unfortunately, nothing was output in ClientBin. Anyone got an idea why? Any recommendation will be highly appreciated.
I've tried your code and it seem to be working fine. I would recommend checking following things:
verify if your somepdf.pdf is in the working folder of the gs process or specify the full path to the file in the command line. It would also be useful to see ghostscript's output by doing something like this:
....
process1.StartInfo.RedirectStandardOutput = true;
process1.StartInfo.UseShellExecute = false;
process1.Start();
// read output
string output = process1.StandardOutput.ReadToEnd();
...
process1.WaitForExit();
...
if gs can't find your file you would get an "Error: /undefinedfilename in (somepdf.pdf)" in the output stream.
another suggestion is that you proceed with your script without waiting for the gs process to finish and generate resulting image_N.jpg files. I guess adding process1.WaitForExit should solve the issue.

Resources