how to preserve the look and feel when converting HTML to PDF - asp.net

I have been using iTextSharp to do a HTML to PDF conversion, overall it works fairly well, but it doesn't seem to be like most of the formatting.
Bold, Italic, and Underline are all working, however, none of the font sizes, styles or other information is respected, therefore the export doesn't look much at all like the HTML that was used to create the format.
Does anyone know how to either
fix the way the iTextSharp exports (below is a sample of my code)
Or know of a different product that is out there that provides this functionality, and will not break the bank?
This is my code:
//Do the PDF thing
Document document = new Document(PageSize.A4);
using (Stream output = new FileStream(Server.MapPath(relDownloadDoc), FileMode.Create, FileAccess.Write, FileShare.None))
using (Stream htmlStream = new FileStream(Server.MapPath(relProcessingDoc), FileMode.Open, FileAccess.Read, FileShare.Read))
using (XmlTextReader reader = new XmlTextReader(htmlStream))
{
reader.WhitespaceHandling = WhitespaceHandling.None;
PdfWriter.GetInstance(document, output);
document.Open();
Console.ReadLine();
HtmlParser.Parse(document, reader);
document.Close();
}

Try WKHTMLTOPDF.
It's an open source implementation of webkit. Both are free.
We've set a small tutorial here

From Convert HTML + CSS to PDF with PHP? I found out about Prince XML, which has clients for lots of languages including the .Net platform.
It is an exceptional converter though commercial and not cheap. There is a Google Tech Talk about it. Allegedly, Google uses it for Google Docs. It's rendering engine also passed the Acid2 test.
If you want high-quality HTML to PDF conversion and are willing to spend the ~$3800 for a server license then look no further. Frankly I think the cost in time of getting anything else to do what Prince does will quickly outstrip the cost involved. Developer time is expensive.

I have used pd4ml for a few things. It seems to work pretty well.
Here is a list html tags/attributes that pd4ml supports: http://pd4ml.com/html.htm

ActivePDF is $375 for a single server license, and does an excellent job. We've used in in client projects before and it's been great.
http://www.activepdf.com/products/serverproducts/webgrabber/index.cfm
EDIT: Nevermind, it depends on another one of their products that costs $1,400. Thought it would roll in cheaper than some of the other suggestions. A few more minutes of research came up with the following alternatives:
Under $500:
http://www.websupergoo.com/abcpdf-1.htm (You'll need the professional edition to keep as much formatting as possible).

Related

Converting Docx to PDF using .Net Core [Open Source]

I'm looking for a plugin which can convert word (docx / doc) to pdf Without Microsoft.Office.Interop and Open Source one. There are questions asked on it but no solution is provided or I didnt found any.
Any suggestion or references will be much appreciated!.
You could do this using Aspose.Words project, however this library is not an opensource (license is required and cost some money): https://blog.aspose.com/2020/01/02/convert-word-doc-docx-to-pdf-in-csharp-net-core/
On our project we needed to keep formating as close as the original. But every plugin we tried never came close to the original.
We opted for I Love Pdf utilities.
Word to PDF
They have a well documented API for some language (including .Net) and it works great.
You can process 250 files freely every month and if you need more, it's not that expensive.
Hope this helps

PDF to HTML or similar

I'm building an application to view pdf's through a browser without the need of a plugin on mobile devices. I tried ImageMagick and ghostscript to covert the pages to images but they are far too large and text becomes unclear. I see website offering a service of converting pdf's into html and do a descent job but I can't find an example of how this is accomplished. Any help is much appreciated. Thanks!
EDIT: I seem to have read the question backwards. In this case it might be best to parse through the PDF and then format some HTML based on what you find. I believe the javapdf option is capable of this, but I haven't used any of these so I am not sure. If worse comes to worst and you can't find software to disassemble a PDF, you might be able to write your own disassembler in Java or PHP by reading the PDF specification. Best of luck!
http://www.adobe.com/devnet/pdf/pdf_reference.html - PDF Specification (Adobe Modified Version, because they are most popular you may want to support their extensions)
-- OLD -- These websites probably write their own proprietary software to do the trick. If you are truly interested in this undertaking, I would suggest parsing the HTML to get the data and style information and using it to format some sort of PDF writer APIs. A quick Google search yields the following: -- END OLD --
http://www.cutepdf.com/Solutions/
http://ruby-pdf.rubyforge.org/pdf-writer/doc/index.html
http://asprise.com/product/javapdf/
If you are looking at converting PDF to HTML and planning to run the conversion on a server, then you can try pdf2html. It is a program packaged as part of poppler-utils. I do not know how the program accomplishes it.
I was googling and came across the below link explaining how scridb.com implements conversion.
http://coding.scribd.com/2010/06/01/the-perils-of-stacking/

Programmatically generating editable Word docs from ASP.NET?

The purpose is to generate proposal documents that can manually be edited in Word after the fact, but before sending them out to the customers.
Much proposal content would be drawn from existing HTML website content (backing CMS) and also some custom (non-HTML) injection for certain scenarios. Of course the conditional logic could go into server-side ASP.NET to vary the content appropriately.
I'm open to 3rd-party tools if raw manipulation of the Word API is arduous. In fact a good 3rd party tool might be the answer.
Use the Aspose Words component for .Net.
Aspose Words Component Link
The component natively understands the Microsoft Word file format without having to install any Microsoft Office products on your application environment. You can then start from an existing word template or programatically build up an entire Microsoft Word document from scratch. The Word object model then allows you to export to doc / docx etc and save as a native Word file to wherever you required.
They have plenty of demos set up on their website.
I've not used any third-party tools before, as I've only ever written Office automation applications for PCs which already have Office installed.
Creating documents from scratch, or basing them on a template, is quite straightforward. With templates, you can define bookmarks and mail-merge fields to make finding and replacing document elements easier.
Here's a few things that you may find useful:
Named and Optional Arguments
The Word object model is reasonably easy to work with. VB.NET used to be easier to work with than C#: as the Office automation APIs were originally written with VB in mind, you could take advantage of optional parameters. In earlier versions of C#, you had to specify every argument in API calls, which was quite tedious. I understand that this has changed in Visual C# 2010:
How to: Use Named and Optional Arguments in Office Programming (C# Programming Guide)
http://msdn.microsoft.com/en-us/library/dd264738.aspx
Tutorials
I found these tutorials quite handy:
Automating Office Programs with VB.NET
http://www.xtremevbtalk.com/showthread.php?t=160433
VB.NET Office Automation FAQ
http://www.xtremevbtalk.com/showthread.php?t=160459
Understanding the Word Object Model from a .NET Developer's Perspective
http://msdn.microsoft.com/en-us/library/aa192495%28office.11%29.aspx
Early and Late binding
One point worth mentioning: late-binding is normally recommended against, but it can be very useful if you don't know what version of Office will be deployed on the application's host. Early-binding tends to operate faster, and has the advantage of intellisense in your IDE:
Using early binding and late binding in Automation
http://support.microsoft.com/kb/245115
Early vs. Late Binding
http://word.mvps.org/faqs/interdev/earlyvslatebinding.htm
Search and Replace
One thing to be aware of is that the find and replacement objects may not work as you would expect. Rather than searching the whole document, it searches just the main text. If you have text frames in the document, these will be ignored. Instead, you have to loop through all the StoryRanges, and search the content of each. Here's what I do in VB.NET to search the main text story and text frames:
Private Sub FindReplaceAll(ByVal objDoc As Object, ByVal strFind As String, ByVal strReplacement As String)
Dim rngStory As Object
For Each rngStory In objDoc.StoryRanges
Do
If rngStory.StoryType = wdMainTextStory Or rngStory.StoryType = wdTextFrameStory Then
With rngStory.Find
.Text = strFind
.Replacement.Text = strReplacement
.Wrap = wdFindContinue
.Execute(Replace:=wdReplaceAll)
End With
End If
rngStory = rngStory.NextStoryRange
Loop Until rngStory Is Nothing
Next rngStory
End Sub
StoryRanges Collection Object
http://msdn.microsoft.com/en-us/library/bb178940%28office.12%29.aspx
I have a long history regarding document generation and mail merge. In the old days we were using Office COM extensively even in server side (ASP) applications. In years we have learnt that this approach was causing many problems and today I’m always advocating against using Office COM (Word automation) in almost any scenario.
With the Microsoft’s introduction of Open XML SDK we managed to create a solid mail-merge component that was many times faster and much more robust than the solution(s) with Office COM. In my experience Open XML SDK allows a developer to create a solid solution, but it takes a lot of effort and time to make it useful and robust.
There are several good document generation/processing libraries on the market. We later ended up purchasing one and in my opinion creating your own solution (based on Open XML SDK or Office COM) simply never pays off.
Currently we are using Docentric Toolkit which is a general purpose document processing library and even better template-based/mail-merge toolkit for .NET. It allows template design in MS Word and then populating them with application data and producing final documents in different formats.
You can look into using XSL to generate some WordML.
This technique is definitely convoluted but gives you a lot power in your layout.
You don't need any 3rd party controls to create a Word document. From 2007 and onward Word can read html as a word document. You simply save any web page with the ".doc" extension and Word will sort it out.
Simply create your web page with whatever formatting you want then save it with a .doc extension.
I used HttpWebRequestto call the Url (with parmaters) to my page then used WebResponse and Stream to get my page into a buffer, then StreamReader and StreamWriter to save it to an actual document. I've then got my own custom function to download the file.
If anyone wants my code let me know

Replace text in Word Document via ASP.NET

How can I replace a string/word in a Word Document via ASP.NET? I just need to replace a couple words in the document, so I would like to stay AWAY from 3rd party plugins & interop. I would like to do this by opening the file and replacing the text.
The following attempts were made:
I created a StreamReader and Writer to read the file but I think that I am reading and writing in the wrong format. I think that Word Documents are stored in binary?? If word documents are binary, how would I read and write the file in binary?
Dim template As String = Request.MapPath("documentName.doc")
If File.Exists(template) Then
Dim sr As New StreamReader(template)
Dim content As String = sr.ReadToEnd()
sr.Close()
Dim sw As New StreamWriter(template)
content = content.Replace("# T O D A Y S D A T E", Date.Now.ToString("MM/dd/yyyy"))
sw.Write(content)
sw.Close()
Else
Word binary format is proprietary to Microsoft. The specification to read the binary format is complex and will take you ages to learn about the document structure and the internal bit and byte structure. I really dont think you will save yourself anytime going down this path, so consider the below:
Use Open XML
Automate Word
Use third party library like Aspose
Use RTF rather than Doc. You can then look for specific RTF tag with your text and replace it with another set of RTF text block. This is probably the simplest for what you want to do if RTF is an acceptable format.
Personal experience, automating Word isn't as bad as it sounds. It is really not suitable for server high volume environment, but for smaller load, it works well of course if you write your code well to manage the application object and handling exceptions.
EDITED: Corrected about my initial NDA comment mentioned. This was the case when I worked on this back in 2005/6 and didnt realize Microsoft had decided to publish that in the recent year.
Lots of choices:
Some of them expensive (Apose)
Some of them hard (binary formats)
Some of them require Interop (VSTO)
or newer formats (Open XML)
Some of them not mentioned yet, like
running Word on the server and just
writing to that (not recommended by
MSFT, but probably your only real
choice for a) cheap, b) simple
OfficeWriter.
If word documents are binary, how would I read and write the file in binary?
They are, and that's why you should use a third party library to program against them.
I would like to stay AWAY from 3rd party plugins & interop
This requirement makes the task extremely hard. If your documents are in the "old Word format" (.doc), I will almost say that you are out of luck. If you can use Word 2007 documents (.docx) instead, you should be able to solve the problem by unzipping the file (it's essentially a ZIP archive), do search/replace in contained XML files and zip the document up again.
See also: Generating a Word Document with C#
You could perform Word automation on the server to easily do it, but that route is fraught with danger. Automation is not designed to run server side and you will find it regularly hangs when Word pop's up a prompt or confirmation box waiting for input that nobody can see.
You have to make a trade off, use Word automation and accept it may hang pretty regularly (anything from daily to weekly), or buy a third party solution. I use Aspose and it has solved a lot of problems.

HTML to RTF Converter for .NET [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I've already seen lots of posts on the site for RTF to HTML and some other posts talking about some HTML to RTF converters, but I'm really trying to get a full breakdown of what is considered the most widely used commercial product, open source product or if people recommend going home grown. Apologies if you consider this a duplicate question, but I'm trying to create a product matrix to see what is the most viable for our application. I also think this would be helpful for others.
The converter would be used in an ASP.NET 2.0 application (we're upgrading to 3.5 shortly but still sticking with WebForms) using SQLServer 2005 (soon 2008) as the DB.
From reading a few posts, SautinSoft appears to be popular as a commercial component. Are there other commercial components that you'd recommend for converting HTML to RTF? Price does matter, but even if it's a little on the expensive side, please list it.
For open source, I read that OpenOffice.org can be run as a service so that it can convert files. However, this appears to be only Java based. I imagine, I'd need some kind of interop to use this? What .NET open source components, if any, are out there for converting HTML to RTF?
For home grown, is an XSLT the way to go with XHTML? If so, what component do you recommend for generating XHTML? Otherwise, what other home grown avenuses do you recommend.
Also, please note that I currently don't care so much about RTF to HTML. If a commercial component offers this and the price is still the same, fine, otherwise please don't mention it.
For what its worth and in no particular order.
A while ago i wanted to export to RTF and then import from RTF the RTF in question being manipulated by MS Word.
The first problem is RTF is not an open standard. It is an internal MS standard and there fore they alter it as and when they like and do not generally worry about compatibility. Currently the versions of RTF are 1.3 to 1.9 and they are all different. Internally they use twips for measurement just for good measure.
I bought the O'Reilly pocket book on the subject which helped and read a lot of the MS documentation which is good, but there is a lot of it and lots for each version.
Because of the way RTF is coded using regex to manipulate is incredibly hard work and needs careful handling and concentration to test and get to work. I use a Mac editor that had built in regex so i could steadily test each section and build it into the code.
Because of the number of versions there is also a lot of incompatibility between versions but there is a lot of commonality and in the end it was reasonably hard/easy to get where i wanted (after about a weeks reading and a weeks coding) and producing a really simple version.
I never found a commercial solution but i had to have a free on because of budget so that cut a lot out but take great care in choosing one to make sure it does what you want and has support.
I don't think where you are coming from HTML/XML/XHTML, i was converting CSV formats, it the RTF.
I am not sure if i would advise to DIY or buy. Probably on balance DIY but your own circumstances will dictate that.
Edit: One thing going from content to RTF is easier than vice versa.
BTW not criticising MS fior the RTF versions, hey it's theirs and proprietary so they can do what they like.
I would recommend doing it yourself as the task is not really that complex. Firstly, the easiest way convert one Xml format into another Xml format is with an Xslt. Converting Xml documents in C# is super easy.
Here is a good msdn blog post to get you started. Mike even mentions that it was easier to do this by hand that to deal with a third party.
link
Actually, I already answered this question here. Guess that makes this a duplicate.
I just came across this WYSIWYG rich text editor (RTE) for the web that also has an HTML to RTF converter, Cute Editor for .NET. Does anyone have any experience with this component? My main experience for web based RTEs have been CKEditor (fckEditor) and TinyMCE but as far as I can tell CKEditor and TinyMCE do not have HTML to RTF converters built in.
Since I'm required to implement some mailmerge capabilities with rich-text formatting on a Web application, I thought it'd be nice to share my experiences.
Basically, I explored two alternatives:
using Google Docs API to leverage Google Docs capabilities
using XSLT, as shown on this essay
Google Docs API works well. Problem is, when you upload an HTML document with page breaks, like this:
<p style="page-break-before:always;display:none;"/>
and ask Google to convert the doc in RTF, you lose all breaks, which does not fit my requirements. However, if page breaks aren't an issue for you, you might check this solution out.
The XSLT solution works... sort of.
It works if you reference MSXML3 COM object directly, bypassing System.Xml classes. Otherwise I couldn't make it work. Moreover, it seems to honor all but basic formatting and tags, disregarding text color, size and the like. However, it honors page breaks. :-)
Here's a quick library I wrote, using tidy.net to force HTML to XHTML conversion. Hope it helps.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace ADDS.Mailmerge
{
public class XHTML2RTF
{
MSXML2.FreeThreadedDOMDocument _xslDoc;
MSXML2.FreeThreadedDOMDocument _xmlDoc;
MSXML2.IXSLProcessor _xslProcessor;
MSXML2.XSLTemplate _xslTemplate;
static XHTML2RTF instance = null;
static readonly object padlock = new object();
XHTML2RTF()
{
_xslDoc = new MSXML2.FreeThreadedDOMDocument();
//XSLData.xhtml2rtf is a resource file
// containing XSL for transformation
// I got XSL from here:
// http://www.codeproject.com/KB/HTML/XHTML2RTF.aspx
_xslDoc.loadXML(XSLData.xhtml2rtf);
_xmlDoc = new MSXML2.FreeThreadedDOMDocument();
_xslTemplate = new MSXML2.XSLTemplate();
_xslTemplate.stylesheet = _xslDoc;
_xslProcessor = _xslTemplate.createProcessor();
}
public string ConvertToRTF(string xhtmlData)
{
try
{
string sXhtml = "";
TidyNet.Tidy tidy = new TidyNet.Tidy();
tidy.Options.XmlOut = true;
tidy.Options.Xhtml = true;
using (MemoryStream ms = new MemoryStream(Encoding.UTF8.GetBytes(xhtmlData)))
{
StringBuilder sb = new StringBuilder();
using (MemoryStream sw = new MemoryStream())
{
TidyNet.TidyMessageCollection messages = new TidyNet.TidyMessageCollection();
tidy.Parse(ms, sw, messages);
sXhtml = Encoding.UTF8.GetString(sw.ToArray());
}
}
_xmlDoc.loadXML(sXhtml);
_xslProcessor.input = _xmlDoc;
_xslProcessor.transform();
return _xslProcessor.output.ToString();
}
catch (Exception exc)
{
throw new Exception("Error in xhtml conversion. ", exc);
}
}
public static XHTML2RTF Instance
{
get
{
lock (padlock)
{
if (instance == null)
{
instance = new XHTML2RTF();
}
return instance;
}
}
}
}
}

Resources