HTML to RTF Converter for .NET [closed]

HTML to RTF Converter for .NET [closed] - asp.net

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I've already seen lots of posts on the site for RTF to HTML and some other posts talking about some HTML to RTF converters, but I'm really trying to get a full breakdown of what is considered the most widely used commercial product, open source product or if people recommend going home grown. Apologies if you consider this a duplicate question, but I'm trying to create a product matrix to see what is the most viable for our application. I also think this would be helpful for others.
The converter would be used in an ASP.NET 2.0 application (we're upgrading to 3.5 shortly but still sticking with WebForms) using SQLServer 2005 (soon 2008) as the DB.
From reading a few posts, SautinSoft appears to be popular as a commercial component. Are there other commercial components that you'd recommend for converting HTML to RTF? Price does matter, but even if it's a little on the expensive side, please list it.
For open source, I read that OpenOffice.org can be run as a service so that it can convert files. However, this appears to be only Java based. I imagine, I'd need some kind of interop to use this? What .NET open source components, if any, are out there for converting HTML to RTF?
For home grown, is an XSLT the way to go with XHTML? If so, what component do you recommend for generating XHTML? Otherwise, what other home grown avenuses do you recommend.
Also, please note that I currently don't care so much about RTF to HTML. If a commercial component offers this and the price is still the same, fine, otherwise please don't mention it.

For what its worth and in no particular order.
A while ago i wanted to export to RTF and then import from RTF the RTF in question being manipulated by MS Word.
The first problem is RTF is not an open standard. It is an internal MS standard and there fore they alter it as and when they like and do not generally worry about compatibility. Currently the versions of RTF are 1.3 to 1.9 and they are all different. Internally they use twips for measurement just for good measure.
I bought the O'Reilly pocket book on the subject which helped and read a lot of the MS documentation which is good, but there is a lot of it and lots for each version.
Because of the way RTF is coded using regex to manipulate is incredibly hard work and needs careful handling and concentration to test and get to work. I use a Mac editor that had built in regex so i could steadily test each section and build it into the code.
Because of the number of versions there is also a lot of incompatibility between versions but there is a lot of commonality and in the end it was reasonably hard/easy to get where i wanted (after about a weeks reading and a weeks coding) and producing a really simple version.
I never found a commercial solution but i had to have a free on because of budget so that cut a lot out but take great care in choosing one to make sure it does what you want and has support.
I don't think where you are coming from HTML/XML/XHTML, i was converting CSV formats, it the RTF.
I am not sure if i would advise to DIY or buy. Probably on balance DIY but your own circumstances will dictate that.
Edit: One thing going from content to RTF is easier than vice versa.
BTW not criticising MS fior the RTF versions, hey it's theirs and proprietary so they can do what they like.

I would recommend doing it yourself as the task is not really that complex. Firstly, the easiest way convert one Xml format into another Xml format is with an Xslt. Converting Xml documents in C# is super easy.
Here is a good msdn blog post to get you started. Mike even mentions that it was easier to do this by hand that to deal with a third party.
link
Actually, I already answered this question here. Guess that makes this a duplicate.

I just came across this WYSIWYG rich text editor (RTE) for the web that also has an HTML to RTF converter, Cute Editor for .NET. Does anyone have any experience with this component? My main experience for web based RTEs have been CKEditor (fckEditor) and TinyMCE but as far as I can tell CKEditor and TinyMCE do not have HTML to RTF converters built in.

Since I'm required to implement some mailmerge capabilities with rich-text formatting on a Web application, I thought it'd be nice to share my experiences.
Basically, I explored two alternatives:
using Google Docs API to leverage Google Docs capabilities
using XSLT, as shown on this essay
Google Docs API works well. Problem is, when you upload an HTML document with page breaks, like this:
<p style="page-break-before:always;display:none;"/>
and ask Google to convert the doc in RTF, you lose all breaks, which does not fit my requirements. However, if page breaks aren't an issue for you, you might check this solution out.
The XSLT solution works... sort of.
It works if you reference MSXML3 COM object directly, bypassing System.Xml classes. Otherwise I couldn't make it work. Moreover, it seems to honor all but basic formatting and tags, disregarding text color, size and the like. However, it honors page breaks. :-)
Here's a quick library I wrote, using tidy.net to force HTML to XHTML conversion. Hope it helps.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace ADDS.Mailmerge
{
public class XHTML2RTF
{
MSXML2.FreeThreadedDOMDocument _xslDoc;
MSXML2.FreeThreadedDOMDocument _xmlDoc;
MSXML2.IXSLProcessor _xslProcessor;
MSXML2.XSLTemplate _xslTemplate;
static XHTML2RTF instance = null;
static readonly object padlock = new object();
XHTML2RTF()
{
_xslDoc = new MSXML2.FreeThreadedDOMDocument();
//XSLData.xhtml2rtf is a resource file
// containing XSL for transformation
// I got XSL from here:
// http://www.codeproject.com/KB/HTML/XHTML2RTF.aspx
_xslDoc.loadXML(XSLData.xhtml2rtf);
_xmlDoc = new MSXML2.FreeThreadedDOMDocument();
_xslTemplate = new MSXML2.XSLTemplate();
_xslTemplate.stylesheet = _xslDoc;
_xslProcessor = _xslTemplate.createProcessor();
}
public string ConvertToRTF(string xhtmlData)
{
try
{
string sXhtml = "";
TidyNet.Tidy tidy = new TidyNet.Tidy();
tidy.Options.XmlOut = true;
tidy.Options.Xhtml = true;
using (MemoryStream ms = new MemoryStream(Encoding.UTF8.GetBytes(xhtmlData)))
{
StringBuilder sb = new StringBuilder();
using (MemoryStream sw = new MemoryStream())
{
TidyNet.TidyMessageCollection messages = new TidyNet.TidyMessageCollection();
tidy.Parse(ms, sw, messages);
sXhtml = Encoding.UTF8.GetString(sw.ToArray());
}
}
_xmlDoc.loadXML(sXhtml);
_xslProcessor.input = _xmlDoc;
_xslProcessor.transform();
return _xslProcessor.output.ToString();
}
catch (Exception exc)
{
throw new Exception("Error in xhtml conversion. ", exc);
}
}
public static XHTML2RTF Instance
{
get
{
lock (padlock)
{
if (instance == null)
{
instance = new XHTML2RTF();
}
return instance;
}
}
}
}
}

Related

Possibility to modify or extend code in D365FO to suppress thrown error

Original class function creates an SQL query and executes it.
Since there is an syntax error in the query it throws an error. What's the correct way to achieve fixation? Class extension does not work, because CoC executes the complete original function.
originalFunction(..)
{
createSomeSQLQueryWithSyntayErrorInIt();
executeQuery();
}
The class in question is ReqDemPlanMissingForecastFiller. In method insertMissingDatesForecastEntries a direct SQL statement string is generated. The date variable nonFrozenForecastStartDate is added to the string, but is not escaped correctly as it seems.
If the SQL statement is executed, a syntax error occurs. If the statement is fixed, it can be executed e.g. in SQL Server Management Studio (SSMS).

In this specific case, based on your comments, you may be able to sidestep.
Create a new class ReqDemPlanMissingForecastFiller_Fix extending ReqDemPlanMissingForecastFiller then copy/paste the erroneous function and correct the mistake.
Create an extension class and change the newParameters static funcion.
[ExtensionOf(classStr(ReqDemPlanMissingForecastFiller))]
class ReqDemPlanMissingForecastFiller_Extention
{
public static ReqDemPlanMissingForecastFiller newParameters(
ReqDemPlanCreateForecastDataContract _dataContract,
ReqDemPlanAllocationKeyFilterTmp _allocationKeyFilter,
ReqDemPlanTaskLoggerInterface _logger = null)
{
ReqDemPlanMissingForecastFiller filler = next newParameters(_dataContract, _allocationKeyFilter, _logger);
filler = new ReqDemPlanMissingForecastFiller_Fix(); //Throw away previous value
filler.parmDataContract(_dataContract);
filler.parmAttributeManager(_dataContract.attributeManager());
filler.parmAllocationKeyFilter(_allocationKeyFilter);
filler.parmLogger(_logger);
filler.init();
return filler;
}
}
Code above was based on AX 2012 code. Stupid solution to a stupid problem.
It goes almost without saying that you should report the problem to Microsoft.

#Jan B. Kjeldsen's answer describes how the specific case can be solved without involving Microsoft.
Since overlayering is no longer possible, the solution involves copying a fair bit of standard code. This brings its own risks, because future changes by Microsoft for that code are not reflected in the copied code.
Though it cannot always be avoided, other options should be evaluated first:
As #Jan B. Kjeldsen mentioned, errors in the standard code should be reported to Microsoft (see Get support for Finance and Operations apps or Lifecycle Services (LCS)). This enables them to fix the error.
Pro: No further work needed.
Con: Microsoft may decline the fix or take a long time to implement it.
If unlike in this specific case the issue is not a downright error, but a lack of extension options, an extensibility request can be created with Microsoft. They will then add an extension option.
Pro: No further work needed.
Con: Microsoft may decline the extensibility request or take a long time to implement it.
For both errors as well as missing extension options, Microsoft also offers the Community Driven Engineering program (CDE). This enables you to develop changes in the standard code directly via a special Microsoft hosted repository where the standard code is not locked for changes.
Pro: Most flexible and fastest of all options involving Microsoft.
Con: You have to do the work yourself. Microsoft may decline the change. It can still take some time before the change is available in a GA version.
You can also consider a hybrid approach: For a quick solution, copy standard code and customize it as required. But also report an error, create an extensibility request or fix it yourself in the CDE program. When the change is available in standard code, you can then remove the copied code again.

grabbing data from a ASP.NET webForm

I'm fairly new to web development and never before did i do any screen-scraping nor web-crawling, but yesterday a friend of mine asked me if i would be able to grab some data from this website, which is not mine, nor his, but the data is publicly available even for download.
The problem with the data is, it's available only as one file per one date or company, rather than one file for multiple dates or companies, which involves a lot of tedious 'clicking trough' the calendar and so he thought it would be nice if i would be able to create some app that could grab all the data with one click and output it in one single file or something similar..
The website uses aspx webFrom with __doPostBack to retrieve the data for different dates, even the links to download the data in XSL aren't the usual "href=…" links, they are, i assume, references for some asp script…
To be honest the only thing i tried was PHP cURL which didn't work, but since i tried cURL for the first time, i don't even know if it didn't work because it is not possible with cURL, or just because i don't know how to work with it.
I am only somewhat proficient in PHP and JavaScript, but not in ASP, though i would't mind learning something new.
So my question is..
Is it at all possible to grab the data from a website like this? and if it is, would you be so kind as to give me some hints on how to approach this kind of problem?
the website, again, is here http://extranet.net4gas.cz/capacity_ee.aspx
Thanks

C# has a nice WebClient class to do the job:
// Create web client.
WebClient client = new WebClient();
// Download string.
string value = client.DownloadString("http://www.microsoft.com/");
once you have the page html in a string you use regular expressions to scrape the content you are looking for.
here is a very basic regular expression to give a hint:
Regex regex = new Regex(#"\d+");
Match match = regex.Match("hello here 10 values");
if (match.Success)
{
Console.WriteLine(match.Value);
}

Marosko, as you said the data on website is open for public, so for sure you can scrape data out of it. Now, it is to decrease the manual click through dates and scraping data out of it. I personally don't have much idea about how Curl will work but I am sure it will involve a lot of coding. I would rather suggest you to automate the entire process using some automation tool, like a software application. Try Automation Anywhere, I bought it few months back for some data extraction purpose and it worked very well. It is automated and you can check the screen scraping capabilities it shows. Its my favorite :)
Charles

Programmatically generating editable Word docs from ASP.NET?

The purpose is to generate proposal documents that can manually be edited in Word after the fact, but before sending them out to the customers.
Much proposal content would be drawn from existing HTML website content (backing CMS) and also some custom (non-HTML) injection for certain scenarios. Of course the conditional logic could go into server-side ASP.NET to vary the content appropriately.
I'm open to 3rd-party tools if raw manipulation of the Word API is arduous. In fact a good 3rd party tool might be the answer.

Use the Aspose Words component for .Net.
Aspose Words Component Link
The component natively understands the Microsoft Word file format without having to install any Microsoft Office products on your application environment. You can then start from an existing word template or programatically build up an entire Microsoft Word document from scratch. The Word object model then allows you to export to doc / docx etc and save as a native Word file to wherever you required.
They have plenty of demos set up on their website.

I've not used any third-party tools before, as I've only ever written Office automation applications for PCs which already have Office installed.
Creating documents from scratch, or basing them on a template, is quite straightforward. With templates, you can define bookmarks and mail-merge fields to make finding and replacing document elements easier.
Here's a few things that you may find useful:
Named and Optional Arguments
The Word object model is reasonably easy to work with. VB.NET used to be easier to work with than C#: as the Office automation APIs were originally written with VB in mind, you could take advantage of optional parameters. In earlier versions of C#, you had to specify every argument in API calls, which was quite tedious. I understand that this has changed in Visual C# 2010:
How to: Use Named and Optional Arguments in Office Programming (C# Programming Guide)
http://msdn.microsoft.com/en-us/library/dd264738.aspx
Tutorials
I found these tutorials quite handy:
Automating Office Programs with VB.NET
http://www.xtremevbtalk.com/showthread.php?t=160433
VB.NET Office Automation FAQ
http://www.xtremevbtalk.com/showthread.php?t=160459
Understanding the Word Object Model from a .NET Developer's Perspective
http://msdn.microsoft.com/en-us/library/aa192495%28office.11%29.aspx
Early and Late binding
One point worth mentioning: late-binding is normally recommended against, but it can be very useful if you don't know what version of Office will be deployed on the application's host. Early-binding tends to operate faster, and has the advantage of intellisense in your IDE:
Using early binding and late binding in Automation
http://support.microsoft.com/kb/245115
Early vs. Late Binding
http://word.mvps.org/faqs/interdev/earlyvslatebinding.htm
Search and Replace
One thing to be aware of is that the find and replacement objects may not work as you would expect. Rather than searching the whole document, it searches just the main text. If you have text frames in the document, these will be ignored. Instead, you have to loop through all the StoryRanges, and search the content of each. Here's what I do in VB.NET to search the main text story and text frames:
Private Sub FindReplaceAll(ByVal objDoc As Object, ByVal strFind As String, ByVal strReplacement As String)
Dim rngStory As Object
For Each rngStory In objDoc.StoryRanges
Do
If rngStory.StoryType = wdMainTextStory Or rngStory.StoryType = wdTextFrameStory Then
With rngStory.Find
.Text = strFind
.Replacement.Text = strReplacement
.Wrap = wdFindContinue
.Execute(Replace:=wdReplaceAll)
End With
End If
rngStory = rngStory.NextStoryRange
Loop Until rngStory Is Nothing
Next rngStory
End Sub
StoryRanges Collection Object
http://msdn.microsoft.com/en-us/library/bb178940%28office.12%29.aspx

I have a long history regarding document generation and mail merge. In the old days we were using Office COM extensively even in server side (ASP) applications. In years we have learnt that this approach was causing many problems and today I’m always advocating against using Office COM (Word automation) in almost any scenario.
With the Microsoft’s introduction of Open XML SDK we managed to create a solid mail-merge component that was many times faster and much more robust than the solution(s) with Office COM. In my experience Open XML SDK allows a developer to create a solid solution, but it takes a lot of effort and time to make it useful and robust.
There are several good document generation/processing libraries on the market. We later ended up purchasing one and in my opinion creating your own solution (based on Open XML SDK or Office COM) simply never pays off.
Currently we are using Docentric Toolkit which is a general purpose document processing library and even better template-based/mail-merge toolkit for .NET. It allows template design in MS Word and then populating them with application data and producing final documents in different formats.

You can look into using XSL to generate some WordML.
This technique is definitely convoluted but gives you a lot power in your layout.

You don't need any 3rd party controls to create a Word document. From 2007 and onward Word can read html as a word document. You simply save any web page with the ".doc" extension and Word will sort it out.
Simply create your web page with whatever formatting you want then save it with a .doc extension.
I used HttpWebRequestto call the Url (with parmaters) to my page then used WebResponse and Stream to get my page into a buffer, then StreamReader and StreamWriter to save it to an actual document. I've then got my own custom function to download the file.
If anyone wants my code let me know

Is it OK to use WPF assemblies in a web app?

I have an ASP.NET MVC 2 app targeting .NET 4 that needs to be able to resize images on the fly and write them to the response.
I have code that does this and it works. I am using System.Drawing.dll.
However, I want to enhance my code so that not only am I resizing the image, but I am dropping it from 24bpp down to 4bit grayscale. I could not, for the life of me, find code on how to do this with System.Drawing.dll.
But I did find a bunch of WPF stuff. This is my working/sample code (runs in LinqPad).
// Load the original 24 bit image
var bitmapImage = new BitmapImage();
bitmapImage.BeginInit();
bitmapImage.UriSource = new Uri(#"C:\Temp\Resized\18_appa2_015.png", UriKind.Absolute);
//bitmapImage.DecodePixelWidth = 600;
bitmapImage.EndInit();
// Create the destination image
var formatConvertedBitmap = new FormatConvertedBitmap();
formatConvertedBitmap.BeginInit();
formatConvertedBitmap.Source = bitmapImage;
formatConvertedBitmap.DestinationFormat = PixelFormats.Gray4;
formatConvertedBitmap.EndInit();
// Encode and dump the image to disk
var encoder = new PngBitmapEncoder();
encoder.Frames.Add(BitmapFrame.Create(formatConvertedBitmap));
using (var fileStream = File.Create(#"C:\Temp\Resized\18_appa2_015_s2.png"))
{
encoder.Save(fileStream);
}
It uses System.Xaml.dll, WindowsBase.dll, PresentationCore.dll, and PresentationFramework.dll. The namespaces used are: System.Windows.Controls, System.Windows.Media, and System.Windows.Media.Imaging.
Is there any problem using these namespaces in my web application? It doesn't seem right.
If anyone knows how to drop the bit depth without all this WPF stuff (which I barely understand, BTW) I would be thrilled to see that too.

No problem. You can easily use WPF for image manipulation from within an ASP.NET web site. I've used WPF behind the scenes within a web site several times before, and it works great.
The one issue I did run into is that many parts of WPF insist the calling threads be STA threads. If your web site uses MTA threads instead you will get an error telling you that WPF needs STA threads. To fix this, use the STAThreadPool class I posted in this answer.

My understanding (and I can't find a citation right now) is that this is officially not supported. However, in practice, it seems to work pretty well and you would not be alone in using these libraries in a Web app. Also, even if you can find a way of doing this with System.Drawing, I believe that officially that's not supported in the Web environment either -- though it is more widely used in that environment than WPF, which gives you an extra level of reassurance.

how to preserve the look and feel when converting HTML to PDF

I have been using iTextSharp to do a HTML to PDF conversion, overall it works fairly well, but it doesn't seem to be like most of the formatting.
Bold, Italic, and Underline are all working, however, none of the font sizes, styles or other information is respected, therefore the export doesn't look much at all like the HTML that was used to create the format.
Does anyone know how to either
fix the way the iTextSharp exports (below is a sample of my code)
Or know of a different product that is out there that provides this functionality, and will not break the bank?
This is my code:
//Do the PDF thing
Document document = new Document(PageSize.A4);
using (Stream output = new FileStream(Server.MapPath(relDownloadDoc), FileMode.Create, FileAccess.Write, FileShare.None))
using (Stream htmlStream = new FileStream(Server.MapPath(relProcessingDoc), FileMode.Open, FileAccess.Read, FileShare.Read))
using (XmlTextReader reader = new XmlTextReader(htmlStream))
{
reader.WhitespaceHandling = WhitespaceHandling.None;
PdfWriter.GetInstance(document, output);
document.Open();
Console.ReadLine();
HtmlParser.Parse(document, reader);
document.Close();
}

Try WKHTMLTOPDF.
It's an open source implementation of webkit. Both are free.
We've set a small tutorial here

From Convert HTML + CSS to PDF with PHP? I found out about Prince XML, which has clients for lots of languages including the .Net platform.
It is an exceptional converter though commercial and not cheap. There is a Google Tech Talk about it. Allegedly, Google uses it for Google Docs. It's rendering engine also passed the Acid2 test.
If you want high-quality HTML to PDF conversion and are willing to spend the ~$3800 for a server license then look no further. Frankly I think the cost in time of getting anything else to do what Prince does will quickly outstrip the cost involved. Developer time is expensive.

I have used pd4ml for a few things. It seems to work pretty well.
Here is a list html tags/attributes that pd4ml supports: http://pd4ml.com/html.htm

ActivePDF is $375 for a single server license, and does an excellent job. We've used in in client projects before and it's been great.
http://www.activepdf.com/products/serverproducts/webgrabber/index.cfm
EDIT: Nevermind, it depends on another one of their products that costs $1,400. Thought it would roll in cheaper than some of the other suggestions. A few more minutes of research came up with the following alternatives:
Under $500:
http://www.websupergoo.com/abcpdf-1.htm (You'll need the professional edition to keep as much formatting as possible).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex