Export Superscript text to worddoc from HTML

Export Superscript text to worddoc from HTML - asp.net

In my asp.net website I have a HTML page which contains some words with Superscript, for example the HTML representation is as below...
ABC<sup>def</sup>
when it is rendered in browser it appears like this : ABCdef
I have a export feature which export the HTML document to a word document. When I am exporting it is being exported as ABC<sup>def</sup> in the worddoc.
I have been trying to do it using some kind of regex html = html.Replace("<sup>", "\"");
but that doesn't help.
Can anybody please help me how would I make it appear as ABCdef in my word document too using asp.net?

This is an older question but I've been trying to do the same thing in reverse. The problem is with how Word formats and defines its super and subscripts.
There are no tags that Word used to define super and subscripts, it uses relative positioning and font-size declarations to generate symbols that look like super and subscripts.
Unfortunately there is no reliable regex to turn Word's positioning into super/subscripts, and there is no way to regex a document with those tags to make Word recognize them.

Related

Convert html tagged text to string for a pdf itext7

I am working on a project in asp.net MVC which involves printing various information that the user enters previously.
I use the TinyMCE for comment type fields.
As we know Tiny allows you to format text using different types of fonts, adding bold words, etc.
The string it produces will have tags that allow reading the style chosen for the word or phrase in HTML.
I use Itext7 to produce pdfs, when I print the fields I mentioned above, obviously all the HTML tags are shown, for example:
<p> Hello World! </p>
Is there a way to tell Itext7 that when these tags are present, it must use an associated style for that tag?

I created "Hello World" here and used bold, italic and underline. Copy the source code via "Tools" and just use the following code to convert it via iText7:
String htmlString = "<p><span style=\"text-decoration: underline;\"><em><strong>Hello World</strong></em></span></p>";
HtmlConverter.convertToPdf(htmlString, new PdfWriter(destinationFolder + "test.pdf"));
The resultanting PDF:

Created pdf does not display all UTF-8 characters

I'm creating a PDF document using HtmlRenderer.PdfSharp library. Back-end receives html from front-end and creates PDF using GeneratePdf() method.
PdfGenerator.GeneratePdf(html, PdfSharp.PageSize.A4);
The process works fine when standard latin characters are used in html. I've tried to pass UTF-8 test file as an input in front-end and some of the characters hasn't rendered properly as seen in attached image. When I've bypassed front-end by hard-coding html value the result was the same, so I assume the problem lays in library.
Is there a way to properly render those characters with this library? I especially care about math symbols such as ∮ or ℝ ⊂ ℂ.

I've found the answer thanks to https://stackoverflow.com/a/59377248
Setting font-family in html element containing math symbols to "Segoe UI Symbol" allowed PdfSharp to render those symbols properly.
Character Map is an useful tool in searching for fonts that include wanted symbols.

html tags in an R markdown document compiled to pdf

I'm trying to use R Markdown to create a pdf document, and I'm having problems using certain html tags. For example, the R markdown document
---
output: pdf_document
---
<pre>
code1
</pre>
<code>
code2
</code>
<pre><code>
code3
</code></pre>
compiles to give
code2
when the desired output is
code1
code2
code3
with some nice formatting for code3. But if I compile to html (output: html_document instead of output: pdf_document in the metadata), the problem is solved.
I'm compiling with TexShop on a Mac using the engine below.
#!/bin/bash
/Library/Frameworks/R.framework/Versions/Current/Resources/bin/Rscript -e "rmarkdown::render(\"$1\", encoding='UTF-8')"
I suspect that I'm not allowed to user certain html tags when I compile to a pdf, but I haven't been able to find any guidelines on this.

It is important to remember that the PDF format is not HTML and knows nothing of HTML tags. When a document is converted to PDF, each piece of the document needs to be converted to its corresponding PDF entity. Therefore, when you introduce non-standard raw HTML into your document, the converter can easily be confused.
Of course, how the converter works under the hood could have some effect on the output as well. For example, if the tool you are using converts the Markdown to HTML and then converts that HTML to PDF, then the raw HTML may have a better chance of being mapped properly. However, if the tool goes straight from a parse tree (list of tokens) to the output format, then it may not know anything about the raw HTML (unless it is also an HTML parser). The point is that using raw HTML adds another potential layer of failure when converting to PDF. My suggestion would be to avoid it if at all possible when you indent to convert to PDF (remember Markdown was originally intended to output HTML only).
As it turns out, Markdown already offers a way (or two; depending on which implementation you are using) to mark up code blocks: indented code blocks (and possibly fenced code blocks). Interestingly, the HTML they output is the same as the raw HTML that you have found to work. Perhaps that should provide a clue that the other two possibilities you tried are not valid.
In fact, the HTML Spec is pretty clear that code blocks must be wrapped in <pre><code> tags. The <pre> tag is a block level tag, so it does not need to be wrapped in any parent tags. However, the <pre> tag does not identify its contents as being "code". Therefore, it should never be assumed that it contains "code" itself. On the other hand, the <code> tag is not a block level tag. It must be wrapped by a block level tag (like <pre> or <p>...). And the <code> tag is the only tag which marks content as being "code". Therefore, the only valid way to mark up a code block in HTML is to wrap it in <pre><code> tags. As it turns out, when you do that, it works. Therefore, my conclusion is that the converter is being confused by invalid HTML and failing (as it should).
So, in conclusion, either use native Markdown methods for marking up code or, if you must use raw HTML, stick to valid HTML.

Mediawiki Extension:RSS

The MediaWiki Extension:RSS (http://www.mediawiki.org/wiki/Extension:RSS) uses the plainlinks class to present the RSS feed link. I have tried all manner of searching, including trying to edit the MediaWiki:Rss-feed template to force the link to presented in non-bolded format.
Has anyone used this extension and can tell me how to change the fonts in the RSS link?
Thanks

As far as I can understand your question, you should be able to remove the boldface formatting from the RSS item titles by editing the page MediaWiki:Rss-item (not MediaWiki:Rss-feed) on your wiki.
What you need to do is two things:
remove the string ''' (MediaWiki markup for bold text) from either side of the title, and
remove the ; (MediaWiki markup for a list definition, which is also bolded by the default style sheet) from the beginning of the line.
That is, change the default content of the page:
; '''<span class='plainlinks'>[{{{link}}} {{{title}}}]</span>'''
: {{{description}}}
: {{{author}}} {{{date}}}<!-- don't use newline here -->
to this:
<span class='plainlinks'>[{{{link}}} {{{title}}}]</span>
: {{{description}}}
: {{{author}}} {{{date}}}<!-- don't use newline here -->

programmatically remove all html and inline formatting

I have taken over a code base and I have to read in these html files that were generated by Microsoft Word, I think so it has all kinds of whacky inline formatting.
is there anyway to parse out all of the bad inline formatting and just get the text from this stream. I basically want a purifier programmatically so I can then apply some sensible css

You should use HTML Tidy - it's uniquitous when it comes to cleansing HTML. There's an article on DevX that describes how to do it from .NET.

in the end i just wrote a small class that did a bunch of find and replaces. not pretty but it worked.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Export Superscript text to worddoc from HTML - asp.net

Related

Convert html tagged text to string for a pdf itext7

Created pdf does not display all UTF-8 characters

html tags in an R markdown document compiled to pdf

Mediawiki Extension:RSS

programmatically remove all html and inline formatting

Categories

Resources