programmatically remove all html and inline formatting - asp.net

I have taken over a code base and I have to read in these html files that were generated by Microsoft Word, I think so it has all kinds of whacky inline formatting.
is there anyway to parse out all of the bad inline formatting and just get the text from this stream. I basically want a purifier programmatically so I can then apply some sensible css

You should use HTML Tidy - it's uniquitous when it comes to cleansing HTML. There's an article on DevX that describes how to do it from .NET.

in the end i just wrote a small class that did a bunch of find and replaces. not pretty but it worked.

Related

How to elegantly modify html to inject html element after x-th paragraph on the server side?

I need to modify html coming from external file (server side) before I render it and inject a quote 'component' like this:
This component needs to be injected after 2nd paragraph and I'm planning to use htmlagillity pack. Any examples? Is HtmlNode.InsertAfter() method good choice once I found third paragraph which should be trivial.
Another question is would it be possible to inject sitecore placeholder or even usercontrol that is going to render my quote instead of pure html? I feel it should be but not sure what would be good approach.
Thanks
I can suggest two possible approaches here:
1) Use snippets with some customisation. Snippets allow users to insert pre-defined chunks of HTML into a RTE field. You could have a pre-defined piece of HTML which might have some identifier to indicate it should use custom processing (I would suggest some data-xxx style attribute which would not conflict with any CSS or JavaScript). Then you could create a new renderField pipeline processor which would detect the data-xxx attribute within the content of a rich text field - you would use HtmlAgilityPack for this and then replace that snippet with the contents of your server-side file.
-or-
2) Split your text content into two separate chunks and have two instances of a "HtmlText" rendering within the placeholder, with a rendering for your quote text between them in the same placeholder.
I would advise that having a rule to insert text after the second paragraph would be quite 'brittle' as this would be very reliant on content editors setting the rich text field contents in quite a precise way e.g. to always ensure two or more paragraphs and to always break text with paragraphs - they might decide to use a load of line breaks instead to split their text. That said if you did do this, you would create a new renderField pipeline processor.

Syntax highlighting in jekyll using redcarpet

I'm trying to get code highlighting to work for a simple blog built with jekyll. I want to be able to do code highlighting within posts written in markdown so I enabled redcarpet as markup language. This works all fine, the code gets formatted in <pre></pre> tags and all the various elements of the code get corresponding classes. e.g.
<span class="n">function</span>
<span class="n">saySomething</span>
<span class="p">()</span>
<span class="p">{</span>
etc.
This is awesome but this doesn't give us of the actual highlighting (color) yet. So I suppose there must be some css ready to copy and paste which actually does the styling of the different code elements. Or am I missing something completely?
I looked into some code highlighting libraries like prettify or prism but these do their own formatting with javascript in the browser. But since redcarpet already does the heavy lifting work of formatting the code it is not necessary doing it again.
Any hints?
You need some CSS magic. Use this one or pick one from here.
You can create the CSS with the highlighter itself
rougify style > rouge.css
Or
coderay stylesheet > coderay.css
I like to share the solution as I faced and it took much time to get rid of this issue. Default syntax highlighting is very poor in Jekyll. Like David said, You really need some CSS magic. Check this gist to solve the syntax highlighting problem.

How do I script html that is not well formed to be well formed using classic asp and vbscript?

I am trying to parse some html to switch out values of various element attributes. I decided that the most reliable way to parse the html was to use an xml parser (msxml.)
The problem is that the html I'm trying to parse contains attribute like:
<param name="flashvars" value="autoplay=false&brand=embed&cid=97%2Ftest&locale=en_US"/>
Which causes the xml parser to blow up. I figured out that I need to server.htmlencode() the value attribute in order for the xml parser to load it properly. How do I approach this?
I feel like the problem is a vicious circle. I couldn't use regex's because html is not regular enough, and now I can't use xml parsers because the html isn't "well formed"
help. How do I approach this issue? I want to be able to change attribute values with a vbscript.
Is your HTML well formed? If so you could simply use an XML DomDocument. Use XPath to find the attributes you want to replace.
You can actually use JScript serverside as well in ASP, whicdh might give you access to HTMLDom libraries you could use.
You should probably have a look at one of the libraries for cleaning up HTML, something like HTML Tidy http://www.w3.org/People/Raggett/tidy/
Your main problem is you need to do a replace on the ampersands, they need to be & in well formed XML/XHTML.

How to get a newline in JSF (plain text)?

I am using JSF to generate text and need newlines to make the text easier to read. I have an HTML version which works great, I hacked it together using <br/> (I'm not proud of that, but it works).
I would like to do the same for the plain text version such as inserting \n.
I am doing something like this:
<customTagLibrary:customTag>
<h:outputText value="Exception"/><br/><br/>
...
</customTagLibrary:customTag>
Instead of the <br/>, I want \n. What is the best way to do that?
Please keep in mind that I'm NOT using this to generate content that will be sent to the browser. This will be used to create email messages or (plain-text) attachments in emails.
Thanks,
Walter
If you use Facelets to render HTML, this did the trick for me:
<h:outputText value="
" />
Why not simply wrap it in a HTML <pre> tag?
The h: prefix means html. So if you don't want html, don't use h: tags. Create your own tags or at least renderers for h: tags and let them output \n.
But my personal opinion is that it's better to use another templating technology for emails.
I'm assuming that your template XML strips whitespace. Unfortunately, EL doesn't let you express newlines in string literals, but you could bind to a string that did (<h:outputText value="#{applicationScope.foo.newline}" />). However, since you want to serve multiple markups, this would be a less than ideal approach.
To share JSF templates between different content types, you could 1) remove all markup specific tags from the template and 2) provide RenderKits which would provide a Renderer appropriate for the current markup. This would be the way to serve content using JSF's model-view-presenter design.
You may have to make some decisions about how you handle markup-specific attributes. The default render kit is geared towards rendering the HTML concrete components. Exactly what you do depends on your goals.
I am going to simply write a newline tag. It will detect whether it should output a or a \n. In my tag library, it would look like this:
<content:newline/>
Walter

Are there any tools out there to compare the structure of 2 web pages?

I receive HTML pages from our creative team, and then use those to build aspx pages. One challenge I frequently face is getting the HTML I spit out to match theirs exactly. I almost always end up screwing up the nesting of <div>s between my page and the master pages.
Does anyone know of a tool that will help in this situation -- something that will compare 2 pages and output the structural differences? I can't use a standard diff tool, because IDs change from what I receive from creative, text replaces lorem ipsum, etc..
You can use HTMLTidy to convert the HTML to well-formed XML so you can use XML Diff, as Gulzar suggested.
tidy -asxml index.html
If out output XML compliant HTML. Or at least translate your HTML product into XML compliancy, you at least could then XSL your output to remove the content and id tags. Apply the same transformation to their html, and then compare.
I was thinking on lines of XML Diff since HTML can be represented as an XML Document.
The challenge with HTML is that it might not be always well formed. Found one more here showing how to use XMLDiff class.
A copy of my own answer from here.
What about DaisyDiff (Java and PHP vesions available).
Following features are really nice:
Works with badly formed HTML that can be found "in the wild".
The diffing is more specialized in HTML than XML tree differs. Changing part of a text node will not cause the entire node to be changed.
In addition to the default visual diff, HTML source can be diffed coherently.
Provides easy to understand descriptions of the changes.
The default GUI allows easy browsing of the modifications through keyboard shortcuts and links.
winmerge is a good visual diff program

Resources