How can I strip html and word formatted text from a text box -

I have a multiline textbox, need to restrict users from adding html input, any other scripts, copy pasting from word or any other word processer.
But I need to allow bullets for the input.
I thought it would be a simple thing to do since it looks like a common problem.
But I could not find a good solution in the web, please help.
I am using telerik tool kit as well.

If you need to strip out HTML then HTML Agility Pack is your friend. It will deal with all manner of malformed html. As a bonus it is included in Sitecore already.
If you want to use something with a friendlier syntax then consider CSQuery or Fizzler both of which provide you with a jQuery type syntax from within C#.
If you need to build a whitelist then take a look at this post on how to add whitelist:
public void RemoveNotInWhiteList(HtmlNode pNode, IEnumerable<string> pWhiteList)
if (!pWhiteList.Contains(pNode.Name))
.Where(att => !pWhiteList.Contains(att.Name))
.ForEach(att => att.Remove());
.ForEach(att => RemoveNotInWhiteList(att, pWhiteList));

You could create a Validation rule, I reckon (in /sitecore/System/Settings/Validation Rules). Put the allowed HTML in a whitelist somewhere (possibly a Sitecore item), when validating run through that whitelist. If any other HTML tags appear in it, make it invalid.
This doesn't stop them from putting it in, but it will stop the item from being published.
You could even create a custom item:saved event handler which strips out all HTML tags apart from the whitelisted stuff. Again, it doesn't stop them from putting the HTML tags in, but as soon as the item is saved it will be removed. Going even a step further than this, I think it also would be possible to use the Rules Engine for this - this article by John West shows how to use the Rules engine to modify item names, but you could modify it to read out specific text boxes.
Neither option here will stop users from inputting HTML, but the HTML tags will automatically be removed when the item is saved.


When using apoc.load.html, Is it possible to return the full HTML rather than only text?

Lets say I want to scrape the Neo4j RefCard found at:
And I would like to fetch a 'code' example along with its styling. Here's my target. Notice that it has CSS treatment (font, color...): in Neo4j I call the apoc.load.html procedure as shown here, and you can see it's no problem finding the content:
It returns a map with three keys: tagName, attributes, and text.
The text is the issue for me. It's stripped of all styling. I would like for it to let me know more about the styling of the different parts of this text.
The actual HTML in the webpage looks like following image with all of these span class tags: cm-string, cm-node, cm-atom, etc. Note that this was not generated by Neo4j's apoc.load.html procedure. It came straight from my Chrome browser's inspect console.
I don't need the actual fonts and colors, just the tag names.
I can seen in the documentation that there is an optional config map you can supply, but there's no explanation for what can be configured there. It would be lovely if I could configure it to return, say, HTML rather than text.
The library that Neo4j uses for CSS selection here is jsoup.
So I am hoping to not strip the <span> tags, or otherwise, extract their class names for each segment of text.
Could you not generate the HTML yourself from the properties in your object? It looks they are all span tags with 3 different classes depending on whether your using the property name, property value, or property delimiter?
That is probably how they are generating the HTML themselves.
Okay, two years later I revisited this question I posted, and did find a solution. I'll keep it short.
The APOC procedure CALL apoc.load.html is using the scraping library Jsoup, which is not a full-fledged browser. When it visits a page it reads the html sent by the server but ignores any javascript. As a result, if a page uses javascript for inserting content or even just formatting the content, then Jsoup will miss the html that the javascript would have generated had it run.
So I have just tried out the service at It's simple to use. You send it a URL, it takes your url as an argument and fetches that page itself and executes the page's javascript as it does. It returns the final result as static HTML.
So if I just call with apoc.load.html then the Jsoup library will simply ask for the html and this time it will get the fully rendered html. :)
You can try the following two queries and see the difference pre-rendering makes. The span tags in this page are rendered only by javascript. So if we call it asking for its span tags without pre-rendering we get nothing returned.
CALL apoc.load.html("", {target:".listingblock pre:contains(age: 38) span"}) YIELD value
UNWIND AS spantags
RETURN spantags
...but if we call it via the website, you will get a bunch of span tags and their content.
CALL apoc.load.html("", {target:".listingblock pre:contains(age: 38) span"}) YIELD value
UNWIND AS spantags
RETURN spantags

How to elegantly modify html to inject html element after x-th paragraph on the server side?

I need to modify html coming from external file (server side) before I render it and inject a quote 'component' like this:
This component needs to be injected after 2nd paragraph and I'm planning to use htmlagillity pack. Any examples? Is HtmlNode.InsertAfter() method good choice once I found third paragraph which should be trivial.
Another question is would it be possible to inject sitecore placeholder or even usercontrol that is going to render my quote instead of pure html? I feel it should be but not sure what would be good approach.
I can suggest two possible approaches here:
1) Use snippets with some customisation. Snippets allow users to insert pre-defined chunks of HTML into a RTE field. You could have a pre-defined piece of HTML which might have some identifier to indicate it should use custom processing (I would suggest some data-xxx style attribute which would not conflict with any CSS or JavaScript). Then you could create a new renderField pipeline processor which would detect the data-xxx attribute within the content of a rich text field - you would use HtmlAgilityPack for this and then replace that snippet with the contents of your server-side file.
2) Split your text content into two separate chunks and have two instances of a "HtmlText" rendering within the placeholder, with a rendering for your quote text between them in the same placeholder.
I would advise that having a rule to insert text after the second paragraph would be quite 'brittle' as this would be very reliant on content editors setting the rich text field contents in quite a precise way e.g. to always ensure two or more paragraphs and to always break text with paragraphs - they might decide to use a load of line breaks instead to split their text. That said if you did do this, you would create a new renderField pipeline processor.

Parsing page data into sidebar - wordpress

What would be the proper procedure for accessing the current page html data and picking up all of a certain tag and throwing them into the sidebar as links?
I'm not sure your proficiency with php, but I'll give you and overview of what you'd probably want to do.
First, you need the HTML. I'm assuming you're running this on a page (in a page.php file or single.php file, or similar), this means that you have access to the global variable $post, which contains the html of the page in it. To access it you can use the helper function get_the_content(), this returns the html being displayed.
Next you need to parse through this to get the h2 tags. A simple regex can handle this, something like <h2[^>]*>(.*)</h2>. It's important to remember that this regex is very picky, so format your html correctly, no multiline h2s.
So now you have the html, and have parsed it with a regex to get the h2s. Now you need to generate the list from the results, and prepend it to the top of the content of the page. There are a ton of ways to do this, the easiest being just running the code in the right spot in the template file.
Of course there are probably better ways of doing this, I'd recommend you look at say a FAQ plugin (if that's what this is for), or do the lists manually (as this system can be broken), or possibly use a custom post type; but for your question, that's how I'd do it.

Can we replace the <add text> labels in SiteEdit 2012 (on Tridion 2011)?

Right now I've been implementing User Interface 2012 and after some hurdles it works just fine. I've been looking to optimise the usability of any UI-editable fields, and run into a related challenge.
Within a component there are several fields that are not mandatory, and as such should not be displayed when they are empty. As soon as an editor enters UI and selects the component holding said fields, several labels such as <add text> and <add internal link to component media> appear.
I am looking to change these labels to something more descriptive of their content, because additional html will be added to the page when a field is not empty.
For example (using Razor Mediator):
#if(Component.Fields.location != null) {
<span class="row">
<span>#RenderComponentField("location", 0)</span>
} else {
<tcdl:ComponentField name="location"></tcdl:ComponentField>
When the location field is empty, it just says <add text>. I would like to change that to <Add location to event>.
I've tried putting something between the tcdl-tags, but they display even when not editing in UI2012. I've been searching the SDL Live content sites but I cannot find any reference to it. Anyone have an idea?
There is no supported way for customizing placeholder text of the empty field. But you could try to write an extension, which overrides the following method:
This method is responsible for setting up the placeholder text.
I was looking for the same when I was checking this, but I don't think that is doable easily AFAIK. I went little bit deep and found that the labels are part of resource file Tridion.Web.UI.Editors.SiteEdit.Strings.resx EmptyTextField. I did not pursue the option to fiddle with this because it would not be the supported way, nor documented and on top of it I still don't have the flexibility of adding my own text for the each field.
Back to your question, I was tossing up an idea (not necessarily answer to your question) and want to share here so the experts could provide some valuable suggestions. I did not try this option (i felt too much work) and this is in my long todo list and might have some drawbacks as well.
Create Schema Fields with "default values" (e.g; "Add location to event"). the default text will be displayed in your UI.
Write Your templates in a way that if the Schema field value is same as default
##if(Component.Fields.location.value == [Compare the schema field definition - default value of the field]) {
//--> Note: I could not find a straight API for this.. but I am assuming it should be there.
#RenderComponentField("location", 0)
} else {
<span class="row">
<span>#RenderComponentField("location", 0)</span>
Perform above condition check based on target type UI enabled, since we do not want to display the default text for live target etc.
Also, posting Tridion Idea as enhancement request will be great. I will do it in next few days if none exist already.
I like the approach as it'd be a quick way to give author's instructions at the field level. We use the description field to typically provide this type of help in the CME.
For inline editing, content types (SDL Live Content - login required) is another option since they define schema (and prototype component), template, instructions, and "save-to" context. You can offer dummy text that authors replace.
Add sample content and/or instructions (Lorem Ipsum) in the prototype component.
Add additional instructions in the content type description.
Select storage location other than the prototype component's folder.
Let us know how it goes. :-)

Seeing html code in tooltips in

I am parsing web service xml and populating a treeview in I'm trying to display one of the xml node attributes as a tooltip, but that attribute happens to sometimes have html tags in it. I know there seem to be some custom tooltip stuff out there, but I don't have the time or the experience to play with those yet. Is there no way to easily remove such code or translate it into the textual equivalent? I know I can replace br tags with environment.newline, but I don't want to have to do this for every conceivable html tag that might be embeded in the content!
The HTML Agilty Pack is an HTML parser that can read HTML fragments - you can do that and then read the InnerText property of the top node. The effect will be a textual version of the HTML.
