Issuses with using CDATA to validate RSS feeds? - rss

I need to make an RSS feed for my site. The issue is that the content has been imported and contains inline styles and other markup. Ive looked at various methods but I can’t get it all removed, and some of it stops my feed from validating.
One work around that seems to work is this:
<![CDATA[ <description>My Content here </description> ]]>
From what ive read this stops the content from being xml parsed, which is why it validates ok. Ive looked in a few readers and it seems fine, but is their a risk / downside to this method? I don’t really understand the implications so id appreciate any advice or info on tests I could perform.
Thanks

This is a perfectly reasonable approach, although you should note that you should use this:
<description><![CDATA[My Content here]]></description>
...rather than:
<![CDATA[ <description>My Content here </description> ]]>
...as the <description> element is part of the RSS specification, so should be properly present in the RSS, rather than being escaped as text.
If you're going to include non-RSS content (typically HTML) in your title and description, especially if it's user-generated content that might contain a variety of markup or invalid markup, marking the whole content as character data like this is definitely the way to go.
RSS readers typically expect and cope happily with HTML stored as CDATA in the description element, whereas the XML parsers they use (and anything else parsing your RSS) will likely be quite sensitive to the malformed XML that might be created by including HTML tags, unexpected entities or even just a single "<" in the <description> text without the escaping.
Use whatever method your XML library provides to insert the content as CDATA, rather than just manually wrapping it with <![CDATA[ and ]]>, too; that way all the thinking (what happens if the content includes ]]>?) will be done for you.

Related

Parsing page data into sidebar - wordpress

What would be the proper procedure for accessing the current page html data and picking up all of a certain tag and throwing them into the sidebar as links?
I'm not sure your proficiency with php, but I'll give you and overview of what you'd probably want to do.
First, you need the HTML. I'm assuming you're running this on a page (in a page.php file or single.php file, or similar), this means that you have access to the global variable $post, which contains the html of the page in it. To access it you can use the helper function get_the_content(), this returns the html being displayed.
Next you need to parse through this to get the h2 tags. A simple regex can handle this, something like <h2[^>]*>(.*)</h2>. It's important to remember that this regex is very picky, so format your html correctly, no multiline h2s.
So now you have the html, and have parsed it with a regex to get the h2s. Now you need to generate the list from the results, and prepend it to the top of the content of the page. There are a ton of ways to do this, the easiest being just running the code in the right spot in the template file.
Of course there are probably better ways of doing this, I'd recommend you look at say a FAQ plugin (if that's what this is for), or do the lists manually (as this system can be broken), or possibly use a custom post type; but for your question, that's how I'd do it.

Symfony2 translation of long texts

Since a few weeks I started playing with Symfony2.
It seems a very powerful framework, but there are some things I cannot still understand.
In documentation I see that i18n (i.e. Translations) is managed by the Translator service. If I correctly understood, the main way to have a website translated is to put the collection of messages I want to translate inside the different files messages.XX.yml (XX=en,fr,it,etc...), one for each language.
This could be perfect for short texts, which possibly do not include any HTML markup. But how do you deal with long text? For instance, how can I manage the translation of a Terms Of Service or an About page?
I guess I should include different templates for each locale I want to use. Am I right?
Thanks for your help!
You can have long texts in .yml translation file as well as html tags. Put your Terms Of Service text in messages.xx.yml file like this:
TermsOfServiceText: >
<p>Here goes my Terms of service code</p>
<p>It can be put in several lines and <strong>can include html tags!</strong></p>
<p>It can also include links</p>
<p>Just make sure that you put '>' sign after your translation keyword like in the first line of this example code
and start your message in next line with double space indentation</p>
Now, in your twig template call translation with this:
{{ 'TermsOfServiceText'|trans|raw }}
raw is used to skip escaping html tags.
I don't think that different templates could be as solution. But feel free to choose what you prefer. I'll go with https://github.com/stof/StofDoctrineExtensionsBundle in particular with the Translatable behaviour.

How do I script html that is not well formed to be well formed using classic asp and vbscript?

I am trying to parse some html to switch out values of various element attributes. I decided that the most reliable way to parse the html was to use an xml parser (msxml.)
The problem is that the html I'm trying to parse contains attribute like:
<param name="flashvars" value="autoplay=false&brand=embed&cid=97%2Ftest&locale=en_US"/>
Which causes the xml parser to blow up. I figured out that I need to server.htmlencode() the value attribute in order for the xml parser to load it properly. How do I approach this?
I feel like the problem is a vicious circle. I couldn't use regex's because html is not regular enough, and now I can't use xml parsers because the html isn't "well formed"
help. How do I approach this issue? I want to be able to change attribute values with a vbscript.
Is your HTML well formed? If so you could simply use an XML DomDocument. Use XPath to find the attributes you want to replace.
You can actually use JScript serverside as well in ASP, whicdh might give you access to HTMLDom libraries you could use.
You should probably have a look at one of the libraries for cleaning up HTML, something like HTML Tidy http://www.w3.org/People/Raggett/tidy/
Your main problem is you need to do a replace on the ampersands, they need to be & in well formed XML/XHTML.

Images in RSS feed

Whenever I see images in an RSS feed, they are embedded in CDATA, rather than surrounded by tags.
In my feed, I would like the images to show up without doing that.
Whether in the browser, or a feed reader (Bloglines) or through FeedBurner, the following structure does not show images, although it is valid RSS. Does anyone have experience with this?
<item>
<category>Viewbook</category>
<title>Widget</title>
<description>Learn more about our widgets.</description>
<link>http://www.widget.com/Default.aspx</link>
<image>
<url>http://www.widget.com/images/thumb.gif</url>
<title>Widget</title>
<link>http://www.widget.com/Default.aspx</link>
<description>Learn more about our widgets.</description>
</image>
</item>
On Colonol Sponsz' hint, I researched:
There's no image tag for items, only for the channel. So you have to do it via the CDATA tag.
For completeness: In RSS 2.0, you CAN have a single enclosure inside an item, which per the spec. can be for a single image. However I understand that support among feed aggregators varies. More typically this is used for things like podcasts. The RSS 2.0 standard states:
<enclosure> is an optional sub-element of <item>.
It has three required attributes. url says where the enclosure is located, length says how big it is in bytes, and type says what its type is, a standard MIME type.
The url must be an http url.
Note that you must include the size of the item, along with the URL and mime type.
However, as others indicated, including the picture(s) in CDATA is much more common.
I believe you can use <media:content ....> items with good support by most rss readers, it is working flawlessly for us on mailchimp (rss to email newsletter).
See http://kb.mailchimp.com/article/how-can-i-format-the-image-content-in-my-rss-to-email-campaigns
EDIT: Here's a live link: https://blog.mailchimp.com/rss-to-email-enhancement-for-publishers/
You can use the media:content element (spec) within item.
Make sure you declare the MRSS (Media RSS) namespace (the xmlns:media attribute, below) for this element, if it is not declared for the whole RSS feed, as it won't validate otherwise. (E.g., out-of-the-box WordPress.)
<media:content
xmlns:media="http://search.yahoo.com/mrss/"
url="http://www.widget.com/images/thumb.gif"
medium="image"
type="image/jpeg"
width="150"
height="150" />
This may or may not display as you'd like; you'd have to experiment. Embedding in content is in that way simpler, though this route helps with things like MailChimp integration (h/t this answer) or other custom solutions.
An example implementation for WordPress is in my answer here.
Use, e.g.:
<enclosure url="http://www.scripting.com/mp3s/weatherReportSuite.mp3" length="12216320" type="audio/mpeg" />
Documentation here
It works with a seperate tag, as you said. The problem is the specification of version 2.0.
I know, there are feed reader that does supress images for bandwidth reasons.
Source: RSS specification 2.0 via Wikipedia

Are there any tools out there to compare the structure of 2 web pages?

I receive HTML pages from our creative team, and then use those to build aspx pages. One challenge I frequently face is getting the HTML I spit out to match theirs exactly. I almost always end up screwing up the nesting of <div>s between my page and the master pages.
Does anyone know of a tool that will help in this situation -- something that will compare 2 pages and output the structural differences? I can't use a standard diff tool, because IDs change from what I receive from creative, text replaces lorem ipsum, etc..
You can use HTMLTidy to convert the HTML to well-formed XML so you can use XML Diff, as Gulzar suggested.
tidy -asxml index.html
If out output XML compliant HTML. Or at least translate your HTML product into XML compliancy, you at least could then XSL your output to remove the content and id tags. Apply the same transformation to their html, and then compare.
I was thinking on lines of XML Diff since HTML can be represented as an XML Document.
The challenge with HTML is that it might not be always well formed. Found one more here showing how to use XMLDiff class.
A copy of my own answer from here.
What about DaisyDiff (Java and PHP vesions available).
Following features are really nice:
Works with badly formed HTML that can be found "in the wild".
The diffing is more specialized in HTML than XML tree differs. Changing part of a text node will not cause the entire node to be changed.
In addition to the default visual diff, HTML source can be diffed coherently.
Provides easy to understand descriptions of the changes.
The default GUI allows easy browsing of the modifications through keyboard shortcuts and links.
winmerge is a good visual diff program

Resources