Anyone know how to re-formatting pinterest rss? I use this http://www.pinterest.com/username/feed.rss but it kind of like bad format, then I can't parsing into rssinclude.com.
Related
I am interested in collecting a large corpus of text from various websites. The result will have lots of html. Is there an easy way of getting rid of the HTML so that I am left with only strings of words which I can then analyse?
I don't mind paying, but I prefer free and fast tools.
I have had a look and it looks like you can do this manually using packages like beautiful soup in python or using paid services like import.io to automatically clean data as the scraping occurs.
But are there better tools avaliable for stripping html from raw text?
I have used Jsoup in my project to extract text from websites, it is simple to use, and i have used HtmlUnit
for clicking buttons in website to load more data.
ruby and the nokogiri gem (library) are probably a good place to start. You mentioned python but did not tag it so I am asume you are not set on python.
Crawling around websites, following links and getting all text is fairly straightforward, nokogiri has a .text method that does this. In probability you want to do a little hand coding for each site to refine what you get. I'me parsing music listing sites and am averaging around 20 lines of unique code per site.
I should mention is that you should first see if there is some type of XLM/RSS feed, these are a lot easier to process than the web content. nokogiri can help you with this.
Is it possible to remove the markup wiki language from the RSS feed and only show the article content?
Because I am using different template like info-boxes etc. and when people click the RSS link it show all the template markup and all the unnecessary coding that people don't really care. I been trying to find a good tutorial or help where I can accomplish this.
Screentshot
As Dereckson says, no, it's not possible. Feeds are just an alternate way to consume recent changes.
The ability to consume recent changes in parsed format essentially equates the feature request for visual diffs (HTML diffs). Will be possible at some point with Parsoid.
Im making a normal RSS feed for my website. I need to include simple html formatting in the description eg paragraphs, line breaks, lists, etc. To do this I need to wrap the description content as CDATA.
The issue with this is that when I validate my feed the content of the CDATA is ignored. So although the feed validates, I dont actually know if everything is ok or not.
How can I find out what markup will likely be read ok by the various RSS readers?
Can I use whatever markup I would happily put in a website? How about inline styles? Or is more like designing html emails? Thanks
RSS files are XML Formatted plain text, I think that's the only standard you can rely upon.
I think most Syndicators look like they're handling HTML in RSS as they simply download the linked article when you choose the header.
If you're looking to embed rich content, then you may well be better investigating Atom instead of RSS.
Have a look at this S/O question: Which is better for encoding HTML for RSS?
I'm trying to incorporate a google news feed in my website (Using the built-in SimplePie functionality of WordPress).
However, the default feed gets rendered in a strange table structure. Sure enough, when I inspect the feed XML, I see that Google News has a whole bunch of table html as its 'description' element, complete with embedded styles, etc (See this example)- essentially dictating how the feed must be displayed, and not allowing for any effective css based customization.
This seems really dumb- can anyone help explain what is going on, or at least agree with me that this is just a terrible feed architecture?
Feeds often include html tags, as many (most?) readers will handle and use them, and that way the RSS provider can have some nice looking output in the reader, as you've guessed. (I prefer flagging it as CDATA unless it's proper xhtml, as it's not valid xml/rss otherwise). It's not in the original spirit of RSS perhapts, but the Google feed is just an extreme example of common practice. As per your problem, does strip_htmltags help (simplepie.org/wiki/reference/simplepie/strip_htmltags)?
Feedity makes feeds address for any webpages and I would like to make an application like this.
How did they implement it?
This looks a little like yql, which can be used for something similar. Given that HTML can be XML, and RSS feeds are XML as well, this should not be too difficult to implement. If I were to approach a custom implementation of this, I would probably attempt the following:
Pull in html from the requested url
Cleanse the HTML so it could be converted to XML (or use something like the HTML Agility Pack)
Use XSLT to translate the XML document into an RSS feed based on a set of rules (that extract links, etc.)
All of that having been said, if I could use something like yql instead, I would definitely do that, as there can be a lot of pitfalls in the custom implementation (bad html, changing url's, defining rules, caching, etc.)