Rss is valid, but Self reference doesn't match document location - rss

This feed is valid, but interoperability with the widest range of feed readers could be improved by implementing the following recommendations.
line 3, column 88: Self reference doesn't match document location!
<atom:link href="http://localhost/blog/rss" rel="self" type="application/rss+xml" />

Further to my previous comment, I did some research. I have added "Broken Link Checker" to my word press blog which found couple of broken links to googleplay (which have changed subsequently). After fixing the broken links, the blog was through. The error mentioned in my previous post didn't occur when checked with w3 text input method.
Even the feeder.co accepted the feed without any problem
Posting so others may find it useful.
thanks.

Related

Scraping: Check if wiki a page is a person-page

I have been trying to scrape all the biography wiki pages for weeks. The problem is I can't find a way to distinguish a page concerning a person or something else.
For instance the following pages:
view-source:https://en.wikipedia.org/wiki/Albert_Einstein
view-source:https://en.wikipedia.org/wiki/Spider
look pretty similar regarding their HTML code. I am sure there must be a keyword allowing you to know if the page is related to a person.
Has anyone faced the same problem?
Thanks in advance =)
I'm not sure there is a definite way to tell but you could build up a list of indicators that you think the page might be about a person and then match on these.
For example on the Albert Einstein page there is a section for "Born" and "Died" on the right pane. By having these present we can be pretty sure that this article is about a person (although if you look for died you'll probably only get dead people). These titles however aren't consistent and you would need to match against one or more of these to build up confidence that the article is indeed about a person. e.g. https://en.wikipedia.org/wiki/Lionel_Messi doesn't contain the "Born" header but it does contain "Date of birth".
Alternatively to this you could do some natural language parsing to try and figure out if the main text on the page is talking about a person. Lots of mentions of "he" or "she", probably means the article is talking about a person.

What is the “link” element in ATOM-feeds?

Could someone please help me understand what the “link” tags are used for within an ATOM feed?
Do they point to a physical resource, or just like an identifier?
What is the difference between link URLs in the beginning and for each “entry” block?
Is it compulsory to have this linkURL?
Any information regarding this would be much appreciated!
I have provided an example snippet of code below.
<?xml version="1.0"?>
<atom:feed>
<link rel="self" href="http://publisher.example.com/happycats.xml" />
<updated>2008-08-11T02:15:01Z</updated>
<!-- Example of a full entry. -->
<entry>
<title>Heathcliff</title>
<link href="http://publisher.example.com/happycat25.xml" />
<id>http://publisher.example.com/happycat25.xml</id>
<updated>2008-08-11T02:15:01Z</updated>
<content>
What a happy cat. Full content goes here.
</content>
</entry>
Atom is a syndication format that can be used by applications employing ReSTful communication through hypermedia. It's very good for publication of feeds, which is not only for blogs but can also be used in distributed applications (for example, for publishing events to other parts of a system) to utilise the benefits of HTTP (caching, scalability, etc) and the decoupling involved in using REST.
elements in Atom are called link relations and can indicate to the consumer of the feed a number of things:
rel="self" normally indicates that the current element (in your case, the feed itself) represents an actual resource, and this is the URI for that resource
rel="via" can identify the original source of the information in the feed or the entry within the feed
rel="alternate" specifies a link to an alternative representation of the same resource (feed or entry)
rel="enclosure" can mean that the linked to resource is intended to be downloaded and cached, as it may be large
rel="related" indicates the link is related to the current feed or entry in some way
A provider of ATOM could also specify their own reasons for a link to appear, and provide a custom rel value
By providing links to related resources in this way you can decouple systems - the only URI the system needs to know about is 1 entry point, and from then on other actions are provided to the consumer via these link relations. The links effectively tell the consumer that they can use these links to either take actions on or retrieve data for the entry they are related to.
A great book I can recommend for REST which goes into depth about Atom is REST in Practice by Jim Webber, Savas Parastatidis and Ian Robinson.

Aggregating from various sources

It could be a project well beyond my skills right now but I've got around one full month to spend on it so I think I can do it. What I want to build is this: Gather news about a specific subject from various sources. Easy, right? Just get the rss feeds and display them on a page. Well, I want something more advanced: Duplicates removed and customized presentation (that is, be able to define/change the format in which the news headlines are displayed).
I've played a bit with Yahoo Pipes and some other tools and I am facing two big problems:
Some sources don't provide rss feeds. How do I create one?
What's the best method to find and remove duplicates. I thought about comparing the headlines and checking if there is a matching bigger than, say, 50%. Is that a good practice though?
Please add any other things (problems, suggestions, whatever) I might not have considered.
Duplication is a nasty issue. What I eventually ended up doing:
1. Strip out all HTML tags except for links (Although I started using regex, I was burned. I eventually moved to custom parsing to remove tags)
2. Strip out all whitespace
3. Case-desensitize
4. Hash all that with MD5.
Here's why you leave the link in:
A comment might be as simple as "Yes, this sucks". "Yes, this sucks" could be a common comment. BUT if the text "this sucks" is linked to different things, then it is not a duplicate comment.
Additionally, you will find that HTML tag escaping is weird with RSS feeds. You would think that a stray < would be double-encoded: (I think)&<;
But it is not. It is encoded <
But so too are HTML tags! :<p>
I eventually copied all the known HTML tags as parsed by Mozilla Firefox and manually recognized those tags.
Creating an RSS feed from HTML is quite nasty and I can only point you to services such as Spinn3r, which are fantastic at de-duplication and content extraction. These services typically use probability-based algorithms that are above me. I know of one provider that got away with regexing pages (They had to know that a certain page was MySpace-based or Blogger-based) but they did not perform admirably.
You might want to try to use the YQL module to scrape a webpage that doesn't provide RSS. Here's a sample of a YQL statement to scrape HTML.
About duplicates, take a look at this pipe.
Customized presentation: if you want it truly customized you'll have to manipulate the pipe results yourself, e.g. get it as JSON an manipulate it with Javascript, or process it server-side.

How should I sanitize urls so people don't put 漢字 or á or other things in them?

How should I sanitize urls so people don't put 漢字 or other things in them?
EDIT: I'm using java. The url will be generated from a question the user asks on a form. It seems StackOverflow just removed the offending characters, but it also turns an á into an a.
Is there a standard convention for doing this? Or does each developer just write their own version?
The process you're describing is slugify. There's no fixed mechanism for doing it; every framework handles it in their own way.
Yes, I would sanitize/remove. It will either be inconsistent or look ugly encoded
Using Java see URLEncoder API docs
Be careful! If you are removing elements such as odd chars, then two distinct inputs could yield the same stripped URL when they don't mean to.
The specification for URLs (RFC 1738, Dec. '94) poses a problem, in that it limits the use of allowed characters in URLs to only a limited subset of the US-ASCII character set
This means it will get encoded. URLs should be readable. Standards tend to be English biased (what's that? Langist? Languagist?).
Not sure what convention is other countries, but if I saw tons of encoding in a URL send to me, I would think it was stupid or suspicious ...
Unless the link is displayed properly, encoded by the browser and decoded at the other end ... but do you want to take that risk?
StackOverflow seems to just remove those chars from the URL all together :)
StackOverflow can afford to remove the
characters because it includes the
question ID in the URL. The slug
containing the question title is for
convenience, and isn't actually used
by the site, AFAIK. For example, you
can remove the slug and the link will
still work fine: the question ID is
what matters and is a simple mechanism
for making links unique, even if two
different question titles generate the
same slug. Actually, you can verify
this by trying to go to
stackoverflow.com/questions/2106942/…
and it will just take you back to this
page.
Thanks Mike Spross
Which language you are talking about?
In PHP I think this is the easiest and would take care of everything:
http://us2.php.net/manual/en/function.urlencode.php

merge rss feeds

I want to merge multiple rss feeds into a single feed, removing any duplicates. Specifically, I'm interested in merging the feeds for the tags I'm interested in.
[A quick search turned up some promising links, which I don't have time to visit at the moment]
Broadly speaking, the ideal would be a reader that would list all the available tags on the site and toggle them on and off, allowing me to explore what's available, keep track of questions I've visited, new answers on interesting feeds, etc, etc . . . though I don't suppose such a things exists right now.
As I randomly explore the site and see questions I think are interesting, I inevitably find "oh yes, that one looked interesting a couple days ago when I read it the first time, and hasn't been updated since". It would be much nicer if my machine would keep track of such deails for me :)
Update: You can now use "and", "or", and "not" to combine multiple tags into a single feed: Tags AND Tags OR Tags
Update: You can now use Filters to watch tags across one or multiple sites: Improved Tag Stes
Have you heard of Yahoo's Pipes.
Its an interactive feed aggregator and
manipulator. List of 'hot pipes' to
subscribe to, and ability to create
your own (yahoo account required).
I played with it during beta back in the day, however I had a blast. Its really fun and easy to aggregate different feeds and you can add logic or filters to the "pipes". You can even do more then just RSS like import images from flickr.
I create a the stackoverflow tag feeds pipe. You can list your tags of choice into the text box and it will combine them into a single feed with all the unique posts. It escapes '#' and '+' characters for you.
Alternatively, you can use the pipe's rss feed by appending your html-encoded tags separated by '+'s:
http://pipes.yahoo.com/pipes/pipe.run?_id=uP22vN923RG_c71O1ZzWFw&_render=rss&tags=.net+c%23+powershell
Unfortunatley, though, this seems to strip out the content of the posts. The content is visible in the debug view, but the output only contains the post title.
[Thanks to everyone for suggesting Yahoo Pipes! Had heard of it before, but never tried it until now :-]
SimplePie is a PHP library that supports merging RSS feeds into one combined feed. I don't believe it does dupe checking out-of-the-box, but I found it trivial to write a little function to eliminate duplicate content via their GUIDs.
Here is an article on Merge Multiple RSS Feeds Into One with Yahoo! Pipes + FeedBurner.
Another option is Feed Rinse, but they have a paid version as well as the free version.
Additionally:
I have heard good things about AideRss
Yahoo Pipes?
23 minutes later:
Aww, I got answer-sniped by #Bernie Perez. Oh well :)
In the latest Podcast, Jeff and Joel talked about the RSS feeds for tags, and Joel noted that there is only the current ability to do AND on tags, not OR.
Jeff suggested that this would be included at some stage in the future.
I think that you should request this on uservoice, or vote for it if it is already there.

Resources