Consolidating RSS feed in a reader - rss

By RSS 2.0 specification, link, title and description are required elements. In reality though, any of those three can be missing. I read data from multiple feeds and I want to display them in a similar manner, how can I consolidate the data?

To simplify the Really simple syndication, you can make those elements in the resulting object/table:
link - There are several elements that can contain a link. Other than <link> itself, there is<guid>. If permalink="true", it is a good link. If not permalink, it can be a link as well, but it may lead nowhere. There can also be <enclosure> (one or more), however, they link to files or streams, not webpages.
title - If there is no <title>, you can take a piece of <description>, remove any HTML from it though.
description - If <description> isn't present, leave it empty.
guid - If it's not present, select first available combination from those:
link-<pubDate>, link-title, link, title-<pubDate>, title, <pubDate>
The generated guid doesn't have to be really unique, be aware of that.
pubDate - if you must show some date and it's not present, generate one upon saving.

Related

How to detect updates in podcast feeds?

I have a large set of podcast feed URLs which I'm periodically polling to check for updates. I'm really struggling to find a robust way to detect if a feed has changed that doesn't have any false positives. I'd like to be able to detect not just if there is a new episode, but also if an existing episode was updated.
RSS and Atom feeds provide pubDate, lastBuildDate or updated elements. However, I'm finding these frequently misused so that the feed is actually inserting the current date time into these fields each request. This makes them difficult to rely on to detect changes.
My next thought was to strip all date information from the podcasts, then MD5 hash the feed contents. I can then compare the feed hashes to detect changes to the feeds.
This seems to work for about 90% of the cases. However, there are still hundreds of podcasts that insert dynamic data into their feeds.
One podcast has the following as their podcast cover art:
http://erikglassman.hipcast.com/albumart/1000.1439649026.jpg
Where 1439649026 is what I assume is a timestamp. This second number changes with each request of their feed.
This is starting to seem like a losing battle. If I can't reliably trust the date fields of a podcast feed, and if some percentage of podcasts insert dynamic data into their feed text, how can I reliably detect changes to a feed in a robust way?
Everything you say is true, so it's not a good idea to try to detect changes at the feed level, instead look for them at the item level.
That generally works, if it doesn't the feed can't be used by anyone, so the source of the feed is likely to have fixed any problem. That's why I think it works so well.
I've been writing feed readers as long as they have existed, my current product is called River4, it's available as open source, MIT License, so you can use it as example code, for this and other issues.
This is where it checks if an item is new:
https://github.com/scripting/river4/blob/master/river4.js#L1411
That might move around as the code changes, so look for a routine called getItemGuid. It shows you how to get a value that uniquely identifies the item. I use this code for my podcatcher, http://podcatch.com/, and it seems to catch the new items, and doesn't get false positives.
Hope this helps! :-)

How to remove/hide all Google Analytics data associated with a specific page?

For about a week, Google Analytics was erroneously reporting page views for a few request URIs, severely skewing my data. I have read that there is no way to remove data once it is reported. If this is the case, is there a way to simply hide this data from the view?
I have tried a number of things (such as creating global filters, view filters, etc.) to no avail. Using segments also doesn't work, because apparently you can only filter out visits/users (whereas my goal is to filter out page views associated with a specific page). At this point, I feel like I must be going about it the totally wrong way...
Below is a screenshot of the Behavior > Overview section. The page views I want to move are #1, #2, and #5.
Alex, unfortunately, there is nothing you can do about the historical data.
However, you can use simple filter to exclude pages you don't want to see (the filter field above the report table, not filters related to account/profiles) -- see the attached screen below.
Make sure you select exclude and then pick Page dimension. The easiest way would be to use regular expressions, like:
(a|b|c)
This one would remove any pages that contain either "a", or "b" or "c".
The expression would be probably a bit more complicated in your case and I suggest using tools like RegEx Hero (free, online). I am not sure if there is anything common for the pages you would like to remove from the reports, but regular expression can do quite a lot :).
One last thing -- be aware there is a slight difference in segments and (table) filters. If you use segments for page dimension, you would end up with ALL the pages that were seen during a visit, which includes the page you set in the segment. Might be a bit confusing, but see this article for detailed explanation.

Make search URL search engine friendly: hash -> what?

I am developing a flight search engine for a customer, and currently the URLs look as follows (ad = destination airport, ao = origin airport, dates and number of passengers are not specified here):
http://example.com/#ad=S%C3%A3o+Paulo+-+Todos+os+aeroportos+(SAO),+Brasil&ao=Recife+-+Guararapes+Intl+(REC),+Brasil
My customer wants to make search pages more search engine friendly (SEO). The idea is that Brazilians who are looking for flights from, say, SAO to REC by e.g. Google should have a higher chance of finding that particular flight search engine.
The first step is probably replacing the fragment identifier (#) by a query string (?). The server then dynamically generates nice text content that can be viewed without JavaScript (search results would still be loaded via XHR). In my opinion, that makes a lot of sense.
Now, to make the URLs more search engine friendly:
(A) My customer proposes adding additional keywords into the URL, something like:
http://example.com?flights+to+Porto+Alegre&S%C3%A3o+Paulo+-+Todos+os+aeroportos+(SAO),+Brasil&ao=Recife+-+Guararapes+Intl+(REC),+Brasil
(B) I propose adding a slug instead, which can easily be internationalized, and which is good to read also for humans. Example:
http://example.com/pt_BR?ad=REC&ao=SAO/voos_de_Sao_Paulo_para_Recife
(C) Or, perhaps without a slug (but - due to parsability - only for a limited parameter set, which has the disadvantage of limiting sharing of URLs by users):
http://example.com/pt_BR/voos_de_Sao_Paulo_(SAO)_para_Recife_(REC)
What do you suggest? Any examples of good URLs for similar use cases?
That all being said: I understand that links from highly ranked pages are still the most important ranking measure. In the end, I wonder if all that complexity really is worth the effort. When I look at Google's own search pages, then they are rather simple. For example, there is no summary of the search query in a H1 tag, just as my customer wants. Of course, Google doesn't search itself...
don't use _ (underscore) to delimit words. Google interprets hello_world as one word but hello-world as two words.
don't put your human readable keywords in the query string (after the ?). Instead make it a normal URL http://example.com/pt_BR/search/voos-de-Sao-Paulo-(SAO)-para-Recife-(REC)
I would go for a something like: http://example.com/pt_BR/2012-10-28/voos-de-Sao-Paulo-(SAO)-para-Recife-(REC)

How to provide multiple search functionality in website?

I am developing a web application, in which I have the following type of search functionality;
Normal search: where user will enter the search keyword to search the records.
Popular: this is no a kind of search, it will display the popular records on the website, something as digg and other social bookmarking sites does.
Recent: In this I am displaying Recently added records in my website.
City Search: Here I am presenting city names to the user like "Delhi", "Mumbai" etc and when user click this link then all records from that particular city will be displayed.
Tag Search: Same as city search I have tag links, when user will click on a tag then all records marked with that tag will be displayed to the user.
Alphabet Search: Same as city and tag this functionality also has links of letters like "A", "B", .... etc and when user clicks on any letter link then all records starting with that particular letter will be displayed to the user
Now, my problem is I have to provide above listed searches to the user, but I am not able to decide that I'll go with one page (result.aspx) which will display all the searches records, and I'll figure using query string that which search is user is using and what data I have to display to the user. Such as, lets say I am searching for city, delhi and tag delhi-hotels then the urls for both will be as :
For City: www.example.com/result.aspx?search_type=city&city_name=delhi
For Tags: www.example.com/result.aspx?search_type=tag&tag_name=delhi-hotels
For Normal Search: www.example.com/result.aspx?search_type=normal&q=delhi+hotels+and+bar&filter=hotlsOnly
Now, I feels above Idea of using a single page for all searches is messy. So I thought of some more and cleaner Idea, which is using separate pages for all type of searches as
For City: www.example.com/city.aspx?name=delhi
For Tags: www.example.com/tag.aspx?name=delhi-hotels
For Normal Search: www.example.com/result.aspx?q=delhi+hotels+and+bar&filter=hotlsOnly
For Recent: www.example.com/recent.aspx
For Popular: www.example.com/popular.aspx
My new idea is cleaner and it tells specifically everything to the user that which page is for what, it also gives him idea that where he is now, what records he's seeing now. But the new idea has one problem, In case I have to change anything in my search result display then I have to make changes in all pages one by one, I thought that solution for this problem too, which is using user-control under repeater control, I'll pass all my values one by one to user-control for rendering HTML for each record.
Everything is fine with new Idea, But I am still no able to decide that with which I dea I have to go for, Can anyone tell me your thoughts on this problem.
I want to implement an idea which will be easy to maintain, SEO friendly (give good ranking to my website), user-friendly(easy to use and understand for the users)
Thanks.
One thing to mention on the SEO front:
As a lot of the "results" pages will be linking through to the same content, there are a couple of advantages to appearing* to have different URLs for these pages:
Some search engines get cross if you appear to have duplicate content on the site, or if there's the possiblity for almost infinite lists.
Analysing traffic flow.
So for point 1, as an example, you'll notice that SO has numberous ways of finding questions, including:
On the home page
Through /questions
Through /tags
Through /unanswered
Through /feeds
Through /search
If you take a look at the robots.txt for SO, you'll see that spiders are not allowed to visit (among other things):
Disallow: /tags
Disallow: /unanswered
Disallow: /search
Disallow: /feeds
Disallow: /questions/tagged
So the search engine should only find one route to the content rather than three or four.
Having them all go through the same page doesn't allow you to filter like this. Ideally you want the search engine to index the list of Cities and Tags, but you only need it to index the actual details once - say from the A to Z list.
For point 2, when analysing your site traffic, it will be a lot easier to see how people are using your site if the URLs are meaningful, and the results aren't hidden in the form header - many decent stats packages allow you to report on query string values, or if you have "nice" urls, this is even easier. Having this sort of information will also make selling advertising easier if that's what's you're interested in.
Finally, as I mentioned in the comments to other responses, users may well want to bookmark a particular search - having the query baked into the URL one way or another (query strings or rewritten url) is the simiplist way to allow this.
*I say "appearing" because as others have pointed out, URL rewriting would enable this without actually having different pages on the server.
There are a few issues that need to be addressed to properly answer your question:
You do not necessarily need to redirect to the Result page before being able to process the data. The page or control that contains the search interface on submitting could process the submitted search parameters (and type of search) and initiate a call to the database or intermediary webservice that supplies the search result. You could then use a single Results page to display the retrieved data.
If you must pass the submitted search parameters via querystring to the result page, then you would be much better off using a single Result page which parses these parameters and displays the result conditionally.
Most users do not rely on the url/querystring information in the browser's address bar to identify their current location in a website. You should have something more visually indicative (such as a Breadcrumbs control or header labels) to indicate current location. Also, as you mentioned, the maintainability issue is quite significant here.
I would definitely not recommend the second option (using separate result pages for each kind of search). If you are concerned about SEO, use URL rewriting to construct URL "slugs" to create more intuitive paths.
I would stick with the original result.aspx result page. My reasoning for this from a user point of view is that the actual URL itself communicates little information. You would be better off creating visual cues on the page that states stuff like "Search for X in Category Y with Tags Z".
As for coding and maintenance, since everything is so similar besides the category it would be wise to just keep it in one tight little package. Breaking it out as you proposed with your second idea just complicates something that doesn't need to be complicated.
Ditch the querystrings and use URL rewriting to handle your "sections".. much better SEO and clearer from a bookmark/user readability standpoint.
City: www.example.com/city/delhi/
Tag: www.example.com/tag/delhi-hotels/
Recent: www.example.com/recent/
Popular: www.example.com/popular/
Regular search can just go to www.example.com/search.aspx or something.

Using Yahoo! Pipes

Have you used pipes.yahoo.com to quickly and easily do... anything? I've recently created a quick mashup of StackOverflow tags (via rss) so that I can browse through new questions in fields I like to follow.
This has been around for some time, but I've just recently revisited it and I'm completely impressed with it's ease of use. It's almost to the point where I could set up a pipe and then give a client privileges to go in and edit feed sources... and I didn't have to write more than a few lines of code.
So, what other practical uses can you think of for pipes?
It's nice for aggregating feeds, yes, but the other handy thing to do is filtering the feeds. A while back, I created a feed for Digg (before Digg fell into the Fark pit of dispair). I didn't care about the overwhelming Apple and Ubuntu news, so I filtered those keywords out of Technology, which I then combined with Science and World & Business feeds.
Anyway, you can do a lot more than just combine things. If you wanted to be smart about it, you could set up per-subfeed and whole-feed filters to give granular or over-arching filtering abilities as the news changes and you get bored with one topic or another.
The one thing I have really used Y! Pipes for (rather than just playing around with it) is to clean up item titles, merge and finally de-dupe the feeds I got from querying multiple blog search engines with the same search term. This is something I’ve done in several very different contexts, eg. for my own ego surfing, in another case for the planet site set up by some conference’s organisers to keep an eye on their conference’s buzz, etc. Highly recommended.
You can do tons of things with pipes. For example for sites like digg or reddit, you can make one to bypass the site and go directly to the linked article (rewriting the RSS).
I like also to filter webcomics' feeds to keep just the comics, and then mix them all in only one feed
I've taken the liberty of copying your pipe and rearranging it a bit so that it's easier to add and remove tags:
Yahoo Pipe: StackOverflow Merge Tags
Tags are now listed in a string builder, so to add a tag you just have to hit the + button on the string builder and type in the tag preceded by a slash.
Well, pipes are real fast and useful.
Other effective uses might be:
1) combine many feeds into one, then sort, filter and translate it.
2) geocode your favorite feeds and browse the items on an interactive map.
3) power widgets/badges on your web site.
4) grab the output of any Pipes as RSS, JSON, KML, and other formats.
This is by no means a comprehensive list.
One of my favorite things to do with Yahoo! Pipes is to aggregate multiple craigslist feeds into a single feed. You can make a feed out of any category or search criteria on craigslist. I live in a university town and am always on the lookout for tickets to sporting events, for example. I have a half-dozen craigslist searches all being combined into a single feed via Yahoo! Pipes. This works a lot better for me than simply monitoring the entire "Tickets" category; filters out most of the tickets I am not interested in. Yes, this is another aggregating feeds example, but the craigslist usage is quite valuable with the ability to aggregate feeds that are themselves based upon searches.
I've used Pipes to translate blogs into English. I would have liked to use it to fetch the full text for blogs which only provide a summary of the content in the feed, but unfortunately they don't provide any input which fetches the content from a parameterizable source :-(.
Just stumbled on this while looking for ways to connect Excel to Pipes. A bit necromancer-ish, but here goes.
One thing I've done, is take an HTML page (science data) which has links to tons of CSV files for a bunch of Army Corps measurement stations. Each station has a big table of datafiles, all organized individually by month and year. I use YQL to parse out and organize the links to the individual CSV files in a way that Pipes can read them. Then, I use that as input into a Pipe, which has a user input for "Station" and "Date."
Using this, I can go to the Pipes page, type in those values and get the values only for a specific station and date, rather than have to find the station on a website, find the year and month in a big table, click the link, open the CSV file, and find the values for a day within that month's worth of data. I can even change the pipe to specify the hour, and the parameter, and then get a single value returned.
Now, I wish I could figure out how to program Excel so that I can use "=yahoo_function(station, datetime)" to place that value automatically into a cell give the values of other columns!

Resources