Aggregating from various sources

Aggregating from various sources - rss

It could be a project well beyond my skills right now but I've got around one full month to spend on it so I think I can do it. What I want to build is this: Gather news about a specific subject from various sources. Easy, right? Just get the rss feeds and display them on a page. Well, I want something more advanced: Duplicates removed and customized presentation (that is, be able to define/change the format in which the news headlines are displayed).
I've played a bit with Yahoo Pipes and some other tools and I am facing two big problems:
Some sources don't provide rss feeds. How do I create one?
What's the best method to find and remove duplicates. I thought about comparing the headlines and checking if there is a matching bigger than, say, 50%. Is that a good practice though?
Please add any other things (problems, suggestions, whatever) I might not have considered.

Duplication is a nasty issue. What I eventually ended up doing:
1. Strip out all HTML tags except for links (Although I started using regex, I was burned. I eventually moved to custom parsing to remove tags)
2. Strip out all whitespace
3. Case-desensitize
4. Hash all that with MD5.
Here's why you leave the link in:
A comment might be as simple as "Yes, this sucks". "Yes, this sucks" could be a common comment. BUT if the text "this sucks" is linked to different things, then it is not a duplicate comment.
Additionally, you will find that HTML tag escaping is weird with RSS feeds. You would think that a stray < would be double-encoded: (I think)&<;
But it is not. It is encoded <
But so too are HTML tags! :<p>
I eventually copied all the known HTML tags as parsed by Mozilla Firefox and manually recognized those tags.
Creating an RSS feed from HTML is quite nasty and I can only point you to services such as Spinn3r, which are fantastic at de-duplication and content extraction. These services typically use probability-based algorithms that are above me. I know of one provider that got away with regexing pages (They had to know that a certain page was MySpace-based or Blogger-based) but they did not perform admirably.

You might want to try to use the YQL module to scrape a webpage that doesn't provide RSS. Here's a sample of a YQL statement to scrape HTML.
About duplicates, take a look at this pipe.
Customized presentation: if you want it truly customized you'll have to manipulate the pipe results yourself, e.g. get it as JSON an manipulate it with Javascript, or process it server-side.

Related

Making a tree of Wikipedia links

I am trying to use the Wikipedia API to get all links on all pages. Currently I'm using
https://en.wikipedia.org/w/api.php?format=json&action=query&generator=alllinks&prop=links&pllimit=max&plnamespace=0
but this does not seem to start at the first article and end at the last. How can I get this to generate all pages and all their links?

The English Wikipedia has approximately 1.05 billion internal links. Considering the list=alllinks module has a limit of 500 links per request, it's not realistic to get all links from the API.
Instead, you can download Wikipedia's database dumps and use those. Specifically, you want the pagelinks dump, containing information about the links themselves, and very likely also the page dump, for mapping page ids to page titles.

I know this is an old question, but in case anyone else is searching and finds this, I highly recommend looking at Wikicrush to extract the link graph for all of Wikipedia. It produces a relatively compact representation that can be used to very quickly traverse links.

How Does an RSS Feed Work?

How's it going?
I've found a lot of more detailed answers relating to specific problems relating to RSS feeds, but I can't really figure out how you USE one, basically.
Could someone explain?
I see the RSS feed icon at the top of a lot of Wordpress sites, including my own, but when I click it, it just seems to be a long XML file. I don't know what to do with it, or even why it would be there.
How do you use this? Are you meant to hit it with an API request, or is there a particular kind of software that you use?
Cheers

Before telling you what RSS, let me describe you a common problem that many people have.
Say there is a bunch of sites that you really like and it's sort of a
daily routine for you to go thru them. They may be a news site, your
friend's blog, but also craigslist bcause you're currently looking for
a new house and maybe a weather site to know how late you should stay
at work :)
The first thing you do when you get to work, is open your web browser
and these sites in new tabs. It's not particularly cumbersome because
there are just 4 sites. But think about it: maybe there is a new blog
that you start to like and ho, these cartoons are really funny. Maybe
there is also a bit of financial info that you're interested in and
the pictures that your brother is posting to Flickr every couple day:
they just had a new baby! Also, as you're trying to buy a house, you'd
love a little raise and you've figured that your boss really likes it
when you tell her that you've read about your company in the news or
when you tell her about a new competing product... There is also
StackOverflow. You're desperately trying to get this "expert" badge
and boost up your reputation: this may help with your boss too or even
when you're looking for a new job.
Opening all these tabs is starting to take a toll and you keep
forgetting an important one. You're also slowly getting tired of the
different reading experience that all these sites have: small fonts,
large fonts, ads all over...etc. Now you have a problem.
Imagine there is a tool that does the following: you can tell it what sites you care about, and then, this tool will look up the new stuff for you. It will show everything in a nice looking format. It should also help you identify what's really worth seeing ASAP or maybe have some kind of "serendipity" mode that you can go into and find interesting stuff that you would have missed otherwise. The tool will obviously send you to the original sites should you need more info about any particular story or classified...
This tool exists. It's usually called a Reader, mostly because it lets your read more things online. Often times you'll see them called "RSS reader", because RSS is what they use to get the information from all these sites. RSS is the pipe. You as a user should probably not know about it, but that's what the readers depend on. In an ideal world, when you're on site you like, you should just hit "follow" on a button like this one and then you'd be redirected to your reader of choice. Later when new content is added, you'll get it straight in your reader.
To get a bit into more technical details, RSS (like Atom) is an XML flavor. It's a collection (mostly reverse chronological) of entries. Entries have at least a title and a link to the actual story. They should also include a unique identifier and could have other elements like a description, an image, tags, author information... etc.
RSS is great because it's content agnostic. It can be used to represent a lot of different things (as described in the little story) and decouples the publishing platform from the subscribing platform: they don't even know the other one exists. RSS is their lingua-franca.

I wrote a blog post about this very question not long ago. Here's the link if you're interested in reading my personal interpretation. https://www.rss.com/whatisrss

An XML file is all the content of a page, with no markup. The XML represents the data in its rawest, most descriptive form. Many readers can interpret XML sources from a variety of places, and format all of the data in its own unique way.

Converting words to hashtags when posting to Twitter

I currently use LinqToTwitter to send posts to Twitter. I'd like to convert words in the title of the post to hashtags when it gets fired off as tweet so something like - "Firefox is cool" is the blog post and becomes #Firefox is cool http://myshortu.rl/dhsgeh on Twitter.
So far the way i see it is i need a database table with the words i want to convert to hashtags. I'd have to parse out the title and compare the words to those in the db and add on the pound sign. Is the best way to use a db table? Or can I do it with an in memory collection or keep the words in web.config? Thanks....

The decision on whether to use a database or file (such as web.config) might depend on whether you want to write code that allows you to maintain the list. e.g. Add, Modify, Remove. If so, then a DB sounds like the easiest option. If the list is small and doesn't change, then adding a delimited list to web.config would work fine.
Since you're using ASP.NET you can't hold it in a memory variable, but you can hold the list in Cache. This can make for some very fast lookups, rather than multiple file or DB queries.
Just to put this into perspective though, it's tough to recommend a proper design in a forum because there might be details that aren't known. So, it's best to take my answer as something that helps think about what the tradeoffs are, rather than a definitive recommendation on what you should do.

Interpreting Search Results

I am tasked with writing a program that, given a search term and the HTML source of a page representing search results of some unknown search engine (it can really be anything, a blog, a shop, Google, eBay, ...), needs to build a data structure of the results containing "what's in the results": a title for earch result, the "details" link, the position within the results etc. It is not known whether the results page contains any of the data at all, and whether there are any search results. The goal is to feed the data structure into another program that extracts meaning.
What I am looking for is not BeautifulSoup or a RegExp but rather some clever ideas or algorithms on how to interpret the HTML source. What do I do to find out what part of the page constitutes a single result item? How do I filter the markup noise to extract the important bits? What would you do? Pointers to fields of research covering what I try to to are aly greatly appreciated.
Thanks, Simon

I doubt that there exist a silver-bullet algorithm that without any training will just work on any arbitrary search query output.
However, this task can be solved and is actually solved in many applications, but with different approach. First you have to define general structure of single search result item based on what you actually going to do with it (it could be name, date, link, description snippet, etc.), and then write number of html parsers that will extract necessary necessary fields from search result output of particular web sites.
I know it is not super sexy solution, but it probably the only one that works. And it is not rocket science. Writing parsers is actually extremly simple, you can make dozen per day. If you will look into html source of search result, you will notice that output results are typically very structured and marked with specific div sections or class atributes, so it is very easy to find it in the document. You dont have even use any complicated HTML parsing library for that, something grep-like will be enough.
For example, on this particular page your question starts with <div class="post-text"> and ends with </div>. Everything in between is actually a post text with some HTML formatting that you may want to remove along with extra spaces and "\n". And this <div class="post-text"> appears on the page only once.
Once you go at large scale with your retrieval applicaiton, you will find out that there is not that big variety of different search engines on different sites, and you will be able to re-use already created parsers for sties using similar search engines.
The only thing you have to remember is built-in self-testing. Sites tend to upgrade and change design from time to time. If your application is going to live for some time, you will need to include into your parsers some logic that will check validity of their results and notify you every time search output has changed and is not compatible anymore with your parser. Then you will have to modify particular parser or write new one.
Hope this helps.

merge rss feeds

I want to merge multiple rss feeds into a single feed, removing any duplicates. Specifically, I'm interested in merging the feeds for the tags I'm interested in.
[A quick search turned up some promising links, which I don't have time to visit at the moment]
Broadly speaking, the ideal would be a reader that would list all the available tags on the site and toggle them on and off, allowing me to explore what's available, keep track of questions I've visited, new answers on interesting feeds, etc, etc . . . though I don't suppose such a things exists right now.
As I randomly explore the site and see questions I think are interesting, I inevitably find "oh yes, that one looked interesting a couple days ago when I read it the first time, and hasn't been updated since". It would be much nicer if my machine would keep track of such deails for me :)
Update: You can now use "and", "or", and "not" to combine multiple tags into a single feed: Tags AND Tags OR Tags
Update: You can now use Filters to watch tags across one or multiple sites: Improved Tag Stes

Have you heard of Yahoo's Pipes.
Its an interactive feed aggregator and
manipulator. List of 'hot pipes' to
subscribe to, and ability to create
your own (yahoo account required).
I played with it during beta back in the day, however I had a blast. Its really fun and easy to aggregate different feeds and you can add logic or filters to the "pipes". You can even do more then just RSS like import images from flickr.

I create a the stackoverflow tag feeds pipe. You can list your tags of choice into the text box and it will combine them into a single feed with all the unique posts. It escapes '#' and '+' characters for you.
Alternatively, you can use the pipe's rss feed by appending your html-encoded tags separated by '+'s:
http://pipes.yahoo.com/pipes/pipe.run?_id=uP22vN923RG_c71O1ZzWFw&_render=rss&tags=.net+c%23+powershell
Unfortunatley, though, this seems to strip out the content of the posts. The content is visible in the debug view, but the output only contains the post title.
[Thanks to everyone for suggesting Yahoo Pipes! Had heard of it before, but never tried it until now :-]

SimplePie is a PHP library that supports merging RSS feeds into one combined feed. I don't believe it does dupe checking out-of-the-box, but I found it trivial to write a little function to eliminate duplicate content via their GUIDs.

Here is an article on Merge Multiple RSS Feeds Into One with Yahoo! Pipes + FeedBurner.
Another option is Feed Rinse, but they have a paid version as well as the free version.
Additionally:
I have heard good things about AideRss

Yahoo Pipes?
23 minutes later:
Aww, I got answer-sniped by #Bernie Perez. Oh well :)

In the latest Podcast, Jeff and Joel talked about the RSS feeds for tags, and Joel noted that there is only the current ability to do AND on tags, not OR.
Jeff suggested that this would be included at some stage in the future.
I think that you should request this on uservoice, or vote for it if it is already there.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex