Making a tree of Wikipedia links - graph

I am trying to use the Wikipedia API to get all links on all pages. Currently I'm using
https://en.wikipedia.org/w/api.php?format=json&action=query&generator=alllinks&prop=links&pllimit=max&plnamespace=0
but this does not seem to start at the first article and end at the last. How can I get this to generate all pages and all their links?

The English Wikipedia has approximately 1.05 billion internal links. Considering the list=alllinks module has a limit of 500 links per request, it's not realistic to get all links from the API.
Instead, you can download Wikipedia's database dumps and use those. Specifically, you want the pagelinks dump, containing information about the links themselves, and very likely also the page dump, for mapping page ids to page titles.

I know this is an old question, but in case anyone else is searching and finds this, I highly recommend looking at Wikicrush to extract the link graph for all of Wikipedia. It produces a relatively compact representation that can be used to very quickly traverse links.

Related

How Does an RSS Feed Work?

How's it going?
I've found a lot of more detailed answers relating to specific problems relating to RSS feeds, but I can't really figure out how you USE one, basically.
Could someone explain?
I see the RSS feed icon at the top of a lot of Wordpress sites, including my own, but when I click it, it just seems to be a long XML file. I don't know what to do with it, or even why it would be there.
How do you use this? Are you meant to hit it with an API request, or is there a particular kind of software that you use?
Cheers
Before telling you what RSS, let me describe you a common problem that many people have.
Say there is a bunch of sites that you really like and it's sort of a
daily routine for you to go thru them. They may be a news site, your
friend's blog, but also craigslist bcause you're currently looking for
a new house and maybe a weather site to know how late you should stay
at work :)
The first thing you do when you get to work, is open your web browser
and these sites in new tabs. It's not particularly cumbersome because
there are just 4 sites. But think about it: maybe there is a new blog
that you start to like and ho, these cartoons are really funny. Maybe
there is also a bit of financial info that you're interested in and
the pictures that your brother is posting to Flickr every couple day:
they just had a new baby! Also, as you're trying to buy a house, you'd
love a little raise and you've figured that your boss really likes it
when you tell her that you've read about your company in the news or
when you tell her about a new competing product... There is also
StackOverflow. You're desperately trying to get this "expert" badge
and boost up your reputation: this may help with your boss too or even
when you're looking for a new job.
Opening all these tabs is starting to take a toll and you keep
forgetting an important one. You're also slowly getting tired of the
different reading experience that all these sites have: small fonts,
large fonts, ads all over...etc. Now you have a problem.
Imagine there is a tool that does the following: you can tell it what sites you care about, and then, this tool will look up the new stuff for you. It will show everything in a nice looking format. It should also help you identify what's really worth seeing ASAP or maybe have some kind of "serendipity" mode that you can go into and find interesting stuff that you would have missed otherwise. The tool will obviously send you to the original sites should you need more info about any particular story or classified...
This tool exists. It's usually called a Reader, mostly because it lets your read more things online. Often times you'll see them called "RSS reader", because RSS is what they use to get the information from all these sites. RSS is the pipe. You as a user should probably not know about it, but that's what the readers depend on. In an ideal world, when you're on site you like, you should just hit "follow" on a button like this one and then you'd be redirected to your reader of choice. Later when new content is added, you'll get it straight in your reader.
To get a bit into more technical details, RSS (like Atom) is an XML flavor. It's a collection (mostly reverse chronological) of entries. Entries have at least a title and a link to the actual story. They should also include a unique identifier and could have other elements like a description, an image, tags, author information... etc.
RSS is great because it's content agnostic. It can be used to represent a lot of different things (as described in the little story) and decouples the publishing platform from the subscribing platform: they don't even know the other one exists. RSS is their lingua-franca.
I wrote a blog post about this very question not long ago. Here's the link if you're interested in reading my personal interpretation. https://www.rss.com/whatisrss
An XML file is all the content of a page, with no markup. The XML represents the data in its rawest, most descriptive form. Many readers can interpret XML sources from a variety of places, and format all of the data in its own unique way.

Is there a plugin or any way to automatically graph wiki content pages?

Having a DokuWiki with some content (regular to small in sice and depth) I would like to automatically generate a GraphViz or Freeplane or any Form of easy to grasp visualisation of my content.
Why? Because the wiki tends to become less and less effective, when searching and organizing its content. As a user I have no good way to get a sharp Idea of the Wiki structure, which is why more and more often topics are not written and found where they supposed to be.
How to generate graphical sitemap of large website is what I found so far, but because my wiki is not that big, it would be quicker for me to just manually make a graph. And because the main topics are not that often updated or extended (like 10 extension a month tops), it would not be that hard to keed it up to date manually.
However, I would like to avoid manual tasks, at least in the future.
So is there a plugin or any other good way to graph the contents?
starting on the landing page, following the internal-wiki-links
using the namespace-sitemap
Either one would be nice, 1. interest me a bit more, because it reflects the paths a user could go, when just calling the wiki-start-page. I am greatful for any help, thanks.
I wrote a simple tool to do just that, the graph can then be analyzed in Gephi. Have a look at this blogpost: http://www.splitbrain.org/blog/2010-08/02-graphing_dokuwiki_help_needed

Aggregating from various sources

It could be a project well beyond my skills right now but I've got around one full month to spend on it so I think I can do it. What I want to build is this: Gather news about a specific subject from various sources. Easy, right? Just get the rss feeds and display them on a page. Well, I want something more advanced: Duplicates removed and customized presentation (that is, be able to define/change the format in which the news headlines are displayed).
I've played a bit with Yahoo Pipes and some other tools and I am facing two big problems:
Some sources don't provide rss feeds. How do I create one?
What's the best method to find and remove duplicates. I thought about comparing the headlines and checking if there is a matching bigger than, say, 50%. Is that a good practice though?
Please add any other things (problems, suggestions, whatever) I might not have considered.
Duplication is a nasty issue. What I eventually ended up doing:
1. Strip out all HTML tags except for links (Although I started using regex, I was burned. I eventually moved to custom parsing to remove tags)
2. Strip out all whitespace
3. Case-desensitize
4. Hash all that with MD5.
Here's why you leave the link in:
A comment might be as simple as "Yes, this sucks". "Yes, this sucks" could be a common comment. BUT if the text "this sucks" is linked to different things, then it is not a duplicate comment.
Additionally, you will find that HTML tag escaping is weird with RSS feeds. You would think that a stray < would be double-encoded: (I think)&<;
But it is not. It is encoded <
But so too are HTML tags! :<p>
I eventually copied all the known HTML tags as parsed by Mozilla Firefox and manually recognized those tags.
Creating an RSS feed from HTML is quite nasty and I can only point you to services such as Spinn3r, which are fantastic at de-duplication and content extraction. These services typically use probability-based algorithms that are above me. I know of one provider that got away with regexing pages (They had to know that a certain page was MySpace-based or Blogger-based) but they did not perform admirably.
You might want to try to use the YQL module to scrape a webpage that doesn't provide RSS. Here's a sample of a YQL statement to scrape HTML.
About duplicates, take a look at this pipe.
Customized presentation: if you want it truly customized you'll have to manipulate the pipe results yourself, e.g. get it as JSON an manipulate it with Javascript, or process it server-side.

I want to create an RSS feed that is customizable

I want to create a dropdown of RSS feeds and users can pick and choose the feeds they want and a custom feed would be created. Is this possible using straight up HTML and java script or do I need a server technology. There are 7 separate feeds so the possible combinations are 7! - far too many for me to individually code into if statements and separate feeds. Is there a program that will generate the possible feeds for me automatically after I update one of them? Then I could just upload the updated xml files.
Right. So I set up my xml files, say I have one for birthdays, one for deaths, and one for mid life crises. So that is three xml files with three separate links for rss feeds. Now what I want is for people to be able to check off the ones to which they wish to subscribe rather than hitting each one separately. So I would have a form with three checkboxes and a submit button. I could do this with javascript by having 6 separate xml feeds, one for each possible combination. But if I have 4 feeds then I need to set up 24 feeds, and 5 would be 120 possible feed combinations.
So the question becomes, is there some software or library that will either handle this computation for me and crank out RSS mixes/blends similar to what some RSS mixing software seems to do. The problem with the services and software I have seen is that it provides blending for people subscribing to feeds but not for providers. I can see in my head how easily this could be done programmatically even though it would spit out alot of xml and html/javascript.
I guess another way about it would be for them to sign up for multiple feeds simultaneously but I'm not sure if that can be done.
If I am making no sense I apologize. I have never seen this done so it might not be possible. I am just going to go with the page with a bunch of RSS links.
Thanks for everyones responses. I appreciate it.
Just because there are 7 options doesn't mean you need to write 7! if statements. You only need to check if each one of the options is set, and output something appropriately.
So, yes, you need to do this server side. And it's not at all difficult.
Where are you stuck, specifically? Your question is missing a few details.

Interpreting Search Results

I am tasked with writing a program that, given a search term and the HTML source of a page representing search results of some unknown search engine (it can really be anything, a blog, a shop, Google, eBay, ...), needs to build a data structure of the results containing "what's in the results": a title for earch result, the "details" link, the position within the results etc. It is not known whether the results page contains any of the data at all, and whether there are any search results. The goal is to feed the data structure into another program that extracts meaning.
What I am looking for is not BeautifulSoup or a RegExp but rather some clever ideas or algorithms on how to interpret the HTML source. What do I do to find out what part of the page constitutes a single result item? How do I filter the markup noise to extract the important bits? What would you do? Pointers to fields of research covering what I try to to are aly greatly appreciated.
Thanks, Simon
I doubt that there exist a silver-bullet algorithm that without any training will just work on any arbitrary search query output.
However, this task can be solved and is actually solved in many applications, but with different approach. First you have to define general structure of single search result item based on what you actually going to do with it (it could be name, date, link, description snippet, etc.), and then write number of html parsers that will extract necessary necessary fields from search result output of particular web sites.
I know it is not super sexy solution, but it probably the only one that works. And it is not rocket science. Writing parsers is actually extremly simple, you can make dozen per day. If you will look into html source of search result, you will notice that output results are typically very structured and marked with specific div sections or class atributes, so it is very easy to find it in the document. You dont have even use any complicated HTML parsing library for that, something grep-like will be enough.
For example, on this particular page your question starts with <div class="post-text"> and ends with </div>. Everything in between is actually a post text with some HTML formatting that you may want to remove along with extra spaces and "\n". And this <div class="post-text"> appears on the page only once.
Once you go at large scale with your retrieval applicaiton, you will find out that there is not that big variety of different search engines on different sites, and you will be able to re-use already created parsers for sties using similar search engines.
The only thing you have to remember is built-in self-testing. Sites tend to upgrade and change design from time to time. If your application is going to live for some time, you will need to include into your parsers some logic that will check validity of their results and notify you every time search output has changed and is not compatible anymore with your parser. Then you will have to modify particular parser or write new one.
Hope this helps.

Resources