Why does identifiers.org forward to addresses that do not contain linked data? - uri

For example, if I want to link to a dbSNP identifier: rs936126, I link using identifiers.org to this URI: http://identifiers.org/dbsnp/rs936126, which then forwards me to this address: https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs936126.
This address is really bad 1990s-esque HTML, not a linked data/ machine readable language, from what I can tell. It seems like a mistake that identifiers.org doesn't forward to a URI that exposes the data in a linked data format?
Can someone help educate me about this issue? How does one go about finding the right URIs they can use as subjects - or predicates for that matter - that describe and RESOLVE specific biological entities?
Note, I do NOT want ontology terms, which simply define what a given property/attribute name means. Rather, I want a document that describes a given and specific biological entity - say a specific variant, gene, genomic chromosome, etc.

Related

Practical usage for linked data

I've been reading about linked data and I think I understand the basics of publishing linked data, but I'm trying to find real world practical (and best practise) usage for linked data. Many books and online tutorials talk a lot about RDF and SPARQL but not about dealing with other peoples data.
My question is, if I have a project with a bunch of data that I output as RDF, what is the best way to enhance (or correctly use) other people's data?
If I create an application for animals and I want to use data from the BBC wildlife page (http://www.bbc.co.uk/nature/life/Snow_Leopard) what should I do? Crawl the BBC wildlife page, for RDF, and save the contents to my own triplestore or query the BBC with SPARQL (I'm not sure that this is actually possible with the BBC) or do I take the URI for my animal (owl:sameAs) and curl the content from the BBC website?
This also asks the question, can you programmatically add linked data? I imagine you would have to crawl the BBC wildlife page unless they provide an index of all the content.
If I wanted to add extra information such as location for these animals (http://www.geonames.org/2950159/berlin.html) again what is considered the best approach? owl:habitat (fake predicate) Brazil? and curl the RDF for Brazil from the geonames site?
I imagine that linking to the original author is the best way because your data can then be kept up-to-date, which from these slides from a BBC presentation (http://www.slideshare.net/metade/building-linked-data-applications) is what the BBC does, but what if the authors website goes down or is too slow? And if you were to index the author's RDF I imagine your owl:sameAs would point to a local RDF.
Here's one potential way of creating and consuming linked data.
If you are looking for an entity (i.e., a 'Resource' in Linked Data terminology) online, see if there is Linked Data description about it. One easy place to find this is DBpedia. For Snow Leopard, one URI that you can use is http://dbpedia.org/page/Snow_leopard. As you can see from the page, there are several object and property descriptions. You can use them to create a rich information platform.
You can use SPARQL in two ways. Firstly, you can directly query a SPARQL endpoint on the web where there might be some data. BBC had one for music; I'm not sure if they do for other information. DBpedia can be queried using snorql. Secondly, you can retrieve the data you need from these endpoints and load into your triple store using INSERT and INSERT DATA features of SPARQL 1.1. To access the SPARQL end points from your triple store, you will need to use the SERVICE feature of SPARQL. The second approach protects you from the inability to execute your queries when a publicly available end point is down for maintenance.
To programmatically add the data to your triplestore, you can use one of the predesigned libraries. In Python, RDFlib is useful for such applications.
To enrich the data with that sourced from elsewhere, there can again be two approaches. The standard way of doing it is using existing vocabularies. So, you'd have to look for the habitat predicate and just insert this statement:
dbpedia:Snow_leopard prefix:habitat geonames:Berlin .
If no appropriate ontologies are found to contain the property (which is unlikely in this case), one needs to create a new ontology.
If you want to keep your information current, then it makes sense to periodically run your queries. Using something such as DBpedia Live is useful is this regard.

R geographic address validation

I am trying to calculate physical distances between geographic locations (addresses) with ggmaps/mapdist function in R. Apart from the uncomfortable fact that Google Maps allows only 2500 queries/session, I have to cope with the misspelled or other way imperfect "addresses". The most typical problem is that the exact address strings themselves are added by several other info (floor, door etc.), but it is very problematic to detect any pattern in these what would allow applying regular expression.
My goal is:
Check if the address string is recognizable to Google Maps;
If not, find a way to truncate to an acceptable form, perhaps by parsing words step by step from the string.
Have anybody coped with this kind of problem?
Thanks.
There are a couple of factors running into each other here. One factor is the misspellings and other complexities related to addresses and the other is pinpointing (geocoding) a given address. Although they are related problems, each must be handled to accomplish your objectives.
There are numerous service providers out there that can do either or both with minimal cost involved. This can be found with a simple Google search. You can then investigate each to see if they match your use case and licensing requirements.
All of that considered, you'll want to get your address list cleaned up on a minimum. Doing that will enable you to utilize any number of geocoding providers.
Depending upon the size of your list, you can get your list cleaned up and geocoded for perhaps $20.
In the interest of full disclosure, I'm the founder of SmartyStreets. We provide a web interface (to help clean up the address list) as well as an API (which can be used on a continual basis to keep addresses clean). We also geocode your list at no extra charge. Further, we don't have any licensing restrictions on the number of lookups that can be performed during a given timeframe. (We have customers that hit us hundreds of millions of times per day.) The entire process of signing up and cleaning up your list takes just a few minutes.

Routing Engine Using OpenStreetMap Data

As part of my academic project, I have to build a routing engine based on data supplied from OSM. I have looked at the data model of OSM and I'm all fine with it. However, I'm having trouble converting an OSM XML file into a graph structure (nodes and edges) that I can use to apply search algorithms (Dijkstra, A* etc.) on. I would like the graph to be stored in memory to allow fast read/write.
So can anyone shed light or suggest techniques on how this can be done, or even provide pointers for further research.
Please note that I'm not allowed to re-use existing routing engines as this would defeat the purpose of doing the project.
All you need to do is:
create a node for every <node> item
every <way> entry is a sequenced list of <nd> items, each of which is a backreference to a node. So for each <way>, you iterate pairwise through its <nd>s and create an arc between the two nodes referenced.
You can do this in one pass using a streaming XML parser, since the XML data defines all the nodes before the ways.
The data doesn't intrinsically include distances, so you need to calculate that from the latlon of each node. You should also take account of the road type (highway=*) and the access info (access=*) in your routing, and you probably also want to ignore ways that are not traversable (eg waterway=stream) but that's all down to your specific situation.
http://wiki.openstreetmap.org/wiki/Elements

Which URL describes the resource the best?

/competitions/1/clubs/5/players
/players/search?club_id=5
/players?club_id=5
When should I use a first-class URL for a resource, and when should I use a nested URL?
Update 1
Thanks for the answers so far. I'll try to clarify things a little further.
Competition and Club have a many-to-many relationship. Clubs can participate in multiple competitions. I guess that would make Club a first class entity, so the way to access a club would be for instance:
/clubs/33
But I also need to be able to access clubs that participate in a specific competition, so I need something like this too:
/competitions/2/clubs
But someone mentioned it isn't recommendable to make a resource accessible via multiple URI's. Doesn't this violate that?
Also, I presume a URI like this would not be preferable:
/competitions/2/clubs/33/players/5
But rather use this:
/clubs/33/players/5
Club has a one-to-many relationship with Player.
/competitions/1/clubs/5/players
As a URI is the identifier of a single resource, I would say the general rule is that if it is an object, it gets a 'first-class URL'.
I only tend to use the query parameters only when limiting/filtering lists, for example, /competitions/1/clubs/5/players?gender=MALE.
I use path elements if the relation "feels" tree/directory wise (like club has players /clubs/berlin/players). Parameters are more "tags", I use it often for search-filters (e.g. defenders of 'berlin' club with age older as 22 /clubs/berlin/players?position=defender&age=22).
I design URL structure by 'domain-importance'. The most basic concepts should go to the root. If possible don't go too deep down url-structure, I try not to duplicate or create alias collections which represent identical resources (costs double maintenance in code + documentation).
Generally putting /clubs as root feels more natural: /clubs/{club_id}/players
I would only expose players through /competitions/{comp_id}/clubs/{club_id}/players, if players-set of is different as /clubs/{club_id}/players, e.g. during competition
several players are blocked or didn't make it for the match-squad.
What do you mean with /competitions? Is it a tournament or a single match? If single match with two clubs maybe use home + away domain-concepts: /competitions/{comp_id}/home-club and
/competitions/{comp_id}/away-club .
Update-1 Answer
Here my thoughts on your update-question:
I guess /competitions/2/clubs is a subset of /clubs, not every club is competing in every competition. So both resources are different, so two URLs are fine.
Thinking again /competitions/2/clubs/33/players/5 should also be fine (but it is important that in server code duplication is avoided). This URL should even be mandatory when the returned resource is a subset of /clubs/33/players (e.g. players are injured or limit of team-size has been hit for specific competition).
I wouldn't put the ID numbers in the URL. They mean something only for those who actually knows what they mean, but for everyone else they are meaningless numbers.
You should always choose descriptive and related words for your URL, because the URL contribute to give informations about the linked resource.
Instead of using meaningless ID numbers, choose a unique name representing the name of the team or the competition, for example
/competitions/worldcup/clubs/usa/players
But if you really need to send that kind of anonymous data in the URL, then I would prefer to see them in a query.
Use only meaningful text for the URL.

Is information a subset of data?

I apologize as I don't know whether this is more of a math question that belongs on mathoverflow or if it's a computer science question that belongs here.
That said, I believe I understand the fundamental difference between data, information, and knowledge. My understanding is that information carries both data and meaning. One thing that I'm not clear on is whether information is data. Is information considered a special kind of data, or is it something completely different?
The words data,information and knowlege are value-based concepts used to categorize, in a subjective fashion, the general "conciseness" and "usefulness" of a particular information set.
These words have no precise meaning because they are relative to the underlying purpose and methodology of information processing; In the field of information theory these have no meaning at all, because all three are the same thing: a collection of "information" (in the Information-theoric sense).
Yet they are useful, in context, to summarize the general nature of an information set as loosely explained below.
Information is obtained (or sometimes induced) from data, but it can be richer, as well a cleaner (whereby some values have been corrected) and "simpler" (whereby some irrelevant data has been removed). So in the set theory sense, Information is not a subset of Data, but a separate set [which typically intersects, somewhat, with the data but also can have elements of its own].
Knowledge (sometimes called insight) is yet another level up, it is based on information and too is not a [set theory] subset of information. Indeed Knowledge typically doesn't have direct reference to information elements, but rather tells a "meta story" about the information / data.
The unfounded idea that along the Data -> Information -> Knowledge chain, the higher levels are subsets of the lower ones, probably stems from the fact that there is [typically] a reduction of the volume of [IT sense] information. But qualitatively this info is different, hence no real [set theory] subset relationship.
Example:
Raw stock exchange data from Wall Street is ... Data
A "sea of data"! Someone has a hard time finding what he/she needs, directly, from this data. This data may need to be normalized. For example the price info may sometimes be expressed in a text string with 1/32th of a dollar precision, in other cases prices may come as a true binary integer with 1/8 of a dollar precision. Also the field which indicate, say, the buyer ID, or seller ID may include typos, and hence point to the wrong seller/buyer. etc.
A spreadsheet made from the above is ... Information
Various processes were applied to the data:
-cleaning / correcting various values
-cross referencing (for example looking up associated codes such as adding a column to display the actual name of the individual/company next to the Buyer ID column)
-merging when duplicate records pertaining to the same event (but say from different sources) are used to corroborate each other, but are also combined in one single record.
-aggregating: for example making the sum of all transaction value for a given stock (rather than showing all the individual transactions.
All this (and then some) turned the data into Information, i.e. a body of [IT sense] Information that is easily usable, where one can quickly find some "data", such as say the Opening and Closing rate for the IBM stock on June 8th 2009.
Note that while being more convenient to use, in part more exact/precise, and also boiled down, there is not real [IT sense] information in there which couldn't be located or computed from the original by relatively simple (if only painstaking) processes.
An financial analyst's report may contain ... knowledge
For example if the report indicate [bogus example] that whenever the price of Oil goes past a certain threshold, the value of gold start declining, but then quickly spikes again, around the time the price of coffee and tea stabilize. This particular insight constitute knowledge. This knowledge may have been hidden in the data alone, all along, but only became apparent when one applied some fancy statistically analysis, and/or required the help of a human expert to find or confirm such patterns.
By the way, in the Information Theory sense of the word Information, "data", "information" and "knowlegde" all contain [IT sense] information.
One could possibly get on the slippery slope of stating that "As we go up the chain the entropy decreases", but that is only loosely true because
entropy decrease is not directly or systematically tied to "usefulness for human"
(a typical example is that a zipped text file has less entropy yet is no fun reading)
there is effectively a loss of information (in addition to entropy loss)
(for example when data is aggregate the [IT sense] information about individual records get lost)
there is, particular in the case of Information -> Knowlege, a change in level of abstration
A final point (if I haven't confused everybody yet...) is the idea that the data->info->knowledge chain is effectively relative to the intended use/purpose of the [IT-sense] Information.
ewernli in a comment below provides the example of the spell checker, i.e. when the focus is on English orthography, the most insightful paper from a Wallstreet genius is merely a string of words, effectively "raw data", some of it in need of improvement (along the orthography purpose chain.
Similarly, a linguist using thousands of newspaper articles which typically (we can hope...) contain at least some insight/knowledge (in the general sense), may just consider these articles raw data, which will help him/her create automatically French-German lexicon (this will be information), and as he works on the project, he may discover a systematic semantic shift in the use of common words betwen the two languages, and hence gather insight into the distinct cultures.
Define information and data first, very carefully.
What is information and what is data is very dependent on context. An extreme example is a picture of you at a party which you email. For you it's information, but for the the ISP it's just data to be passed on.
Sometimes just adding the right context changes data to information.
So, to answer you question: No, information is not a subset of data. It could be at least the following.
A superset, when you add context
A subset, needle-in-a-haystack issue
A function of the data, e.g. in a digest
There are probably more situations.
This is how I see it...
Data is dirty and raw. You'll probably have too much of it.
... Jason ... 27 ... Denton ...
Information is the data you need, organised and meaningful.
Jason.age=27
Jason.city=Denton
Knowledge is why there are wikis, blogs: to keep track of insights and experiences. Note that these are human (and community) attributes. Except for maybe a weird science project, no computer is on Facebook telling people what it believes in.
information is an enhancement of data:
data is inert
information is actionable
note that information without data is merely an opinion ;-)
Information could be data if you had some way of representing the additional content that makes it information. A program that tries to 'understand' written text might transform the input text into a format that allows for more complex processing of the meaning of that text. This transformed format is a kind of data that represents information, when understood in the context of the overall processing system. From outside the system it appears as data, whereas inside the system it is the information that is being understood.

Resources