I've been reading about linked data and I think I understand the basics of publishing linked data, but I'm trying to find real world practical (and best practise) usage for linked data. Many books and online tutorials talk a lot about RDF and SPARQL but not about dealing with other peoples data.
My question is, if I have a project with a bunch of data that I output as RDF, what is the best way to enhance (or correctly use) other people's data?
If I create an application for animals and I want to use data from the BBC wildlife page (http://www.bbc.co.uk/nature/life/Snow_Leopard) what should I do? Crawl the BBC wildlife page, for RDF, and save the contents to my own triplestore or query the BBC with SPARQL (I'm not sure that this is actually possible with the BBC) or do I take the URI for my animal (owl:sameAs) and curl the content from the BBC website?
This also asks the question, can you programmatically add linked data? I imagine you would have to crawl the BBC wildlife page unless they provide an index of all the content.
If I wanted to add extra information such as location for these animals (http://www.geonames.org/2950159/berlin.html) again what is considered the best approach? owl:habitat (fake predicate) Brazil? and curl the RDF for Brazil from the geonames site?
I imagine that linking to the original author is the best way because your data can then be kept up-to-date, which from these slides from a BBC presentation (http://www.slideshare.net/metade/building-linked-data-applications) is what the BBC does, but what if the authors website goes down or is too slow? And if you were to index the author's RDF I imagine your owl:sameAs would point to a local RDF.
Here's one potential way of creating and consuming linked data.
If you are looking for an entity (i.e., a 'Resource' in Linked Data terminology) online, see if there is Linked Data description about it. One easy place to find this is DBpedia. For Snow Leopard, one URI that you can use is http://dbpedia.org/page/Snow_leopard. As you can see from the page, there are several object and property descriptions. You can use them to create a rich information platform.
You can use SPARQL in two ways. Firstly, you can directly query a SPARQL endpoint on the web where there might be some data. BBC had one for music; I'm not sure if they do for other information. DBpedia can be queried using snorql. Secondly, you can retrieve the data you need from these endpoints and load into your triple store using INSERT and INSERT DATA features of SPARQL 1.1. To access the SPARQL end points from your triple store, you will need to use the SERVICE feature of SPARQL. The second approach protects you from the inability to execute your queries when a publicly available end point is down for maintenance.
To programmatically add the data to your triplestore, you can use one of the predesigned libraries. In Python, RDFlib is useful for such applications.
To enrich the data with that sourced from elsewhere, there can again be two approaches. The standard way of doing it is using existing vocabularies. So, you'd have to look for the habitat predicate and just insert this statement:
dbpedia:Snow_leopard prefix:habitat geonames:Berlin .
If no appropriate ontologies are found to contain the property (which is unlikely in this case), one needs to create a new ontology.
If you want to keep your information current, then it makes sense to periodically run your queries. Using something such as DBpedia Live is useful is this regard.
Related
I'm new to Docker's world. I want to query an ontology locally. I have already configured
virtuoso-sparql-endpoint-quickstart.
It works, and my endpoint is http://localhost:8890/sparql.
Now I want to query my own ontology (not DBpedia). So can I still use the same endpoint? How can I add my ontology to virtuoso?
Please note that an ontology is a vocabulary used to describe one or more classes of entities. The descriptions themselves are typically referred to as instance data, and queries are usually run over such instance data. (There are a few ontologies used to describe ontologies, and these descriptions are also instance data, and queries might be made against them.)
There are a number of ways to load data into Virtuoso. The most useful for most people is the Bulk Load facility. For most purposes, you'll want to load your data into one or more distinct Named Graphs, such that queries can be scoped to one, some, or all of the those Named Graphs.
Any and all queries can be made against the same http://localhost:8890/sparql endpoint. Results will vary depending on the Named Graphs identified in your query.
I've been trying to figure out how to best model data for a complex feed in Cloud Firestore without returning unnecessary documents.
Here's the challenge --
Content is created for specific topics, for example: Architecture, Bridges, Dams, Roads, etc. The topic options can expand to included as many as needed at any time. This means it is a growing and evolving list.
When the content is created it is also tagged to specific industries. For example, I may want to create a post in Architecture and I want it to be seen within the Construction, Steel, and Concrete industries.
Here is where the tricky part comes in. If I am a person interested in the Steel and Construction industries, I would like to have a feed that includes posts from both of those industries with the specific topics of Bridges and Dams. Since it's a feed the results will need to be in time order. How would I possibly create this feed?
I've considered these options:
Query for each individual topic selected that includes tags for Steel and Construction, then aggregate and sort the results. The problem I have with this one is that it can return too many posts, which means I'm reading documents unnecessarily. If I select 5 topics between a specific time range, That's 5 queries, which is ok. However, each can have any possible amount of results, which is problematic. I could add a limit but then I run the risk of posts being omitted from topics even though they fall within the time range.
Create a post "index" table in Cloud SQL and perform queries on it to get the post ID's then retrieve the Firestore documents as needed. Then the question is, why not just use Cloud MySql.... Well it's a scaling, cost, and maintenance issue. The whole point of firestore is not having to worry so much about DBAs, load, and scale.
I've not been able to come to any other ideas and hoping someone has dealt with such a challenge and can shed some light on the matter. Perhaps firestore is just completely the wrong solution and I'm trying to fit a square peg into a round hole, but it seems like a workable solution can be found.
The perfect structure is to have separate node for posts then for each post you give it a reference parent category eg Steel and Construction. Have them also arranged with timestamps. If you think that the database will be too massive for firebase's queries. You can connect your firebase database to elasticsearch and do the search from there.
I hope it helps.
I'm building a web application using multiple APIs which are a part of the Microsoft cognitive Services bundle.
The API returns detected objects such as person, man, fly, kite, etc. I require a list of all possible objects that the API is capable of detecting and also the hierarchy(if available).
From the database normalization perspective, it is information that I need. Is there any documentation that I am missing out on ?
There are thousands of objects to detect, and their list is not available publicly.
That being said, the image categories are available publicly in the documentation
Computer vision can categorize an image broadly or specifically, using
the list of 86 categories in the following diagram.
If you generally need a list of objects to use then you can use publicly available object datasets including the following (arranged from oldest to newest):
COIL100
SFU
SOIL-47
ETHZ Toys
NORB
CalTech 101
PASCAL VOC
GRAZ-02
ALOI
LabelME
Tiny
CIFAR10 and CIFAR100
ImageNet
BOSS
Office
BigBIRD
MS-COCO
iLab-20M
CURE-OR
However, it is recommended to normalize your database based on the JSON you receive from the API, for example, you already know that you are going to receive objects when trying to use Detect Objects, and categories when trying to use Analyze Image, then you can work with that!
I am working on building a database of timing and address information of restaurants those are extracted from multiple web sites. As information for same restaurants may be present in multiple web sites. So in the database I will have some nearly duplicate copies.
As the number of restaurants is large say, 100000. Then for each new entry I have to do order of 100000^2 comparison to check if any restaurant information with nearly similar name is already present. So I am asking whether there is any efficient approach better than that is possible. Thank you.
Basically, you're looking for a record linkage tool. These tools can index records, then for each record quickly locate a small set of potential candidates, then do more detailed comparison on those. That avoids the O(n^2) problem. They also have support for cleaning your data before comparison, and more sophisticated comparators like Levenshtein and q-grams.
The record linkage page on Wikipedia used to have a list of tools on it, but it was deleted. It's still there in the version history if you want to go look for it.
I wrote my own tool for this, called Duke, which uses Lucene for the indexing, and has the detailed comparators built in. I've successfully used it to deduplicate 220,000 hotels. I can run that deduplication in a few minutes using four threads on my laptop.
One approach is to structure your similarity function such that you can look up a small set of existing restaurants to compare your new restaurant against. This lookup would use an index in your database and should be quick.
How to define the similarity function is the tricky part :) Usually you can translate each record to a series of tokens, each of which is looked up in the database to find the potentially similar records.
Please see this blog post, which I wrote to describe a system I built to find near duplicates in crawled data. It sounds very similar to what you want to do and since your use case is smaller, I think your implementation should be simpler.
I am currently looking into using Lucene.NET for powering the search functionality on a web application I am working on. However, the search functionality I am implementing not only needs to do full text searches, but also needs to rank the results by proximity to a specified address.
Can Lucene.NET handle this requirement? Or do I have need to implement some way of grouping hits into different locations (e.g. less than 5 miles, less than 10 miles, etc) first, then use Lucene.NET to rank the items within those groups? Or is there a completely different way that I am overlooking?
You can implement a custom scorer to rank the results in order of distance, but you must filter the results before to be efficient. You can make use of the bounding boxes method, filtering the results in a square of 20 milles around your address, and after that apply the ranking.
If I don't remember bad, In the lucene in action book there is an example of a distance relevance algorithm. It's for java lucene, but the api is the same and you can translate easily to c# or vb.net
What you are looking for is called spatial search. I'm not sure if there are extensions to Lucene.Net to do this but you could take a look at NHibernate Spatial. Other than that, these queries are often done within the database. At least PostGreSQL, MySQL and SQL Server 2008 have spatial query capabilities.
After some additional research, I think I may have found my answer. I will use Lucene.NET to filter the search results down by other factors, then use the geocoded information from Google or Yahoo to sort the results by distance.