Trouble accessing rdf data from Sponger - virtuoso

I am currently working on a project which uses Virtuoso Sponger. I have been having multiple issues, and I have referenced a lot of material before asking these questions. Since I am new to Virtuoso, please be patient with me.
I cannot seem to access RDF data using this format as given on the Sponger page — http://{virtuoso-host}/about/data/{format}/{URIscheme}/{authority}/{local-path}
I tried it both on the linkeddata.uriburner.com and the personal server I hosted with Virtuoso installed.
I wrote this in the address bar —
http://linkeddata.uriburner.com/about/data/xml/http://www.bbc.co.uk/music/artists/ed2ac1e9-d51d-4eff-a2c2-85e81abd6360%01artist
— and got this error —
Error HTTP/1.1 404 File not found
The requested URL was not found
URI = '/about/data/xml/http:/www.bbc.co.uk/music/artists/ed2ac1e9-d51d-4eff-a2c2-85e81abd6360artist'
When I try the HTML —
http://{virtuoso-host}/about/html/{URIscheme}/{authority}/{local-path}
— option of Browser input, I get much less data output from my server than from linkeddata.uriburner.com. How can I correct this?
My main objective is to get RDF data from social media and information sites, and store it in a database to be searched on locally. So for example, BBC has info on Bob Marley; so has Wiki. I get structured data from both of them, take out redundant data, and add new data so that a single object is created. I wish to query on this data from the database.
How would I store this data to the database by executing using the Browser Input method?
Also, lets say this data got stored under graphs (I saw its link in the Virtuoso Conductor -> LinkedData -> Graph); then how do I query on it?

Shrivansh,
There are many issues here, so I am going to provide a broad answer.
The Sponger is going to transform a Web Resource into RDF-based Linked Data. The transformed data ends up in a Virtuoso-hosted RDF Document, which is identified by a Named Graph IRI.
Given a Web Resource URL —
http://www.slideshare.net/kleinerperkins/internet-trends-v1
— you could construct an extract, transform, and load (ETL) service URL as —
http://linkeddata.uriburner.com/about/html/http/www.slideshare.net/kleinerperkins/internet-trends-v1
The results of the above are as follows:
Basic Entity Description Page (note the alternative document type links in the page footer) — http://linkeddata.uriburner.com/about/html/http/www.slideshare.net/kleinerperkins/internet-trends-v1
Faceted Browsing-oriented Entity Description Page (again note the alternative document type links in the footer) — http://linkeddata.uriburner.com/c/9DH6GNQ6
Named Graph IRI — http://www.slideshare.net/kleinerperkins/internet-trends-v1
SPARQL Query Results Page — http://linkeddata.uriburner.com/c/9DJ563FL
SPARQL Query Definition (so you can see the query source code) — http://linkeddata.uriburner.com/c/9BL763CG
When using a local Virtuoso Sponger instance, note the following:
You have to install Sponger Cartridges for target data sources (e.g.,
Slideshare, LinkedIn, Facebook, Twitter, etc.)
The live URIBurner.com instance has many cartridges and meta cartridges installed and configured -- so you will see more results there than you get locally (unless you also install and enable all the asme cartridges on your local instance)
A list of available cartridges

Related

Parsing FB-Purity's Firefox idb (Indexed Database API) object_data blob from Linux bash

From a Linux bash script, I want to read the structured data stored by a particular Firefox add-on called FB-Purity.
I have found a folder called .mozilla/firefox/b8eab5j0.default/storage/default/moz-extension+++37a9788c-671d-4cae-ba5c-fbdb8788499a^userContextId=4294967295/ that contains a .metadata file which contains the string moz-extension://37a9788c-671d-4cae-ba5c-fbdb8788499a, an URL which when opened in Firefox shows the add-on's details, so I am pretty sure that this folder belongs to the add-on.
That folder contains an idb directory, which sounds like Indexed Database API, a W3C standard apparently used since last year by Firefox it to store add-ons data.
The idb folder only contains an empty folder and an SQLite file.
The SQLite file, unfortunately, does not contain much application structured data, but the object_data table contains a 95KB blob which probably contains the real structured data:
INSERT INTO `object_data` VALUES (1,'0pmegsjfoetupsf.742612367',NULL,NULL,
X'e08b0d0403000101c0f1ffe5a201000400ffff7b00220032003100380035003000320022003a002
2005300610074006f0072007500200055007205105861006e00690022002c00220036003100350036
[... 95KB ...]
00780022007d00000000000000');
Question: Any clue what this blob's format is? How to extract it (using command line or any library or Linux tool) to JSON or any other readable format?
Well, I had a fun day today figuring this out and ended creating a Python tool that can read the data from these indexedDB database files and print them (and maybe more at some point): moz-idb-edit
To answer the technical parts of the question first:
Both the name key (name) and data (value) use a Mozilla proprietary format whose only documentation appears to be its source code at this time.
The keys use a special just-for-this use-case encoding whose rough description is available in mozilla-central/dom/indexedDB/Key.cpp – the file also contains the only known implementation. Its unique selling point appears to be the fact that it is relatively compact while being compatible with all the possible index types websites may throw at you as well as being in the correct binary sorting order by default.
The values are stored using SpiderMonkey's internal StructuredClone representation that is also used when moving values between processes in the browser. Again there are no docs to speak of but one can read the source code which fortunately is quite easy to understand. Before being added to the database however the generated binary is compressed on-the-fly using Google's Snappy compression which “does not aim for maximum compression [but instead …] aims for very high speeds and reasonable compression” – probably not a bad idea considering that we're dealing with wasteful web content here.
To locate the correct indexedDB file for an extension's local storage data, one needs to resolve the extension's static ID to a so-call “internal UUID” whose value is different in every browser profile instance (to make tracking based on installed addons a lot harder). The mapping table for this is stored as a pref (“extensions.webextensions.uuids”) in the prefs.js. The IDB path then is ${MOZ_PROFILE}/storage/default/moz-extension+++${EXT_UUID}^userContextId=4294967295/idb/3647222921wleabcEoxlt-eengsairo.sqlite
For all practical intents and purposes you can read the value of a single storage key of any extension by downloading the project mentioned above. Basic usage is:
$ ./moz-idb-edit --extension "${EXT_ID}" --profile "${MOZ_PROFILE}" "${STORAGE_KEY}"
Where ${EXT_ID} is the extension's static ID (check its manifest.json file or look in about:support#extensions-tbody if your unsure), ${MOZ_PROFILE} is the Firefox profile directory (also in about:support) and ${STORAGE_KEY} is the name of the key you'd like to query (unfortunately querying all keys is not supported yet).
Also writing data is not currently supported either.
I'll update this answer as I implement more features (or drop me an issue on the project page!).

Practical usage for linked data

I've been reading about linked data and I think I understand the basics of publishing linked data, but I'm trying to find real world practical (and best practise) usage for linked data. Many books and online tutorials talk a lot about RDF and SPARQL but not about dealing with other peoples data.
My question is, if I have a project with a bunch of data that I output as RDF, what is the best way to enhance (or correctly use) other people's data?
If I create an application for animals and I want to use data from the BBC wildlife page (http://www.bbc.co.uk/nature/life/Snow_Leopard) what should I do? Crawl the BBC wildlife page, for RDF, and save the contents to my own triplestore or query the BBC with SPARQL (I'm not sure that this is actually possible with the BBC) or do I take the URI for my animal (owl:sameAs) and curl the content from the BBC website?
This also asks the question, can you programmatically add linked data? I imagine you would have to crawl the BBC wildlife page unless they provide an index of all the content.
If I wanted to add extra information such as location for these animals (http://www.geonames.org/2950159/berlin.html) again what is considered the best approach? owl:habitat (fake predicate) Brazil? and curl the RDF for Brazil from the geonames site?
I imagine that linking to the original author is the best way because your data can then be kept up-to-date, which from these slides from a BBC presentation (http://www.slideshare.net/metade/building-linked-data-applications) is what the BBC does, but what if the authors website goes down or is too slow? And if you were to index the author's RDF I imagine your owl:sameAs would point to a local RDF.
Here's one potential way of creating and consuming linked data.
If you are looking for an entity (i.e., a 'Resource' in Linked Data terminology) online, see if there is Linked Data description about it. One easy place to find this is DBpedia. For Snow Leopard, one URI that you can use is http://dbpedia.org/page/Snow_leopard. As you can see from the page, there are several object and property descriptions. You can use them to create a rich information platform.
You can use SPARQL in two ways. Firstly, you can directly query a SPARQL endpoint on the web where there might be some data. BBC had one for music; I'm not sure if they do for other information. DBpedia can be queried using snorql. Secondly, you can retrieve the data you need from these endpoints and load into your triple store using INSERT and INSERT DATA features of SPARQL 1.1. To access the SPARQL end points from your triple store, you will need to use the SERVICE feature of SPARQL. The second approach protects you from the inability to execute your queries when a publicly available end point is down for maintenance.
To programmatically add the data to your triplestore, you can use one of the predesigned libraries. In Python, RDFlib is useful for such applications.
To enrich the data with that sourced from elsewhere, there can again be two approaches. The standard way of doing it is using existing vocabularies. So, you'd have to look for the habitat predicate and just insert this statement:
dbpedia:Snow_leopard prefix:habitat geonames:Berlin .
If no appropriate ontologies are found to contain the property (which is unlikely in this case), one needs to create a new ontology.
If you want to keep your information current, then it makes sense to periodically run your queries. Using something such as DBpedia Live is useful is this regard.

How to update metadata using content indexs in webcenter content

I need to create a program which can search a document and fill the metadata from document( eg. resume of candidate) like user experience, user skill , location etc.
for this i like to use oracle indexing mechanism(Oracle text search) because it index all the data from document. when it index the document, i like to first update my metadata field from indexed data and then content server will update their indexes. Can anyone help me how i will get to know the working of indexer and event on which i will trap and do some modification for updating my metadata.
i need to update metadata because requirement are:
Extensive choices for Search Filter criteria (that searches within Resumes and not just form keywords) :
- Boolean search between multiple parameters
- Have search on Skills, Years of experiences, particular company, education qualification, Geo/Location and Submission date of the profile.
- Search on who referred, name, team , BU etc.
- Result window adequate size of results, filters
- Predefined resume filter criteria to assisting screening in case of candidate applying on job portal
You are looking at this problem from the wrong end. The indexer (OracleText Search) is a powerful and complex tool embedded inside the workings of the database. What you are suggesting is to interpret the results of text indexing and use this as metadata for your content - if I am not mistaken? OracleText generates huge amounts of data and literally "chops" up documents word for word. For you to make meaningful metadata from this would be a huge task.
Instead you should be looking at the capture of the metadata from as close to the source as possible. This could be done using (if you are using MS-OFFICE) Word vbScript when the user saves to the repository or filesystem. I believe you can fully manipulate the metadata in a document at savetime.
You will of course need to install the Oracle WebCenter Content Desktop Integration suite.
Look into Oracle WebCenter Capture. WebCenter Capture can scan a document and allows metadata to be automatically tagged on the document. WebCenter Capture integrates with WebCenter Content (WCC) and allows you to directly checkin scanned documents to WebCenter Content.
http://www.oracle.com/technetwork/middleware/webcenter/content/index-090596.html

Scraping BRfares for train fares

I am looking for advise. The following website
http://brfares.com/#home
provides fares information for UK train lines. I would like to use it to build a database of travel costs for seasons tickets from different locations. I have never done this kind of thing before but have experience with Python/Bash scripting and some HTML.
Viewing the source code for a typical query the actual fair information is not displayed in index.html. Can anyone provide a pointer as to how to go about scraping (a new word for me) the information.
This is the url for the query : http://brfares.com/querysimple?orig=SUY&dest=0415&rlc=
the response is a json object.
First you need to build a lookup table of all destinations codes. you can use the following link to do that http://brfares.com/ac_loc?term=. Do it for all the letters in the alphabet and then parse for a unique list.
Then you take them by the pair, execute the json query, parse the returned json and feed the data to a database.
Now you can do whatever you want with that database.

How to Access Data in ZODB

I have a Plone site that has a lot of data in it and I would like to query the database for usage statistics; ie How many cals with more than 1 entries, how many blogs per group with entries after a given date, etc.
I want to run the script from the command line... something like so:
bin/instance [script name]
I've been googling for a while now but can't find out how to do this.
Also, can anybody provide some help on how to get user specific information. Information like, last logged in, items created.
Thanks!
Eric
In general, you can query the portal_catalog to locate content by searching various indexes. See http://plone.org/documentation/manual/developer-manual/indexing-and-searching/querying-the-catalog and http://docs.zope.org/zope2/zope2book/SearchingZCatalog.html for an introduction to the catalog.
In some cases the built-in indexes will allow you to do the query you want. In other cases you may need to write some Python to narrow down the results after doing an initial catalog query.
If you put your querying code in a file called foo.py, you can run it via:
bin/instance run foo.py
Within foo.py, you can refer to the root of the database as 'app'. The catalog would then be found at app.site.portal_catalog, where 'site' is the id of your Plone site.
Finding information about users happens via a separate API (for the Pluggable Auth Service). I'd suggest asking a separate question about that.

Resources