I want to extract information from a large website and generate an ontology. Something that can be processed with description logic.
What data structure is advisable for the extracted html data?
My ideas yet:
- Use Data Frames, Table Structures
- Sets and Relations (sets package and good relations)
- Graphs
.
In the End I want to export the data and plan to process it with predicate logic (or description logic) using another programming language.
I want to use R to extraction information from html pages. But as I understand there is no direct support in R (or packages) for predicate logic or RDF/OWL.
So I need to do the extraction, use some data structure in the process and export the data.
Example Data:
SomeDocument rdf:type PDFDocument
PDFDocument rdfs:subClassOf Document
SomeDocument isUsedAt DepartmentA
DepartmentA rdf:type Department
PersonA rdf:type Person
PersonA headOf DepartmentA
PersonA hasName "John"
Where the instance data is "SomeDocument", "DepartmentA" and "PersonA".
.
If it makes sense, some sort of reasoning (but probably not in R):
AccessedOften(SomeDocument) => ImportantDocument(SomeDocument)
Most important is what does your website data look like? For instance, if it already has RDFa in it you would use an RDFa distiller to get the RDF out; simple; done. Then you could shove the RDF into a triple store. You could augment the website's data by creating your own ontology which you would query using SPARQL, if your ontology make equivalent classes to the data you found on your web site then you are golden. Many triple stores can be queried as SPARQL endpoints via URLs alone, and return in format of XML so even if R has no SPARQL or OWL ontolgoy packages per se, it doesn't mean you can't query the data at all.
If it requires a lot of pages to be downloaded I would use WGET to download those. To proces the files I would use a Perl script to transform the data to a more readable format eg. comma separated. Then I would turn to some programming language to combine in the way you describe, however, I would not go for R in this matter.
Related
For a scoping study/systematic literature review I would like a package which generates a reference list as a .ris file directly from publisher data bases such as Wiley, PubMed, Science Direct Web of Science and JSTOR.
Is there a package (or a workaround with API) that can "output" all listed resources of a database query as a file / dataframe in R?
I have read about "refwork" and "revtools" so far, but they seem to need .ris data upfront. I am looking for something generating me this file and not me doing this individually (which means ticking results page for page and exporting it).
I need to recognize a complex chemichal names from a scanned document (pdf). They contain special characters and are written in a table format. I also have an Excel document that contains ALL possible names (I would say rows because there are no combinations) that I may encounter during scanning. Is there a way to create ligatures (so the Finereader will recognize an entire row instead of dissecting it into separate characters)? I tried creating a user dictionary but Finereader does not treat it as a one row.
The only way to create ligatures is to use "user pattern training". In FineReader, go to Tools -> Options -> Read tab (changes slightly depending on FR version) and enable User pattern training. During training extend your box to include several combined characters, thus creating a ligature.
The formulas recognition using this method is tough but may be possible.
I have done this many times in my work at www.wisetrend.com. I am a former ABBYY support employee and current integrator and OCR consulting specialist. I will be glad to help if you need more specific assistance.
There is a list of proper names of stars here: https://www.wikidata.org/wiki/Q1433418
How can I query this in the Wikidata Query Service so that all individual names of stars are listed, alongwith other data in the list, such as Constellation?
In other words, how do I get at the members of the list? "Instance of" doesn't seem to work.
There is a confusion here coming from the fact that this List of proper names of stars (Q1433418) is an element centralizing links to Wikipedia pages playing this role in the different Wikipedia editions but isn't really playing any meaningful role in Wikidata: there are no instance of (P31) List of proper names of stars (Q1433418) in Wikidata.
You would have more luck looking for instance of (P31) Stars (Q523) and instance of elements that are a subclass of (P279) Star, a pattern that you will find in many of the SPARQL query examples: ?star wdt:P31/wdt:P279* wd:Q523 .
That could give this query (json version).
And if you're into JS, you can parse the JSON result with this function I wrote: wdk.simplifySparqlResults
I would not take official names of stars from there. The Wikipedia is one of the most useful resources to get first hand, somewhat organised information, on any topic. It is irreplaceable for this, and it would be a great mess not having it. However, the information is very sensitive to misuse caused by vandalism or clumsy editors.
To get (the only) official proper names of stars, the IAU is making an effort started this year. I would use this as reference. It is also stored in a text file which is easy to retrieve by a program, and is being updated while the Committee accepts more star names. It is here:
http://www.pas.rochester.edu/~emamajek/WGSN/IAU-CSN.txt
In fact, as you see, the file structure is presented in a format ready to use by software applications. It has been made to meet needs as yours.
I've been reading about linked data and I think I understand the basics of publishing linked data, but I'm trying to find real world practical (and best practise) usage for linked data. Many books and online tutorials talk a lot about RDF and SPARQL but not about dealing with other peoples data.
My question is, if I have a project with a bunch of data that I output as RDF, what is the best way to enhance (or correctly use) other people's data?
If I create an application for animals and I want to use data from the BBC wildlife page (http://www.bbc.co.uk/nature/life/Snow_Leopard) what should I do? Crawl the BBC wildlife page, for RDF, and save the contents to my own triplestore or query the BBC with SPARQL (I'm not sure that this is actually possible with the BBC) or do I take the URI for my animal (owl:sameAs) and curl the content from the BBC website?
This also asks the question, can you programmatically add linked data? I imagine you would have to crawl the BBC wildlife page unless they provide an index of all the content.
If I wanted to add extra information such as location for these animals (http://www.geonames.org/2950159/berlin.html) again what is considered the best approach? owl:habitat (fake predicate) Brazil? and curl the RDF for Brazil from the geonames site?
I imagine that linking to the original author is the best way because your data can then be kept up-to-date, which from these slides from a BBC presentation (http://www.slideshare.net/metade/building-linked-data-applications) is what the BBC does, but what if the authors website goes down or is too slow? And if you were to index the author's RDF I imagine your owl:sameAs would point to a local RDF.
Here's one potential way of creating and consuming linked data.
If you are looking for an entity (i.e., a 'Resource' in Linked Data terminology) online, see if there is Linked Data description about it. One easy place to find this is DBpedia. For Snow Leopard, one URI that you can use is http://dbpedia.org/page/Snow_leopard. As you can see from the page, there are several object and property descriptions. You can use them to create a rich information platform.
You can use SPARQL in two ways. Firstly, you can directly query a SPARQL endpoint on the web where there might be some data. BBC had one for music; I'm not sure if they do for other information. DBpedia can be queried using snorql. Secondly, you can retrieve the data you need from these endpoints and load into your triple store using INSERT and INSERT DATA features of SPARQL 1.1. To access the SPARQL end points from your triple store, you will need to use the SERVICE feature of SPARQL. The second approach protects you from the inability to execute your queries when a publicly available end point is down for maintenance.
To programmatically add the data to your triplestore, you can use one of the predesigned libraries. In Python, RDFlib is useful for such applications.
To enrich the data with that sourced from elsewhere, there can again be two approaches. The standard way of doing it is using existing vocabularies. So, you'd have to look for the habitat predicate and just insert this statement:
dbpedia:Snow_leopard prefix:habitat geonames:Berlin .
If no appropriate ontologies are found to contain the property (which is unlikely in this case), one needs to create a new ontology.
If you want to keep your information current, then it makes sense to periodically run your queries. Using something such as DBpedia Live is useful is this regard.
I have a database name Team which has 40 tables . How can I connect to that database and refer to particular table without using sqlquerry. By the use of R data Structures.
I am not sure what do you mean with "How can I connect to that database and refer to particular table without using sqlquerry".
I am not aware of a way to "see" DB tables as R dataframes or arrays or whatever without importing the tuples first through some sort of query (in SQL) - this seems to be the most practical way to use R with DB data (without going to the hassle of exporting these as .csv files first, and re-read them in R).
There are a couple ways to import data from a DB to R, so that the result of a query becomes a R data structure (including proper type conversion, ideally).
Here is a short guide on how to do that with SQL-R
A similar brief introduction to the DBI family