Creating links to ontology nodes - uri

Let's say that, being abstract from any language, we have some ontology made of triples (e.g. subject (S) - predicate (P) - object (O))
Now if I want to, for some reason, annotate any of these triples (nodes), than I'd like to keep links to them that I can use in web documents.
Here are some conditions:
1) Such link must be in a form of one line of text
2) Such link should be easily parseable both by machine and person
3) Sections of such links should be delimited
4) Such link must be easy to grep, which IMO means they should be wrapped in some distinct letters or characters to make them easy to regex from any web or other document
5) Such link can be used in URL pathnames or query strings, thus has to comply with URL syntax
6) Characters used in such link must not be reserved for URL pathnames, query strings or hashes (e.g. not "/", ";" "?", "#")
My ideas so far were as follows:
a) Start and end such link with some distinct, constant set of letters, e.g. STK_....._OVRFLW
b) Separate sections with dashes "-", e.g. Subject-Predicate-Object
So it would look like:
STK_S1234-P123-O1234_OVRFLW
You have better ideas?

I'm with #msalvadores on this one - this seems to be a classic use of semantic web / linked data (albeit in a rather complex form), and your example seems to be more related to URI design rather than anthing else.
# is dealt with extensively in the semantic web lit, also there are javascript libraries for querying rdf through sparql - it just makes more sense to stick with the standard.
To link to a triple, the standard method is to use reification - essentially naming a triple (to keep with the triple model, it ends up creating 4 triples, but I would consider it the "correct" method in this situation). There is also the "named graph" method, which isn't a standard, but probably has more widespread adoption.
The link will be 1 line of text
It will be easily machine parsable, to make it human parsable, it might be necessary to give some thought to URI design.
Delimitation is once again on URI design
easy grepping - URI design
URL syntax - tick
no "/", ";" "?", "#" - I would try to incorporate it into a url instead of pushing it out
I would consider www.stackoverflow.com/statement/S1234_P123_O123, where S1234 etc. are unique labels (I don't necessarily agree with human readable uris, but I guess they'll have to stay until humans don't have to read uris). The beautiful thing is that it should dereference and give a nice human vs machine readable representation

Related

How to add customized tokens into solr to change the indexing token behaviour

It's a Drupal site with solr for search. Mainly I am not satisfied with current search result on Chinese. The tokenizer has broken the words into supposed small pieces. Most of them are reasonable. But still, it made mistakes by not treating something as a valid token either breaking it to pieces or not breaking it.
Assuming I am writing Chinese now: big data analysis is one word which shouldn't be broken. So my search on it should find it. Also I want people to find AI and big data analysis training as the first hit when they search the exact phrase AI and big data analysis training.
So I want a way to intervene or compensate the current tokens to make the search smarter.
Maybe there is a file in solr allow me to manually write these tokens down to relate them certain phrases? So every time when indexing, solr can use it as a reference.
You different steps to achieve what you want :
1) I don't see an extremely big problem with your " over tokenization" :
big data analysis is one word which shouldn't be broken. So my search on it should find it. -> your search will find it even if tokenized, I understand this was an example and the actual words are chinese, but I suspect a different issue there
2) You can use the edismax[1] query parser with phrase boost at various level to boost subsequent tokens or phrases ( pf,pf2,pf3...ps,ps2,ps3...)
[1] https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html , https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html#TheExtendedDisMaxQueryParser-ThepsParameter

How to create a ligature from a user dictionary in Abby Finereader?

I need to recognize a complex chemichal names from a scanned document (pdf). They contain special characters and are written in a table format. I also have an Excel document that contains ALL possible names (I would say rows because there are no combinations) that I may encounter during scanning. Is there a way to create ligatures (so the Finereader will recognize an entire row instead of dissecting it into separate characters)? I tried creating a user dictionary but Finereader does not treat it as a one row.
The only way to create ligatures is to use "user pattern training". In FineReader, go to Tools -> Options -> Read tab (changes slightly depending on FR version) and enable User pattern training. During training extend your box to include several combined characters, thus creating a ligature.
The formulas recognition using this method is tough but may be possible.
I have done this many times in my work at www.wisetrend.com. I am a former ABBYY support employee and current integrator and OCR consulting specialist. I will be glad to help if you need more specific assistance.

REST resources with a triple as a parameter

When needing to create a URL that takes a finite set of parameters, where all of said parameters are semantically the same "level", what is the current consensus around the use of delimiters within URLs? Here's an example:
/myresource/thing1,thing2,thing3
/myresource/thing2,thing1
/myresource/thing1;thing2;thing3
/myresource/thing1;thing3
That is to say, the parameter here could be a single, a pair or a triple. They can be specified in any order because they are not a logical tree, and thing2 is not a subordinate resource of thing1, so doing something like this seems "wrong":
/myresources/thing1/thing2/thing3
This bothers me because it implies a tree-like relationship between the elements of the triple, and that is not the case (despite many HTTP frameworks seemingly pushing this, wrongly in my view). In addition, using a query string doesn't feel right as this is not a search operation, it is a known triple in a very finite space - there's nothing to query or search, so to speak.
I suppose the other option would be to make it a POST request and supply a body that details the parts of the triple being supplied. This doesn't give me warm fuzzies though, for some reason.
How have others handled this? Delimiters seem clean to me, and communicate the intended semantics of the resource, but i know there are folks would would take a different view, and I was looking to understand the experiences of others who've had similar use cases.
Since any value can be missing and values can appear in any order, How would you know which value is for which parameter (if that matters).
I would have used query string for GET, or in the payload for POST.
Use query parameters
/path/to/the/resource?key1=value1&key2=value2&key3=value3
or matrix parameters
/path/to/the/resource;key1=value1;key2=value2;key3=value3
Without a proper example, I'm not sure exactly about your needs.
However, a little known fact is that any HTTP parameter can have multiple values. It is the way to go when you have a set of objects (see GoogleMaps static API for an example).
/path/to/the/resource?things=thing1&things=thing2&things=thing3
Then you can use the same API for single, pairs, triples (and more).

Constructing a Windows Search query

We have a website in which we will be using the Windows Search feature to allow users to search the pages and documents of the site. I would like to make the search as intuitive as possible, given that most users are already familiar with google-style search syntax. However, using Windows Search seems to present two problems.
If I use the FREETEXT() predicate, then users can enter certain google-style syntax options, such as double quotes for exact phrase matching or use the minus sign to exclude a certain word. These are features I consider necessary. However, the FREETEXT() predicate seems to demand that every search term appear somewhere in the page / document in order for it to be returned in the results.
If I use the CONTAINS() predicate, then users can enter search terms using boolean operators, and they can execute wildcard searches using the * character. However, all search terms must be joined by one of the logical operators or enclosed in double quotation marks.
What I would like is a combination of the two. Users should be able to search for exact phrases using double quotations marks, exclude words using the minus sign, but also have anything not enclosed in quotation marks be subject to wildcard matching (e.g. searching for civ would return documents containing the words civil or civility or civilization).
How could I go about implementing this?
I followed some of the instructions at http://www.codeproject.com/Articles/21142/How-to-Use-Windows-Vista-Search-API-from-a-WPF-App to create the Interop.SearchAPI.dll assembly for .NET. I then used the ISearchQueryHelper.GenerateSQLFromUserQuery() method to build the SQL command.
The generated SQL uses the CONTAINS() predicate, but it builds the CONTAINS() predicate numerous times with different combinations of the search terms, including wild cards. This allows the user to enter exact phrases using double quotation marks, exclude words using the minus sign, and perform automatic wildcard matching as I mentioned in the original question.

RESTifying URLs

At work here, we have a box serving XML feeds to business partners. Requests for our feeds are customized by specifying query string parameters and values. Some of these parameters are required, but many are not.
For example, we've require all requests to specify a GUID to identify the partner, and a request can either be for a "get latest" or "search" action:
For a search: http://services.null.ext/?id=[GUID]&q=[Search Keywords]
Latest data in category: http://services.null.ext/?id=[GUID]&category=[ID]
Structuring a RESTful URL scheme for these parameters is easy:
Search: http://services.null.ext/[GUID]/search/[Keywords]
Latest: http://services.null.ext/[GUID]/latest/category/[ID]
But what how should we handle the dozen or so optional parameters we have? Many of these are mutually exclusively, and many are required in combinations. Very quickly, the number of possible paths becomes overwhelmingly complex.
What are some recommended practices for how to map URLs with complex query strings to friendlier /REST/ful/paths?
(I'm interested in conventions, schemes, patterns, etc. Not specific technologies to implement URL-rewriting on a web server or in a framework.)
You should leave optional query parameters in the Query string. There is no "rule" in REST that says there cannot be a query string. Actually, it's quite the opposite. The query string should be used to alter the view of the representation you are transferring back to the client.
Stick to "Entities with Representable State" for your URL path components. Category seems OK, but what exactly is it that you are feeding over XML? Posts? Catalog Items? Parts?
I think a much better REST taxonomy would look like this (assuming the content of your XML feed is an "article"):
http://myhost.com/PARTNERGUID/articles/latest?criteria1=value1&criteria2=value2
http://myhost.com/PARTNERGUID/articles/search?criteria1=value1&criteria2=value2
If you're not thinking about the entities you are representing while building your REST structure, you're not doing REST. You're doing something else.
Take a look at this article on REST best practices. It's old, but it may help.
Parameters with values? One option is the query string. Using it is not inherently non-restful. Another option is to use the semi-colon, Tim Berners-Lee talks about them and they might just fit the bill, allowing the URL to make sense, without having massively long paths.

Resources