How to query elements from a list of items in wikidata? - wikidata

There is a list of proper names of stars here: https://www.wikidata.org/wiki/Q1433418
How can I query this in the Wikidata Query Service so that all individual names of stars are listed, alongwith other data in the list, such as Constellation?
In other words, how do I get at the members of the list? "Instance of" doesn't seem to work.

There is a confusion here coming from the fact that this List of proper names of stars (Q1433418) is an element centralizing links to Wikipedia pages playing this role in the different Wikipedia editions but isn't really playing any meaningful role in Wikidata: there are no instance of (P31) List of proper names of stars (Q1433418) in Wikidata.
You would have more luck looking for instance of (P31) Stars (Q523) and instance of elements that are a subclass of (P279) Star, a pattern that you will find in many of the SPARQL query examples: ?star wdt:P31/wdt:P279* wd:Q523 .
That could give this query (json version).
And if you're into JS, you can parse the JSON result with this function I wrote: wdk.simplifySparqlResults

I would not take official names of stars from there. The Wikipedia is one of the most useful resources to get first hand, somewhat organised information, on any topic. It is irreplaceable for this, and it would be a great mess not having it. However, the information is very sensitive to misuse caused by vandalism or clumsy editors.
To get (the only) official proper names of stars, the IAU is making an effort started this year. I would use this as reference. It is also stored in a text file which is easy to retrieve by a program, and is being updated while the Committee accepts more star names. It is here:
http://www.pas.rochester.edu/~emamajek/WGSN/IAU-CSN.txt
In fact, as you see, the file structure is presented in a format ready to use by software applications. It has been made to meet needs as yours.

Related

How to add customized tokens into solr to change the indexing token behaviour

It's a Drupal site with solr for search. Mainly I am not satisfied with current search result on Chinese. The tokenizer has broken the words into supposed small pieces. Most of them are reasonable. But still, it made mistakes by not treating something as a valid token either breaking it to pieces or not breaking it.
Assuming I am writing Chinese now: big data analysis is one word which shouldn't be broken. So my search on it should find it. Also I want people to find AI and big data analysis training as the first hit when they search the exact phrase AI and big data analysis training.
So I want a way to intervene or compensate the current tokens to make the search smarter.
Maybe there is a file in solr allow me to manually write these tokens down to relate them certain phrases? So every time when indexing, solr can use it as a reference.
You different steps to achieve what you want :
1) I don't see an extremely big problem with your " over tokenization" :
big data analysis is one word which shouldn't be broken. So my search on it should find it. -> your search will find it even if tokenized, I understand this was an example and the actual words are chinese, but I suspect a different issue there
2) You can use the edismax[1] query parser with phrase boost at various level to boost subsequent tokens or phrases ( pf,pf2,pf3...ps,ps2,ps3...)
[1] https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html , https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html#TheExtendedDisMaxQueryParser-ThepsParameter

How to create a ligature from a user dictionary in Abby Finereader?

I need to recognize a complex chemichal names from a scanned document (pdf). They contain special characters and are written in a table format. I also have an Excel document that contains ALL possible names (I would say rows because there are no combinations) that I may encounter during scanning. Is there a way to create ligatures (so the Finereader will recognize an entire row instead of dissecting it into separate characters)? I tried creating a user dictionary but Finereader does not treat it as a one row.
The only way to create ligatures is to use "user pattern training". In FineReader, go to Tools -> Options -> Read tab (changes slightly depending on FR version) and enable User pattern training. During training extend your box to include several combined characters, thus creating a ligature.
The formulas recognition using this method is tough but may be possible.
I have done this many times in my work at www.wisetrend.com. I am a former ABBYY support employee and current integrator and OCR consulting specialist. I will be glad to help if you need more specific assistance.

Which URL describes the resource the best?

/competitions/1/clubs/5/players
/players/search?club_id=5
/players?club_id=5
When should I use a first-class URL for a resource, and when should I use a nested URL?
Update 1
Thanks for the answers so far. I'll try to clarify things a little further.
Competition and Club have a many-to-many relationship. Clubs can participate in multiple competitions. I guess that would make Club a first class entity, so the way to access a club would be for instance:
/clubs/33
But I also need to be able to access clubs that participate in a specific competition, so I need something like this too:
/competitions/2/clubs
But someone mentioned it isn't recommendable to make a resource accessible via multiple URI's. Doesn't this violate that?
Also, I presume a URI like this would not be preferable:
/competitions/2/clubs/33/players/5
But rather use this:
/clubs/33/players/5
Club has a one-to-many relationship with Player.
/competitions/1/clubs/5/players
As a URI is the identifier of a single resource, I would say the general rule is that if it is an object, it gets a 'first-class URL'.
I only tend to use the query parameters only when limiting/filtering lists, for example, /competitions/1/clubs/5/players?gender=MALE.
I use path elements if the relation "feels" tree/directory wise (like club has players /clubs/berlin/players). Parameters are more "tags", I use it often for search-filters (e.g. defenders of 'berlin' club with age older as 22 /clubs/berlin/players?position=defender&age=22).
I design URL structure by 'domain-importance'. The most basic concepts should go to the root. If possible don't go too deep down url-structure, I try not to duplicate or create alias collections which represent identical resources (costs double maintenance in code + documentation).
Generally putting /clubs as root feels more natural: /clubs/{club_id}/players
I would only expose players through /competitions/{comp_id}/clubs/{club_id}/players, if players-set of is different as /clubs/{club_id}/players, e.g. during competition
several players are blocked or didn't make it for the match-squad.
What do you mean with /competitions? Is it a tournament or a single match? If single match with two clubs maybe use home + away domain-concepts: /competitions/{comp_id}/home-club and
/competitions/{comp_id}/away-club .
Update-1 Answer
Here my thoughts on your update-question:
I guess /competitions/2/clubs is a subset of /clubs, not every club is competing in every competition. So both resources are different, so two URLs are fine.
Thinking again /competitions/2/clubs/33/players/5 should also be fine (but it is important that in server code duplication is avoided). This URL should even be mandatory when the returned resource is a subset of /clubs/33/players (e.g. players are injured or limit of team-size has been hit for specific competition).
I wouldn't put the ID numbers in the URL. They mean something only for those who actually knows what they mean, but for everyone else they are meaningless numbers.
You should always choose descriptive and related words for your URL, because the URL contribute to give informations about the linked resource.
Instead of using meaningless ID numbers, choose a unique name representing the name of the team or the competition, for example
/competitions/worldcup/clubs/usa/players
But if you really need to send that kind of anonymous data in the URL, then I would prefer to see them in a query.
Use only meaningful text for the URL.

Name matching dictionary for finding first and last name variants

I have an application that will store and track visitors. These visitors are created in the system by schedulers(users) as needed when they set up a visit. The problem is that most of the time the only important unique identifiers of a visitor are as follows:
First Name
Last Name
Company Name
The risk of duplicate records existing for the same person is inherent, a scheduler may enter a new visitor record in lieu of searching the system for somebody existing by that name.
When I encounter somebody entering a visitor by the same name I display a warning dialog with various suggestions of who this person COULD be, but then even that is not good enough.
I could enter 'Jim Jones' and this person may exist in the system as 'James Jones' or 'Jimmy Jones'. I see there are name recognition software packages available but they are expensive and certainly more heavy than what I am looking for.
Would anybody know where to find a free or open source dictionary file that I can programatically access to find potential name variants? Software or an online service would be nice but even just a data dump or simple text file might do.
I know even this will not prevent duplicate visitor records, I am just trying to keep that at a minimum so it is not a critical feature.
Check out the Moby project (http://icon.shef.ac.uk/Moby/mwords.html) for common first and last names. You can do a precomputation for similar names using tools like metaphone and soundex and use that to identify potential matches. You also mention company names which are a bit harder to manage since they can be made up of lots of things, for that maybe check out the 12-dicts word list (http://wordlist.sourceforge.net/) the 2+2lemma list provided in that package provides multiple forms that share common roots which can be used in conjunction with a simiar spelling solution to provide improved results.

Find Unique Words in One or More Columns?

I'm looking at implementing tags in my ASP.NET website. After looking at several algorithms, I'm leaning towards having a couple of database columns that contain one or more tag words. I will then use full-text search to locate rows with specified tags.
All of this seems pretty straight forward except for one thing: I need to be able to generate a list of available tags, which the user can select from.
I know I can write a C# program to build the list of available tags, and then run it once every week or so, but I was just wondering if there's any SQL-method for doing stuff like this more efficiently.
Also, I can't help but notice that the words will be extracted anyway as part of building the full-text index. I don't suppose there's any way to access that information?
This isn't how I'd choose to structure this but to answer the actual question...
In SQL Server 2008 you can query the sys.dm_fts_index_keywords and sys.dm_fts_index_keywords_by_document table valued functions to get the information that you want.
Why not to use separate table for tags with many-to-many relationship with tagged items table?
I mean something like that:
--Articles
ArticleId
Text
--Tags
TagId
Name
--TagsToArticles
ArticleRef
TagRef

Resources