Finding all entity names from deprecated freebase - freebase

I'm training a few Machine learning models that represent words as vectors, using freebase as training data. Since the API has been deprecated, I'm working with raw freebase dump, which is now a list of 3.1 billion triples, containing more than 500 million distinct entities (subject/object), and I'd like to reduce this number.
I would like to remove all triples which simply denote names of subjects so that only triples containing MIDs remain. However, I've found multiple possible predicates that define the 'name' of an entity.
i) common.notable_for.display_name
ii) type.object.name
iii) /rdf-schema#label
I have 3 questions :
a) Is there any difference between the above predicates?
b) Are there any additional predicates which also describe the names of entities?
c) Apart from the triple where a name is defined, does the name ever appear in other triples, instead of the MID?
Thank you for your help!

You should only concentrate on the type.object.name that's the schema property holding the topic's name.
The /rdf-schema#label is equalization, it is not part of the freebase schema.
The common.notable_for.display_name description is: "Localized/gender appropriate display name for the notable object.", it is also a property within a CVT (compound value type) and it holds different type of information: "of all types that a topic has, what't it most "important". As far as I remember "Larry Page" was an "entrepreneur". So you don't need this property. Concentrate on the TON type.object.name.

Related

Using shacl to validate a property that has at most one value in its properties

I'm trying to create a shacl based on the ontology that my organization is developing (in dutch): https://wegenenverkeer.data.vlaanderen.be/
The objects described have attributes (properties), that have a specified datatype. The datatype can a primitive (like string or decimal) or complex, which means the property will have properties itself (nested properties). For example: an asset object A will have an attribute assetId which is a complex datatype DtcIdentificator, which consists of two properties itself. I have succesfully created a shacl that validates objects by creating multiple shapes and nesting them.
I now run into the problem of what we call union datatypes. These are a special kind of complex datatypes. They are still nested datatypes: the attribute with the union datatypes will have multiple properties but only exactly zero or one of those properties may have a value. If the attribute has 2 properties with values, it is invalid. How can I create such a constraint in shacl?
Example (in dutch): https://wegenenverkeer.data.vlaanderen.be/doc/implementatiemodel/union-datatypes/#Afmeting%20verkeersbord
A traffic sign (Verkeersbord, see https://wegenenverkeer.data.vlaanderen.be/doc/implementatiemodel/signalisatie/#Verkeersbord) can have a property afmeting (size) of the datatype DtuAfmetingVerkeersbord.
If an asset A of this type would exist, I could define its size as (in dotnotation):
A.afmeting.rond.waarde = 700
-or-
A.afmeting.driehoekig.waarde = 400
Both are valid ways of using the afmeting property, however, if they are both used for the same object, this becomes invalid, as only one property of A.afmeting may have a value.
I have tried using the union constraint in shacl, but soon found out that that has nothing to do with what we call "union datatypes"
I think the reason you are struggling is because this kind of problem is usually modelled differently. Basically you have different types of Traffic signs and these signs can have measurements. With the model as you described, A.afmeting.rond.waarde captures 2 ideas using 1 property: (a) the type and (b) the size. From your question, this seems to be the intend. However, this is usually not how this kind of problem is addressed.
A more intuitive design is for Traffic sign to have 2 different properties: (a) type and (b) a measurement. The Traffic sign types are achthoekig, driehoekig, etc. Then you can use SHACL to check that a traffic sign has either both or no properties for a traffic sign.

Riak: searchable list of maps (with CRDTs)

I have a use-case which consist in storing several maps under an attribute. It's the equivalent JSON list:
[
{"tel": "+33123456789", "type": "work"},
{"tel": "+33600000000", "type": "mobile"},
{"tel": "+79001234567", "type": "mobile"}
]
Also, I'd like to have it searchable at the object level. For instance:
- search entries that have a mobile number
- search entries that have a mobile number and whose number starts with the string "+7"
Since Riak doesn't support sets of maps (but only sets of registers), I was wondering if there is a trick to achieve it. So far I've had 2 ideas
Map of maps
The idea is to generate (randomly?) keys for the objects, and store each object of the list in a parent map whose keys are the ones generated for this only purpose to have a key.
It seems to me that it doesn't allow to search the object (maps inside the parent map) because Riak Solr search requires the full path to the attribute. One cannot simply write the following Solr query: phone_numbers.*.tel:+7*. Also composed search (eg. search entries that have a mobile number and whose number starts with the string "+7") seem hard to achieve.
Sets with simulated multi-valued attributes
This solution consists in using a set and insert all the values of the object as a single string, with separators between them. Yes, it's a hack. An attribute value would look like: $tel:+79001234567$type:mobile$ with : as the attribute name-value separator and $ as the attribute separator.
Search could be feasible using the * wildcard (ugly, but still), except the problem with escaping the separators: what if the type attribute includes a :? Are they some separators that are accepted by Riak and would not be acceptable in a string (I'm thinking of control characters)?
In a nutshell
I'm looking for a solution whatever the hackiness of it, as long as it works. Any idea is welcome.

XML for Analysis (XML/A) format of member names?

I have two different XML/A providers, Mondrian and icCube. The tuples for a time dimension contain the unique name for the member, but the format of the member name is different:
Mondrian:
<UName>[Time].[2004].[QTR2].[Apr]</UName>
<Caption>Apr</Caption>
[Time] is the name of the hierarchy
[2004] is the name of the ancestor at the Year level
[QTR2] is the name of the ancestor at the Quarter level
[Apr] is the local name of the member at the Month level
icCube:
<UName>[Time].[Calendar].[Month].&[Jun 2010]</UName>
<Caption>Jun 2010</Caption>
[Time] is the name of the dimension
[Calendar] is the the name of the
hierarchy
[Month] is the name of the level
[Jun 2010] is the name of
the month member.
(I don't know why the ampersands are there)
My question is, is there any recommended, preferably standard way to figure out how the member names are formatted?
The reason I want to know this is when I render the result in a Pivot table, the captions for the members will usually end up as labels on the headers of the pivot table. But since the captions may not be unique, it is desirable to also produce labels of the "ancestor" members, because together they do identify the member uniquely.
In my example, I could use the parts of the member unique name to do this, but in ic cube not,since the member u name is structured differently.
I have 2 questions:
1) How can I tell beforehand what format the XML/A provider will use to identify the members?
2) What would be the recommended way in ic cube to produce the labels for the ancestor members?
UPDATE:
Luc Boudreau informed me that the ampersand indicates "key notation" - it designates the member key rather than its name. Thanks Luc!
The meaning of unique names in MDX is a string that guarantees that it defines a unique MDX entity when parsed. There is no possible collision with another MDX entity. The way to write it depends on the XMLA provider. Even though it's 'unique' there are multiple ways creating it, each server chooses its way.
Never mind, a query written in one server will work in another as both 'unique' names are correctly parsed.
& amp; stands for &
Our advise, the client code should not rely on the format of the unique names.
That being said, if you need parent "names", you should retrieve them explicitly using the "Parent" function and/or as a calculated measure retrieving the name/caption property.
Hope that helps.

How does one convert string to number in JDOQL?

I have a JDOQL/DataNucleus storage layer which stores values that can have multiple primitive types in a varchar field. Some of them are numeric, and I need to compare (</>/...) them with numeric constants. How does one achieve that? I was trying to use e.g. (java.lang.)Long.parse on the field or value (e.g. java.lang.Long.parseLong(field) > java.lang.Long.parseLong(string_param)), supplying a parameter of type long against string field, etc. but it doesn't work. In fact, I very rarely get any errors, for various combinations it would return all values or no values for no easily discernible reasons.
Is there documentation for this?
Clarification: the field is of string type (actually a string collection from which I do a get). For some subset of values they may store ints, e.g. "3" string, and I need to do e.g. value >= 2 filters.
I tried using casts, but not much, they do produce errors, let me investigate some more
JDO has a well documented set of methods that are valid for use with JDOQL, upon which DataNucleus JDO adds some additional ones and allows users to add on support for others as per
http://www.datanucleus.org/products/accessplatform_3_3/jdo/jdoql.html#methods
then you also can use JDOQL casts (on the same page as that link).

DBpedia : Get list of Chinese universities and their adresses to populate google map?

I'm trying to get list of Chinese universities and their adresses. The minimum being the City/Town name. I will use these addresses to populate a googlemap, fiddle here.
I saw interesting code such as:
SELECT ?resource ?value
WHERE {
?resource a <http://dbpedia.org/class/yago/CitiesAndTownsInDenmark> .
?resource <http://dbpedia.org/property/populationTotal> ?value .
FILTER (?value > 100000)
}
ORDER BY ?resource ?value
Since CitiesAndTownsInChina doesn't work,
1. Where to find the exact name of the class I'am targeting ? and
2. Where to find dbpedia's operators manual ?
Note: I'am a very active user on Wikipedia, I'am well aware of all the data available there, but the dbpedia ontology/syntaxe/keywords is quite hard to get.
Personal note: queries on http://dbpedia.org/snorql/ , http://dbpedia.org/sparql/ , http://querybuilder.dbpedia.org/
(Expanding on my reply to How to find cities with more than X population in a certain country)
CitiesAndTownsInDenmark exists because people use the category http://en.wikipedia.org/wiki/Category:Cities_and_towns_in_Denmark in wikipedia. Wikipedia categories are pretty loose and as a result there's a lot of variation in style, so even if a useful category exists the name may not be guessable.
In addition categories are maintained manually, and may not be consistently applied.
A good place to start is looking at the data. Visiting http://dbpedia.org/page/Beijing I see yago:MetropolitanAreasOfChina which seems promising, but if you follow that link you'll see it's not well populated.
As a consequence avoid relying on the existence of such categories and directly querying for populated places in a country. This information comes from wikipedia infoboxes, and they're much more consistent than categories. Taking Beijing as an exemplar again I found:
select ?s {
?s a <http://dbpedia.org/ontology/PopulatedPlace> ;
<http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/China>
}
(The relevant properties and values for my query were found by copying link location in the Beijing page)
with the result:
"http://dbpedia.org/resource/Hulunbuir"
"http://dbpedia.org/resource/Guangzhou"
"http://dbpedia.org/resource/Chongqing"
"http://dbpedia.org/resource/Kuqa_County"
"http://dbpedia.org/resource/Changzhou"
... nearly 3000 results ...
You'll notice that position is encoded multiple times (geo:lat and long, georss:point, various dbpprop:latd longd things), and there seem to be two values excitingly. You can either simply deal with the multiple values in whichever format you prefer, or try picking just one using GROUP BY and SAMPLE.
As for a manual, almost everything I know of are academic papers, and not very useful. However the data is reasonably self documenting.
for your first question:
you can see possible classes by querying one member of your intended set of entities (ex: Shanghai).
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?type WHERE {
<http://dbpedia.org/resource/Shanghai> rdf:type ?type.
FILTER regex(str(?type), ".*China", "i").
} LIMIT 100
which gives this result:
dbpedia:class/yago/MetropolitanAreasOfChina [http]
dbpedia:class/yago/PortCitiesAndTownsInChina [http]
dbpedia:class/yago/MunicipalitiesOfThePeople'sRepuBlicOfChina [http]
dbpedia:class/yago/PopulatedCoastalPlacesInChina [http]
they are CamelCase versions of the categories that you will find at the bottom of wikipedia pages. I was fooled for a while by the erroneous capitalization of RepuBlic and finally saw that it contains only 4 cities, so it is of limited use for you.
so I would propose to go with #user205512 answer and get the cities by linking 2 properties.
for your second question:
I would advice you to search/ask on http://answers.semanticweb.com

Resources