How to query to get the list of instance count of every freebase types? - freebase

I want to annotate the corpus using freebase types. But almost every instance in freebase has several types. So I decide to choose the most common types as the instance's type. Is there a way to get the list of the count of the instance? I found this query but it seems not right because the result only has like 400 types. But I think the real types are way more than that.
[{
"id": null,
"name": null,
"type": "/freebase/type_profile",
"/freebase/type_profile/instance_count": []
}]

I question the premise, but let's talk about that at the end after answering your question.
That's (close to) the correct query. When I ask for the count with by adding "return" : "count", I get 17,972 which sounds about right. Perhaps your query framework is adding a "limit" : 400 somehow?
Since you want the most common, why don't we modify the query to sort them. Due to a quirk in the sorting, nulls sort last (or first in our reversed sort), so we'll also add a qualifier to filter them out. We could use >0, but since presumably you aren't interested in low frequency types, let's use >1000 instead.
The final query looks like this:
[{
"id": null,
"name": null,
"type": "/freebase/type_profile",
"instance_count>": 1000,
"instance_count": null,
"sort": "-instance_count"
}]
which will return an ordered list of 849 types sorted in descending order by instance count.
You'll probably want to do a little hand curation of the resulting list to eliminate things like /common/topic, /common/document, /book/isbn, /book/pagination, etc. Mediator types won't also have /common/topic, so you could filter on that first (but depending on the types of things in your corpus, they may all be topics (ie entities) to start with.
Now back to the premise that most frequent == best. Depending on your application, you may actually want more specific (which usually means lower frequency) types, rather than broader, high frequency types. For example, Deceased Person rather than Person, or Politician, Author, or Athlete, in preference to Person. You may want to consider using least frequent type (which is used at least some threshold times). The other thing that you may want to do is blacklist non-commons types (ie types rooted at /base/... or /user/...) which haven't been as carefully curated.
EDIT - word of warning:
Those counts were last updated in 2012. That should be fine for an exercise like this where you just want a rough ordering, but if you need current stats, you'll need to either count occurrences in the Freebase data dump or figure out the separate Stats API which I'm not sure is public/documented http://freebase-site.googlecode.com/svn/trunk/www/lib/queries/stats.sjs

Related

Riak: searchable list of maps (with CRDTs)

I have a use-case which consist in storing several maps under an attribute. It's the equivalent JSON list:
[
{"tel": "+33123456789", "type": "work"},
{"tel": "+33600000000", "type": "mobile"},
{"tel": "+79001234567", "type": "mobile"}
]
Also, I'd like to have it searchable at the object level. For instance:
- search entries that have a mobile number
- search entries that have a mobile number and whose number starts with the string "+7"
Since Riak doesn't support sets of maps (but only sets of registers), I was wondering if there is a trick to achieve it. So far I've had 2 ideas
Map of maps
The idea is to generate (randomly?) keys for the objects, and store each object of the list in a parent map whose keys are the ones generated for this only purpose to have a key.
It seems to me that it doesn't allow to search the object (maps inside the parent map) because Riak Solr search requires the full path to the attribute. One cannot simply write the following Solr query: phone_numbers.*.tel:+7*. Also composed search (eg. search entries that have a mobile number and whose number starts with the string "+7") seem hard to achieve.
Sets with simulated multi-valued attributes
This solution consists in using a set and insert all the values of the object as a single string, with separators between them. Yes, it's a hack. An attribute value would look like: $tel:+79001234567$type:mobile$ with : as the attribute name-value separator and $ as the attribute separator.
Search could be feasible using the * wildcard (ugly, but still), except the problem with escaping the separators: what if the type attribute includes a :? Are they some separators that are accepted by Riak and would not be acceptable in a string (I'm thinking of control characters)?
In a nutshell
I'm looking for a solution whatever the hackiness of it, as long as it works. Any idea is welcome.

Finding all entity names from deprecated freebase

I'm training a few Machine learning models that represent words as vectors, using freebase as training data. Since the API has been deprecated, I'm working with raw freebase dump, which is now a list of 3.1 billion triples, containing more than 500 million distinct entities (subject/object), and I'd like to reduce this number.
I would like to remove all triples which simply denote names of subjects so that only triples containing MIDs remain. However, I've found multiple possible predicates that define the 'name' of an entity.
i) common.notable_for.display_name
ii) type.object.name
iii) /rdf-schema#label
I have 3 questions :
a) Is there any difference between the above predicates?
b) Are there any additional predicates which also describe the names of entities?
c) Apart from the triple where a name is defined, does the name ever appear in other triples, instead of the MID?
Thank you for your help!
You should only concentrate on the type.object.name that's the schema property holding the topic's name.
The /rdf-schema#label is equalization, it is not part of the freebase schema.
The common.notable_for.display_name description is: "Localized/gender appropriate display name for the notable object.", it is also a property within a CVT (compound value type) and it holds different type of information: "of all types that a topic has, what't it most "important". As far as I remember "Larry Page" was an "entrepreneur". So you don't need this property. Concentrate on the TON type.object.name.

Adding a new user to neo4j

A totally neo4j noob is talking here,
I like to create a graph to store a set of users, a typical user is as follows:
CREATE
(node_1 {FullName:"Peter Parker",FirstName:"peter",FamilyName:"parker"}),
(node_2 {Address:"Newyork",CountryCode:"US"}),
(node_3 {Location:"Hidden"}),
(node_4 {phoneNumber:11111}),
(node_5 {InternetEmailAddress:"peter#peterland.com")
now the problem is,
Every time I execute this I add 5 more nodes.
I know I need to use a unique key, but all example I saw can use a unique key for a specific node. So how can I make sure a user doesn't get added if it already exists(I can use email address as unique key).
how do I update the nodes if some changes occur. for example, after a week I want to update the graph to contain the following instead of the previous one.(no duplicates)
CREATE(node_1 {FullName:"Peter Parker",FirstName:"peter",FamilyName:"parker"}),(node_2 {Address:"Newyork",CountryCode:"US"}),(node_3 {Location:"public"}),(node_4 {phoneNumber:11111}),(node_5 {InternetEmailAddress:"peter#peterland.com"),(node_6 {status:"Jailed"})
(NOTE the new update changed location to "public" and added a new node for peter
Seeing as you had a load of nodes anyway.
Some of the data you have modelled as Nodes are probably properties as the other answer suggests, some are possibly correctly modelled as Nodes and one could probably form the or a part of the relationship.
Location public/hidden can be modelled in one of three ways, as a property on the Person, as a property between the Person and the Location or as the relationship type. To understand that first you need to have a relationship.
Your address at the moment is another Node, I think this is correct, but possibly you would want two nodes, related something like this:
(s:State)-[:IN_COUNTRY]-(c:Country)
YMMV and clearly that a US centric model, but you can extend it easilly enough.
Now you could create Peter with a LIVES_IN relationship:
CREATE (p:Person{fullName:"Peter Parker"}), (s:State{name:"New York"}), (c:Country{code:"US"}),
(p)-[:LIVES_IN]->(s), (s)-[:IN_COUNTRY]->(c)
For speed you are better off modelling two relationships which could be LIVES_IN_PUBLIC and LIVES_IN_HIDDEN which means to perform that update that you want above then you have to delete the one and create the other. However, if speed is not of the essence, it is common also to use properties on the relationship.
CREATE (p:Person{fullName:"Peter Parker"}), (s:State{name:"New York"}), (c:Country{code:"US"}),
(p)-[:LIVES_IN{public:false}]->(s), (s)-[:IN_COUNTRY]->(c)
So your complete Q&A:
CREATE (p:Person {fullName:"Peter Parker",firstName:"peter",familyName:"parker", phoneNumber:1111, internetEmailAddress:"peter#peterland.com"}),
(s:State {name:"New York"}), (c:Country {code:"US"}),
(p)-[:LIVES_IN{public:false}]->(s), (s)-[:IN_COUNTRY]-(c)
MATCH (p:Person {internetEmailAddress:"peter#peterland.com"})-[li:LIVES_IN]->()
SET li.public = true, p.status = "jailed"
When adding other People you probably do not want to recreate States and Countries, rather you want to match them, and possibly Merge them, but we'll stick to Create.
MATCH (s:State{name:"New York"})
CREATE (p:Person{name:"John Smith", internetEmailAddress:"john#google.com"})-[:LIVES_IN{public:false}]->(s)
John Smith now implicitly lives in the US too as you can follow the relationship through the State Node.
Treatise complete.
I think you're modeling your data incorrectly here - you're setting up each property of the person as a separate node, which is not a good idea. You don't have any linkages between those nodes, so with this data pattern, later on you won't be able to tell what Peter Parker's address is. You're also not using node labels, which I think could really help here.
The quick question to your answer about updating nodes is that you have to MATCH them, then use SET to modify a property. So if you had a person, you might do this:
MATCH (p:Person { FullName: "Peter Parker" })
SET p.Address = "123 Fake Street"
RETURN p;
But notice I'm making assumptions about the way your data is structured. I'll take that same data you provided, this might be a better way of creating it:
CREATE (node_1:Person {FullName:"Peter Parker",
FirstName:"peter",
FamilyName:"parker",
Address:"Newyork",CountryCode:"US",
Location:"Hidden",
phoneNumber:11111,
InternetEmailAddress:"peter#peterland.com"});
The difference with this suggestion is that I'm putting all the properties into a single node (instead of one property per node) and I'm applying the Person label to the node.
If you structured the data like this, then the update query I provided would work. Structuring the data like you have it, it's not possible to update Peter Parker's address, because there's no relationship between your node_1 and node_2

How to extract Companies per country from Freebase

I am new to Freebase. I am trying to extract all companies per country (The Head Quarter's country). The simplest approach I thought was to list them all and filter by country such as this test
[{
"name": null,
"type": "/organization/organization",
"/location/location/containedby": "Japan",
"limit": 4
}]
The problem is that I get schools as well. It is not very clear unlike DBpedia that has a class called "Company", how one can find distinguish the companies in Freebase while there is no clear type for it? I thought the organization/organization domain will do but it is too general also there is Business domain.
Why not use /business/business_operation or /business/consumer_company or some other more appropriate type if /organization/organization is too broad?
A bigger issue with your query is that it is only going to find entities contained directly in Japan, not those contained in all locations contained by Japan (e.g. prefectures, cities, etc). You may want to investigate using the Freebase Search API instead of MQL since I think it will compute the closure for you (or do radius searches). Alternatively, you'll probably need to run a few variations of you query with different levels of location nesting.
Here are some example search queries/filters:
https://developers.google.com/freebase/v1/search-output
restaurants near SF Ferry building - filter=(all type:restaurant (within radius:1000ft lon:-122.39 lat:37.7955))
https://developers.google.com/freebase/v1/search-cookbook
Japanese volcanos - filter: (all category:volcano (any part_of:japan))

How to limit freebase results by release date?

How can I limit the retrieved results by release date, when I am trying to get a list of books in a time interval?
I have tried with the following query, but it doesn't seem to work (still retrieves books from after 1900:
https://www.googleapis.com/freebase/v1/mqlread?query= [{"type":"/book/book","id":null,"name":null,"/book/written_work/author": [],"/book/written_work/date_of_first_publication>":"1900","/book/book/genre": [],"/book/written_work/subjects":[],"limit":10,"key": {"namespace":"/wikipedia/en_id","value":null}}]&indent=1
Also how do you select that you want the release date returned, when you use the above query?
As user007 mentioned, you've got the sign of your comparison wrong. To get the actual date returned, just ask for it the same way you would normally, by including the property name with no value:
[{"type":"/book/book",
"id":null,
"name":null,
"/book/written_work/author": [],
"/book/written_work/date_of_first_publication<":"1900",
"/book/written_work/date_of_first_publication": null,
"/book/book/genre": [],
"/book/written_work/subjects":[],
"limit":10,
"key": {"namespace":"/wikipedia/en_id","value":null}}]
Note that the date_of_first_publication field is likely to be substantially less well populated than the date of the editions, so you may want to consider changing your query to book who have at least one edition published before 1900.
I've done a substantial amount of work with the Freebase book data, so feel free to message me if you need more help.
date_of_first_publication>":"1900" change to date_of_first_publication<":"1900" so it won't "retrieves books from after 1900"

Resources