How to extract Companies per country from Freebase - freebase

I am new to Freebase. I am trying to extract all companies per country (The Head Quarter's country). The simplest approach I thought was to list them all and filter by country such as this test
[{
"name": null,
"type": "/organization/organization",
"/location/location/containedby": "Japan",
"limit": 4
}]
The problem is that I get schools as well. It is not very clear unlike DBpedia that has a class called "Company", how one can find distinguish the companies in Freebase while there is no clear type for it? I thought the organization/organization domain will do but it is too general also there is Business domain.

Why not use /business/business_operation or /business/consumer_company or some other more appropriate type if /organization/organization is too broad?
A bigger issue with your query is that it is only going to find entities contained directly in Japan, not those contained in all locations contained by Japan (e.g. prefectures, cities, etc). You may want to investigate using the Freebase Search API instead of MQL since I think it will compute the closure for you (or do radius searches). Alternatively, you'll probably need to run a few variations of you query with different levels of location nesting.
Here are some example search queries/filters:
https://developers.google.com/freebase/v1/search-output
restaurants near SF Ferry building - filter=(all type:restaurant (within radius:1000ft lon:-122.39 lat:37.7955))
https://developers.google.com/freebase/v1/search-cookbook
Japanese volcanos - filter: (all category:volcano (any part_of:japan))

Related

How to classify Wikidata items?

I am trying to classify items into the main categories supported by Wikidata:
Generic, Person, Organization, Events, Works, Terms, Place, Others.
These categories are listed here:
https://www.wikidata.org/wiki/Wikidata:List_of_properties
I could not find a property that specifies the main category. I looked into
the P31 "instance of" property and P279 "subclass of" but they are not what I need.
For example for "IBM" the P31 returns "public company" and "software house" and for "Swiss International Air Lines" it returns "airline".
So I cannot tell that they are both organizations.
Is there a way to do this?
One option would be to check the properties of an item, so
if an item has the P21 "sex or gender" then it's a human (or animal).
But I don't think that is stable since no property is mandatory.
I'm using the Wikidata Toolkit for my queries.
Wikidata used to have a main type property but it was deleted in favour of instance of and a more flexible schema.
You can see lots of archived discussion about the main type at https://www.wikidata.org/wiki/Property_talk:P107
You probably want to take a look at the SPARQL endpoint at http://query.wikidata.org
Q4830453 is business enterprise / company.
To find all items that are a company or a subclass of company just do:
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?item
WHERE {
?item wdt:P31/wdt:P279* wd:Q4830453
}
The query takes a little time, there are currently 150k results.

How would you find the most specific "filter" that matches a document? (determining which market segment a user fits in)

Imagine you have actions setup for when a user is from a certain demographic/market segment. The filters work a bit like a graph, matching for country, region, platform, operating system, and browser.
By default, you will match any value (if you specify US, you match for all users from the US regardless of region, platform, OS, or browser)
If you specify multiple values for any property of the filter it works like an OR (can be any of the values you specified), for the filter to match all the properties must have at least one match or be empty (accept all), essentially an AND operation.
So we can have:
Segment #1:
Countries: United States, Canada
Segment #2:
Countries: United States
Regions: New York
Platform: Tablets
Segment #3
Countries: United States
Browser: Chrome
Segment #4
Countries: United States
Segment #5
Match all (all filters left empty)
Scenario #1
User from Canada on his Tablet
Result: Segment #1
Scenario #2
User from New York, United States visits from Google Chrome on his Tablet.
Result: Segment #2, because the filter more specifically matches the user (matches country, region, and platform)
Scenario #3
User from Texas visits from his desktop
Result: Segment #4, tie with segment #1 is resolved because Segment #4 only matches United States and is therefore more specific
Work so far
I was thinking I could take each segment and load it up into a graph database that looks something like this
Country -> Region -> Platform --> OS -> Browser -> Segment
Each node either has a value (ex: United States, Chrome, Firefox, etc) and relationships that link it to any node below it in the tree (Country -> Browser is okay, Browser -> Country is not) or is null ("match all").
Each relationship (represented by ->) would also store a weight used to resolve ties. Relationships from a catch-all node get the max weight as they will always lose to a more specific filter.
Example database (numbers on the lines are the weight, lower weight becomes the prefered path)
Potential query
So now I need a query (maybe neo4j can do this?) that does the following:
Find the top level country node with the same value as the user or null
Go through each relationship (sorted by weight in ascending order)
Find the longest path, ties go to the node connected by a relationship with the lowest weight (if the tie is between a relationship to a null/catch-all node, the null node loses)
Continue this loop until we find a segment #
I'm sorry for the long post, it's hard to explain what I'm getting at via text.
What I'm looking for
Am I on the right path to solving this problem?
Are there better ways to go about this?
What would be the best way to store these relationships (graph database?)
How can I build a query that does what I want?
tl;dr: Need a way decent/performant way of finding the longest/most specific path in a graph like data structure. Comments requesting clarification or with any related information/documentation/projects/reading are very welcome
With Neo4j, you can store properties in a relationship, example:
(u1:User{name:"foo"})-[:FRIEND_WITH{since : "2015/01/01"}]->(u2:User{name:"bar"})
I think you should store country nodes this way:
(usa:Country{name: "USA", other attributes...})
So you can find every single country by matching with Country label, and then filter with the name property to get the one you're looking for.
Same for the cities, you can do a simple relationship to store every city :
(usa:Country{ name: "USA"})-[:CONTAINS_CITY]->(n:City{name: "New York", other attributes...})
and then you can add platform etc after the city.
To match a segment related to a certain country, you can do this way (example for Scenario #1) :
Match (c:Country{name : "Canada"})-[*1..2]->(p:Platform{name : "Tablet"})-[*1..]->(s:Segment) return s
Then you can create your segment by using nodes and create relations between them, the only problem may be on this case:
User1 has a Tablet in Canada
User2 has a Tablet in Canada using
Chrome
In this case, because of the depth match on the relationship ([*1..]) the User1 can be on the same segment as User2. The solution is to create intermediate nodes with default values, in case you don't have browser informations for example.

How to query to get the list of instance count of every freebase types?

I want to annotate the corpus using freebase types. But almost every instance in freebase has several types. So I decide to choose the most common types as the instance's type. Is there a way to get the list of the count of the instance? I found this query but it seems not right because the result only has like 400 types. But I think the real types are way more than that.
[{
"id": null,
"name": null,
"type": "/freebase/type_profile",
"/freebase/type_profile/instance_count": []
}]
I question the premise, but let's talk about that at the end after answering your question.
That's (close to) the correct query. When I ask for the count with by adding "return" : "count", I get 17,972 which sounds about right. Perhaps your query framework is adding a "limit" : 400 somehow?
Since you want the most common, why don't we modify the query to sort them. Due to a quirk in the sorting, nulls sort last (or first in our reversed sort), so we'll also add a qualifier to filter them out. We could use >0, but since presumably you aren't interested in low frequency types, let's use >1000 instead.
The final query looks like this:
[{
"id": null,
"name": null,
"type": "/freebase/type_profile",
"instance_count>": 1000,
"instance_count": null,
"sort": "-instance_count"
}]
which will return an ordered list of 849 types sorted in descending order by instance count.
You'll probably want to do a little hand curation of the resulting list to eliminate things like /common/topic, /common/document, /book/isbn, /book/pagination, etc. Mediator types won't also have /common/topic, so you could filter on that first (but depending on the types of things in your corpus, they may all be topics (ie entities) to start with.
Now back to the premise that most frequent == best. Depending on your application, you may actually want more specific (which usually means lower frequency) types, rather than broader, high frequency types. For example, Deceased Person rather than Person, or Politician, Author, or Athlete, in preference to Person. You may want to consider using least frequent type (which is used at least some threshold times). The other thing that you may want to do is blacklist non-commons types (ie types rooted at /base/... or /user/...) which haven't been as carefully curated.
EDIT - word of warning:
Those counts were last updated in 2012. That should be fine for an exercise like this where you just want a rough ordering, but if you need current stats, you'll need to either count occurrences in the Freebase data dump or figure out the separate Stats API which I'm not sure is public/documented http://freebase-site.googlecode.com/svn/trunk/www/lib/queries/stats.sjs

How does Google Maps decide when to use a specific icon?

I am using the Google Maps Places library to do a search for nearby hospitals, but it returns results that aren't necessary hospitals (but have 'hospital' as one of their types). However, I've noticed that actual hospitals have a hospital icon on the map, so Google must somehow know which establishments are actually hospitals. Does anyone know if the public has access to this data?
This is the icon I'm referring to: https://www.dropbox.com/s/1jfqcayxavjhlyi/Screenshot%202015-03-17%2017.20.19.png?dl=0
Example of request I'm making:
var request = {
location: self.location,
radius: 20000,
types: ['hospital'],
keyword: 'hospital'
};
Example result that isn't a hospital:
{"geometry":{"location":{"k":44.815958,"D":-68.808244}},
"icon":"http://maps.gstatic.com/mapfiles/place_api/icons/generic_business-71.png","id":"de6e60bd70b90ba4cb86afe149a60169553607f1",
"name":"Penobscot Community Health Center",
"opening_hours":{"open_now":true,"weekday_text":[]},
"photos":[{"height":320,"html_attributions":[],"width":320}],"place_id":"ChIJj--4INRKrkwRN0z2XkoJtVU",
"rating":3.1,
"reference":"CoQBdAAAADmf3YA0659efzMbCSPOK6SZttkfus7aWBDhrZZyX63Szl256BRcpz81LH6rIuONldYv256tsN7Zv-N6ZkOkJadlD2VS01bs7C4ierKvGUMyJOJu657xL5MvidF3Tgs9iejeJcXsxjDJYOwtN3m3sbfClfWYVnnIL4hMLYV8P9TnEhBurfJv_30CAG2wp1V73POVGhR-7fz1mCdh4OYWSa3Pw0mPupckoQ",
"scope":"GOOGLE",
"types":["hospital","pharmacy","store","health","establishment"],
"vicinity":"1012 Union Street, Bangor",
"html_attributions":[]}
My guess is there are a couple ways to get around this. You might remove the keyword argument from the API, which acts like a search term rather than a specific match on a type of location like the type field does.
You may want to be careful about your radius value choice.
Next, if you do a search on Google Maps in general you'll get a broad assortment of results. Do you need every result to be an actual hospital or can you do your own filtering afterwards?
If you do your own filtering it looks like type information and even icons are embedded in the result JSON. You might see if there's a distinguishing characteristic between the types of results you want and filter by that. Otherwise, any additional graphical data would not be accessible via the API.

DBpedia : Get list of Chinese universities and their adresses to populate google map?

I'm trying to get list of Chinese universities and their adresses. The minimum being the City/Town name. I will use these addresses to populate a googlemap, fiddle here.
I saw interesting code such as:
SELECT ?resource ?value
WHERE {
?resource a <http://dbpedia.org/class/yago/CitiesAndTownsInDenmark> .
?resource <http://dbpedia.org/property/populationTotal> ?value .
FILTER (?value > 100000)
}
ORDER BY ?resource ?value
Since CitiesAndTownsInChina doesn't work,
1. Where to find the exact name of the class I'am targeting ? and
2. Where to find dbpedia's operators manual ?
Note: I'am a very active user on Wikipedia, I'am well aware of all the data available there, but the dbpedia ontology/syntaxe/keywords is quite hard to get.
Personal note: queries on http://dbpedia.org/snorql/ , http://dbpedia.org/sparql/ , http://querybuilder.dbpedia.org/
(Expanding on my reply to How to find cities with more than X population in a certain country)
CitiesAndTownsInDenmark exists because people use the category http://en.wikipedia.org/wiki/Category:Cities_and_towns_in_Denmark in wikipedia. Wikipedia categories are pretty loose and as a result there's a lot of variation in style, so even if a useful category exists the name may not be guessable.
In addition categories are maintained manually, and may not be consistently applied.
A good place to start is looking at the data. Visiting http://dbpedia.org/page/Beijing I see yago:MetropolitanAreasOfChina which seems promising, but if you follow that link you'll see it's not well populated.
As a consequence avoid relying on the existence of such categories and directly querying for populated places in a country. This information comes from wikipedia infoboxes, and they're much more consistent than categories. Taking Beijing as an exemplar again I found:
select ?s {
?s a <http://dbpedia.org/ontology/PopulatedPlace> ;
<http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/China>
}
(The relevant properties and values for my query were found by copying link location in the Beijing page)
with the result:
"http://dbpedia.org/resource/Hulunbuir"
"http://dbpedia.org/resource/Guangzhou"
"http://dbpedia.org/resource/Chongqing"
"http://dbpedia.org/resource/Kuqa_County"
"http://dbpedia.org/resource/Changzhou"
... nearly 3000 results ...
You'll notice that position is encoded multiple times (geo:lat and long, georss:point, various dbpprop:latd longd things), and there seem to be two values excitingly. You can either simply deal with the multiple values in whichever format you prefer, or try picking just one using GROUP BY and SAMPLE.
As for a manual, almost everything I know of are academic papers, and not very useful. However the data is reasonably self documenting.
for your first question:
you can see possible classes by querying one member of your intended set of entities (ex: Shanghai).
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?type WHERE {
<http://dbpedia.org/resource/Shanghai> rdf:type ?type.
FILTER regex(str(?type), ".*China", "i").
} LIMIT 100
which gives this result:
dbpedia:class/yago/MetropolitanAreasOfChina [http]
dbpedia:class/yago/PortCitiesAndTownsInChina [http]
dbpedia:class/yago/MunicipalitiesOfThePeople'sRepuBlicOfChina [http]
dbpedia:class/yago/PopulatedCoastalPlacesInChina [http]
they are CamelCase versions of the categories that you will find at the bottom of wikipedia pages. I was fooled for a while by the erroneous capitalization of RepuBlic and finally saw that it contains only 4 cities, so it is of limited use for you.
so I would propose to go with #user205512 answer and get the cities by linking 2 properties.
for your second question:
I would advice you to search/ask on http://answers.semanticweb.com

Resources