Degrees of separation using Freebase and MQL? - freebase

I`ve been examining Freebase and some queries like this:
Retrieving/storing All related Actors in Freebase
and I`ve come up with an idea of finding out degrees of separation between chosen actors. Is there a way to make MQL find a link (by other actors that co-starred in a movie) between for example 'John Wayne' and 'Daniel Craig'?

I may be a little late to answer, but recently I came across an excellent tutorial from GraphLab that seems to do exactly what you're describing. Check out:
http://graphlab.com/learn/notebooks/graph_analytics_movies.html

Here's a thread from the Freebase mailing list that discusses an app that did Six Degrees of Kevin Bacon. Its actually pretty difficult to do this with the MQL API because after you get a couple of degrees apart the queries start to time out.
I'd recommend that you download the Freebase data dump and calculate all of the shortest paths between every actor in Freebase offline. You can then save all those paths and query them quickly from your app. If you use grep to filter the data dumps down to just the relationships between actors and movies, then the data should be small enough to run the whole search in memory on a desktop/laptop.

Related

Gremlin: Shortest logistical route between A and B while respecting schedules + other constraints

Preface
I'm new to Gremlin and working through Kelvin Lawrence's awesome eBook on the topic in order to solve a specific use-case.
Due to the sheer amount to learn, I'm asking this question to get recommendations on how I might approach the challenge so that, as I read the eBook, I'll better know the sections to which to pay extra attention.
I intend to use AWS Neptune in the pursuit of solving this, so I tagged that topic as well.
Question
Respecting departure/arrival times of legs + other constraints, can the shortest path (the real-world, logistical meaning
of "path") between origin and destination be "queried" (i.e., can I
use the Gremlin console with a single statement)? Or is the
use-case of such complexity that I will effectively need to write a
program to accomplish it?
Use-Case / Detail
I hope to answer the question:
Starting at ORIGIN on DAY, can I get to DESTINATION while
respecting [CONDITIONS]?
The good news is that I only need a true/false response (so limit(1)?) and a lack of a result (e.g., []) suffices for "no".
What are the conditions?
Flight schedules need to be respected. Instead of simple flight routes (i.e., a connection exists between BOSton and DALlas), I have actual flight schedules (i.e., on Wednesday, 9 Nov 2022 at 08:40, flight XYZ will depart BOSton and then arrive DALlas at 13:15) ... consequently, if/when there are connections, I need to respect arrival and departure times + some sort of buffer (i.e., a path for which a Traveler would arrive at 13:05 and depart on another leg at 13:06 isn't actually a valid path);
Aggregate travel time / cost limits. The answer to the question needs to be "No" if a path's aggregate travel time or aggregate cost exceeds specified limits. (Here, I believe I'll need to use sack() to track the cost - financial and time - of each leg and bail out of the repeat() until loop when either is hit?)
I apologize b/c I know this isn't a good StackOverflow question, since it's not technically specific -- my hope is that, at least, some specific technical recommendations might result.
The use-case seems like the varsity / pro version of the flight routes example presented in the eBook, which is perfect for someone brand-new to Gremlin ... 😅
There are a number of ways you might model this. One way I have seen used effectively is to essentially have two graphs. This first just knows about routes. You use that one to find ways to get from A to Z in x-hops. Then using the second graph, which tracks actual flights, using the results from the first search you look for flights within the time constraints you need to impose. So there is really the data modeling question and then the query writing part. Obviously the data model should enable the queries to be as efficient as possible.
There are a couple of useful blog posts related to your question. They mention Neo4j but are really quite generic and mainly focus on the data modeling aspects of your question.
https://maxdemarzi.com/2015/08/26/modeling-airline-flights-in-neo4j/
https://maxdemarzi.com/2017/05/24/flight-search-with-neo4j/
I would focus on the data model, and once you have that, focus on the Gremlin queries. Amazon Neptune also now supports openCypher as an alternative property graph query language.
If you already have a data model worked out and can share a sample, I'm happy to update the answer with an example query or two.

Firebase queries returns too many documents in complex data model

I've been trying to figure out how to best model data for a complex feed in Cloud Firestore without returning unnecessary documents.
Here's the challenge --
Content is created for specific topics, for example: Architecture, Bridges, Dams, Roads, etc. The topic options can expand to included as many as needed at any time. This means it is a growing and evolving list.
When the content is created it is also tagged to specific industries. For example, I may want to create a post in Architecture and I want it to be seen within the Construction, Steel, and Concrete industries.
Here is where the tricky part comes in. If I am a person interested in the Steel and Construction industries, I would like to have a feed that includes posts from both of those industries with the specific topics of Bridges and Dams. Since it's a feed the results will need to be in time order. How would I possibly create this feed?
I've considered these options:
Query for each individual topic selected that includes tags for Steel and Construction, then aggregate and sort the results. The problem I have with this one is that it can return too many posts, which means I'm reading documents unnecessarily. If I select 5 topics between a specific time range, That's 5 queries, which is ok. However, each can have any possible amount of results, which is problematic. I could add a limit but then I run the risk of posts being omitted from topics even though they fall within the time range.
Create a post "index" table in Cloud SQL and perform queries on it to get the post ID's then retrieve the Firestore documents as needed. Then the question is, why not just use Cloud MySql.... Well it's a scaling, cost, and maintenance issue. The whole point of firestore is not having to worry so much about DBAs, load, and scale.
I've not been able to come to any other ideas and hoping someone has dealt with such a challenge and can shed some light on the matter. Perhaps firestore is just completely the wrong solution and I'm trying to fit a square peg into a round hole, but it seems like a workable solution can be found.
The perfect structure is to have separate node for posts then for each post you give it a reference parent category eg Steel and Construction. Have them also arranged with timestamps. If you think that the database will be too massive for firebase's queries. You can connect your firebase database to elasticsearch and do the search from there.
I hope it helps.

Aggregating and deduplicationg information extracted from multiple web sites

I am working on building a database of timing and address information of restaurants those are extracted from multiple web sites. As information for same restaurants may be present in multiple web sites. So in the database I will have some nearly duplicate copies.
As the number of restaurants is large say, 100000. Then for each new entry I have to do order of 100000^2 comparison to check if any restaurant information with nearly similar name is already present. So I am asking whether there is any efficient approach better than that is possible. Thank you.
Basically, you're looking for a record linkage tool. These tools can index records, then for each record quickly locate a small set of potential candidates, then do more detailed comparison on those. That avoids the O(n^2) problem. They also have support for cleaning your data before comparison, and more sophisticated comparators like Levenshtein and q-grams.
The record linkage page on Wikipedia used to have a list of tools on it, but it was deleted. It's still there in the version history if you want to go look for it.
I wrote my own tool for this, called Duke, which uses Lucene for the indexing, and has the detailed comparators built in. I've successfully used it to deduplicate 220,000 hotels. I can run that deduplication in a few minutes using four threads on my laptop.
One approach is to structure your similarity function such that you can look up a small set of existing restaurants to compare your new restaurant against. This lookup would use an index in your database and should be quick.
How to define the similarity function is the tricky part :) Usually you can translate each record to a series of tokens, each of which is looked up in the database to find the potentially similar records.
Please see this blog post, which I wrote to describe a system I built to find near duplicates in crawled data. It sounds very similar to what you want to do and since your use case is smaller, I think your implementation should be simpler.

Freebase co-types search timeout

I'm trying to find freebase co-types, that is, given a type you find 'compatible' types:
Suppose to start with /people/person, it might be a musician (/music/group_member), but not a music album (/music/album), i don't know if in freebase is there something like owl 'disjointWith' between types, anyway in the MQL cookbook they suggest to use this trick.
The query in the example get all instances of a given type, then get all types of all instances and let them to be unique...this is clever, but the query goes in timeout... is there another way? for me it's ok even a static list/result, i don't need the live query...i think the result would be the same...
EDIT:
Incompatible types type seems to be useful and similar to disjointWith, may also be used with the suggest...
Thanks!
luca
Freebase doesn't have the concept of disjointWith at the graph or schema level. The Incompatible Types base that you found is a manually curated version of that which may be used in a future version of the UI, but isn't today.
If you want to find all co-types, as they exist in the graph today, you can do that using the query you mentioned, but you're probably better off using the data dumps. I'd also consider establishing a frequency threshold to eliminate low frequency co-types so that you filter out mistakes and other noise.
As Tom mentioned, there are some thresholds that you might want to consider to filter out some of the less notable or experimental types that users have created in Freebase.
To give you an idea of what the data looks like I've run the query for all co-types for /people/person.
/common/topic (100.00%)
/book/author (23.88%)
/people/deceased_person (18.68%)
/sports/pro_athlete (12.49%)
/film/actor (9.72%)
/music/artist (7.60%)
/music/group_member (4.98%)
/soccer/football_player (4.53%)
/government/politician (3.92%)
/olympics/olympic_athlete (2.91%)
See the full list here.
You can also experiment with Freebase Co-Types using this app that I built (although it is prone to the same timeouts that you experienced).

Which technology is best suited to store and query a huge readonly graph?

I have a huge directed graph: It consists of 1.6 million nodes and 30 million edges. I want the users to be able to find all the shortest connections (including incoming and outgoing edges) between two nodes of the graph (via a web interface). At the moment I have stored the graph in a PostgreSQL database. But that solution is not very efficient and elegant, I basically need to store all the edges of the graph twice (see my question PostgreSQL: How to optimize my database for storing and querying a huge graph).
It was suggested to me to use a GraphDB like neo4j or AllegroGraph. However the free version of AllegroGraph is limited to 50 million nodes and also has a very high-level API (RDF), which seems too powerful and complex for my problem. Neo4j on the other hand has only a very low level API (and the python interface is not mature yet). Both of them seem to be more suited for problems, where nodes and edges are frequently added or removed to a graph. For a simple search on a graph, these GraphDBs seem to be too complex.
One idea I had would be to "misuse" a search engine like Lucene for the job, since I'm basically only searching connections in a graph.
Another idea would be, to have a server process, storing the whole graph (500MB to 1GB) in memory. The clients could then query the server process and could transverse the graph very quickly, since the graph is stored in memory. Is there an easy possibility to write such a server (preferably in Python) using some existing framework?
Which technology would you use to store and query such a huge readonly graph?
LinkedIn have to manage a sizeable graph. It may be instructive to check out this info on their architecture. Note particularly how they cache their entire graph in memory.
There is also OrientDB a open source document-graph dbms with commercial friendly license (Apache 2). Simple API, SQL like language, ACID Transactions and the support for Gremlin graph language.
The SQL has extensions for trees and graphs. Example:
select from Account where friends traverse (1,7) (address.city.country.name = 'New Zealand')
To return all the Accounts with at least one friend that live in New Zealand. And for friend means recursively up to the 7th level of deep.
I have a directed graph for which I (mis)used Lucene.
Each edge was stored as a Document, with the nodes as Fields of the document that I could then search for.
It performs well enough, and query times for fetching in and outbound links from a node would be acceptable to a user using it as a web based tool. But for computationally intensive, batch calculations where I am doing many 100000s queries I am not satisfied with the query times I'm getting. I get the sense that I am definitely misusing Lucene so I'm working on a second Berkeley DB based implementation so that I can do a side by side comparison of the two. If I get a chance to post the results here I will do.
However, my data requirements are much larger than yours at > 3GB, more than could fit in my available memory. As a result the Lucene index I used was on disk, but with Lucene you can use a "RAMDirectory" index in which case the whole thing will be stored in memory, which may well suit your needs.
Correct me if I'm wrong, but since each node is list of the linked nodes, seems to me a DB with a schema is more of a burden than an advantage.
It also sound like Google App Engine would be right up your alley:
It's optimized for reading - and there's memcached if you want it even faster
it's distributed - so the size doesn't affect efficiency
Of course if you somehow rely on Relational DB to find the path, it won't work for you...
And I just noticed that the q is 4 months old
So you have a graph as your data and want to perform a classic graph operation. I can't see what other technology could fit better than a graph database.

Resources