How can I check whether a Freebase MID is a schema in freebase or not? - freebase

I want to filter out all the triples which has a schema in its head or tail. So I want to check check a mid is a schema or not. But can somebody teach me how to do this?
Thank you so much..

Objects or nodes in the graph which are part of the type system have a /type/object/type of /type/type. Other schema components such as domains (groups of types) and properties also have their own types (/type/domain and /type/property, respectively).
If what you really want is all entities however, you're probably better off looking objects which have the type /common/topic. This will exclude not only components of the schema, but also mediator nodes (aka CVTs) like https://www.freebase.com/m/0kpxvh?links= which are used to describe complex relationships.
Whether an inclusive or exclusive filtering strategy would work best really depends on what you're trying to accomplish, which I'm not sure I fully understand.

Related

Does an Increased Number of Node Types Impact Performance of Graph DBs?

I am in the process of creating a graph database, a simple one for movies with several types of information like the actors, producers, directors and so on.
What I would like to know is, is it better to break down your nodes to a more granular level? For example, is it better to have two kinds of nodes for 'actors' and 'directors' or is it better to have one node, say 'person' and use different kinds of relationships like 'acted_in' and 'directed'? Does this even matter at all?
Further, is there any impact on the traversal queries? Does having more types of nodes mean that the traversal is slower?
Note: I intend to implement this using the Gremlin console in Amazon Neptune.
The answer really is it depends. If I were building such a model I would break out the key "nouns" into their own nodes. I would also label the edges appropriately such as ACTED_IN or DIRECTED.
The performance of any graph query depends on how much data it will need to touch (the fan out factor as you go from depth to depth).
The best advice I can give you is think about the questions you will need the graph to answer and try to design your data model so that writing those queries is as easy as possible. Don't be afraid to iterate multiple times on your data model also. That is common and expected.
Properties can be useful when you want to add a unique piece of information to a node - perhaps the birthday of the director.
Edge properties can be useful for filtering out unneeded edges but edge labels can also. In some cases you may find a label such as DIRECTED-IN-2005 is a useful short cut to avoid checking a label and a property on an edge.

Modeling time in graph databases

I read in Neo4j documentation a section about how to make queries that depends on time more efficient:
One way to model time-specific data and relationships is by including
data in the relationship type. Because Neo4j is optimized specifically
for traversing relationships between entities, you can often improve
query performance by specifying a date as the relationship type and
only traversing particular dated relationships.
But I was wondering, using this technique you will have to repeat the same things any time you want to make the time-based-queries more efficient. For example if you want to query the posts created by specific user at specific date you have to add (similarly to AirportDay) something like UserDay.
My question is there a possible way to model time universally in your graph?, so that time become the main entry-point to query events and activities in the graph.
There's no answer to modelling time universally in your graph. It depends on your use cases.
The example in your post is one way to optimise non-performant queries that traverse too many relationships of the same type from a node.
You could also store time as a property on the node, and index it.
And then there's the option of a timetree https://graphaware.com/neo4j/2014/08/20/graphaware-neo4j-timetree.html
To summarise, it depends on your use cases- usually no need to prematurely optimise.

Weight in cts:collection-query

I want perform a weighted search in cts:collection-query. Is there any way provided for this?
What exactly I want to do is I want to fetch documents from a collection and give them different weight in a similar way as we do in cts:element-range-query.
cts:collection-query does not have any scoring options, unlike cts:element-range-query. A document either matches a collection query or it doesn't.
One option for you is to move the information you're current modeling with collections into elements (or JSON properties) within the documents; then you'll be able to use cts:element-range-query.
You haven't specified what kind of information you're using the collections for; it's hard to picture typical collection names benefitting from this approach. Some more detail might make that more clear.
If documents in some collections are "better" (should score higher) than ones in other collections, and those valuations are static, you could set the document quality based on the collections it belongs to. Not exactly the same, but perhaps that accomplishes the goal.

Neo4j Design: Property vs "Node & Relationship"

I have a node type that has a string property that will have the same value really often. Etc. Millions of nodes with only 5 options of that string value. I will be doing searches by that property.
My question would be what is better in terms of performance and memory:
a) Implement it as a node property and have lots of duplicates (and search using WHERE).
b) Implement it as 5 additional nodes, where all original nodes reference one of them (and search using additional MATCH).
Without knowing further details it's hard to give a general purpose answer.
From a performance perspective it's better to limit the search as early as possible. Even more beneficial if you do not have to look into properties for a traversal.
Given that I assume it's better to move the lookup property into a seperate node and use the value as relationship type.
Use labels; this blog post is a good intro to this new Neo4j 2.0 feature:
Labels and Schema Indexes in Neo4j
I've thought about this problem a little as well. In my case, I had to represent state:
STARTED
IN_PROGRESS
SUBMITTED
COMPLETED
Overall the Node + Relationship approach looks more appealing in that only a single relationship reference needs to be maintained each time rather than a property string and you don't need to scan an extra additional index which has to be maintained on the property (memory and performance would intuitively be in favor of this approach).
Another advantage is that it easily supports the ability of a node being linked to multiple "special nodes". If you foresee a situation where this should be possible in your model, this is better than having to use a property array (and searching using "in").
In practice I found that the problem then became, how do you access these special nodes each time. Either you maintain some sort of constants reference where you have the node ID of these special nodes where you can jump right into them in your START statement (this is what we do) or you need to do a search against property of the special node each time (name, perhaps) and then traverse down it's relationships. This doesn't make for the prettiest of cypher queries.

How is representing all information in Nodes vs Attributes affect storage, computations?

While using Graph Databases(my case Neo4j), we can represent the same information many ways. Making each entity a Node and connecting all entities through relationships or just adding the entities to attribute list of a Node.diff
Following are two different representations of the same data.
Overall, which mechanism is suitable in which conditions?
My use case involves traversing the Database from different nodes until 4 depths and examining the information through connected nodes or attributes (based on which approach it is).
One query of interest may be, "Who are the friends of John who went to Stanford?"
What is the difference in terms of Storage, computations
Normally,
properties are loaded lazily, and are more expensive to hold in cache, especially strings. Nodes and Relationships are most effective for traversal, especially since the relationships types are stored together with the relatoinship records and thus don't trigger property loads when used in traversals.
Also, a balanced graph (that is, not many dense nodes with over say 10K relationships) is most effective to traverse.
I would try to model most of the reoccurring proeprties as nodes connecting to the entities, thus using the graph itself to index on these values, instead of having to revert to filter on property values or index the property with an expensive index lookup.
The first one is much better since you're querying on entities such as Stanford- and that entity is related to many person nodes. My opinion that modeling as nodes is more intuitive and easier to query on. "Find all persons who went to Stanford" would not be very easy to do in your second model as you don't have a place to start traversing from.
I'd use attributes mainly to describe the node/entity use them to filter results from the query e.g. Who are friends of John who went to Stanford in the year 2010. In this case, the year attribute would just be used to trim the results. Depends on your use case- if year is really important and drives a lot of queries or is used to represent a timeline, you could even model the year as a node attached to Stanford.

Resources