what is the similarity function and how is the query transformed with spread activation techniques - information-retrieval

We have an information retrieval project in which we should build a search engine for a website we made the first semester.
Anyway I'm evaluating existing models when I stumbled upon spread activation.
I understand that there is a graph in which nodes are documents and links are the edges between them, I also understand the graph is weighted.
Some papers said nodes can be terms,citation,etc. What I didn't find is the similarity function and the query.
Where is the query in this graph ? and how is the similarity computed ?
I understand there are many models that use spread activation, I think one or two example models would be enough, I want to understand how is spread activation applied on the query in particular.

Related

Does an Increased Number of Node Types Impact Performance of Graph DBs?

I am in the process of creating a graph database, a simple one for movies with several types of information like the actors, producers, directors and so on.
What I would like to know is, is it better to break down your nodes to a more granular level? For example, is it better to have two kinds of nodes for 'actors' and 'directors' or is it better to have one node, say 'person' and use different kinds of relationships like 'acted_in' and 'directed'? Does this even matter at all?
Further, is there any impact on the traversal queries? Does having more types of nodes mean that the traversal is slower?
Note: I intend to implement this using the Gremlin console in Amazon Neptune.
The answer really is it depends. If I were building such a model I would break out the key "nouns" into their own nodes. I would also label the edges appropriately such as ACTED_IN or DIRECTED.
The performance of any graph query depends on how much data it will need to touch (the fan out factor as you go from depth to depth).
The best advice I can give you is think about the questions you will need the graph to answer and try to design your data model so that writing those queries is as easy as possible. Don't be afraid to iterate multiple times on your data model also. That is common and expected.
Properties can be useful when you want to add a unique piece of information to a node - perhaps the birthday of the director.
Edge properties can be useful for filtering out unneeded edges but edge labels can also. In some cases you may find a label such as DIRECTED-IN-2005 is a useful short cut to avoid checking a label and a property on an edge.

Modeling time in graph databases

I read in Neo4j documentation a section about how to make queries that depends on time more efficient:
One way to model time-specific data and relationships is by including
data in the relationship type. Because Neo4j is optimized specifically
for traversing relationships between entities, you can often improve
query performance by specifying a date as the relationship type and
only traversing particular dated relationships.
But I was wondering, using this technique you will have to repeat the same things any time you want to make the time-based-queries more efficient. For example if you want to query the posts created by specific user at specific date you have to add (similarly to AirportDay) something like UserDay.
My question is there a possible way to model time universally in your graph?, so that time become the main entry-point to query events and activities in the graph.
There's no answer to modelling time universally in your graph. It depends on your use cases.
The example in your post is one way to optimise non-performant queries that traverse too many relationships of the same type from a node.
You could also store time as a property on the node, and index it.
And then there's the option of a timetree https://graphaware.com/neo4j/2014/08/20/graphaware-neo4j-timetree.html
To summarise, it depends on your use cases- usually no need to prematurely optimise.

Gremlin: how to get all of the graph structure surrounding a single vertex into a subgraph

I would like to get all of the graph structure surrounding a single vertex into a subgraph.
The TinkerPop documentation shows how to do this for a fixed number of traversal steps.
My question: are there any recipes for getting the entire surrounding structure (which may include cycles) without knowing ahead of time how many traversal steps are needed?
I have updated this question for the benefit of anyone who might land on this question, here is a gremlin statement that will capture an arbitrary graph structure that surrounds a vertex with id='xxx'
g.V('xxx').repeat(out().simplePath()).until(outE().count().is(eq(0))).path()
-- this incorporates Stephen Mallete's suggestion to use simplePath within the repeat step.
That example uses repeat()...times() but you could easily replace times() with until() and not know the number of steps ahead of time. See more information in the Reference Documentation on how repeat() works to see how to express different types of flow control and traversal stop conditions.

number of 'graphs' in a ArangoDB database

I am exploring Arangodb and I am not sure I understand correctly how to use the graph concept in ArangoDb.
Let's say I am modelling a social network. Should I create a single graph for the whole social network or should I create a graph for every person and its connections ?
I've got the feeling I should use a single graph... But is there any performance/fonctionality issue related to that choice ?
Maybe the underlying question is this: should I consider the graph concept in arangodb as a technical or as a business-related concept ?
Thanks
You should use not use a graph per person. The first quick answer would be to use a single graph.
In general, I think you should treat the graph concept as a technical concept. Having said that, quite often, a (mathematical) graph models a relationship arising from business very naturally. Thus, the technical concept graph in a graph database maps very well to the business logic.
A social network is one of the prime examples. Typical questions here are "find the friends of a user?", "find the friends of the friends of a user?" or "what is the shortest path from person A to person B?". A graph database will be most useful for questions involving an a priori unknown path length, like for example in the shortest path example.
Coming back to the original question: You should start by looking at the queries you will have about your data. You then want to make it, so that these queries map conveniently onto the standard graph operations (or indeed other queries) your data store can answer. This then tells you what kind of information should be in the same graph, and which bits belong in separate graphs.
In your original use case of a social network, I would assume that you want to run queries involving chains of friendship-relations, so the edges in these chains must be in the same graph. However, in more complicated cases it is for example conceivable that you have a "friendship" graph and a "follows" graph, both using different edges but the same vertices. In that case you might have two graphs for your social network.

How is representing all information in Nodes vs Attributes affect storage, computations?

While using Graph Databases(my case Neo4j), we can represent the same information many ways. Making each entity a Node and connecting all entities through relationships or just adding the entities to attribute list of a Node.diff
Following are two different representations of the same data.
Overall, which mechanism is suitable in which conditions?
My use case involves traversing the Database from different nodes until 4 depths and examining the information through connected nodes or attributes (based on which approach it is).
One query of interest may be, "Who are the friends of John who went to Stanford?"
What is the difference in terms of Storage, computations
Normally,
properties are loaded lazily, and are more expensive to hold in cache, especially strings. Nodes and Relationships are most effective for traversal, especially since the relationships types are stored together with the relatoinship records and thus don't trigger property loads when used in traversals.
Also, a balanced graph (that is, not many dense nodes with over say 10K relationships) is most effective to traverse.
I would try to model most of the reoccurring proeprties as nodes connecting to the entities, thus using the graph itself to index on these values, instead of having to revert to filter on property values or index the property with an expensive index lookup.
The first one is much better since you're querying on entities such as Stanford- and that entity is related to many person nodes. My opinion that modeling as nodes is more intuitive and easier to query on. "Find all persons who went to Stanford" would not be very easy to do in your second model as you don't have a place to start traversing from.
I'd use attributes mainly to describe the node/entity use them to filter results from the query e.g. Who are friends of John who went to Stanford in the year 2010. In this case, the year attribute would just be used to trim the results. Depends on your use case- if year is really important and drives a lot of queries or is used to represent a timeline, you could even model the year as a node attached to Stanford.

Resources