Edge Collection vs. Graph - graph

There's one thing I don't get in ArangoDB:
What's the difference between an edge collection and a graph? In which cases should I choose which?

Graphs in ArangoDB are built on top of documents and edges.
Edge collections have automatic indexes on _from and _to, allowing efficient retrieval of any connected documents. Because data are still stored in regular (document and edge) collections, you can use these collections in non-graph queries, too.
Graphs add some functionality (i.e. query methods, traversals) on top of the data. You can have multiple of them in ArangoDB. Think of a "graph" being a means of grouping for parts or all of your data, and making them accessible in queries.

This is an edge:
{
"_id": "edges/328701573688",
"_from": "nodes/150194180348",
"_to": "nodes/328668871224",
"_rev": "3680146597",
"_key": "328701573688",
"type": "includes"
}
This is a document:
{
"_id": "nodes/328668871224",
"_rev": "3610088613",
"_key": "328668871224",
"name": "Gold-edged Gem",
"type": "species"
}
As you can see there is no fundamental difference. They are both documents. Edge collections are only useful for when you using Arango for it's graph database capabilities.
As I understand it, the point setting the type of the collection to "edge" tells Arango that it should be ensuring that all documents stored in there minimally have _to and _from attributes so that the document can serve its function as a connector between two other documents.
Once you have a document collection whose documents are connected by a bunch of edge documents in an edge collection... now you have a graph.

Related

How should we model many to many relationships in dynamodb when aiming for single table design

Quick high level background:
DynamoDB with single table design
OpenSearch for full text search
DynamoDB Stream which indexes into OpenSearch on DynamoDB Create/Update/Delete via Lambda
The single table design approach has been working well for us so far, but we also haven't really had many-to-many relationships to deal with. However a new relationship we recently needed to account for is Tags for Entry objects:
interface Entries {
readonly id: string
readonly title: string
readonly tags: Tag[]
}
interface Tags {
readonly id: string
readonly name: string
}
We want to try and stick to a single query/read to retrieve a list of Entries / single Entry but also want to find a good balance between having to manage updates.
A few ways we've considered storing the data:
Store all tag data in the Entry
{
"id": "asdf1234",
"title": "Entry Title",
"tags": [
{
"id": "1234asdf",
"name": "stack"
},
{
"id": "4321hjkl",
"name": "over"
},
{
"id": "7657gdfg",
"name": "flow"
}
]
}
This approach makes reads easy, but updates become a pain - anytime a tag is updated, we would need to find a way to find all Entries that reference that tag and then update it.
Store only the tag ids in the Entry
{
"id": "asdf1234",
"title": "Entry Title",
"tags": ["1234asdf", "4321hjkl", "7657gdfg"]
}
With this approach, no updates would be required when a Tag is updated, but now we have to do multiple reads to return the full data - we would need to query each Tag by id to retrieve its data before returning the full content back to the client.
Store only the tag ids in the Entry but use OpenSearch to query and get data
This option, similar to the one above, would store only the tag ids on the Entry, but then have the Entry document that is indexed on the search side include all Tag data in our stream lambda. Updates on a Tag would still require updates to all Entries (in search) to also query and update each Entry individually - but the question is if its more cost effective to just do it in DynamoDB.
This scenario presents an interesting uni-directional flow:
writes go straight to DynamoDB
DynamoDB stream -> Lambda - do a transformations on the data => index in OpenSearch
reads are exclusively done via OpenSearch
The overall question is, how do applications using nosql with single table design, handle these many-to-many scenarios? Is using a uni-directional flow stated above a good idea/worth it?
Things to consider:
our application leans more heavily on the read side
our application will also utilize search capability quite heavily
Tag updates will not be often

Conditional update all documents in all CosmosDB partitions

I have a CosmosDb container that have (simplified) documents like this:
{
id,
city,
propertyA,
propertyB
}
The partitionkey is 'city'.
Usually this works well. But there has been a bug in our code and now I have to update a lot of documents in a lot of different partitions.
I made a conditional update document like so:
{
"conditions": "from c where propertyA = 1",
"operations": [
{
"op": "set"
"path": "/propertyB",
"value": true
}
]
}
This document I send with the REST API to CosmosDB
Because I want to update all documents in all partitions that satisfy the condition I set the x-ms-documentdb-query-enablecrosspartition header to 'True'.
But I still need to supply a partition key in the x-ms-documentdb-partitionkey header.
Is there a way to use the REST API to update all the documents for which the condition is true, whatever the partitionkey is?
Yes, there is a way through which REST API can provide us with the ability to create, query, and delete databases, collection of documents, and documents through programmatically. To change a document partially, you can use the PATCH HTTP Method.
You will still need to supply a partition key in the x-ms-documentdb-partitionkey header.
If you have created the document based on partition key, you must need to provide partition key in header.
For more information regarding the above said partition key, you can refer the below documentation link: -
https://learn.microsoft.com/en-us/rest/api/cosmos-db/patch-a-document

arangodb aql effectively tarversing from startvertex through the endvertex and find connection between them

i'm very new to graph concept and arangodb. i plan to using both of them in a project which related to communication analysis. i have set the data to fit the need in arangodb with one document collection named object and one edge collection named object_routing
in my object the data structure is as follow
{
"img": "assets/img/default_message.png",
"label": "some label",
"obj_id": "45a92a7344ee4f758841b5466c010ed9",
"type": "message"
}
...
{
"img": "assets/img/default_person.png",
"label": "some label",
"obj_id": "45a92a7344ee4f758841b5466c01111",
"type": "user"
}
in my object_routing the data structure is as follow
{
"message_id": "no_data",
"source": "45a92a7344ee4f758841b5466c010ed9",
"target": "45a92a7344ee4f758841b5466c01111",
"type": "has_contacted"
}
with _from : object/45a92a7344ee4f758841b5466c010ed9 and _to : object/45a92a7344ee4f758841b5466c01111
the sum of data for object is 23k and for object_routing is 127k.
my question is, how can i effectively traversing from start vertex through the end vertex, so that i can presumably get all the connected vertex and its edge and its children and so on between them untill there is nothing to traverse again?
i'm afraid my question is not clear enough and my understanding of graph concept is not in the right direction so please bear with me
note : bfs algorithm is not an option because that is not what i need. if possible, i would like to get the longest path. my arangodb current version is 3.1.7 running on a cluster with 1 coordinator and 3 db servers
It is worth trying a few queries to get a feel for how AQL traversals work, but maybe start with this example from the AQL Traversal documentation page:
FOR v, e, p IN 1..10 OUTBOUND 'object/45a92a7344ee4f758841b5466c010ed9' GRAPH 'insert_my_graph_name'
LET last_vertex_in_path = LAST(p.vertices)
FILTER last_vertex_in_path.obj_id == '45a92a7344ee4f758841b5466c01111'
RETURN p
This sample query will look at all outbound edges in your graph called insert_my_graph_name starting from the vertex with an _id of object/45a92a7344ee4f758841b5466c010ed9.
The query is then set up to return three variables for every path found:
v contains a collection of vertices for the outbound path found
e contains a collection of edges for the outbound path found
p contains the path that was found
A path is consisted of vertices connected to each other by edges.
If you want to explore the variables, try this version of the query:
FOR v, e, p IN 1..10 OUTBOUND 'object/45a92a7344ee4f758841b5466c010ed9' GRAPH 'insert_my_graph_name'
RETURN {
vertices: v,
edges: e,
paths: p
}
What is nice is that AQL returns this information in JSON format, in arrays and such.
When a path is returned, it is stored as a document with two attributes, edges and vertices, where the edges attribute is an array of edge documents the path went down, and the vertices attribute is an array of vertex documents.
The interesting thing about the vertices array is that the order of array elements is important. The first document in the vertices array is the starting vertex, and the last document is the ending vertex.
So the example query above, because your query is set up as an OUTBOUND query, that means your starting vertex will always be the FIRST element of the array stored at p.vertices' and the end of the path will always be theLAST` element of that array.
It doesn't matter how many vertices are traversed in your path, that rule still works.
If your query was an INBOUND rule, then the logic stays the same, in that case FIRST(p.vertices) will be the starting vertex for the path, and LAST(p.vertices) will be the terminating vertex, which will be the same _id as what you specified in your query.
So back to your use case.. if you want to filter out all OUTBOUND paths from your starting vertex to a specific vertex, then you can add the LET last_vertex_in_path = LAST(p.vertices) declaration to set a reference to the last vertex in the path provided.
Then you can easily provide a FILTER that references this variable, and then filter on any attribute of that terminating vertex. You could filter on the last_vertex_in_path._id or last_vertex_in_path.obj_id or any other parameter of that final vertex document.
Play with it and practice some, but once you see that a graph traversal query only provides you with these three key variables, v, e, and p, and these aren't anything special, they are just arrays of vertices and edges, then you can do some pretty powerful filtering.
You could put filters on properties of any of the vertices, edges, or path positions to do some pretty flexible filtering and aggregation of the results it sends through.
Also have a look at the traversal options, they can be useful.
To get started just make sure your have your documents and edges loaded, and that you've created a graph with those document and edges collections in it.
And yes.. you can have many document and edge collections in a single graph, even sharing document/edge collections over multiple graphs if that suits your use cases.
Have fun!

Normalized many-to-many schema for client-side data store

I am developing the browser front end of a social network application. It has lots of relational data, having one-to-many (1:m) and mostly many-to-many (m:m) relationships as in below list.
I want to use Flux data flow architecture in the application. I am using Vuex.js with Vue.js.
As expressed in the Redux.js docs it is better to have flat, normalized, store state shape for various reasons for usage with React, and I think that is the case for usage with Vue.js also.
posts have categories (m:m)
posts have tags (m:m)
post has comments (1:m)
posts have hashtags in them (m:m) // or users creates hashtags
posts have mentions in them (m:m) // or users creates mentions of users
users like posts (m:m)
users follow users, posts, post categories etc. (m:m)
users favorite posts (m:m)
etc.
I will need to show post feeds with all of its related data of other entities like users, comments, categories, tags. For this, like having a 1:many relation, holding the many side of this relation's data in the one side (can be said to be the parent), even it is actually many-to-many, seems ok for usual querying of them to compose their parent, that is posts. However, I will need to query the store state inversely also, for example, getting the posts with a certain category or tag.
In that case, it is not as easy is as doing so for posts. I need a relation entity that holds the id pairs for the two connected data entity, just like a join table or association table in RDBMSs, for ease of accessing and updating, avoiding deep digging into state, and also avoiding unnecessary re-renders (that requirement is React or Vue.js and GUI specific).
How can I achieve this relatively easily and effectively, e.g. as one do for 1:many relations?
Pursuant to your last comment. I'll present the data structure I currently use for this circumstance:
Tag
{
"type": "tag",
"id": "tag1",
"name": "Tag One"
}
Tag To Post
{
"id": "someId",
"type": "postTag",
"tagId": "tag1",
"postId": "post1"
}
Post
{
"id": "post1",
"name": "Post 1"
}
I found that each side of M:M storing the relationship ids potentially produces orphans. The management of these IDs in dual places leads to replicating steps and an increase in cognitive management as all functions managing the M:M happen in two places rather than one. Additionally, the relationship itself may need to contain data, where would this data go?
M:M Without Entity
{
"id": "post1",
"name": "Post 1"
"tagIds": [
{id: "tag1", extraAttribute: false} //this is awkward
]
}
Tag To Post - Additional Attributes
{
"id": "someId",
"extraAttribute": false,
"postId": "post1"
"type": "postTag",
"tagId": "tag1",
}
There are additional options available to speed up extracting tags with minor elbow grease.
Post
{
"id": "post1",
"name": "Post 1"
"tagIds" : ["tag1", "tag4"]
}
Hypothetically, a post would not have more than 20 tags. Making this a generally negligible storage requirement to reduce lookups. I have found no urgent need for this currently with a database of 10000 relationships.
Ease of access and updating
1:M is an object directly pointing at what it wants. M:M are two different entities pointing at their relationships. Model that relationship, and centralise the logic
Rerenders
If your application renders long lists of data (hundreds or thousands
of rows), we recommended using a technique known as “windowing”. This
technique only renders a small subset of your rows at any given time,
and can dramatically reduce the time it takes to re-render the
components as well as the number of DOM nodes created.
https://reactjs.org/docs/optimizing-performance.html#virtualize-long-lists
I feel solutions may be use case specific and subject to broader opinions. I ran into this issue utilising couch/pouch with Vuex and a very large table of 20,000 entries. Again, in production the issues were not extremely noticeable. Results will always vary.
A couple things I try here:
Load partial data sets: in-file (non-reactive) vs in memory (loaded in Vuex)
Sort, paginate, search in-file and load results

Regarding understanding the necessity of Graph Creation in ArangoDB

I don't understand the necessity of creating graphs in ArangoDB.
For example, refer the below AQLs,
// Paths between 2 vertices
FOR p IN TRAVERSAL(person, knows, "person/person3", "outbound", {
paths: true, filterVertices: [{_id: "person/person2"}],
vertexFilterMethod: ["exclude"]}) RETURN p.path.vertices[*].name
//All connected Vertices for a given Vertex..**
FOR p IN PATHS(person, knows, "outbound")
FILTER p.source._id == "person/person5"
RETURN p.vertices[*].name
The above two queries are clearly related to Graphs...but you no need to create a graph to make them work.
Why and when should I create a graph?
What advantages will I get if I create a graph?
Creating or registering a 'graph' in ArangoDB is optional. Its purpose is to maintain graph persistency during modifications.
You can use Document features and combinations of graph traversals on the collections without referencing the graph.
However, one main purpose of the graph definition above is to use it during modifying edges or vertices. A Vertex document in a vertex collection may be referenced from several edge documents in several edge collections, with these edge collections in term belonging to several graphs.
When you now remove a vertex via a graph API, all these graph definitions are queried whether they permit edges pointing into this very special vertex collection. Subsequentially all edges in all possible edge collections are searched and removed. All this is done with transactional security.
By doing this, the mentioned graph persistency can be maintained. Graph persistency means that you don't have dangling edges that are pointing to a previously removed vertex.
Plesae note that you should rather use pattern matching traversals; One could rephrase
FOR p IN TRAVERSAL(person, knows, "person/person3", "outbound", {
paths: true, filterVertices: [{_id: "person/person2"}],
vertexFilterMethod: ["exclude"]}) RETURN p.path.vertices[*].name
like this using the more modern pattern matching:
FOR v, e, p IN 1..20 OUTBOUND "person/person3"
FILTER v._ID != "person/person2"
RETURN p.vertices[*].name

Resources