How do I configure OrientDB ETL to import an edge list with attributes - graph

I have an CSV which contains an edge list, one edge per row. It looks like this:
id1, id2, attr1, attr2, attrX, attrY, attrZ
From this, I want to be able to create (or update) the following, per row:
Vertex A of class X, with id1 and attribute attr1
Vertex B of class X, with id2 and attribute attr2
Edge A->B with edge attributes attrX, attrY, attrZ
This is the configuration file I'm feeding to oetl.sh (using OrientDB 2.2 beta2):
{
"source": { "file": { "path": "/data/sample/test.csv" } },
"extractor": { "row": {} },
"transformers" :
[
{ "csv" : {} },
{ "merge" : { "joinFieldName":"id1", "lookup":"X.id" } },
{ "vertex" : { "class" : "X", "skipDuplicates":true } },
{ "edge" : {
"unresolvedLinkAction" : "WARNING",
"class" : "EdgeTypeClass",
"joinFieldName" : "id2",
"lookup": "X.id",
"edgeFields":{"attrX":"${input.attrX}", "attrY":"${input.attrY}","attrZ":"${input.attrZ}"}
}
},
{ "field" : { "fieldNames" : [ "id1", "id2", "attr1", "attr2", "attrX", "attrY", "attrZ" ], "operation": "remove" } }
],
"loader": {
"orientdb": {
"dbURL": "remote:localhost/test2",
"dbType": "graph"
}
}
}
The sample data I used to run the test is as follows:
10,11,"A","B",100,200,1
11,12,"B","C",110,201,5
12,14,"C","D",90,250,10
14,13,"D","E",105,210,3
When I run the oetl.sh script with the given configuration and sample data, it creates 4 vertices instead of 5 and no edges. There are no attributes on the vertices at all.
So these are the questions:
Is there a way in the vertex clause to specify vertex attributes/fields the same way that one can do for edges (i.e. edgeFields)? The documentation doesn't mention anything about it but it seems odd that you wouldn't be able to do it.
Rather than relying on the edge to create the outbound vertex, should I instead be creating two vertices explicitly and if so how do I specify that in the configuration file? When I try to add two "vertex" clauses it only seems to pick up the last one as the "current" vertex.
It's possible that the specific edge (id1 -> id2) already exists. Is it possible to only update the edge attributes in this case?
My sinking feeling is that given the complexity and number of things I'm trying to pack into this that it will be simpler to write my own ETL (e.g. using the Java API) instead of relying on oetl, but I was hoping I'd be able to avoid doing that if only because it's more maintainable.

Related

Terraform: Add item to a DynamoDB Table

What is the correct way to add tuple and key-pair values items to a DynamoDB database via Terraform?
I am trying like this:
resource "aws_dynamodb_table_item" "item" {
table_name = aws_dynamodb_table.dynamodb-table.name
hash_key = aws_dynamodb_table.dynamodb-table.hash_key
for_each = {
"0" = {
location = "Madrid"
coordinates = [["lat", "40.49"], ["lng", "-3.56"]]
visible = false
destinations = [0, 4]
}
}
item = <<ITEM
{
"id": { "N": "${each.key}"},
"location": {"S" : "${each.value.location}"},
"visible": {"B" : "${each.value.visible}"},
"destinations": {"L" : [{"N": "${each.value.destinations}"}]
}
ITEM
}
And I am getting the message:
each.value.destinations is tuple with 2 elements
│
│ Cannot include the given value in a string template: string required.
I also have no clue on how to add the coordinates variable.
Thanks!
List should be something like that :
"destinations": {"L": [{ "N" : 1 }, { "N" : 2 }]}
You are trying to pass
"destinations": {"L": [{ "N" : [0,4] }]}
Also you are missing the last } in destinations key
TLDR: I think the problem here is that you are trying to put L(N) - i.e. a list of numeric values, while your current Terraform code tries to put all the destinations into one N/number.
Instead of:
[{"N": "${each.value.destinations}"}]
you need some iteration over destinations and building a {"N": ...} of them.
"destinations": {"NS": ${jsonencode(each.value.destinations)}}
Did the trick!

Gremlin group by vertex property and get sum other properties in the same vertex

We have vertex which will store various jobs and their types and counts as properties. I have to group by the status and their counts. I tried the following query which works for one property(receiveCount)
g.V().hasLabel("Jobs").has("Type",within("A","B","C")).group().by("Type").by(fold().match(__.as("p").unfold().values("receiveCount").sum().as("totalRec")).select("totalRec")).next()
I wanted to give 10 more properties like successCount, FailedCount etc.. Is there a better way to give that?
You could use cap() step just like:
g.V().has("name","marko").out("knows").groupCount("a").by("name").group("b").by("name").by(values("age").sum()).cap("a","b")
And the result would be:
"data": [
{
"a": {
"vadas": 1,
"josh": 1
},
"b": {
"vadas": [
27.0
],
"josh": [
32.0
]
}
}
]

DSE Graph Loader mapping edges

I have to map data from JSON files to DSE.
Everything is working just fine, but I didn't find any documentation about the way to map edges connected to different nodes but sharing a same label.
Example :
[A:Car] -- [OWNER] --> [B:Person]
[C:Car] -- [OWNER] --> [D:Company]
I've tried different approaches, finally I've added a custom field that explicitly describes the class of the nodes :
Data sample
// Nodes
{"id":"A","label":"Car"}
{"id":"B","label":"Person"}
{"id":"C","label":"Car"}
{"id":"D","label":"Company"}
// Edges
{"out":"A","label":"OWNER","in":"B", "outLabel":"Car","inLabel":"Person"}
{"out":"C","label":"OWNER","in":"D", "outLabel":"Car","inLabel":"Company"}
Here is the mapping script
load(nodesInput).asVertices {
labelField "label"
key "id"
}
load(edgesInput).asEdges {
label "OWNER"
outV "out", {
key "id"
label "Car"
}
inV "in", {
key "id"
labelField "inLabel" <-- this declaration seems to fail
}
}
Any idea ?
I believe you could accomplish the above with something like the following.
load(edgesInput).asEdges {
label "OWNER"
outV "out", {
key "id"
label "Car"
}
inV "in", {
key "id"
label it["inLabel"]
}
}
https://docs.datastax.com/en/latest-dse/datastax_enterprise/graph/dgl/dglMapScript.html

Query to get exact matches of Elastic Field with multile values in Array

I want to write a query in Elastic that applies a filter based on values i have in an array (in my R program). Essentially the query:
Matches a time range (time field in Elastic)
Matches "trackId" field in Elastic to any value in array oth_usr
Return 2 fields - "trackId", "propertyId"
I have the following primitive version of the query but do not know how to use the oth_usr array in a query (part 2 above).
query <- sprintf('{"query":{"range":{"time":{"gte":"%s","lte":"%s"}}}}',start_date,end_date)
view_list <- elastic::Search(index = "organised_recent",type = "PROPERTY_VIEW",size = 10000000,
body=query, fields = c("trackId", "propertyId"))$hits$hits
You need to add a terms query and embed it as well as the range one into a bool/must query. Try updating your query like this:
terms <- paste(sprintf("\"%s\"", oth_usr), collapse=", ")
query <- sprintf('{"query":{"bool":{"must":[{"terms": {"trackId": [%s]}},{"range": {"time": {"gte": "%s","lte": "%s"}}}]}}}',terms,start_date,end_date)
I'm not fluent in R syntax, but this is raw JSON query that works.
It checks whether your time field matches given range (start_time and end_time) and whether one of your terms exact matches trackId.
It returns only trackId, propertyId fields, as per your request:
POST /indice/_search
{
"_source": {
"include": [
"trackId",
"propertyId"
]
},
"query": {
"bool": {
"must": [
{
"range": {
"time": {
"gte": "start_time",
"lte": "end_time"
}
}
},
{
"terms": {
"trackId": [
"terms"
]
}
}
]
}
}
}

ArangoDB: traverse only edges within a time range

I am experimenting with time based versioning.
I have created a vertex that is connected to other vertices with this edge on one side:
{
"_id": "edges/426647569364",
"_key": "426647569364",
"_rev": "426647569364",
"_from": "nodes/426640688084",
"_to": "nodes/426629284820",
"valid_from": "1385787600000",
"valid_till": "9007199254740991"
}
And this edge on the other:
{
"_id": "edges/426679485396",
"_key": "426679485396",
"_rev": "426845488084",
"_from": "nodes/426675749844",
"_to": "nodes/426629284820",
"valid_from": "1322629200000",
"valid_till": "1417323600000"
}
The valid_till value in the first edge is the output of the Number.MAX_SAFE_INTEGER function.
I looked at custom vistors a little and it looks like its focused on filtering vertices rather than edges.
How can I restrict my traversal to edges with a valid_till value between new Date().getTime() and Number.MAX_SAFE_INTEGER?
You can use the followEdges attribute in a traversal.
followEdges can optionally be a JavaScript function for filtering edges. It will be invoked for each edge in the traversal:
var expandFilter = function (config, vertex, edge, path) {
return (edge.vaild_till >= new Date().getTime() &&
edge.valid_till <= Number.MAX_SAFE_INTEGER);
};
require("org/arangodb/aql/functions").register("my::expandFilter", expandFilter);
It can then be used in a traversal like a regular custom filter by specifying it in the followEdges attribute of the traversal options, e.g.:
LET options = {
followEdges: 'my::expandFilter'
}
FOR doc IN TRAVERSAL(nodes, edges, 'nodes/startNode', 'inbound', options)
RETURN doc.vertex

Resources