Gremlin group by vertex property and get sum other properties in the same vertex - graph

We have vertex which will store various jobs and their types and counts as properties. I have to group by the status and their counts. I tried the following query which works for one property(receiveCount)
g.V().hasLabel("Jobs").has("Type",within("A","B","C")).group().by("Type").by(fold().match(__.as("p").unfold().values("receiveCount").sum().as("totalRec")).select("totalRec")).next()
I wanted to give 10 more properties like successCount, FailedCount etc.. Is there a better way to give that?

You could use cap() step just like:
g.V().has("name","marko").out("knows").groupCount("a").by("name").group("b").by("name").by(values("age").sum()).cap("a","b")
And the result would be:
"data": [
{
"a": {
"vadas": 1,
"josh": 1
},
"b": {
"vadas": [
27.0
],
"josh": [
32.0
]
}
}
]

Related

Great Expectations - Result validation for row_count and column_freshness

I would like to validate results for row count and column freshness on some data on AWS. I am using a check_config.json file to configure the checks. I use terraform to make a Glue job to run the check and throw the result to DynamoDB. The result in DynamoDB is not elaborate and I would like the result to be more specific on the exact results obtained before marking a check as fail or pass. I would like to see, for example, when was the table last modified(column freshness) and number of rows obtained after a count (expect_row_count).
Below is the current result in DynamoDB:
Below is the json code:
{
"table": "table1",
"checks": [
{
"check": "custom_expect_column_to_be_fresh",
"parameters": {
"columns": [
"column1"
],
"strftime_format": "%Y-%m-%d",
"threshold_days": 0,
"threshold_hours": 10
}
},
{
"check": "expect_table_row_count_to_be_between",
"result_format" : "COMPLETE",
"include_config": "True",
"parameters": {
"min_value": 1,
"max_value": 100000
},
"alarm" : {
"threshold": 100,
"period": 3600
}
}
]
}
I was expecting a more elaborate result on how many rows were obtained before the row_count is marked as a failure and I also want to see the last table modification timestamp before column freshness marks as a failure.

Azure Cosmos Graph nest edge vertex in a vertex property

I have two vertex:
1) Vertex 1: { id: 1, name: “john” }
2) Vertex 2: { id: 2, name: “mary” }
There is an edge from 1 to 2 named “children”.
Is it possible to return 2 nested in 1 using gremlin like this?
{
id: 1,
name: “john”,
children: { id: 2, name: “mary” }
}
Thank you!
My solution with an amazing help of #noam621 ---------------------------------
g.V(1)
.union( valueMap(true),
project('children').by( coalesce( out('children').valueMap(true).fold() , constant([]))),
project('parents').by( coalesce( out('parents').valueMap(true).fold() , constant([])))
)
.unfold().group().by(keys).by(select(values))
It returns the following object:
{
id: 1,
name: [ “john” ],
children: [ { id: 2, name: [ “mary” ] } ],
parents: []
}
.union with project are the key to merge all objects in one object.
valueMap(true).fold() is fundamental to get all objects in the edge and coalesce helps with a default value if the edge doesn't return any vertex.
Due to some Azure Cosmos gremlin limitations is only possible to get values as array values.
Thus I finalized the object formatting in my application code. It's ok for now.
Thank you!
You can do it by using the project step for both vertexes:
g.V(1).project('id', 'name', 'children').
by(id).
by('name').
by(out('children').
project('id', 'name').by(id).
by('name'))
example:
https://gremlify.com/3j
query with valueMap:
g.V(1).union(
valueMap().
with(WithOptions.tokens).by(unfold()),
project('children').
by(out('children').
valueMap().
with(WithOptions.tokens).by(unfold()))
).unfold().
group().by(keys).
by(select(values))
if valueMap().with(WithOptions.tokens) not supported in Cosmos use valueMap(true) instead

How do I configure OrientDB ETL to import an edge list with attributes

I have an CSV which contains an edge list, one edge per row. It looks like this:
id1, id2, attr1, attr2, attrX, attrY, attrZ
From this, I want to be able to create (or update) the following, per row:
Vertex A of class X, with id1 and attribute attr1
Vertex B of class X, with id2 and attribute attr2
Edge A->B with edge attributes attrX, attrY, attrZ
This is the configuration file I'm feeding to oetl.sh (using OrientDB 2.2 beta2):
{
"source": { "file": { "path": "/data/sample/test.csv" } },
"extractor": { "row": {} },
"transformers" :
[
{ "csv" : {} },
{ "merge" : { "joinFieldName":"id1", "lookup":"X.id" } },
{ "vertex" : { "class" : "X", "skipDuplicates":true } },
{ "edge" : {
"unresolvedLinkAction" : "WARNING",
"class" : "EdgeTypeClass",
"joinFieldName" : "id2",
"lookup": "X.id",
"edgeFields":{"attrX":"${input.attrX}", "attrY":"${input.attrY}","attrZ":"${input.attrZ}"}
}
},
{ "field" : { "fieldNames" : [ "id1", "id2", "attr1", "attr2", "attrX", "attrY", "attrZ" ], "operation": "remove" } }
],
"loader": {
"orientdb": {
"dbURL": "remote:localhost/test2",
"dbType": "graph"
}
}
}
The sample data I used to run the test is as follows:
10,11,"A","B",100,200,1
11,12,"B","C",110,201,5
12,14,"C","D",90,250,10
14,13,"D","E",105,210,3
When I run the oetl.sh script with the given configuration and sample data, it creates 4 vertices instead of 5 and no edges. There are no attributes on the vertices at all.
So these are the questions:
Is there a way in the vertex clause to specify vertex attributes/fields the same way that one can do for edges (i.e. edgeFields)? The documentation doesn't mention anything about it but it seems odd that you wouldn't be able to do it.
Rather than relying on the edge to create the outbound vertex, should I instead be creating two vertices explicitly and if so how do I specify that in the configuration file? When I try to add two "vertex" clauses it only seems to pick up the last one as the "current" vertex.
It's possible that the specific edge (id1 -> id2) already exists. Is it possible to only update the edge attributes in this case?
My sinking feeling is that given the complexity and number of things I'm trying to pack into this that it will be simpler to write my own ETL (e.g. using the Java API) instead of relying on oetl, but I was hoping I'd be able to avoid doing that if only because it's more maintainable.

Combine orderByPriority with equalTo

I have a dataset like this:
[
{
"projectId": "fdsFDSFaSdA",
"teamId": "ASDasdASDsada"
...
},
{
"projectId": "DSF432afdsf",
"teamId": "fdsASfsdasdd"
...
},
...
]
I nead to select objects from this list sometimes by projectId, sometimes by teamId.
I know that this is possible like so:
ref.orderByChild('key').equalTo(value)
But the problem is that I need to order by priority.
I see that equalTo() takes two parameters:
equalTo(value, [key])
I tried like so but it doesn't work:
ref.orderByPriority().equalTo(value, 'key')
How can I make it work?

Query to get exact matches of Elastic Field with multile values in Array

I want to write a query in Elastic that applies a filter based on values i have in an array (in my R program). Essentially the query:
Matches a time range (time field in Elastic)
Matches "trackId" field in Elastic to any value in array oth_usr
Return 2 fields - "trackId", "propertyId"
I have the following primitive version of the query but do not know how to use the oth_usr array in a query (part 2 above).
query <- sprintf('{"query":{"range":{"time":{"gte":"%s","lte":"%s"}}}}',start_date,end_date)
view_list <- elastic::Search(index = "organised_recent",type = "PROPERTY_VIEW",size = 10000000,
body=query, fields = c("trackId", "propertyId"))$hits$hits
You need to add a terms query and embed it as well as the range one into a bool/must query. Try updating your query like this:
terms <- paste(sprintf("\"%s\"", oth_usr), collapse=", ")
query <- sprintf('{"query":{"bool":{"must":[{"terms": {"trackId": [%s]}},{"range": {"time": {"gte": "%s","lte": "%s"}}}]}}}',terms,start_date,end_date)
I'm not fluent in R syntax, but this is raw JSON query that works.
It checks whether your time field matches given range (start_time and end_time) and whether one of your terms exact matches trackId.
It returns only trackId, propertyId fields, as per your request:
POST /indice/_search
{
"_source": {
"include": [
"trackId",
"propertyId"
]
},
"query": {
"bool": {
"must": [
{
"range": {
"time": {
"gte": "start_time",
"lte": "end_time"
}
}
},
{
"terms": {
"trackId": [
"terms"
]
}
}
]
}
}
}

Resources