Gremlin union query takes long time to execute against Neptune DB - gremlin

The following Gremlin query takes long time when you run using union clause. This is a common search query which is being used from many pages.
It works fine if you use it on entities with fewer nodes. However, if you use it to search for entities with higher number of nodes it times out from UI (after 30 secs).
I have to use containing as it searches as user starts typing (after 4 chars).
I am using AWS Neptune database with Python Gremlin.
It works for entities with 30k nodes.
It is timing out for entities with 200k+ nodes.
When I run these clauses separately it works fine, looks like union is taking long time.
g.V().hasLabel("org").union(
        has(T.id, containing("{searchtext}")),
        has("name", containing("{searchtext}")),
        has("tin", containing("{searchtext}")),
     where(in_("orgcode").has(T.id, containing("{searchtext}".upper())))).limit(10).dedup().project("id","name").by(__.id()).by("name").toList()
*******************************************************
Neptune Gremlin Profile
*******************************************************
Query String
==================
g.V().hasLabel("org").union(
has(T.id, containing("0000804415")),
has("name", containing("0000804415")),
has("tin", containing("0000804415")),
inE("orgcode").has(T.id, containing("0000804415"))).limit(10).dedup().project("id","name").by(__.id()).by("name")
Original Traversal
==================
[GraphStep(vertex,[]), HasStep([~label.eq(org)]), UnionStep([[HasStep([~id.containing(0000804415)]), EndStep], [HasStep([name.containing(0000804415)]), EndStep], [HasStep([tin.containing(0000804415)]), EndStep], [VertexStep(IN,[orgcode],edge), HasStep([~id.containing(0000804415)]), EndStep]]), RangeGlobalStep(0,10), DedupGlobalStep, ProjectStep([id, name],[[IdStep], value(name)])]
Optimized Traversal
===================
Neptune steps:
[
NeptuneGraphQueryStep(Vertex) {
JoinGroupNode {
PatternNode[(?1, <~label>, ?2=<org>, <~>) . project ?1 .], {indexTime=0, joinTime=166, numSearches=1}
}, annotations={path=[Vertex(?1):GraphStep], joinStats=true, optimizationTime=0, maxVarId=15, chunkSize=10, executionTime=16936}
},
NeptuneUnionStep {
NeptuneGraphQueryStep(Vertex) {
JoinGroupNode {
FilterByP(?1: containing(0000804415)) .
}, annotations={initialValues={?1=null}, executionTime=16936, path=[Vertex(?1):GraphStep], chunkSize=10, optimizationTime=0, maxVarId=15, joinStats=true}
},
NeptuneGraphQueryStep(Vertex) {
JoinGroupNode {
PatternNode[(?1, <name>, ?9, ?) . project ask . FilterByP(?9: containing(0000804415)) .], {indexTime=102, joinTime=4304, numSearches=205662}
}, annotations={initialValues={?1=null}, executionTime=16936, path=[Vertex(?1):GraphStep], chunkSize=10, optimizationTime=0, maxVarId=15, joinStats=true}
},
NeptuneGraphQueryStep(Vertex) {
JoinGroupNode {
PatternNode[(?1, <tin>, ?10, ?) . project ask . FilterByP(?10: containing(0000804415)) .], {indexTime=93, joinTime=4215, numSearches=205662}
}, annotations={initialValues={?1=null}, executionTime=16935, path=[Vertex(?1):GraphStep], chunkSize=10, optimizationTime=0, maxVarId=15, joinStats=true}
},
NeptuneGraphQueryStep(Edge) {
JoinGroupNode {
PatternNode[(?11, ?13=<orgcode>, ?1, ?14) . project ?1,?14 . IsEdgeIdFilter(?14) . FilterByP(?14: containing(0000804415)) .], {indexTime=331, joinTime=4847, numSearches=20567}
}, annotations={initialValues={?1=null}, executionTime=16935, path=[Vertex(?1):GraphStep, Edge(?14):VertexStep], chunkSize=10, optimizationTime=0, maxVarId=15, joinStats=true}
}
},
NeptuneTraverserConverterStep
]
+ not converted into Neptune steps: RangeGlobalStep(0,10),
Neptune steps:
[
NeptuneMemoryTrackerStep
]
+ not converted into Neptune steps: DedupGlobalStep,ProjectStep([id, name],[[IdStep], value(name)]),
WARNING: >> [RangeGlobalStep(0,10), DedupGlobalStep] << (or one of the children for each step) is not supported natively yet
Physical Pipeline
=================
NeptuneGraphQueryStep
|-- StartOp
|-- JoinGroupOp
|-- SpoolerOp(10)
|-- DynamicJoinOp(PatternNode[(?1, <~label>, ?2=<org>, <~>) . project ?1 .])
NeptuneUnionStep
|-- BindingSetQueue
|-- JoinGroupOp
|-- FilterOp(FilterByP(?1: containing(0000804415)) .)
|-- BindingSetQueue
|-- JoinGroupOp
|-- SpoolerOp(10)
|-- DynamicJoinOp(PatternNode[(?1, <name>, ?9, ?) . project ask . FilterByP(?9: containing(0000804415)) .])
|-- BindingSetQueue
|-- JoinGroupOp
|-- SpoolerOp(10)
|-- DynamicJoinOp(PatternNode[(?1, <tin>, ?10, ?) . project ask . FilterByP(?10: containing(0000804415)) .])
|-- BindingSetQueue
|-- JoinGroupOp
|-- SpoolerOp(10)
|-- DynamicJoinOp(PatternNode[(?11, ?13=<orgcode>, ?1, ?14) . project ?1,?14 . IsEdgeIdFilter(?14) . FilterByP(?14: containing(0000804415)) .])
Runtime (ms)
============
Query Execution: 16936.285
Serialization: 0.085
Traversal Metrics
=================
Step Count Traversers Time (ms) % Dur
-------------------------------------------------------------------------------------------------------------
NeptuneGraphQueryStep(Vertex) 205662 205662 282.539 1.67
NeptuneUnionStep([[NeptuneGraphQueryStep(Vertex... 16653.466 98.33
NeptuneTraverserConverterStep 0.023 0.00
RangeGlobalStep(0,10) 0.003 0.00
NeptuneMemoryTrackerStep 0.006 0.00
DedupGlobalStep 0.004 0.00
ProjectStep([id, name],[[IdStep], value(name)]) 0.004 0.00
>TOTAL - - 16936.048 -
Predicates
==========
# of predicates: 201
Results
=======
Count: 0
Output: []
Response serializer: application/vnd.gremlin-v3.0+json
Response size (bytes): 216
Index Operations
================
Query execution:
# of statement index ops: 431892
# of unique statement index ops: 431892
Duplication ratio: 1.0
# of terms materialized: 910040
Serialization:
# of statement index ops: 0
# of terms materialized: 0
%%gremlin

First, you can get better details of the execution of the query on Neptune by using Neptune's Gremlin query explainer (versus the default TinkerPop explain() or profile() steps). To do this use %%gremlin profile when executing the query in a cell.
On to the query...
It's likely not the union() that is causing the issues here, but the use of containing(). Neptune does not maintain a substring index of string properties within the database. So using any of the Gremlin Text Predicates [1] will incur some sort of partial scan of all properties with the property key that you're using for those filters.
If you need to do these types of queries on a regular basis, it would beneficial to use Neptune's Full-Text-Search integration with OpenSearch [2].
[1] https://tinkerpop.apache.org/docs/current/reference/#a-note-on-predicates
[2] https://docs.aws.amazon.com/neptune/latest/userguide/full-text-search-cfn-create.html

Related

How to extrapolate values in one AWS CLI output with values from two separate CLI outputs as input files?

I am trying to build an audit/compliance report from IAM identity center. We need a list of groups and the respective group members. At current count we have 1,500+ users and 700+ Groups across 120 accounts in AWS.
There isn't an API command to spit this data out, so I'm putting a few commands together to extract the groups to files in Cloudshell. Then I need to cross-reference and throw everything into a CSV for filtering in Excel for the auditors.
Retrieve UserName and UserID - store in UserID.json
aws identitystore list-users --identity-store-id d-123456789| jq '.Users[] | {Name: .UserName, ID:.UserId}' > UsersIds.json
Retrieve Groups and GroupIDs - store in GroupsID.json
aws identitystore list-groups --identity-store-id d-123456789| jq '.Groups[] | {GroupName: .DisplayName, ID:.GroupId}' > GroupsID.json
Retrieve list of All Users per Group - store in GroupMembers.json
result=$(aws identitystore list-groups --identity-store-id d-123456789| jq -r '.Groups[].GroupId')
for val in $result; do
aws identitystore list-group-memberships --identity-store-id d-123456789--group-id $val | jq -r '.GroupMemberships[] | \
{GroupID: .GroupId, Member:User.Id} ' >> GroupMembers.json
done
Example output from UserIds.json:
{
"Name": "first.last#example.com",
"ID": "123456789-9876543210-ABCD-4321-1234"
}
{
"Name": "last.first#example.com",
"ID": "12345678-4321-1234-2233-9876543210"
}
Example output from GroupsID.json:
{
"GroupName": "sso-aws-zone-role-CloudCoreOps",
"ID": "123456789-55668877-1234-5522-2255-987654321"
}
{
"GroupName": "sso-aws-zone-role-CloudCoreRO",
"ID": "1234567890-11224455-2255-5522-1343-9876543210"
}
Example Output from GroupsMembers.json:
{
"GroupID": "123456789-55668877-1234-5522-2255-987654321",
"Member": "123456789-9876543210-ABCD-4321-1234"
}
{
"GroupID": "1234567890-11224455-2255-5522-1343-9876543210",
"Member": "12345678-4321-1234-2233-9876543210"
}
Now I just need to correlate and I have read you can use JQ like SED. So, that means I should be able to replace the key values in GroupMembers.json. First is to replace the GroupID with the correct GroupName matched from the GroupsID.json file and the Member with the User Name that matches the ID from the UserID.json file.
I think this can be done in a loop, but I want need to learn not only how to do this, but the best way.
It should be doable with INDEX and JOIN in a two-level nesting:
jq --slurpfile users UserIds.json --slurpfile groups GroupsID.json '
JOIN($groups | INDEX(.ID);
JOIN($users | INDEX(.ID); .; .Member; add);
.GroupID; add) | {Name, GroupName}
' GroupsMembers.json
{
"Name": "first.last#example.com",
"GroupName": "sso-aws-zone-role-CloudCoreOps"
}
{
"Name": "last.first#example.com",
"GroupName": "sso-aws-zone-role-CloudCoreRO"
}

Modifying Kusto to get the logs output

I have my below kql which when ran in Log Analytics give me the right result. But Now I have moved my logs to a storage account and created an ADX external table to query the same logs using Kusto. however I am finding it difficult to Query as same query wont work and will need some modification. I would need help if someone can advice on what all changes should I do in existing Kusto to get the same result.
In log Analytics this works:
"AzureDiagnostics
| where Category == 'kube-audit'
| where TimeGenerated between (datetime("$querystart") .. datetime("$queryend"))
| where (strlen(log_s) >= 32000
and not(log_s contains \"aksService\")
and not(log_s contains \"system:serviceaccount:crossplane-system:crossplane\")
and not(log_s contains \"system:serviceaccount:elastic-system:elastic-operator\")
and not(log_s contains \"system:serviceaccount:internal-services:cert-manager-cainjector\")
and not(log_s contains \"system:serviceaccount:internal-services:spinnaker\")
and not(log_s contains \"system:serviceaccount:kube-system:daemon-set-controller\")
and not(log_s contains \"system:serviceaccount:kube-system:deployment-controller\")
and not(log_s contains \"system:serviceaccount:kube-system:endpoint-controller\")
and not(log_s contains \"system:serviceaccount:kube-system:node-controller\")
and not(log_s contains \"system:serviceaccount:kube-system:replicaset-controller\")
and not(log_s contains \"system:serviceaccount:kube-system:statefulset-controller\"))
or strlen(log_s) < 32000
| extend op = parse_json(log_s)
| where not(tostring(op.verb) in (\"list\", \"get\", \"watch\"))
| where not(tostring(op.user.username) hasprefix \"system:\")
| where not(tostring(op.user.username) in (\"hcpService\", \"aksService\", \"aksProblemDetector\", \"readinessChecker\", \"nodeclient\", \"masterclient\"))
| where substring(tostring(op.responseStatus.code), 0, 1) == \"2\"
| where not(tostring(op.requestURI) in (\"/apis/authorization.k8s.io/v1/selfsubjectaccessreviews\"))
| extend user = op.user.username
| extend decision = tostring(parse_json(tostring(op.annotations)).[\"authorization.k8s.io/decision\"])
| extend requestURI = tostring(op.requestURI)
| extend name = tostring(parse_json(tostring(op.objectRef)).name)
| extend namespace = tostring(parse_json(tostring(op.objectRef)).namespace)
| extend verb = tostring(op.verb)
| project TimeGenerated, SubscriptionId, ResourceId, namespace, name, requestURI, verb, decision, ['user']
| order by TimeGenerated asc"
and the output in Log Analytics for query
AzureDiagnostics
| where Category == 'kube-audit'
On exporting to storage account and then creating an External table in ADX over it, I dont see the same schema, the result I have in ADX external table for kube-audit is something like this:
"operationName": Microsoft.ContainerService/managedClusters/diagnosticLogs/Read,
"category": kube-audit,
"ccpNamespace": 5c40f,
"resourceId": /SUBSCRIPTIONS/53AEB/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV,
"properties": {
"log": "{\"kind\":\"Event\",\"apiVersion\":\"audit.k8s.io/v1\",\"level\":\"Request\",\"auditID\":\"d80ca0b72-75eaf\",\"stage\":\"ResponseComplete\",\"requestURI\":\"/apis/apps/v1/namespaces/events/deployments/api/scale\",\"verb\":\"get\",\"user\":{\"username\":\"system:serviceaccount:kube-system:horizontal-pod-autoscaler\",\"uid\":\"d5d7-ba1cfb172033\",\"groups\":[\"system:serviceaccounts\",\"system:serviceaccounts:kube-system\",\"system:authenticated\"]},\"sourceIPs\":[\"100.11.11.0\"],\"userAgent\":\"kube-controller-manager/v1.22.6 (linux/amd64) kubernetes/0795921/system:serviceaccount:kube-system:horizontal-pod-autoscaler\",\"objectRef\":{\"resource\":\"deployments\",\"namespace\":\"events\",\"name\":\"api\",\"apiGroup\":\"apps\",\"apiVersion\":\"v1\",\"subresource\":\"scale\"},\"responseStatus\":{\"metadata\":{},\"code\":200},\"requestReceivedTimestamp\":\"2022-05-23T13:44:59.985416Z\",\"stageTimestamp\":\"2022-05-23T13:45:00.002107Z\",\"annotations\":{\"authorization.k8s.io/decision\":\"allow\",\"authorization.k8s.io/reason\":\"RBAC: allowed by ClusterRoleBinding \\\"system:controller:horizontal-pod-autoscaler\\\" of ClusterRole \\\"system:controller:horizontal-pod-autoscaler\\\" to ServiceAccount \\\"horizontal-pod-autoscaler/cfxyz\\\"\"}}\n",
"stream": "stdout",
"pod": "kube-apiserver-7d-q6v"
},
"time": 2022-05-23T13:45:00Z,
"Cloud": AzureCloud,
"Environment": prod,
"UnderlayClass": hcp-underlay,
"UnderlayName": hcp-underlay-norteurope-cx-624,
External table schema:
"TableName": logsKube,
"Schema": operationName:string,category:string,ccpNamespace:string,resourceId:string,properties:dynamic,['time']:datetime,Cloud:string,Environment:string,UnderlayClass:string,UnderlayName:string,
"DatabaseName": logsstorage,
"Folder": ,
"DocString": ,
How can I run the above query in ADX to get the result?
Create the external table manually, using the original columns' names.
Create and alter Azure Storage external tables
Should be somthing like that:
.create-or-alter external table logsKube (TenantId:string,TimeGenerated:datetime,ResourceId:string,Category:string,ResourceGroup:string,SubscriptionId:string,ResourceProvider:string,Resource:string,ResourceType:string,OperationName:string,ResultType:string,CorrelationId:string,ResultDescription:string,Tenant_g:string,JobId_g:string,RunbookName_s:string,StreamType_s:string,Caller_s:string,requestUri_s:string,Level:string,DurationMs:string,CallerIPAddress:string,OperationVersion:string,ResultSignature:string,id_s:string,status_s:string,LogicalServerName_s:string,Message:string,clientInfo_s:string,httpStatusCode_d:string,identity_claim_appid_g:string,identity_claim_http_schemas_microsoft_com_identity_claims_objectidentifier_g:string,userAgent_s:string,ruleName_s:string,identity_claim_http_schemas_xmlsoap_org_ws_2005_05_identity_claims_upn_s:string,systemId_g:string,isAccessPolicyMatch_b:string,EventName_s:string,httpMethod_s:string,subnetId_s:string,type_s:string,instanceId_s:string,macAddress_s:string,vnetResourceGuid_g:string,direction_s:string,subnetPrefix_s:string,primaryIPv4Address_s:string,conditions_sourcePortRange_s:string,priority_d:string,conditions_destinationPortRange_s:string,conditions_destinationIP_s:string,conditions_None_s:string,conditions_sourceIP_s:string,httpVersion_s:string,matchedConnections_d:string,startTime_t:string,endTime_t:string,DatabaseName_s:string,clientIP_s:string,host_s:string,requestQuery_s:string,sslEnabled_s:string,clientPort_d:string,httpStatus_d:string,receivedBytes_d:string,sentBytes_d:string,timeTaken_d:string,resultDescription_ErrorJobs_s:string,resultDescription_ChildJobs_s:string,identity_claim_http_schemas_microsoft_com_identity_claims_scope_s:string,workflowId_s:string,resource_location_s:string,resource_workflowId_g:string,resource_resourceGroupName_s:string,resource_subscriptionId_g:string,resource_runId_s:string,resource_workflowName_s:string,_schema_s:string,correlation_clientTrackingId_s:string,properties_sku_Family_s:string,properties_sku_Name_s:string,properties_tenantId_g:string,properties_enabledForDeployment_b:string,code_s:string,resultDescription_Summary_MachineId_s:string,resultDescription_Summary_ScheduleName_s:string,resultDescription_Summary_Status_s:string,resultDescription_Summary_StatusDescription_s:string,resultDescription_Summary_MachineName_s:string,resultDescription_Summary_TotalUpdatesInstalled_d:string,resultDescription_Summary_RebootRequired_b:string,resultDescription_Summary_TotalUpdatesFailed_d:string,resultDescription_Summary_InstallPercentage_d:string,resultDescription_Summary_StartDateTimeUtc_t:string,resource_triggerName_s:string,resultDescription_Summary_InitialRequiredUpdatesCount_d:string,properties_enabledForTemplateDeployment_b:string,resultDescription_Summary_EndDateTimeUtc_s:string,resultDescription_Summary_DurationInMinutes_s:string,resource_originRunId_s:string,properties_enabledForDiskEncryption_b:string,resource_actionName_s:string,correlation_actionTrackingId_g:string,resultDescription_Summary_EndDateTimeUtc_t:string,resultDescription_Summary_DurationInMinutes_d:string,conditions_protocols_s:string,identity_claim_ipaddr_s:string,ElasticPoolName_s:string,identity_claim_http_schemas_microsoft_com_claims_authnmethodsreferences_s:string,RunOn_s:string,query_hash_s:string,SourceSystem:string,MG:string,ManagementGroupName:string,Computer:string,RawData:string,certificatePolicyProperties_certificateProperties_subject_s:string,certificatePolicyProperties_certificateProperties_validityInMonths_d:string,certificatePolicyProperties_keyProperties_type_s:string,certificatePolicyProperties_keyProperties_size_d:string,certificatePolicyProperties_keyProperties_export_b:string,certificatePolicyProperties_secretProperties_type_s:string,certificatePolicyProperties_certificateIssuerProperties_name_s:string,error_state_d:string,location_s:string,Tenant_s:string,RecoveryJobDestination_s:string,RecoveryJobRPLocation_s:string,RecoveryLocationType_s:string,upstreamSourcePort_s:string,ProtectedContainerOSType_s:string,ProtectedContainerOSVersion_s:string,GatewayManagerVersion_s:string,targetResources_CertificateName_s:string,displayResourceId_s:string,executionClusterType_s:string,clientResponseTime_d:string,targetResources_NodeConfigurationName_s:string,targetResources_NodeId_g:string,targetResources_CredentialId_g:string,targetResources_CredentialName_s:string,targetResources_DscConfigurationName_s:string,targetResources_VariableId_g:string,targetResources_VariableName_s:string,targetResources_RunbookId_g:string,targetResources_RunbookName_s:string,targetResources_ModuleId_g:string,targetResources_ModuleName_s:string,targetResources_ScheduleId_g:string,targetResources_ScheduleName_s:string,clientInfo_TenantId_g:string,clientInfo_Issuer_s:string,clientInfo_ObjectId_g:string,clientInfo_AppId_g:string,targetResources_JobScheduleId_g:string,targetResources_JobName_s:string,clientInfo_IpAddress_s:string,clientInfo_PrincipalName_s:string,clientInfo_ClientRequestId_g:string,targetResources_Resource_s:string,targetResources_JobId_g:string,targetResources_JobName_g:string,clusterType_s:string,identity_claim_upn_s:string,DataCenterName_s:string,identity_claim_scp_s:string,identity_claim_unique_name_s:string,identity_claim_amr_s:string,identity_claim_oid_g:string,identity_claim_home_oid_g:string,removedAccessPolicy_Permissions_storage_s:string,replicationHealthErrors_s:string,eventGridEventProperties_topic_s:string,eventGridEventProperties_subject_s:string,eventGridEventProperties_eventType_s:string,eventGridEventProperties_eventTime_t:string,eventGridEventProperties_data_Id_s:string,eventGridEventProperties_data_VaultName_s:string,eventGridEventProperties_data_ObjectType_s:string,eventGridEventProperties_data_ObjectName_s:string,eventGridEventProperties_data_Version_s:string,eventGridEventProperties_dataVersion_s:string,properties_networkAcls_bypass_s:string,properties_networkAcls_defaultAction_s:string,properties_softDeleteRetentionInDays_d:string,error_number_d:string,Severity:string,user_defined_b:string,state_d:string,PolicyUniqueId_s:string,ProtectedContainerName_g:string,identity_claim_http_schemas_xmlsoap_org_ws_2005_05_identity_claims_name_s:string,retryHistory_s:string,network_s:string,nexthop_s:string,locprf_s:string,weight_s:string,path_s:string,addressfamily_s:string,ClientOperationId_g:string,CorrelationRequestId_g:string,Region_s:string,ScaleUnit_s:string,ActivityId_g:string,EventTimeString_s:string,EventProperties_s:string,SKU_s:string,virtual_core_count_s:string,avg_cpu_percent_s:string,reserved_storage_mb_s:string,storage_space_used_mb_s:string,io_requests_s:string,io_bytes_read_s:string,io_bytes_written_s:string,timeOfOccurence_t:string,eventType_s:string,description_s:string,healthErrors_s:string,logId_g:string,removedAccessPolicy_TenantId_g:string,removedAccessPolicy_ObjectId_g:string,removedAccessPolicy_Permissions_keys_s:string,removedAccessPolicy_Permissions_secrets_s:string,removedAccessPolicy_Permissions_certificates_s:string,addedAccessPolicy_TenantId_g:string,addedAccessPolicy_ObjectId_g:string,addedAccessPolicy_Permissions_keys_s:string,addedAccessPolicy_Permissions_secrets_s:string,addedAccessPolicy_Permissions_certificates_s:string,addedAccessPolicy_Permissions_storage_s:string,properties_enableSoftDelete_b:string,JobOperationSubType_s:string,DataTransferredInMB_s:string,ProtectedInstanceCount_s:string,StorageConsumedInMBs_s:string,StorageType_s:string,StorageName_s:string,OldestRecoveryPointTime_s:string,OldestRecoveryPointLocation_s:string,LatestRecoveryPointTime_s:string,LatestRecoveryPointLocation_s:string,BackupItemFrontEndSize_s:string,StorageUniqueId_s:string,AlertConsolidationStatus_s:string,CountOfAlertsConsolidated_s:string,AlertRaisedOn_s:string,AlertCode_s:string,RecommendedAction_s:string,AlertUniqueId_s:string,AlertType_s:string,AlertStatus_s:string,AlertOccurrenceDateTime_s:string,AlertSeverity_s:string,TelemetryProperties_s:string,AdHocOrScheduledJob_s:string,affectedResourceId_s:string,JobUniqueId_g:string,JobOperation_s:string,JobStatus_s:string,JobFailureCode_s:string,JobStartDateTime_s:string,JobDurationInSecs_s:string,RecoveryJobRPDateTime_s:string,affectedResourceName_s:string,affectedResourceId_g:string,affectedResourceType_s:string,logId_d:string,DeploymentUnit_s:string,CloudStorageInBytes_s:string,ProtectedInstances_s:string,trustedService_s:string,OptionName_s:string,OptionDesiredState_s:string,OptionActualState_s:string,OptionDisableReason_s:string,IsDisabledBySystem_d:string,DatabaseDesiredMode_s:string,DatabaseActualMode_s:string,RegisteredContainerId_s:string,ProtectedServerType_s:string,ProtectedServerFriendlyName_s:string,BackupManagementServerUniqueId_s:string,BackupItemId_s:string,ProtectedServerName_s:string,ProtectionState_s:string,ProtectedServerUniqueId_s:string,exec_type_d:string,wait_category_s:string,total_query_wait_time_ms_d:string,max_query_wait_time_ms_d:string,is_parameterizable_s:string,statement_type_s:string,statement_key_hash_s:string,query_param_type_d:string,interval_start_time_d:string,interval_end_time_d:string,logical_io_writes_d:string,max_logical_io_writes_d:string,physical_io_reads_d:string,max_physical_io_reads_d:string,logical_io_reads_d:string,max_logical_io_reads_d:string,execution_type_d:string,count_executions_d:string,cpu_time_d:string,max_cpu_time_d:string,dop_d:string,max_dop_d:string,rowcount_d:string,max_rowcount_d:string,query_max_used_memory_d:string,max_query_max_used_memory_d:string,duration_d:string,max_duration_d:string,num_physical_io_reads_d:string,max_num_physical_io_reads_d:string,log_bytes_used_d:string,max_log_bytes_used_d:string,query_id_d:string,plan_id_d:string,query_plan_hash_s:string,statement_sql_handle_s:string,tags_displayName_s:string,error_code_s:string,error_message_s:string,start_utc_date_t:string,end_utc_date_t:string,wait_type_s:string,delta_max_wait_time_ms_d:string,delta_signal_wait_time_ms_d:string,delta_wait_time_ms_d:string,delta_waiting_tasks_count_d:string,LogBackupFrequency_s:string,LogBackupRetentionDuration_s:string,PolicyTimeZone_s:string,PolicyName_s:string,BackupFrequency_s:string,BackupTimes_s:string,BackupDaysOfTheWeek_s:string,DailyRetentionDuration_s:string,DailyRetentionTimes_s:string,ProtectedContainerFriendlyName_s:string,ProtectedContainerWorkloadType_s:string,ProtectedContainerName_s:string,ProtectedContainerProtectionState_s:string,ProtectedContainerLocation_s:string,ProtectedContainerType_s:string,listenerName_s:string,backendPoolName_s:string,backendSettingName_s:string,originalRequestUriWithArgs_s:string,transactionId_g:string,sslCipher_s:string,sslProtocol_s:string,sslClientVerify_s:string,sslClientCertificateFingerprint_s:string,sslClientCertificateIssuerName_s:string,serverRouted_s:string,serverStatus_s:string,serverResponseLatency_s:string,originalHost_s:string,EndpointName_s:string,Status_s:string,NodeId_g:string,NodeName_s:string,NodeComplianceStatus_s:string,DscReportId_g:string,DscReportStatus_s:string,LastSeenTime_t:string,ReportStartTime_t:string,ReportEndTime_t:string,ConfigurationMode_s:string,HostName_s:string,NumberOfResources_d:string,IPAddress:string,DscResourceId_s:string,DscResourceName_s:string,DscResourceStatus_s:string,DscModuleName_s:string,DscModuleVersion_s:string,DscConfigurationName_s:string,DscResourceDuration_d:string,ErrorCode_s:string,ErrorMessage_s:string,BackupItemProtectionState_s:string,BackupItemAppVersion_s:string,BackupItemUniqueId_s:string,BackupItemName_s:string,BackupItemFriendlyName_s:string,BackupItemType_s:string,BackupManagementType_s:string,ProtectedContainerUniqueId_s:string,PolicyUniqueId_g:string,timeStamp_t:string,lastRecoveryPoint_t:string,latestAppConsistentRecoveryPoint_t:string,replicatingDisksCount_d:string,uploadRPOInSeconds_d:string,uploadRPOUpdateTime_t:string,processedRPOInSeconds_d:string,processedRPOUpdateTime_t:string,EventId_d:string,VaultUniqueId_s:string,VaultName_s:string,AzureDataCenter_s:string,VaultTags_s:string,ResourceGroupName_s:string,StorageReplicationType_s:string,SchemaVersion_s:string,State_s:string,InstanceName_s:string,Value_s:string,ProviderName_s:string,TaskName_s:string,agentVersion_s:string,recoveryRegion_s:string,multiVmGroupId_g:string,multiVmGroupName_s:string,multiVmGroupCreateOption_s:string,recoveryNetworkId_s:string,lastHeartbeat_t:string,multiVmSyncStatus_s:string,targetVmNicDetails_s:string,recoveryServicesProviderId_g:string,replicationHealth_s:string,failoverHealth_s:string,name_s:string,id_g:string,primaryFabricName_s:string,recoveryFabricName_s:string,primaryFabricType_s:string,recoveryFabricType_s:string,primaryContainerName_s:string,recoveryContainerName_s:string,protectionState_s:string,activeLocation_s:string,policyName_s:string,replicationProviderName_s:string,osFamily_s:string,initialReplicationProgressPercentage_d:string,itemType_s:string,failoverHealthErrors_s:string,rpoInSeconds_d:string,lastRpoCalculatedTime_t:string,version_s:string,attrs_s:string,containerID_s:string,ccpNamespace_s:string,log_s:string,stream_s:string,pod_s:string,Cloud_s:string,Environment_s:string,UnderlayClass_s:string,UnderlayName_s:string,msg_s:string,AdditionalFields:string,Type:string,_ResourceId:string)
kind=storage
dataformat=csv
(
h#'abfss://filesystem#storageaccount.dfs.core.windows.net/path;secretKey'
)
with (includeHeaders=all)

Compare json files but ignore values

I would like to compare two json files and report differencies but I am interested in keys only and not values. So for example the "json-diff" between the following two files (of course they are much more complicated):
{
"http": {
"https": true,
"swagger": {
"enabled": false
},
"scalingFactors": [0.1, 0.2]
}
}
{
"http": {
"https": true,
"swagger": {
"enabled": true
},
"scalingFactors": [0.1, 0.1],
"test": true
}
}
should report that there is missing key:
http.test
but
should not report that the following keys have different values:
http.swagger.enabled
http.scalingFactors
I looked at the jq tool but I am not sure how to ignore values.
Ignoring potential complications having to do with arrays, looking at the "symmetric difference" of the sets of paths to scalars would make sense. As a starting point, you could thus consider:
jq -c '
[paths(scalars)] as $f1
| [input | paths(scalars)] as $f2
| ($f1 - $f2) + ($f2 - $f1)' file1.json file2.json
You might want to stringify the paths, but then again, it might be wise to avoid doing so if the mapping to the strings is not invertible.
If arrays are present, you might want to compare the paths while ignoring the array indices:
def p: [paths(scalars) | map(select(type=="string"))] | unique;
p as $f1
| (input | p) as $f2
| ($f1 - $f2) + ($f2 - $f1)
| .[]
The last line ensures that the result is a (possibly empty) stream, the point being that this makes it easy to check the return code to determine whether any difference was detected: simply use the -e command-line option. If there are no differences, the return code will then be 4.
One way to check if the stream is empty would be to use the -4

Optimize NeptuneDB Gremlin query

vehicles --> accounts --> organizations <-- users
We have the above graph structure where vechicles , accounts, organizations and users are vertex labels and the arrows indicate the edge direction.
Consider the following number of vertices :
organizations = 1
accounts per organizations = 2
vehciles per account = 5000
users per organizations = 100
Our requirement is , given two vertexIds , find a set of all users and vehicles that satisfy the above graph.
For example if I have vertex1 = accounts:1 and vertex2 = organizations:1 , find the set of users and vehicles that are part of these two vertices.
We have the following query
g.V('accounts:1').outE().otherV().hasId('organizations:1')
.V('accounts:1').inE().otherV().as('B')
.V('organizations:1').inE().otherV().as('A')
.select('A', 'B')
While this works , the query takes ~3.5 seconds to complete , now we know that there are going to be 500000 traversers for this query.
Is there a better way to do this ?
Thanks for the help
Edit #1 : Attaching the query's profile API response
Optimized Traversal
===================
Neptune steps:
[
NeptuneGraphQueryStep(VertexId)#[A, B] {
JoinGroupNode {
JoinGroupNode {
PatternNode[(?1=<accounts:1>, <lifestate>, "ACTIVE", ?) . project ?1 .], {estimatedCardinality=1799504, expectedTotalOutput=1, indexTime=0, joinTime=1, numSearches=1, actualTotalOutput=1}
PatternNode[(?1, ?5, ?3=<organizations:1>, ?6) . project ?1,?6,?3 . IsEdgeIdFilter(?6) .], {estimatedCardinality=102, expectedTotalOutput=1, indexTime=0, joinTime=0, numSearches=1, actualTotalOutput=1}
PatternNode[(?6, <lifestate>, "ACTIVE", ?) . project ask .], {estimatedCardinality=3341886, expectedTotalOutput=1, indexTime=0, joinTime=2, numSearches=1, actualTotalOutput=1}
PatternNode[(?3, <lifestate>, "ACTIVE", ?) . project ask .], {estimatedCardinality=1799504, expectedTotalOutput=1, indexTime=0, joinTime=2, numSearches=1, actualTotalOutput=1}
}, finishers=[dedup(?3)]
PatternNode[(?8=<organizations:1>, <~label>, ?9, <~>) . project distinct ?8 .], {estimatedCardinality=INFINITY, expectedTotalOutput=1, indexTime=0, joinTime=0, numSearches=1, actualTotalOutput=1}
PatternNode[(?8, <lifestate>, "ACTIVE", ?) . project ask .], {estimatedCardinality=1799504, expectedTotalOutput=1, indexTime=0, joinTime=0, numSearches=1, actualTotalOutput=1}
PatternNode[(?10, ?12, ?8, ?13) . project ?8,?13,?10 . IsEdgeIdFilter(?13) .], {estimatedCardinality=INFINITY, expectedTotalOutput=102, indexTime=0, joinTime=1, numSearches=1, actualTotalOutput=102}
PatternNode[(?13, <lifestate>, "ACTIVE", ?) . project ask .], {estimatedCardinality=3341886, expectedTotalOutput=102, indexTime=0, joinTime=128, numSearches=102, actualTotalOutput=102}
PatternNode[(?13, <role>, "admin", ?) . project ask .], {estimatedCardinality=113376, expectedTotalOutput=100, indexTime=1, joinTime=6, numSearches=102, actualTotalOutput=100}
PatternNode[(?10, <~label>, ?14=<users>, <~>) . project ask .], {estimatedCardinality=2326404, expectedTotalOutput=100, indexTime=0, joinTime=128, numSearches=100, actualTotalOutput=100}
PatternNode[(?10, <lifestate>, "ACTIVE", ?) . project ask .], {estimatedCardinality=1799504, expectedTotalOutput=100, indexTime=0, joinTime=83, numSearches=100, actualTotalOutput=100}
PatternNode[(?10, <~label>, ?15=<users>, <~>) . project ?10 .], {estimatedCardinality=2326404, expectedTotalOutput=100, indexTime=1, joinTime=1, numSearches=1, actualTotalOutput=100}
PatternNode[(?16=<accounts:1>, <~label>, ?17, <~>) . project distinct ?16 .], {estimatedCardinality=INFINITY, expectedTotalOutput=100, indexTime=0, joinTime=1, numSearches=1, actualTotalOutput=100}
PatternNode[(?16, <lifestate>, "ACTIVE", ?) . project ask .], {estimatedCardinality=1799504, expectedTotalOutput=100, indexTime=0, joinTime=0, numSearches=1, actualTotalOutput=100}
PatternNode[(?18, ?20, ?16, ?21) . project ?16,?21,?18 . IsEdgeIdFilter(?21) .], {estimatedCardinality=INFINITY, expectedTotalOutput=1000, indexTime=0, joinTime=119, numSearches=1, actualTotalOutput=500000}
PatternNode[(?21, <lifestate>, "ACTIVE", ?) . project ask .], {estimatedCardinality=3341886, expectedTotalOutput=1000, indexTime=194, joinTime=142, numSearches=5000, actualTotalOutput=500000}
PatternNode[(?18, <~label>, ?22=<vehicles>, <~>) . project ask .], {estimatedCardinality=238260, expectedTotalOutput=1000, indexTime=183, joinTime=499, numSearches=5000, actualTotalOutput=500000}
PatternNode[(?18, <lifestate>, "ACTIVE", ?) . project ask .], {estimatedCardinality=1799504, expectedTotalOutput=1000, indexTime=193, joinTime=858, numSearches=5000, actualTotalOutput=500000}
PatternNode[(?18, <~label>, ?23=<vehicles>, <~>) . project ?18 .], {estimatedCardinality=238260, indexTime=360, joinTime=1372, numSearches=500}
}, annotations={path=[Vertex(?1):GraphStep, Edge(?6):VertexStep, Vertex(?3):EdgeOtherVertexStep, Vertex(?8):GraphStep, Edge(?13):VertexStep, Vertex(?10):EdgeOtherVertexStep, VertexId(?10):IdStep#[A], Vertex(?16):GraphStep, Edge(?21):VertexStep, Vertex(?18):EdgeOtherVertexStep, VertexId(?18):IdStep#[B]], joinStats=true, optimizationTime=329, maxVarId=24, executionTime=6279}
},
NeptuneTraverserConverterStep
]
+ not converted into Neptune steps: [SelectStep(last,[A, B])]
WARNING: >> SelectStep(last,[A, B]) << (or one of its children) is not supported natively yet
Physical Pipeline
=================
NeptuneGraphQueryStep#[A, B]
|-- StartOp
|-- JoinGroupOp
|-- JoinGroupOp
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?1=<accounts:1>, <lifestate>, "ACTIVE", ?) . project ?1 .], {estimatedCardinality=1799504, expectedTotalOutput=1})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?1, ?5, ?3=<organizations:1>, ?6) . project ?1,?6,?3 . IsEdgeIdFilter(?6) .], {estimatedCardinality=102, expectedTotalOutput=1})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?6, <lifestate>, "ACTIVE", ?) . project ask .], {estimatedCardinality=3341886, expectedTotalOutput=1})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?3, <lifestate>, "ACTIVE", ?) . project ask .], {estimatedCardinality=1799504, expectedTotalOutput=1})
|-- FilterOp
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?8=<organizations:1>, <~label>, ?9, <~>) . project distinct ?8 .], {estimatedCardinality=INFINITY, expectedTotalOutput=1})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?8, <lifestate>, "ACTIVE", ?) . project ask .], {estimatedCardinality=1799504, expectedTotalOutput=1})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?10, ?12, ?8, ?13) . project ?8,?13,?10 . IsEdgeIdFilter(?13) .], {estimatedCardinality=INFINITY, expectedTotalOutput=102})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?13, <lifestate>, "ACTIVE", ?) . project ask .], {estimatedCardinality=3341886, expectedTotalOutput=102})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?13, <role>, "admin", ?) . project ask .], {estimatedCardinality=113376, expectedTotalOutput=100})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?10, <~label>, ?14=<users>, <~>) . project ask .], {estimatedCardinality=2326404, expectedTotalOutput=100})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?10, <lifestate>, "ACTIVE", ?) . project ask .], {estimatedCardinality=1799504, expectedTotalOutput=100})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?10, <~label>, ?15=<users>, <~>) . project ?10 .], {estimatedCardinality=2326404, expectedTotalOutput=100})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?16=<accounts:1>, <~label>, ?17, <~>) . project distinct ?16 .], {estimatedCardinality=INFINITY, expectedTotalOutput=100})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?16, <lifestate>, "ACTIVE", ?) . project ask .], {estimatedCardinality=1799504, expectedTotalOutput=100})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?18, ?20, ?16, ?21) . project ?16,?21,?18 . IsEdgeIdFilter(?21) .], {estimatedCardinality=INFINITY, expectedTotalOutput=1000})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?21, <lifestate>, "ACTIVE", ?) . project ask .], {estimatedCardinality=3341886, expectedTotalOutput=1000})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?18, <~label>, ?22=<vehicles>, <~>) . project ask .], {estimatedCardinality=238260, expectedTotalOutput=1000})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?18, <lifestate>, "ACTIVE", ?) . project ask .], {estimatedCardinality=1799504, expectedTotalOutput=1000})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?18, <~label>, ?23=<vehicles>, <~>) . project ?18 .], {estimatedCardinality=238260})
Runtime (ms)
============
Query Execution: 6283.262
Serialization: 2120.104
Traversal Metrics
=================
Step Count Traversers Time (ms) % Dur
-------------------------------------------------------------------------------------------------------------
NeptuneGraphQueryStep(VertexId)#[A, obje... 500000 500000 2502.636 41.43
NeptuneTraverserConverterStep 500000 500000 2580.098 42.71
SelectStep(last,[A, B]) 500000 500000 958.328 15.86
>TOTAL - - 6041.062 -
Predicates
==========
# of predicates: 37
WARNING: reverse traversal with no edge label(s) - .in() / .both() may impact query performance
Results
=======
Count: 500000
Output: <Removed for space>
Response serializer: application/vnd.gremlin-v3.0+gryo
Response size (bytes): 64,000,045
Index Operations
================
Query execution:
# of statement index ops: 15915
# of unique statement index ops: 15915
Duplication ratio: 1.0
# of terms materialized: 0
Serialization:
# of statement index ops: 0
# of terms materialized: 0
If possible always provide labels on traversal steps like in() and out(). Also, you do not need to specify inE().otherV() unless you need data from the edge. in() will suffice. As a first step I would try:
g.V('accounts:1').out(<labels>).hasId('organizations:1')
.V('accounts:1').in(<labels>).as('B')
.V('organizations:1').in(<labels>).as('A')
.select('A', 'B')
Where <labels> will be of the form in('works-with','knows').
Using edge labels, especially on the in steps can help a lot in some cases. I would start there as a first step. There are other rewrites that can be tried but this is a good first step.

How can I define local dependencies between roles in a collection in ansible?

I have a question about dependencies between roles in a collection.
In general, I am concerned if it is possible to define dependencies between roles in a collection - local dependencies like a relative path.
I would like to implement scenarios:
roleB depends on roleA
default scenario of roleC should use roleA in prepare.yml to set up the environment
or
default scenario of roleC should use roleA in converge.yml
I would like to get these dependencies as local dependencies.
For case 2, I tried to use the requriments.yml file
with the appropriate entry in molecule.yml
---
dependency:
name: galaxy
driver:
name: docker
platforms:
.. ...
provisioner:
name: ansible
# env:
# ANSIBLE_ROLES_PATH: "../../roles"
playbooks:
prepare: prepare.yml
config_options:
defaults:
remote_user: ansible
dependency:
name: galaxy
options:
ignore-certs: True
ignore-errors: True
requirements-file: requirements.yml
verifier:
name: ansible
But unfortunately I can't solve the error:
ERROR [1m[0;34mUsing /etc/ansible/ansible.cfg as config file[1m[0m
Starting galaxy role install process
- downloading role from file://../../tool-box
[1m[0;31m [ERROR]: failed to download the file: <urlopen error [Errno 2] No such file or[1m[0m
[1m[0;31mdirectory: '/../tool-box'>[1m[0m
[1m[1;35m[WARNING]: - tool-box was NOT installed successfully.[1m[0m
[1m[0;31mERROR! - you can use --ignore-errors to skip failed roles and finish processing the list.[1m[0m
Structure of collection with roles:
mynamespace
|
|-- mycollection
|
| --roles
|
| -- roleA --
| |--molecule
| |
| |--default
|
| -- roleB --
| |--molecule
| |
| |--default
|
| -- roleC --
| |--molecule
| |
| |--default
Thank you.
Update:
See request issue in ansible/galaxy:
https://github.com/ansible/galaxy/issues/2719
I added this because I don't think there is such functionality.

Resources