Great Expectations - Result validation for row_count and column_freshness - amazon-dynamodb

I would like to validate results for row count and column freshness on some data on AWS. I am using a check_config.json file to configure the checks. I use terraform to make a Glue job to run the check and throw the result to DynamoDB. The result in DynamoDB is not elaborate and I would like the result to be more specific on the exact results obtained before marking a check as fail or pass. I would like to see, for example, when was the table last modified(column freshness) and number of rows obtained after a count (expect_row_count).
Below is the current result in DynamoDB:
Below is the json code:
{
"table": "table1",
"checks": [
{
"check": "custom_expect_column_to_be_fresh",
"parameters": {
"columns": [
"column1"
],
"strftime_format": "%Y-%m-%d",
"threshold_days": 0,
"threshold_hours": 10
}
},
{
"check": "expect_table_row_count_to_be_between",
"result_format" : "COMPLETE",
"include_config": "True",
"parameters": {
"min_value": 1,
"max_value": 100000
},
"alarm" : {
"threshold": 100,
"period": 3600
}
}
]
}
I was expecting a more elaborate result on how many rows were obtained before the row_count is marked as a failure and I also want to see the last table modification timestamp before column freshness marks as a failure.

Related

Create index on nested array value with dynamodb

I have the following data stored in a DynamoDB table called elo-history.
{
"gameId": "chess",
"guildId": "abc123",
"id": "c3c640e2d8b76b034605d8835a03bef8",
"recordedAt": 1621095861673,
"results": [
{
"oldEloRating": null,
"newEloRating": 2010,
"place": 1,
"playerIds": [
"abc1"
]
},
{
"oldEloRating": null,
"newEloRating": 1990,
"place": 2,
"playerIds": [
"abc2"
]
}
],
"versus": "1v1"
}
I have 2 indexes, guildId-recordedAt-index and gameId-recordedAt-index. Theses allow me to query on those fields.
I am trying to add another index for results[].playerIds[]. I want to be able to do a query for records with playerId=abc1 and have those sorted just like guildId and gameId. Does DynamoDB support something like? Do I need to restructure the data or save it in two different formats to support this type of query?
Something like this.
New table called player-elo-history in addition to the elo-history table. This would store the list of games by playerId
{
"id": "abc1",
"gameId": "chess",
"guildId": "abc123",
"recordedAt": 1621095861673,
"results": [
[
{
"oldEloRating": null,
"newEloRating": 2010,
"place": 1,
"playerIds": [
"abc1"
]
},
{
"oldEloRating": null,
"newEloRating": 1990,
"place": 2,
"playerIds": [
"abc2"
]
}
]
]
}
{
"id": "abc2",
"gameId": "chess",
"guildId": "abc123",
"recordedAt": 1621095861673,
"results": [
[
{
"oldEloRating": null,
"newEloRating": 2010,
"place": 1,
"playerIds": [
"abc1"
]
},
{
"oldEloRating": null,
"newEloRating": 1990,
"place": 2,
"playerIds": [
"abc2"
]
}
]
]
}
It looks like you're modeling the one-to-many relationship between Games and Results using a complex attribute (e.g. a list or objects) on the Game item. This is a completely valid approach to modeling one-to-many relationships and is best used when 1) the results data doesn't change (or change often) and 2) you don't have any access patterns around Results.
Since it sounds like you do have access patterns around Results, you'd be better off storing your Results in their own items.
For example, you might consider modeling results in the user partition with a PK=USER#user_id SK=RESULT#game_id. This would allow you to fetch results by User ID (QUERY where PK=USER#user_id SK begins_with RESULT). Alternatively, you could model results with a PK=RESULT#game_id SK=USER#user_id and create a GSI that swaps the PK/SK's which will allow you to group results by User.
I don't know the specifics around your access patterns, but can say that you'll need to move results into their own items if you want to support access patterns around game results.

Cosmos DB query strings in array retain grouping and trim the value?

Say we have two sets of data in my collection:
{
"id": "111",
"linkedId": [
"ABC:123",
"ABC:456"
]
}
{
"id": "222",
"linkedId": [
"DEF:321",
"DEF:654"
]
}
What query can I run to get a result that will look like this?
{
[
"123",
"456"
]
},
{
[
"321",
"654"
]
}
I have tried
SELECT c.linkedId FROM c
But this has the "linkedId" as the property name in the result set. And I tried LEFT but it doesn't trim first 4 characters of the string.
Then I tried
SELECT value cc FROM cc In c.linkedId
But this loses the grouping.
Any idea?
Since the elements are just strings, not json object, i suggest you using UDF in cosmos db query sql.
UDF:
function userDefinedFunction(arr){
var returnArr = [];
for(var i=0;i<arr.length;i++){
returnArr.push(arr[i].substring(4,7));
}
return returnArr;
}
SQL:
SELECT value udf.test(c.linkedId) FROM c
OUTPUT:

Gremlin group by vertex property and get sum other properties in the same vertex

We have vertex which will store various jobs and their types and counts as properties. I have to group by the status and their counts. I tried the following query which works for one property(receiveCount)
g.V().hasLabel("Jobs").has("Type",within("A","B","C")).group().by("Type").by(fold().match(__.as("p").unfold().values("receiveCount").sum().as("totalRec")).select("totalRec")).next()
I wanted to give 10 more properties like successCount, FailedCount etc.. Is there a better way to give that?
You could use cap() step just like:
g.V().has("name","marko").out("knows").groupCount("a").by("name").group("b").by("name").by(values("age").sum()).cap("a","b")
And the result would be:
"data": [
{
"a": {
"vadas": 1,
"josh": 1
},
"b": {
"vadas": [
27.0
],
"josh": [
32.0
]
}
}
]

Multiple range keys in couchdb views

I've been searching for a solution since few hours without success...
I just want to do this request in couchdb with a view:
select * from database where (id >= 3000000 AND id <= 3999999) AND gyro_y >= 1000
I tried this:
function(doc) {
if(doc.id && doc.Gyro_y){
emit([doc.id,doc.Gyro_y], null);
}
}
Here is my document (record in couchdb):
{
"_id": "f97968bee9674259c75b89658b09f93c",
"_rev": "3-4e2cce33e562ae502d6416e0796fcad1",
"id": "30000002",
"DateHeure": "2016-06-16T02:08:00Z",
"Latitude": 1000,
"Longitude": 1000,
"Gyro_x": -242,
"Gyro_y": 183,
"Gyro_z": -156,
"Accel_x": -404,
"Accel_y": -2424,
"Accel_z": -14588
}
I then do an HTTP request like so:
http://localhost:5984/arduino/_design/filter/_view/bygyroy?startkey=["3000000",1000]&endkey=["3999999",9999999]&include_docs=true
I get this as an answer:
{
total_rows: 10,
offset: 8,
rows: [{
id: "f97968bee9674259c75b89658b09f93c",
key: [
"01000002",
183
],
value: null,
doc: {
_id: "f97968bee9674259c75b89658b09f93c",
_rev: "3-4e2cce33e562ae502d6416e0796fcad1",
id: "30000002",
DateHeure: "2016-06-16T02:08:00Z",
Latitude: 1000,
Longitude: 1000,
Gyro_x: -242,
Gyro_y: 183,
Gyro_z: -156,
Accel_x: -404,
Accel_y: -2424,
Accel_z: -14588
}
}
]
}
So it's working for the id but it's not working for the second key gyro_y.
Thanks for your help.
When you specify arrays as your start/end keys, the results are filtered in a "cascade". In other words, it moves from left to right, and only if something was matched by the previous key, will it be matched by the next key.
In this case, you'll only find Gyro_y >= 1000 when that document also matches the first condition of 3000000 <= id <= 3999999.
Your SQL example does not translate exactly to what you are doing in CouchDB. In SQL, it'll find both conditions and then find the intersection amongst your resulting rows. I would read up on view collation to understand these inner-workings of CouchDB.
To solve your problem right now, I would simply switch the order you are emitting your keys. By putting the Gyro_y value first, you should get the results you've described.

Query to get exact matches of Elastic Field with multile values in Array

I want to write a query in Elastic that applies a filter based on values i have in an array (in my R program). Essentially the query:
Matches a time range (time field in Elastic)
Matches "trackId" field in Elastic to any value in array oth_usr
Return 2 fields - "trackId", "propertyId"
I have the following primitive version of the query but do not know how to use the oth_usr array in a query (part 2 above).
query <- sprintf('{"query":{"range":{"time":{"gte":"%s","lte":"%s"}}}}',start_date,end_date)
view_list <- elastic::Search(index = "organised_recent",type = "PROPERTY_VIEW",size = 10000000,
body=query, fields = c("trackId", "propertyId"))$hits$hits
You need to add a terms query and embed it as well as the range one into a bool/must query. Try updating your query like this:
terms <- paste(sprintf("\"%s\"", oth_usr), collapse=", ")
query <- sprintf('{"query":{"bool":{"must":[{"terms": {"trackId": [%s]}},{"range": {"time": {"gte": "%s","lte": "%s"}}}]}}}',terms,start_date,end_date)
I'm not fluent in R syntax, but this is raw JSON query that works.
It checks whether your time field matches given range (start_time and end_time) and whether one of your terms exact matches trackId.
It returns only trackId, propertyId fields, as per your request:
POST /indice/_search
{
"_source": {
"include": [
"trackId",
"propertyId"
]
},
"query": {
"bool": {
"must": [
{
"range": {
"time": {
"gte": "start_time",
"lte": "end_time"
}
}
},
{
"terms": {
"trackId": [
"terms"
]
}
}
]
}
}
}

Resources