Cleanup Cosmos collection by document type - azure-cosmosdb

I have a collection (1B records) and I need to clean it up
Schema:
// <pk> - item Id
// <type> - literal enum, e.g. Type1|Type2|Type3
{
"partKey": "<pk>",
"type": "<type>"
}
I need to delete all documents where type = Type2.
I can't execute DELETE ... WHERE c.type = 'Type2' as it is not supported
I can't execute Stored Procedure as the collection is partitioned
I'd prefer not to use SDK
What is the best way to cleanup the collection by the specified condition?

I create the following data for test in my collection:
[
{
"partKey": "1",
"type": "1"
},
{
"partKey": "5",
"type": "4"
},
{
"partKey": "2",
"type": "2"
},
{
"partKey": "3",
"type": "2"
},
{
"partKey": "4",
"type": "2"
}
]
Then create a dataflow in ADF. Both source and sink dataset is your Cosmos DB collection.
Check Include system columns option in Source setting.
2.Create Alter Row transformation to delete documents.
Check Allow delete option and type your Partition key.
Result:

Related

Indexing the partition key in Azure Cosmos DB

Suppose I've the following data in my container:
{
"id": "1DBF704E-1623-4844-DC86-EFA729A5C048",
"firstName": "Wylie",
"lastName": "Ramsey",
"country": "AZ",
"city": "Tucson"
}
Where I use the field "id" as the item id, and the field "country" as the partition key, when I query on specific partition key:
SELECT * FROM c WHERE c.country = "AZ"
(get all the people in "AZ")
Should I add "country" as an index or I will get it by default, since I declered "country" as my partition key?
Is there a diference when using the SDK (meaning: adding the new PartitionKey("AZ") option and then sending the query as mentioned above)?
I created a collection with 50,000 records and disabled indexing on all properties.
Indexing policy:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [], // Included nothing
"excludedPaths": [
{
"path": "/\"_etag\"/?"
},
{
"path": "/*" // Exclude all paths
}
]
}
Querying by id cost 2.85 RUs.
Querying by PartitionKey cost 580 RUs.
Indexing policy with PartitionKey (country) added:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/country/*"
}
],
"excludedPaths": [
{
"path": "/\"_etag\"/?"
},
{
"path": "/*" // Exclude all paths
}
]
}
Adding an index on the PartitionKey brought it down to 2.83 RUs.
So the answer to this is Yes, if you have disabled default indexing policies and need to search by partition key then you should add an index to it.
In my opinion, it's a good practice to query with partition key in cosmosdb sql api, here's the offical doc related to it.
By the way, cosmosdb sql api indexes all the properties by default. If you'd like to cover the default setting and customise the indexing policy, this doc may provide more details.

Updating a DynamoDB item with a new attribute that's a list

So I have an item in DynamoDB:
{
"id": "1",
"foo": "bar"
}
And I want to add a new attribute to said item, the attribute being:
{
"newAttr": [{ "bar": 1 }, { "bar": 2 }, { "bar": 3 }]
}
Now I'm doing this in JS using the AWS-SDK like so:
client.update({
ExpressionAttributeNames: { '#newAttr': 'newAttr' },
ExpressionAttributeValues: { ':newAttr': newAttr },
Key: { "id": "1" },
TableName: "foo",
UpdateExpression: "SET #newAttr = :newAttr"
}, callback)
However, I get an error:
ExpressionAttributeValues contains invalid value: One or more
parameter values were invalid: An AttributeValue may not contain an
empty string for key :newAttr
So if I were to JSON.stringify(newAttr) then it's fine, but I don't want this new attribute to be a string, I want it to be a list.
So what am I doing wrong?
Try ADD instead of SET
client.update({
ExpressionAttributeNames: { '#newAttr': 'newAttr' },
ExpressionAttributeValues: { ':newAttr': newAttr },
Key: { "id": "1" },
TableName: "foo",
UpdateExpression: "ADD #newAttr = :newAttr"
}, callback)
I ended up having these issues in the beginning. I'm sure you are wasting time in writing in all the low-level code to interact with DynamoDB.
Use an ODM (ORM Like) for document storage which handles all the hurdles for you.
https://github.com/clarkie/dynogels
is what I have used and will make your life much easier.
Empty String are not allowed
One cannot store empty strings with DynamoDB at the root level. You can have it in the nested objects.
Allowed Data Types from DynamoDB:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.NamingRulesDataTypes.html
Hope this helps.

Redux + Normalizr : Adding and deleting normalized entities in Redux state

I have an API response which has a lot of nested entities. I use normalizr to keep the redux state as flat as possible. For eg. the api response looks like below:
{
"id": 1,
"docs": [
{
"id": 1,
"name": "IMG_0289.JPG"
},
{
"id": 2,
"name": "IMG_0223.JPG"
}
],
"tags": [
{
"id": "1",
"name": "tag1"
},
{
"id": "2",
"name": "tag2"
}
]
}
This response is normalized using normalizr using the schema given below:
const OpeningSchema = new schema.Entity('openings', {
tags: [new schema.Entity('tags')],
docs: [new schema.Entity('docs')]
});
and below is how it looks then:
{
result: "1",
entities: {
"openings": {
"1": {
"id": 1,
"docs": [1,2],
"tags": [1,2]
}
},
"docs": {
"1": {
id: "1",
"name": "IMG_0289.JPG"
},
"2": {
id: "2",
"name": "IMG_0223.JPG"
}
},
"tags": {
"1": {
"id": 1,
"name": "tag1"
},
"2": {
"id": 2,
"name": "tag2"
}
}
}
}
The redux state now looks something like below:
state = {
"opening" : {
id: 1,
tags: [1,2],
docs: [1,2]
},
"tags": [
{
"id":1,
"name": "tag1"
},
{
"id":2,
"name": "tag2"
}
],
"docs": [
{
"id":1,
"name": "IMG_0289.JPG"
},
{
"id":2,
"name": "IMG_0223.JPG"
}
]
}
Now if I dispatch an action to add a tag, then it adds a tag object to state.tags but it doesn't update state.opening.tags array. Same behavior while deleting a tag also.
I keep opening, tags and docs in three different reducers.
This is an inconsistency in the state. I can think of the following ways to keep the state consistent:
I dispatch an action to update tags and listen to it in both tags reducer and opening reducer and update tags subsequently at both places.
The patch request to update opening with tags returns the opening response. I can again dispatch the action which normalizes the response and set tags, opening etc with proper consistency.
What is the right way to do this. Shouldn't the entities be observing for changes to the related entities and make the changes itself. Or there are any other patterns that could be followed any such action.
First to summarise how normalizr works: normalizr flattens nested API response to entities defined by your schemas. So, when you made your initial GET openings API request, normalizr flattened the response and created your Redux entities and the flattened objects: openings, docs, tags.
Your suggestions are viable, but I find normalizr's real benefit in separating API data from UI state; so I don't update the data in Redux store myself... All my API data are kept in entities and they are not altered by me; they are vanilla back-end data... All I do is to do a GET upon state changing API operations, and normalise the GET response. There is a small exception for DELETE case that I'll expand on later on... A middleware will deal with such cases, so you should use one if you haven't been using. I created my own middleware, but I know redux-promise-middleware is quite popular.
In your data set above; when you add a new tag, I assume you are making an API POST to do so, which in turn updates the back-end. Then, you should do another GET openings which will update the entities for openings and all its nested schemas.
When you delete a tag, e.g. tag[2], upon sending the DELETE request to the back-end, you should nullify the deleted object in your entities state, ie. entities.tags[2] = null before making the GET openings again to update your normalizr entities.

JMS Serializer nested objects policy

I have a similar question as JMS Serializer serialize object in object with diffrent view, but I can't get it work like in the accepted answer.
I have a User model that has many Reviews, but the Reviews owner is another User. I have a serialization policy that outputs the following:
{
"id": "1",
"name": "John Doe",
"reviews": [
{
"id": "1",
"rate": "5",
"evaluator": {
"id": "2",
"name": "Alice",
"reviews": [...]
}
}, ...
]
}
The behavior makes sense since the associated (Review) owner model is the same as the parent model and therefore it's using the same serialization policy. But how could I define a custom serialization policy for the nested model so outputs me the following:
{
"id": "1",
"name": "John Doe",
"reviews": [
{
"id": "1"
"rate": "5",
"evaluator": "Alice"
}, ...
]
}

Google Cloud Datastore avoid entity overhead when fetching

Is it possible to avoid the entity overhead when fetching data in the cloud datastore ? Ideally I'd love to have SQL-like results : simple arrays/objects with key:values instead of
{
"batch": {
"entityResultType": "FULL",
"entityResults": [
{
"entity": {
"key": {
"partitionId": {
"datasetId": "e~******************"
},
"path": [
{
"kind": "Answer",
"id": "*******************"
}
]
},
"properties": {
"value": {
"stringValue": "12",
"indexed": false
},
"question": {
"integerValue": "120"
}
}
}
}
],
"endCursor": "********************************************",
"moreResults": "MORE_RESULTS_AFTER_LIMIT",
"skippedResults": null
}
}
which has just too much overhead for me (I plan on running queries over thousands of entities). I just couldn't find in the docs if that's possible or not.
You can use projection queries to query for specific entity properties.
In your example, you could use a projection query to avoid returning the entity's key in the query results.
The other fields are part of the API's data representation and are required to be returned, however, many of them (e.g. endCursor) are only returned once per query batch, so the overhead is low when there are many results.

Resources