Databrew recipes can be written under JSON for transformations that will be used more than once for multiple datasets.
This is an example that i copied from Databrew Developer Guide to do joins between datasets:
`
{
"Action": {
"Operation": "JOIN",
"Parameters": {
"joinKeys": "[{\"key\":\"assembly_session\",\"value\":\"assembly_session\"},{\"key\":\"state_code\",\"value\":\"state_code\"}]",
"joinType": "INNER_JOIN",
"leftColumns": "[\"year\",\"assembly_session\",\"state_code\",\"state_name\",\"all_votes\",\"yes_votes\",\"no_votes\",\"abstain\",\"idealpoint_estimate\",\"affinityscore_usa\",\"affinityscore_russia\",\"affinityscore_china\",\"affinityscore_india\",\"affinityscore_brazil\",\"affinityscore_israel\"]",
"rightColumns": "[\"assembly_session\",\"vote_id\",\"resolution\",\"state_code\",\"state_name\",\"member\",\"vote\"]",
"secondInputLocation": "s3://databrew-public-datasets-us-east-1/votes.csv",
"secondaryDatasetName": "votes"
}
}
}
`
It's possible to select all columns with a * within "leftColumns" or anything close to that?
I've tried to add only * but it doesn't work.
I will do the same transformations in multiple tables and this functionality would work great if i could select everything on a left join, without needing to specify all the columns.
Related
I'm running a database on sequelize and sqlite and I use soft-deletes to basically archive the data.
I'm aware that with .findAll(paranoid: false) I can find all rows including the soft deleted ones. However I would like to find ONLY the soft-deleted ones.
Is there any way to achieve this? Or is there perhaps a way to do "set operations" with two data results, like finding the relative complement of one in the other?
For this, you can add the following condition to where.
deletedAt: { [Op.not]: null }
For example like this:
const projects = await db.Project.findAndCountAll({
paranoid: false,
order: [['createdAt', 'DESC']],
where: { employer_id: null, deletedAt: { [Op.not]: null } },
limit: parseInt(size),
offset: (page - 1) * parseInt(size),
});
I'm using the Cosmos Data migration tool to migrate data between environments. During my migration, I need to update the hostname of the website in the data. I was able to do this pretty easily with a query like this with the top level object data:
SELECT Farms["name"], Farms["farmerInfo"], REPLACE(Farms["websiteLink"], "thiswebsite", "newHostName") AS websiteLink FROM Farms
My Cosmos DB data is structured like (data is just for the example):
{
"name": "Red's Farm",
"websiteLink": "www.thiswebsite.com/goats/",
"farmerInfo": {
"name": "Bob",
"websiteLink": "www.thiswebsite.com/goats/",
"hasGoats": true,
"numGoats": 17
}
}
I don't actually need to modify any of the top level data. The data I need to modify is within the "farmerInfo" object. I've tried a few things but I've had no luck. How can I replace a string in this object using the SQL api?
I want the data to look like this after the migration:
{
"name": "Red's Farm",
"websiteLink": "www.thiswebsite.com/goats/",
"farmerInfo": {
"name": "Bob",
"websiteLink": "www.newHostName.com/goats/", <--- Updated data
"hasGoats": true,
"numGoats": 17
}
}
You can use a SELECT statement in your SELECT statement to build up the sub objects. As example:
SELECT
c.name,
c.websiteLink,
(
SELECT
c.farmerInfo.name,
REPLACE(c.farmerInfo.websiteLink, "thiswebsite", "newHostName") AS websiteLink
) AS farmerInfo
FROM c
I have a table with PK (String) and SK (Integer) - e.g.
PK_id SK_version Data
-------------------------------------------------------
c3d4cfc8-8985-4e5... 1 First version
c3d4cfc8-8985-4e5... 2 Second version
I can do a conditional insert to ensure we don't overwrite the PK/SK pair using ConditionalExpression (in the GoLang SDK):
putWriteItem := dynamodb.Put{
TableName: "example_table",
Item: itemMap,
ConditionExpression: aws.String("attribute_not_exists(PK_id) AND attribute_not_exists(SK_version)"),
}
However I would also like to ensure that the SK_version is always consecutive but don't know how to write the expression. In pseudo-code this is:
putWriteItem := dynamodb.Put{
TableName: "example_table",
Item: itemMap,
ConditionExpression: aws.String("attribute_not_exists(PK_id) AND attribute_not_exists(SK_version) **AND attribute_exists(SK_version = :SK_prev_version)**"),
}
Can someone advise how I can write this?
in SQL I'd do something like:
INSERT INTO example_table (PK_id, SK_version, Data)
SELECT {pk}, {sk}, {data}
WHERE NOT EXISTS (
SELECT 1
FROM example_table
WHERE PK_id = {pk}
AND SK_version = {sk}
)
AND EXISTS (
SELECT 1
FROM example_table
WHERE PK_id = {pk}
AND SK_version = {sk} - 1
)
Thanks
A conditional check is applied to a single item. It cannot be spanned across multiple items. In other words, you simply need multiple conditional checks. DynamoDb has transactWriteItems API which performs multiple conditional checks, along with writes/deletes. The code below is in nodejs.
const previousVersionCheck = {
TableName: 'example_table',
Key: {
PK_id: 'prev_pk_id',
SK_version: 'prev_sk_version'
},
ConditionExpression: 'attribute_exists(PK_id)'
}
const newVersionPut = {
TableName: 'example_table',
Item: {
// your item data
},
ConditionExpression: 'attribute_not_exists(PK_id)'
}
await documentClient.transactWrite({
TransactItems: [
{ ConditionCheck: previousVersionCheck },
{ Put: newVersionPut }
]
}).promise()
The transaction has 2 operations: one is a validation against the previous version, and the other is an conditional write. Any of their conditional checks fails, the transaction fails.
You are hitting your head on some of the differences between a SQL and a no-SQL database. DynamoDB is, of course, a no-SQL database. It does not, out of the box, support optimistic locking. I see two straight forward options:
Use a software layer to give you locking on your DynamoDB table. This may or may not be feasible depending on how often updates are made to your table. How fast 'versions' are generated and the maximum time your application can be gated on the lock will likely tell you if this can work foryou. I am not familiar with Go, but the Java API supports this. Again, this isn't a built-in feature of DynamoDB. If there is no such Go API equivalent, you could use the technique described in the link to 'lock' the table for updates. Generally speaking, locking a no-SQL DB isn't a typical pattern as it isn't exactly what it was created to do (part of which is achieving large scale on unstructured documents to allow fast access to many consumers at once)
Stop using an incrementor to guarantee uniqueness. Typically, incrementors are frowned upon in DynamoDB, in part due to the lack of intrinsic support for it and in part because of how DynamoDB shards you don't want a lot of similarity between records. Using a UUID will solve the uniqueness problem, but if you are porting an existing application that means more changes to the elements that create that ID and updates to reading the ID (perhaps to include a creation-time field so you can tell which is the newest, or the prepending or appending of an epoch time to the UUID to do the same). Here is a pertinent link to a SO question explaining on why to use UUIDs instead of incrementing integers.
Based on Hung Tran's answer, here is a Go example:
checkItem := dynamodb.TransactWriteItem{
ConditionCheck: &dynamodb.ConditionCheck{
TableName: "example_table",
ConditionExpression: aws.String("attribute_exists(pk_id) AND attribute_exists(version)"),
Key: map[string]*dynamodb.AttributeValue{"pk_id": {S: id}, "version": {N: prevVer}},
},
}
putItem := dynamodb.TransactWriteItem{
Put: &dynamodb.Put{
TableName: "example_table",
ConditionExpression: aws.String("attribute_not_exists(pk_id) AND attribute_not_exists(version)"),
Item: data,
},
}
writeItems := []*dynamodb.TransactWriteItem{&checkItem, &putItem}
_, _ = db.TransactWriteItems(&dynamodb.TransactWriteItemsInput{TransactItems: writeItems})
I'm trying to make a graph from a csv file, but I'm not being able to add additional relationship in the existing nodes.
My actual code is:
USING PERIODIC COMMIT 10000
LOAD CSV FROM 'my_file.csv' AS line
MERGE (p:Title { title: line[0]})
MERGE (a:Author { name: line[1]})
MERGE (a)-[:COLABORATE_IN]->(p)
WITH line WHERE line[2] IS NOT NULL
MERGE (b:Author {name: line[2]})
MERGE (b)-[:COLABORATE_IN]->(p) //not working
RETURN line[2]
It should be a simple, It creates well the nodes and the firsts relationships, but for the line[2] it just create the relationships for new nodes. What could I do?
Thanks
Everything that is not piped in the WITH clause is not available to the next part of the query :
MERGE (a:Author { name: line[1]})
MERGE (a)-[:COLABORATE_IN]->(p)
WITH line WHERE line[2] IS NOT NULL
// p is no more available here
Just add the p identifier to make it available in the remaining part of the query :
USING PERIODIC COMMIT 10000
LOAD CSV FROM 'my_file.csv' AS line
MERGE (p:Title { title: line[0]})
MERGE (a:Author { name: line[1]})
MERGE (a)-[:COLABORATE_IN]->(p)
WITH p, line
WHERE line[2] IS NOT NULL
MERGE (b:Author {name: line[2]})
MERGE (b)-[:COLABORATE_IN]->(p) //not working
RETURN line[2]
I am looking for a general solution to a problem with couchdb views.
For example, have a view result like this:
{"total_rows":4,"offset":0,"rows":[
{"id":"1","key":["imported","1"],"value":null},
{"id":"2","key":["imported","2"],"value":null},
{"id":"3","key":["imported","3"],"value":null},
{"id":"4","key":["mapped","4"],"value":null},
{"id":"5,"key":["mapped","5"],"value":null}
]
1) If I want to select only "imported" documents I would use this:
view?startkey=["imported"]&endkey=["imported",{}]
2) If I want to select all imported documents with an higher id then 2:
view?startkey=["imported",2]&endkey=["imported",{}]
3) If I want to select all imported documents with an id between 2 and 4:
view?startkey=["imported",2]&endkey=["imported",4]
My Questtion is: How can I select all Rows with an id between 2 and 4?
You can try to extend the solution above, but prepend keys with a kind of "emit index" flag like this:
map: function (doc) {
emit ([0, doc.number, doc.category]); // direct order
emit ([1, doc.category, doc.number]); // reverse order
}
so you will be able to request them with
view?startkey=[0, 2]&endkey=[0, 4, {}]
or
view?startkey=[1, 'imported', 2]&endkey=[1, 'imported', 4]
But 2 different views will be better anyway.
I ran into the same problem a little while ago so I'll explain my solution. Inside of any map function you can have multiple emit() calls. A map function in your case might look like:
function(doc) {
emit([doc.number, doc.category], null);
emit([doc.category, doc.number], null);
}
You can also use ?include_docs=true to get the documents back from any of your queries. Then your query to get back rows 2 to 4 would be
view?startkey=[2]&endkey=[4,{}]
You can view the rules for sorting at CouchDB View Collation