I have to design a DynamoDB table for a Monitoring System.
The attributes present for each item are:
name (Monitor Name: String)
et (milliseconds epoch: Number)
owner (Owner of the monitor: String)
Below are the queries that I need to perform:
Retrieve all monitors:
1.1 That ran today
1.2 In et (milliseconds epoch) range
Retrieve by owner and name
Retrieve by owner
Retrieve by owner, name, and in et (milliseconds epoch) range
I came up with below selection of keys:
Partition Key: Owner#MonitorName
Sort Key: ET (milliseconds epoch)
GSI Partition Key: ET (milliseconds epoch)
How I think queries would be executed:
Query 1:
Query 1.1 et = :val
Query 1.2 et between :val1 and :val2
Query 2: Owner#MonitorName = :val
Query 3: Owner#MonitorName begins_with(:val)
Query 4: (Owner#MonitorName = :val) and (et between :val1 and :val2)
Is this design efficient? What are the drawbacks for the design? How can it be improved?
Related
I have a DynamoDB table, I need to find the records which are between the given date range.
So here is my table structure
{
"Id":"String",
"Name":"String",
"CrawledAt":"String"
}
In this table partition key as Id and CrawledAt fileds used. And also created local secondary index with CrawledAt field and it's name "CrawledAt-index"
When querying most of the articles using Id with CreatedAt. But in my case I don't know what is the Id, I only need to retrieve records for a particular date range.
Here is the code I have tried
request = {
"TableName": "sflnd00001-test",
"IndexName": "CrawledAt-index",
"ConsistentRead": False,
"ProjectionExpression": "Name",
"KeyConditionExpression":
"CrawledAt between :v_start and :v_end",
"ExpressionAttributeValues": {
":v_start": {"S": "2020-01-31T00:00:00.000Z"},
":v_end": {"S": "2025-11-31T00:00:00.000Z"} }
}
response = table.query(**request)
It's returning this error
"An error occurred (ValidationException) when calling the Query operation: Invalid KeyConditionExpression: Incorrect operand type for operator or function; operator or function: BETWEEN, operand type: M",
Can someone please tell me how to find data set with the given date range without providing primary key
You cannot do a between or any other function on a partition key, you must always provide the entire key.
For your use-case your GSI partition key should be a single value, and the crawledAt should be the sort key.
{
"Id":"String",
"Name":"String",
"CrawledAt":"String",
"GsiPk": "Number"
}
.
"KeyConditionExpression":
"GsiPk = 1 AND CrawledAt between :v_start and :v_end"
This would then allow you to retrieve all the data in the table between two dates. But be aware of the caveat of doing this, using a single value for a GSIPK is not scalable, and would cap the write requests to approx 1000WCU.
If you need more scale you can assign a random number to the GSIPK (n) to increase the number of partitions which would then require you to make (n) queries to collect all the data.
Alternatively you can Scan the table and use FilterExpression which is also not a scalable solution:
aws dynamodb scan \
--table-name MusicCollection \
--filter-expression "timestamp between :a and :b" \
--expression-attribute-names file://expression-attribute-names.json \
--expression-attribute-values file://expression-attribute-values.json
I have a data dump from Plaid API in DynamoDB. Each transaction has transaction_id, pending(bool), and pending_transaction_id (the FK basically to the older pending transaction it replaces)
{
"account_id": "acct1", // partition key
"transaction_id": "txn100", // sort key
"category_id": "22001000",
"pending": false,
"pending_transaction_id": "txn1",
"amount": 500,
},
{
"account_id": "acct1",
"transaction_id": "txn1",
"category_id": "22001000",
"pending": true,
"pending_transaction_id": null,
"amount": 500,
},
Is it possible to query in a single query only pending transactions that don't have a permanent replacement yet?
In other words, if it was relational DB it would be along the lines
select * from txn where pending == false and transaction_id not in (select pending_transaction_id from txn) (or whatever flavor of CTE or left join you prefer).
How do I do this in dynamo db in a single query?
We can have a GSI here to solve this problem.
PK (pending)
SK (pending_transaction_id)
..
false
txn1
..
true
null
..
We can then query over records which PK and get our records.
Points to consider/ observe:
Since SK is null here, record will not be created. This works for us as we don't need those records.
We can include pending = true records in our GSI if required, however that means having "NULL" attribute value.
The advantage with GSI I see here is (considering only pt. 1), we are keeping duplicate records only which we need as part of our query.
I wrote query like below. I am able to retrieve data fromtime and totime. My problem is for every minute they are 30 records. I would like to get help to get the first record for every one hour and 24 records for one day and I need this for 30 days.
var config = new QueryRequest
{
TableName = "dfgfdgdfg",
KeyConditionExpression = "id= :id AND plctime BETWEEN :fromtime AND :totime",
ExpressionAttributeValues = new Dictionary<string, AttributeValue> {
{
":serialNumber", new AttributeValue {S = id}
},
{
":fromtime", new AttributeValue {S = fromtime }
},
{
":totime", new AttributeValue {S = totime }
}
},
};
return await _dynamoClient.QueryAsync(config);
In addition to storing your record as is, you could consider inserting another record that looks like this :
{
pk : "DailyMarker_" + DateTime.Now.ToString("yyMMdd"), // partition key
sk : "HourlyMarker_" + DateTime.Now.ToString("yyMMddhh") // range key
record: <your entire record>
}
pk and sk would be of the structure DailyMarker_201911 and HourlyMarker_2019112101. Basically the part after the underscore acts as a date/time stamp with only the granularity you are interested in.
While inserting a marker record, you can add precondition checks, which, if they fail, will prevent the insertion from taking place (see PutItem -> ConditionExpression. This operation throws an exception with most SDKs if the condition evaluates to false, so you want to handle that exception.
At this point only the first record per hour is being inserted into this PK/SK combination, and all SKs for one day end up under the same PK
To query for different ranges, you will have to perform some calculations in your application code to determine the start and end buckets (pk and sk) that you want to query. While you will need to make one call per pk you are interested in, the range key can be queried using range queries
You could also switch the pk to be monthly instead of daily, so that will reduce the number of PKs to query while increasing the potential for imbalanced keys (aka. hot keys)
Our team is currently exploring the ways to encrypt PII data on the field level within BigQuery and we found out the following way to encrypt/decrypt using Crypto-JS:
#standardSQL
CREATE TEMPORARY FUNCTION encrypt(_text STRING) RETURNS STRING LANGUAGE js AS
"""
let key = CryptoJS.enc.Utf8.parse("<key>");
let options = { iv: CryptoJS.enc.Utf8.parse("<iv>"), mode: CryptoJS.mode.CBC };
let _encrypt = CryptoJS.AES.encrypt(_text, key, options);
return _encrypt;
""";
CREATE TEMPORARY FUNCTION decrypt(_text STRING) RETURNS STRING LANGUAGE js AS
"""
let key = CryptoJS.enc.Utf8.parse("<key>");
let options = { iv: CryptoJS.enc.Utf8.parse("<iv>"), mode: CryptoJS.mode.CBC };
let _decrypt = CryptoJS.AES.decrypt(_text, key, options).toString(CryptoJS.enc.Utf8);
return _decrypt;
""" OPTIONS (library="gs://path/to/Crypto-JS/crypto-js.js");
-- query to encrypt fields
SELECT
<fields>, encrypt(<pii-fields>)
FROM
`<project>.<dataset>.<table>`
-- query to decrypt fields
SELECT
<fields>, decrypt(<pii-fields>)
FROM
`<project>.<dataset>.<table>`
I am trying to benchmark the performance of AES CBC encryption & decryption using Crypto JS library in the big query before deploying it into our production. We found out the rate of data to encrypt & decrypt is growing exponential per records with increasing number of data compared to the usual query. However with the increasing number of data to process, the progress of processing per record & record processing time is improving.
As there are no available documentation regarding this, could someone from the community help provide better ways, optimize query, best practices to use field level encryption & decryption within the big query?
BigQuery now supports encryption functions. From the documentation, here is a self-contained example that creates some keysets and uses them to encrypt data. In practice, you would want to store the keysets in a real table so that you can later use them to decrypt the ciphertext.
WITH CustomerKeysets AS (
SELECT 1 AS customer_id, KEYS.NEW_KEYSET('AEAD_AES_GCM_256') AS keyset UNION ALL
SELECT 2, KEYS.NEW_KEYSET('AEAD_AES_GCM_256') UNION ALL
SELECT 3, KEYS.NEW_KEYSET('AEAD_AES_GCM_256')
), PlaintextCustomerData AS (
SELECT 1 AS customer_id, 'elephant' AS favorite_animal UNION ALL
SELECT 2, 'walrus' UNION ALL
SELECT 3, 'leopard'
)
SELECT
pcd.customer_id,
AEAD.ENCRYPT(
(SELECT keyset
FROM CustomerKeysets AS ck
WHERE ck.customer_id = pcd.customer_id),
pcd.favorite_animal,
CAST(pcd.customer_id AS STRING)
) AS encrypted_animal
FROM PlaintextCustomerData AS pcd;
Edit: if you want to decrypt using AES-CBC with PKCS padding (it's not clear what kind of padding you are using in your example) you can use the KEYS.ADD_KEY_FROM_RAW_BYTES function to create a keyset, then call AEAD.DECRYPT_STRING or AEAD.DECRYPT_BYTES. For example:
SELECT
AEAD.DECRYPT_STRING(
KEYS.ADD_KEY_FROM_RAW_BYTES(b'', 'AES_CBC_PKCS', b'1234567890123456'),
FROM_HEX('deed2a88e73dccaa30a9e6e296f62be27db30db16f76d3f42c85d31db3f46376'),
'')
This returns abcdef. The IV is expected to be the first 16 bytes of the ciphertext.
I'm familiar with MySQL and am starting to use Amazon DynamoDB for a new project.
Assume I have a MySQL table like this:
CREATE TABLE foo (
id CHAR(64) NOT NULL,
scheduledDelivery DATETIME NOT NULL,
-- ...other columns...
PRIMARY KEY(id),
INDEX schedIndex (scheduledDelivery)
);
Note the secondary Index schedIndex which is supposed to speed-up the following query (which is executed periodically):
SELECT *
FROM foo
WHERE scheduledDelivery <= NOW()
ORDER BY scheduledDelivery ASC
LIMIT 100;
That is: Take the 100 oldest items that are due to be delivered.
With DynamoDB I can use the id column as primary partition key.
However, I don't understand how I can avoid full-table scans in DynamoDB. When adding a secondary index I must always specify a "partition key". However, (in MySQL words) I see these problems:
the scheduledDelivery column is not unique, so it can't be used as a partition key itself AFAIK
adding id as unique partition key and using scheduledDelivery as "sort key" sounds like a (id, scheduledDelivery) secondary index to me, which makes that index pratically useless
I understand that MySQL and DynamoDB require different approaches, so what would be a appropriate solution in this case?
It's not possible to avoid a full table scan with this kind of query.
However, you may be able to disguise it as a Query operation, which would allow you to sort the results (not possible with a Scan).
You must first create a GSI. Let's name it scheduled_delivery-index.
We will specify our index's partition key to be an attribute named fixed_val, and our sort key to be scheduled_delivery.
fixed_val will contain any value you want, but it must always be that value, and you must know it from the client side. For the sake of this example, let's say that fixed_val will always be 1.
GSI keys do not have to be unique, so don't worry if there are two duplicated scheduled_delivery values.
You would query the table like this:
var now = Date.now();
//...
{
TableName: "foo",
IndexName: "scheduled_delivery-index",
ExpressionAttributeNames: {
"#f": "fixed_value",
"#d": "scheduled_delivery"
},
ExpressionAttributeValues: {
":f": 1,
":d": now
},
KeyConditionExpression: "#f = :f and #d <= :d",
ScanIndexForward: true
}