How dynamoDB + DAX work with timeseries? - amazon-dynamodb-dax

I wonder how DAX works with time-series. I want to insert some data every minute, add TTL to remove it after 14 days and get last 3 hours of data after each insert:
insert 1KB each minute
expire after 14 days
after each insert read data for the last 3 hours
3 hours is 180 minutes, so most of the time I need the last 180 items. Sometimes data is not coming for some time, so there may be less than 180 items.
So there are 20,160 items ±19MB of data for 14 days. How much DAX I will use while fetching the last 3 hours of data every minute? Will it be 19MB or 180KB?
let params = {
TableName: 'prod_server_data',
KeyConditionExpression: 's = :server_id and t between :time_from and :time_to',
ExpressionAttributeValues: {
':server_id': serverId, // string
':time_from': from, // timestamp
':time_to': to, // timestamp
},
ReturnConsumedCapacity: 'TOTAL',
ScanIndexForward: false,
Limit: 1440, // 24h*60 = 1440. 1 check every 1 min
};
const queryResult = await dynamo.query(params).promise();

DAX caches items and queries separately, and the query cache stores the entire response, keyed by the parameters. In this case, set the query TTL to 1 minute, and make sure that :time_from and :time_to only have 1 minute resolution.
If you only call query once per minute, than you won't see much benefit from DAX (since it will have to go to DynamoDB every time to refresh).
If you call query multiple times per minute but only expect the data to update every minute (i.e. repeatedly refreshing a dashboard) there will only be 1 call to DynamoDB every minute to refresh and all other requests will be served from the cache.

Related

How fast is counting documents in Cloud Firestore?

Last year Firestore introduced count queries, which allows you to retrieve the number of results in a query/collection without actually reading the individual documents.
The documentation for this count feature mentions:
Aggregation queries rely on the existing index configuration that your queries already use, and scale proportionally to the number of index entries scanned. This means that aggregations of small- to medium-sized data sets perform within 20-40 ms, though latency increases with the number of items counted.
And:
If a count() aggregation cannot resolve within 60 seconds, it returns a DEADLINE_EXCEEDED error.
How many documents can Firestore actually count within that 1 minute timeout?
I created some collections with many documents in a test database, and then ran COUNT() queries against that.
The code to generate the minimal documents through the Node.js Admin SDK:
const db = getFirestore();
const col = db.collection("10m");
let count = 0;
const writer = db.bulkWriter();
while (count++ < 10_000_000) {
if (count % 1000 === 0) await writer.flush();
writer.create(col.doc(), {
index: count,
createdAt: FieldValue.serverTimestamp()
})
}
await writer.close();
Then I counted them with:
for (const name of ["1k", "10k", "1m", "10m"]) {
const start = Date.now();
const result = await getCountFromServer(collection(db, name));
console.log(`Collection '${name}' contains ${result.data().count} docs (counting took ${Date.now()-start}ms)`);
}
And the results I got were:
count
ms
1,000
120
10,000
236
100,000
401
1,000,000
1,814
10,000,000
16,565
I ran some additional tests with limits and conditions, and the results were always in line with the above for the number of results that were counted. So for example, counting 10% of the collection with 10m documents took about 1½ to 2 seconds.
So based on this, you can count up to around 40m documents before you reach the 60 second timeout. Honestly, given that you're charged 1 document read for every up to 1,000 documents counted, you'll probably want to switch over to stored counters well before that.

How to check if the device's time is between two times in Flutter from Firebase/Firestore?

In the Firestore project, I have documents in a collection containing data for shops, having fields like shopName, shopAddress, startTime(eg. 10 AM) and closeTime(eg. 10 PM) . (all strings for now)
When the user is browsing the app, i have retrieved the data from Firestore of the shops displayed in the app, now i wanna show that the shop is closed when the device's time is not between the startTime and closeTime of the shop. How do i achieve this?
So far I can detect the device's current time using dart package intl using this code:
print("${DateFormat('j').format(DateTime.now())}");
It gives output as follows:
I/flutter (14877): 6 PM
This is in DateFormat, and the data types stored in Firestore are strings.. I dont know how to compare them.. Do let me know if i have to change the data types in Firestore too.
Thank You
I think if you use 24 Hour Time Format and convert startTime, closeTime and actualTime to int or double ( if the shop close at 20:30/8:30pm), then you can easily compare them with if. On your firebase server string format is perfect.
For example you make a map and iterate it, and check if the actualTime is higher than startTime and lower than closeTime.
I have never tried this code, but i think it is going to work.
Map map = {'1am': 1, '2am': 2, '3am': 3, ... , '11pm': 23};
map.entries.forEach((e) {
if(e.key == actualTime) {
if(e.value >= startTime && e.value < closeTime) {
print('Open');
}
else{
print('Closed');
}
}
});
By the way, I think you should use UTC, because if you change the time-zone on your device, your app is going to show that the shop is closed, but in fact the shop is open, just you are in a different time-zone. You can easily implement this with this code.
var now = DateTime.now().toUtc();
Maybe you can create a hash map like this:
hashMap=['12 AM', '1 AM', '2 AM', ... , '11 PM', '12 AM'];
After that you can get the positions of startTime, closeTime and actualTime, and see if the actualTime is between start and close times positions.
Let me know if you want to give you a code example.

Daily_schedule triggered runs and backfill runs have different date partition

I have #daily_schedule triggered daily at 3 minutes past 12am
When triggered by the scheduled tick at '2021-02-16 00:03:00'
The date input shows '2021-02-15 00:00:00', partition tagged as '2021-02-15'
While if triggered via backfill for partition '2021-02-16'
The date input shows '2021-02-16 00:00:00', partition tagged as '2021-02-16'
Why does the scheduled tick fill the partition a day before? Is there an option to use the datetime of execution instead (without using cron #schedule)? This descrepency is confusing when I perform queries using the timestamp for exact dates
P.S I have tested both scheduled run and backfil run to have the same Timezone.
#solid()
def test_solid(_, date):
_.log.info(f"Input date: {date}")
#pipeline()
def test_pipeline():
test_solid()
#daily_schedule(
pipeline_name="test_pipeline",
execution_timezone="Asia/Singapore",
start_date=START_DATE,
end_date=END_DATE,
execution_time=time(00, 03),
# should_execute=four_hourly_fitler
)
def test_schedule_daily(date):
timestamp = date.strftime("%Y-%m-%d %X")
return {
"solids": {
"test_solid":{
"inputs": {
"date":{
"value": timestamp
}
}
}
}
}
Sorry for the trouble here - the underlying assumption that the system is making here is that for schedules on pipelines that are partitioned by date, you don't fill in the partition for a day until that day has finished (i.e. the job filling in the data for 2/15 wouldn't run until the next day on 2/16). This is a common pattern in scheduled ETL jobs, but you're completely right that it's not a given that all schedules will want this behavior, and this is good feedback that we should make this use case easier.
It is possible to make a schedule for a partition in the way that you want, but it's more cumbersome. It would look something like this:
from dagster import PartitionSetDefinition, date_partition_range, create_offset_partition_selector
def partition_run_config(date):
timestamp = date.strftime("%Y-%m-%d %X")
return {
"solids": {
"test_solid":{
"inputs": {
"date":{
"value": timestamp
}
}
}
}
}
test_partition_set = PartitionSetDefinition(
name="test_partition_set",
pipeline_name="test_pipeline",
partition_fn=date_partition_range(start=START_DATE, end=END_DATE, inclusive=True, timezone="Asia/Singapore"),
run_config_fn_for_partition=partition_run_config,
)
test_schedule_daily = (
test_partition_set.create_schedule_definition(
"test_schedule_daily",
"3 0 * * *",
execution_timezone="Asia/Singapore",
partition_selector=create_offset_partition_selector(lambda d:d.subtract(minutes=3)),
)
)
This is pretty similar to #daily_schedule's implementation, it just uses a different function for mapping the schedule execution time to a partition (subtracting 3 minutes instead of 3 minutes and 1 day - that's the create_offset_partition_selector part).
I'll file an issue for an option to customize the mapping for the partitioned schedule decorators, but something like that may unblock you in the meantime. Thanks for the feedback!
Just an update on this: We added a 'partition_days_offset' parameter to the 'daily_schedule' decorator (and a similar parameter to the other schedule decorators) that lets you customize this behavior. The default is still to go back 1 day, but setting partition_days_offset=0 will give you the behavior you were hoping for where the execution day is the same as the partition day. This should be live in our next weekly release on 2/18.

How to set different time slots for a single https requests in jmeter

I have a http request where I need to send random time values in given time frame
Request looks like this:
http://domain/api1?&mt=getEmp&punchTime=1590678744
I've shifts pre-defined
Morning: 0900 to 1400
Evening: 1400 to 1900
Night: 1900 to 2300
Expectation is: random epoch time value between pre-defined shift time slot should be put as punch-time for each request.
I don't want to separate out the request as per different time shifts.
Could anyone please help me to achieve this with JMeter?
You can calculate the random time stamp in the given range using a suitable JSR223 Test Element and Groovy language.
Example code which produces random time between 9 and 14 hour of the current day and store it into randomMorningTime JMeter Variable would be something like:
def calendar = Calendar.getInstance()
def morningStart = new GregorianCalendar(calendar.get(Calendar.YEAR), calendar.get(Calendar.MONTH), calendar.get(Calendar.DAY_OF_MONTH), 9, 00).getTimeInMillis()
def morningEnd = new GregorianCalendar(calendar.get(Calendar.YEAR), calendar.get(Calendar.MONTH), calendar.get(Calendar.DAY_OF_MONTH), 14, 00).getTimeInMillis()
def randomMorningTime = org.apache.commons.lang3.RandomUtils.nextLong(morningStart, morningEnd)
def timestamp = ( randomMorningTime / 1000).round() as String
log.info('Random morning time for current day: ' + new Date(randomMorningTime))
log.info('Associated timestamp: ' + timestamp)
vars.put('randomMorningTime', timestamp)
Demo:
You will be able to access the generated value as ${randomMorningTime} where required.
See JavaDoc on GregorianCalendar class for more information

CosmosDB query on date range + index

I have a cosmos DB whose size is around 100GB.
I successfully create a nice partition key, i have around 4600 partition on 70M records, but I still need to query on two datetime fields that are stored as a string, not in an epoch format.
Example json:
"someField1": "UNKNOWN",
"someField2": "DATA",
"endDate": 7014541201,
"startDate": 7054864502,
"someField3": "0",
"someField3": "0",
i notice when i do select * from tbl and when i do select * from tbl where startDate > {someDate} AND endDate<{someDate1} latency different is around 1s, so this filtering does not decrease my latency time.
Is it better to store date types as number? Does cosmos have better performance on epoch query ranges?
I am using SQL API.
Also when i try to add hash indexes on a startDate and endDate he basically convert that into two indexes.
Example:
"path": "/startDate/?",
"indexes": [
{
"kind": "Hash",
"dataType": "String",
"precision": 3
}
]
},
this is converted to
"path": "/startDate/?",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Range",
"dataType": "String",
"precision": -1
}
]
Is that a normal behaviour or it is related to my data?
Thanks.
I checked query metrics, and for 4k records query to cosmosDB is executed in 100ms. I would like to ask you is it normal behaviour that
var option = new FeedOptions { PartitionKey = new PartitionKey(partitionKey), MaxItemCount = -1};
var query= client.CreateDocumentQuery<MyModel>(collectionLink, option)
.Where(tl => tl.StartDate >= DateTimeToUnixTimestamp(startDate) && tl.EndDate <= DateTimeToUnixTimestamp(endDate))
.AsEnumerable().ToList();
this query returns 10k results (in Postman its around 9MB size) in 10-12s? This partition contains around 50k records.
Retrieved Document Count : 12,356
Retrieved Document Size : 12,963,709 bytes
Output Document Count : 3,633
Output Document Size : 3,819,608 bytes
Index Utilization : 29.00 %
Total Query Execution Time : 264.31
milliseconds
Query Compilation Time : 0.12 milliseconds
Logical Plan Build Time : 0.07 milliseconds
Physical Plan Build Time : 0.06 milliseconds
Query Optimization Time : 0.01 milliseconds
Index Lookup Time : 51.10 milliseconds
Document Load Time : 140.51 milliseconds
Runtime Execution Times
Query Engine Times : 55.61 milliseconds
System Function Execution Time : 0.00 milliseconds
User-defined Function Execution Time : 0.00 milliseconds
Document Write Time : 10.56 milliseconds
Client Side Metrics
Retry Count : 0
Request Charge : 904.73 RUs
I am from the CosmosDB engineering team.
Since your collection has 70M records, I assume that the observed latency is only on the first roundtrip (or first page) of results. Note that the observed latency can also be improved by tweaking FeedOptions.MaxDegreeOfParallelism to -1 when executing the query.
Regarding the difference between the two queries themselves, please note that SELECT * without a filter is a full scan query, which is probably a bit
faster to first return results, when compared to the other query with two filters, which does a little bit more work on the local indexes across all the partitions, which may explain the observed latency.
Regarding your other question, we no longer support the Hash indexing policy on new collections. Please see here: https://learn.microsoft.com/en-us/azure/cosmos-db/index-types#index-kind . We automatically convert Hash indexes to Range with full precision.
You may also fetch QueryMetrics for your query and analyze the results to figure out why you have latency. Details are here: https://learn.microsoft.com/en-us/azure/cosmos-db/sql-api-query-metrics#query-execution-metrics

Resources