I need to run a query which joins 5 large table on user_id and filter it on proc_date.
I have planed to do partition on proc_date and partition(5 range partition) on user_id to increase query performance. I keep primary index as well on proc_date and user_id.
"But how can I run the query for just one partition of the user_id at a time? I want to restrict the query to join first partition(on User_id) of every table"
Reason behind this is, once I complete the query for first partition, I can send the output data for next process. While next process is running i can run the query for 2nd partition.
Could anyone please give me some solution to achieve this.
Related
I have a CosmosDB collection with id field and a partition key ManagerName.
When I run two queries.
SELECT * FROM c
where c.id = '76e24380-71cb-45d5-807a-ce2374f57624' and c.ManagerName ='Darin Jast2'
SELECT * FROM c
where c.id = '76e24380-71cb-45d5-807a-ce2374f57624'
in data explorer the RU's result is sort of strange. For the first query I get 3.070 RUs and the second I get 2.9 RUs. Almost every time I run the two queries?
That is strange to me because from what I read when you have a partition id in the where clause the query will run on a single partition.
The stranger thing is that when I run a
SELECT * FROM c
where c.ManagerName ='Darin Jast2'
I get 2.9 in fact any field I get the same number. It seams to be related to the number of where conditions instead of having or not having partitions?
Can someone explain to me what is going on here and why am I getting the results. Dose this have something to do with indexing? Size of the collection? Number of partitions?
All the resources I found on CosmosDb say you should include the partition key in your query and if you can do single partition queries.
In a dynamo table I would like to query by selecting all items where an attributes value matches one of a set of values. For example my table has a current_status attribute so I would like all items that either have a 'NEW' or 'ASSIGNED' value.
If I apply a GSI to the current_status attribute it looks like I have to do this in two queries? Or instead do a scan?
DynamoDB does not recommend using scan. Use it only when there is no other option and you have fairly small amount of data.
You need use GSIs here. Putting current_status in PK of GSI would result in hot
partition issue.
The right solution is to put random number in PK of GSI, ranging from 0..N, where N is number of partitions. And put the status in SK of GSI, along with timestamp or some unique information to keep PK-SK pair unique. So when you want to query based on current_status, execute N queries in parallel with PK ranging from 0..N and SK begins_with current_status. N should be decided based on amount of data you have. If the data on each row is less than 4kb, then this parallel query operation would consume N read units without hot partition issue. Below link provides the details information on this
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-indexes-gsi-sharding.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-modeling-nosql-B.html
I would like to cluster raw table with raw data of events from Firebase in BQ, but without reprocessing/creating another tables (keeping costs at minimum).
The main idea is to find a way to cluster tables when they create from intraday table.
I tried to create empty tables with pre-defined schema (same as previous events tables), but partitioned by _partition_time column (NULL partition) and clustered by event_name column.
After Firebase inserts all the data from intraday table, the column event_name stays in details tab of table as cluster field, but no reducing costs happens after querying.
What could be another solution or way how to make it working ?
Thanks in advance.
/edit:
Our table has detail tab as:
detail tab of table
After running this query:
SELECT * FROM 'ooooooo.ooooooo_ooooo.events_20181222'
WHERE event_name = 'screen_view'
the result is:
how query processed whole table
So no cost reducing.
But if I try to create the same table clustered by event_name manually with:
Create TABLE 'aaaa.aaaa.events_20181222'
partition by DATE(event_timestamp)
cluster by event_name
AS
Select * from ooooooo.ooooooo_ooooo.events_20181222
Then the same query from first IMG applied to created table processes only 5mb - so clustering really works.
Team,
I have a dynamodb with a given hashkey (userid) and sort key (ages). Lets say if we want to retrieve the elements as "per each hashkey(userid), smallest age" output, what would be the query and filter expression for the dynamo query.
Thanks!
I don't think you can do it in a query. You would need to do full table scan. If you have a list of hash keys somewhere, then you can do N queries (in parallel) instead.
[Update] Here is another possible approach:
Maintain a second table, where you have just a hash key (userID). This table will contain record with the smallest age for given user. To achieve that, make sure that every time you update main table you also update second one if new age is less than current age in the second table. You can use conditional update for that. Update can either be done by application itself, or you can have AWS lambda listening to dynamoDB stream. Now if you need smallest age for each use, you still do full table scan of the second table, but this scan will only read relevant records, to it will be optimal.
There are two ways to achieve that:
If you don't need to get this data in realtime you can export your data into a other AWS systems, like EMR or Redshift and perform complex analytics queries there. With this you can write SQL expressions using joins and group by operators.
You can even perform EMR Hive queries on DynamoDB data, but they perform scans, so it's not very cost efficient.
Another option is use DynamoDB streams. You can maintain a separate table that stores:
Table: MinAges
UserId - primary key
MinAge - regular numeric attribute
On every update/delete/insert of an original query you can query minimum age for an updated user and store into the MinAges table
Another option is to write something like this:
storeNewAge(userId, newAge)
def smallestAge = getSmallestAgeFor(userId)
storeSmallestAge(userId, smallestAge)
But since DynamoDB does not has native transactions support it's dangerous to run code like that, since you may end up with inconsistent data. You can use DynamoDB transactions library, but these transactions are expensive. While if you are using streams you will have consistent data, at a very low price.
You can do it using ScanIndexForward
YourEntity requestEntity = new YourEntity();
requestEntity.setHashKey(hashkey);
DynamoDBQueryExpression<YourEntity> queryExpression = new DynamoDBQueryExpression<YourEntity>()
.withHashKeyValues(requestEntity)
.withConsistentRead(false);
equeryExpression.setIndexName(IndexName); // if you are using any index
queryExpression.setScanIndexForward(false);
queryExpression.setLimit(1);
I just started figuring out DynamoDB.
I have a simple table has date attribute(ex. 20160101) as HASH and created_at attribute(ex. 20160101185332) as RANGE.
I'd like to get latest N items from the table.
First, SCAN command does not have ScanIndexForward option. I think it's not possible with SCAN.
Next, QUERY command. It seems to be work if I repeat QUERY command several times to get enough number of items(cuz, I don't know how many items have same key value). - for example, I can query using today first and repeat for the day before if the result does not give enough items.
How can I do the job more efficiently? Or, can I query without KEY value?
as you described your table, you cant do it more efficiently, and you cant query dynamodb without KEY(hash) value
look at the answer here:
dynamodb get earliest inserted distinct values from a table