What is a solid DynamoDB access pattern for storing data from a bunch of receipts of identical format? I would use SQL for maximum flexibility on more advanced analytics, but as a learning exercise want to see how far one can go with DynamoDB here. For starters I'd like to query for aggregate overall and per product spending for a given time range, track product price history, sort receipts by total, stuff along those lines. But I also want it to be as flexible as possible for future queries I haven't thought of yet. Would something like this, plus some GSI's, work?
-----------------------------------------------------------------------------------------------------------
| pk | sk | unit $ | qty | total $ | receipt total | items
-----------------------------------------------------------------------------------------------------------
| "product a" | "2021-01-01T12:00:00Z" | 2 | 2 | 4 | |
| "product b" | "2021-01-01T12:00:00Z" | 2 | 3 | 6 | |
| "receipt" | "2021-01-01T12:00:00Z" | | | | 10 | array of above item data
| "product a" | "2021-01-02T12:00:00Z" | 1.75 | 3 | 5.25 | |
| "product c" | "2021-01-02T12:00:00Z" | 2 | 2 | 4 | |
| "receipt" | "2021-01-02T12:00:00Z" | | | | 9.25 | array of above item data
-----------------------------------------------------------------------------------------------------------
You have to decide your access patterns, and build the design of the dynamo off that not the other way around. No one outside your team/product can tell you what your access patterns are. That entirely depends on your products need.
You have to ask: What pieces of Information do you have, and what do you need to retrieve when you have those pieces of information? You then have decide what is the most common ones that will be done the most and craft your PK/SK combinations off that. If you can't fit all your queries into just one or two bits of information, you may want to set up an Index - but Index's should be maintained only for far less often accessed queries.
If you need to, its also Accepted Practice to enter the same information twice - in two documents in the table - as writes are easier/cheaper than multiple reads (a write is pretty much one WCU per document - any query/scan can be multiple RCUs even if you only need one part -- plus Index's being replications of the table mean there is a desync chance if you write/read too quickly or try to write/read the same document in parallel calls)
Take your time now to sit down and consider everything your app will need to query the dynamo for. The more you can figure out now, the better, and if you can set your PK to something that will almost always be available to the calling function trying to query then you will be in a much better state.
Related
How does one return a list of unique users from a dynamodb table with the following (simplified) schema? Does it require a GSI? This is for an app with small number of users, and I can think of ways that will work for my needs without creating a GSI (like scanning and filtering on SK, or creating a new item with list of user ids inside). But what is the scalable solution?
------------------------------------------------------
| pk | sk | amount | balance
------------------------------------------------------
| "user1" | "2021-01-01T12:00:00Z" | 7 |
| "user1" | "2021-01-03T12:00:00Z" | 5 |
| "user2" | "2021-01-01T12:00:00Z" | 3 |
| "user2" | "2021-01-03T12:00:00Z" | 2 |
| "user1" | "user1" | | 12
| "user2" | "user2" | | 5
Your data model isn't designed to fetch all unique users efficiently.
You certainly could use a scan operation and filter with your current data model, but that is inefficient.
If you want to fetch all users in a single query, you'll need to get all user information into a single partition. As you've identified, you could do this with a GSI. You could also re-organize your data model to accommodate this access pattern.
For example, you mentioned that the application has a small number of users. If the number of users is small enough, you could create a partition that stores a list of all users (e.g. PK=USERS). If you could do this under 400kb, that may be a viable solution.
The idiomatic solution is to create a global secondary index.
I was planning to use a Dynamo table as a sort of replication log, so I have a table that looks like this:
+--------------+--------+--------+
| Sequence Num | Action | Thing |
+--------------+--------+--------+
| 0 | ADD | Thing1 |
| 1 | DEL | Thing1 |
| 2 | ADD | Thing2 |
+--------------+--------+--------+
Each of my processes keeps track of the last sequence number it read. Then on an interval it issues a Scan against the table with ExclusiveStartKey set to that sequence number. I assumed this would result in reading everything after that sequence, but instead I am seeing inconsistent results.
For example, given the table above, if I do a Scan(ExclusiveStartKey=1), I get zero results when I am expecting to see the 3rd row (seq=2).
I have a feeling it has to do with the internal hashing DynamoDB uses to partition the items and that I am misusing the ExclusiveStartKey option.
Is this the wrong tool for the job?
Alternatively, each process could issue a Query for seq+1 on each interval (looping if anything was found), which would result in the same ReadThroughput, but would require N API calls instead of N/1MB I would get with a Scan.
When you do a DynamoDB Scan operation, it does not seem to proceed sorted by the hash key. So using ExclusiveStartKey does not allow you to get an arbitrary page of keys.
For this example table with the Sequence ID, what I want can be accomplished with a Kinesis stream.
I have an Application Insights Azure Stream Analytics query that looks like this...
requests
| summarize count() by bin(duration, 1000)
| order by duration asc nulls last
...which gives me something like this, which shows the number of requests binned by duration in seconds, recorded in Application Insights.
| 0 | 1000 |
| 1000 | 500 |
| 2000 | 200 |
I would like to able to add another column which shows the count of exceptions from all requests in each bin.
I understand that extend is used to add additional columns, but to do so I would have to reference the 'outer' expression to get the bin constraints, which I don't know how to do. Is this the best way to do this? Or am I better off trying to join the two tables together and then doing the summarize?
Thanks
As you suspected - extend will not help you much here. You need is to run join kind=leftouter on the operation IDs (leftouter is needed so you won't drop requests that did not have any exceptions):
requests
| join kind=leftouter (
exceptions
| summarize exceptionsCount = count() by operation_Id
) on operation_Id
| summarize count(), sum(exceptionsCount) by bin(duration, 1000)
| order by duration asc nulls last
I am a software engineer, but I am very new to databases and I am trying to hack up a tool to show some demo.
I have an Apache server which serves a simple web page full of tables. Each row in the table has a proposal id and a link to a web page where the proposal is explained. So just two columns.
----------------------
| id | proposal |
|--------------------
| 1 | foo.html |
| 2 | bar.html |
----------------------
Now, I want to add a third column titled Comments where a user can leave comments.
------------------------------------------------
| id | proposal | Comments |
|-----------------------------------------------
| 1 | foo.html | x: great idea ! |
| | | y: +1 |
| 2 | bar.html | z: not for this release |
------------------------------------------------
I just want to quickly hack up something to show this as a demo and get feedback. I am planning to use SQLite to create a table per id and store the userid, comments in the table. People can add comment at the same time. I am planning to use lock to perform operations on the SQLite database. I am not worried about scaling just want to show and get feedback. Are there any major flaw in this implementation?
There are similar questions. But I am looking for a simplest possible implementation.
Table per ID; why would you want to do that? If you get a large number of proposals, the number of tables can get out of hand very quickly. You just need to keep an id column in the table to keep track of things and keep the number of tables in a sane figure.
The other drawback of using a table for each proposal is that you will not be able to use prepared statements for those, because table names cannot be bound as a parameter.
SQLite assumes the table name is 'a'
Add column
alter table a add column Comments text;
Insert comment
insert into a values (4,"hello.html","New Comment");
You need to provide values for the other two columns along with the new comment.
I want to create a page in Drupal to report some basic forum information. I thought I'd use Views, but Views only lets you set one "entity" type per view but forum topics are made up of nodes and comments (aka, topics and replies).
Ideally, I'd like a single view that lists all forum nodes and comments together in a single table (sorted by date), along with a total number of both combined, if possible. Is there a way to do that with Views?
Update: What I'm looking for is something like this:
-------------------------------------------------------
| User | Post | Type | Date |
-------------------------------------------------------
| amy | post text appears here | post | 1/5/01 |
| bob | comment text appears here | comment | 1/5/01 |
| amy | another comment here | comment | 1/5/01 |
| cid | another post appears here | post | 1/4/01 |
| dave | yet another comment here | comment | 1/4/01 |
-------------------------------------------------------
total posts + comments: 5
Not sure what you really want. Either you can display nodes + number of comments or nodes and comments at the same level but then they don't have a total number because they are all separate? Or do you want to show each comment separate together with the number of comments in that thread?
If the latter, that might not be trivial.
Basically, you could create a UNION Select query and query both the node and the comment table. could look like this:
(SELECT 'node' AS type, n.nid as id, n.title as title, nncs.comment_count as comment_count, n.created as timestamp FROM {node} n INNER JOIN {node_comment_statistics} nncs ON n.nid = nncs.nid)
UNION
(SELECT 'comment' AS type, c.cid as id, c.subject as title, cncs.comment_count as comment_count, c.timestamp as timestamp FROM {comments} c INNER JOIN {node_comment_statistics} cncs ON c.nid = cncs.nid)
ORDER BY timestamp DESC LIMIT 10;
That will return a result containing: node/comment | id | title | comment_count | timestamp.
See http://dev.mysql.com/doc/refman/5.1/en/union.html for more information about UNION.
You can then theme that as a table.
Hints:
If you need more data, either extend
the query or use node/comment_load
You could also join {node} in the
second query and use the node title
instead of comment subject
That query is going to be slow
because it will always do a filesort
because you have a union there. It
might actually be faster to execute
two separate queries and then mangle
them together in PHP if you have a
large number of nodes/comments
It turns out the Tracker 2 module provides enough of what I needed.