Dynamodb data model for process/transaction monitoring - amazon-dynamodb

I am wanting to keep track of multi stage processing job.
Likely just need the following fields
batchId (guid) | eventId (guid) | statusId (int) | timestamp | message (string)
There are relatively small number of events per batch.
I want to be able to easily query events that have a statusId less than n (still being processed or didn't finish processing).
Would using multiple rows for each status change, and querying for latest status be the best approach? I would use global secondary index but StatusId does not seem like a good candidate for hashkey (less than 10 statuses).

Instead of using multiple rows for every status change, if you updated the same event row instead, you could use a technique described in the DynamoDB documentation in the section 'Use a Calculated Value'. Basically this would involve adding another attribute (say 'derivedStatusId') which would be derived by appending a random number to statusId at the time of writing to DynamoDB. For example, for a statusId of 2, derivedStatusId could be one of {"2-00", "2-01", .. "2-99"}. Setting up a Global Secondary Index on derivedStatusId would give you some fan-out that will help in preventing the index from becoming hot.
If you are sure that you will use this index for only unfinished events, then removing the derivedStatusId attribute from the record when it transitions to a finished status will remove it from index as well - which may be a good property if events are expected to finish processing eventually, and if they stay around forever. This technique is called "Sparse Index" and is described in more detail here.
From your question, it seems like keeping status history recording is a desired property (I assume this because you want to have multiple rows for status changes). Consider putting this historical information in the same row. DynamoDB supports list data types and also has a generous 400KB item limit which may just allow you to capture all the desired historical information in the same record.

Related

Function of Rows, Rowsets in PeopleCode

I'm trying to get a better understanding of what Rows and Rowsets are used for in PeopleCode? I've read through PeopleBooks and still don't feel like I have a good understanding. I'm looking to get more understanding of these as it pertains to Application Engine programs. Perhaps walking through an example may help. Here are some specific questions I have:
I understand that Rowsets, Row, Record, and Field are used to access component buffer data, but is this still the case for stand alone Application Engine programs run via Process Scheduler?
What would be the need or advantage to using these as opposed to using SQL objects/functions (CreateSQL, SQLExec, etc...)? I often see in AE programs where the CreateRowset object is instantiated and uses a .Fill method with a SQL WHERE Clause and I don't quite understand why a SQL was not used instead.
I've seen in PeopleBooks that a Row object in a component scroll is a row, how does a component scroll relate to the row? I've seen references to rows having different scroll levels, is this just a way of grouping and nesting related data?
After you have instantiated the CreateRowset object, what are typical uses of it in the program afterwards? How would you perform logic (If, Then, Else, etc..) on data retrieved by the rowset, or use it to update data?
I appreciate any insight you can share.
You can still use Rowsets, Rows, Records and fields in stand alone Application Engines. Application Engines do not have component buffer data as they are not running within the context of a component. Therefore to use these items you need to populate them using built-in methods like .fill() on a rowset, or .selectByKey() on a record.
The advantage of using rowsets over SQL is that it makes the CRUD easier. There are built-in methods for selecting, updating, inserting and deleting. Additionally you don't have to worry about making a large number of variables if there were multiple fields like you would with a SQL object. Another advantage is when you do the fill, the data is read into memory, where if you looped through the SQL, the SQL cursor would be open longer. The rowset, row, record and field objects also have a lot of other useful methods such as allowing you to executeEdits (validation) or copy from one rowset\row\record to another.
This question is a bit less clear to me but I'll try and explain. If you have a Page, it would have a level 0 row. It then could have multiple Level 1 rowsets. Under each of those it could have a level 2 rowsets.
Level0
/ \
Level1 Level1
/ \ / \
Level2 Level2 Level2 Level2
If one of your level1 rows had 3 rows, then you would find 3 rows in the Rowset associated with that level1. Not sure I explained this to answer what you need, please clarify if I can provide more info
Typically after I create a rowset, I would loop through it. Access the record on each row, do some processing with it. In the example below, I look through all locked accounts and prefix their description with LOCKED and then updated the database.
.
Local boolean &updateResult;
local integer &i;
local record &lockedAccount;
Local rowset &lockedAccounts;
&lockedAccounts = CreateRowset(RECORD.PSOPRDEFN);
&lockedAccounts.fill("WHERE acctlock = 1");
for &i = 1 to &lockedAccounts.ActiveRowCount
&lockedAccount = &lockedAccounts(&i).PSOPRDEFN;
if left(&lockedAccount.OPRDEFNDESCR.value,6) <> "LOCKED" then
&lockedAccount.OPRDEFNDESCR.value = "LOCKED " | &lockedAccount.OPRDEFNDESCR.value;
&updateResult = &lockedAccount.update();
if not &updateResult then
/* Error handle failed update */
end-if;
end-if;
End-for;

How do I keep track of the most recent time an item was read in DynamoDB?

I have a use case where I want to always know and be able to look up DynamoDB items by their last read time. What is the easiest way to do this (I would prefer not to use any other services).
You can recall items by their last read time in DynamoDB by using a combination of the UpdateItem API, Query API and GSIs.
Estimate the amount of time your application will randomly read 1MB worth of items from the DynamoDB table. Let's assume that we are working with small items and that each item is <=1KB, so then, if the RPS on the table is 100, it will take 10 seconds for your application to randomly read 1MB of data.
Create a GSI that projects all attributes and is keyed on (PK=read_time_bucket, SK=read_time). The WPS on the GSI should equal the RPS of the base table in the case of small items.
Use the UpdateItem API with the following parameters to “read” each item. The UpdateItemResult will contain the item and the updated last read time and bucket.:
(
ReturnAttributes=ALL_NEW,
UpdateExpression="SET read_time_bucket = :bucket, read_time = :time",
ExpressionAttributeValues={
":bucket": <a partial timestamp indicating the 10-second-long bucket of time that corresponds to now>,
":time": <epoch millis>
}
)
You can use the Query API to look up items by last read time on the GSI, using key conditions on read_time_bucket and read_time.
You will need to adjust your time bucket size and throughput settings depending on item size and read/write patterns on the base table. If item size is prohibitively large, restrict the projection to SELECTED_ATTRIBUTES or KEYS_ONLY.

Is this loop redundant?

The three tables of interest are:
Event, containing various details of, eg, the berlin marathon
Result, containing various fields including user's race time and a FK to an Event, and
Goal, with a FK to the Event the user would like to run, a field for the time they'd like to run it in, and eventually a FK to the Race at which the user achieved their goal.
Obviously, the Event of the Race where the user achieved their goal has to be the Event of the Goal. But not all Goal's have been achieved -- some may never be.
Is this bad design? Can anybody suggest a better way of modelling this problem? I'm using sqlite in a django project.
Your Event table is OK.
But your Goal Table design messed up the proposed event and the actual achieved event.
I think Result Table can be merged with Goal table into a new Result table.
Since one user may want to run multiple events. In your new Result table, it should be like:
UserID EventID TimeProposed ActualTimeUsed Achieved
1 1 1 hour 1.1 hour No
1 2 1.5 hour 1.2 hour Yes
So the loop you mentioned is removed since each row has only one event. (The UserID and EventID remains to be the FK to the other two tables.)
The Achived column can be updated using a query to decide ActualTimeUsed<=TimeProposed.

Is the performance of ref.endAt().limit(n).on(...) similar for .priority and name?

I am currently using ref.endAt().limit(n).on(...) to get the 'last' n values.
All the .priority are null so the list is sorted by name which is a 0 padded timestamp
It seemed that if I set the .priority of each item also to the timestamp that it would take more storage. Does it?
Regardless of whether or not it takes more storage, is there a significant performance difference for retrieving the last n sorted items if .priority is all null (so name sort is used) or if .priority are all unique and that .priority sort is used?
I am currently designing for it to work well with 10,000 ish items in a list. Is .priority or name sort better when a list gets over 1,000,000 items?
What about using ref.startAt(null, timeStart).endAt(null, timeEnd).on(...)?
I could profile, but how would I know that server load or network delays are or are not affecting it?
There should be no performance difference between using priority or key names to sort items. Firebase first looks for priority to sort items, and if it doesn't exist, sorts items by key name. There might be a very small performance gain by using priority instead of key name, but I expect this to very small.

Handling SortOrder fields in SQL Server

In a specific table I have a SortOrder integer field that tells my page in which order to display the data. There are sets of data in the this field (based on a CategoryID field), and each set will have its own ordering. Users can add/remove/update records in this table.
My question is what is the best way to manage this SortOrder field? I would like to "reseed" it everytime a record is deleted or updated. Is this something I should be using a trigger for? Or should my code handle it and manage the reseeding?
What I used to do is use only odd numbers in the SortOrder field so upon changing the order, I would add or subtract 3 from the current value of the modified item and then do a reseed (order the items again using odd number indexes). Also I used to reseed after every insert or delete.
All you really have to worry about is swapping any two fields. All new entries go to the end and i'm sure you've got a mechanism by which the user can change the order. The order change, move up or down, really is a swap with a neighboring field. All you really care about is that all the fields are sorted properly. Don't let a mathematical sense of aesthetic drive you into creating something overly complex. (You'll end up with holes in your sequence after deletes are made but that's OK. It's an internal sequence marker used for ORDER BY. the numbers don't need to be made contiguous.)

Resources