Separate tables vs map lists - DynamoDB - amazon-dynamodb

I need your help. I am quite new to databases.
I'm trying to get set up a table in DynamoDB to store info about TV shows. It seems pretty simple and straightforward but I am not sure if what I am doing is correct.
So far I have this structure. I am trying to fit everything about the TV shows into one table. Seasons and episodes are contained within a list of maps within a list of maps.
Is this too much layering?
Would this present a problem in the future where some items are huge?
Should I separate some of these lists of maps to another table?
Shows table

Ideally, you should not put a potentially unbounded list in a single row in DynamoDB because you could end up running into the item size limit of 400kb. Also, if you were to read or write one episode of one show, you consume capacity as if you are reading or writing all the episodes in a show.
Take a look at the adjacency list pattern. It’s a good choice because it will allow you to easily find the seasons in a show and the episodes in a season. You can also take a look at this slide deck. Part of the way through, it talks about hierarchical data, which is exactly what you’re dealing with.
If you can provide more information about your query patterns, I can give you more guidance on how to model your data in the table.
Update (2018-11-26)
Based on your comments, it sounds like you should use composite keys to establish hierarchical 1-N relationships.
By using a composite sort key of DataType:ItemId where ItemId is a different format depending on the data type, you have a lot of flexibility.
This approach will allow you to easily get the seasons in the show, get all episodes in all seasons, get all episodes in a particular season, or even get all episodes between season 1, episode 5 and season 2 episode 5.
hash_key | sort_key | data
----------|-----------------|----------------------------
SHOW_1234 | SHOW:SHOW_1234 | {name:"Some TV Show", ...
SHOW_1234 | SEASON:SE_01 | {descr:"In this season, the main character...
SHOW_1234 | EPISODE:S01_E01 | {...
SHOW_1234 | EPISODE:S01_E02 | {...
Here are the various key condition expressions for the queries I mentioned:
hash_key = "SHOW_1234" and sort_key begins_with("SEASON:") – gets all seasons
hash_key = "SHOW_1234" and sort_key begins_with("EPISODE:") – gets all episodes in all season
hash_key = "SHOW_1234" and sort_key begins_with("EPISODE:S02_") – gets all episodes in season 2
hash_key = "SHOW_1234" and sort_key between "EPISODE:S01_E5" and "EPISODE:S02_E5" – gets all episodes between season 1, episode 5 and season 2 episode 5

Related

Invisible graphs cause report to slow

I have a report with a parameter where the end user chooses a practice name that corresponds to a group of people. Most of these groups have fewer than 10 people, but a small number of them have as many as 150. When there are more than 15 people in a given group, they want separate graphs, each with no more than 15 people. So for most of the groups, we only need one graph. For a few, we need a lot of graphs.
Behind the scenes, I created a graph for each multiple of 15 people, and set them to only be visible if there are actually that many people in the group. This does what I need it to, but it makes the report super slow. As close as I can tell, behind the scenes when an end user runs the report it's still somehow rendering the hidden graphs and slowing it all to heck. (I did find this link which I think suggests this is a known bug.
I need to have one report where the end user selects the practice name, so I can't make two reports, "My practice is normal" and "My practice is ginormous". I thought maybe I could make a conditional sub-report split into those two reports based on the practice name parameter, but that doesn't appear to be possible; you can play around with visibility but I'm guessing that will still cause the invisible graph rendering problem and not help my speed.
Are there any other cool tips I can try to speed up my report, or is this just a case of too many graphs spoiling the broth?
The easiest way would be to generate a group number for every 15 people and then use a list control to repeat the chart for each group.
Here's a very quick example of this in action. I just used some sample data from one of the Adventure Works sample database.
Here's my query that returns every person in each selected department. Note that I have commented out the DELCAREs as these were just in there for testing.
--DECLARE #Department varchar(50) = ''
--DECLARE #chartMax int = 5
SELECT
GroupName, v.Department, v.FirstName, v.LastName
, ChartGroup = (ROW_NUMBER() OVER(PARTITION BY Department ORDER BY LastName, FirstName)-1) / #chartMax -- calc which chart number the person belongs to
, Salary = ((ABS(CHECKSUM(NewId())) % 100) * 500) + (ABS(CHECKSUM(NewId())) % 1000) + 10000 -- Just some random number to plot
FROM [HumanResources].[vEmployeeDepartment] v
WHERE Department IN (#Department)
ORDER BY Department
The key bit is the ChartGroup column
ChartGroup = (ROW_NUMBER() OVER(PARTITION BY Department ORDER BY LastName, FirstName)-1) / #chartMax
This will give the first 5 rows in each department a ChartGroup of 0 the next 15 1 and so on. I used 5 rather than 15 just so it's easier to demo.
Here's the dataset results
Now, in your report, add a List, set it's dataset property to your dataset containing your main data (the query above in my case).
Now edit the 'details' rowgroup properties and add a grouping by Practice and ChartGroup (Department and ChartGroup in this example)
In the list box's textbox, right-click then insert a chart.
Set the chart up as required, in my example, I used salary as the values on a pie chart and the employee names as the labels.
Here's the final design ..
Note that I set the department as a multi-value parameter and also set the number of persons per chart (chartMax) as a report parameter.
When I preview the report I get this for 'Engineering' which has 6 employees
Sales has 18 employees so we get this
.... and so on, it will generate a new chart for every 15 people or part thereof.

What is the best strategy for optimizing aggregation queries over a long period of time/dataset in Azure Data Explorer (Kusto)?

I have some queries that look at aggregated data over a long period of time (180 days) for data that is per second (example query below). The table's hot cache is 31 days so the queries can take over a minute to return and this is not acceptable for the dashboards I want to display them on. What would be recommended optimization strategies? My thoughts so far is to either use an update policy to push the data for these tags into a separate table with a hot cache of 180 days or to use a materialized view.
raw_table
| where TimeStamp between (now(-180d) .. now()) and TagName in ("Tag1","Tag2")
| extend Date = startofday(TimeStamp)
| summarize Value1=max(Value) by Date,TagName
| summarize Value1=sum(Value1) by Date
| project TagName="AggregatedData",Date,Value
My thoughts so far is to either use an update policy to push the data for these tags into a separate table with a hot cache of 180 days or to use a materialized view.
both options you mentioned are appropriate (even a combination of both, if required)

BigQuery to Data Studio : Show reliable COUNT DISTINCT regardless of the selected period

in my BigQuery project I store event data integrated from Firebase. The granularity and dimension is such that trying to present raw data in Data Studio quickly makes the report become VERY slow (1-2 min per page/interaction).
I then started to think how I could create pre-aggregated tables in BigQuery to speed everything up, but quickly realised COUNT DISTINCT metrics would be a problem with this approach.
Let me explain:
SELECT user, date
FROM UNNEST([
STRUCT("Adam" AS user, "20190923" AS date),
("Bob", "20190923"),
("Carl", "20190923"),
("Adam", "20190924"),
("Bob", "20190924"),
("Adam", "20190925"),
("Carl", "20190925"),
("Bob", "20190926")
]) AS website_visits;
+------+----------+
| User | Date |
+------+----------+
| Adam | 20190923 |
| Bob | 20190923 |
| Carl | 20190923 |
| Adam | 20190924 |
| Bob | 20190924 |
| Adam | 20190925 |
| Carl | 20190925 |
| Bob | 20190926 |
+------+----------+
The above is a table of website visits.
Clearly, creating a pre-aggregated table like
SELECT date, COUNT(DISTINCT user) FROM website_visits GROUP BY date
has the limitation that the count cannot be aggregated further (or even less, dinamically) to get a total, as doing a SUM would return 8 unique users which is not correct, there are only 3 unique users.
In BigQuery, this is fixed by using HLL_COUNT, which despite the approximation works ok for me.
Now to the big question:
How to do the same so that the result is displayable in Data Studio????
HLL_COUNT.EXTRACT is not available as function in there, and in the reporting I always have to keep in mind that the date range is set by the user however (s)he likes so it's not possible to store a pre-aggregated result for ALL cases...
EDIT 1: APPROX_COUNT_DISTINCT
As per answer from Bobbylank, I tried to use APPROX_COUNT_DISTINCT.
However I found that this just seems to move the issue down the line. My fault for not explaining what's over there.
Despite being performances acceptable it does not seem possible to me to blend a data source with this calculated metric.
Example: After displaying the amount of unique users in the selected period (which now works), I'm also trying to display Average Revenue Per User (ARPU) in Data Studio like Firebase does.
To do this, I have to SUM(REVENUE) / APPROX_COUNT_DISTINCT(USER)
Clearly, REVENUE works ok with pre-aggregation and is available in the raw data. I tried then to blend the raw data with a table containing just user visits. However APPROX_COUNT_DISTINCT can't be used in the blended data definition as calculated metrics are not allowed.
Even trying to use the USER field as a metric with Count Distinct aggregation, despite returning the correct figures when showing revenue and user count separately, when I try to divide them the problem becomes aggregation (apply SUM or AVG to the field and basically the result will be AVG(REVENUE/USERS) for each day).
I also then tried to store REVENUE directly in the visits table, but was reminded by Data Studio that I can't create calculated metrics that I can't mix dimensions and metrics in a calculated field.
APPROX_COUNT_DISTINCT might be more performance friendly for you?
https://support.google.com/datastudio/answer/9189108?hl=en
Otherwise the only way I can think would be to pre-calculate several metrics (e.g. unique users on that day, 7-day cumulative, 14-day, etc.) as your customer require for each single day.
Or you could provide a 2 page report with both of these methods with the caveat that the first can be used over a time period but will be much slower?

Dynamodb data model for process/transaction monitoring

I am wanting to keep track of multi stage processing job.
Likely just need the following fields
batchId (guid) | eventId (guid) | statusId (int) | timestamp | message (string)
There are relatively small number of events per batch.
I want to be able to easily query events that have a statusId less than n (still being processed or didn't finish processing).
Would using multiple rows for each status change, and querying for latest status be the best approach? I would use global secondary index but StatusId does not seem like a good candidate for hashkey (less than 10 statuses).
Instead of using multiple rows for every status change, if you updated the same event row instead, you could use a technique described in the DynamoDB documentation in the section 'Use a Calculated Value'. Basically this would involve adding another attribute (say 'derivedStatusId') which would be derived by appending a random number to statusId at the time of writing to DynamoDB. For example, for a statusId of 2, derivedStatusId could be one of {"2-00", "2-01", .. "2-99"}. Setting up a Global Secondary Index on derivedStatusId would give you some fan-out that will help in preventing the index from becoming hot.
If you are sure that you will use this index for only unfinished events, then removing the derivedStatusId attribute from the record when it transitions to a finished status will remove it from index as well - which may be a good property if events are expected to finish processing eventually, and if they stay around forever. This technique is called "Sparse Index" and is described in more detail here.
From your question, it seems like keeping status history recording is a desired property (I assume this because you want to have multiple rows for status changes). Consider putting this historical information in the same row. DynamoDB supports list data types and also has a generous 400KB item limit which may just allow you to capture all the desired historical information in the same record.

Missing values in google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910

I am working with the practice repository in preparation for doing upcoming work with a large enterprise client using BQ. The repository link is: google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910
I have 3 questions to ask in relation to the sample repository & a query that was run (please see the bottom of the link for the query that motivated the question:
1) What is the difference between customDimensions.index, customDimensions.value and hits.customDimensions.index, hits.customDimensions.value?
2) If a single hit has multiple custom dimensions/metrics how is that returned/queried? I only see single dimensions matching at the hit level in the sample data.
3) There are no custom metric values passed in the example data, what will those values look like?
Here is the query that motivated the previous 3 questions:
SELECT hits.page.pagePath AS urls,
hits.time,
customDimensions.index,
customDimensions.value,
hits.customMetrics.index,
hits.customMetrics.value,
trafficSource.medium,
hits.customVariables.index,
hits.customVariables.customVarName,
hits.customVariables.customVarValue
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
Every record in that table represents one Google Analytics Session. Big Query has this concept of nested fields and that's how individual hits are defined. They are nested into the hits record.
Answering your questions:
1) customDimensions.index and customDimensions.value are the index and value for user or session scoped custom dimensions. hits.customDimensions.index and hits.customDimensions.value re custom Dimensions set at hit scope level. The scope is defined when you create the custom Dimension through GA interface. indexes are integers from 1 to 20 (as defined in the Admin section) and value is the string passed as the value for that custom Dimension. More info about Custom Dimensions/Metrics
2) Both rows and rows.customDimensions are REPEATED RECORDS in Big Query. So in essence every row in that BQ table looks like this:
|- date
|- (....)
+- hits
|- time
+- customDimensions
|- index
|- value
But when you query the data this should be FLATTEN by default. Because it's flatten if a single hit has multiple custom dimensions and metrics it should show multiple rows, one for each.
3) Should be the same as customDimensions but the values are INTEGER instead of STRINGS.
For a simpler and more educational dataset I suggest that you create a brand new BQ table and load the data provided on this developer document page.
PS: Tell my good friends at Cardinal Path that Eduardo said Hello!

Resources