We deployed an Azure TSI Preview ingesting messages coming from an IoT Hub.
We are wandering what would be the best practice that applies when there are events that are generated by different type of devices with no intersection of properties.
Consider, as example, messages coming from a device of type A:
{
"timestamp" : "2019-02-25T01:08:00Z",
"devicetype" : "a",
"windspeed" : 10,
"airpressure" : 101300
}
and messages coming from a device of type B
{
"timestamp" : "2019-02-25T01:09:00Z",
"devicetype" : "b",
"temperature" : 26.5,
"humidity" : 22.5
}
where timestamp is the column used as the source timestamp and devicetype the column used as the Time Series ID.
Following the doc, and checking out the resulting events in the explorer, the resulting output would looks like
timestamp | devicetype | windspeed | airpressure | tempearature | humidity
2019-02-25T01:08:00Z | a | 10 | 101300 | |
2019-02-25T01:09:00Z | b | | | 26.5 | 22.5
In practice, we have devices of different types that will never share any properties. Therefore,
Are we going to get the same degree of performance for speed and memory allocation (blob)?
Are we wasting space?
Is there a better way to organize the events?
What if we change the properties and introduce a common field?
Thanks :)
Thank you for your interest in TSI. I am a Sr. Product Manager on the team.
I believe that the way you have organized your events is ok. you can also optimize using something like propertyType to reduce the number of columns.For example:
{
"deviceType" : "a"
"propertyType": "humidity"
"value": 22.5
}
This would create a table structure in your environment as follows:
propertyType value
humidity 22.5
temperature 26.5
The structure would depend on the number of properties that you have in your environment. In the example that you have posted in the thread you can query properties directly through queries. However, if you use my approach, you will need to use a filter in the query(This may not give you the most optimal query performance if you have too many properties.)
I hope this clarifies your questions. Please let me know if I can answer any more questions.
Related
I have some queries that look at aggregated data over a long period of time (180 days) for data that is per second (example query below). The table's hot cache is 31 days so the queries can take over a minute to return and this is not acceptable for the dashboards I want to display them on. What would be recommended optimization strategies? My thoughts so far is to either use an update policy to push the data for these tags into a separate table with a hot cache of 180 days or to use a materialized view.
raw_table
| where TimeStamp between (now(-180d) .. now()) and TagName in ("Tag1","Tag2")
| extend Date = startofday(TimeStamp)
| summarize Value1=max(Value) by Date,TagName
| summarize Value1=sum(Value1) by Date
| project TagName="AggregatedData",Date,Value
My thoughts so far is to either use an update policy to push the data for these tags into a separate table with a hot cache of 180 days or to use a materialized view.
both options you mentioned are appropriate (even a combination of both, if required)
Azure Application Insights uses the Kusto Query Language (KQL) that is clearly quite powerful, but I cannot seem to get it to aggregate nested data.
The best way to explain this is through an example. My actual situation uses different data, but has the same problem that I will explain here. Using the Azure Data Explorer there is a StormEvents table that has a State property, a StormSummary property, and many more. The StormSummary property is JSON that looks like:
{"TotalDamages":0,"StartTime":"2007-09-18T20:00:00.0000000Z",...}
I can do a query such as:
StormEvents
| project State, StormSummary.TotalDamages
That gives me a nice tabular result. However, what I really want is to aggregate the total damages for each state, so I want something like:
StormEvents
| project State, sum(StormSummary.TotalDamages)
Unfortunately, the above query fails with:
Function 'sum' cannot be invoked in current context.
My end goal is to render this in a pie chart to show total damages for each state, but I can't get the sum of the damages. I'm using App Insights to create data with the same problem as this. Maybe if I structure my data differently it would help. I am using Track Event and providing a number as a property on the event. I could use a Metric instead, but the documentation indicates I should use an Event since I am not aggregating the metric myself.
As a point of reference, the following works if I do a count of the records by state, but I want a sum of the total damages by state.
StormEvents
| summarize Count=count() by State
| render piechart
instead of this:
StormEvents
| project State, sum(StormSummary.TotalDamages)
you could try this:
StormEvents
| summarize sum(tolong(StormSummary.TotalDamages)) by State
in my BigQuery project I store event data integrated from Firebase. The granularity and dimension is such that trying to present raw data in Data Studio quickly makes the report become VERY slow (1-2 min per page/interaction).
I then started to think how I could create pre-aggregated tables in BigQuery to speed everything up, but quickly realised COUNT DISTINCT metrics would be a problem with this approach.
Let me explain:
SELECT user, date
FROM UNNEST([
STRUCT("Adam" AS user, "20190923" AS date),
("Bob", "20190923"),
("Carl", "20190923"),
("Adam", "20190924"),
("Bob", "20190924"),
("Adam", "20190925"),
("Carl", "20190925"),
("Bob", "20190926")
]) AS website_visits;
+------+----------+
| User | Date |
+------+----------+
| Adam | 20190923 |
| Bob | 20190923 |
| Carl | 20190923 |
| Adam | 20190924 |
| Bob | 20190924 |
| Adam | 20190925 |
| Carl | 20190925 |
| Bob | 20190926 |
+------+----------+
The above is a table of website visits.
Clearly, creating a pre-aggregated table like
SELECT date, COUNT(DISTINCT user) FROM website_visits GROUP BY date
has the limitation that the count cannot be aggregated further (or even less, dinamically) to get a total, as doing a SUM would return 8 unique users which is not correct, there are only 3 unique users.
In BigQuery, this is fixed by using HLL_COUNT, which despite the approximation works ok for me.
Now to the big question:
How to do the same so that the result is displayable in Data Studio????
HLL_COUNT.EXTRACT is not available as function in there, and in the reporting I always have to keep in mind that the date range is set by the user however (s)he likes so it's not possible to store a pre-aggregated result for ALL cases...
EDIT 1: APPROX_COUNT_DISTINCT
As per answer from Bobbylank, I tried to use APPROX_COUNT_DISTINCT.
However I found that this just seems to move the issue down the line. My fault for not explaining what's over there.
Despite being performances acceptable it does not seem possible to me to blend a data source with this calculated metric.
Example: After displaying the amount of unique users in the selected period (which now works), I'm also trying to display Average Revenue Per User (ARPU) in Data Studio like Firebase does.
To do this, I have to SUM(REVENUE) / APPROX_COUNT_DISTINCT(USER)
Clearly, REVENUE works ok with pre-aggregation and is available in the raw data. I tried then to blend the raw data with a table containing just user visits. However APPROX_COUNT_DISTINCT can't be used in the blended data definition as calculated metrics are not allowed.
Even trying to use the USER field as a metric with Count Distinct aggregation, despite returning the correct figures when showing revenue and user count separately, when I try to divide them the problem becomes aggregation (apply SUM or AVG to the field and basically the result will be AVG(REVENUE/USERS) for each day).
I also then tried to store REVENUE directly in the visits table, but was reminded by Data Studio that I can't create calculated metrics that I can't mix dimensions and metrics in a calculated field.
APPROX_COUNT_DISTINCT might be more performance friendly for you?
https://support.google.com/datastudio/answer/9189108?hl=en
Otherwise the only way I can think would be to pre-calculate several metrics (e.g. unique users on that day, 7-day cumulative, 14-day, etc.) as your customer require for each single day.
Or you could provide a 2 page report with both of these methods with the caveat that the first can be used over a time period but will be much slower?
I need your help. I am quite new to databases.
I'm trying to get set up a table in DynamoDB to store info about TV shows. It seems pretty simple and straightforward but I am not sure if what I am doing is correct.
So far I have this structure. I am trying to fit everything about the TV shows into one table. Seasons and episodes are contained within a list of maps within a list of maps.
Is this too much layering?
Would this present a problem in the future where some items are huge?
Should I separate some of these lists of maps to another table?
Shows table
Ideally, you should not put a potentially unbounded list in a single row in DynamoDB because you could end up running into the item size limit of 400kb. Also, if you were to read or write one episode of one show, you consume capacity as if you are reading or writing all the episodes in a show.
Take a look at the adjacency list pattern. It’s a good choice because it will allow you to easily find the seasons in a show and the episodes in a season. You can also take a look at this slide deck. Part of the way through, it talks about hierarchical data, which is exactly what you’re dealing with.
If you can provide more information about your query patterns, I can give you more guidance on how to model your data in the table.
Update (2018-11-26)
Based on your comments, it sounds like you should use composite keys to establish hierarchical 1-N relationships.
By using a composite sort key of DataType:ItemId where ItemId is a different format depending on the data type, you have a lot of flexibility.
This approach will allow you to easily get the seasons in the show, get all episodes in all seasons, get all episodes in a particular season, or even get all episodes between season 1, episode 5 and season 2 episode 5.
hash_key | sort_key | data
----------|-----------------|----------------------------
SHOW_1234 | SHOW:SHOW_1234 | {name:"Some TV Show", ...
SHOW_1234 | SEASON:SE_01 | {descr:"In this season, the main character...
SHOW_1234 | EPISODE:S01_E01 | {...
SHOW_1234 | EPISODE:S01_E02 | {...
Here are the various key condition expressions for the queries I mentioned:
hash_key = "SHOW_1234" and sort_key begins_with("SEASON:") – gets all seasons
hash_key = "SHOW_1234" and sort_key begins_with("EPISODE:") – gets all episodes in all season
hash_key = "SHOW_1234" and sort_key begins_with("EPISODE:S02_") – gets all episodes in season 2
hash_key = "SHOW_1234" and sort_key between "EPISODE:S01_E5" and "EPISODE:S02_E5" – gets all episodes between season 1, episode 5 and season 2 episode 5
I have been working on an automation project where I have to write cucumber test for search filter. Search filter works dynamically where parameters are nested - next parameter are populated based on previous parameter e.g. On selecting "Subscribers" next parameters in dropdown are "Name", "City", "Network". Likewise, on selecting "Service Desk", parameters in subsequent dropdown are "Status", "Ticket no.", "Assignee". I am using Scenario Outline as below:
Scenario Outline: As a user, I can search records
Given I am on search page
When I search on "<category>" and "<nestedfilter>"
Then I see records having "<category>" category
Examples:
|category |nestedfilter|
|Subscribers |Name |
|Subscribers |City |
|Subscribers |Network |
|Service Desk|Status |
|Service Desk|Ticket no. |
|Service Desk|Assignee |
The filter could be more complex as there could be more nested filters based on previous nested filters.
All I need to know if there could be a more efficient way to handle this problem? For example passing data table to step_definition for which I am not too sure.
Thanks
If you really need the order of your items to be preserved, use a data table instead of a scenario outline.
A scenario outline is a shorthand notation for multiple scenarios. The execution of each scenario is not guaranteed. Or at least it would be a mistake to assume a specific execution order. The order of the items in a data table will not change if you use a List as argument and therefore a lot safer in your case.
A common mistake with Cucumber is to use Scenario Outline and example tables to do some sort of semi-exhaustive testing. This tends to hide lots of interesting things about the functionality being developed.
I would start writing single features for the searches you are working with and explore what those searches are and why they are important. So if we start with your first one we get ...
Note: all of the following assumes a background step Given I am searching
When I search on subscribers and name
Then I should see records for subscribers
and with the second one
When I search on subscribers and city
Then I should see records for subscribers
Now it becomes clear that there is a serious flaw in these scenarios, as both scenarios are looking for the same result.
So what you are actually testing is that
The subscribers search has name and city filters
A subscriber search should return subscriber results
Now you can refactor and get
When I do a subscriber search
Then I should see city, name, network filters
When I do a subscriber search
Then I should only see subscriber results
note: This is already much more efficient as you have reduced the number of scenarios from 3 to 2, and reduced the number of searches you have to do from 3 to 1.
Now I have no idea if this is what you want to do, but this is what your current scenario is doing. However because you are using an Outline and Example tables you can't see this.
The fact that you have a drop-down and nested filters is an implementation detail, which describes how the user is trying to achieve what they want to achieve.
If you think of what you're trying to do as examples of how the system behaves, rather than tests, it might be easier. You're not looking for something exhaustive. You also want your scenarios to be specific, so that you're illustrating them with realistic data and concrete examples. If you would commonly have some typical data available, that's a perfect thing to set up using Background.
So for instance, I might have scenarios like:
Background:
Given I have subscribers
| Name | City | Network | Status | etc.
| Bob | Rome | ABC | Alive | ...
| Sam | Berlin | ABC | Dead | ...
| Sue | Berlin | DEF | Dead | ...
| Ann | Berlin | DEF | Alive | ...
| Jon | London | DEF | Dead | ...
Scenario: First level search
Given I'm on the search page
When I search for Subscribers who are in Rome
Then I should see Bob
But not Sue or Jon.
Scenario: Second level search
Given I'm on the search page
When I search for Subscribers in Berlin on the ABC network
Then I should see Sam
But not Sue or Ann
etc.
The full-system scenarios should be just enough to understand what's going on. Don't use BDD for regression. It can help with that, but scenarios will rapidly become slow and unmaintainable if you try to cover every case. Delegate to integration and unit tests where appropriate (see "the testing pyramid").