Applying calculations on large data set - bigdata

I'm currently optimizing our data warehouse and processes which uses it and i'm looking for some suggestions.
The problem is that i'm not sure about the calculations on retrieved data.
For make things more clearer for example we have following data stucture:
id : 1
param: static_value
param2: static_value
And let's consider that we got about 50 million entries with this structure.
Also let's assume that we are querying this data set about 30 times per minute which results every time at least 10k entries.
So, in short we got these stats:
Data set: 50 million entries.
Access frequency: 30 / s.
Resulting data size: ~10k results
On every query in resulting data set i have to go thought every entry and apply on it some calculations which results a field (for example param3 ) with it's dynamic value. For example:
Query2 ( 2k results ) and one of it's entries:
id : 2
param: static_value_2
param2: static_value_2
param3: dynamic_value_2
Query3 ( 10k results ) and one of it's entries:
id : 3
param: static_value_3
param2: static_value_3
param3: dynamic_value_3
And so on..
The problem is that i can't prepare the field param3 value earlier than i get it's values by query because of many dynamic values which are used in calculations.
Main question:
Is there any guidelines, practises or even the technologies for optimizing this kind of „problems“ or implementing this kind solutions?
Thanks for any information.
Update 1:
The field "param3" is calculated on every query in every data result entry, it means that this calculated value is not stored in any storage it just computed on every query. I can't store this value because it's dynamic and depends on many variables due this reason i can't store it as static value when it's dynamic.
I guess it's not good practise to have such implementation?

Related

Slow query on table | WHERE x | ORDER by timestamp | DISTINCT a,b,c,d | TAKE 20 when table large

We are experiencing a sudden performance drop with a query structured like this:
table(tablename)
| where MeasurementName in ('ActiveJobId')
and MachineId == machineId
and SourceTimestamp <= from
and isnotnull( Value)
| order by SourceTimestamp desc
| distinct SourceTimestamp, MeasurementName, tostring(Value), SourceTimestampUtc
| take rows
tablename, machineId, from, rows are all query parameters. rows is typically "20". Value column is of type "dynamic"
The table contains 240 Million entries, with about 64,000 matching the WHERE criteria. The goal of the query is to get the last 20 UNIQUE, non-empty entries for a given machine and data point, starting after a specific date.
The query runs smooth in the Staging database system, but started to degrade in performance on the Dev system. Possibly because of increased data amount.
If we remove the distinct clause, or move it behind the TAKE clause, the query completes very fast. (<1s). The data contains about 5-10% duplicate entries.
To our understanding the query should be performed like this:
Prepare a filter for the source table, start at a specific datetime range
Order desc: walk backwards
Walk down the table and stop when you got 20 distinct rows
From the time it sometimes takes it looks almost as if ADX walks down the whole table, performs a distinct, and then only takes the topmost 20 rows.
The problem persists if we swap | order and | distinct around.
The problem disappears if we move | distinct to the end of the query, but then we often receive 1-2 items less than required.
Is there a logical error we make, can this query be rewritten, or are there better options at hand?
The goal of the query is to get the last 20 UNIQUE, non-empty entries for a given machine and data point, starting after a specific date.
This part of the description doesn't match the filter in your query: and SourceTimestamp <= from - did you mean to use >= instead of <= ?
Is there a logical error we make, can this query be rewritten, or are there better options at hand?
If you can't eliminate the duplicates upstream, you can consider setting a materialized view that performs the deduplication, then query the view directly instead of the raw data. Also see Handle duplicate data

why filtering on extents_tags() is slow

Why the following command is slow (5 mins)?
mytable | where extent_tags() contains "20210613" | count
I know this is not the best way to get count , I could have used .show table extents and could have simply calculated sum(RowCount) using summarize operator. But I am just testing. Ideally ADX should be able to search tags across extents and get counts , so it is only metadata search and once it finds correct extent, row count is already stored as part of the extent metadata anyways, so why should it take 5 mins? And by the, the extent(s) I am interested in has the following tag:-
drop-by:20210613
ingest-by:20210613
There is a datetime field in the table which I could have used to filter too , which is what adx ideally recommends in general scenarios and I can guess the reason that min and max of every datetime field in the table is stored in every extent of the table -- but then similarly even tag is stored in every extent. So which method is more efficient , filtering on a datetime field if available or tags?
a. you're correct that using .show table T extents where tags contains 'string' | ... would be much more efficient
b. as mentioned in the documentation: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/extenttagsfunction
Filtering on the value of extent_tags() performs best when one of the following string operators is used: has, has_cs, !has, !has_cs.
c. which method is more efficient , filtering on a datetime field if available or tags?
The former, especially when your filter is on a substring, and not on the full content of the tag. Tags are a non-indexed metadata property of shards, and isn't an indexed data column. Also see: https://yonileibowitz.github.io/blog-posts/datetime-columns.html

To query Last 7 days data in DynamoDB

I have my dynamo db table as follows:
HashKey(Date) ,RangeKey(timestamp)
DB stores the data of each day(hash key) and time stamp(range key).
Now I want to query data of last 7 days.
Can i do this in one query? or do i need to call dbb 7 times for each day? order of the data does not matter So, can some one suggest an efficient query to do that.
I think you have a few options here.
BatchGetItem - The BatchGetItem operation returns the attributes of one or more items from one or more tables. You identify requested items by primary key. You could specify all 7 primary keys and fire off a single request.
7 calls to DynamoDB. Not ideal, but it'd get the job done.
Introduce a global secondary index that projects your data into the shape your application needs. For example, you could introduce an attribute that represents an entire week by using a truncated timestamp:
2021-02-08 (represents the week of 02/08/21T00:00:00 - 02/14/21T12:59:59)
2021-02-16 (represents the week of 02/15/21T00:00:00 - 02/22/21T12:59:59)
I call this a "truncated timestamp" because I am effectively ignoring the HH:MM:SS portion of the timestamp. When you create a new item in DDB, you could introduce a truncated timestamp that represents the week it was inserted. Therefore, all items inserted in the same week will show up in the same item collection in your GSI.
Depending on the volume of data you're dealing with, you might also consider separate tables to segregate ranges of data. AWS has an article describing this pattern.

How to achieve idempotent lambda function?

I have a pipeline like this -
table 1(dynamo db) -> aws lambda -> table 2 (dynamo db)
So whenever there is any update hapeens in table 1 then lambda gets trigered. So lambda basically batch read( 1000 records) from table 1 , then perform a batch compute to come up with the list of records that's needed to be updated in table 2. Table 2 basically maintains the count of certain event happening in table 1.
So problem is if we send the same batch of records twice then it will increment the count in table 2 twice.
Why am i considering this as during outage on one of the lambda function ( the number of lambda running is 1:1 relation with the number of partitions in dynamo db ) while it had performed some of the writes operation, it will resend the last batch read.
To avoid this one way can be to store the sequence number of the records we have already computed and store that in table 2. So when ever we update we can check if its already computed. But we need to maintain the size of that list else we will get performance issue. But what size it should be is an issue.
What should be the write approach to handle these kind of issues?

Dealing with Indexer timeout when importing documentdb into Azure Search

I'm trying to import a rather large (~200M docs) documentdb into Azure Search, but I'm finding the indexer times out after ~24hrs. When the indexer restarts, it starts again from the beginning, rather than from where it got to, meaning I can't get more than ~40M docs into the search index. The data source has a highwater mark set like this:
var source = new DataSource();
source.Name = DataSourceName;
source.Type = DataSourceType.DocumentDb;
source.Credentials = new DataSourceCredentials(myEnvDef.ConnectionString);
source.Container = new DataContainer(myEnvDef.CollectionName, QueryString);
source.DataChangeDetectionPolicy = new HighWaterMarkChangeDetectionPolicy("_ts");
serviceClient.DataSources.Create(source);
The highwater mark appears to work correctly when testing on a small db.
Should the highwater mark be respected when the indexer fails like this, and if not how can I index such a large data set?
The reason the indexer is not making incremental progress even while timing out after 24 hours (the 24 hour execution time limit is expected) is that a user-specified query (QueryString argument passed to the DataContainer constructor) is used. With a user-specified query, we cannot guarantee and therefore cannot assume that the query response stream of documents will be ordered by the _ts column, which is a necessary assumption to support incremental progress.
So, if a custom query isn't required for your scenario, consider not using it.
Alternatively, consider partitioning your data and creating multiple datasource / indexer pairs that all write into the same index. You can use Datasource.Container.Query parameter to provide a DocumentDB query that partitions your data using a WHERE filter. That way, each of the indexers will have less work to do, and with sufficient partitioning, will fit under the 24 hour limit. Moreover, if your search service has multiple search units, multiple indexers will run in parallel, further increasing the indexing throughout and decreasing the overall time to index your entire dataset.

Resources