Incremental Data Load in Azure Kusto - azure-data-explorer

I'm trying to do delta load from Azure Devops to Azure Kusto Cluster using ADF. As Kusto doesn't have an option to update or delete specific records, I have followed the below steps to implement incremental data load.
Data will be loaded as a full load into main table (Table A) for the first time.
I will get the Max modified date from Table A and load only the latest modified data into TableStg in Kusto with the following filter.(>=(maxmodifieddate-1 day)
I will create a Temp table (TableTemp) here in Kusto to merge modified/new data (from TableStg) and exisiting data into single table and then replace the main table (Table A) with Temp table using below KQL.
.set-or-replace TableTemp with (distributed=true) <| set notruncation;let updated=TableStg|union TableA |summarize maxdate=max(ChangedDate) by WorkItemId;let mergeupdate=(TableStg|union TableTemp)|distinct *|join kind=inner updated on $left.WorkItemId==$right.WorkItemId|whereChangedDate==maxdate;mergeupdate| project-away WorkItemId1, maxdate
.drop table TableA ifexists ;
.rename tables TableA=TableTemp;
With large number of records, this query is failing with below memory error.
"error": {
"code": "LimitsExceeded",
"message": "Request is invalid and cannot be executed.",
"#type": "Kusto.DataNode.Exceptions.KustoQueryRunawayException",
"#message": "HResult: 0x80DA0001, source: (Partial query failure: Low memory condition (E_LOW_MEMORY_CONDITION). (message: 'bad allocation', details: ''))"
Is there any option to optimize this query and achieve delta load?
Do we have any other way to implement incremental load in Azure Kusto?

You have a couple of more (simpler) options:
You may be able to use update policy, if you guarantee that each ADF ingestion happens after the previous one completed.
You can also use materialized-views to apply the "last updated" logic.

Related

Truncate and Load a Kusto table instead of a Materialized view so that it can be used for continous export

We have a scenario where some reference data is getting ingested in a Kusto table (~1000 rows).
To handle data duplication due to daily data load (as Kusto does always append), we have created a Materialized view(MV) on top of the table to summarize the data and get the latest data based on ingestion_time(), so that querying the MV will always result in latest updated reference data.
Our next ask is to export this formatted data in a storage container using Kusto continuous data export (please refer MS doc), however, it seems we can't use Materialized view to set up a continuous export.
So looking at options, is there any way we can create a truncate load table instead of a Materialized View in kusto, so that we don't have a duplicate record in the table and it can be used to do continuous export.
.create async materialized-view with (backfill=true) products_vw on table products
{
products
| extend d=parse_json(record)
| extend
createdBy=tostring(d.createdBy),
createdDate = tostring(d.createdDate),
product_id=tostring(d.id),
product_name=tostring(d.name),
ingest_time=ingestion_time()
| project
ingest_time,
createdBy,
createdDate,
product_id,
product_name
| summarize arg_max(ingest_time, *) by product_id
}
You can use Azure logic app or Microsoft flow to run the applicable export command to an external table backed by Azure storage on any given time interval. The query can simply refer to the materialized view, for example:
.export to table ExternalBlob <| Your_MV

How to insert/ingest Current timestamp into kusto table

I am trying to insert current datetime into table which has Datetime as datatype using the following query:
.ingest inline into table NoARR_Rollout_Status_Dummie <| #'datetime(2021-06-11)',Sam,Chay,Yes
Table was created using the following query:
.create table NoARR_Rollout_Status_Dummie ( Timestamp:datetime, Datacenter:string, Name:string, SurName:string, IsEmployee:string)
But when I try to see data in the table, I could not see TimeStamp being filled. Is there anything I am missing?
the .ingest inline command parses the input (after the <|) as a CSV payload. therefore you cannot include variables in it.
an alternative to what you're trying to do would be using the .set-or-append command, e.g.:
.set-or-append NoARR_Rollout_Status_Dummie <|
print Timestamp = datetime(2021-06-11),
Name = 'Sam',
SurName = 'Chay',
IsEmployee = 'Yes'
NOTE, however, that ingesting a single or a few records in a single command is not recommended for production scenarios, as it created very small data shards and could negatively impact performance.
For queued ingestion, larger bulks are recommended: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/api/netfx/kusto-ingest-best-practices#optimizing-for-throughput
otherwise, see if your use case meets the recommendations of streaming ingestion: https://learn.microsoft.com/en-us/azure/data-explorer/ingest-data-streaming

MariaDB - Inserting historical data into a system versioned (temporal) table

I have some tables in MariaDB that I have been tracking the changes for by using a separate "changelog" table that updates every time a record is updated. However I have recently learned about temporal data tables in MariaDB and I would like to switch to that method as it is a much more elegant method of tracking changes. I'm wondering, however, if there is a way to transfer over my "changelog" table to the newly system versioned tables.
So I was hoping I could insert new rows somehow with the specified values for the table and also specify the row_end and row_start columns and also have that not trigger the table to create another historical row... is this possible? I tried just doing a a "insert into (id, row_start, row_end, etc) values(x, y, z)" but that results in an unknown column "row_start" error.
Old question, but starting with 10.11 MariaDB allows direct insertion of historical data using a command line option or setting.
https://mariadb.com/kb/en/system-versioned-tables/#system_versioning_insert_history
system_versioning_insert_history
Description: Allows direct inserts into ROW_START and ROW_END columns if secure_timestamp allows changing timestamp.
Commandline: --system-versioning-insert-history[={0|1}]
Scope: Global, Session
Dynamic: Yes
Type: Boolean
Default Value: OFF
Introduced: MariaDB 10.11.0

Is there a way to clone a table in Kusto?

Is there a way to clone a table in Kusto exactly so it has all the extents of the original table? Even if it's not possible to have extents retained , at least is there a performant way to copy a table to a new table. I tried the following:-
.set new_table <| existing_table;
It was running forever and got timeout error. Is there way to copy so the Kusto engine recognizes that this is just a dump copy so instead of using Kusto engine, it will just do a simple blob copy from back-end and simply point the new table to the copied blob thus bypassing the whole Kusto processing route?
1. Copying schema and data of one table to another is possible using the command you mentioned (another option to copy the data is to export its content into cloud storage, then ingest the result storage artifacts using Kusto's ingestion API or a tool that uses it, e.g. LightIngest or ADF)
Of course, if the source table has a lot of data, then you would want to split this command into multiple ones, each dealing with a subset of the source data (which you can 'partition', for example, by time).
Below is just one example (it obviously depends on how much data you have in the source table):
.set-or-append [async] new_table <| existing_table | where ingestion_time() > X and ingestion_time() < X + 1h
.set-or-append [async] new_table <| existing_table | where ingestion_time() >= X+1h and ingestion_time() < X + 2h
...
Note that the async is optional, and is to avoid the potential client-side-timeout (default after 10 minutes). the command itself continues to run on the backend for up to a non-configurable timout of 60 mins (though it's strongly advised to avoid such long-running commands, e.g. by performing the "partitioning" mentioned above).
2. To your other question: There's no option to copy data between tables without re-ingesting the data (an extent / data shard currently can't belong to more than 1 table).
3. If you need to "duplicate" data being ingestion into table T1 continuously into table T2, and both T1 and T2 are in the same database, you can achieve that using an update policy.

UPSERT /INSERT/ UPDATE between Databricks to Cosmos

Currently we are using Azure Databricks as Transformation layer and transformed data are loaded to Cosmos DB through connector.
Scenario:
We have 2 files as source files.
1st File contains name,age
2nd file contains name, state, country
In Cosmos, I have created collection using id, Partition Key
In databricks, I am loading these 2 files as Dataframe and creating a temp table to query the content.
I am querying the content from the first file [ select name as id, name, age from file ]and loading the same to Cosmos Collection.
From the second file. I am using [ select name as id, state, country] and loading to the same collection expecting the content from the second file get inserted in the same collection in the same document based on id field.
The issue here is when I am loading the content from the second file, the attribute 'age' from the first file gets deleted and only id, name, state, country is seen in the cosmos document. This is happening because I am using UPSERT in databricks to load to Cosmos.
When I am changing the UPSERT to INSERT or UPDATE it throws as error which says 'Resource with id already exists'
Databricks Connection to Cosmos:
val configMap = Map(
"Endpoint" -> {"https://"},
"Masterkey" -> {""},
"Database" -> {"ods"},
"Collection" -> {"tval"},
"preferredRegions" -> {"West US"},
"upsert" -> {"true"})
val config = com.microsoft.azure.cosmosdb.spark.config.Config(configMap)
Is there a way to insert the attributes from second file without deleting the attribute which is already present. I am not using JOIN operation as the use case doesn't fit to use.
From a vague memory of doing this you need to set the id attribute on your data frame to match between the two datasets.
If you omit this field Cosmos generates a new record - which is what is happening for you.
So if df1 & df2 have id=1 on the first record then the first will insert it, the second will update it.
But if they are the same record then joining in Spark will be far more efficient.

Resources