Is there a way to clone a table in Kusto? - azure-data-explorer

Is there a way to clone a table in Kusto exactly so it has all the extents of the original table? Even if it's not possible to have extents retained , at least is there a performant way to copy a table to a new table. I tried the following:-
.set new_table <| existing_table;
It was running forever and got timeout error. Is there way to copy so the Kusto engine recognizes that this is just a dump copy so instead of using Kusto engine, it will just do a simple blob copy from back-end and simply point the new table to the copied blob thus bypassing the whole Kusto processing route?

1. Copying schema and data of one table to another is possible using the command you mentioned (another option to copy the data is to export its content into cloud storage, then ingest the result storage artifacts using Kusto's ingestion API or a tool that uses it, e.g. LightIngest or ADF)
Of course, if the source table has a lot of data, then you would want to split this command into multiple ones, each dealing with a subset of the source data (which you can 'partition', for example, by time).
Below is just one example (it obviously depends on how much data you have in the source table):
.set-or-append [async] new_table <| existing_table | where ingestion_time() > X and ingestion_time() < X + 1h
.set-or-append [async] new_table <| existing_table | where ingestion_time() >= X+1h and ingestion_time() < X + 2h
...
Note that the async is optional, and is to avoid the potential client-side-timeout (default after 10 minutes). the command itself continues to run on the backend for up to a non-configurable timout of 60 mins (though it's strongly advised to avoid such long-running commands, e.g. by performing the "partitioning" mentioned above).
2. To your other question: There's no option to copy data between tables without re-ingesting the data (an extent / data shard currently can't belong to more than 1 table).
3. If you need to "duplicate" data being ingestion into table T1 continuously into table T2, and both T1 and T2 are in the same database, you can achieve that using an update policy.

Related

Apply multiple functions to update table using kusto

I want to produce/update the output of a table using several functions. Becasue each functions will create separate columns. For me it would be relatively practical to write several functions for it.
To update a table using one function is documented in the documentation. https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/updatepolicy
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/alter-table-update-policy-command
But this case is not informed. is it even possible to do it ? if so How ?
is is this right way to do this ? .set-or-append TABLE_NAME<| FUNCTION1 <| FUNCTION2 <| FUNCTION3
You can chain update policies as much as you need (as long as it does not create a circular reference), this means that Table B can have an update policy that runs a function over Table A and Table C can have an update policy that runs a function over Table B.
If you don't need the intermediate tables you can set their retention policy to 0 days, this means that no data will actually be ingested into these tables.

Azure data explorer update record

I am new to Azure data explorer and I am wondering how you can do update on a record in Azure data explorer using microsoft .NET SDK in C# ?
The Microsoft documentation is really poor
Can we update or you can replace a row only or you?
You can use soft-delete to delete the original record, and then append/ingest the updated record.
Please note that this won't be atomic, meaning if someone queries the table between the soft-delete and the append operations, they won't see neither the old record, nor the updated record.
there is no record "update" mechanism in Azure Data Explorer, even the 'soft delete' removes and replaces the row. this is useful for one-off scenarios, and may not be worth implementing in another language since it should not be used frequently. as the soft delete documentation says, if you plan to update data often, materialize may be a better option.
materialize is a bit more work and abstract, generally being worth the effort if you have a very large table that relies on metadata information like ingestion_time to make sense of records.
in smaller tables (say, less than a gig) i recommend the simple approach of replacing the table with an updated version of itself (just make sure that if you do rely on fields like ingestion_time, you update the schema and extend that data as a field for later use).
You will need to query for the entire table, implement logic for isolating only the row(s) of interest (while retaining all others, and perform an extend function to modify that value. Then, replace (do not append) the entire table.
For example:
.set-or-replace MyTable1 <|
MyTable1
| extend IncorrectColumn = iif(IncorrectColumn == "incorrectValue", "CorrectValue", IncorrectColumn)
alternatively, you can have the unchanging relevant data and updated data in two tabular results, and perform a union on them to form the final table.
.set-or-replace MyTable1 <|
let updatedRows =
MyTable1
| where Column1 = "IncorrectValue"
| extend Column1 = "CorrectValue";
let nonUpdatedRows =
MyTable1
| where Column1 = "CorrectValue";
updatedRows
| union nonUpdatedRows
I prefer to write to a temp table, double check the data quality, then replace the final table. This is particularly useful if you're working in batches and want to minimize the risk of data loss if there's a failure halfway through your batches

'distributed=true' property doesn't seem to work with ingest from query

I am performing ingest from query in the following manner:-
.append async mytable with(distributed=true) <| myquery
Since this is using 'async' , I got an OperationId to track the progress. So when I issue .show operations command against the OperationsId , I get 2 rows in the resultset. The 'State' column value for both the rows was 'InProgress'. The 'NodeId' column value for one of the rows was blank whereas for the other row it was KENGINE000001. My cluster has 10+ worker nodes. Should I be getting ~ 10 rows as a result of this command , since I am using distribute=true option? And my data load is also heavy , so it's really a candidate for distributed ingestion. So either this property is not working or I am not interpreting its usage correctly?
Should I be getting ~ 10 rows as a result of this command , since I am using distribute=true option?
No
So either this property is not working or I am not interpreting its usage correctly?
likely the latter, or a false expectation from the output of .show operations, see above.
you can track the state/status of an async command using .show operations <operation_id>
If it doesn't reach a final state ("Completed","Failed","Throttled", etc.) after 1hr, that's unexpected - and you should open a support ticket for that.
regardless - it's ill-advised to attempt to ingest a lot of data (multi-GBs or more) using a single command, even if it's distributed.
if that's what you're attempting to do, you should consider splitting your ingestion into multiple commands, each handling a subset of the data.
see the 'remarks' section here: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/data-ingestion/ingest-from-query

How do I make multiple insertions into a SQLite DB from UIPath

I have an excel spreadsheet with multiple entries that I want to insert into an SQLite DB from UIPath. How do I do this?
You could do it one of two ways. For both methods, you will need to use the Excel Read Range to put the excel into a table.
Scenario 1: You could read the table in a for each loop, line by line, converting each row to SQL and use a Execute non-query activity. This is too long, and if you like O notation, this is an O(n) solution.
Scenario 2: You could upload the entire table (as long as its compatible with the DB Table) to the database.
you will need Database > Insert activity
You will need to provide the DB Connection (which I answer in another post how to create)
Then enter the sqlite database table you want to insert into in Quotes
And then enter the table name that you have created or pulled from another resource in the last field
Output will be an integer (Affected Records)
For O Notation, this is an O(1) solution. At least from our coding perspective

mysqldump data loss after restoration

I have tried to dump a database db1 of about 40gb into sql file using mysqldump from system A with innodb default storage engine and tried to restore it on another system B. Both have the default storage engine as innodb and same mysql version . I have checked for any table corruptions on system A using check table status and was not able to find any table corruptions on it. I have used the below query to calculate the table size and no of rows per table on both databases (db1) over system A and system B and found that there was about 6GB data loss on db1 of system B.
SELECT table_schema,
-> SUM(data_length+index_length)/1024/1024 AS total_mb,
-> SUM(data_length)/1024/1024 AS data_mb,
-> SUM(index_length)/1024/1024 AS index_mb,
-> COUNT(*) AS tables,
-> CURDATE() AS today
-> FROM information_schema.tables
-> GROUP BY table_schema
-> ORDER BY 2 DESC
Can we rely on information schema for calculating the exact no of rows, exact tablesize (datalength + indexlength) when Innodb is default storage engine ? Why a dump using mysql dump has resulted in significant data loss on restoration over system B ?
InnoDB isn't able to give a exact count (using a SELECT COUNT() query) of records found in a table. When you request a record count on a table with the InnoDB engine, you will notice that the count will flucturate.
For more information I would like to refer you to the MySQL developer page for InnoDB
http://dev.mysql.com/doc/refman/5.0/en/innodb-restrictions.html
Restrictions on InnoDB Tables
ANALYZE TABLE determines index cardinality (as displayed in the Cardinality column of SHOW INDEX output) by doing eight random dives to each of the index trees and updating index cardinality estimates accordingly. Because these are only estimates, repeated runs of ANALYZE TABLE may produce different numbers. This makes ANALYZE TABLE fast on InnoDB tables but not 100% accurate because it does not take all rows into account.
MySQL uses index cardinality estimates only in join optimization. If some join is not optimized in the right way, you can try using ANALYZE TABLE. In the few cases that ANALYZE TABLE does not produce values good enough for your particular tables, you can use FORCE INDEX with your queries to force the use of a particular index, or set the max_seeks_for_key system variable to ensure that MySQL prefers index lookups over table scans. See Section 5.1.4, “Server System Variables”, and Section C.5.6, “Optimizer-Related Issues”.
SHOW TABLE STATUS does not give accurate statistics on InnoDB tables, except for the physical size reserved by the table. The row count is only a rough estimate used in SQL optimization.
InnoDB does not keep an internal count of rows in a table because concurrent transactions might “see” different numbers of rows at the same time. To process a SELECT COUNT(*) FROM t statement, InnoDB scans an index of the table, which takes some time if the index is not entirely in the buffer pool. If your table does not change often, using the MySQL query cache is a good solution. To get a fast count, you have to use a counter table you create yourself and let your application update it according to the inserts and deletes it does. If an approximate row count is sufficient, SHOW TABLE STATUS can be used. See Section 14.2.12.1, “InnoDB Performance Tuning Tips”.
The best solution to check if you have any data loss, is to compare the contents of your database.
mysqldump --skip-comments --skip-extended-insert -u root -p dbName1 > file1.sql
mysqldump --skip-comments --skip-extended-insert -u root -p dbName2 > file2.sql
diff file1.sql file2.sql
See this topic for more information.
Another advantage of this solution is that you can see where you have the differences.

Resources