Removing duplicated rows in the most efficient way - azure-data-explorer

I an wondering what is the best way to remove duplicate rows, but keeping other data within the row.
For instance, let's say I have two rows that has the same ID, and I want to remove either one and keep just 1 copy of that row based on that field(s).
Online searches show that the answer is probably using | summarize arg_max(TimeField,*) by ID. However, since KQL is a columnar database, this operation is "heavy" by design, and will increase the processing time, probably significantly.
I was wondering if there is no way around it, or there is a more efficient way?
Thank you!
I tried to remove entire duplicate rows using arg_max but given the nature of the function, it is making the query to timeout.

"It is making the query to timeout."
The default timeout is 4 minutes.
You can increase it up to an hour.
2.
"Since KQL is a columnar database, this operation is "heavy" by design."
This might be a "heavy" operation due to the cardinality of the IDs.
You might need to use summarize with hint.strategy=shuffle instead of the default algorithm.
It has nothing to do with columnar.

Related

DynamoDB top item per partition

We are new to DynamoDB and struggling with what seems like it would be a simple task.
It is not actually related to stocks (it's about recording machine results over time) but the stock example is the simplest I can think of that illustrates the goal and problems we're facing.
The two query scenarios are:
All historical values of given stock symbol <= We think we have this figured out
The latest value of all stock symbols <= We do not have a good solution here!
Assume that updates are not synchronized, e.g. the moment of the last update record for TSLA maybe different than for AMZN.
The 3 attributes are just { Symbol, Moment, Value }. We could make the hash_key Symbol, range_key Moment, and believe we could achieve the first query easily/efficiently.
We also assume could get the latest value for a single, specified Symbol following https://stackoverflow.com/a/12008398
The SQL solution for getting the latest value for each Symbol would look a lot like https://stackoverflow.com/a/6841644
But... we can't come up with anything efficient for DynamoDB.
Is it possible to do this without either retrieving everything or making multiple round trips?
The best idea we have so far is to somehow use update triggers or streams to track the latest record per Symbol and essentially keep that cached. That could be in a separate table or the same table with extra info like a column IsLatestForMachineKey (effectively a bool). With every insert, you'd grab the one where IsLatestForMachineKey=1, compare the Moment and if the insertion is newer, set the new one to 1 and the older one to 0.
This is starting to feel complicated enough that I question whether we're taking the right approach at all, or maybe DynamoDB itself is a bad fit for this, even though the use case seems so simple and common.
There is a way that is fairly straightforward, in my opinion.
Rather than using a GSI, just use two tables with (almost) the exact same schema. The hash key of both should be symbol. They should both have moment and value. Pick one of the tables to be stocks-current and the other to be stocks-historical. stocks-current has no range key. stocks-historical uses moment as a range key.
Whenever you write an item, write it to both tables. If you need strong consistency between the two tables, use the TransactWriteItems api.
If your data might arrive out of order, you can add a ConditionExpression to prevent newer data in stocks-current from being overwritten by out of order data.
The read operations are pretty straightforward, but I’ll state them anyway. To get the latest value for everything, scan the stocks-current table. To get historical data for a stock, query the stocks-historical table with no range key condition.

How to get nulls at the top of database query results?

I make a query with a huge amount of data. It is imporant that I get rows with null values in a specific column at the beginning of results. I want to avoid using ORDER BY because it will certainly make a big performance hit. Is there any other way to achieve this?

Unix Remove Duplicate uniq vs Sybase - ignore_dup_key

I have a file with 20 Million records. It has 30% duplicate values.
We thought of implementing two approaches.
Writing shell script to remove the duplicates, the file will be uploaded in unix box.
creating a table in sybase with ignore_dup_key and the BCP the file as is into the table. so that the table will eliminate the duplicates.
I have read when the duplicate percentage increases the ignore_dup_key will impact the performance.
How about the performance of Unix - uniq method? which one will be applicable for this?
Inputs are welcome!
Doing a BCP into a table with an ignore-dup-key unique index should be fastest, not in the last place because it is much easier and simpler to implement.
Here is why: ultimately, in either scenario you end up inserting a set of rows into the database table and building an index for those inserted rows. That amount of work is equal for both cases.
Now, the BCP method uses the existing index to identify and discard duplicate keys. This is handled quite efficiently inside ASE, as the row is discarded before being inserted. The number of duplicates does NOT affect this efficiency in case you only want to discard the duplicates (whoever said that was incorrectly informed).
If you'd do this duplicate-filtering outside ASE, you'd need to figure out a sorting method that discards records based on uniqueness of a part of the record (only they key). That's less trivial than it sounds and also requires system resources to perform the sort. Those resources are better spent on doing the sort (=index creation) inside ASE -- which you already had to do anyway for the rows being finally inserted.
Regardless, the BCP method is much more convenient than external sorting since it requires less work (less steps) from you. That's probably an even more important consideration.
For further reading, my book "Tips, Tricks & Recipes for Sybase ASE" has a few sections dedicated to ignore_dup_key.
Without testing both approaches, you cannot say which is faster for sure. But using sybase approach will be faster more probably since databases are optimized parallize your workload.

"SELECT * FROM..." VS "SELECT ID FROM..." Performance [duplicate]

This question already has answers here:
select * vs select column
(12 answers)
Closed 9 years ago.
As someone who is newer to many things SQL as I don't use it much, I'm sure there is an answer to this question out there, but I don't know what to search for to find it, so I apologize.
Question: if I had a bunch of rows in a database with many columns but only need to get back the IDs which is faster or are they the same speed?
SELECT * FROM...
vs
SELECT ID FROM...
You asked about performance in particular vs. all the other reasons to avoid SELECT *: so it is performance to which I will limit my answer.
On my system, SQL Profiler initially indicated less CPU overhead for the ID-only query, but with the small # or rows involved, each query took the same amount of time.
I think really this was only due to the ID-only query being run first, though. On re-run (in opposite order), they took equally little CPU overhead.
Here is the view of things in SQL Profiler:
With extremely high column and row counts, extremely wide rows, there may be a perceptible difference in the database engine, but nothing glaring here.
Where you will really see the difference is in sending the result set back across the network! The ID-only result set will typically be much smaller of course - i.e. less to send back.
Never use * to return all columns in a table–it’s lazy. You should only extract the data you need.
so-> select field from is more faster
There are several reasons you should never (never ever) use SELECT * in production code:
since you're not giving your database any hints as to what you want, it will first need to check the table's definition in order to determine the columns on that table. That lookup will cost some time - not much in a single query - but it adds up over time.
in SQL Server (not sure about other databases), if you need a subset of columns, there's always a chance a non-clustered index might be covering that request (contain all columns needed). With a SELECT *, you're giving up on that possibility right from the get-go. In this particular case, the data would be retrieved from the index pages (if those contain all the necessary columns) and thus disk I/O and memory overhead would be much less compared to doing a SELECT *.... query.
Long answer short, selecting only the columns you need will always be faster. SELECT * requires scanning the whole table. This is a best practice thing that you should adopt very early on.
For the second part, you should probably post a seperate question instead of piggybacking off this one. Makes it easy to distinguish what you are asking about.

Split large sqlite table by sessionid field

I am relatively new to sql(ite), and I'm learning as I go while working on a new project.
We have got millions of transaction rows in one "data" table, one field being a "sessionid" field.
Since I want to concentrate on in-session activity for now, I primarily need to look only at transactions from the same sessions.
My intuition now is, that it would be a lot faster if I separate the database by sessions into many single session tables, than always querying for a single sessionid, and then proceeding. My question: is that correct? will that make a difference?
Even if not: Could you help me out and tell me, how I could split the one "data" table rows into many session-specific tables, the rows staying the same? Plus one table which relates sessionIds to their tables?
Thanks!
A friend just told me, the splitting-into-tables thing would be extremely unflexible, and I should try adding a distinct index instead for the different sessionId rows to access single sessions faster. Any thoughts on that and how to do it best?
First of all, are you having any specific performance bottleneck with it till now? If yes, please describe it.
Having one table per session will probably speed lookups/indexes (for INSERTs) things up.
SQLite doesn't impose a limit on the number of tables, so you should be okay.
One other solution that provides easier maintenance, is if you create one table per day/week.
Depending on how long your sessions last, this could be feasible or not.
Related: https://stackoverflow.com/a/811862/89771

Resources