I make a query with a huge amount of data. It is imporant that I get rows with null values in a specific column at the beginning of results. I want to avoid using ORDER BY because it will certainly make a big performance hit. Is there any other way to achieve this?
Related
I an wondering what is the best way to remove duplicate rows, but keeping other data within the row.
For instance, let's say I have two rows that has the same ID, and I want to remove either one and keep just 1 copy of that row based on that field(s).
Online searches show that the answer is probably using | summarize arg_max(TimeField,*) by ID. However, since KQL is a columnar database, this operation is "heavy" by design, and will increase the processing time, probably significantly.
I was wondering if there is no way around it, or there is a more efficient way?
Thank you!
I tried to remove entire duplicate rows using arg_max but given the nature of the function, it is making the query to timeout.
"It is making the query to timeout."
The default timeout is 4 minutes.
You can increase it up to an hour.
2.
"Since KQL is a columnar database, this operation is "heavy" by design."
This might be a "heavy" operation due to the cardinality of the IDs.
You might need to use summarize with hint.strategy=shuffle instead of the default algorithm.
It has nothing to do with columnar.
We are new to DynamoDB and struggling with what seems like it would be a simple task.
It is not actually related to stocks (it's about recording machine results over time) but the stock example is the simplest I can think of that illustrates the goal and problems we're facing.
The two query scenarios are:
All historical values of given stock symbol <= We think we have this figured out
The latest value of all stock symbols <= We do not have a good solution here!
Assume that updates are not synchronized, e.g. the moment of the last update record for TSLA maybe different than for AMZN.
The 3 attributes are just { Symbol, Moment, Value }. We could make the hash_key Symbol, range_key Moment, and believe we could achieve the first query easily/efficiently.
We also assume could get the latest value for a single, specified Symbol following https://stackoverflow.com/a/12008398
The SQL solution for getting the latest value for each Symbol would look a lot like https://stackoverflow.com/a/6841644
But... we can't come up with anything efficient for DynamoDB.
Is it possible to do this without either retrieving everything or making multiple round trips?
The best idea we have so far is to somehow use update triggers or streams to track the latest record per Symbol and essentially keep that cached. That could be in a separate table or the same table with extra info like a column IsLatestForMachineKey (effectively a bool). With every insert, you'd grab the one where IsLatestForMachineKey=1, compare the Moment and if the insertion is newer, set the new one to 1 and the older one to 0.
This is starting to feel complicated enough that I question whether we're taking the right approach at all, or maybe DynamoDB itself is a bad fit for this, even though the use case seems so simple and common.
There is a way that is fairly straightforward, in my opinion.
Rather than using a GSI, just use two tables with (almost) the exact same schema. The hash key of both should be symbol. They should both have moment and value. Pick one of the tables to be stocks-current and the other to be stocks-historical. stocks-current has no range key. stocks-historical uses moment as a range key.
Whenever you write an item, write it to both tables. If you need strong consistency between the two tables, use the TransactWriteItems api.
If your data might arrive out of order, you can add a ConditionExpression to prevent newer data in stocks-current from being overwritten by out of order data.
The read operations are pretty straightforward, but I’ll state them anyway. To get the latest value for everything, scan the stocks-current table. To get historical data for a stock, query the stocks-historical table with no range key condition.
I am using websql to store data in a phonegap application. One of table have a lot of data say from 2000 to 10000 rows. So when I read from this table, which is just a simple select statement it is very slow. I then debug and found that as the size of table increases the performance deceases exponentially. I read somewhere that to get performance you have to divide table into smaller chunks, is that possible how?
One idea is to look for something to group the rows by and consider breaking into separate tables based on some common category - instead of a shared table for everything.
I would also consider fine tuning the queries to make sure they are optimal for the given table.
Make sure you're not just running a simple Select query without a where clause to limit the result set.
I have a piece of software which takes in a database, and uses it to produce graphs based on what the user wants (primarily queries of the form SELECT AVG(<input1>) AS x, AVG(<intput2>) as y FROM <input3> WHERE <key> IN (<vals..> AND ...). This works nicely.
I have a simple script that is passed a (often large) number of files, each describing a row
name=foo
x=12
y=23.4
....... etc.......
The script goes through each file, saving the variable names, and an INSERT query for each. It then loads the variable names, sort | uniq's them, and makes a CREATE TABLE statement out of them (sqlite, amusingly enough, is ok with having all columns be NUMERIC, even if they actually end up containing text data). Once this is done, it then executes the INSERTS (in a single transaction, otherwise it would take ages).
To improve performance, I added an basic index on each row. However, this increases database size somewhat significantly, and only provides a moderate improvement.
Data comes in three basic types:
single value, indicating things like program version, etc.
a few values (<10), indicating things like input parameters used
many values (>1000), primarily output data.
The first type obviously shouldn't need an index, since it will never be sorted upon.
The second type should have an index, because it will commonly be filtered by.
The third type probably shouldn't need an index, because it will be used in output.
It would be annoying to determine which type a particular value is before it is put in the database, but it is possible.
My question is twofold:
Is there some hidden cost to extraneous indexes, beyond the size increase that I have seen?
Is there a better way to index for filtration queries of the form WHERE foo IN (5) AND bar IN (12,14,15)? Note that I don't know which columns the user will pick, beyond the that it will be a type 2 column.
Read the relevant documentation:
Query Planning;
Query Optimizer Overview;
EXPLAIN QUERY PLAN.
The most important thing for optimizing queries is avoiding I/O, so tables with less than ten rows should not be indexed because all the data fits into a single page anyway, so having an index would just force SQLite to read another page for the index.
Indexes are important when you are looking up records in a big table.
Extraneous indexes make table updates slower, because each index needs to be updated as well.
SQLite can use at most one index per table in a query.
This particular query could be optimized best by having a single index on the two columns foo and bar.
However, creating such indexes for all possible combinations of lookup columns is most likely not worth the effort.
If the queries are generated dynamically, the best idea probably is to create one index for each column that has good selectivity, and rely on SQLite to pick the best one.
And don't forget to run ANALYZE.
When performing a SQLite query does the size of the returned data set affect how long the query takes? Lets assume for this question that I don't actually access any of the data in the result, I just want to know if the query itself takes longer. Lets also assume that I am simply selecting all rows and have no WHERE or ORDER BY clauses.
For example if I have two tables A and B. Let says table A has a million rows and table B has 10 rows and that both tables have the same number and types of columns. Will selecting all rows in table A take longer than selecting all rows in table B?
This is a follow up to my question How does a cursor refer to deleted rows?. I am guessing that if a during the query SQLite makes a copy of the data then queries that return large data sets may take longer, unless there is an optimization that only copies the query result data if there is a change to the data in the db while the query is still alive?
Depending on some details, yes, a query may take different amounts of time.
Example: I have a table with some 20k entries. I do a GLOB search that must try every line, with a LIMIT. If the LIMIT is met, the query can stop early. If not, it must go through the entire table (or JOIN). So searches with too many results return quicker than searches with only a few results.
If the query must run through the same amount of data, I don't expect there is a significant difference between a smaller and larger amount of selected rows. There will probably be IO cost, of course.