Currently I use table.query to get items by matching partition key and sorted by sorting key. Now the new requirement is to handle batch query - a couple of hundred partition keys match and hopefully still sorted by sorting key in each partition key result. I find GetBatchItem that can handle up to 100 items per one query, but look like no sorting. Is one item here one row in DDB or all rows in one partition key?
From performance(query speed) and price perspective which one should I use? And do i have to do sorting for the result by myself if I use GetBatchItem? Ideally I like a solution of fast, cost effective and result sorted by sorting key in each partition key, but the first two are top priority and I can do sorting if I have to. Thanks
Query() is cheaper...
BatchGetItem() runs as individual GetItem() each costing 1 RCU (assuming your item is less than 400K).
Lets say you're item is 10K, Query() can return 40 of them for 1 RCU whereas returning 40 via BatchGetItem() will cost 40 RCU.
In a dynamo table I would like to query by selecting all items where an attributes value matches one of a set of values. For example my table has a current_status attribute so I would like all items that either have a 'NEW' or 'ASSIGNED' value.
If I apply a GSI to the current_status attribute it looks like I have to do this in two queries? Or instead do a scan?
DynamoDB does not recommend using scan. Use it only when there is no other option and you have fairly small amount of data.
You need use GSIs here. Putting current_status in PK of GSI would result in hot
partition issue.
The right solution is to put random number in PK of GSI, ranging from 0..N, where N is number of partitions. And put the status in SK of GSI, along with timestamp or some unique information to keep PK-SK pair unique. So when you want to query based on current_status, execute N queries in parallel with PK ranging from 0..N and SK begins_with current_status. N should be decided based on amount of data you have. If the data on each row is less than 4kb, then this parallel query operation would consume N read units without hot partition issue. Below link provides the details information on this
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-indexes-gsi-sharding.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-modeling-nosql-B.html
My problem is that my querys are too slow.
I have a fairly large sqlite database. The table is:
CREATE TABLE results (
timestamp TEXT,
name TEXT,
result float,
)
(I know that timestamps as TEXT is not optimal, but please ignore that for the purposes of this question. I'll have to fix that when I have the time)
"name" is a category. This calculation holds the results of a calculation that has to be done at each timestamp for all "name"s. So the inserts are done at equal-timestamps, but the querys will be done at equal-names (i.e. I want given a name, get its time series), like:
SELECT timestamp,result WHERE name='some_name';
Now, the way I'm doing things now is to have no indexes, calculate all results, then create an index on name CREATE INDEX index_name ON results (name). The reasoning is that I don't need the index when I'm inserting, but having the index will make querys on the index really fast.
But it's not. The database is fairly large. It has about half a million timestamps, and for each timestamp I have about 1000 names.
I suspect, although I'm not sure, that the reason why it's slow is that every though I've indexed the names, they're still scattered all around the physical disk. Something like:
timestamp1,name1,result
timestamp1,name2,result
timestamp1,name3,result
...
timestamp1,name999,result
timestamp1,name1000,result
timestamp2,name1,result
timestamp2,name2,result
etc...
I'm sure this is slower to query with NAME='some_name' than if the rows were physically ordered as:
timestamp1,name1,result
timestamp2,name1,result
timestamp3,name1,result
...
timestamp499997,name1000,result
timestamp499998,name1000,result
timestamp499999,name1000,result
timestamp500000,namee1000,result
etc...
So, how do I tell SQLite that the order in which I'd like the rows in disk isn't the one they were written in?
UPDATE: I'm further convinced that the slowness in doing a select with such an index comes exclusively from non-contiguous disk access. Doing SELECT * FROM results WHERE name=<something_that_doesnt_exist> immediately returns zero results. This suggests that it's not finding the names that's slow, it's actually reading them from the disk.
Normal sqlite tables have, as a primary key, a 64-bit integer (Known as rowid and a few other aliases). That determines the order that rows are stored in a B*-tree (Which puts all actual data in leaf node pages). You can change this with a WITHOUT ROWID table, but that requires an explicit primary key which is used to place rows in a B-tree. So if every row's (name, timestamp) columns make a unique value, that's a possibility that will leave all rows with the same name on a smaller set of pages instead of scattered all over.
You'd want the composite PK to be in that order if you're searching for a particular name most of the time, so something like:
CREATE TABLE results (
timestamp TEXT
, name TEXT
, result REAL
, PRIMARY KEY (name, timestamp)
) WITHOUT ROWID
(And of course not bothering with a second index on name.) The tradeoff is that inserts are likely to be slower as the chances of needing to split a page in the B-tree go up.
Some pragmas worth looking into to tune things:
cache_size
mmap_size
optimize (After creating your index; also consider building sqlite with SQLITE_ENABLE_STAT4.)
Since you don't have an INTEGER PRIMARY KEY, consider VACUUM after deleting a lot of rows if you ever do that.
I have an sqlite database with a table that logs electric power values over time, i.e. there is a timestamp column and one for the associated power value.
With a value coming in roughly every second, this table grows significantly over time. Which is why I want to thin out old values, for example by replacing all 60 values in a minute with their average.
I know how to query for the average.
I know how to insert the query's result back into the table.
But how do I delete the original values without also deleting the newly inserted average value (which has a timestamp within the same range)?
Note that I would like to perform the operation entirely inside sqlite query language, i.e. without storing for example row ids in the C code that is executing the queries.
The easiest way would be to use a temporary table:
BEGIN;
CREATE TEMP TABLE Averages AS
SELECT MIN(Timestamp), AVG(Value)
FROM MyTable
WHERE (old)
GROUP BY (minute);
DELETE FROM MyTable WHERE (old);
INSERT INTO MyTable(Timestamp, Value) SELECT * FROM Averages;
DROP TABLE Averages;
COMMIT;
This might be a beginners question, but when testing my sqlite data base, I found that when I delete a row, the row id keeps incrementing when I insert a new row and doesn't reuse for instance the row id of a deleted row. So, what will happen if the row id runs out to it's maximum value, while there are less rows in the table?
This is documented:
If the table has previously held a row with the largest possible ROWID, then new INSERTs are not allowed and any attempt to insert a new row will fail with an SQLITE_FULL error.
If you omit the AUTOINCREMENT keyword, IDs will still autoincrement, but can be reused if you delete the last row or if the values overflow:
If the largest ROWID is equal to the largest possible integer (9223372036854775807) then the database engine starts picking positive candidate ROWIDs at random until it finds one that is not previously used.
When you add row number as auto increment you have to check largest value. If data rows go to that limit you have to use bigger data type. But usually integer doesn't cross because a database designer must keep eye on normalization.
If data rows give so big. You are really stuck with the queries. It will take huge time. SQLite is mainly useful for low end device. They are not so capable of handling big data.