Performance issue with primary key - sqlite

I am populating a medium-sized table (60GB, 500 million rows). The process completes reasonably fast if the table has no primary key (~1 hour using bulk insert), but it takes ~10 times longer if I create that table with the primary key. I assume this is because it takes time to verify the uniqueness constraint and also update the index at each insert.
I thought a good workaround would be to add the primary key later, since indexation on the table that's already populated should be much faster compared to incremental indexation. But sqlite doesn't seem to have the option to add primary key after the table is created (not sure why?).
I guess I could just not use a primary key at all, and instead just add a unique index after the table is populated. Is there any disadvantage to that?
Or any better solution recommended?

From a purely technical point of view, an unique index has exactly the same effect as a primary key. (In SQLite, some primary keys allow NULLs for backwards compatibility.)
The only difference is that the primary key constraint does not show up in the table definition itself, which might be a bad thing for documentation purposes.
Also see Is CREATE UNIQUE INDEX or INTEGER PRIMARY KEY more performant in SQLite.

Run the bulk insert inside a transaction and you'll avoid quite a few things that slow inserts down.
I just found this which is a great write up on how to speed things up in sqlite3.
Improve INSERT-per-second performance of SQLite?

Related

Dynamodb using partition key in a global secondary index

New to DynamoDB, I have the partition group_id, and sort key groupid_storeid_sortk.
I am wanting to setup additional access pattern with the group_id and store_addrss_sortk.
Will this have any impact on performance using the partition key in the secondary index, or would it be better to create a new attribute as the secondary key, even though it would be duplicate data.
ThankYou
It’s fine to use the same partition key attribute again as the PK for the GSI. No problem there.
For the future: You may want to watch some videos on single-table design and start using PK/SK as generic names since you might want to overload what’s inside them for different items. And then you might want GSI1PK/GSI1SK as the GSI keys.
That’s a style thing when you aim for some optimizations single-table design can bring.
An index is simply another table that you don't have to manage yourself. When you create an index, the service (DynamoDB, for example) creates a new table for you and manages the synchronization of the data between the tables.
In DynamoDB you have two types of secondary indexes, Global and Local. If you use the same partition key, you can use both of these options. However, you have to define the secondary local index (SLI) when you create the table and you can't add it later. Only secondary global indexes (SGI) can be added after the creation of the table. You can read more about it in DyanmoDB documentation.
Regarding performance, you need to consider the cost (read/write capacity) on top of the usual time considerations. You need to see if you are writing a lot to the table and not only reading a lot. Based on that you can plan carefully the projection of the data into the new index. Remember that writes are about 10 times more expensive and slower than reads. You can read more about projection best practices here.

Does clickhouse support quick retrieval of any column?

I tried to use clickhouse to store 4 billion data, deployed on a single machine, 48-core cpu and 256g memory, mechanical hard disk.
My data has ten columns, and I want to quickly search any column through SQL statements, such as:
select * from table where key='mykeyword'; or select * from table where school='Yale';
I use order by to establish a sort key, order by (key, school, ...)
But when I search, only the first field ordered by key has very high performance. When searching for other fields, the query speed is very slow or even memory overflow (the memory allocation is already large enough)
So ask every expert, does clickhouse support such high-performance search for each column index similar to mysql? I also tried to create a secondary index for each column through index, but the performance did not improve.
You should try to understand how works sparse primary indexes
and how exactly right ORDER BY clause in CREATE TABLE help your query performance.
Clickhouse never will works the same way as mysql
Try to use PRIMARY KEY and ORDER BY in CREATE TABLE statement
and use fields with low value cardinality on first order in PRIMARY KEY
don't try to use ALL
SELECT * ...
it's really antipattern
moreover, maybe secondary data skip index may help you (but i'm not sure)
https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#table_engine-mergetree-data_skipping-indexes

Question about a unique primary key vs unique primary index

I searched many different topics here on stack over flow related to this question, but I can't seem to find an answer that makes sense to me. I come from a MS SQL Server background where just about every table has a primary key and many more have foreign keys. In Teradata I understand there are Primary keys and also unique primary indexes. It is to my understanding that values for a unique primary index in Teradata can't be null. If that is the case why even use primary keys at all? Is it just to enforce parent child relationships in the table structures like other RDBMS systems such as a sales order header table (PK) and sales order detail table (fk)? Maybe a good use for a UPI without a PK would be a table that has no relationships to another table like a "reporting" table?
Just want to make sure I understand correctly. Thank you for the help.

How do I query DynamoDB when I want to consider the sort key but not the partition key?

I can't figure out how to do this in DynamoDB.
I have a table with data something like this:
ID Updated other fields...
1200 2017-12-11 ...
1201 2018-02-05 ...
1205 2018-01-05 ...
1206 2018-01-11 ...
1210 2018-02-15 ...
1212 2018-02-10 ...
The partition key is 'ID' and I have a sort key of 'Updated'.
I want to retrieve the records where Updated is greater than "2018-02-01", say.
I can't query on just 'Updated' alone, it complains with Query condition missed key schema element: ID. I understand what that means, but I'm not sure how to do this properly.
I've tried adding various indexes and then querying on the index, including having only the 'Updated' field as the partition key, but then I can't query for a range of values only an exact match on the partition key.
So, how do I query across multiple partitions for a condition?
I could use a scan, but that is potentially expensive. Can I do this by indexing it a certain way? Or is there a way to do something similar to a query where I don't need to specify the partition key?
Use a scan
Almost everyone using DynamoDB seems to get worried about scans. Scans are FINE in many circumstances. Things you should ask yourself include; how much data will I have, how will it grow over time, how fast do I need the scan to complete, how many RCUs will this cost? Don't just dismiss scans - do the maths.
Archive data
If you only need to access recent data, consider deleting or archiving old data. By removing it from your table you can increase the performance of scans.
Partition by date
There are various strategies you can use to improve your table performance if you really want to use a query. For example you could have a partition key of YYYY-MM and sort key of datetime (down to nanosecond). That way you can retrieve whole months of data in one query, whilst still being able to sort for specific date ranges. This kind of query is much more complicated to handle in your application than a scan. Architecting your tables really depends on your data access patterns.
Nice problem, not so nice solution! :)
• You cannot do a query without conditioning on Partition Key.
• You need the Updated column to be a Sorting Key, either in the table "schema", either in an index. If it will not be a sorting key anymore, you wont be able to efficiently query for Updated > VALUE.
So you need a constant partition key and Updated to be the sorting key. Here is your Global Secondary Index:
• PK: ConstantColumn
• SK: Updated
Of course, you'll loose some scalability because all your index will be in one partition, but using a KEYS_ONLY projection should give you enough room.
Should you really need more scalability consider having PK values like C0, C1, ..., Cn, iterate through queries for each partition key, then merge the results (divide et impera).
I would consider alternative partition keys. For example, will your business logic work if you create a GSI with year as partition key and date as sort key? How about year-month?
Your query will be more complex to write as you might have to issue multiple queries to cover more than 1 partitions to fill your result page.
But as you pointed out, this is cheaper than performing a full table scan.

Query a range of primary keys in dynamodb

I want to make sure I get this right,
Based on what I've read so far, you can NOT query a range of primary keys in dynamodb,
like if you have a primary key which is number like the phone number of your customers, you can not get items with primary keys larger than 3010000000 or between 3010000000 and 3020000000
to make it clear, I am not talking about the range key, my questions is about the primary key itself,
so if this is true, there are lots of use cases, like items between dates, users registered after some point, and... , that requiers either table scans,
is this correct?
EDIT: OK, one solution that comes to mind, would be to use only one dummy hash_key for primary key and insert the real key (like phone numbers above) as range keys, does this work?
Yes, you can not get a range of hash_key with DynamoDb. But this does not mean you are stuck with your use case.
Let's take the 'dates' use case and say your are building a logging application. You are likely to get lots of records each day.
If you use the day as the hash_key, you can put the full timestamp as the range_key. This way, you can split your query into chunks and get what you want.
Of course, to get the optimal results, you will need to know well the kind of queries. For example, what is the typical range ? With DynamoDb, as well as other key:value store, you most of the time model your data with query in mind, unlike SQL when you model with only data in mind.
Of course, if your items spans on larger/shorter range, just adapt this system.
Concerning the "all under the same dummy hash_key" sounds like a terrible idea. Sorry. I am not a hundred percent sure how it really works but I know DynamoDB does some sharding across so called partitions. I believe 1 hash_key <=> 1 partitions. Moreover, If read closely the documentation, you'll notice that the provisionned throughput is splited evenly between the partitions so that each partitions is only allocated a fraction of what you pay for.
Without modifying the keys of your primary DynamoDB table, you can add a GSI with a constant partition key and your primary table's partition key as its sort key.
This will enable you to query on the index's sort key and use the resulting partition keys to get the data you're looking for.

Resources