Does clickhouse support quick retrieval of any column? - bigdata

I tried to use clickhouse to store 4 billion data, deployed on a single machine, 48-core cpu and 256g memory, mechanical hard disk.
My data has ten columns, and I want to quickly search any column through SQL statements, such as:
select * from table where key='mykeyword'; or select * from table where school='Yale';
I use order by to establish a sort key, order by (key, school, ...)
But when I search, only the first field ordered by key has very high performance. When searching for other fields, the query speed is very slow or even memory overflow (the memory allocation is already large enough)
So ask every expert, does clickhouse support such high-performance search for each column index similar to mysql? I also tried to create a secondary index for each column through index, but the performance did not improve.

You should try to understand how works sparse primary indexes
and how exactly right ORDER BY clause in CREATE TABLE help your query performance.
Clickhouse never will works the same way as mysql
Try to use PRIMARY KEY and ORDER BY in CREATE TABLE statement
and use fields with low value cardinality on first order in PRIMARY KEY
don't try to use ALL
SELECT * ...
it's really antipattern
moreover, maybe secondary data skip index may help you (but i'm not sure)
https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#table_engine-mergetree-data_skipping-indexes

Related

Filtering results from ClickHouse using values from dictionaries

I'm a little unfamiliar with ClickHouse and still study it by trial and error. Got a question about it.
Talking about the star scheme of data representations, with dimensions and facts. Currently, I keep everything in PostgreSQL, but OLAP queries with aggregations start to show bad timing, so I'm going to move some fact tables to ClickHouse. Initial tests of CH show incredible performance, however, in real life the queries should include joins to dimension tables from PostgreSQL. I know I can connect them as dictionaries.
Question: I found that using dictionaries I can make requests similar to LEFT JOINs in good old RDBMS, ie values from resultset could be joined with corresponding values from the dictionary. But can they be filtered by some restrictions on dictionary keys (as in INNER JOIN)? For example, in PostgreSQL I have a table users (id, name, ...) and in ClickHouse I have table visits (user_id, source, medium, session_time, timestamp, ...) with metrics about their visits to the site. Can I make a query to CH to fetch aggregated metrics (number of daily visits for given date range) of users which name matches some condition (LIKE "EVE%" for example)?
It sounds like ODBC table function is what you're looking for. ClickHouse have a bunch of table functions which work like Postgres foreign tables. The setup is similar to Dictionaries but you gain the traditional JOIN behavior. It currently doesn't show up in the official document. You can refer to this https://github.com/yandex/ClickHouse/blob/master/dbms/tests/integration/test_odbc_interaction/test.py#L84 . And in near future (this year), ClickHouse will have standard JOIN statement supported.
The dictionary will basically replace the value first. As I understand it your dictionary would be based off your users table.
Here is an example. Hopefully I am understanding your question.
select dictGetString('accountidmap', 'domain', tuple(toString(account_id))) AS domain, sum(session) as sessions from session_distributed where date = '2018-10-15' and like(domain, '%cats%') group by domain
This is a real query on our database so If there is something you want to try/confirm let me know

Performance issue with primary key

I am populating a medium-sized table (60GB, 500 million rows). The process completes reasonably fast if the table has no primary key (~1 hour using bulk insert), but it takes ~10 times longer if I create that table with the primary key. I assume this is because it takes time to verify the uniqueness constraint and also update the index at each insert.
I thought a good workaround would be to add the primary key later, since indexation on the table that's already populated should be much faster compared to incremental indexation. But sqlite doesn't seem to have the option to add primary key after the table is created (not sure why?).
I guess I could just not use a primary key at all, and instead just add a unique index after the table is populated. Is there any disadvantage to that?
Or any better solution recommended?
From a purely technical point of view, an unique index has exactly the same effect as a primary key. (In SQLite, some primary keys allow NULLs for backwards compatibility.)
The only difference is that the primary key constraint does not show up in the table definition itself, which might be a bad thing for documentation purposes.
Also see Is CREATE UNIQUE INDEX or INTEGER PRIMARY KEY more performant in SQLite.
Run the bulk insert inside a transaction and you'll avoid quite a few things that slow inserts down.
I just found this which is a great write up on how to speed things up in sqlite3.
Improve INSERT-per-second performance of SQLite?

Retrieve all items with a column beginning with specified text on DynamoDB

I have a table in DynamoDB:
Id: int, hash key
Name: string
(there are many more columns, but I omitted them)
Typically I just pull out and update items by their Id, and this schema works fine for that.
However, one of the requirements is to have an auto-completing drop down box based on the name. I want to be able to query all items in this DynamoDB table for Name columns starting with a query string.
The SQL way of solving this would be to just add an index on Name and write a query like SELECT Id FROM table WHERE Name LIKE 'query%', but I can't figure out a DynamoDB-friendly way of doing this.
I have considered a few ways to solve this:
Scan the table. This is the easiest option, but least efficient. There's a bit more data in this table than I would be comfortable frequently scanning.
Scan + cache it in memory. But then I have to worry about cache invalidation etc.
Make Name a range key, which supports a begins_with function on the query. However, I'd still have to Scan the table since I want to retrieve results for every single hash key, so this doesn't really work.
Make a global secondary index and query it only with the range key. This also doesn't appear to be possible. I could have a column with a static value and use that as the hash key for the GSI, but that seems like a really ugly hack.
Use a full text search engine like CloudSearch, but this seems like massive overkill for my use case.
Is there a simple solution to this issue?
The use case you described is not directly supported by DynamoDB's Query operation today - DynamoDB typically requires you to specify a hashkey then query on the range key accordingly.
However, there is a popular scatter-gather technique that is commonly used for usecase such as yours. In this case, you would add an attribute bucket_id and create a global secondary index with bucket_id as hash key, and Name as the range key.
The bucket_id refers to a fixed range of IDs or numbers, with enough cardinality to ensure your global secondary index is well-distributed. For instance, bucket_id could range from 0 to 99. Then when updating your base table, whenever a new entry is added, a random bucket_id between 0 and 99 is assigned to it.
During your autocomplete query, the application would send 100 separate queries (scatter) for each bucket_id value (0 to 99) and use BEGINS_WITH on the range key Name. After the results are retrieved, the application would have to combine the 100 sets of responses and re-sort as necessary (gather).
The above process may seem a bit cumbersome, but it allows your system/table to scale well by ensuring the load is evenly distributed over a fixed key range. You can increase the bucket_id range as appropriate. To save cost, you can choose to project KEYS_ONLY onto your global secondary index, so cost of querying is minimized.
The problem is that DynamoDB is essentially a key-value store with support for operations against a single key, and you are trying to search all values which doesn't work well . The "simplest" solution to this is to have a known hash key and then you can Query it directly and specify conditions.
For example, you could query with hash_key='name_search' and range_key=begins_with(myText) or other_key=begins_with(myText) and get the use case you are describing. This will work fine for small sets of data that do not require a large amount of provisioned RCUs.
The problem is that this does not scale because you are not following any of the DynamoDB best practices (in fact, this is an anti-pattern). Take a look at the Understand Partition Behavior documentation
My suggestion would be to use a different service/solution to accomplish this rather than trying to squeeze DynamoDB into this use case.

How can I improve performance while altering a large mysql table?

I have 600 Millions records in a table and I am not able to add a column in this table as every time I try to do it, it times out.
Suppose in your MYSQL database you have a giant table having 600 Millions of rows, having some schema operation such as adding a unique key, altering a column, even adding one more column to it is a very cumbersome process which will takes hours to process and sometimes there is a server time out. In order to overcome that, one to have to come up with very good migration plan, one of which I jotting below.
1) Suppose there is table Orig_X in which I have to add a new column colNew with default value as 0.
2) A Dummy table Dummy_X is created which is replica of Orig_X except with a new column colNew.
3) Data is inserted from the Orig_X to Dummy_X with the following settings.
4) Auto commit is set to zero, so that data is not committed after each insert statement hindering the performance.
5) Binary logs are set to zero, so that no data will be written in these logs.
6) After insertion of data bot the feature are set to one.
SET AUTOCOMMIT = 0;
SET sql_log_bin = 0;
Insert into Dummy_X(col1, col2, col3, colNew)
Select col1, col2, col3, from Orig_X;
SET sql_log_bin = 1;
SET AUTOCOMMIT = 1;
7) Now primary key can be created with the newly inserted column, which is now the part of primary key.
8) All the unique keys can now be created.
9) We can check the status of the server by issuing the following command
SHOW MASTER STATUS
10) It’s also helpful to issue FLUSH LOGS so MySQL will clear the old logs.
11) In order to boost performance to run the similar type of queries such as above insert statement, one should have query cache variable on.
SHOW VARIABLES LIKE 'have_query_cache';
query_cache_type = 1
Above were the steps for the migration strategy for the large table, below I am witting so steps to improve the performance of the database/queries.
1) Remove any unnecessary indexes on the table, pay particular attention to UNIQUE indexes as these when disable change buffering. Don't use a UNIQUE index if you have no reason for that constraint, prefer a regular INDEX.
2) If bulk loading a fresh table, delay creating any indexes besides the PRIMARY KEY. If you create them once all after data is loaded, then InnoDB is able to apply a pre-sort and bulk load process which is both faster and results in typically more compact indexes.
3) More memory can actually help in performance optimization. If SHOW ENGINE INNODB STATUS shows any reads/s under BUFFER POOL AND MEMORY and the number of Free buffers (also under BUFFER POOL AND MEMORY) is zero, you could benefit from more (assuming you have sized innodb_buffer_pool_size correctly on your server.
4) Normally your database table gets re-indexed after every insert. That's some heavy lifting for you database, but when your queries are wrapped inside a Transaction, the table does not get re-indexed until after this entire bulk is processed. Saving a lot of work.
5) Most MySQL servers have query caching enabled. It's one of the most effective methods of improving performance that is quietly handled by the database engine. When the same query is executed multiple times, the result is fetched from the cache, which is quite fast.
6) Using the EXPLAIN keyword can give you insight on what MySQL is doing to execute your query. This can help you spot the bottlenecks and other problems with your query or table structures. The results of an EXPLAIN query will show you which indexes are being utilized, how the table is being scanned and sorted etc...
7) If your application contains many JOIN queries, you need to make sure that the columns you join by are indexed on both tables. This affects how MySQL internally optimizes the join operation.
8) In every table have an id column that is the PRIMARY KEY, AUTO_INCREMENT and one of the flavors of INT. Also preferably UNSIGNED, since the value cannot be negative.
9) Even if you have a user’s table that has a unique username field, do not make that your primary key. VARCHAR fields as primary keys are slower. And you will have a better structure in your code by referring to all users with their id's internally.
10) Normally when you perform a query from a script, it will wait for the execution of that query to finish before it can continue. You can change that by using unbuffered queries. This saves a considerable amount of memory with SQL queries that produce large result sets, and you can start working on the result set immediately after the first row has been retrieved as you don't have to wait until the complete SQL query has been performed.
11) With database engines, disk is perhaps the most significant bottleneck. Keeping things smaller and more compact is usually helpful in terms of performance, to reduce the amount of disk transfer.
12) The two main storage engines in MySQL are MyISAM and InnoDB. Each have their own pros and cons.MyISAM is good for read-heavy applications, but it doesn't scale very well when there are a lot of writes. Even if you are updating one field of one row, the whole table gets locked, and no other process can even read from it until that query is finished. MyISAM is very fast at calculating
SELECT COUNT(*)
types of queries.InnoDB tends to be a more complicated storage
engine and can be slower than MyISAM for most small applications. But it supports row-based locking, which scales better. It also supports some more advanced features such as transactions.

Reindexing a large SQL Server database to Lucene

We have a web service method which accepts some data and puts it in Lucene index. We use it to index new and updated entries from our asp.net web app.
These entries are stored in a large SQL Server table (20M rows and growing), and I need a way to be able to reindex the whole table in case if current index gets deleted or corrupted. I'm not sure what's the optimal way to retrieve chunks of data from a large table. Currently, we use the fact that the table has PK which is autoincrement, so we get chunks of 1000 rows until it starts to return nothing. Kind of like (in pseudo language):
i = 0
while (true)
{
SELECT col1, col2, col3 FROM mytable WHERE pk between i and i + 1000
.... if result is empty 20 times in a row, break ....
.... otherwise send result to web service to reindex ....
i = i + 1000
}
This way, we don't need to SELECT COUNT(*) which would be a big performance killer, and we just move up the pk values until we stop getting any results. This has it's con: if we have a hole greater than 20,000 values somewhere in the table, it will stop indexing assuming it reached the end, but that's a tradeoff we have to live for now.
Can anyone suggest a more efficient way of getting data from a table to index? I would assume we are not the first ones facing this problem - search engines are widely used nowadays :)
For what we do with Lucene, we rarely need to reindex everything. I can't remember coming across any case when all index would be corrupted (Lucene is actually quite safe/good at this), but it has been many times when individual items needed to be reindexed because of one reason or another. I'd say the most frequent reindexing patterns would be:
reindex items by given id (or set of ids)
reindex items by given period of time
The latter, of course, requires separate db index on the relevant date field(s) which should be a bit costly for 20M+ records but we decided to go for it (our biggest deployment had up to 10M records) as disk space is cheap these days anyway.
EDIT: added few explanations as per question author's comment.
If the source data structure changes, requiring reindexing of all records, our approach is to roll out new code which ensures all new data is correct (basically forms correct Lucene Document from this moment). Then after we can reindex things in batches (either manually or by hand), by providing relevant period ranges. This, to certain extent, also applies to Lucene version changes, too.
Why is a COUNT(*) a performance killer? What about MAX(id)? I'm thinking that a index would provide the information needed for those queries. You do have an index on your primary key, right?
I actually just figured it out - I can use IDENT_CURRENT(table_name) to get the last generated id, and use that instead of MAX() or Count() - this method should blow the other two away :)

Resources