Clickhouse Extremely Slow - olap

I'm attempting to benchmark clickhouse against postgres for API use.
The dataset is ~40M records
SELECT
name,
web_url,
image_url,
visitor_id,
web_id
FROM analytics WHERE event_id = (
SELECT uuid
FROM interaction_logs
WHERE path_id = '12bc4ca0ed14ds3a8ab5d2061c2e551a18a936f14d' )
ORDER BY visitor_id ASC LIMIT 0, 10
However, this query takes approx ~2S to run. This is extremely bad for anything production under any user-load. Is this normal? If not what can i do to improve it?
The server is a AWS c5a.8xlarge instance (32 vCPU/64GB RAM)
I've attempted to set the max_threads and max memory per query, both did not impact performance.
The actual table creation query:
CREATE TABLE analytics(
`uuid` UUID,
...
)
ENGINE=MergeTree()
ORDER BY uuid;

Related

How to design a cosmos DB to do efficient query on non-partition key as well

I am new to Cosmos DB and facing issue in designing my DB.
I have a data similar to below structure
{
"userId": "64_CHAR_ID",
"gpId": "34_CHAR_ID"
... Other data
}
Currently my DB is having partition on userId as all the queries were by userId so far. Now I want to query my DB based on gpId when userId is not known. So it is ending up as cross partition query and it takes lot of clock-time(more than 5 min) and RUs (more than 3k RUs).
Query I am using is
SELECT * FROM c WHERE c.gpId='SOME_GPID'
According to Microsoft Doc we should avoid cross partition queries when dataset is large, and in my case the dataset is quite large (~80 GB).
So what would be a better design / strategy to query the data by gpId in cosmos db. My requirement is to query by gpId in almost real time.
Note: Current limit of RUs is set to 500000 RUs/s and also put on AutoScale.

Cloud Datastore: 503 service Unavailable with 1000/s+ concurrent transactions

When trying to save entities within a transaction to Datastore at a rate of ~1000 transactions per second or more, Datastore consistently returns 503 Service Unavailable until load backs off to a smaller rate.
I'm using the python-datastore client library in a web service to save millions of unique entities to Datastore (NOT Firestore in Datastore mode).
I've tried using the recommended "500/50/5" rule to gradually ramp up to 1000 operations per second and more, but Datastore consistently peaks at the same level irrespective of how gradually the load is increased.
I've also observed that the same Datastore transaction operations are perfectly sustained at 750 operations per second without issues.
My understanding is Datastore can handle millions ops - does this also apply to transactional operations?
Are there any limits or constraints to using transactions when it comes to call volume.
Any suggestions or feedback as to how to tackle this issue would be greatly appreciated!
Here's a sample data model for an "Offer" Kind that's written to Datastore. "id", a uuid, is the entity's key.
{
"id": "a0cf7d66-5fab-495f-a73c-617570628fd6",
"loyalty_id": "191200101829",
"status": "eligible",
"ce_promotion_id": "6452",
"hybris_promotion_id": null,
"offer_promotion_id": "47032",
"ce_campaign_id": "0382",
"promotion_type": "offer",
"display_order": 1,
"activation_date": null,
"deactivation_date": null,
"expiration_date": "2021-04-12T08:00:00Z",
"scheduled_expiration_date": "2021-04-13T07:56:00Z",
"redemption_date": null,
"created_date": "2021-04-11T19:15:28.067053Z",
"update_date": "2021-04-11T19:15:28.067083Z"
}
I also have 3 composite indexes:
loyalty_id ASC status ASC expiration_date DESC
status ASC scheduled_expiration_date DESC
loyalty_id ASC expiration_date DESC
If the date properties increase or decrease monotonically, that would create a hotspot. To increase throughput, you would need to make the property non-monotonic when it's written to the database. Here's an approach from the docs:
For instance, if you want to query for entries by timestamp but only
need to return results for a single user at a time, you could prefix
the timestamp with the user id and index that new property instead.
This would still permit queries and ordered results for that user, but
the presence of the user id would ensure the index itself is well
sharded.
Another approach is to leave the property as is but turn off all indexes on that property (built-in, too) except those that add some randomness in front of the property. From your example, these indexes might fit that model if the loyalty_id is pretty random:
loyalty_id ASC status ASC expiration_date DESC
loyalty_id ASC expiration_date DESC
Note: Indexing order matters. It's important here that the monotonic property come last in the composite index.

NHibernate Query slows other queries

i'm writing a program in which i use two database queries using NHibernate. First query is a large one - select with two joins (the big SELECT query) whose result is about 50000 records. Query takes about 30 secs. Next step in the program is iterating through these 50000 record and invoking query on each of this records. This query is pretty small COUNT method.
There are two interesting things tough:
If i run the small COUNT query before the big SELECT, the COUNT query takes about 10ms, but if i ran it after the big SELECT query it takes 8-9 seconds. Furthermore, if i reduce the complexity of the big SELECT query i also reduce the time of the COUNT query execution afterwards.
If i ran the the big SELECT query on sql server management studio it takes 1 sec, but from ASP.NET application it takes 30 secs.
SO there are two main questions. Why is the query taking so long to execute in code when its so fast in ssms? Why is the big SELECT query affecting the small COUNT queries afterwards.
I know there are many possible answers to this problem but i have googled a lot and this is what i have tried:
Setting the SET parameters of asp.net application and ssms so they are the same to avoid different query plans
Clearing the ssms cache so the good ssms result is not caused by ssms caching - same 1 second result after the cache clear
The big SELECT query:
var subjects = Query
.FetchMany(x => x.Registrations)
.FetchMany(x => x.Aliases)
.Where(x => x.InvalidationDate == null)
.ToList();
The small COUNT query:
Query.Count(x => debtorIRNs.Contains(x.DebtorIRN.CodIRN) && x.CurrentAmount > 0 && !x.ArchivationDate.HasValue && x.InvalidationDate == null);
As it turned out the above mentioned FatchMany's were inevitable for the program so i couldn't just skip. The first significant improvement i achieved was turning off the loggs of the application (as i mentioned the above code is just a fragment). Performance without logs were about a half faster. But still it took considerable amount of time. SO i decided to avoid using NHibernate for this query and wrote plain sqlQuery to data reader, which i than parsed into my object's. I was able to reduce the execution time from 2.5 days (50000 * 4 sec -> number of small queries * former execution time of one small query) to 8 minutes.

Improving SQLite Query Performance

I have run the following query in SQLite and SQLServer. On SQLite the query has never finished runing - i have let it sit for hours and still continues to run. On SQLServer it takes a little less than a minute to run. The table has several hundred thousands of records. Is there a way to improve the performance of the query in SQLite?
update tmp_tbl
set prior_symbol = (select o.symbol
from options o
where o.underlying_ticker = tmp_tbl.underlying_ticker
and o.option_type = tmp_tbl.option_type
and o.expiration = tmp_tbl.expiration
and o.strike = (select max(o2.strike)
from options o2
where o2.underlying_ticker = tmp_tbl.underlying_ticker
and o2.option_type = tmp_tbl.option_type
and o2.expiration = tmp_tbl.expiration
and o2.strike < tmp_tbl.strike));
Update: I was able to get what I needed done using some python code and handling the data mapping outside of SQL. However, I am puzzled by the performance difference between SQLite and SQLServer - I was expecting SQLite to be much faster.
When I ran the above query initially, neither table had any indexes other than a standard primary key, id, which is unrelated to the data. I created two indexes as follows:
create index options_table_index on options(underlying_ticker, option_type, expiration, strike);
and:
create index tmp_tbl_index on tmp_tbl(underlying_ticker, option_type, expiration, strike);
But that didn't help. The query still continues to clock without any output - I let it run for nearly 40 minutes.
The table definition for tmp_tbl is:
create table tmp_tbl(id integer primary key,
symbol text,
underlying_ticker text,
option_type text,
strike real,
expiration text,
mid real,
prior_symbol real,
prior_premium real,
ratio real,
error_flag bit);
The definition of options table is similar but with a few more fields.

Sqlite slow but barely using machine ressources

I have a 500MB sqlite database of about 5 million rows with the following schema:
CREATE TABLE my_table (
id1 VARCHAR(12) NOT NULL,
id2 VARCHAR(3) NOT NULL,
date DATE NOT NULL,
val1 NUMERIC,
val2 NUMERIC,
val2 NUMERIC,
val4 NUMERIC,
val5 INTEGER,
PRIMARY KEY (id1, id2, date)
);
I am trying to run:
SELECT count(ROWID) FROM my_table
The query has now been running for several minutes which seems excessive to me. I am aware that sqlite is not optimized for count(*)-type queries.
I could accept this if at least my machine appeared to be hard at work. However, my CPU load hovers somewhere around 0-1%. "Disk Delta Total Bytes" in Process Explorer is about 500.000.
Any idea if this can be sped up?
You should have an index for any fields you query on like this. create index tags_index on tags(tag);. Then, I am sure definitely the query will be faster. Secondly, try to normalize your table and have a test (without having an index). Compare the results.
In most cases, count(*) would be faster than count(rowid).
If you have a (non-partial) index, computing the row count can be done faster with that because less data needs do be loaded from disk.
In this case, the primary key constraint already has created such an index.
I would try to look at my disk IO if I were you. I guess they are quite high. Considering the size of your database some data must be on the disk which makes it the bottleneck.
Two ideas from my rudimentary knowledge of SQLite.
Idea 1: If memory is not a problem in your case and your application is launched once and run several queries, I would try to increase the amount of cache used (there's a cache_size pragma available). After a few googling I found this link about SQLite tweaking: http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html
Idea 2: I would try to have an autoincremented primary key (on a single column) and try to tweak my request using SELECT COUNT(DISTINCT row_id) FROM my_table; . This could force the counting to be only run on what's contained in the index.

Resources