Query against a materialized view with uniqExact column fails due to memory limit - out-of-memory

I have a materialized view with the following structure:
CREATE MATERIALIZED VIEW events_daily
ENGINE = AggregatingMergeTree()
ORDER BY (
owner_id, user_id, event_type_id, event_day, field1, field2
)
AS SELECT
toStartOfDay(event_datetime) as event_day,
owner_id,
user_id,
event_type_id,
field1,
field2,
countState() as count,
uniqExactState(message_id, field1, field2) as unique_count
FROM raw_events
GROUP BY owner_id, user_id, event_type_id, event_day;
When I'm trying to perform a select for a user with big amount of records, I'm getting a memory limit error:
Memory limit (for query) exceeded: would use 9.42 GiB (attempt to allocate chunk of 134217728 bytes), maximum: 9.31 GiB: While executing AggregatingTransform.
The SELECT query I'm trying to execute:
SELECT event_day,
event_type_id,
countMerge(count) as count,
uniqExactMerge(unique_count) as unique_count
FROM events_daily WHERE owner_id = xxx AND event_day >= '2022-04-05 00:00:00' AND event_day <= '2022-05-05 23:00:00'
GROUP BY owner_id, event_type_id, event_day
ORDER BY event_day, event_type_id
If I change the dates condition to a narrower one (e.g. querying data for 1 week instead of 1 month), it works. Also if I remove uniqExactMerge from the SELECT clause, it also works (and pretty fast).
So is there a solution to make a query with uniqExactMerge() for heavy set of data? Or should I alter the whole architecture in some way?

By default, if my memory is correct you are going to use up to 10GB of RAM. You can increase this value: https://clickhouse.com/docs/en/operations/settings/query-complexity/#settings_max_memory_usage
You should also consider using function a less intensive function like uniqExactMerge as it uses a lot of memory. Try using a lighter version like: uniq: https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/uniq/

Related

How to write Azure Cosmos DB SQL-API count distinct query without non-deterministic results

My Aim
I would like to count the number of distinct values of FileName for Azure Cosmos DB documents like the following in a single partition, using the SQL API.
{
"id": "some uuid",
"FileName": "file-1.txt",
"PartitionKeyField": "some key",
... other fields ...
}
My Test
I have uploaded 533,956 documents with 500,000 different FileName values, i.e. 33,956 documents have duplicate FileName (other fields are different). These are all uploaded with the same PartitionKeyField.
(I can only reproduce the behaviour below for 100,000s of documents).
I would like to count the number of distinct FileName values - so hope to get back 500,000.
Attempt 0 - Sanity Check
If I run the following query:
SELECT DISTINCT c.FileName
FROM c
WHERE c.PartitionKeyField = 'some key'
This returns 500,000 documents as expected.
Attempt 1
However, I don't need all the documents, I just need the count, so I try to run the following query
SELECT VALUE COUNT(1)
FROM (
SELECT DISTINCT c.FileName
FROM c
WHERE c.PartitionKeyField = 'some key'
) c2
But this gives 533,956 - i.e. it's as if DISTINCT has not been applied.
Attempt 2
Next I tried the following, redundant GROUP BY in an attempt to force the count to work:
SELECT c2.PartitionKeyField, COUNT(1)
FROM (
SELECT DISTINCT c.FileName
FROM c
WHERE c.PartitionKeyField = 'some key'
) c2
GROUP BY c2.PartitionKeyField
The result returned by this depends on how many RUs is allocated to the collection, e.g.
Returns 500,007 at 9900 RUs
Returns 500,175 at 5000 RUs
Returns 500,441 at 3000 RUs
Returns 500,812 at 1000 RUs
Returns 501,406 at 400 RUs
Also, the above values are averages, e.g. for 9900 RUs results of 500,009 and 500,006 were also returned.
Questions
Is it possible to write the required "count" query in a deterministic way that doesn't depend on the number of RUs? (other than retrieving all documents as in Attempt 0?)
Why does increasing the number of RUs change the result of the query in Attempt 2?
Please try this SQL:
SELECT VALUE COUNT(c2)
FROM (
SELECT DISTINCT c.FileName
FROM c
WHERE c.PartitionKeyField = 'some key'
) c2

MS SQL Server paging with total rows returned

I'm trying to get an idea as the fastest way to do server paging using SQL SERVER 2012 with large data-sets whilst returning the EXACT total records.
THAT IS THE EXACT QUESTION
I say this because the more I research this the more people seem to go off on tangents sort of missing the question itself. I know all about estimating total records, Indexing and hardware updates for performance but that is not what I'm asking.
Currently I'm working with the 'OFFSET x ROWS FETCH NEXT y ROWS ONLY' and I use currently the 'COUNT(*) OVER() AS TotalRows' in my query to attempt to use one query rather than two.
I have had to slightly doctor my query's to accommodate 'COUNT(*) OVER()' when they are DISTINCT in nature.
What are people experiences with this or do you use a different method all together.
It would also be interesting to know whats best to add to the example to test the actual times it takes without any caching compiling etc etc. I have had a small play with this but no matter which way I try I very rarely get the same results for the same query twice.
I'm guessing this is because of other things the OS is doing, hard drive speeds and caching.
NOTE: I have now added some timing and on my tests it does seem that the second query take nearly twice as long as the first.
EXAMPLE:
DECLARE #RowCount INT
SET #RowCount = 0
DECLARE #TestTable TABLE
(
Value1 INT, Value2 INT, Value3 INT
)
-- The above table contains no relevant data and no indexes etc on purpose
-- as that's not really the point
SET NOCOUNT ON
WHILE #RowCount < 200000
BEGIN
INSERT
INTO #TestTable
( Value1, Value2, Value3)
VALUES
(
ABS(CHECKSUM(NewId())) % 10,
ABS(CHECKSUM(NewId())) % 1000,
8
)
SET #RowCount = #RowCount + 1
END
SET NOCOUNT OFF
-- The following WONT work for DISTINCT Because the COUNT(*) OVER() is
-- calculated first.
--
-- SELECT DISTINCT Value1, Value2, Value3, COUNT(*) OVER() AS TotalRows
-- FROM #TestTable
CHECKPOINT;
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
DECLARE #StartTime DATETIME
DECLARE #EndTime DATETIME
SET STATISTICS IO ON
SET STATISTICS TIME ON
------------------------------------------------------------------------------------
SET #StartTime = GETDATE()
-- So we have the following using one query
--
SELECT Value1,
Value2,
Value3,
COUNT(*) OVER() AS TotalRows
FROM
(
SELECT DISTINCT Value1, Value2, Value3
FROM #TestTable
-- INNER JOIN ...
-- WHERE ...
) AS foo
ORDER BY Value1
OFFSET (200) ROWS FETCH NEXT (100) ROWS ONLY;
SELECT #EndTime=GETDATE()
SELECT DATEDIFF(ms,#StartTime,#EndTime) AS [Duration in microseconds]
------------------------------------------------------------------------------------
SET #StartTime = GETDATE()
-- And this using TWO querys
--
-- Query ONE
SELECT DISTINCT Value1, Value2, Value3
FROM #TestTable
-- INNER JOIN ...
-- WHERE ...
ORDER BY Value1
OFFSET (200) ROWS FETCH NEXT (100) ROWS ONLY;
--Query TWO
SELECT COUNT(*) AS TotalRows FROM (SELECT DISTINCT Value1, Value2, Value3
FROM (TestTable) AS Foo
SELECT #EndTime=GETDATE()
SELECT DATEDIFF(ms,#StartTime,#EndTime) AS [Duration in microseconds]
------------------------------------------------------------------------------------
SET STATISTICS TIME OFF
SET STATISTICS IO OFF

SQLite Group By Limit

I have a web service that generates radio station playlists and I'm trying to ensure that playlists never have tracks from the same artist more than n times.
So for example (unless it is Mandatory Metallica --haha) then no artist should ever dominate any 8 hour programming segment.
Today we use a query similar to this which generates smaller randomized playlists out of existing very large playlists:
SELECT FilePath FROM vwPlaylistTracks
WHERE Owner='{0}' COLLATE NOCASE AND
Playlist='{1}' COLLATE NOCASE
ORDER BY RANDOM()
LIMIT {2};
Someone then has to manually review the playlists and do some manual editing if the same artist appears consecutively or more than the desired limit.
Supposing the producer wants to ensure that no artist appears more than twice in the span of the playlist generated in this query (and assuming there is an artist field in the vwPlaylistTracks view; which there is) is GROUP BY the correct way to accomplish this?
I've been messing around with the view trying to accomplish this but this query always only returns 1 track from each artist.
SELECT
a.Name as 'Artist',
f.parentPath || '\' || f.fileName as 'FilePath',
p.name as 'Playlist',
u.username as 'Owner'
FROM mp3_file f,
mp3_track t,
mp3_artist a,
mp3_playlist_track pt,
mp3_playlist p,
mp3_user u
WHERE f.file_id = t.track_id
AND t.artist_id = a.artist_id
AND t.track_id = pt.track_id
AND pt.playlist_id = p.playlist_id
AND p.user_id = u.user_id
--AND p.Name = 'Alternative Rock'
GROUP BY a.Name
--HAVING Count(a.Name) < 3
--ORDER BY RANDOM()
--LIMIT 50;
GROUP BY creates exactly one result record for each distinct value in the grouped column, so this is not what you want.
You have to count any previous records with the same artist, which is not easy because the random ordering is not stable.
However, this is possible with a temporary table, which is ordered by its rowid:
CREATE TEMPORARY TABLE RandomTracks AS
SELECT a.Name as Artist, parentPath, name, username
FROM ...
WHERE ...
ORDER BY RANDOM();
CREATE INDEX RandomTracks_Artist on RandomTracks(Artist);
SELECT *
FROM RandomTracks AS r1
WHERE -- filter out if there are any two previous records with the same artist
(SELECT COUNT(*)
FROM RandomTracks AS r2
WHERE r2.Artist = r1.Artist
AND r2.rowid < r1.rowid
) < 2
AND -- filter out if the directly previous record has the same artist
r1.Artist IS NOT (SELECT Artist
FROM RandomTracks AS r3
WHERE r3.rowid = r1.rowid - 1)
LIMIT 50;
DROP TABLE RandomTracks;
It might be easier and faster to just read the entire playlist and to filter and reorder it in your code.

Strange SQLite behavior: Not returning results on simple queries

Ok, so I have a basic table called "ledger", it contains fields of various types, integers, varchar, etc.
In my program, I used to use a query with no "from" predicate to collect all of the rows, which of course works fine. But... I changed my code to allow selecting one row at a time using "where acctno = x" (where X is the account number I want to select at the time).
I thought this must be a bug in the client library for my programming language, so I tested it in the SQLite command-line client - and it still doesn't work!
I am relatively new to SQLite, but I have been using Oracle, MS SQL Server, etc. for years and never seen this type of issue before.
Other things I can tell you:
* Queries using other integer fields also don't work
* Queries on char fields work
* Querying it as a string (with the account number on quotes) still doesn't work. (I thought maybe the numbers were stored as a string inadvertently).
* Accessing rows by rowid works fine - which is why I can edit the database with GUI tools with no noticeable problem.
Examples:
Query with no WHERE (works fine):
1|0|0|JPY|8|Paid-In Capital|C|X|0|X|0|0||||0|0|0|
0|0|0|JPY|11|Root Account|P|X|0|X|0|0|SYSTEM|20121209|150000|0|0|0|
3|0|0|JPY|13|Mitsubishi Bank Futsuu|A|X|0|X|0|0|SYSTEM|20121209|150000|0|0|0|
4|0|0|JPY|14|Japan Post Bank|A|X|0|X|0|0|SYSTEM|20121209|150000|0|0|0|
...
Query with WHERE clause: (no results)
sqlite> select * from ledger where acctno=1;
sqlite>
putting quotes around the 1 above changes nothing.
Interestingly enough, "select * from ledger where acctno > 1" returns results! However since it returns ALL results, it's not terrible useful.
I'm sure someone will ask about the table structure, so here goes:
sqlite> .schema ledger
CREATE TABLE "LEDGER" (
"ACCTNO" integer(10,0) NOT NULL,
"drbal" integer(20,0) NOT NULL,
"crbal" integer(20,0) NOT NULL,
"CURRKEY" char(3,0) NOT NULL,
"TEXTKEY" integer(10,0),
"TEXT" VARCHAR(64,0),
"ACCTYPECD" CHAR(1,0) NOT NULL,
"ACCSTCD" CHAR(1,0),
"PACCTNO" number(10,0) NOT NULL,
"CATCD" number(10,0),
"TRANSNO" number(10,0) NOT NULL,
"extrefno" number(10,0),
"UPDATEUSER" VARCHAR(32,0),
"UPDATEDATE" text(8,0),
"UPDATETIME" TEXT(6,0),
"PAYEECD" number(10,0) NOT NULL,
"drbal2" number(10,0) NOT NULL,
"crbal2" number(10,0) NOT NULL,
"delind" boolean,
PRIMARY KEY("ACCTNO"),
CONSTRAINT "fk_curr" FOREIGN KEY ("CURRKEY") REFERENCES "CURRENCY" ("CUR
RKEY") ON DELETE RESTRICT ON UPDATE CASCADE
);
The strangest thing is that I have other similar tables where this works fine!
sqlite> select * from journalhdr where transno=13;
13|Test transaction ATM Withdrawel 20130213|20130223||20130223||
TransNo in that table is also integer (10,0) NOT NULL - this is what makes me thing it is something to do with the values.
Another clue is that the sort order seems to be based on ascii, not numeric:
sqlite> select * from ledger order by acctno;
0|0|0|JPY|11|Root Account|P|X|0|X|0|0|SYSTEM|20121209|150000|0|0|0|
1|0|0|JPY|8|Paid-In Capital|C|X|0|X|0|0||||0|0|0|
10|0|0|USD|20|Sallie Mae|L|X|0|X|0|0|SYSTEM|20121209|153900|0|0|0|
21|0|0|USD|21|Skrill|A|X|0|X|0|0|SYSTEM|20121209|154000|0|0|0|
22|0|0|USD|22|AES|L|X|0|X|0|0|SYSTEM|20121209|154200|0|0|0|
23|0|0|JPY|23|Marui|L|X|0|X|0|0|SYSTEM|20121209|154400|0|0|0|
24|0|0|JPY|24|Amex JP|L|X|0|X|0|0|SYSTEM|20121209|154500|0|0|0|
3|0|0|JPY|13|Mitsubishi Bank Futsuu|A|X|0|X|0|0|SYSTEM|20121209|150000|0|0|0|
Of course the sort order on journalhdr (where the select works properly) is numeric.
Solved! (sort-of)
The data can be fixed like this:
sqlite> update ledger set acctno = 23 where rowid = 13;
sqlite> select * from ledger where acctno = 25;
25|0|0|JPY|0|Test|L|X|0|X|0|0|SYSTEM|20130224|132500|0|0|0|
Still, if it was stored as strings, then that leave a few questions:
1. Why couldn't I select it as a string using the quotes?
2. How did it get stored as a string since it is a valid integer?
3. How would you go about detecting this problem normally besides noticing bizzarre symptoms?
Although the data would normally be entered by my program, some of it was created by hand using Navicat, so I assume the problem must lie there.
You are victim of SQLite dynamic typing.
Even though SQLite defines system of type affinity, which sets some rules on how input strings or numbers will be converted to actual internal values, but it does NOT prevent software that is using prepared statements to explicitly set any type (and data value) for the column (and this can be different per row!).
This can be shown by this simple example:
CREATE TABLE ledger (acctno INTEGER, name VARCHAR(16));
INSERT INTO ledger VALUES(1, 'John'); -- INTEGER '1'
INSERT INTO ledger VALUES(2 || X'00', 'Zack'); -- BLOB '2\0'
I have inserted second row not as INTEGER, but as binary string containing embedded zero byte. This reproduces your issue exactly, see this SQLFiddle, step by step. You can also execute these commands in sqlite3, you will get the same result.
Below is Perl script that also reproduces this issue
This script creates just 2 rows with acctno having values of integer 1 for first, and "2\0" for second row. "2\0" means string consisting of 2 bytes: first is digit 2, and second is 0 (zero) byte.
Of course, it is very difficult to visually tell "2\0" from just "2", but this is what script below demonstrates:
#!/usr/bin/perl -w
use strict;
use warnings;
use DBI qw(:sql_types);
my $dbh = DBI->connect("dbi:SQLite:test.db") or die DBI::errstr();
$dbh->do("DROP TABLE IF EXISTS ledger");
$dbh->do("CREATE TABLE ledger (acctno INTEGER, name VARCHAR(16))");
my $sth = $dbh->prepare(
"INSERT INTO ledger (acctno, name) VALUES (?, ?)");
$sth->bind_param(1, "1", SQL_INTEGER);
$sth->bind_param(2, "John");
$sth->execute();
$sth->bind_param(1, "2\0", SQL_BLOB);
$sth->bind_param(2, "Zack");
$sth->execute();
$sth = $dbh->prepare(
"SELECT count(*) FROM ledger WHERE acctno = ?");
$sth->bind_param(1, "1");
$sth->execute();
my ($num1) = $sth->fetchrow_array();
print "Number of rows matching id '1' is $num1\n";
$sth->bind_param(1, "2");
$sth->execute();
my ($num2) = $sth->fetchrow_array();
print "Number of rows matching id '2' is $num2\n";
$sth->bind_param(1, "2\0", SQL_BLOB);
$sth->execute();
my ($num3) = $sth->fetchrow_array();
print "Number of rows matching id '2<0>' is $num3\n";
Output of this script is:
Number of rows matching id '1' is 1
Number of rows matching id '2' is 0
Number of rows matching id '2<0>' is 1
If you were to look at resultant table using any SQLite tool (including sqlite3), it will print 2 for second row - they all get confused by trailing 0 inside a BLOB when it gets coerced to string or number.
Note that I had to use custom param binding to coerce type to BLOB and permit null bytes stored:
$sth->bind_param(1, "2\0", SQL_BLOB);
Long story short, it is either some of your client programs, or some of client tools like Navicat which screwed it up.

Is it possible to use WHERE clause in same query as PARTITION BY?

I need to write SQL that keeps only the minimum 5 records per each identifiable record in a table. For this, I use partition by and delete all records where the value returned is greater than 5. When I attempt to use the WHERE clause in the same query as the partition by statement, I get the error "Ordered Analytical Functions not allowed in WHERE Clause". So, in order to get it to work, I have to use three subqueries. My SQL looks ilke this:
delete mydb.mytable where (field1,field2) in
(
select field1,field2 from
(
select field1,field2,
Rank() over
(
partition BY field1
order by field1,field2
) n
from mydb.mytable
) x
where n > 5
)
The innermost subquery just returns the raw data. Since I can't use WHERE there, I wrapped it with a subquery, the purpose of which is to 1) use WHERE to get records greater than 5 in rank and 2) select only field1 and field2. The reason why I select only those two fields is so that I can use the IN statement for deleting those records in the outermost query.
It works, but it appears a bit cumbersome. I'd like to consolidate the inner two subqueries into a single subquery. Is this possible?
Sounds like you need to use the QUALIFY clause which is the HAVING clause for Window Aggregate functions. Below is my take on what you are trying to accomplish.
Please do not run this SQL directly against your production data without first testing it.
/* Physical Delete */
DELETE TGT
FROM MyDB.MyTable TGT
INNER JOIN
(SELECT Field1
, Field2
FROM MyDB.MyTable
QUALIFY ROW_NUMBER() (PARTITION BY Field1, ORDER BY Field1,2)
> 5
) SRC
ON TGT.Field1 = SRC.Field1
AND TGT.Field2 = SRC.Fileld2
/* Logical Delete */
UPDATE TGT
FROM MyDB.MyTable TGT
,
(SELECT Field1
, Field2
FROM MyDB.MyTable
QUALIFY ROW_NUMBER() (PARTITION BY Field1, ORDER BY Field1,2)
> 5
) SRC
SET Deleted = 'Y'
/* RecordExpireDate = Date - 1 */
WHERE TGT.Field1 = SRC.Field1
AND TGT.Field2 = SRC.Fileld2

Resources