Table with latin1 collation slow query, fast with utf8mb4 - why? - mariadb

Table A is 25k rows with a dozen columns, about 8mb of total data set to latin1.
Table B has 2000 rows with two dozen columns, about 5mb of total data set to utf8mb3.
Doing an inner join between the two the overall query time is 1.3 seconds. If I now switch table A to use utf8mb4 the query time is 0.05 seconds for the same query.
Why would there be such a massive difference in query time just because of the collation/charset?

Related

Slow query on table | WHERE x | ORDER by timestamp | DISTINCT a,b,c,d | TAKE 20 when table large

We are experiencing a sudden performance drop with a query structured like this:
table(tablename)
| where MeasurementName in ('ActiveJobId')
and MachineId == machineId
and SourceTimestamp <= from
and isnotnull( Value)
| order by SourceTimestamp desc
| distinct SourceTimestamp, MeasurementName, tostring(Value), SourceTimestampUtc
| take rows
tablename, machineId, from, rows are all query parameters. rows is typically "20". Value column is of type "dynamic"
The table contains 240 Million entries, with about 64,000 matching the WHERE criteria. The goal of the query is to get the last 20 UNIQUE, non-empty entries for a given machine and data point, starting after a specific date.
The query runs smooth in the Staging database system, but started to degrade in performance on the Dev system. Possibly because of increased data amount.
If we remove the distinct clause, or move it behind the TAKE clause, the query completes very fast. (<1s). The data contains about 5-10% duplicate entries.
To our understanding the query should be performed like this:
Prepare a filter for the source table, start at a specific datetime range
Order desc: walk backwards
Walk down the table and stop when you got 20 distinct rows
From the time it sometimes takes it looks almost as if ADX walks down the whole table, performs a distinct, and then only takes the topmost 20 rows.
The problem persists if we swap | order and | distinct around.
The problem disappears if we move | distinct to the end of the query, but then we often receive 1-2 items less than required.
Is there a logical error we make, can this query be rewritten, or are there better options at hand?
The goal of the query is to get the last 20 UNIQUE, non-empty entries for a given machine and data point, starting after a specific date.
This part of the description doesn't match the filter in your query: and SourceTimestamp <= from - did you mean to use >= instead of <= ?
Is there a logical error we make, can this query be rewritten, or are there better options at hand?
If you can't eliminate the duplicates upstream, you can consider setting a materialized view that performs the deduplication, then query the view directly instead of the raw data. Also see Handle duplicate data

Distinct Query is not properly working in Cosmos DB

I have about 100 thousand records in Cosmos DB. I want to get the distinct records by some property. I am using Stored Procedure to acheive this and sets the page size to -1 to get the maximum records. When i fire a query without distinct, i get about 19 thousand records. At the same time if i fired the distinct query, it gives me distinct records, and the distinct applied with in the undistincted 19 thousand records instead of the entire 100 thousand records.
Below is the query i have used:
SELECT r.[[FieldName]] FROM r -> returns 19000 records with duplicates
SELECT DISTINCT r.[[FieldName]] FROM r -> returns distinct records (few about 5000) which are distincted from the above 19000 records instead of 100 thousand records

Discrepances in Bytes Processed from an historical table vs ga_sessions_ historical tables

If I extract full data from all existing ga_sessions_ or firebase tables, the Bytes Processed are 4.5GB.
If I save the previous query into a Destination Table and then I extract full data from this table, the Bytes Processed are 217GB.
Both tables have the same table size. Why this discrepancy?
UPDATE:
My standardSQL query:
SELECT TABLE_SUFFIX AS Date,
user_dim.app_info.app_instance_id,
user_dim.app_info.app_version,
user_dim.geo_info.city,
user_properties.key,
event.name
FROM project.dataset.app_events_*,
UNNEST(user_dim.user_properties) AS user_properties,
UNNEST(event_dim) AS event
returnes 4.5GB. If I save this table (called historical_data), and I compose this query:
SELECT *
FROM `project.dataset.historical_data`
then it returnes 217GB.
I think it is possible because of double cross joins - for each cross joined row you now have redundant set of below fields
TABLE_SUFFIX AS Date,
user_dim.app_info.app_instance_id,
user_dim.app_info.app_version,
user_dim.geo_info.city
so even though original table was of 4.5GB in size the result got of 217 GB
make sense to me - and this is something that happ[ens with BigData - result can explode to enormous size if not to be careful enough
And, btw, check number of rows in original table vs. output table

how to select first 200 rows in oracle without full table scan

I need to search first 200 rows in my database with out full table can. If I scan full table it takes too much time because my table contain 160 million record. I am using oracle 11g.
Do you really need to avoid a FTS in this case as I expect
SELECT * FROM table WHERE ROWNUM <= 200;
runs pretty fast and starts returning results immediately despite a FTS even with a table containing millions of rows.

space consumption of null columns in sqlite db

Say I have a db column that's only used rarely by records in my sqlite db (for the rest of the records, the value is null), would those null columns consume as much space comparable to if those columns were non-existent?
In my test program, NULL values have consumed one byte per row. If the average row size in your table will be above 100 bytes, then yes, it's comparable to nonexistent.

Resources