Query performance - 'Left join is null' vs 'Not exists select' - mariadb

I have a question about a query that I want to execute, but I dont know what is the best qua performance. I need to get all the words exclude the words that have a relation with the table wordfilter.
The output of the queries is right, but maybe there is a better solution for this. I have almost none knowledge about query plans, I'm trying to understand it now.
SELECT CONCAT(SPACE(1), UCASE(stocknews.word.word), SPACE(1)) AS word, stocknews.word.language
FROM stocknews.word
WHERE NOT EXISTS (SELECT word_id FROM stocknews.wordfilter WHERE stocknews.word.id = word_id)
AND user_id = 1
+----+--------------+------------+-------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra |
+----+--------------+------------+-------+---------------+---------+---------+-------+------+-------------+
| 1 | PRIMARY | word | ref | user_id | user_id | 4 | const | 843 | Using where |
| 2 | MATERIALIZED | wordfilter | index | PRIMARY | PRIMARY | 756 | | 16 | Using index |
+----+--------------+------------+-------+---------------+---------+---------+-------+------+-------------+
Against
SELECT CONCAT(SPACE(1), UCASE(stocknews.word.word), SPACE(1)) AS word, stocknews.word.language
FROM stocknews.word
LEFT JOIN stocknews.wordfilter ON stocknews.word.id = stocknews.wordfilter.word_id
WHERE stocknews.wordfilter.word_id IS NULL AND user_id = 1
+----+-------------+------------+------+---------------+---------+---------+---------+------+--------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra |
+----+-------------+------------+------+---------------+---------+---------+---------+------+--------------------------------------+
| 1 | SIMPLE | word | ref | user_id | user_id | 4 | const | 843 | |
| 1 | SIMPLE | wordfilter | ref | PRIMARY | PRIMARY | 4 | word.id | 1 | Using where; Using index; Not exists |
+----+-------------+------------+------+---------------+---------+---------+---------+------+--------------------------------------+
Any help is welcome! An explanation would be nice.
Edit:
For query 1:
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Handler_commit | 1 |
| Handler_delete | 0 |
| Handler_discover | 0 |
| Handler_external_lock | 0 |
| Handler_icp_attempts | 0 |
| Handler_icp_match | 0 |
| Handler_mrr_init | 0 |
| Handler_mrr_key_refills | 0 |
| Handler_mrr_rowid_refills | 0 |
| Handler_prepare | 0 |
| Handler_read_first | 1 |
| Handler_read_key | 1044 |
| Handler_read_last | 0 |
| Handler_read_next | 859 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 0 |
| Handler_read_rnd_deleted | 0 |
| Handler_read_rnd_next | 0 |
| Handler_rollback | 0 |
| Handler_savepoint | 0 |
| Handler_savepoint_rollback | 0 |
| Handler_tmp_update | 0 |
| Handler_tmp_write | 215 |
| Handler_update | 0 |
| Handler_write | 0 |
+----------------------------+-------+
25 rows in set (0.00 sec)
For query 2:
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Handler_commit | 1 |
| Handler_delete | 0 |
| Handler_discover | 0 |
| Handler_external_lock | 0 |
| Handler_icp_attempts | 0 |
| Handler_icp_match | 0 |
| Handler_mrr_init | 0 |
| Handler_mrr_key_refills | 0 |
| Handler_mrr_rowid_refills | 0 |
| Handler_prepare | 0 |
| Handler_read_first | 0 |
| Handler_read_key | 844 |
| Handler_read_last | 0 |
| Handler_read_next | 843 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 0 |
| Handler_read_rnd_deleted | 0 |
| Handler_read_rnd_next | 0 |
| Handler_rollback | 0 |
| Handler_savepoint | 0 |
| Handler_savepoint_rollback | 0 |
| Handler_tmp_update | 0 |
| Handler_tmp_write | 0 |
| Handler_update | 0 |
| Handler_write | 0 |
+----------------------------+-------+

It seems to be a close race between the two formulations. (Some other example may show a clearer winner.)
From the HANDLER values: Query 1 did more read_keys, and some writing (which goes along with MATERIALIZED). The other numbers were about same. So, I conclude that Query 1 is slower -- although possibly not enough slower to make much difference.
I vote for LEFT JOIN as the better query pattern (in this case)

Related

change column type and convert the existing values from string to integer in mariadb

I have a table name employees
MariaDB [company]> describe employees;
+----------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+-------------+------+-----+---------+-------+
| employee_id | char(10) | NO | | NULL | |
| first_name | varchar(20) | NO | | NULL | |
| last_name | varchar(20) | NO | | NULL | |
| email | varchar(60) | NO | | NULL | |
| phone_number | char(14) | NO | | NULL | |
| hire_date | date | NO | | NULL | |
| job_id | int(11) | NO | | NULL | |
| salary | varchar(30) | NO | | NULL | |
| commission_pct | char(10) | NO | | NULL | |
| manager_id | char(10) | NO | | NULL | |
| department_id | char(10) | NO | | NULL | |
+----------------+-------------+------+-----+---------+-------+
MariaDB [company]> select * from employees;
+-------------+-------------+-------------+--------------------+---------------+------------+--------+----------+----------------+------------+---------------+
| employee_id | first_name | last_name | email | phone_number | hire_date | job_id | salary | commission_pct | manager_id | department_id |
+-------------+-------------+-------------+--------------------+---------------+------------+--------+----------+----------------+------------+---------------+
| 100 | Steven | King | sking#gmail.com | 515.123.4567 | 2003-06-17 | 1 | 24000.00 | 0.00 | 0 | 90 |
| 101 | Neena | Kochhar | nkochhar#gmail.com | 515.123.4568 | 2005-09-21 | 2 | 17000.00 | 0.00 | 100 | 90 |
| 102 | Lex | Wow | Lwow#gmail.com | 515.123.4569 | 2001-01-13 | 2 | 17000.00 | 0.00 | 100 | 9 |
| 103 | Alexander | Hunold | ahunold#gmail.com | 590.423.4567 | 2006-01-03 | 3 | 9000.00 | 0.00 | 102 | 60 |
| 104 | Bruce | Ernst | bernst#gmail.com | 590.423.4568 | 2007-05-21 | 3 | 6000.00 | 0.00 | 103 | 60 |
| 105 | David | Austin | daustin#gmail.com | 590.423.4569 | 2005-06-25 | 3 | 4800.00 | 0.00 | 103 | 60 |
| 106 | Valli | Pataballa | vpatabal#gmail.com | 590.423.4560 | 2006-02-05 | 3 | 4800.00 | 0.00 | 103 | 60 |
| 107 | Diana | Lorentz | dlorentz#gmail.com | 590.423.5567 | 2007-02-07 | 3 | 4200.00 | 0.00 | 103 | 60 |
| 108 | Nancy | Greenberg | ngreenbe#gmail.com | 515.124.4569 | 2002-08-17 | 4 | 12008.00 | 0.00 | 101 | 100 |
| 109 | Daniel | Faviet | dfaviet#gmail.com | 515.124.4169 | 2002-08-16 | 5 | 9000.00 | 0.00 | 108 | 100 |
| 110 | John | Chen | jchen#gmail.com | 515.124.4269 | 2005-09-28 | 5 | 8200.00 | 0.00 | 108 | 100 |
| 111 | Ismael | Sciarra | isciarra#gmail.com | 515.124.4369 | 2005-09-30 | 5 | 7700.00 | 0.00 | 108 | 100 |
| 112 | Jose | Urman | jurman#gmail.com | 515.124.4469 | 2006-03-07 | 5 | 7800.00 | 0.00 | 108 | 100 |
| 113 | Luis | Popp | lpopp#gmail.com | 515.124.4567 | 2007-12-07 | 5 | 6900.00 | 0.00 | 108 | 100 |
| 114 | Den | Raphaely | drapheal#gmail.com | 515.127.4561 | 2002-12-07 | 6 | 11000.00 | 0.00 | 100 | 30 |
| 115 | Alexander | Khoo | akhoo#gmail.com | 515.127.4562 | 2003-05-18 | 7 | 3100.00 | 0.00 | 114 | 30 |
+-------------+-------------+-------------+--------------------+---------------+------------+--------+----------+----------------+------------+---------------+
I wanted to change the salary column from string to integer. So, I ran this command
MariaDB [company]> alter table employees modify column salary int;
ERROR 1292 (22007): Truncated incorrect INTEGER value: '24000.00'
As you can see it gave me truncation error. I found some previous questions where they showed how to use convert() and trim() but those actually didn't answer my question.
sql code and data can be found here https://0x0.st/oYoB.com_5zfu
I tested this on MySQL and it worked fine. So it is apparently an issue only with MariaDB.
The problem is that a string like '24000.00' is not an integer. Integers don't have a decimal place. So in strict mode, the implicit type conversion fails.
I was able to work around this by running this update first:
update employees set salary = round(salary);
The column is still a string, but '24000.00' has been changed to '24000' (with no decimal point character or following digits).
Then you can alter the data type, and implicit type conversion to integer works:
alter table employees modify column salary int;
See demonstration using MariaDB 10.6:
https://dbfiddle.uk/V6LrEMKt
P.S.: You misspelled the column name "commission_pct" as "comission_pct" in your sample DDL, and I had to edit that to test. In the future, please use one of the db fiddle sites to share samples, because they will test your code.

Index behavior in select not using the key

This is my database table llamados
+-------------+----------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+----------------------+------+-----+---------+-------+
| prospecto | int(10) unsigned | NO | PRI | NULL | |
| bis | tinyint(2) unsigned | NO | PRI | NULL | |
| id_usr | smallint(4) unsigned | NO | MUL | NULL | |
| fecha | date | NO | | NULL | |
| respuesta | tinyint(3) unsigned | NO | | NULL | |
| descripcion | char(255) | NO | | NULL | |
| hora_inicio | time | NO | | NULL | |
| hora_fin | time | NO | | NULL | |
+-------------+----------------------+------+-----+---------+-------+
And this are indexes
+---------------------+--------------+-------------+
| Key_name | Seq_in_index | Column_name |
+---------------------+--------------+-------------+
| PRIMARY | 1 | prospecto |
| PRIMARY | 2 | bis |
| prospecto_fecha_idx | 1 | prospecto |
| prospecto_fecha_idx | 2 | fecha |
| prospecto_fecha_idx | 3 | hora_inicio |
| usr_idx | 1 | id_usr |
+---------------------+--------------+-------------+
I testing this select statement
explain EXTENDED select * from llamados where id_usr = 2;
+------+-------------+----------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+------+-------------+----------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | llamados | ALL | usr_idx | NULL | NULL | NULL | 37 | 62.16 | Using where |
+------+-------------+----------+------+---------------+------+---------+------+------+----------+-------------+
I dont understand why is scaning ALL the table and not using usr_idx to filtered the records

Make a simple clustering manually in R

I am trying to make a simple clustering manually (without using any clustering algorithm) based on the distance between the points. I used the pearson correlation to calculate the distance:
c <- round(cor(t(df)), digits = 2)
d <- as.dist(1 - c)
I want to cluster all point that have a correlation greater than a certain threshold. For example 0,7. How could I cluster this data points in R?
The first rows and columns of my data frame look like this: (there are in total 188 entries and 31 columns
| | A1 | A2 | A3 | A4 | A5 |
| --- | --- | --- | --- | --- | --- |
| U00 | 0 | 0 | 0 | 0 | 0 |
| U01 | 0 | 0 | 84 | 0 | 0 |
| U02 | 0 | 1 | 0 | 0 | 0 |
| U03 | 0 | 0 | 0 | 0 | 0 |
| U04 | 0 | 0 | 0 | 0 | 0 |
| U05 | 0 | 0 | 0 | 0 | 0 |
| U06 | 0 | 0 | 0 | 0 | 0 |
and the dist:
| | U00 | U01 | U02 | U03 | U04 | U05 | U06 |
| | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
| U01 | 0,05 | | | | | | |
| U02 | 1,04 | 1,05 | | | | | |
| U03 | 1,04 | 1,04 | 0,92 | | | | |
| U04 | 1,04 | 1,04 | 0,92 | 0,00 | | | |
| U05 | 1,04 | 1,04 | 0,92 | 0,00 | 0,00 | | |
| U06 | 1,04 | 1,04 | 0,92 | 0,00 | 0,00 | 0,00 | |
At the end I would like to habe an extra column in my data frame with the number of the cluster. Thank you in advance!
Things like this can be done using igraph package:
library(igraph)
threshold <- 0.7
graph_from_adjacency_matrix(abs(cor(df)) > threshold) %>%
components() %>%
membership() %>%
split(names(.), .)
note: I took absolute correlation, you can just remove abs.

How do I take out data from an event for multiple parameters with value of one parameter being the same in the event

Take for example,
event_dim.name = "Start_Level"
event_dim.params.key = "Chapter_Name"
event_dim.params.value.string_value = "chapter_1" (or "chapter_2" or "chapter_3" and so on)
event_dim.params.key = "Level"
event_dim.params.value.int_value = 1 or 2 or 3 or 4 and so on
event_dim.params.key = "Opening_Balance"
event_dim.params.value = 1000 or 1200 or 300 or so on
How do I take out the data if I want to:
- Look at unique users who've played "Level" only for event_dim.params.string_value = "chapter_1" (meaning for levels in Chapter 1)
- Look at the "Opening_Balance" per "Level" only the levels in the chapter where event_dim.params.key = "Chapter_Name" and event_dim.params.value.string_value = "chapter_2"
Currently, I am trying to do it as below to grab the data which I don't think is giving me proper data. I am trying to take out level data for users who've installed the game between a particular date (through first_open) and from a particular source.:
SELECT
COUNT(DISTINCT(app_instance)),
event_value.int_value
FROM (
SELECT
user_dim.app_info.app_instance_id AS app_instance,
event.name AS event,
(
SELECT
user_prop.value.value.int_value
FROM
UNNEST(user_dim.user_properties) AS user_prop
WHERE
user_prop.key = 'first_open_time') AS first_open,
params.key AS event_param,
params.value AS event_value
FROM
`app_package.app_events_*`,
UNNEST(event_dim) AS event,
UNNEST(event.params) AS params
WHERE
event.name = "start_level"
AND user_dim.traffic_source.user_acquired_source = "source"
AND params.key != 'firebase_event_origin'
AND params.key != 'firebase_screen_class'
AND params.key != 'firebase_screen_id' )
WHERE
event_param = "Level"
AND (first_open >= 1516579200000 AND first_open <= 1516924800000)
GROUP BY
event_value.int_value
However, I am not able to segregate events which are specific to when chapter_name = "chapter_1" in the event. (I don't know how to do it unfortunately and hence the question)
Update: (Some additional information added as requested by Mikhail)
Sample Input events would be as follows:
+-----------------+-------------+-----------------+--------------+-----------+
| app_instance_id | event_name | param_key | string_value | int_value |
+-----------------+-------------+-----------------+--------------+-----------+
| 100001 | start_level | chapter_name | chapter_1 | null |
| | | level | null | 1 |
| | | opening_balance | null | 2000 |
| | start_level | chapter_name | chapter_1 | null |
| | | level | null | 2 |
| | | opening_balance | null | 2500 |
| | start_level | chapter_name | chapter_1 | null |
| | | level | null | 2 |
| | | opening_balance | null | 2750 |
| | start_level | chapter_name | chapter_1 | null |
| | | level | null | 3 |
| | | opening_balance | null | 3000 |
| | start_level | chapter_name | chapter_2 | null |
| | | level | null | 1 |
| | | opening_balance | null | 3100 |
| | start_level | chapter_name | chapter_2 | null |
| | | level | null | 2 |
| | | opening_balance | null | 3500 |
| | start_level | chapter_name | chapter_2 | null |
| | | level | null | 3 |
| | | opening_balance | null | 3800 |
| 100002 | start_level | chapter_name | chapter_1 | null |
| | | level | null | 1 |
| | | opening_balance | null | 2000 |
| | start_level | chapter_name | chapter_1 | null |
| | | level | null | 2 |
| | | opening_balance | null | 2250 |
| | start_level | chapter_name | chapter_1 | null |
| | | level | null | 2 |
| | | opening_balance | null | 2400 |
| | start_level | chapter_name | chapter_1 | null |
| | | level | null | 3 |
| | | opening_balance | null | 2800 |
| | start_level | chapter_name | chapter_2 | null |
| | | level | null | 1 |
| | | opening_balance | null | 3000 |
| | start_level | chapter_name | chapter_2 | null |
| | | level | null | 2 |
| | | opening_balance | null | 3200 |
+-----------------+-------------+-----------------+--------------+-----------+
Output required is as follows:
+-----------+-------+--------------+-------------------+---------------+
| Chapter | Level | Unique Users | Total Level Start | Avg. Open Bal |
+-----------+-------+--------------+-------------------+---------------+
| chapter_1 | 1 | 2 | 2 | 2000 |
| chapter_1 | 2 | 2 | 3 | 2383 |
| chapter_1 | 3 | 2 | 3 | 2850 |
| chapter_2 | 1 | 2 | 2 | 3050 |
| chapter_2 | 2 | 2 | 2 | 3350 |
| chapter_2 | 3 | 1 | 1 | 3800 |
+-----------+-------+--------------+-------------------+---------------+
For anyone who is looking for an answer to this question, you can try the below standard sql query:
SELECT
chapter,
level,
count(distinct id) as Unique_Users,
count(id) as Level_start,
avg(opening_balance) as Avg_Open_Bal,
FROM(
SELECT
user_dim.app_info.app_instance_id AS id,
event.date,
event.name,
(SELECT value.string_value FROM UNNEST(event.params) WHERE key = "chapter_name") AS chapter,
(SELECT value.int_value FROM UNNEST(event.params) WHERE key = "level") AS level,
(SELECT value.int_value FROM UNNEST(event.params) WHERE key = "opening_coin_balance") AS open_bal
FROM
`<table_name>`,
UNNEST(event_dim) AS event
WHERE
event.name = "start_level"
)
GROUP BY
chapter,
level

Split data based on grouping column

I'm trying to work out how, in Azure ML (and therefore R solutions are acceptable), to randomly split data based on a column, such that all records with any given value in that column wind up in one side of the split or another. For example:
+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
| 1234 | 1 | Foo | 1 |
| 5678 | 0 | Bar | 1 |
| 9101112 | 1 | Quack | 1 |
| 13141516 | 1 | Meep | 1 |
| 1234 | 0 | Boop | 2 |
| 5678 | 0 | Baa | 2 |
| 9101112 | 0 | Bleat | 2 |
| 13141516 | 1 | Maaaa | 2 |
| 1234 | 0 | Foo | 3 |
| 5678 | 0 | Bar | 3 |
| 9101112 | 1 | Quack | 3 |
| 13141516 | 1 | Meep | 3 |
| 1234 | 1 | Boop | 4 |
| 5678 | 1 | Baa | 4 |
| 9101112 | 0 | Bleat | 4 |
| 13141516 | 1 | Maaaa | 4 |
+------------+------+--------------------+------+
Acceptable output from that if I chose, say, a 50/50 split and to be grouped based on the Student ID column would be two new datasets:
+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
| 1234 | 1 | Foo | 1 |
| 1234 | 0 | Boop | 2 |
| 1234 | 0 | Foo | 3 |
| 1234 | 1 | Boop | 4 |
| 9101112 | 1 | Quack | 1 |
| 9101112 | 0 | Bleat | 2 |
| 9101112 | 1 | Quack | 3 |
| 9101112 | 0 | Bleat | 4 |
+------------+------+--------------------+------+
and
+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
| 5678 | 0 | Bar | 1 |
| 5678 | 0 | Baa | 2 |
| 5678 | 0 | Bar | 3 |
| 5678 | 1 | Baa | 4 |
| 13141516 | 1 | Meep | 1 |
| 13141516 | 1 | Maaaa | 2 |
| 13141516 | 1 | Meep | 3 |
| 13141516 | 1 | Maaaa | 4 |
+------------+------+--------------------+------+
Now, from what I can tell this is basically the opposite of stratified split, where it would get a random sample with every student represented on both sides.
I would prefer an Azure ML function that did this, but I think that's unlikely so is there an R function or library that gives this kind of functionality? All I could find was questions about stratification which obviously don't help me much.
You can use te following command:
data.fold <- mutate(df, fold = sample(rep_len(1:2, n_distinct(Student ID)))[Student ID])
It returns the original dataframe with an new column that indicates the fold that the student is in. If you want more folds, just adjust the '1:2' part.
I've tried the 'sample unique' way but it did not always work for me in the past.

Resources