Here's my data:
+--------+------------+---------+------------+----------------+
| UserID | VisitDate | VisitID | PurchaseID | LastPurchaseID |
+--------+------------+---------+------------+----------------+
| 1234 | 2014-10-03 | 1 | 4a75 | 4a75 |
| 1234 | 2014-10-06 | 2 | | 4a75 |
| 1234 | 2014-10-07 | 3 | b305 | b305 |
| 1234 | 2014-10-08 | 4 | | b305 |
| 1234 | 2014-10-09 | 5 | | b305 |
| 1234 | 2014-10-10 | 6 | b305 | b305 |
| 1234 | 2014-10-10 | 7 | | b305 |
| 1234 | 2014-10-15 | 8 | | b305 |
+--------+------------+---------+------------+----------------+
I don't have LastPurchaseID - this is what I want
I figure I have to use window functions, but I'm not sure how to get it to keep the most recent non-null value, even if the most recent non-null value is many rows ago.
For example, I've tried something like:
SELECT UserID,
VisitDate,
VisitID,
PurchaseID,
LAG(TRIM(PurchaseID)) IGNORE NULLS
OVER (ORDER BY UserID, VisitDate) AS LastPurchaseID
FROM TheTable;
but this only returns:
+--------+------------+---------+------------+----------------+
| UserID | VisitDate | VisitID | PurchaseID | LastPurchaseID |
+--------+------------+---------+------------+----------------+
| 1234 | 2014-10-03 | 1 | 4a75 | 4a75 |
| 1234 | 2014-10-06 | 2 | | 4a75 |
| 1234 | 2014-10-07 | 3 | b305 | b305 |
| 1234 | 2014-10-08 | 4 | | b305 |
| 1234 | 2014-10-09 | 5 | | |
| 1234 | 2014-10-10 | 6 | b305 | b305 |
| 1234 | 2014-10-10 | 7 | | b305 |
| 1234 | 2014-10-15 | 8 | | |
+--------+------------+---------+------------+----------------+
Is there any way to use a window function say "keep the most recent, if it is null, assume it hasn't changed from the previous non-null value"?
I eventually got it, sorry about that. For anyone else in this somewhat unique situation, this is what was happening:
Since PurchaseID was a string in my case, I wasn't considering the case where the PurchaseID was an empty string (or just a space, which trim() turned into the empty string), which is not null.
I have since fixed the job that inserts into the table to prevent this from occurring, and also changed the LastPurchaseID logic to the following:
SELECT LAG(CASE WHEN LENGTH(TRIM(PurchaseID)) = 0 THEN NULL
ELSE TRIM(PurchaseID) END)
IGNORE NULLS OVER (ORDER BY UserID, VisitDate) AS LastPurchaseID
FROM TheTable;
Related
I have a table name employees
MariaDB [company]> describe employees;
+----------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+-------------+------+-----+---------+-------+
| employee_id | char(10) | NO | | NULL | |
| first_name | varchar(20) | NO | | NULL | |
| last_name | varchar(20) | NO | | NULL | |
| email | varchar(60) | NO | | NULL | |
| phone_number | char(14) | NO | | NULL | |
| hire_date | date | NO | | NULL | |
| job_id | int(11) | NO | | NULL | |
| salary | varchar(30) | NO | | NULL | |
| commission_pct | char(10) | NO | | NULL | |
| manager_id | char(10) | NO | | NULL | |
| department_id | char(10) | NO | | NULL | |
+----------------+-------------+------+-----+---------+-------+
MariaDB [company]> select * from employees;
+-------------+-------------+-------------+--------------------+---------------+------------+--------+----------+----------------+------------+---------------+
| employee_id | first_name | last_name | email | phone_number | hire_date | job_id | salary | commission_pct | manager_id | department_id |
+-------------+-------------+-------------+--------------------+---------------+------------+--------+----------+----------------+------------+---------------+
| 100 | Steven | King | sking#gmail.com | 515.123.4567 | 2003-06-17 | 1 | 24000.00 | 0.00 | 0 | 90 |
| 101 | Neena | Kochhar | nkochhar#gmail.com | 515.123.4568 | 2005-09-21 | 2 | 17000.00 | 0.00 | 100 | 90 |
| 102 | Lex | Wow | Lwow#gmail.com | 515.123.4569 | 2001-01-13 | 2 | 17000.00 | 0.00 | 100 | 9 |
| 103 | Alexander | Hunold | ahunold#gmail.com | 590.423.4567 | 2006-01-03 | 3 | 9000.00 | 0.00 | 102 | 60 |
| 104 | Bruce | Ernst | bernst#gmail.com | 590.423.4568 | 2007-05-21 | 3 | 6000.00 | 0.00 | 103 | 60 |
| 105 | David | Austin | daustin#gmail.com | 590.423.4569 | 2005-06-25 | 3 | 4800.00 | 0.00 | 103 | 60 |
| 106 | Valli | Pataballa | vpatabal#gmail.com | 590.423.4560 | 2006-02-05 | 3 | 4800.00 | 0.00 | 103 | 60 |
| 107 | Diana | Lorentz | dlorentz#gmail.com | 590.423.5567 | 2007-02-07 | 3 | 4200.00 | 0.00 | 103 | 60 |
| 108 | Nancy | Greenberg | ngreenbe#gmail.com | 515.124.4569 | 2002-08-17 | 4 | 12008.00 | 0.00 | 101 | 100 |
| 109 | Daniel | Faviet | dfaviet#gmail.com | 515.124.4169 | 2002-08-16 | 5 | 9000.00 | 0.00 | 108 | 100 |
| 110 | John | Chen | jchen#gmail.com | 515.124.4269 | 2005-09-28 | 5 | 8200.00 | 0.00 | 108 | 100 |
| 111 | Ismael | Sciarra | isciarra#gmail.com | 515.124.4369 | 2005-09-30 | 5 | 7700.00 | 0.00 | 108 | 100 |
| 112 | Jose | Urman | jurman#gmail.com | 515.124.4469 | 2006-03-07 | 5 | 7800.00 | 0.00 | 108 | 100 |
| 113 | Luis | Popp | lpopp#gmail.com | 515.124.4567 | 2007-12-07 | 5 | 6900.00 | 0.00 | 108 | 100 |
| 114 | Den | Raphaely | drapheal#gmail.com | 515.127.4561 | 2002-12-07 | 6 | 11000.00 | 0.00 | 100 | 30 |
| 115 | Alexander | Khoo | akhoo#gmail.com | 515.127.4562 | 2003-05-18 | 7 | 3100.00 | 0.00 | 114 | 30 |
+-------------+-------------+-------------+--------------------+---------------+------------+--------+----------+----------------+------------+---------------+
I wanted to change the salary column from string to integer. So, I ran this command
MariaDB [company]> alter table employees modify column salary int;
ERROR 1292 (22007): Truncated incorrect INTEGER value: '24000.00'
As you can see it gave me truncation error. I found some previous questions where they showed how to use convert() and trim() but those actually didn't answer my question.
sql code and data can be found here https://0x0.st/oYoB.com_5zfu
I tested this on MySQL and it worked fine. So it is apparently an issue only with MariaDB.
The problem is that a string like '24000.00' is not an integer. Integers don't have a decimal place. So in strict mode, the implicit type conversion fails.
I was able to work around this by running this update first:
update employees set salary = round(salary);
The column is still a string, but '24000.00' has been changed to '24000' (with no decimal point character or following digits).
Then you can alter the data type, and implicit type conversion to integer works:
alter table employees modify column salary int;
See demonstration using MariaDB 10.6:
https://dbfiddle.uk/V6LrEMKt
P.S.: You misspelled the column name "commission_pct" as "comission_pct" in your sample DDL, and I had to edit that to test. In the future, please use one of the db fiddle sites to share samples, because they will test your code.
I have a source like below table:
+---------+--+--------+--+---------+--+--+------+
| ID | | SEQ_NO | | UNIT_ID | | | D_ID |
+---------+--+--------+--+---------+--+--+------+
| 7979092 | | 1 | | 99 | | | 759 |
| 7979092 | | 2 | | -1 | | | 869 |
| 7979092 | | 3 | | -1 | | | 927 |
| 7979092 | | 4 | | -1 | | | 812 |
| 7979092 | | 5 | | 99 | | | 900 |
| 7979092 | | 6 | | 99 | | | 891 |
| 7979092 | | 7 | | -1 | | | 785 |
| 7979092 | | 8 | | -1 | | | 762 |
| 7979092 | | 9 | | -1 | | | 923 |
+---------+--+--------+--+---------+--+--+------+
I have to merge the rows when consecutive unit_id has same value. We should take max(D_id) when we consolidate the rows. Expected output is:
+---------+---------+------+
| ID | UNIT_ID | D_ID |
+---------+---------+------+
| 7979092 | 99 | 759 |
| 7979092 | -1 | 927 |
| 7979092 | 99 | 900 |
| 7979092 | -1 | 923 |
+---------+---------+------+
I have tried to find the solution using Teradata ordered analytical function, but did not find the solution. I use Teradata 16.
Thank You.
This logic is a bit quirky, it's based on two sequences created by different sort orders:
SELECT
ID
,UNIT_ID
,Max(D_ID)
FROM
(
SELECT
ID
,SEQ_NO
,UNIT_ID
,D_ID
-- assign the same value to consecutive UNIT_IDs
,SEQ_NO -
Row_Number()
Over(PARTITION BY ID, UNIT_ID
ORDER BY SEQ_NO) AS grp
FROM tab
) AS dt
GROUP BY 1,2,grp
You can use RESET WHEN to dynamically create groups within the window. Here's one way to do it:
select ID, UNIT_ID,
max(D_ID) over(
partition by ID order by SEQ_NO
reset when UNIT_ID <> UNIT_ID_prev -- Create new group for new value
) as D_ID
from (
select ID, SEQ_NO, UNIT_ID, D_ID,
lag(UNIT_ID) over(partition by ID order by SEQ_NO) as UNIT_ID_prev -- Previous value
from MY_TABLE
) src
qualify row_number() over(
partition by ID order by SEQ_NO
reset when UNIT_ID <> UNIT_ID_prev -- Match original max() window
) = 1 -- One row per group (similar to DISTINCT)
I have two QBO matricies:
Matrix A has 1 input and 1 outputs,
Matrix B has 3 inputs and 1 outputs
When I navigate to Matrix A, I see something like this:
| Client | Price |
| ------ | ------ |
| | 120.00 |
| A,B,C | 100.00 |
| D,E | 90.00 |
| F | 95.00 |
when I enter E into the client filter, the Matrix sorts as I would expect, with the first row 'valid' and the others crossed out:
| Client | Price |
| ------ | ------ |
| D,E | 90.00 |
| | 120.00 |
| A,B,C | 100.00 |
| F | 95.00 |
Matrix B looks like this:
| Client | State | Investor | Price |
| --------- | ----- | -------- | ------ |
| | CA | NOT(2,3) | 100.00 |
| San Diego | CA | | 110.00 |
| | FL | | 95.00 |
| Miami | FL | 3 | 105.00 |
When I enter 'Miami' into the Client column filter, the sorting appears like this with all columns crossed out:
| Client | State | Investor | Price |
| --------- | ----- | -------- | ------ |
| | CA | NOT(2,3) | 100.00 |
| | FL | | 95.00 |
| San Diego | CA | | 110.00 |
| Miami | FL | 3 | 105.00 |
Why don't I see the row with Miami at the top?
In your second example, the row with Miami also requires State = 'FL', so that row is not a match. To get the Miami row to match, you must enter City = 'Miami' and State = 'FL'.
When the Matrix is evaluated, it calculates the following:
for each row of the matrix
for each input of the matrix
if the row has a value that equals your input, that's a 'match'
if the row has a value and you provided no input, that's a 'mismatch'
if the row has a NOT(value) and you provided no input, that's a 'match'
sort by
weight descending,
then by number of mismatches,
then by the number of matches
For the Matrix B output:
| Client | State | Investor | Price |
| --------- | ----- | -------- | ------ |
| | CA | NOT(2,3) | 100.00 | Mismatch = 1, Match=1
| | FL | | 95.00 | Mismatch = 1, Match=0
| San Diego | CA | | 110.00 | Mismatch = 2, Match=1
| Miami | FL | 3 | 105.00 | Mismatch = 2, Match=1
In short:
The 'Miami' line requires 3 inputs to match, and 2 did not, so it's at the bottom
the first line (CA) actually matches your 'empty' input because it just cares about the Investor NOT being Investor 2 or 3 -- no investor matches this requirement!
I'm trying to work out how, in Azure ML (and therefore R solutions are acceptable), to randomly split data based on a column, such that all records with any given value in that column wind up in one side of the split or another. For example:
+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
| 1234 | 1 | Foo | 1 |
| 5678 | 0 | Bar | 1 |
| 9101112 | 1 | Quack | 1 |
| 13141516 | 1 | Meep | 1 |
| 1234 | 0 | Boop | 2 |
| 5678 | 0 | Baa | 2 |
| 9101112 | 0 | Bleat | 2 |
| 13141516 | 1 | Maaaa | 2 |
| 1234 | 0 | Foo | 3 |
| 5678 | 0 | Bar | 3 |
| 9101112 | 1 | Quack | 3 |
| 13141516 | 1 | Meep | 3 |
| 1234 | 1 | Boop | 4 |
| 5678 | 1 | Baa | 4 |
| 9101112 | 0 | Bleat | 4 |
| 13141516 | 1 | Maaaa | 4 |
+------------+------+--------------------+------+
Acceptable output from that if I chose, say, a 50/50 split and to be grouped based on the Student ID column would be two new datasets:
+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
| 1234 | 1 | Foo | 1 |
| 1234 | 0 | Boop | 2 |
| 1234 | 0 | Foo | 3 |
| 1234 | 1 | Boop | 4 |
| 9101112 | 1 | Quack | 1 |
| 9101112 | 0 | Bleat | 2 |
| 9101112 | 1 | Quack | 3 |
| 9101112 | 0 | Bleat | 4 |
+------------+------+--------------------+------+
and
+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
| 5678 | 0 | Bar | 1 |
| 5678 | 0 | Baa | 2 |
| 5678 | 0 | Bar | 3 |
| 5678 | 1 | Baa | 4 |
| 13141516 | 1 | Meep | 1 |
| 13141516 | 1 | Maaaa | 2 |
| 13141516 | 1 | Meep | 3 |
| 13141516 | 1 | Maaaa | 4 |
+------------+------+--------------------+------+
Now, from what I can tell this is basically the opposite of stratified split, where it would get a random sample with every student represented on both sides.
I would prefer an Azure ML function that did this, but I think that's unlikely so is there an R function or library that gives this kind of functionality? All I could find was questions about stratification which obviously don't help me much.
You can use te following command:
data.fold <- mutate(df, fold = sample(rep_len(1:2, n_distinct(Student ID)))[Student ID])
It returns the original dataframe with an new column that indicates the fold that the student is in. If you want more folds, just adjust the '1:2' part.
I've tried the 'sample unique' way but it did not always work for me in the past.
I have a question about a query that I want to execute, but I dont know what is the best qua performance. I need to get all the words exclude the words that have a relation with the table wordfilter.
The output of the queries is right, but maybe there is a better solution for this. I have almost none knowledge about query plans, I'm trying to understand it now.
SELECT CONCAT(SPACE(1), UCASE(stocknews.word.word), SPACE(1)) AS word, stocknews.word.language
FROM stocknews.word
WHERE NOT EXISTS (SELECT word_id FROM stocknews.wordfilter WHERE stocknews.word.id = word_id)
AND user_id = 1
+----+--------------+------------+-------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra |
+----+--------------+------------+-------+---------------+---------+---------+-------+------+-------------+
| 1 | PRIMARY | word | ref | user_id | user_id | 4 | const | 843 | Using where |
| 2 | MATERIALIZED | wordfilter | index | PRIMARY | PRIMARY | 756 | | 16 | Using index |
+----+--------------+------------+-------+---------------+---------+---------+-------+------+-------------+
Against
SELECT CONCAT(SPACE(1), UCASE(stocknews.word.word), SPACE(1)) AS word, stocknews.word.language
FROM stocknews.word
LEFT JOIN stocknews.wordfilter ON stocknews.word.id = stocknews.wordfilter.word_id
WHERE stocknews.wordfilter.word_id IS NULL AND user_id = 1
+----+-------------+------------+------+---------------+---------+---------+---------+------+--------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra |
+----+-------------+------------+------+---------------+---------+---------+---------+------+--------------------------------------+
| 1 | SIMPLE | word | ref | user_id | user_id | 4 | const | 843 | |
| 1 | SIMPLE | wordfilter | ref | PRIMARY | PRIMARY | 4 | word.id | 1 | Using where; Using index; Not exists |
+----+-------------+------------+------+---------------+---------+---------+---------+------+--------------------------------------+
Any help is welcome! An explanation would be nice.
Edit:
For query 1:
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Handler_commit | 1 |
| Handler_delete | 0 |
| Handler_discover | 0 |
| Handler_external_lock | 0 |
| Handler_icp_attempts | 0 |
| Handler_icp_match | 0 |
| Handler_mrr_init | 0 |
| Handler_mrr_key_refills | 0 |
| Handler_mrr_rowid_refills | 0 |
| Handler_prepare | 0 |
| Handler_read_first | 1 |
| Handler_read_key | 1044 |
| Handler_read_last | 0 |
| Handler_read_next | 859 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 0 |
| Handler_read_rnd_deleted | 0 |
| Handler_read_rnd_next | 0 |
| Handler_rollback | 0 |
| Handler_savepoint | 0 |
| Handler_savepoint_rollback | 0 |
| Handler_tmp_update | 0 |
| Handler_tmp_write | 215 |
| Handler_update | 0 |
| Handler_write | 0 |
+----------------------------+-------+
25 rows in set (0.00 sec)
For query 2:
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Handler_commit | 1 |
| Handler_delete | 0 |
| Handler_discover | 0 |
| Handler_external_lock | 0 |
| Handler_icp_attempts | 0 |
| Handler_icp_match | 0 |
| Handler_mrr_init | 0 |
| Handler_mrr_key_refills | 0 |
| Handler_mrr_rowid_refills | 0 |
| Handler_prepare | 0 |
| Handler_read_first | 0 |
| Handler_read_key | 844 |
| Handler_read_last | 0 |
| Handler_read_next | 843 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 0 |
| Handler_read_rnd_deleted | 0 |
| Handler_read_rnd_next | 0 |
| Handler_rollback | 0 |
| Handler_savepoint | 0 |
| Handler_savepoint_rollback | 0 |
| Handler_tmp_update | 0 |
| Handler_tmp_write | 0 |
| Handler_update | 0 |
| Handler_write | 0 |
+----------------------------+-------+
It seems to be a close race between the two formulations. (Some other example may show a clearer winner.)
From the HANDLER values: Query 1 did more read_keys, and some writing (which goes along with MATERIALIZED). The other numbers were about same. So, I conclude that Query 1 is slower -- although possibly not enough slower to make much difference.
I vote for LEFT JOIN as the better query pattern (in this case)