Snowflake, Python/Jupyter analysis - jupyter-notebook

I am new to Snowflake, and running a query to get a couple of day's data - this returns more than 200 million rows, and take a few days. I tried running the same query in Jupyter - and the kernel restars/dies before the query ends. Even if it got into Jupyter - I doubt I could analyze the data in any reasonable timeline (but maybe using dask?).
I am not really sure where to start - I am trying to check the data for missing values, and my first instinct was to use Jupyter - but I am lost at the moment.
My next idea is to stay within Snowflake - and check the columns there with case statements (e.g. sum(case when column_value = '' then 1 else 0 end) as number_missing_values
Does anyone have any ideas/direction I could try - or know if I'm doing something wrong?
Thank you!

not really the answer you are looking for but
sum(case when column_value = '' then 1 else 0 end) as number_missing_values`
when you say missing value, this will only find values that are an empty string
this can also be written is a simpler form as:
count_if(column_value = '') as number_missing_values
The data base already knows how many rows are in a column, and it knows how many null columns there are. If loading data into a table, it might make more sense to not load empty strings, and use null instead then, for not compute cost you can run:
count(*) - count(column) as number_empty_values
also of note, if you have two tables in snowflake you can compare the via the MINUS
aka
select * from table_1
minus
select * from table_2
is useful to find missing rows, you do have to do it in both directions.
Then you can HASH rows, or hash the whole table via HASH_AGG
But normally when looking for missing data, you have an external system, so the driver is 'what can that system handle' and finding common ground.
Also in the past we where search for bugs in our processing that cause duplicate data (where we needed/wanted no duplicates) so then the above, and COUNT DISTINCT like commands come in useful.

Related

Pipe "|" symbol in MariaDB EXPLAIN

I ran an EXPLAIN on a slow (2 minutes to return 2 sorted results) query in MariaDB, and some of the returned columns contain multiple values separated by a "|" symbol.
When using a better index (same query running in 20ms), EXPLAIN returns similar values but separated by a comma.
I spent the last hour looking for any kind of reference online, both in the MariaDB and MySQL documentation (since I'm not sure it's MariaDB-specific), but nothing relevant came up - not even a SO question.
Do you know what the "|" symbol means in this context? Considering the time difference vs the comma-separated result it feels like a combinatory operator, but adding "combinatory" or "exponential" as google search key didn't provide any additional insight.
EXPLAIN EXTENDED followed by SHOW WARNINGS didn't provide any additional insight either.
Return fields examples:
TYPE: ref|filter
KEY: key1|key2
KEY_LEN: 9|9
Rows: 2 (0%)
Extra: Using where; Using rowid filter
Thank you for any input!
EDIT: for additional context, here's the hibernate-generated query that produces the result above:
select * from things this_ left outer join rel_tab rt_ on this_.id=rt_.thing_id left outer join tab2 t2_ on this_.id=t2_.thing_id where this_.filter1=123 and this_.filter2=456 and this_.filter3=1 order by this_.id desc limit 20;
I also updated the explain plan result above with the filter selectivity.

How to access unaggregated results when aggregation is needed due to dataset size in R

My task is to get total inbound leads for a group of customers, leads by month for the same group of customers and conversion rate of those leads.
The dataset I'm pulling from is 20 million records so I can't query the whole thing. I have successfully done the first step (getting total lead count for each org with this:
inbound_leads <- domo_get_query('6d969e8b-fe3e-46ca-9ba2-21106452eee2',
auto_limit = TRUE,
query = "select org_id,
COUNT(*)
from table
GROUP BY org_id
ORDER BY org_id"
DOMO is the bi tool I'm pulling from and domo_get_query is an internal function from a custom library my company built. It takes a query argument which is a mysql query)and various others which aren't important right now.
sample data looks like this:
org_id, inserted_at, lead_converted_at
1 10/17/2021 2021-01-27T03:39:03
2 10/18/2021 2021-01-28T03:39:03
1 10/17/2021 2021-01-28T03:39:03
3 10/19/2021 2021-01-29T03:39:03
2 10/18/2021 2021-01-29T03:39:03
I have looked through many aggregation online tutorials but none of them seem to go over how to get data needed pre-aggregation (such as number of leads per month per org, which isn't possible once the aggregation has occurred because in the above sample the aggregation would remove the ability to see more than one instance of org_id 1 for example) from a dataset that needs to be aggregated in order to be accessed in the first place. Maybe I just don't understand this enough to know the right questions to ask. Any direction appreciated.
If you're unable to fit your data in memory, you have a few options. You could process the data in batches (i.e. one year at a time) so that it fits in memory. You could use a package like chunked to help.
But in this case I would bet the easiest way to handle your problem is to solve it entirely in your SQL query. To get leads by month, you'll need to truncate your date column and group by org_id, month.
To get conversion rate for leads in those months, you could add a column (in addition to your count column) that is something like:
sum(case when conversion_date is not null then 1 else 0) as convert_count

MariaDB update query taking a long time

I'm currently having some problems with our mysql replication. We're using a master-master setup for failover purposes.
The replication itself is working and I believe thats setup right. But we're having troubles with some queries that takes a cruciating time to execute.
Example:
| 166 | database | Connect | 35 | updating | update xx set xx =
'xx' where xx = 'xx' and xx = 'xx' | 0.000 |
These update queries are taking a 20-30+ seconds sometimes to complete and because of that the replication starts lagging behind and within a day, it will be behind for a couple of hours. Strange part is that it will eventually catchup with the other master.
The table is around ~100MM rows big and around 70GB large. On the master where the queries are executed they take less than a second.
Both configurations, mysql and server, are near identical and we tried optimizing the table and queries, but no luck so far.
Any recommendations we could try to solve this? Let me know if I can provide you with any more information.
Using:
MariaDB 10.1.35 -
CentOS 7.5.1804
The key aspect of this is how many rows are you updating:
If the percentage is low (less than 5% of the rows) then an index can help.
Otherwise, if you are updating a large number of rows (greater than 5%), a full table scan will be optimal. If you have millions of rows this will be slow. Maybe partitioning the table could help, but I would say you have little chances of improving it.
I'm going to assume you are updating a small percentage of rows, so you can use an index. Look at the condition in the WHERE statement. If it looks like this:
WHERE col1 = 'xx' and col2 = 'yy'
Then, an index on those columns will make your query faster. Specifically:
create index ix1 on my_table (col1, col2);
Depending on the selectivity of your columns the flipped index could be faster:
create index ix2 on my_table (col2, col1);
You'll need to try which one is better for your specific case.

How to Query for Recent Rows in SQLITE3

I'm using SQLite3 and trying to query for recent rows. So I'm having SQLite3 insert a unix timestamp into each row with strftime('%s','now'). My Table looks like this:
CREATE TABLE test(id INTEGER PRIMARY KEY, time);
INSERT INTO test (time) VALUES (strftime('%s','now')); --Repeated
SELECT * FROM test;
1|1516816522
2|1516816634
3|1516816646 --etc lots of rows
Now I want to query for only recent entries, for example, I'm trying to get all rows with a time within the last hour. I'm trying the following SQL query:
SELECT * FROM test WHERE time > strftime('%s','now')-60*60;
However, that always returns all rows regardless of the value in the time column. I really don't know what's going on.
Also, if I put WHERE time > strftime('%s','now') it'll return nothing (which is expected) but if I put WHERE time > strftime('%s','now')-1 then it'll return everything. I don't know why.
Here's one more example:
sqlite> SELECT , strftime('%s','now')-1 AS window FROM test WHERE time > window;
1|1516816522|1516817482
2|1516816634|1516817482
3|1516816646|1516817482
It seems that SQLite3 thinks the values in the middle column are greater than the values in the right column!?
This isn't at all what I expect. Can someone please tell me what's going on? Thanks!
The purpose of strftime() is to format values, so it returns a string.
When you try to do computations with its return value, the database must convert it into a number. And numbers and strings cannot be compared directly with each other.
You must ensure that both values in a comparison have the same data type.
The best way to do this is to store numbers in the table:
INSERT INTO test (time)
VALUES (CAST(strftime('%s','now') AS MAKE_THIS_A_NUMBER_PLEASE));
(Or just declare the column type as something with numeric affinity.)

sqlite subqueries with group_concat as columns in select statements

I have two tables, one contains a list of items which is called watch_list with some important attributes and the other is just a list of prices which is called price_history. What I would like to do is group together 10 of the lowest prices into a single column with a group_concat operation and then create a row with item attributes from watch_list along with the 10 lowest prices for each item in watch_list. First I tried joins but then I realized that the operations where happening in the wrong order so there was no way I could get the desired result with a join operation. Then I tried the obvious thing and just queried the price_history for every row in the watch_list and just glued everything together in the host environment which worked but seemed very inefficient. Now I have the following query which looks like it should work but it's not giving me the results that I want. I would like to know what is wrong with the following statement:
select w.asin,w.title,
(select group_concat(lowest_used_price) from price_history as p
where p.asin=w.asin limit 10)
as lowest_used
from watch_list as w
Basically I want the limit operation to happen before group_concat does anything but I can't think of a sql statement that will do that.
Figured it out, as somebody once said "All problems in computer science can be solved by another level of indirection." and in this case an extra select subquery did the trick:
select w.asin,w.title,
(select group_concat(lowest_used_price)
from (select lowest_used_price from price_history as p
where p.asin=w.asin limit 10)) as lowest_used
from watch_list as w

Resources