I am using the R package SQLDF and am having trouble finding the number of days between two date time variables. The variables ledger_entry_created_at and created_at are Unix Epochs and when I try to subtract them after casting to julianday, I return a vector of NA's.
I've taken a look at this previous question and didn't find it useful since my answer to be given in SQL for reasons that are outside the scope of this question.
If anyone could help me figure out a way to do this inside SQLDF I would be grateful.
EDIT:
SELECT strftime('%Y-%m-%d %H:%M:%S', l.created_at, 'unixepoch') ledger_entry_created_at,
l.ledger_entry_id, l.account_id, l.amount, a.user_id, u.created_at
FROM ledger l
LEFT JOIN accounts a
ON l.account_id = a.account_id
LEFT JOIN users u
ON a.user_id = u.user_id
This answer is trivial, but if you already have two UNIX timestamps, and you want to find out how many days have elapsed between them, you can simply take the difference in seconds (their original unit), and convert to days, e.g.
SELECT
(l.created_at - u.created_at) / (3600*24) AS diff
-- any maybe other columns here
FROM ledger l
LEFT JOIN accounts a
ON l.account_id = a.account_id
LEFT JOIN users u
ON a.user_id = u.user_id;
I don't know why your current approach is failing, as the timestamps you have in the screen capture should be valid inputs to SQLite's julianday function. But, again, you may not need such a complicated route to get the result you want.
Related
I am new to Snowflake, and running a query to get a couple of day's data - this returns more than 200 million rows, and take a few days. I tried running the same query in Jupyter - and the kernel restars/dies before the query ends. Even if it got into Jupyter - I doubt I could analyze the data in any reasonable timeline (but maybe using dask?).
I am not really sure where to start - I am trying to check the data for missing values, and my first instinct was to use Jupyter - but I am lost at the moment.
My next idea is to stay within Snowflake - and check the columns there with case statements (e.g. sum(case when column_value = '' then 1 else 0 end) as number_missing_values
Does anyone have any ideas/direction I could try - or know if I'm doing something wrong?
Thank you!
not really the answer you are looking for but
sum(case when column_value = '' then 1 else 0 end) as number_missing_values`
when you say missing value, this will only find values that are an empty string
this can also be written is a simpler form as:
count_if(column_value = '') as number_missing_values
The data base already knows how many rows are in a column, and it knows how many null columns there are. If loading data into a table, it might make more sense to not load empty strings, and use null instead then, for not compute cost you can run:
count(*) - count(column) as number_empty_values
also of note, if you have two tables in snowflake you can compare the via the MINUS
aka
select * from table_1
minus
select * from table_2
is useful to find missing rows, you do have to do it in both directions.
Then you can HASH rows, or hash the whole table via HASH_AGG
But normally when looking for missing data, you have an external system, so the driver is 'what can that system handle' and finding common ground.
Also in the past we where search for bugs in our processing that cause duplicate data (where we needed/wanted no duplicates) so then the above, and COUNT DISTINCT like commands come in useful.
I'm generating a table which will in turn be used to format several different statistics and graphs.
Some columns of this table, are a result of subqueries which use a nearly identical structure. My query works, but it is very inefficient even in a simplified example like the following one.
SELECT
o.order,
o.date,
c.clienttype,
o.producttype,
(SELECT date FROM orders_interactions LEFT JOIN categories WHERE order=o.order AND category=3) as completiondate,
(SELECT amount FROM orders_interactions LEFT JOIN categories WHERE order=o.order AND category=3) as amount,
DATEDIFF((select date from orders_interactions LEFT JOIN categories where order=o.order AND category=3),o.date) as elapseddays
FROM orders o
LEFT JOIN clients c ON c.idClient=o.idClient
Being this a simplified example of a much more complex query, I would like to know the recommended approaches for a query like this one, taking into account query times, and readability.
As the example shows, I had to repeat a subquery (the one with date), just to calculate a datediff, since I cannot directly reference the column 'completiondate'
Thank you
You can try a left join.
SELECT o.order,
o.date,
o.producttype,
oi.date completiondate,
oi.amount,
datediff(oi.date, o.date) completiondate
FROM orders o
LEFT JOIN orders_interactions oi
ON oi.order = o.order
AND oi.category = 3;
That doesn't necessarily perform better but there are good chances. For performance an index on order_interactions (order, category) might help in any case.
And if you consider it more readable is up to you. But at least it's less repetitive (Which doesn't necessarily translates to more performance. Just because an expression is repeated in a query doesn't necessarily mean it repeatedly calculated.)
It seems I might have found the answer.
In my opinion, it improves readability quite a bit, and in my real usage scenario, both profile and execution plans are way more efficient, and results are returned in less than 1/3 of the time.
My answer relies on using a SELECT inside the LEFT JOIN, hence, using a subquery as the JOINs 'input'.
SELECT
o.order,
o.date,
c.clienttype,
o.producttype,
tmp.date,
tmp.amount,
DATEDIFF(tmp.date,o.date) as elapseddays
FROM orders o
LEFT JOIN clients c ON c.idClient=o.idClient
LEFT JOIN (SELECT order,date,amount FROM orders_interactions oi LEFT JOIN categories ct ON ct.order=oi.order AND category=3) AS tmp ON tmp.order=o.order
The answer idea, and the explanation about how and why it works, came from this post: Mysql Reference subquery result in parent where clause
I have a table that includes a 'LastUpdated' column that is generated when the row is inserted using Sqlite's datetime('now') function.
How do I write a Select statement that finds all rows with 'LastUpdated' more than 100 days old?
I think it's a variant of:
SELECT * FROM Table WHERE (DATETIME('Now')-100 Days) > LastUpdated
But I'm unsure of:
a) How to specify the 100 Days?
b) Whether I can actually compare datetimes like this or if I first have to convert DATETIME('Now') to a string?
c) DATETIME('Now') results in UTC time, correct? I think so from my reading of the documentation, but it was a little confusing...
Figured it out--I didn't see all the handy modifiers at the bottom of the SQLite Datetime Documentation.
A bunch of helpful examples there demonstrating addition/subtraction of any datetime unit (years, months, hours, seconds, etc)
SELECT * FROM Table WHERE (DATETIME('Now','-100 Days') > LastUpdated
I am good with SQL and I naturaly use sqldf package.
However, it is useful to know native R way to achieve various SQL commands.
on a dataframe column, how can I achieve a similar count as in the last command?
library(ggplot2)
head(tips,3)
sqldf("select count(distinct day) from tips")
ok. I got little better now and can answer my own question.
d <- table(tips$day)
and then count how long is that list dim(d)
I have two tables, one contains a list of items which is called watch_list with some important attributes and the other is just a list of prices which is called price_history. What I would like to do is group together 10 of the lowest prices into a single column with a group_concat operation and then create a row with item attributes from watch_list along with the 10 lowest prices for each item in watch_list. First I tried joins but then I realized that the operations where happening in the wrong order so there was no way I could get the desired result with a join operation. Then I tried the obvious thing and just queried the price_history for every row in the watch_list and just glued everything together in the host environment which worked but seemed very inefficient. Now I have the following query which looks like it should work but it's not giving me the results that I want. I would like to know what is wrong with the following statement:
select w.asin,w.title,
(select group_concat(lowest_used_price) from price_history as p
where p.asin=w.asin limit 10)
as lowest_used
from watch_list as w
Basically I want the limit operation to happen before group_concat does anything but I can't think of a sql statement that will do that.
Figured it out, as somebody once said "All problems in computer science can be solved by another level of indirection." and in this case an extra select subquery did the trick:
select w.asin,w.title,
(select group_concat(lowest_used_price)
from (select lowest_used_price from price_history as p
where p.asin=w.asin limit 10)) as lowest_used
from watch_list as w