I have a SQLite database, I want to create a query that will group records if the DateTime is within 60 minutes - the hard part is the DateTime is cumulative so if we have 3 records with DateTimes 2019-12-14 15:40:00, 2019-12-14 15:56:00 and 2019-12-14 16:55:00 it would all fall in one group. Please see the hands and desired output of the query to help you understand the requirement.
Database Table "Hands"
ID DateTime Result
1 2019-12-14 15:40:00 -100
2 2019-12-14 15:56:00 1000
3 2019-12-14 16:55:00 -2000
4 2012-01-12 12:00:00 400
5 2016-10-01 21:00:00 900
6 2016-10-01 20:55:00 1000
Desired output of query
StartTime Count Result
2019-12-14 15:40:00 3 -1100
2012-01-12 12:00:00 1 400
2016-10-01 20:55:00 2 1900
You can use some window functions to indicate at which record a new group should start (because of a datetime difference with the previous that is 60 minutes or larger), and then to turn that information into a unique group number. Finally you can group by that group number and perform the aggregation functions on it:
with base as (
select DateTime, Result,
coalesce(cast((
julianday(DateTime) - julianday(
lag(DateTime) over (order by DateTime)
)
) * 24 >= 1 as integer), 1) as firstInGroup
from Hands
), step as (
select DateTime, Result,
sum(firstInGroup) over (
order by DateTime rows
between unbounded preceding and current row) as grp
from base
)
select min(DateTime) DateTime,
count(*) Count,
sum(Result) Result
from step
group by grp;
DB-fiddle
Related
I have a df in r with numerous records with the below format, with 'arrival_time' values for a 12 hour period'.
id
arrival_time
wait_time_value
1
2020-02-20 12:02:00
10
2
2020-02-20 12:04:00
5
99900
2020-02-20 23:47:00
8
10000
2020-02-20 23:59:00
21
I would like to create a new df that has a row for each 15 minute slot of the arrival time period and the wait_time_value of the record with the earliest arrival time in that slot. So, in the above example, the first and last row of the new df would look like:
id
period_start
wait_time_value
1
2020-02-20 12:00:00
10
48
2020-02-20 23:45:00
8
I have used the below code to achieve this for the mean average wait time for all records in each 15 minute range, but i'm not sure how to select the value for the earliest record?
df$period_start <- align.time(df$arrival_time- 899, n = 60*15)
avgwait_df <- aggregate(wait_time_value ~ period_start, df, mean)
Use DataFrame.resample with GroupBy.first, remove only NaNs and convert to DataFrame:
df['arrival_time'] = pd.to_datetime(df['arrival_time'])
df = (df.resample('15Min', on='arrival_time')['wait_time_value']
.first()
.dropna()
.reset_index(name='wait_time_value'))
print (df)
arrival_time wait_time_value
0 2020-02-20 12:00:00 10.0
1 2020-02-20 23:45:00 8.0
Using dplyr:
df %>%
group_by(period_start) %>%
summarise(wait_time = min(wait_time_value))
The data table looks like the following:
ID DATE
1 2020-12-31 10:10:00
2 2020-12-31 20:30:00
3 2020-12-31 20:50:00
4 2021-01-02 17:10:00
5 2021-01-02 17:20:00
6 2021-01-02 17:30:00
7 2021-01-03 23:10:00
..
And I would like to query only the last entry per hour per day, and to have the resulte like:
ID DATE
1 2020-12-31 10:10:00
3 2020-12-31 20:50:00
6 2021-01-02 17:30:00
7 2021-01-03 23:10:00
..
I tried to look for hourly query and found the following
strftime('%H', " + DATE + ", '+1 hours')
However, not sure how to use it properly (e.g. with GROUP BY ? then how to ensure it takes the lastest entry of the hour), therefore, would be great to have some help here!
You can do it with ROW_NUMBER() window function:
SELECT ID, DATE
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY strftime('%Y%m%d%H', DATE) ORDER BY DATE DESC) rn
FROM tablename
)
WHERE rn = 1
ORDER BY ID
Instead of strftime('%Y%m%d%H', DATE) you could also use substr(DATE, 1, 13).
For versions of SQLite previous to 3.25.0 which do not support window functions you can do it with NOT EXISTS:
SELECT t1.*
FROM tablename t1
WHERE NOT EXISTS (
SELECT 1
FROM tablename t2
WHERE strftime('%Y%m%d%H', t2.DATE) = strftime('%Y%m%d%H', t1.DATE)
AND t2.DATE > t1.DATE
)
See the demo.
Results:
> ID | DATE
> -: | :------------------
> 1 | 2020-12-31 10:10:00
> 3 | 2020-12-31 20:50:00
> 6 | 2021-01-02 17:30:00
> 7 | 2021-01-03 23:10:00
I am writing a query on this table to get the sum of size for all the directories, group by directory where date is yesterday. I am getting no output from the below query.
test.id test.path test.size test.date
1 this/is/the/path1/fil.txt 232.24 2019-06-01
2 this/is/the/path2/test.txt 324.0 2016-06-01
3 this/is/the/path3/index.txt 12.3 2017-05-01
4 this/is/the/path4/test2.txt 134.0 2019-03-23
5 this/is/the/path1/files.json 2.23 2018-07-23
6 this/is/the/path1/code.java 1.34 2014-03-23
7 this/is/the/path2/data.csv 23.42 2016-06-23
8 this/is/the/path3/test.html 1.33 2018-09-23
9 this/is/the/path4/prog.js 6.356 2019-06-23
4 this/is/the/path4/test2.txt 134.0 2019-04-23
SELECT regexp_replace(path,'[^/]+$',''), sum(cast(size as decimal))
from test WHERE date > date_sub(current_date, 1) GROUP BY path,size;
You must not group by size, only by regexp_replace(path,'[^/]+$','').
Also, since you want only yesterday's rows why do you use WHERE date > '2019%?
You can get yesterday's date with date_sub(current_date, 1):
select
regexp_replace(path,'[^/]+$',''),
sum(cast(size as decimal))
from test
where date = date_sub(current_date, 1)
group by regexp_replace(path,'[^/]+$','');
You probably want WHERE date >= '2019-01-01'. Using % in matching strings, for example your 2019%, only works with LIKE, not inequality matching.
The example you gave looks like you want all rows in calendar year 2019.
For yesterday, you want
WHERE date >= DATE_SUB(current_date, -1)
AND date < current_date
This works even if your date column contains timestamps.
I have a table(pay_period) as following
pay_period
period_id list_id start_date end_date price
1 100 2017-01-01 2017-08-31 100
2 100 2017-09-01 2017-12-31 110
3 101 2017-01-01 2017-08-31 75
Now I have list_id, checkin_date, checkout_date
list_id 100
checkin_date 2017-08-25
checkout_date 2017-09-10
I need to calculate the price of a list for the period from checkin date to checkout date.
therefore the calculation is supposed to be
7 * 100 + 10 * 110
I am thinking to do it with a for loop, if there is any other better way to do it, can you please suggest?
You have to see if the checkin_date and checkout_date are into the same period_id.
1.1 If yes, you multiply the price with the nunmber of days.
1.2 If no, you have count the days between checkin_day untill the end of your period 1 and multiply with the corresponding price, then do the same with checkout_day and beginning of next period.
Note: i guess it might happen to have more than 2 prices per list_id. for example:
period_id list_id start_date end_date price
1 100 2017-01-01 2017-04-30 100
2 100 2017-05-01 2017-09-30 110
3 100 2017-10-01 2017-12-31 120
4 101 2017-01-01 2017-08-31 75
and the calculation period to be:
list_id 100
checkin_date 2017-03-01
checkout_date 2017-11-10
In this case, yes, the solution would be to have a CURSOR where to keep the prices for list_id and periods; loop through it and compare the checkin_date and checkout_date with each record.
Best,
Mikcutu.
You can do the following for a much cleaner code. Although it is purely sql, I am using a function to make it code better to understand.
Create a generic function which gets you the number of overlapping days in 2 different date range.
CREATE OR REPLACE FUNCTION fn_count_range
( p_start_date1 IN DATE,
p_end_date1 IN DATE,
p_start_date2 IN DATE,
p_end_date2 IN DATE ) RETURN NUMBER AS
v_days NUMBER;
BEGIN
IF p_end_date1 < p_start_date1 OR p_end_date2 < p_start_date2 THEN
RETURN 0;
END IF;
SELECT COUNT(*) INTO v_days
FROM (
(SELECT p_start_date1 + LEVEL - 1
FROM dual CONNECT BY LEVEL <= p_end_date1 - p_start_date1 + 1 ) INTERSECT
(SELECT p_start_date2 + LEVEL - 1
FROM dual CONNECT BY LEVEL <= p_end_date2 - p_start_date2 + 1 ) );
RETURN v_days;
END;
/
Now, your query to calculate the total price is simplified.
WITH lists ( list_id,
checkin_date,
checkout_date) AS
( SELECT 100,
TO_DATE('2017-08-25','YYYY-MM-DD'),
TO_DATE('2017-09-10','YYYY-MM-DD')
FROM dual) --Not required if you have a lists table.
SELECT l.list_id,
SUM(fn_count_range(start_date,end_date,checkin_date,checkout_date) * price) total_price
FROM pay_period p
JOIN lists l ON p.list_id = l.list_id
GROUP BY l.list_id;
I have a table that looks like something like this:
timestamp value person
===============================================
2010-01-12 00:00:00 33 emp1
2010-01-12 11:00:00 22 emp1
2010-01-12 09:00:00 16 emp2
2010-01-12 08:00:00 16 emp2
2010-01-12 12:12:00 45 emp3
2010-01-12 13:44:00 64 emp4
2010-01-12 06:00:00 33 emp1
2010-01-12 15:00:00 12 emp5
I wanted to find the maximum value associated with each person. The obvious query was:
select person,max(value) from table group by person
Now I wanted to include the timestamp associated with each max(value). I could not use timestamp column in the above query because as everyone knows, it won't appear in the group by clause. So I wrote this instead:
select x.timestamp,x.value,x.person from table as x,
(select person,max(value) as maxvalue from table group by person order by maxvalue
desc) as y
where x.person = y.person
and x.value = y.maxvalue
This works -- to an extent. I now see:
timestamp value person
===============================================
2010-01-12 13:44:00 64 emp4
2010-01-12 12:12:00 45 emp3
2010-01-12 06:00:00 33 emp1
2010-01-12 00:00:00 33 emp1
2010-01-12 08:00:00 16 emp2
2010-01-12 09:00:00 16 emp2
2010-01-12 15:00:00 12 emp5
The problem is now I get all the entries for emp1 and emp2 that ends up with the same max(value).
Suppose among emp1 and emp2, I only want to see the entry with the latest timestamp. IOW, I want this:
timestamp value person
===============================================
2010-01-12 13:44:00 64 emp4
2010-01-12 12:12:00 45 emp3
2010-01-12 06:00:00 33 emp1
2010-01-12 09:00:00 16 emp2
2010-01-12 15:00:00 12 emp5
What kind of query would I have to write? Is it possible to extend the nested query I wrote to achieve what I want or does one have to rewrite everything from the scratch?
If its important, because I am using Sqlite, timestamps are actually stored as julian days. I use the datetime() function to convert them back to a string representation in every query.
You were almost there:
SELECT max(x.timestamp) AS timestamp, x.value, x.person
, y.max_value, y.ct_value, y.avg_value
FROM table AS x
JOIN (
SELECT person
, max(value) as max_value
, count(value) as ct_value
, avg(value) as avg_value
FROM table
GROUP BY person
) AS y ON (x.person, x.value) = (y.person, y.max_value)
GROUP BY x.person, x.value, y.max_value, y.ct_value, y.avg_value
-- ORDER BY x.person, x.value
You cannot compute max(x.timestamp) in the same nested query, because you don't want the absolute maximum per person, but the one accompanying the maximum value. So you have to aggregate another time on the next query level.
Compute max(x.timestamp) before you convert it to its string representation - though your format would sort correctly, too. But that that should perform better.
Note how I transformed your cross join with where conditions to an [inner] join with a (simplified) join condition. Does the same, just more like the canonical way of the SQL standard and more readable.
All of this could be done in one query level with window functions (max() and first_value()), which are implemented in all the bigger RDBMS (except MYSQL), but not in SQLite.
Edit
Included additional aggregates after request in comment.