Include non-aggregate column in group by clause (with a slight wrinkle) - sqlite

I have a table that looks like something like this:
timestamp value person
===============================================
2010-01-12 00:00:00 33 emp1
2010-01-12 11:00:00 22 emp1
2010-01-12 09:00:00 16 emp2
2010-01-12 08:00:00 16 emp2
2010-01-12 12:12:00 45 emp3
2010-01-12 13:44:00 64 emp4
2010-01-12 06:00:00 33 emp1
2010-01-12 15:00:00 12 emp5
I wanted to find the maximum value associated with each person. The obvious query was:
select person,max(value) from table group by person
Now I wanted to include the timestamp associated with each max(value). I could not use timestamp column in the above query because as everyone knows, it won't appear in the group by clause. So I wrote this instead:
select x.timestamp,x.value,x.person from table as x,
(select person,max(value) as maxvalue from table group by person order by maxvalue
desc) as y
where x.person = y.person
and x.value = y.maxvalue
This works -- to an extent. I now see:
timestamp value person
===============================================
2010-01-12 13:44:00 64 emp4
2010-01-12 12:12:00 45 emp3
2010-01-12 06:00:00 33 emp1
2010-01-12 00:00:00 33 emp1
2010-01-12 08:00:00 16 emp2
2010-01-12 09:00:00 16 emp2
2010-01-12 15:00:00 12 emp5
The problem is now I get all the entries for emp1 and emp2 that ends up with the same max(value).
Suppose among emp1 and emp2, I only want to see the entry with the latest timestamp. IOW, I want this:
timestamp value person
===============================================
2010-01-12 13:44:00 64 emp4
2010-01-12 12:12:00 45 emp3
2010-01-12 06:00:00 33 emp1
2010-01-12 09:00:00 16 emp2
2010-01-12 15:00:00 12 emp5
What kind of query would I have to write? Is it possible to extend the nested query I wrote to achieve what I want or does one have to rewrite everything from the scratch?
If its important, because I am using Sqlite, timestamps are actually stored as julian days. I use the datetime() function to convert them back to a string representation in every query.

You were almost there:
SELECT max(x.timestamp) AS timestamp, x.value, x.person
, y.max_value, y.ct_value, y.avg_value
FROM table AS x
JOIN (
SELECT person
, max(value) as max_value
, count(value) as ct_value
, avg(value) as avg_value
FROM table
GROUP BY person
) AS y ON (x.person, x.value) = (y.person, y.max_value)
GROUP BY x.person, x.value, y.max_value, y.ct_value, y.avg_value
-- ORDER BY x.person, x.value
You cannot compute max(x.timestamp) in the same nested query, because you don't want the absolute maximum per person, but the one accompanying the maximum value. So you have to aggregate another time on the next query level.
Compute max(x.timestamp) before you convert it to its string representation - though your format would sort correctly, too. But that that should perform better.
Note how I transformed your cross join with where conditions to an [inner] join with a (simplified) join condition. Does the same, just more like the canonical way of the SQL standard and more readable.
All of this could be done in one query level with window functions (max() and first_value()), which are implemented in all the bigger RDBMS (except MYSQL), but not in SQLite.
Edit
Included additional aggregates after request in comment.

Related

How to query SQLite data hourly?

The data table looks like the following:
ID DATE
1 2020-12-31 10:10:00
2 2020-12-31 20:30:00
3 2020-12-31 20:50:00
4 2021-01-02 17:10:00
5 2021-01-02 17:20:00
6 2021-01-02 17:30:00
7 2021-01-03 23:10:00
..
And I would like to query only the last entry per hour per day, and to have the resulte like:
ID DATE
1 2020-12-31 10:10:00
3 2020-12-31 20:50:00
6 2021-01-02 17:30:00
7 2021-01-03 23:10:00
..
I tried to look for hourly query and found the following
strftime('%H', " + DATE + ", '+1 hours')
However, not sure how to use it properly (e.g. with GROUP BY ? then how to ensure it takes the lastest entry of the hour), therefore, would be great to have some help here!
You can do it with ROW_NUMBER() window function:
SELECT ID, DATE
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY strftime('%Y%m%d%H', DATE) ORDER BY DATE DESC) rn
FROM tablename
)
WHERE rn = 1
ORDER BY ID
Instead of strftime('%Y%m%d%H', DATE) you could also use substr(DATE, 1, 13).
For versions of SQLite previous to 3.25.0 which do not support window functions you can do it with NOT EXISTS:
SELECT t1.*
FROM tablename t1
WHERE NOT EXISTS (
SELECT 1
FROM tablename t2
WHERE strftime('%Y%m%d%H', t2.DATE) = strftime('%Y%m%d%H', t1.DATE)
AND t2.DATE > t1.DATE
)
See the demo.
Results:
> ID | DATE
> -: | :------------------
> 1 | 2020-12-31 10:10:00
> 3 | 2020-12-31 20:50:00
> 6 | 2021-01-02 17:30:00
> 7 | 2021-01-03 23:10:00

Creating a SQLite query

I have a SQLite database, I want to create a query that will group records if the DateTime is within 60 minutes - the hard part is the DateTime is cumulative so if we have 3 records with DateTimes 2019-12-14 15:40:00, 2019-12-14 15:56:00 and 2019-12-14 16:55:00 it would all fall in one group. Please see the hands and desired output of the query to help you understand the requirement.
Database Table "Hands"
ID DateTime Result
1 2019-12-14 15:40:00 -100
2 2019-12-14 15:56:00 1000
3 2019-12-14 16:55:00 -2000
4 2012-01-12 12:00:00 400
5 2016-10-01 21:00:00 900
6 2016-10-01 20:55:00 1000
Desired output of query
StartTime Count Result
2019-12-14 15:40:00 3 -1100
2012-01-12 12:00:00 1 400
2016-10-01 20:55:00 2 1900
You can use some window functions to indicate at which record a new group should start (because of a datetime difference with the previous that is 60 minutes or larger), and then to turn that information into a unique group number. Finally you can group by that group number and perform the aggregation functions on it:
with base as (
select DateTime, Result,
coalesce(cast((
julianday(DateTime) - julianday(
lag(DateTime) over (order by DateTime)
)
) * 24 >= 1 as integer), 1) as firstInGroup
from Hands
), step as (
select DateTime, Result,
sum(firstInGroup) over (
order by DateTime rows
between unbounded preceding and current row) as grp
from base
)
select min(DateTime) DateTime,
count(*) Count,
sum(Result) Result
from step
group by grp;
DB-fiddle

Hive: No results are displayed from the query

I am writing a query on this table to get the sum of size for all the directories, group by directory where date is yesterday. I am getting no output from the below query.
test.id test.path test.size test.date
1 this/is/the/path1/fil.txt 232.24 2019-06-01
2 this/is/the/path2/test.txt 324.0 2016-06-01
3 this/is/the/path3/index.txt 12.3 2017-05-01
4 this/is/the/path4/test2.txt 134.0 2019-03-23
5 this/is/the/path1/files.json 2.23 2018-07-23
6 this/is/the/path1/code.java 1.34 2014-03-23
7 this/is/the/path2/data.csv 23.42 2016-06-23
8 this/is/the/path3/test.html 1.33 2018-09-23
9 this/is/the/path4/prog.js 6.356 2019-06-23
4 this/is/the/path4/test2.txt 134.0 2019-04-23
SELECT regexp_replace(path,'[^/]+$',''), sum(cast(size as decimal))
from test WHERE date > date_sub(current_date, 1) GROUP BY path,size;
You must not group by size, only by regexp_replace(path,'[^/]+$','').
Also, since you want only yesterday's rows why do you use WHERE date > '2019%?
You can get yesterday's date with date_sub(current_date, 1):
select
regexp_replace(path,'[^/]+$',''),
sum(cast(size as decimal))
from test
where date = date_sub(current_date, 1)
group by regexp_replace(path,'[^/]+$','');
You probably want WHERE date >= '2019-01-01'. Using % in matching strings, for example your 2019%, only works with LIKE, not inequality matching.
The example you gave looks like you want all rows in calendar year 2019.
For yesterday, you want
WHERE date >= DATE_SUB(current_date, -1)
AND date < current_date
This works even if your date column contains timestamps.

Time to failure variable based off start and end timestamps in R

I have two data sets. Data set 1 contains time stamps of 15 minute intervals starting at 2009-08-18 18:15:00 and ending 2012-11-09 22:30:00 with measurements taken at those times. Data set 2 has start and end time stamps for faults occurring in a factory. There are 6 faults and these faults' start and end times are also 15 min intervals, yet can last longer than 1 interval. They also all fall somewhere between 2009-08-18 18:15:00 and 2012-11-09 22:30:00 as well. I am trying to create a time to failure variable for the faults, where -i would indicate the next fault is i intervals (which are 15 mins) away and i would indicate the fault started i intervals ago. For example,
DataSet1
Timestamp Sensor 1
2009-09-04 10:00:00 30
2009-09-04 10:30:00 40
2009-09-04 10:45:00 33
2009-09-04 11:00:00 23
2009-09-04 11:15:00 24
2009-09-04 11:30:00 42
DataSet 2
Start Time End Time Fault Type
09/04/09 10:45 9/4/2009 11:15 1
09/04/09 21:45 9/4/2009 22:00 1
09/04/09 23:00 9/4/2009 23:15 1
09/05/09 10:45 9/5/2009 11:15 1
09/05/09 21:30 9/5/2009 23:15 1
09/08/09 10:45 9/8/2009 12:30 1
So what I want to end up with is the following time to failure variable (TTF1) and then repeat the process for faults 2-6
Timestamp Sensor 1 TTF1
2009-09-04 10:00:00 30 -3
2009-09-04 10:30:00 40 -1
2009-09-04 10:45:00 33 0
2009-09-04 11:00:00 23 1
2009-09-04 11:15:00 24 2
2009-09-04 11:30:00 42 -41
I know I can use the sqldf function to separate out each fault type, but I have no clue where to begin to even create counting the time to fault variable. I'm very stuck, any help would be greatly appreciated!
You can use the difftime() function from base R to get the time difference between these 2 timestamps:
(z <- Sys.time() - 3600)
Sys.time() - z # just over 3600 seconds.
as.difftime(c("0:3:20", "11:23:15"))
as.difftime(c("3:20", "23:15", "2:"), format = "%H:%M") # 3rd gives NA
(z <- as.difftime(c(0,30,60), units = "mins"))
as.numeric(z, units = "secs")
as.numeric(z, units = "hours")
format(z)
I would recommend set units = "mins". You can convert the class to character, strip out any non-numeric data with gsub, then change the class with as.numeric. Finally just divide by 15 to get the 15-minute time units you want. You can use floor to round the result if needed.

Pandas increment time series by one minute

My data is here.
I want to add a minute to values in STA_STD to get a 5-minute regular time series, if the value in that column contains "23:59:00". Adding one minute should also change to date to next day 00:00 hours.
My code is here
dat=pd.read_csv("temp.csv")
if(dat['STA_STD'].str.contains("23:59:00")):
dat['STA_STD_NEW']= pd.to_datetime(dat[dat['STA_STD'].str.contains("23:59:00")] ['STA_STD'])+datetime.timedelta(minutes=1)
else:
dat['STA_STD_NEW'] = dat['STA_STD']
And this gives me below error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Pandas documentation here talks about the same error.
What is the procedure to iterate through all rows and increment value by one minute, if value contains "23:59:00"?
Please advise.
Two things:
You can't use if/else like that to evaluate multiple values at the same time (you would need to iterate over the values, and then do the if/else for each value separately). But using boolean indexing is much better in this case.
the str.contains does not work with datetimes. But you can eg check if the time part of the datetime values is equal to datetime.time(23, 59)
A small example:
In [2]: dat = pd.DataFrame({'STA_STD':pd.date_range('2012-01-01 23:50', periods=10, freq='1min')})
In [3]: dat['STA_STD_NEW'] = dat['STA_STD']
In [4]: dat.loc[dat['STA_STD'].dt.time == datetime.time(23,59), 'STA_STD_NEW'] += datetime.timedelta(minutes=1)
In [5]: dat
Out[5]:
STA_STD STA_STD_NEW
0 2012-01-01 23:50:00 2012-01-01 23:50:00
1 2012-01-01 23:51:00 2012-01-01 23:51:00
2 2012-01-01 23:52:00 2012-01-01 23:52:00
3 2012-01-01 23:53:00 2012-01-01 23:53:00
4 2012-01-01 23:54:00 2012-01-01 23:54:00
5 2012-01-01 23:55:00 2012-01-01 23:55:00
6 2012-01-01 23:56:00 2012-01-01 23:56:00
7 2012-01-01 23:57:00 2012-01-01 23:57:00
8 2012-01-01 23:58:00 2012-01-01 23:58:00
9 2012-01-01 23:59:00 2012-01-02 00:00:00 <-- increment of 1 minute
Using the dt.time approach needs pandas >= 0.15

Resources