Finding within-groups specific sequences - r

I am working on a data set including the results of repeated tests (results expressed as positive(1)/negative(0)) on individuals over time; the number of tests per individual is not necessarily the same.
below is a df reproducing how my dataset looks like:
id<-c(rep("a", time=5), rep("b", time=5), rep("c",time=7))
date<-as.Date(c("2018-03-01","2018-04-01","2018-06-01","2018-08-01","2018-10-01","2017-03-01","2017-04-01","2018-02-01","2018-11-01","2018-12-01","2016-05-11","2017-10-01","2018-03-01","2018-03-21","2018-4-01","2018-07-01","2018-08-01"))
test<-c(1,1,0,1,0,0,1,0,1,1,0,0,1,0,0,1,0)
df<-data.frame(id, test, date)
df
id test date
a 1 2018-03-01
a 1 2018-04-01
a 0 2018-06-01
a 1 2018-08-01
a 0 2018-10-01
b 0 2017-03-01
b 1 2017-04-01
b 0 2018-02-01
b 1 2018-11-01
b 1 2018-12-01
c 0 2016-05-11
c 0 2017-10-01
c 1 2018-03-01
c 0 2018-03-21
c 0 2018-04-01
c 1 2018-07-01
c 0 2018-08-01
what I am trying to do is to create a new column 'Var' indicating whether any of the following set of results:
match1<-c(1,1,1,1)
match2<-c(1,1,1,0)
match3<-c(0,1,1,1)
match4<-c(1,0,1,1)
match5<-c(1,1,0,1)
is observed in the result set of each individual. ideally this would result in:
id test date Var
a 1 2018-03-01 case
a 1 2018-04-01 case
a 0 2018-06-01 case
a 1 2018-08-01 case
a 0 2018-10-01 case
b 0 2017-03-01 case
b 1 2017-04-01 case
b 0 2018-02-01 case
b 1 2018-11-01 case
b 1 2018-12-01 case
c 0 2016-05-11 non-case
c 0 2017-10-01 non-case
c 1 2018-03-01 non-case
c 0 2018-03-21 non-case
c 0 2018-04-01 non-case
c 1 2018-07-01 non-case
c 0 2018-08-01 non-case
because the sequence (1,1,0,1) is observed within the result set of 'a', (1,0,1,1) in 'b' while none of the target sequences is observed in 'c'.
Apologize for not posting any attempt but I am really stuck with this!
best regards,
Matteo

Related

Create a binary variable based on the first appearance of another (date) variable

Is it possible to create a binary variable based on the first appearance of another (date) variable?
For my thesis I am trying to create a variable that captures the number of first-time forecasts issued and revised during the month divided by the number of forecasts at the month-end for a firm in a given year. For convenience I would like to separate the first-time forecasts issued and revised in different columns.
Example data
dt <- data.table(
analyst = rep((1:2),10),
id = rep((1:5),4),
year = rep(as.Date(c('2009-12-31','2009-12-31','2010-12-31','2010-12-31'),format='%Y-%m-%d'),5),
fdate = rep(as.Date(c('2009-07-31','2009-02-26','2010-01-31','2010-05-15','2009-06-30','2009-10-08','2010-07-31','2010-11-30','2009-01-31','2009-06-26','2010-05-03','2010-04-13','2009-10-30','2009-11-02','2010-03-28','2010-10-14','2009-02-17','2009-09-14','2010-08-02','2010-10-03'),format='%Y-%m-%d')))
To create the variable, I used the following steps:
First, identifying the issuance of the first-time forecasts for a given year (for firms by analysts) with the following code:
dt2 <- setkey(setDT(dt), id, year, analyst)[order(fdate),.SD[1L] ,by=list(id,year)]
However, this generates a table with only the first-time forecast by id, year and analyst. Secondly, I give the first-time forecasts the value 1 with:
dt3 <- print(dt2[, first:=1L])
Third, combine the two data.tables:
dt4 <- dt3[dt, on = c('id', 'year', 'analyst', 'fdate')]
Fourth, I replace the na for 0
dt4[is.na(dt4)] <- 0
Fifth, creating the revised binary variable:
dt4$rev <- ifelse(dt4$first == 0,"1", "0")
Last, I sum the number of first-time and revised forecasts for every month for a firm.
Is there a more elegant way of creating this variable so I can learn more of R/data.table? I have tried to incorporate the dcast function, based on the answers from:
R data.table - categorical values in one column to binary values in multiple columns
How to programmatically create binary columns based on a categorical variable in data.table?
Data table dcast column headings
However, it doesn't work out for me.
Current result, based on the previous mentioned steps:
id year analyst fdate first rev
1 2009-12-31 1 2009-07-31 1 0
1 2009-12-31 2 2009-10-08 0 1
1 2010-12-31 1 2010-05-03 1 0
1 2010-12-31 2 2010-10-14 0 1
2 2009-12-31 1 2009-02-17 1 0
2 2009-12-31 2 2009-02-26 0 1
2 2010-12-31 1 2010-07-31 0 1
2 2010-12-31 2 2010-04-13 1 0
3 2009-12-31 1 2009-10-30 0 1
3 2009-12-31 2 2009-09-14 1 0
3 2010-12-31 1 2010-01-31 1 0
3 2010-12-31 2 2010-11-30 0 1
4 2009-12-31 1 2009-01-31 1 0
4 2009-12-31 2 2009-11-02 0 1
4 2010-12-31 1 2010-08-02 0 1
4 2010-12-31 2 2010-05-15 1 0
5 2009-12-31 1 2009-06-30 0 1
5 2009-12-31 2 2009-06-26 1 0
5 2010-12-31 1 2010-03-28 1 0
5 2010-12-31 2 2010-10-03 0 1
We can replace the ifelse and also the base R methods. Create the 'first' as 0, then do a join with 'dt2' based on the columns in the post, then assign those matching rows to 1 for 'first', negate (!) the first and convert to integer with (+) or as.integer and assign it to rev
dt[, first := 0][dt2, first := 1, on = .(id, year, analyst, fdate)]
dt[, rev := +(!first)][]
# analyst id year fdate first rev
# 1: 1 1 2009-12-31 2009-07-31 1 0
# 2: 2 1 2009-12-31 2009-10-08 0 1
# 3: 1 1 2010-12-31 2010-05-03 1 0
# 4: 2 1 2010-12-31 2010-10-14 0 1
# 5: 1 2 2009-12-31 2009-02-17 1 0
# 6: 2 2 2009-12-31 2009-02-26 0 1
# 7: 1 2 2010-12-31 2010-07-31 0 1
# 8: 2 2 2010-12-31 2010-04-13 1 0
# 9: 1 3 2009-12-31 2009-10-30 0 1
#10: 2 3 2009-12-31 2009-09-14 1 0
#11: 1 3 2010-12-31 2010-01-31 1 0
#12: 2 3 2010-12-31 2010-11-30 0 1
#13: 1 4 2009-12-31 2009-01-31 1 0
#14: 2 4 2009-12-31 2009-11-02 0 1
#15: 1 4 2010-12-31 2010-08-02 0 1
#16: 2 4 2010-12-31 2010-05-15 1 0
#17: 1 5 2009-12-31 2009-06-30 0 1
#18: 2 5 2009-12-31 2009-06-26 1 0
#19: 1 5 2010-12-31 2010-03-28 1 0
#20: 2 5 2010-12-31 2010-10-03 0 1

Conditionally update rows and then group

Let me start by providing my sample dataset:
ID Start Code End Days
1 2016-03-01 A 2016-03-14 14
1 2016-03-01 A 2016-03-14 14
1 2016-03-01 B 2016-04-01 30
2 2016-02-01 A 2016-03-01 28
I'd like to, for each ID, and within this group, for each Code, check if the End is larger dan Start in the next row (if df$End[i] > df$Start[i+1]) and if so, update Start of the next row to End en recompute End (which is Start + Days) for that row i+1. The results should thus be:
ID Start Code End Days
1 2016-03-01 A 2016-03-14 14
1 2016-03-14 A 2016-03-28 14
1 2016-03-01 B 2016-04-01 30
2 2016-02-01 A 2016-03-01 28
Afterwards, if for a ID, and a Code the difference between df$End[i] - df$Start[i+1] <= 7, I would like to combine the rows, using the smallest df$Start and the largest df$End for this subset. Making:
ID Start Code End Days
1 2016-03-01 A 2016-03-28 14
1 2016-03-01 B 2016-04-01 30
2 2016-02-01 A 2016-03-01 28
Since my dataset is over 100M rows, I'd like a fast solution. Unfortunately I am pretty new to dplyr, so help is highly appreciated!
update: larger example:
ID Start Code End Days
1 2012-04-01 A 2012-04-07 7
1 2016-03-01 B 2016-03-15 15
1 2016-03-01 B 2016-05-29 90
1 2016-06-01 B 2016-08-29 90
1 2016-09-01 B 2016-11-29 90
1 2016-12-01 B 2017-02-28 90
1 2017-03-01 B 2017-05-09 90
1 2017-08-01 B 2017-10-29 90
1 2017-12-01 B 2018-02-28 90
2 2016-04-01 B 2016-04-14 14
This results in:
ID Start Code End
1 2012-04-01 A 2012-04-07
1 2016-03-01 B 2017-02-28
1 2017-03-01 B 2017-05-29
1 2018-08-01 B 2017-12-05
2 2016-04-01 B 2016-04-14
Where I would expect row 2 and to be combined.
For the first step I tried:
grouped_df <-
df %>%
group_by(ID, Code) %>%
mutate_at(vars(Start, End), funs(as.Date)) %>%
mutate(new_start = as.Date(ifelse(lag(End > Start), lag(End), Start), origin="1970-01-01")) %>%
mutate(new_stop= new_disp + Days)
However, if a new_end has been computed, we should now compare new_end and not End with new_start (and not Start).

How can I group values by hour and count the cumulative totals in other columns

I have a data frame that is aggregated per minute (where one row represents one minute in YYYY-MM-DD HH:MM:SS format).
I want to group each minute value into their respective hour values/bins.
I have also extracted the hour value from the date field into another column in order to group the data more easily (YYYY-MM-DD HH).
I have looked at several approaches/answers where people recommend using lubridate/dplyr/anytime but no approach seems to have worked completely for me.
My data frame:
> df
date hour available busy
1 2018-03-01 01:00:00 2018-03-01 01:00:00 1 1
2 2018-03-01 01:01:00 2018-03-01 01:00:00 1 1
3 2018-03-01 01:02:00 2018-03-01 01:00:00 1 1
4 2018-03-01 01:03:00 2018-03-01 01:00:00 1 1
5 2018-03-01 01:04:00 2018-03-01 01:00:00 1 1
6 2018-03-01 01:05:00 2018-03-01 01:00:00 1 1
...
7907 2018-03-14 00:54:00 2018-03-14 1 0
7908 2018-03-14 00:55:00 2018-03-14 1 0
7909 2018-03-14 00:56:00 2018-03-14 2 0
7910 2018-03-14 00:57:00 2018-03-14 1 0
7911 2018-03-14 00:58:00 2018-03-14 1 0
7912 2018-03-14 00:59:00 2018-03-14 1 0
I want to group everything by hour for each date (I don't mind if I use the hour column or whether the values are grouped by the HH value in the date column) and list the CUMULATIVE number of available and busy for each hour group.
My desired output df will look like this (note that these are dummy values and not the actual values):
date available busy
1 2018-03-01 01:00:00 1 6
2 2018-03-01 02:00:00 2 11
3 2018-03-01 03:00:00 10 8
...
450 2018-03-14 08:00:00 11 1
451 2018-03-14 09:00:00 24 19
452 2018-03-14 10:00:00 12 4
This is the sample data:
Here's the dplyr code to do that:
library(lubridate)
df2 <- df %>%
group_by(hour) %>%
summarize(
available = sum(available),
busy = sum(available)
) %>%
ungroup()

R: data.table aggregate using external grouping vector

I have data
dt <- data.table(time=as.POSIXct(c("2018-01-01 01:01:00","2018-01-01 01:05:00","2018-01-01 01:01:00")), y=c(1,10,9))
> dt
time y
1: 2018-01-01 01:01:00 1
2: 2018-01-01 01:05:00 10
3: 2018-01-01 01:01:00 9
and I would like to aggregate by time. Usually, I would do
dt[,list(sum=sum(y),count=.N), by="time"]
time sum count
1: 2018-01-01 01:01:00 10 2
2: 2018-01-01 01:05:00 10 1
but this time, I would also like to get zero values for the minutes in between, i.e.,
time sum count
1: 2018-01-01 01:01:00 10 2
2: 2018-01-01 01:02:00 0 0
3: 2018-01-01 01:03:00 0 0
4: 2018-01-01 01:04:00 0 0
5: 2018-01-01 01:05:00 10 1
Could this be done, for example, using an external vector
times <- seq(from=min(dt$time),to=max(dt$time),by="mins")
that can be fed to the data.table function as a grouping variable?
You would typically do with with a join (either before or after the aggregation). For example:
dt <- dt[J(times), on = "time"]
dt[,list(sum=sum(y, na.rm = TRUE), count= sum(!is.na(y))), by=time]
# time sum count
#1: 2018-01-01 01:01:00 10 2
#2: 2018-01-01 01:02:00 0 0
#3: 2018-01-01 01:03:00 0 0
#4: 2018-01-01 01:04:00 0 0
#5: 2018-01-01 01:05:00 10 1
Or in a "piped" version:
dt[J(times), on = "time"][
, .(sum = sum(y, na.rm = TRUE), count= sum(!is.na(y))),
by = time]

SQLITE - Flatten a key-value table into columns [duplicate]

I have a table in SQLite called param_vals_breaches that looks like the following:
id param queue date_time param_val breach_count
1 c a 2013-01-01 00:00:00 188 7
2 c b 2013-01-01 00:00:00 156 8
3 c c 2013-01-01 00:00:00 100 2
4 d a 2013-01-01 00:00:00 657 0
5 d b 2013-01-01 00:00:00 23 6
6 d c 2013-01-01 00:00:00 230 12
7 c a 2013-01-01 01:00:00 100 0
8 c b 2013-01-01 01:00:00 143 9
9 c c 2013-01-01 01:00:00 12 2
10 d a 2013-01-01 01:00:00 0 1
11 d b 2013-01-01 01:00:00 29 5
12 d c 2013-01-01 01:00:00 22 14
13 c a 2013-01-01 02:00:00 188 7
14 c b 2013-01-01 02:00:00 156 8
15 c c 2013-01-01 02:00:00 100 2
16 d a 2013-01-01 02:00:00 657 0
17 d b 2013-01-01 02:00:00 23 6
18 d c 2013-01-01 02:00:00 230 12
I want to write a query that will show me a particular queue (e.g. "a") with the average param_val and breach_count for each param on an hour by hour basis. So transposing the data to get something that looks like this:
Results for Queue A
Hour 0 Hour 0 Hour 1 Hour 1 Hour 2 Hour 2
param avg_param_val avg_breach_count avg_param_val avg_breach_count avg_param_val avg_breach_count
c xxx xxx xxx xxx xxx xxx
d xxx xxx xxx xxx xxx xxx
is this possible? I'm not sure how to go about it. Thanks!
SQLite does not have a PIVOT function but you can use an aggregate function with a CASE expression to turn the rows into columns:
select param,
avg(case when time = '00' then param_val end) AvgHour0Val,
avg(case when time = '00' then breach_count end) AvgHour0Count,
avg(case when time = '01' then param_val end) AvgHour1Val,
avg(case when time = '01' then breach_count end) AvgHour1Count,
avg(case when time = '02' then param_val end) AvgHour2Val,
avg(case when time = '02' then breach_count end) AvgHour2Count
from
(
select param,
strftime('%H', date_time) time,
param_val,
breach_count
from param_vals_breaches
where queue = 'a'
) src
group by param;
See SQL Fiddle with Demo

Resources