SQLITE - Flatten a key-value table into columns [duplicate] - sqlite

I have a table in SQLite called param_vals_breaches that looks like the following:
id param queue date_time param_val breach_count
1 c a 2013-01-01 00:00:00 188 7
2 c b 2013-01-01 00:00:00 156 8
3 c c 2013-01-01 00:00:00 100 2
4 d a 2013-01-01 00:00:00 657 0
5 d b 2013-01-01 00:00:00 23 6
6 d c 2013-01-01 00:00:00 230 12
7 c a 2013-01-01 01:00:00 100 0
8 c b 2013-01-01 01:00:00 143 9
9 c c 2013-01-01 01:00:00 12 2
10 d a 2013-01-01 01:00:00 0 1
11 d b 2013-01-01 01:00:00 29 5
12 d c 2013-01-01 01:00:00 22 14
13 c a 2013-01-01 02:00:00 188 7
14 c b 2013-01-01 02:00:00 156 8
15 c c 2013-01-01 02:00:00 100 2
16 d a 2013-01-01 02:00:00 657 0
17 d b 2013-01-01 02:00:00 23 6
18 d c 2013-01-01 02:00:00 230 12
I want to write a query that will show me a particular queue (e.g. "a") with the average param_val and breach_count for each param on an hour by hour basis. So transposing the data to get something that looks like this:
Results for Queue A
Hour 0 Hour 0 Hour 1 Hour 1 Hour 2 Hour 2
param avg_param_val avg_breach_count avg_param_val avg_breach_count avg_param_val avg_breach_count
c xxx xxx xxx xxx xxx xxx
d xxx xxx xxx xxx xxx xxx
is this possible? I'm not sure how to go about it. Thanks!

SQLite does not have a PIVOT function but you can use an aggregate function with a CASE expression to turn the rows into columns:
select param,
avg(case when time = '00' then param_val end) AvgHour0Val,
avg(case when time = '00' then breach_count end) AvgHour0Count,
avg(case when time = '01' then param_val end) AvgHour1Val,
avg(case when time = '01' then breach_count end) AvgHour1Count,
avg(case when time = '02' then param_val end) AvgHour2Val,
avg(case when time = '02' then breach_count end) AvgHour2Count
from
(
select param,
strftime('%H', date_time) time,
param_val,
breach_count
from param_vals_breaches
where queue = 'a'
) src
group by param;
See SQL Fiddle with Demo

Related

Different results for certain group if filter is applied in user function vs. no filter

I have a data set which consists out of multiple describing variables (type, origin etc.) and timestamps per observation of an item. The timestamps specify different states reached by the item.
The Items are parts of different system-units, which are identified by an ID. A unit must consist of at least one item but there is no upper limit.
A much simplified example data set is:
library(tidyverse)
set.seed(10)
data <- data.frame(ID=paste(c(rep(2010,times=250),rep(2011,times=200),rep(2012,times=300)),"_",sprintf("%02d", sample(50,replace=TRUE,size=750))),
year=c(rep(2010,times=250),rep(2011,times=200),rep(2012,times=300)),
Cat=sample(c("A","B","C","D","E"),replace=TRUE,size=750)) %>%
mutate(Time_1=strptime(paste(year,"-01-01 01:",c(sprintf("%02d",rep(59,times=750))),":",sample(2,replace=TRUE,size=750),sep=""),format="%Y-%m-%d %H:%M:%S"),
Time_2=strptime(paste(year,"-01-01 02:",c(sprintf("%02d",sample(0,replace=TRUE,size=750))),":",sample(59,replace=TRUE,size=750),sep=""),format="%Y-%m-%d %H:%M:%S"))
Some items are considered relevant and some items are considered irrelevant. The following lookup-table gives information about that:
lookup <- data.frame(Cat=c("A","B","C","D","E"),
Relevant=c(TRUE,FALSE,FALSE,FALSE,TRUE))
df <- data %>%
left_join(lookup)
I wrote a function which does the following:
Applies a filter if passed as an argument
Adds a column with the smallest timestamp per unit only considering relevant items
Checks if the time difference between time1 and the smallest timestamp per unit matches a condition
Returns the result as summary
For the example data set the function would look like:
foo <- function(data,Filter=FALSE) {
new_column <- data %>%
filter(Relevant) %>%
group_by(ID) %>%
slice_min(order_by = Time_2,with_ties=FALSE) %>%
select(ID,Time_2) %>%
rename("Time_2_mod"="Time_2") %>%
ungroup()
Obj <- data %>%
{if (isFALSE(Filter)) . else filter(.,eval(rlang::parse_expr(Filter)))} %>%
left_join(new_column) %>%
mutate(check=Time_2_mod-Time_1 < 90) %>%
group_by(year) %>%
summarise(count_checked=sum(check, na.rm=TRUE))
return(Obj)
}
Now, the output without filter is:
> foo(df)
Joining, by = "ID"
# A tibble: 3 x 2
year count_checked
* <dbl> <int>
1 2010 175
2 2011 149
3 2012 245
With filter applied for some of the years I get a different result:
> foo(df,Filter="year==2010")
Joining, by = "ID"
# A tibble: 1 x 2
year count_checked
* <dbl> <int>
1 2010 232
> foo(df,Filter="year==2011")
Joining, by = "ID"
# A tibble: 1 x 2
year count_checked
* <dbl> <int>
1 2011 173
> foo(df,Filter="year==2012")
Joining, by = "ID"
# A tibble: 1 x 2
year count_checked
* <dbl> <int>
1 2012 245
Why is that?
It took me quite a while to figure it out and I got to the answer by deconstructing the function step by step. The key was to look at the result of the difference in mutate(check=Time_2_mod-Time_1 < 90). So I changed it to mutate(check=Time_2_mod-Time_1) followed by head(10) to only see the first 10 rows.
Function now is:
foo_2 <- function(data,Filter=FALSE) {
new_column <- data %>%
filter(Relevant) %>%
group_by(ID) %>%
slice_min(order_by = Time_2,with_ties=FALSE) %>%
select(ID,Time_2) %>%
rename("Time_2_mod"="Time_2") %>%
ungroup()
Obj <- data %>%
{if (isFALSE(Filter)) . else filter(.,eval(rlang::parse_expr(Filter)))} %>%
left_join(new_column) %>%
mutate(check=Time_2_mod-Time_1) %>%
head(10)
return(Obj)
}
Result without filtering:
> foo_2(df)
Joining, by = "ID"
ID year Cat Time_1 Time_2 Relevant Time_2_mod check
1 2010 _ 43 2010 C 2010-01-01 01:59:02 2010-01-01 02:00:40 FALSE <NA> NA secs
2 2010 _ 09 2010 B 2010-01-01 01:59:02 2010-01-01 02:00:49 FALSE 2010-01-01 02:00:18 76 secs
3 2010 _ 10 2010 E 2010-01-01 01:59:02 2010-01-01 02:00:55 TRUE 2010-01-01 02:00:49 107 secs
4 2010 _ 48 2010 B 2010-01-01 01:59:02 2010-01-01 02:00:47 FALSE 2010-01-01 02:00:13 71 secs
5 2010 _ 12 2010 C 2010-01-01 01:59:02 2010-01-01 02:00:54 FALSE 2010-01-01 02:00:03 61 secs
6 2010 _ 08 2010 D 2010-01-01 01:59:01 2010-01-01 02:00:45 FALSE 2010-01-01 02:00:19 78 secs
7 2010 _ 39 2010 C 2010-01-01 01:59:01 2010-01-01 02:00:24 FALSE 2010-01-01 02:00:21 80 secs
8 2010 _ 19 2010 E 2010-01-01 01:59:01 2010-01-01 02:00:21 TRUE 2010-01-01 02:00:21 80 secs
9 2010 _ 24 2010 B 2010-01-01 01:59:01 2010-01-01 02:00:51 FALSE <NA> NA secs
10 2010 _ 15 2010 D 2010-01-01 01:59:01 2010-01-01 02:00:43 FALSE 2010-01-01 02:00:02 61 secs
And with the filters applied:
> foo_2(df,Filter="year==2010")
Joining, by = "ID"
ID year Cat Time_1 Time_2 Relevant Time_2_mod
1 2010 _ 43 2010 C 2010-01-01 01:59:02 2010-01-01 02:00:40 FALSE <NA>
2 2010 _ 09 2010 B 2010-01-01 01:59:02 2010-01-01 02:00:49 FALSE 2010-01-01 02:00:18
3 2010 _ 10 2010 E 2010-01-01 01:59:02 2010-01-01 02:00:55 TRUE 2010-01-01 02:00:49
4 2010 _ 48 2010 B 2010-01-01 01:59:02 2010-01-01 02:00:47 FALSE 2010-01-01 02:00:13
5 2010 _ 12 2010 C 2010-01-01 01:59:02 2010-01-01 02:00:54 FALSE 2010-01-01 02:00:03
6 2010 _ 08 2010 D 2010-01-01 01:59:01 2010-01-01 02:00:45 FALSE 2010-01-01 02:00:19
7 2010 _ 39 2010 C 2010-01-01 01:59:01 2010-01-01 02:00:24 FALSE 2010-01-01 02:00:21
8 2010 _ 19 2010 E 2010-01-01 01:59:01 2010-01-01 02:00:21 TRUE 2010-01-01 02:00:21
9 2010 _ 24 2010 B 2010-01-01 01:59:01 2010-01-01 02:00:51 FALSE <NA>
10 2010 _ 15 2010 D 2010-01-01 01:59:01 2010-01-01 02:00:43 FALSE 2010-01-01 02:00:02
check
1 NA mins
2 1.266667 mins
3 1.783333 mins
4 1.183333 mins
5 1.016667 mins
6 1.300000 mins
7 1.333333 mins
8 1.333333 mins
9 NA mins
10 1.016667 mins
> foo_2(df,Filter="year==2011")
Joining, by = "ID"
ID year Cat Time_1 Time_2 Relevant Time_2_mod
1 2011 _ 38 2011 D 2011-01-01 01:59:01 2011-01-01 02:00:17 FALSE 2011-01-01 02:00:23
2 2011 _ 25 2011 B 2011-01-01 01:59:02 2011-01-01 02:00:27 FALSE 2011-01-01 02:00:07
3 2011 _ 22 2011 C 2011-01-01 01:59:01 2011-01-01 02:00:58 FALSE 2011-01-01 02:00:20
4 2011 _ 11 2011 C 2011-01-01 01:59:01 2011-01-01 02:00:57 FALSE 2011-01-01 02:00:36
5 2011 _ 35 2011 C 2011-01-01 01:59:01 2011-01-01 02:00:03 FALSE 2011-01-01 02:00:41
6 2011 _ 29 2011 A 2011-01-01 01:59:01 2011-01-01 02:00:44 TRUE 2011-01-01 02:00:04
7 2011 _ 28 2011 A 2011-01-01 01:59:01 2011-01-01 02:00:25 TRUE 2011-01-01 02:00:13
8 2011 _ 34 2011 A 2011-01-01 01:59:01 2011-01-01 02:00:46 TRUE 2011-01-01 02:00:03
9 2011 _ 01 2011 B 2011-01-01 01:59:01 2011-01-01 02:00:43 FALSE 2011-01-01 02:00:19
10 2011 _ 08 2011 E 2011-01-01 01:59:01 2011-01-01 02:00:07 TRUE 2011-01-01 02:00:07
check
1 1.366667 mins
2 1.083333 mins
3 1.316667 mins
4 1.583333 mins
5 1.666667 mins
6 1.050000 mins
7 1.200000 mins
8 1.033333 mins
9 1.300000 mins
10 1.100000 mins
> foo_2(df,Filter="year==2012")
Joining, by = "ID"
ID year Cat Time_1 Time_2 Relevant Time_2_mod check
1 2012 _ 15 2012 E 2012-01-01 01:59:02 2012-01-01 02:00:29 TRUE 2012-01-01 02:00:26 84 secs
2 2012 _ 18 2012 A 2012-01-01 01:59:01 2012-01-01 02:00:01 TRUE 2012-01-01 02:00:01 60 secs
3 2012 _ 08 2012 A 2012-01-01 01:59:02 2012-01-01 02:00:21 TRUE 2012-01-01 02:00:21 79 secs
4 2012 _ 10 2012 D 2012-01-01 01:59:02 2012-01-01 02:00:23 FALSE 2012-01-01 02:00:10 68 secs
5 2012 _ 03 2012 C 2012-01-01 01:59:02 2012-01-01 02:00:20 FALSE 2012-01-01 02:00:12 70 secs
6 2012 _ 02 2012 B 2012-01-01 01:59:02 2012-01-01 02:00:25 FALSE 2012-01-01 02:00:42 100 secs
7 2012 _ 46 2012 C 2012-01-01 01:59:01 2012-01-01 02:00:23 FALSE 2012-01-01 02:00:34 93 secs
8 2012 _ 22 2012 E 2012-01-01 01:59:01 2012-01-01 02:00:45 TRUE 2012-01-01 02:00:24 83 secs
9 2012 _ 27 2012 D 2012-01-01 01:59:01 2012-01-01 02:00:56 FALSE 2012-01-01 02:00:27 86 secs
10 2012 _ 35 2012 D 2012-01-01 01:59:02 2012-01-01 02:00:32 FALSE 2012-01-01 02:00:24 82 secs
You know will easily recognize that the units in the last column checkseem to depend on the year filtered. Which is only half the truth. The difference of two date/times is a difftime. And if no unit is given, the function will choose "auto"matically
a suitable set of units is chosen, the largest possible (excluding "weeks") in which all the absolute differences are greater than one.
Source
So that is what produced the inconsistency in results. It was not the filter but that the difftimes for the years 2010 and 2011 were at least one minute and therefor minutes were used as unit while in the year 2012 there was at least one difftime which was less than one minute so seconds was used as unit.
To fix that behaviour mutate(check=Time_2_mod-Time_1 < 90) should be changed to mutate(check=difftime(Time_2_mod,Time_1,units="secs") <90)
foo_3 <- function(data,Filter=FALSE) {
new_column <- data %>%
filter(Relevant) %>%
group_by(ID) %>%
slice_min(order_by = Time_2,with_ties=FALSE) %>%
select(ID,Time_2) %>%
rename("Time_2_mod"="Time_2") %>%
ungroup()
Obj <- data %>%
{if (isFALSE(Filter)) . else filter(.,eval(rlang::parse_expr(Filter)))} %>%
left_join(new_column) %>%
mutate(check=difftime(Time_2_mod,Time_1,units="secs") <90) %>%
group_by(year) %>%
summarise(count_checked=sum(check, na.rm=TRUE))
return(Obj)
}
This returns consistent results with and without filtering.
> foo_3(df)
Joining, by = "ID"
# A tibble: 3 x 2
year count_checked
* <dbl> <int>
1 2010 175
2 2011 149
3 2012 245
> foo_3(df,Filter="year==2010")
Joining, by = "ID"
# A tibble: 1 x 2
year count_checked
* <dbl> <int>
1 2010 175
> foo_3(df,Filter="year==2011")
Joining, by = "ID"
# A tibble: 1 x 2
year count_checked
* <dbl> <int>
1 2011 149
> foo_3(df,Filter="year==2012")
Joining, by = "ID"
# A tibble: 1 x 2
year count_checked
* <dbl> <int>
1 2012 245

How to calculate number of hours from a fixed start point that varies among levels of a variable

The dataframe df1 summarizes detections of different individuals (ID) through time (Datetime). As a short example:
library(lubridate)
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Datetime= ymd_hms(c("2016-08-21 00:00:00","2016-08-24 08:00:00","2016-08-23 12:00:00","2016-08-29 03:00:00","2016-08-27 23:00:00","2016-09-02 02:00:00","2016-09-01 12:00:00","2016-09-09 04:00:00","2016-09-01 12:00:00","2016-09-10 12:00:00")))
> df1
ID Datetime
1 1 2016-08-21 00:00:00
2 2 2016-08-24 08:00:00
3 1 2016-08-23 12:00:00
4 2 2016-08-29 03:00:00
5 1 2016-08-27 23:00:00
6 2 2016-09-02 02:00:00
7 1 2016-09-01 12:00:00
8 2 2016-09-09 04:00:00
9 1 2016-09-01 12:00:00
10 2 2016-09-10 12:00:00
I want to calculate for each row, the number of hours (Hours_since_begining) since the first time that the individual was detected.
I would expect something like that (It can contain some mistakes since I did the calculations by hand):
> df1
ID Datetime Hours_since_begining
1 1 2016-08-21 00:00:00 0
2 2 2016-08-24 08:00:00 0
3 1 2016-08-23 12:00:00 60 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-23 12:00:00"
4 2 2016-08-29 03:00:00 115
5 1 2016-08-27 23:00:00 167 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-27 23:00:00"
6 2 2016-09-02 02:00:00 210
7 1 2016-09-01 12:00:00 276
8 2 2016-09-09 04:00:00 380
9 1 2016-09-01 12:00:00 276
10 2 2016-09-10 12:00:00 412
Does anyone know how to do it?
Thanks in advance!
You can do this :
library(tidyverse)
# first get min datetime by ID
min_datetime_id <- df1 %>% group_by(ID) %>% summarise(min_datetime=min(Datetime))
# join with df1 and compute time difference
df1 <- df1 %>% left_join(min_datetime_id) %>% mutate(Hours_since_beginning= as.numeric(difftime(Datetime, min_datetime,units="hours")))

How to group time by every n minutes in R

I have a dataframe with a lot of time series:
1 0:03 B 1
2 0:05 A 1
3 0:05 A 1
4 0:05 B 1
5 0:10 A 1
6 0:10 B 1
7 0:14 B 1
8 0:18 A 1
9 0:20 A 1
10 0:23 B 1
11 0:30 A 1
I want to group the time series into every 6 minutes and count the frequency of A and B:
1 0:06 A 2
2 0:06 B 2
3 0:12 A 1
4 0:12 B 1
5 0:18 A 1
6 0:24 A 1
7 0:24 B 1
8 0:18 A 1
9 0:30 A 1
Also, the class of the time series is character. What should I do?
Here's an approach to convert times to POSIXct, cut the times by 6 minute intervals, then count.
First, you need to specify the year, month, day, hour, minute, and seconds of your data. This will help with scaling it to larger datasets.
library(tidyverse)
library(lubridate)
# sample data
d <- data.frame(t = paste0("2019-06-02 ",
c("0:03","0:06","0:09","0:12","0:15",
"0:18","0:21","0:24","0:27","0:30"),
":00"),
g = c("A","A","B","B","B"))
d$t <- ymd_hms(d$t) # convert to POSIXct with `lubridate::ymd_hms()`
If you check the class of your new date column, you will see it is "POSIXct".
> class(d$t)
[1] "POSIXct" "POSIXt"
Now that the data is in "POSIXct", you can cut it by minute intervals! We will add this new grouping factor as a new column called tc.
d$tc <- cut(d$t, breaks = "6 min")
d
t g tc
1 2019-06-02 00:03:00 A 2019-06-02 00:03:00
2 2019-06-02 00:06:00 A 2019-06-02 00:03:00
3 2019-06-02 00:09:00 B 2019-06-02 00:09:00
4 2019-06-02 00:12:00 B 2019-06-02 00:09:00
5 2019-06-02 00:15:00 B 2019-06-02 00:15:00
6 2019-06-02 00:18:00 A 2019-06-02 00:15:00
7 2019-06-02 00:21:00 A 2019-06-02 00:21:00
8 2019-06-02 00:24:00 B 2019-06-02 00:21:00
9 2019-06-02 00:27:00 B 2019-06-02 00:27:00
10 2019-06-02 00:30:00 B 2019-06-02 00:27:00
Now you can group_by this new interval (tc) and your grouping column (g), and count the frequency of occurences. Getting the frequency of observations in a group is a fairly common operation, so dplyr provides count for this:
count(d, g, tc)
# A tibble: 7 x 3
g tc n
<fct> <fct> <int>
1 A 2019-06-02 00:03:00 2
2 A 2019-06-02 00:15:00 1
3 A 2019-06-02 00:21:00 1
4 B 2019-06-02 00:09:00 2
5 B 2019-06-02 00:15:00 1
6 B 2019-06-02 00:21:00 1
7 B 2019-06-02 00:27:00 2
If you run ?dplyr::count() in the console, you'll see that count(d, tc) is simply a wrapper for group_by(d, g, tc) %>% summarise(n = n()).
According to the sample dataset, the time series is given as time-of-day, i.e., without date.
The data.table package has the ITime class which is a time-of-day class stored as the integer number of seconds in the day. With data.table, we can use a rolling join to map times to the upper limit of the 6 minutes intervals (right-closed intervals):
library(data.table)
# coerce from character to class ITime
setDT(ts)[, time := as.ITime(time)]
# create sequence of breaks
breaks <- as.ITime(seq(as.ITime("0:00"), as.ITime("23:59:59"), as.ITime("0:06")))
# rolling join and aggregate
ts[, CJ(breaks, group, unique = TRUE)
][ts, on = .(group, breaks = time), roll = -Inf, .(x.breaks, group)
][, .N, by = .(upper = x.breaks, group)]
which returns
upper group N
1: 00:06:00 B 2
2: 00:06:00 A 2
3: 00:12:00 A 1
4: 00:12:00 B 1
5: 00:18:00 B 1
6: 00:18:00 A 1
7: 00:24:00 A 1
8: 00:24:00 B 1
9: 00:30:00 A 1
Addendum
If the direction of the rolling join is changed (roll = +Inf instead of roll = -Inf) we get left-closed intervals
ts[, CJ(breaks, group, unique = TRUE)
][ts, on = .(group, breaks = time), roll = +Inf, .(x.breaks, group)
][, .N, by = .(lower = x.breaks, group)]
which changes the result significantly:
lower group N
1: 00:00:00 B 2
2: 00:00:00 A 2
3: 00:06:00 A 1
4: 00:06:00 B 1
5: 00:12:00 B 1
6: 00:18:00 A 2
7: 00:18:00 B 1
8: 00:30:00 A 1
Data
library(data.table)
ts <- fread("
1 0:03 B 1
2 0:05 A 1
3 0:05 A 1
4 0:05 B 1
5 0:10 A 1
6 0:10 B 1
7 0:14 B 1
8 0:18 A 1
9 0:20 A 1
10 0:23 B 1
11 0:30 A 1"
, header = FALSE
, col.names = c("rn", "time", "group", "value"))

Conditionally update rows and then group

Let me start by providing my sample dataset:
ID Start Code End Days
1 2016-03-01 A 2016-03-14 14
1 2016-03-01 A 2016-03-14 14
1 2016-03-01 B 2016-04-01 30
2 2016-02-01 A 2016-03-01 28
I'd like to, for each ID, and within this group, for each Code, check if the End is larger dan Start in the next row (if df$End[i] > df$Start[i+1]) and if so, update Start of the next row to End en recompute End (which is Start + Days) for that row i+1. The results should thus be:
ID Start Code End Days
1 2016-03-01 A 2016-03-14 14
1 2016-03-14 A 2016-03-28 14
1 2016-03-01 B 2016-04-01 30
2 2016-02-01 A 2016-03-01 28
Afterwards, if for a ID, and a Code the difference between df$End[i] - df$Start[i+1] <= 7, I would like to combine the rows, using the smallest df$Start and the largest df$End for this subset. Making:
ID Start Code End Days
1 2016-03-01 A 2016-03-28 14
1 2016-03-01 B 2016-04-01 30
2 2016-02-01 A 2016-03-01 28
Since my dataset is over 100M rows, I'd like a fast solution. Unfortunately I am pretty new to dplyr, so help is highly appreciated!
update: larger example:
ID Start Code End Days
1 2012-04-01 A 2012-04-07 7
1 2016-03-01 B 2016-03-15 15
1 2016-03-01 B 2016-05-29 90
1 2016-06-01 B 2016-08-29 90
1 2016-09-01 B 2016-11-29 90
1 2016-12-01 B 2017-02-28 90
1 2017-03-01 B 2017-05-09 90
1 2017-08-01 B 2017-10-29 90
1 2017-12-01 B 2018-02-28 90
2 2016-04-01 B 2016-04-14 14
This results in:
ID Start Code End
1 2012-04-01 A 2012-04-07
1 2016-03-01 B 2017-02-28
1 2017-03-01 B 2017-05-29
1 2018-08-01 B 2017-12-05
2 2016-04-01 B 2016-04-14
Where I would expect row 2 and to be combined.
For the first step I tried:
grouped_df <-
df %>%
group_by(ID, Code) %>%
mutate_at(vars(Start, End), funs(as.Date)) %>%
mutate(new_start = as.Date(ifelse(lag(End > Start), lag(End), Start), origin="1970-01-01")) %>%
mutate(new_stop= new_disp + Days)
However, if a new_end has been computed, we should now compare new_end and not End with new_start (and not Start).

Finding within-groups specific sequences

I am working on a data set including the results of repeated tests (results expressed as positive(1)/negative(0)) on individuals over time; the number of tests per individual is not necessarily the same.
below is a df reproducing how my dataset looks like:
id<-c(rep("a", time=5), rep("b", time=5), rep("c",time=7))
date<-as.Date(c("2018-03-01","2018-04-01","2018-06-01","2018-08-01","2018-10-01","2017-03-01","2017-04-01","2018-02-01","2018-11-01","2018-12-01","2016-05-11","2017-10-01","2018-03-01","2018-03-21","2018-4-01","2018-07-01","2018-08-01"))
test<-c(1,1,0,1,0,0,1,0,1,1,0,0,1,0,0,1,0)
df<-data.frame(id, test, date)
df
id test date
a 1 2018-03-01
a 1 2018-04-01
a 0 2018-06-01
a 1 2018-08-01
a 0 2018-10-01
b 0 2017-03-01
b 1 2017-04-01
b 0 2018-02-01
b 1 2018-11-01
b 1 2018-12-01
c 0 2016-05-11
c 0 2017-10-01
c 1 2018-03-01
c 0 2018-03-21
c 0 2018-04-01
c 1 2018-07-01
c 0 2018-08-01
what I am trying to do is to create a new column 'Var' indicating whether any of the following set of results:
match1<-c(1,1,1,1)
match2<-c(1,1,1,0)
match3<-c(0,1,1,1)
match4<-c(1,0,1,1)
match5<-c(1,1,0,1)
is observed in the result set of each individual. ideally this would result in:
id test date Var
a 1 2018-03-01 case
a 1 2018-04-01 case
a 0 2018-06-01 case
a 1 2018-08-01 case
a 0 2018-10-01 case
b 0 2017-03-01 case
b 1 2017-04-01 case
b 0 2018-02-01 case
b 1 2018-11-01 case
b 1 2018-12-01 case
c 0 2016-05-11 non-case
c 0 2017-10-01 non-case
c 1 2018-03-01 non-case
c 0 2018-03-21 non-case
c 0 2018-04-01 non-case
c 1 2018-07-01 non-case
c 0 2018-08-01 non-case
because the sequence (1,1,0,1) is observed within the result set of 'a', (1,0,1,1) in 'b' while none of the target sequences is observed in 'c'.
Apologize for not posting any attempt but I am really stuck with this!
best regards,
Matteo

Resources