I want to keep an observation (grouped by ID) for every 30 days. I want to do this by creating a variable that tells me which observations are left inside (1) and which ones are outside (0) of the filter.
Example
id date
1 3/1/2021
1 4/1/2021
1 5/1/2021
1 6/1/2021
1 2/2/2021
1 3/2/2021
1 5/2/2021
1 7/2/2021
1 9/2/2021
1 11/2/2021
1 13/2/2021
1 16/3/2021
2 5/1/2021
2 31/10/2021
2 9/1/2021
2 6/2/2021
2 1/6/2021
3 1/1/2021
3 1/6/2021
3 31/12/2021
4 5/5/2021
Expected result
id date count
1 3/1/2021 1
1 4/1/2021 0
1 5/1/2021 0
1 6/1/2021 0
1 2/2/2021 0
1 3/2/2021 1
1 5/2/2021 0
1 7/2/2021 0
1 9/2/2021 0
1 11/2/2021 0
1 13/2/2021 0
1 16/3/2021 1
2 5/1/2021 1
2 31/10/2021 1
2 9/1/2021 0
2 6/2/2021 1
2 1/6/2021 1
3 1/1/2021 1
3 1/6/2021 1
3 31/12/2021 1
4 5/5/2021 1
here is a data.table approach
library(data.table)
# sort by id by date
setkey(DT, id, date)
# create groups
DT[, group := rleid((as.numeric(date - date[1])) %/% 30), by = .(id)][]
# create count column
DT[, count := ifelse(!group == shift(group, type = "lag", fill = 0), 1, 0), by = .(id)][]
# id date group count
# 1: 1 2021-01-03 1 1
# 2: 1 2021-01-04 1 0
# 3: 1 2021-01-05 1 0
# 4: 1 2021-01-06 1 0
# 5: 1 2021-02-02 2 1
# 6: 1 2021-02-03 2 0
# 7: 1 2021-02-05 2 0
# 8: 1 2021-02-07 2 0
# 9: 1 2021-02-09 2 0
#10: 1 2021-02-11 2 0
#11: 1 2021-02-13 2 0
#12: 1 2021-03-16 3 1
#13: 2 2021-01-05 1 1
#14: 2 2021-01-09 1 0
#15: 2 2021-02-06 2 1
#16: 2 2021-06-01 3 1
#17: 2 2021-10-31 4 1
#18: 3 2021-01-01 1 1
#19: 3 2021-06-01 2 1
#20: 3 2021-12-31 3 1
#21: 4 2021-05-05 1 1
# id date group count
sample data used
DT <- fread("id date
1 3/1/2021
1 4/1/2021
1 5/1/2021
1 6/1/2021
1 2/2/2021
1 3/2/2021
1 5/2/2021
1 7/2/2021
1 9/2/2021
1 11/2/2021
1 13/2/2021
1 16/3/2021
2 5/1/2021
2 31/10/2021
2 9/1/2021
2 6/2/2021
2 1/6/2021
3 1/1/2021
3 1/6/2021
3 31/12/2021
4 5/5/2021")
# set date as actual date
DT[, date := as.Date(date, "%d/%m/%Y")]
Related
I am hoping to create a new variable in datatable that adds a column telling me the number of days since the opposite (or other event occured).
The dataset I have looks like the following:
date event id obs_since_event_1 obs_since_event_2
2000-07-06 2 1 NA NA
2000-07-07 1 1 NA 1
2000-07-09 0 1 1 2
2000-07-10 0 1 2 3
2000-07-15 2 1 3 4
2000-07-16 1 1 4 1
2000-07-20 0 1 1 2
2000-07-21 1 1 2 3
2000-07-06 1 2 NA NA
2000-07-07 2 2 1 NA
2000-07-15 0 2 2 1
2000-07-16 0 2 3 2
2000-07-17 2 2 4 3
2000-07-18 1 2 5 1
And I am hoping to add a column called days_since_opposite. which records number of days since the opposite event occurred (the opposite events being 1 and 2). I already have the number of days since either an event 1 or 2 occurred. Now I need to work out an if statement that works in datatables that will provide me with the corresponding values in the final column.
date event id obs_since_event_1 obs_since_event_2 days_since_opposite
2000-07-06 2 1 NA NA NA
2000-07-07 1 1 NA 1 NA
2000-07-09 0 1 1 2 NA
2000-07-10 0 1 2 3 NA
2000-07-15 2 1 3 4 3
2000-07-16 1 1 4 1 1
2000-07-20 0 1 1 2 NA
2000-07-21 1 1 2 3 3
I hope this is clear. I also have different ids to reckon with but not sure if it impacts the results.
I tried something along the following lines but it did not work:
data[,days_since_opposite:=ifelse(event==1,obs_since_event_2,ifelse(event==2,obs_since_event_1,0)),]
Thanks in advance
DATA
Input = (
' date event id obs_since_event_1 obs_since_event_2
2000-07-06 2 1 NA NA
2000-07-07 1 1 NA 1
2000-07-09 0 1 1 2
2000-07-10 0 1 2 3
2000-07-15 2 1 3 4
2000-07-16 1 1 4 1
2000-07-20 0 1 1 2
2000-07-21 1 1 2 3
2000-07-06 1 2 NA NA
2000-07-07 2 2 1 NA
2000-07-15 0 2 2 1
2000-07-16 0 2 3 2
2000-07-17 2 2 4 3
2000-07-18 1 2 5 1')
df = read.table(textConnection(Input), header = T)
Here is an option:
#identify the opposite event
DT[, oppev := c(0L, 2L, 1L)[event + 1L]]
#for event 1 and 2, perform non-equi join to find the prev opp event
DT[event %in% c(1L, 2L), days_since_opposite := DT[.SD,
on=.(id, event=oppev, date<date), mult="last", as.integer(i.date - x.date)]]
output:
date event id oppev days_since_opposite
1: 2000-07-06 2 1 1 NA
2: 2000-07-07 1 1 2 1
3: 2000-07-09 0 1 0 NA
4: 2000-07-10 0 1 0 NA
5: 2000-07-15 2 1 1 8
6: 2000-07-16 1 1 2 1
7: 2000-07-20 0 1 0 NA
8: 2000-07-21 1 1 2 6
9: 2000-07-06 1 2 2 NA
10: 2000-07-07 2 2 1 1
11: 2000-07-15 0 2 0 NA
12: 2000-07-16 0 2 0 NA
13: 2000-07-17 2 2 1 11
14: 2000-07-18 1 2 2 1
data:
library(data.table)
DT <- fread("date event id
2000-07-06 2 1
2000-07-07 1 1
2000-07-09 0 1
2000-07-10 0 1
2000-07-15 2 1
2000-07-16 1 1
2000-07-20 0 1
2000-07-21 1 1
2000-07-06 1 2
2000-07-07 2 2
2000-07-15 0 2
2000-07-16 0 2
2000-07-17 2 2
2000-07-18 1 2")[, date := as.IDate(date, format="%Y-%m-%d")]
I am wondering how to calculate the number of observations since the same type of event but also to find the number of observations since any other type of event. I also have ids in my datatable.
To illustrate, please see below. I am trying to do this in R using Datatables but to little result.
What I have is a datatable as follows:
date event id
2000-07-06 2 1
2000-07-07 1 1
2000-07-09 0 1
2000-07-10 0 1
2000-07-15 2 1
2000-07-16 1 1
2000-07-20 0 1
2000-07-21 1 1
2000-07-06 1 2
2000-07-07 2 2
2000-07-15 0 2
2000-07-16 0 2
2000-07-17 2 2
2000-07-18 1 2
and what I would like to have is something like this:
date event id obs_since_event_1 obs_since_event_2
2000-07-06 2 1 NA NA
2000-07-07 1 1 NA 1
2000-07-09 0 1 1 2
2000-07-10 0 1 2 3
2000-07-15 2 1 3 4
2000-07-16 1 1 4 1
2000-07-20 0 1 1 2
2000-07-21 1 1 2 3
2000-07-06 1 2 NA NA
2000-07-07 2 2 1 NA
2000-07-15 0 2 2 1
2000-07-16 0 2 3 2
2000-07-17 2 2 4 3
2000-07-18 1 2 5 1
The two events are mutually exclusive, that is, they cannot take place on the same observed day. Hope to hear some good advice. All the best.
Here's a way to do this using dplyr and non-standard evaluation :
library(dplyr)
apply_fun <- function(df, value) {
col <- paste0('obs_since_event_', value)
df %>%
group_by(id) %>%
group_by(temp = lag(cumsum(event == value), default = 0), add = TRUE) %>%
mutate(!!col := row_number()) %>%
ungroup() %>%
mutate(!!col := replace(!!sym(col), temp == 0, NA)) %>%
select(-temp)
}
df <- apply_fun(df, 1)
df <- apply_fun(df, 2)
df
# A tibble: 14 x 5
# date event id obs_since_event_1 obs_since_event_2
# <fct> <int> <int> <int> <int>
# 1 2000-07-06 2 1 NA NA
# 2 2000-07-07 1 1 NA 1
# 3 2000-07-09 0 1 1 2
# 4 2000-07-10 0 1 2 3
# 5 2000-07-15 2 1 3 4
# 6 2000-07-16 1 1 4 1
# 7 2000-07-20 0 1 1 2
# 8 2000-07-21 1 1 2 3
# 9 2000-07-06 1 2 NA NA
#10 2000-07-07 2 2 1 NA
#11 2000-07-15 0 2 2 1
#12 2000-07-16 0 2 3 2
#13 2000-07-17 2 2 4 3
#14 2000-07-18 1 2 5 1
I have a data table in the below format :
id c1 c2
1 1 NA
1 1 NA
1 1 10
1 1 NA
1 1 NA
1 1 10
1 1 NA
1 1 NA
1 1 11
1 1 NA
1 1 NA
1 1 11
2 1 NA
2 1 12
2 1 NA
2 1 NA
2 1 12
From this data table I would like to update all the NA in between the two values in c2 as below:
id c1 c2
1 1 NA
1 1 NA
1 1 10
1 1 10
1 1 10
1 1 10
1 1 NA
1 1 NA
1 1 11
1 1 11
1 1 11
1 1 11
2 1 NA
2 1 12
2 1 12
2 1 12
2 1 12
Can do it using a for loop and which():
df=data.frame(id = c(rep(1,12)),c2 = c(NA,NA,10,NA,NA,10, NA,NA,11,NA,11,NA))
Find unique values of c2:
vals=unique(df[which(!is.na(df$c2)),'c2'])
Loop through unique values and replace observations between their first and last appearance:
for(i in vals){
df[min(which(df$c2==i)):max(which(df$c2==i)),'c2']=i
}
Besides David's approach which is working directly with row indices there is another data.table approach which uses a non-equi join:
# coerce to data.table
setDT(DT)[
# append unique row id
, rn := .I][
# non-equi join on row ids
DT[!is.na(c2), .(rmin = min(rn), rmax = max(rn)), by = c2],
on = .(rn >= rmin, rn <= rmax), c2 := i.c2][
# remove row id column
, rn := NULL][]
id c1 c2
1: 1 1 NA
2: 1 1 NA
3: 1 1 10
4: 1 1 10
5: 1 1 10
6: 1 1 10
7: 1 1 NA
8: 1 1 NA
9: 1 1 11
10: 1 1 11
11: 1 1 11
12: 1 1 11
13: 2 1 NA
14: 2 1 12
15: 2 1 12
16: 2 1 12
17: 2 1 12
Caveat
The expression
DT[!is.na(c2), .(rmin = min(rn), rmax = max(rn)), by = c2]
returns the row id ranges for each unique value of c2
c2 rmin rmax
1: 10 3 6
2: 11 9 12
3: 12 14 17
There is an implicit assumption that the row id ranges do not overlap. It requires that each "gap" is associated with a unique c2 value. This affects other solutions 1, 2 as well.
Improved solution using rleid()
The code can be improved to handle cases where the above mentioned assumption is violated.
Using rleid(), we can distinguish different gaps even if the have the same c2 value. For instance, for the second sample data set
DT2[!is.na(c2), .(c2 = first(c2), rmin = min(rn), rmax = max(rn)), by = rleid(c2)]
rleid c2 rmin rmax
1: 1 10 3 6
2: 2 11 9 12
3: 3 12 14 17
4: 4 10 20 23
The complete code:
setDT(DT2)[, rn := .I][
DT2[!is.na(c2), .(c2 = first(c2), rmin = min(rn), rmax = max(rn)), by = rleid(c2)],
on = .(rn >= rmin, rn <= rmax), c2 := i.c2][, rn := NULL][]
id c1 c2
1: 1 1 NA
2: 1 1 NA
3: 1 1 10
4: 1 1 10
5: 1 1 10
6: 1 1 10
7: 1 1 NA
8: 1 1 NA
9: 1 1 11
10: 1 1 11
11: 1 1 11
12: 1 1 11
13: 2 1 NA
14: 2 1 12
15: 2 1 12
16: 2 1 12
17: 2 1 12
18: 2 1 NA
19: 2 1 NA
20: 2 1 10
21: 2 1 10
22: 2 1 10
23: 2 1 10
24: 2 1 NA
25: 2 1 NA
id c1 c2
Data
library(data.table)
DT <- fread("id c1 c2
1 1 NA
1 1 NA
1 1 10
1 1 NA
1 1 NA
1 1 10
1 1 NA
1 1 NA
1 1 11
1 1 NA
1 1 NA
1 1 11
2 1 NA
2 1 12
2 1 NA
2 1 NA
2 1 12")
Expanded data set (note the repeated appearance of c2 == 10):
DT2 <- fread("id c1 c2
1 1 NA
1 1 NA
1 1 10
1 1 NA
1 1 NA
1 1 10
1 1 NA
1 1 NA
1 1 11
1 1 NA
1 1 NA
1 1 11
2 1 NA
2 1 12
2 1 NA
2 1 NA
2 1 12
2 1 NA
2 1 NA
2 1 10
2 1 NA
2 1 NA
2 1 10
2 1 NA
2 1 NA")
Okay (new/edited answer), we can make use of the fact that the desired property of a solution is that filling up should yield the same result as filling down:
library(tidyverse)
df %>%
mutate(filled_down = c2, filled_up = c2) %>%
fill(filled_down, .direction="down") %>%
fill(filled_up, .direction="up") %>%
mutate(c2 = ifelse(filled_down == filled_up, filled_down, c2)) %>%
select(-filled_down, -filled_up)
dfin <-
ID SEQ GRP C1 C2 C3 T1 T2 T3
1 1 1 0 5 8 0 1 2
1 2 1 5 10 15 5 6 7
2 1 2 20 25 30 0 1 2
C1 is the concentration (CONC) at T1 (TIME) and so on. This is what I want as an output:
dfout <-
ID SEQ GRP CONC TIME
1 1 1 0 0
1 1 1 5 1
1 1 1 8 2
1 2 1 5 5
1 2 1 10 6
1 2 1 15 7
2 1 2 20 0
2 1 2 25 1
2 1 2 30 2
The dfin has much more columns for Cx and Tx where x is the number of concentration readings.
You can do this with data.table::melt, with its capability of melting the table into multiple columns based on the columns pattern:
library(data.table)
melt(
setDT(df),
id.vars=c("ID", "SEQ", "GRP"),
# columns starts with C and T should be melted into two separate columns
measure.vars=patterns("^C", "^T"),
value.name=c('CONC', 'TIME')
)[order(ID, SEQ)][, variable := NULL][]
# ID SEQ GRP CONC TIME
#1: 1 1 1 0 0
#2: 1 1 1 5 1
#3: 1 1 1 8 2
#4: 1 2 1 5 5
#5: 1 2 1 10 6
#6: 1 2 1 15 7
#7: 2 1 2 20 0
#8: 2 1 2 25 1
#9: 2 1 2 30 2
Or if the value column names follow the pattern [CT][0-9], you can use reshape from base R by specifying the sep="" which will split the value columns name by the letter/digit separation due to this default setting (from ?reshape):
split = if (sep == "") {
list(regexp = "[A-Za-z][0-9]", include = TRUE)
} else {
list(regexp = sep, include = FALSE, fixed = TRUE)}
reshape(df, varying=-(1:3), idvar=c("ID", "SEQ", "GRP"),
dir="long", sep="", v.names=c("CONC", "TIME"))
# ID SEQ GRP time CONC TIME
#1: 1 1 1 1 0 5
#2: 1 2 1 1 5 10
#3: 2 1 2 1 20 25
#4: 1 1 1 2 8 0
#5: 1 2 1 2 15 5
#6: 2 1 2 2 30 0
#7: 1 1 1 3 1 2
#8: 1 2 1 3 6 7
#9: 2 1 2 3 1 2
I am trying to use conditional statements to obtain some variables in a data table. Here's some simplified data, the code and the results:
> dt
id trial bet outcome
1: 11 1 1 6
2: 11 2 456 2
3: 11 3 3456 3
4: 11 4 456 6
5: 12 1 34 6
6: 12 2 3456 2
7: 12 3 12 4
8: 12 4 123 2
dt1=dt[,list(
nbet=nchar(bet),
if (nchar(bet)>2.5) riskybet=1 else riskybet=0,
if (grepl(outcome,bet)==TRUE) win=1 else win=0),
by='id,trial']
> dt1
id trial nbet V2 V3
1: 11 1 1 0 0
2: 11 2 3 1 0
3: 11 3 4 1 1
4: 11 4 3 1 1
5: 12 1 2 0 0
6: 12 2 4 1 0
7: 12 3 2 0 0
8: 12 4 3 1 1
The conditional statements are working as they should but without the assigned variable names 'riskybet' and 'win', i.e. they appear as V2 and V3. What am I doing wrong?
You are assigning values to variables "inside" the if/else-statement. Try this:
dt1=dt[,list(
nbet=nchar(bet),
riskybet = if (nchar(bet)>2.5) 1 else 0,
win = if (grepl(outcome, bet)) 1 else 0),
by='id,trial']
id trial nbet riskybet win
1: 11 1 1 0 0
2: 11 2 3 1 0
3: 11 3 4 1 1
4: 11 4 3 1 1
5: 12 1 2 0 0
6: 12 2 4 1 0
7: 12 3 2 0 0
8: 12 4 3 1 1
Alternatively you could also use ifelse instead of the traditional if-else.