add column with data in rows before & after a condition - r

With inspiration from this post, I came up with a solution to add a column with 1's 2 rows before and after y==5. Like this:
library(data.table)
DT <- fread("grp y exclude
a 1 0
a 2 0
a 3 0
a 4 1
a 5 0
a 7 1
a 8 0
a 9 0
a 10 0
b 1 0
b 2 0
b 3 0
b 4 1
b 5 0
b 6 1
b 7 1
b 8 0
b 9 0
b 10 0
c 5 1
d 1 0")
DT[DT[, rn := .I][, rn[abs(.I - .I[y==5]) <= 2], by=grp]$V1, xx:=1]
My problem is that if the condition is met too close to the top, it will throw a warning and not work properly:
DT$y[2] <- 5
DT[DT[, rn := .I][, rn[abs(.I - .I[y==5]) <= 2], by=grp]$V1, xx:=1]
Warning message:
In .I - .I[y == 5] :
longer object length is not a multiple of shorter object length
I suspect that the index will give numbers less than 1, but cant figure it out. Any suggestion to alter the code, so that it works under all conditions?
Wanted output:
grp y exclude xx
1: a 1 0 1
2: a 5 0 1
3: a 3 0 1
4: a 4 1 1
5: a 5 0 1
6: a 7 1 1
7: a 8 0 1
8: a 9 0 0
9: a 10 0 0
10: b 1 0 0
11: b 2 0 0
12: b 3 0 1
13: b 4 1 1
14: b 5 0 1
15: b 6 1 1
16: b 7 1 1
17: b 8 0 0
18: b 9 0 0
19: b 10 0 0
20: c 5 1 1
21: d 1 0 0

Here is another alternative using shift
val <- 5L
DT[, xx := as.integer(
Reduce(`|`, c(shift(y==val, 0L:2L, fill=FALSE), shift(y==val, 1L:2L, FALSE, "lead")))
),
by=.(grp)]
DT
Incorporating Jaap's comment, using data.table version >= 1.12.0, code is shortened to
DT[, xx := +(Reduce(`|`, shift(y==v, -2L:2L, FALSE))), by=grp]
output:
grp y exclude xx
1: a 1 0 1
2: a 5 0 1
3: a 3 0 1
4: a 4 1 1
5: a 5 0 1
6: a 7 1 1
7: a 8 0 1
8: a 9 0 0
9: a 10 0 0
10: b 1 0 0
11: b 2 0 0
12: b 3 0 1
13: b 4 1 1
14: b 5 0 1
15: b 6 1 1
16: b 7 1 1
17: b 8 0 0
18: b 9 0 0
19: b 10 0 0
20: c 5 1 1
21: d 1 0 0
grp y exclude xx

Related

Conditionally replace value in a single row or replace value of following rows with values from previous row group in R

I have a huge datatable with over 20'000 rows with a column for each time point t and for each customer with id and I am looking for a way to replace the values in y for t=5:8 each customer id by the value by copy pasting the value of y when t=3&4.
The data set below is a short version of my data set:
Dt=data.table(
t=rep(1:8, times=3),
y=c(0,1,0,0,0,1,1,0,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0),
id=rep(1:3, each=8))
t y id
1: 1 0 1
2: 2 1 1
3: 3 0 1
4: 4 0 1
5: 5 0 1
6: 6 1 1
7: 7 1 1
8: 8 0 1
9: 1 0 2
10: 2 0 2
11: 3 0 2
12: 4 1 2
13: 5 0 2
14: 6 0 2
15: 7 1 2
16: 8 0 2
17: 1 0 3
18: 2 1 3
19: 3 1 3
20: 4 1 3
21: 5 0 3
22: 6 1 3
23: 7 0 3
24: 8 0 3
In the end it should look like this:
t y id
1: 1 0 1
2: 2 1 1
3: 3 0 1
4: 4 0 1
5: 5 0 1
6: 6 0 1
7: 7 0 1
8: 8 0 1
9: 1 0 2
10: 2 0 2
11: 3 0 2
12: 4 1 2
13: 5 0 2
14: 6 1 2
15: 7 0 2
16: 8 1 2
17: 1 0 3
18: 2 1 3
19: 3 1 3
20: 4 1 3
21: 5 1 3
22: 6 1 3
23: 7 1 3
24: 8 1 3
Do you maybe have an idea how I could solve this? I thought of doing 2 for loops with the range of t and customer id, but I imagine that for this dataset it would take too long.
Thank you in advance!
Your data does not exactly match what is displayed in your post (particularly rows where t is 3 and 4 within id 3). You could try replace in the following approach, though not sure how efficient this is generating a copy with :=.
library(data.table)
Dt[ , y := replace(y, t %in% 5:8, y[t %in% 3:4]), by = id]
Dt
Output
t y id
1: 1 0 1
2: 2 1 1
3: 3 0 1
4: 4 0 1
5: 5 0 1
6: 6 0 1
7: 7 0 1
8: 8 0 1
9: 1 0 2
10: 2 0 2
11: 3 0 2
12: 4 1 2
13: 5 0 2
14: 6 1 2
15: 7 0 2
16: 8 1 2
17: 1 0 3
18: 2 1 3
19: 3 0 3
20: 4 0 3
21: 5 0 3
22: 6 0 3
23: 7 0 3
24: 8 0 3
t y id

R Data Table Assign Subset of Rows and Columns with Zero

I'm trying to explode a data table into a time series by populating future time steps with values of zero. The starting data table has the following structure. Values for V1 and V2 can be thought of as values for the first time step.
dt <- data.table(ID = c(1,2,3), V1 = c(1,2,3), V2 = c(4,5,6))
ID V1 V2
1: 1 1 4
2: 2 2 5
3: 3 3 6
What I want to get to is a data table like this
ID year V1 V2
1: 1 1 1 4
2: 1 2 0 0
3: 1 3 0 0
4: 1 4 0 0
5: 1 5 0 0
6: 2 1 2 5
7: 2 2 0 0
8: 2 3 0 0
9: 2 4 0 0
10: 2 5 0 0
11: 3 1 3 6
12: 3 2 0 0
13: 3 3 0 0
14: 3 4 0 0
15: 3 5 0 0
I've exploded the original data table and appended the year column with the following
dt <- dt[, .(year=1:5), by=ID][dt, on=ID, allow.cartesian=T]
ID year V1 V2
1: 1 1 1 4
2: 1 2 1 4
3: 1 3 1 4
4: 1 4 1 4
5: 1 5 1 4
6: 2 1 2 5
7: 2 2 2 5
8: 2 3 2 5
9: 2 4 2 5
10: 2 5 2 5
11: 3 1 3 6
12: 3 2 3 6
13: 3 3 3 6
14: 3 4 3 6
15: 3 5 3 6
Any ideas on how to populate columns V1 and V2 with zeros for year!=1 would be much appreciated. I also need to avoid spelling out the V1 and V2 column names as the actual data table I'm working with has 58 columns.
I got an error with that last step, but if you have a more recent version of data.table that behaves differently hten by all means just :
dt[year != 1, V1 := 0] # logical condition in the 'i' position
dt[year != 1, V2 := 0] # data.table assign in the 'j' position
Ooops. Didn't read to the end. Will see if I can test a range of columns.
Ranges can be constructed on the LHS of data.table.[ assignment operator (:=):
> dt2[year != 1, paste0("V", 1:2) := 0 ]
> dt2
ID V1 V2 year
1: 1 1 4 1
2: 1 0 0 2
3: 1 0 0 3
4: 1 0 0 4
5: 1 0 0 5
6: 2 2 5 1
7: 2 0 0 2
8: 2 0 0 3
9: 2 0 0 4
10: 2 0 0 5
11: 3 3 6 1
12: 3 0 0 2
13: 3 0 0 3
14: 3 0 0 4
15: 3 0 0 5
You can use tidyr::complete -
library(dplyr)
library(tidyr)
dt %>%
mutate(year = 1) %>%
complete(ID, year = 1:5, fill = list(V1 = 0, V2 = 0))
# ID year V1 V2
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 4
# 2 1 2 0 0
# 3 1 3 0 0
# 4 1 4 0 0
# 5 1 5 0 0
# 6 2 1 2 5
# 7 2 2 0 0
# 8 2 3 0 0
# 9 2 4 0 0
#10 2 5 0 0
#11 3 1 3 6
#12 3 2 0 0
#13 3 3 0 0
#14 3 4 0 0
#15 3 5 0 0

Identify jumps from 0 to 1 using data.table

I have the following data.table with the variable of interest x. I would like to create another variable that indicates a jump in x from exactly 0 to 1, meaning the variable has been 0 until a certain year and 1 in all years following after that. This should by done by id_d.
Is there an easy data.table way for doing this?
Original data.table:
fullDat <- data.table(id_d = rep(letters[1:3], each=12),
year=rep(1:12, 3),
x = c(rep(0, 5), rep(1, 7), 0,1,0,1,2,2,4, rep(5,5), 1, rep(0, 3), rep(1, 8)))
id_d year x
1: a 1 0
2: a 2 0
3: a 3 0
4: a 4 0
5: a 5 0
6: a 6 1
7: a 7 1
8: a 8 1
9: a 9 1
10: a 10 1
11: a 11 1
12: a 12 1
13: b 1 0
14: b 2 1
15: b 3 0
16: b 4 1
17: b 5 2
18: b 6 2
19: b 7 4
20: b 8 5
21: b 9 5
22: b 10 5
23: b 11 5
24: b 12 5
25: c 1 1
26: c 2 0
27: c 3 0
28: c 4 0
29: c 5 1
30: c 6 1
31: c 7 1
32: c 8 1
33: c 9 1
34: c 10 1
35: c 11 1
36: c 12 1
id_d year x
What the result should look like:
id_d year x jump
1: a 1 0 0
2: a 2 0 0
3: a 3 0 0
4: a 4 0 0
5: a 5 0 0
6: a 6 1 1
7: a 7 1 0
8: a 8 1 0
9: a 9 1 0
10: a 10 1 0
11: a 11 1 0
12: a 12 1 0
13: b 1 0 0
14: b 2 1 0
15: b 3 0 0
16: b 4 1 0
17: b 5 2 0
18: b 6 2 0
19: b 7 4 0
20: b 8 5 0
21: b 9 5 0
22: b 10 5 0
23: b 11 5 0
24: b 12 5 0
25: c 1 1 0
26: c 2 0 0
27: c 3 0 0
28: c 4 0 0
29: c 5 1 0
30: c 6 1 0
31: c 7 1 0
32: c 8 1 0
33: c 9 1 0
34: c 10 1 0
35: c 11 1 0
36: c 12 1 0
id_d year x jump
the variable has been 0 until a certain year and 1 in all years following
# find rows to assign one
wDT = fullDat[, .(year = year[with(rle(x),
if (identical(values, c(0, 1))) first(lengths) + 1L
else 0L
)]), by=id_d]
# initialize to zero
fullDat[, jump := 0L ]
# update join to assign ones
fullDat[wDT, on=.(id_d, year), jump := 1L ]
It is not necessary to make the intermediate table wDT; writing the full code for it into the final statement would work too. In fact, it could all be in one line if wanted, something like...
DT[, x := 0L][code_for_wDT, on=on_cols, x := 1L]
Alternately, instead of a join, just use the row numbers from .I:
# find rows to assign one
w = fullDat[, with(rle(x), .I[
if (identical(values, c(0, 1))) first(lengths) + 1L
else 0L
]), by=id_d]$V1
# initialize to zero
fullDat[, jump := 0L ]
# update to assign ones
fullDat[w, jump := 1L ]
We can do
fullDat[, jump := {i1 <- which.max(x)
if(all(x[i1:.N]==1)) replace(rep(0, .N), i1, 1) else 0},
id_d]
fullDat
# id_d year x jump
# 1: a 1 0 0
# 2: a 2 0 0
# 3: a 3 0 0
# 4: a 4 0 0
# 5: a 5 0 0
# 6: a 6 1 1
# 7: a 7 1 0
# 8: a 8 1 0
# 9: a 9 1 0
#10: a 10 1 0
#11: a 11 1 0
#12: a 12 1 0
#13: b 1 0 0
#14: b 2 1 0
#15: b 3 0 0
#16: b 4 1 0
#17: b 5 2 0
#18: b 6 2 0
#19: b 7 4 0
#20: b 8 5 0
#21: b 9 5 0
#22: b 10 5 0
#23: b 11 5 0
#24: b 12 5 0
#25: c 1 1 0
#26: c 2 0 0
#27: c 3 0 0
#28: c 4 0 0
#29: c 5 1 0
#30: c 6 1 0
#31: c 7 1 0
#32: c 8 1 0
#33: c 9 1 0
#34: c 10 1 0
#35: c 11 1 0
#36: c 12 1 0
Or a slightly more compact option is
fullDat[, jump := if(all(cumsum(diff(x)) %in% c(0,1))) c(0, diff(x)) else 0 ,id_d]
fullDat[, jump := (cumsum(x==0)==(1:.N - 1L)) & (rev(cumsum(rev(x==1))) == .N:1), id_d]
How this works:
cumsum(x==0) == (1:.N - 1L) checks that the number of zeros up to and including this row is equal to the number of previous rows
rev(cumsum(rev(x==1))) == .N:1 checks that the number of ones, counting from the final row backwards (down to and including this row), is equal to the number of rows from here to the end

R data.table conditional (min/max) aggregation

I'm relatively new to R and I have a question regarding how to do conditional aggregation using data.tables (or other methods) while still accessing the table columns by reference. There was an answer to a similar question here but it takes a long time on my data and takes a lot of memory. Here is some toy data:
t <- data.table(User=c(1,1,1,1,1,2,2,2,2,3,3,3,3,3,3),
Obs=c(1,2,3,4,5,1,2,3,4,1,2,3,4,5,6),
Flag=c(0,1,0,1,0,0,1,0,0,1,0,0,0,1,0))
Which looks like this:
User Obs Flag
1: 1 1 0
2: 1 2 1
3: 1 3 0
4: 1 4 1
5: 1 5 0
6: 2 1 0
7: 2 2 1
8: 2 3 0
9: 2 4 0
10: 3 1 1
11: 3 2 0
12: 3 3 0
13: 3 4 0
14: 3 5 1
15: 3 6 0
What I would like to do with this is to get the maximum observation less than the current observation where the flag is 1, by user. The output should look like this:
User Obs Flag min.max
1: 1 1 0 NA
2: 1 2 1 2
3: 1 3 0 2
4: 1 4 1 4
5: 1 5 0 4
6: 2 1 0 NA
7: 2 2 1 2
8: 2 3 0 2
9: 2 4 0 2
10: 3 1 1 1
11: 3 2 0 1
12: 3 3 0 1
13: 3 4 0 1
14: 3 5 1 5
15: 3 6 0 5
Any help would be greatly appreciated!
t[, max := Obs[Flag == 1], by = .(User, cumsum(diff(c(0, Flag)) == 1))]
t
# User Obs Flag max
# 1: 1 1 0 NA
# 2: 1 2 1 2
# 3: 1 3 0 2
# 4: 1 4 1 4
# 5: 1 5 0 4
# 6: 2 1 0 NA
# 7: 2 2 1 2
# 8: 2 3 0 2
# 9: 2 4 0 2
#10: 3 1 1 1
#11: 3 2 0 1
#12: 3 3 0 1
#13: 3 4 0 1
#14: 3 5 1 5
#15: 3 6 0 5

R data.table update errors

Some more problems I'm having with old data.table code related to this:
R: number rows that match >= other row within group
The data looks like this, with a different ID for each person, IDSEQ is the sequence of each admission for this person and TAG for diabetes medication (TAG=1 hypoglycaemic agent and TAG=2 insulin).
ID IDSEQ TAG
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 0
5 1 5 0
6 1 6 0
7 1 7 0
8 1 8 1
9 1 9 0
10 1 10 0
11 2 1 0
12 2 2 0
13 2 3 0
14 2 4 1
15 2 5 0
16 2 6 0
17 2 7 0
18 2 8 2
19 2 9 0
20 2 10 0
# recreate this data with
df <- data.frame(ID=c(rep(1,10),rep(2,10)),
IDSEQ=c(1:10,1:10),
TAG=c(rep(0,7),1,0,0,0,0,0,1,0,0,0,2,0,0))
Exercise: Create two new index sequence variables, COND1 using TAG=1 as the index record, and COND2 using TAG=2 as the index record. Write your syntax so that only record in a block prior to the index records in a block prior to the index record are numbered with a '0'.
a) TAG=1 (seems to still work)
DT <- data.table(df)
setkey(DT, ID)
# counter for condition 1
tmp <- df[which(df$TAG == 1),1:2]
DT1 <- data.table(tmp)
DT1 <- DT1[, list(IDSEQ=min(IDSEQ)), by=ID]
DT[, COND1:=0L]
DT[DT[DT1,.I[IDSEQ >= i.IDSEQ]],COND1:=1:.N,by=ID]
# previously
# DT[DT[DT1,.I[IDSEQ >= i.IDSEQ]]$V1,COND1:=1:.N,by=ID]
a) TAG=2 does not result in the correct result anymore, it is not linked on both ID and IDSEQ.
tmp <- df[which(df$TAG == 2),1:2]
DT1 <- data.table(tmp)
DT1 <- DT1[, list(IDSEQ=min(IDSEQ)), by=ID]
DT[, COND2:=0L]
DT[DT[DT1,.I[IDSEQ >= i.IDSEQ]],COND2:=1:.N,by=ID]
# previously worked with
# DT[DT[DT1,.I[IDSEQ >= i.IDSEQ]]$V1,COND2:=1:.N,by=ID]
The overall result should look like this
ID IDSEQ TAG COND1 COND2
1 1 1 0 0 0
2 1 2 0 0 0
3 1 3 0 0 0
4 1 4 0 0 0
5 1 5 0 0 0
6 1 6 0 0 0
7 1 7 0 0 0
8 1 8 1 1 0
9 1 9 0 2 0
10 1 10 0 3 0
11 2 1 0 0 0
12 2 2 0 0 0
13 2 3 0 0 0
14 2 4 1 1 0
15 2 5 0 2 0
16 2 6 0 3 0
17 2 7 0 4 0
18 2 8 2 5 1
19 2 9 0 6 2
20 2 10 0 7 3
# recreate this data with
data.frame(ID=c(rep(1,10),rep(2,10)),
IDSEQ=c(1:10,1:10),
TAG=c(rep(0,7),1,0,0,0,0,0,1,0,0,0,2,0,0),
COND1=c(rep(0,7),1,2,3,0,0,0,1,2,3,4,5,6,7),
COND2=c(rep(0,17),1,2,3))
data.table Version 1.9.4, R version 3.1.1
Here's one way using data.table:
dt[, `:=`(count1 = cumsum(cumsum(TAG == 1L)),
count2 = cumsum(cumsum(TAG == 2L))
), by=ID]
# ID IDSEQ TAG count1 count2
# 1: 1 1 0 0 0
# 2: 1 2 0 0 0
# 3: 1 3 0 0 0
# 4: 1 4 0 0 0
# 5: 1 5 0 0 0
# 6: 1 6 0 0 0
# 7: 1 7 0 0 0
# 8: 1 8 1 1 0
# 9: 1 9 0 2 0
# 10: 1 10 0 3 0
# 11: 2 1 0 0 0
# 12: 2 2 0 0 0
# 13: 2 3 0 0 0
# 14: 2 4 1 1 0
# 15: 2 5 0 2 0
# 16: 2 6 0 3 0
# 17: 2 7 0 4 0
# 18: 2 8 2 5 1
# 19: 2 9 0 6 2
# 20: 2 10 0 7 3
The corrected last line of your example:
DT[DT[DT1, .I[IDSEQ >= i.IDSEQ], by=.EACHI]$V1, COND2:=1:.N, by=ID]
Alternatively, you could also alter the default behavior, although I wouldn't recommend it due to compatibility issues.
options(datatable.old.bywithoutby=TRUE)
Some background information:
The selection criteria for TAG==2 results in
DT[DT1,.I[IDSEQ >= i.IDSEQ]]
[1] 8 9 10
which are the correct indices (line number) for ID==2 after the subset/join.
You would experience the same problem/result if there would be e.g. an ID==0 without any TAG==1.
df <- data.frame(ID=c(0, rep(1,10),rep(2,10)),
IDSEQ=c(1, 1:10,1:10),
TAG=c(0, rep(0,7),1,0,0,0,0,0,1,0,0,0,2,0,0))
DT <- data.table(df)
setkey(DT, ID)
# counter for condition 1
tmp <- df[which(df$TAG == 1),1:2]
DT1 <- data.table(tmp)
DT1 <- DT1[, list(IDSEQ=min(IDSEQ)), by=ID]
DT[, COND1:=0L]
DT[DT[DT1, .I[IDSEQ >= i.IDSEQ]], COND1:=1:.N, by=ID]
DT[c(1,2, 7:10),]
ID IDSEQ TAG COND1
1: 0 1 0 0
2: 1 1 0 0
3: 1 6 0 0
4: 1 7 0 1
5: 1 8 1 2
6: 1 9 0 3
data.table handles this situation (by-without-by) differently since version 1.9.4. When loading data.table it states
> library(data.table)
data.table 1.9.4 For help type: ?data.table
*** NB: by=.EACHI is now explicit. See README to restore previous behaviour.
Therefore you need to explicitly tell that you want to get your results j (the second part of the statement) not only for the current subset (defined in the first part), but grouped by every key variable.
More information can be found in the data.table FAQ 1.13 and 1.14 on page 5 and 6.

Resources