I am working with data that looks like this:
ID Year Variable_of_Interest
1 a 2000 0
2 a 2001 0
3 a 2002 0
4 a 2003 0
5 a 2004 0
6 a 2005 1
7 a 2006 1
8 a 2007 1
9 a 2008 1
10 a 2009 1
11 b 2000 0
12 b 2001 0
13 b 2002 0
14 b 2003 1
15 b 2004 1
16 b 2005 1
17 b 2006 1
18 b 2007 1
19 b 2008 1
20 b 2009 1
21 c 2000 0
22 c 2001 0
23 c 2002 0
24 c 2003 0
25 c 2004 0
26 c 2005 0
27 c 2006 1
28 c 2007 1
29 c 2008 1
30 c 2009 1
31 d 2000 0
32 d 2001 0
33 d 2002 1
34 d 2003 1
35 d 2004 1
36 d 2005 1
37 d 2006 0
38 d 2007 0
39 d 2008 0
40 d 2009 0
The unit of analysis is ID. The IDs repeat across each year in the data. The variable of interest column represents changes to the IDs, wherein some years they are a 0 and other years they are a 1
I want to create an additional column that codes changes (defined as going from 0 to 1) in the Variable_of_Interest at the year before and after the change, while also ignoring changes from (1 to 0) (as seen when the ID is equal to "d").
Any code that can help me achieve this solution would be greatly appreciated!
Perferability I'd like the data to look like this:
ID Year Variable_of_Interest Solution
1 a 2000 0 -5
2 a 2001 0 -4
3 a 2002 0 -3
4 a 2003 0 -2
5 a 2004 0 -1
6 a 2005 1 0
7 a 2006 1 1
8 a 2007 1 2
9 a 2008 1 3
10 a 2009 1 4
11 b 2000 0 -3
12 b 2001 0 -2
13 b 2002 0 -1
14 b 2003 1 0
15 b 2004 1 1
16 b 2005 1 2
17 b 2006 1 3
18 b 2007 1 4
19 b 2008 1 5
20 b 2009 1 6
21 c 2000 0 -6
22 c 2001 0 -5
23 c 2002 0 -4
24 c 2003 0 -3
25 c 2004 0 -2
26 c 2005 0 -1
27 c 2006 1 0
28 c 2007 1 1
29 c 2008 1 2
30 c 2009 1 3
31 d 2000 0 -2
32 d 2001 0 -1
33 d 2002 1 0
34 d 2003 1 1
35 d 2004 1 2
36 d 2005 1 3
37 d 2006 0 NA
38 d 2007 0 NA
39 d 2008 0 NA
40 d 2009 0 NA
Here is the replication code:
ID <- c(rep("a",10), rep("b", 10), rep("c", 10), rep("d", 10)); length(ID)
Year <- rep(seq(2000,2009, 1), 4)
Variable_of_Interest <- c(rep(0,5), rep(1, 5),
rep(0,3), rep(1, 7),
rep(0,6), rep(1, 4),
rep(0,2), rep(1, 4), rep(0,4))
data.frame(ID, Year, Variable_of_Interest)
Thank you for your help!
We could create a function :
library(dplyr)
get_sequence <- function(x) {
inds <- which(x == 1 & lag(x) == 0)[1]
vals <- seq_along(x) - inds
inds <- which(x == 0 & lag(x) == 1)[1]
if(!is.na(inds)) vals[inds:length(x)] <- NA
return(vals)
}
and apply it for each ID :
df %>% group_by(ID) %>% mutate(Solution = get_sequence(Variable_of_Interest))
# ID Year Variable_of_Interest Solution
#1 a 2000 0 -5
#2 a 2001 0 -4
#3 a 2002 0 -3
#4 a 2003 0 -2
#5 a 2004 0 -1
#6 a 2005 1 0
#7 a 2006 1 1
#8 a 2007 1 2
#9 a 2008 1 3
#10 a 2009 1 4
#11 b 2000 0 -3
#...
#...
#33 d 2002 1 0
#34 d 2003 1 1
#35 d 2004 1 2
#36 d 2005 1 3
#37 d 2006 0 NA
#38 d 2007 0 NA
#39 d 2008 0 NA
#40 d 2009 0 NA
Related
the data looks like:
df <- data.frame("Grp"=c(rep("A",10),rep("B",10)),
"Year"=c(seq(2001,2010,1),seq(2001,2010,1)),
"Treat"=c(as.character(c(0,0,1,1,1,1,0,0,1,1)),
as.character(c(1,1,1,0,0,0,1,1,1,0))))
df
Grp Year Treat
1 A 2001 0
2 A 2002 0
3 A 2003 1
4 A 2004 1
5 A 2005 1
6 A 2006 1
7 A 2007 0
8 A 2008 0
9 A 2009 1
10 A 2010 1
11 B 2001 1
12 B 2002 1
13 B 2003 1
14 B 2004 0
15 B 2005 0
16 B 2006 0
17 B 2007 1
18 B 2008 1
19 B 2009 1
20 B 2010 0
All I want is to generate another col seq to count the sequence of Treat by Grp, maintaining the sequence of Year. I think the hard part is that when Treat turns to 0, seq should be 0 or whatever, and the sequence of Treat should be re-counted when it turns back to non-zero again. An example of the final dataframe looks like below:
Grp Year Treat seq
1 A 2001 0 0
2 A 2002 0 0
3 A 2003 1 1
4 A 2004 1 2
5 A 2005 1 3
6 A 2006 1 4
7 A 2007 0 0
8 A 2008 0 0
9 A 2009 1 1
10 A 2010 1 2
11 B 2001 1 1
12 B 2002 1 2
13 B 2003 1 3
14 B 2004 0 0
15 B 2005 0 0
16 B 2006 0 0
17 B 2007 1 1
18 B 2008 1 2
19 B 2009 1 3
20 B 2010 0 0
Any suggestions would be much appreciated!
With data.table rleid , you can do :
library(dplyr)
df %>%
group_by(Grp, grp = data.table::rleid(Treat)) %>%
mutate(seq = row_number() * as.integer(Treat)) %>%
ungroup %>%
select(-grp)
# Grp Year Treat seq
# <chr> <dbl> <chr> <int>
# 1 A 2001 0 0
# 2 A 2002 0 0
# 3 A 2003 1 1
# 4 A 2004 1 2
# 5 A 2005 1 3
# 6 A 2006 1 4
# 7 A 2007 0 0
# 8 A 2008 0 0
# 9 A 2009 1 1
#10 A 2010 1 2
#11 B 2001 1 1
#12 B 2002 1 2
#13 B 2003 1 3
#14 B 2004 0 0
#15 B 2005 0 0
#16 B 2006 0 0
#17 B 2007 1 1
#18 B 2008 1 2
#19 B 2009 1 3
#20 B 2010 0 0
I am working with data that looks like this:
ID Year Variable_of_Interest
1 a 2000 0
2 a 2001 0
3 a 2002 0
4 a 2003 0
5 a 2004 0
6 a 2005 1
7 a 2006 1
8 a 2007 1
9 a 2008 1
10 a 2009 1
11 b 2000 0
12 b 2001 0
13 b 2002 0
14 b 2003 1
15 b 2004 1
16 b 2005 1
17 b 2006 0
18 b 2007 1
19 b 2008 1
20 b 2009 1
21 c 2000 0
22 c 2001 0
23 c 2002 0
24 c 2003 0
25 c 2004 0
26 c 2005 0
27 c 2006 1
28 c 2007 1
29 c 2008 1
30 c 2009 0
31 d 2000 0
32 d 2001 0
33 d 2002 1
34 d 2003 1
35 d 2004 0
36 d 2005 1
37 d 2006 1
38 d 2007 1
39 d 2008 1
40 d 2009 1
The unit of analysis is in the ID column. The IDs repeat across each year in the data.
The variable of interest column represents changes to the IDs, wherein some years the values are a 0 and other years they are a 1.
I want to create an additional column that includes sequences of numbers that document the time before and after a code changes (defined as going from 0 to 1) in the Variable_of_Interest at the year before and after the change.
The code must account for repeat code changes (defined as going from 0 to 1), such as ID b from 2002-2003 and 2006-2007.
NAs can be assigned to 0 values without changes back to 1, such as 0 in "c" 2009.
Such that, the data looks like:
ID Year Variable_of_Interest Solution
1 a 2000 0 -5
2 a 2001 0 -4
3 a 2002 0 -3
4 a 2003 0 -2
5 a 2004 0 -1
6 a 2005 1 0
7 a 2006 1 1
8 a 2007 1 2
9 a 2008 1 3
10 a 2009 1 4
11 b 2000 0 -3
12 b 2001 0 -2
13 b 2002 0 -1
14 b 2003 1 0
15 b 2004 1 1
16 b 2005 1 2
17 b 2006 0 -1
18 b 2007 1 0
19 b 2008 1 1
20 b 2009 1 2
21 c 2000 0 -6
22 c 2001 0 -5
23 c 2002 0 -4
24 c 2003 0 -3
25 c 2004 0 -2
26 c 2005 0 -1
27 c 2006 1 0
28 c 2007 1 1
29 c 2008 1 2
30 c 2009 0 NA
31 d 2000 0 -2
32 d 2001 0 -1
33 d 2002 1 0
34 d 2003 1 1
35 d 2004 0 -1
36 d 2005 1 0
37 d 2006 1 1
38 d 2007 1 2
39 d 2008 1 3
40 d 2009 1 4
Here is the replication code:
ID <- c(rep("a",10), rep("b", 10), rep("c", 10), rep("d", 10)); length(ID)
Year <- rep(seq(2000,2009, 1), 4)
Variable_of_Interest <- c(rep(0,5), rep(1, 5),
rep(0,3), rep(1, 3), rep(0, 1), rep(1, 3),
rep(0,6), rep(1, 3), rep(0, 1),
rep(0,2), rep(1, 2), rep(0,1), rep(1,5))
data.frame(ID, Year, Variable_of_Interest)
Thank you so much for your help!
Another option using data.table:
#identify runs
setDT(DF)[, ri := rleid(voi)]
#generate the desired output depending on whether VOI is 1 or 0
DF[, soln := if (voi[1L]==1L) seq.int(.N) - 1L else -rev(seq.int(.N)), .(ID, ri)]
#replace trailing 0 with NA
DF[, soln := if(voi[.N]==0L) replace(soln, ri==ri[.N], NA_integer_) else soln, ID]
data:
ID <- c(rep("a",10), rep("b", 10), rep("c", 10), rep("d", 10)); length(ID)
Year <- rep(seq(2000,2009, 1), 4)
Variable_of_Interest <- c(rep(0,5), rep(1, 5),
rep(0,3), rep(1, 3), rep(0, 1), rep(1, 3),
rep(0,6), rep(1, 3), rep(0, 1),
rep(0,2), rep(1, 2), rep(0,1), rep(1,5))
DF <- data.frame(ID, Year, voi=Variable_of_Interest)
Here is a dplyr option:
library(dplyr)
df %>%
group_by(ID, idx = cumsum(Variable_of_Interest != lag(Variable_of_Interest, default = first(Variable_of_Interest)))) %>%
mutate(Solution = case_when(Variable_of_Interest == 0 ~ rev(-1:(-n())), Variable_of_Interest == 1 ~ 0:(n() - 1))) %>%
group_by(ID) %>%
mutate(Solution = replace(Solution, idx == max(idx) & Variable_of_Interest == 0, NA)) %>%
select(-idx)
Output:
ID Year Variable_of_Interest idx Solution
1 a 2000 0 0 -5
2 a 2001 0 0 -4
3 a 2002 0 0 -3
4 a 2003 0 0 -2
5 a 2004 0 0 -1
6 a 2005 1 1 0
7 a 2006 1 1 1
8 a 2007 1 1 2
9 a 2008 1 1 3
10 a 2009 1 1 4
11 b 2000 0 2 -3
12 b 2001 0 2 -2
13 b 2002 0 2 -1
14 b 2003 1 3 0
15 b 2004 1 3 1
16 b 2005 1 3 2
17 b 2006 0 4 -1
18 b 2007 1 5 0
19 b 2008 1 5 1
20 b 2009 1 5 2
21 c 2000 0 6 -6
22 c 2001 0 6 -5
23 c 2002 0 6 -4
24 c 2003 0 6 -3
25 c 2004 0 6 -2
26 c 2005 0 6 -1
27 c 2006 1 7 0
28 c 2007 1 7 1
29 c 2008 1 7 2
30 c 2009 0 8 NA
31 d 2000 0 8 -2
32 d 2001 0 8 -1
33 d 2002 1 9 0
34 d 2003 1 9 1
35 d 2004 0 10 -1
36 d 2005 1 11 0
37 d 2006 1 11 1
38 d 2007 1 11 2
39 d 2008 1 11 3
40 d 2009 1 11 4
I have a dataframe that looks like this one
state start end date treat
1 1999 2000 2001 1
1 1998 2000 2001 1
1 2000 2003 NA 0
2 2001 2002 NA 0
2 2002 2004 2003 1
2 2003 2004 2005 1
3 2002 2004 2006 1
3 2003 2004 NA 0
3 2005 2007 NA 0
I want to group it by state identifier and, for each state, I want compute the number of treated observation (treat) the date of which lies in between start and end.
In other words I want to get the following
state start end date treat result
1 1999 2000 2001 1 0
1 1998 2000 2001 1 0
1 2000 2003 NA 0 2
2 2001 2002 NA 0 0
2 2002 2004 2003 1 1
2 2003 2004 2005 1 0
3 2002 2004 2006 1 0
3 2003 2004 NA 0 0
3 2005 2008 NA 0 1
For instance, result in the first row is equal to 0 because within state = 1 there is no date between 1999 and 2000. On the other hand, result in the last row is equal to one because within state 3 I have one treated unit the date of which lies between 2005 and 2008 (in particular date = 2006 in the 7th row).
Thank you very much for your help.
You can split by state and combine two outer with & testing if date is between start and end and then sum treat for those matching dates.
x$result <- unlist(lapply(split(x, x$state), function(y) {
tt <- outer(y$start, y$date, "<") & outer(y$end, y$date, ">")
tt[is.na(tt)] <- TRUE
apply(tt, 1, function(z) sum(y$treat[z]))
}))
x
# state start end date treat result
#1 1 1999 2000 2001 1 0
#2 1 1998 2000 2001 1 0
#3 1 2000 2003 NA 0 2
#4 2 2001 2002 NA 0 0
#5 2 2002 2004 2003 1 1
#6 2 2003 2004 2005 1 0
#7 3 2002 2004 2006 1 0
#8 3 2003 2004 NA 0 0
#9 3 2005 2007 NA 0 1
Or you take the part describing the treat per state and date and merge it with the part describing state, start and end and sum the matching treat.
tt <- aggregate(treat ~ state + date, x[,c("state", "date", "treat")], sum)
tt <- merge(x[,c("state", "start", "end")], tt)
tt$treat[tt$start >= tt$date | tt$end <= tt$date] <- 0
aggregate(treat ~ start + end + state, tt, sum)
# start end state treat
#1 1998 2000 1 0
#2 1999 2000 1 0
#3 2000 2003 1 2
#4 2001 2002 2 0
#5 2002 2004 2 1
#6 2003 2004 2 0
#7 2002 2004 3 0
#8 2003 2004 3 0
#9 2005 2007 3 1
This gives your numbers though it repeats them on every row:
library(tidyverse)
df %>% group_by(state) %>%
mutate(result=sum(treat==1 & date>=min(start, na.rm=TRUE) & date<=max(end, na.rm=TRUE), na.rm=TRUE))
#> # A tibble: 9 x 6
#> # Groups: state [3]
#> state start end date treat result
#> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 1 1999 2000 2001 1 2
#> 2 1 1998 2000 2001 1 2
#> 3 1 2000 2003 NA 0 2
#> 4 2 2001 2002 NA 0 1
#> 5 2 2002 2004 2003 1 1
#> 6 2 2003 2004 2005 1 1
#> 7 3 2002 2004 2006 1 1
#> 8 3 2003 2004 NA 0 1
#> 9 3 2005 2007 NA 0 1
If you just want one number per group, summarize might be a better option:
df %>% group_by(state) %>%
summarize(result=sum(treat==1 & date>=min(start, na.rm=TRUE) & date<=max(end, na.rm=TRUE), na.rm=TRUE))
#> # A tibble: 3 x 2
#> state result
#> <dbl> <int>
#> 1 1 2
#> 2 2 1
#> 3 3 1
I have this kind of data:
df <- data.frame(year=c(1999,1999,1999,2000,2000,2001,2011,2011,2011,2011), class=c("A","B","C","A","C","A","B","C","D","E"),
n=c(10,20,30,12,15,40,50,55,60,5), occurs=c(0,1,3,4,2,0,0,11,12,2))
> df
year class n occurs
1 1999 A 10 0
2 1999 B 20 1
3 1999 C 30 3
4 2000 A 12 4
5 2000 C 15 2
6 2001 A 40 0
7 2011 B 50 0
8 2011 C 55 11
9 2011 D 60 12
10 2011 E 5 2
I would like to expand this data like this:
year class n occurs
1 1999 A 1 0
1 1999 A 2 0
1 1999 A 3 0
...
1 1999 A 10 0
2 1999 B 0 0
2 1999 B 1 0
2 1999 B 2 0
...
2 1999 B 20 1
3 1999 C 1 1
3 1999 C 1 1
3 1999 C 1 0
3 1999 C 1 0
.. the rest of occurs is seq of zeros...because `n-occurs` = 27 zeros and seq of 3x `1`.
I want to expand the rows n times as indicated by column n and so that the occurs column expands to flag 0 or 1 n-times according to the number of occurs columns number so if column occurs has interger 5 and column n = 10 then there will be n rows (year and class the same) and flags occurs 5 times zero and 5 times number one.
EDIT: Please note the new sequence of occurs (seq only of 0 and 1) is based on n-occurs for number of 0s and number of 1 is determined by number occurs.
Consider do.call and lapply calls using the data.frame() constructor with build of occurs:
df_List <- lapply(seq(nrow(df)), FUN=function(d){
occ <- c(rep(1, df$occurs[[d]]), rep(0, df$n[[d]]-df$occurs[[d]]))
data.frame(year=df$year[[d]], class=df$class[[d]], n=seq(df$n[[d]]), occurs=occ)
})
finaldf <- do.call(rbind, df_List)
head(finaldf, 20)
# year class n occurs
# 1 1999 A 1 0
# 2 1999 A 2 0
# 3 1999 A 3 0
# 4 1999 A 4 0
# 5 1999 A 5 0
# 6 1999 A 6 0
# 7 1999 A 7 0
# 8 1999 A 8 0
# 9 1999 A 9 0
# 10 1999 A 10 0
# 11 1999 B 1 1
# 12 1999 B 2 0
# 13 1999 B 3 0
# 14 1999 B 4 0
# 15 1999 B 5 0
# 16 1999 B 6 0
# 17 1999 B 7 0
# 18 1999 B 8 0
# 19 1999 B 9 0
# 20 1999 B 10 0
Here is a base R method that is closely related to the linked post here and in my comment above. The answer is provides the method for generating the first two columns of the data.frame.
dat <- data.frame(df[1:2][rep(1:nrow(df), df$n),],
n=sequence(df$n),
occurs=unlist(mapply(function(x, y) rep(0:1, c(x-y, y)), df$n, df$occurs)))
Here, the first 2 columns are generated using that answer. n is generated using sequence, and occurs uses mapply and rep, returning a vector with unlist. This puts the 1s at the end. You could use 1:0 to put the 1s at the beginning or feed the resulting vector to sample within mapply to get a random ordering of 1s and 0s.
We can check that the data.frame has the proper number of rows:
nrow(dat) == sum(df$n)
[1] TRUE
The first 15 observations of
head(dat, 15)
year class n occurs
1 1999 A 1 0
1.1 1999 A 2 0
1.2 1999 A 3 0
1.3 1999 A 4 0
1.4 1999 A 5 0
1.5 1999 A 6 0
1.6 1999 A 7 0
1.7 1999 A 8 0
1.8 1999 A 9 0
1.9 1999 A 10 0
2 1999 B 1 0
2.1 1999 B 2 0
2.2 1999 B 3 0
2.3 1999 B 4 0
2.4 1999 B 5 0
Suppose I have the following data frame df:
id year y
1 1 1990 NA
2 1 1991 0
3 1 1992 0
4 1 1993 1
5 1 1994 NA
6 2 1990 0
7 2 1991 0
8 2 1992 0
9 2 1993 0
10 2 1994 0
11 3 1990 0
12 3 1991 0
13 3 1992 1
14 3 1993 NA
15 3 1994 NA
Code to create the df:
id<-c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
year<-c(1990,1991,1992,1993,1994,1990,1991,1992,1993,1994,1990,1991,1992,1993,1994)
y<-c(NA,0,0,1,NA,0,0,0,0,0,0,0,1,NA,NA)
df<-data.frame(id,year,y)
I want to create the following vector t that measures the duration an observation has been at risk until an event occurs (y=1) or the last entry of an observation (equal to right censoring):
id year y t
1 1 1990 NA NA
2 1 1991 0 1
3 1 1992 0 2
4 1 1993 1 3
5 1 1994 NA NA
6 2 1990 0 1
7 2 1991 0 2
8 2 1992 0 3
9 2 1993 0 4
10 2 1994 0 5
11 3 1990 0 1
12 3 1991 0 2
13 3 1992 1 3
14 3 1993 NA NA
15 3 1994 NA NA
Any help is highly welcome!
Here's a possible data.table solution which will also update your data set by reference
library(data.table)
setDT(df)[!is.na(y), t := seq_len(.N), id][]
# id year y t
# 1: 1 1990 NA NA
# 2: 1 1991 0 1
# 3: 1 1992 0 2
# 4: 1 1993 1 3
# 5: 1 1994 NA NA
# 6: 2 1990 0 1
# 7: 2 1991 0 2
# 8: 2 1992 0 3
# 9: 2 1993 0 4
# 10: 2 1994 0 5
# 11: 3 1990 0 1
# 12: 3 1991 0 2
# 13: 3 1992 1 3
# 14: 3 1993 NA NA
# 15: 3 1994 NA NA
A base R option would be
df$t <- with(df, ave(!is.na(y), id, FUN=cumsum)*NA^is.na(y))
df
# id year y t
#1 1 1990 NA NA
#2 1 1991 0 1
#3 1 1992 0 2
#4 1 1993 1 3
#5 1 1994 NA NA
#6 2 1990 0 1
#7 2 1991 0 2
#8 2 1992 0 3
#9 2 1993 0 4
#10 2 1994 0 5
#11 3 1990 0 1
#12 3 1991 0 2
#13 3 1992 1 3
#14 3 1993 NA NA
#15 3 1994 NA NA
Or using dplyr
library(dplyr)
df %>%
group_by(id) %>%
mutate(t=replace(y, !is.na(y), seq(na.omit(y))))
You can achieve this using the btcs() command from Dave Armstrong's packages DAMisc.
df <- btscs(df, "y", "year", "id")
That will spit out your original dataset along with a column 'spell' which is the number of time units since the last event.