the data looks like:
df <- data.frame("Grp"=c(rep("A",10),rep("B",10)),
"Year"=c(seq(2001,2010,1),seq(2001,2010,1)),
"Treat"=c(as.character(c(0,0,1,1,1,1,0,0,1,1)),
as.character(c(1,1,1,0,0,0,1,1,1,0))))
df
Grp Year Treat
1 A 2001 0
2 A 2002 0
3 A 2003 1
4 A 2004 1
5 A 2005 1
6 A 2006 1
7 A 2007 0
8 A 2008 0
9 A 2009 1
10 A 2010 1
11 B 2001 1
12 B 2002 1
13 B 2003 1
14 B 2004 0
15 B 2005 0
16 B 2006 0
17 B 2007 1
18 B 2008 1
19 B 2009 1
20 B 2010 0
All I want is to generate another col seq to count the sequence of Treat by Grp, maintaining the sequence of Year. I think the hard part is that when Treat turns to 0, seq should be 0 or whatever, and the sequence of Treat should be re-counted when it turns back to non-zero again. An example of the final dataframe looks like below:
Grp Year Treat seq
1 A 2001 0 0
2 A 2002 0 0
3 A 2003 1 1
4 A 2004 1 2
5 A 2005 1 3
6 A 2006 1 4
7 A 2007 0 0
8 A 2008 0 0
9 A 2009 1 1
10 A 2010 1 2
11 B 2001 1 1
12 B 2002 1 2
13 B 2003 1 3
14 B 2004 0 0
15 B 2005 0 0
16 B 2006 0 0
17 B 2007 1 1
18 B 2008 1 2
19 B 2009 1 3
20 B 2010 0 0
Any suggestions would be much appreciated!
With data.table rleid , you can do :
library(dplyr)
df %>%
group_by(Grp, grp = data.table::rleid(Treat)) %>%
mutate(seq = row_number() * as.integer(Treat)) %>%
ungroup %>%
select(-grp)
# Grp Year Treat seq
# <chr> <dbl> <chr> <int>
# 1 A 2001 0 0
# 2 A 2002 0 0
# 3 A 2003 1 1
# 4 A 2004 1 2
# 5 A 2005 1 3
# 6 A 2006 1 4
# 7 A 2007 0 0
# 8 A 2008 0 0
# 9 A 2009 1 1
#10 A 2010 1 2
#11 B 2001 1 1
#12 B 2002 1 2
#13 B 2003 1 3
#14 B 2004 0 0
#15 B 2005 0 0
#16 B 2006 0 0
#17 B 2007 1 1
#18 B 2008 1 2
#19 B 2009 1 3
#20 B 2010 0 0
Related
I would like to create a new variable called Var3 that combines the values of Year and Month from the row in which Var1 == 1. My data is grouped by ID (in long format). In cases without a 1 on Var1 in any row (e.g. ID 3) there should be NA's on Var3.
df <- read.table(text=
"ID Var1 Year Month
1 0 2008 2
1 0 2009 2
1 0 2010 2
1 0 2011 2
1 1 2013 2
1 0 2014 10
2 0 2008 2
2 0 2010 2
2 1 2011 2
2 0 2013 2
2 0 2015 11
3 0 2010 2
3 0 2011 2
3 0 2013 2
3 0 2015 11
3 0 2017 10", header=TRUE)
My expected outcome would look like this:
df <- read.table(text=
"ID Var1 Year Month Var2
1 0 2008 2 20132
1 0 2009 2 20132
1 0 2010 2 20132
1 0 2011 2 20132
1 1 2013 2 20132
1 0 2014 10 20112
2 0 2008 2 20112
2 0 2010 2 20112
2 1 2011 2 20112
2 0 2013 2 20112
2 0 2015 11 20112
3 0 2010 2 NA
3 0 2011 2 NA
3 0 2013 2 NA
3 0 2015 11 NA
3 0 2017 10 NA",header=TRUE)
I am trying to figure out how to solve this issue using dplyr. I am pretty new to tidyverse therefore any suggestions are more than welcome. I already figured out that I have to use group_by(ID) and probably mutate to create the new variable. Can anybody help me out?
One possible solution using dplyr is
df %>%
group_by(ID) %>%
mutate(Var3 = ifelse(Var1 == 1, paste0(Year, Month), NA)) %>%
mutate(Var3 = max(Var3, na.rm = TRUE))
The idea behind it is: first, you paste together Year and Month where Var1 == 1, then inside each group you spread the only value present for Var3 with a function such as max (but it could also be min) removing the NA values.
Output
# A tibble: 16 x 5
# Groups: ID [3]
ID Var1 Year Month Var3
<int> <int> <int> <int> <chr>
1 1 0 2008 2 20132
2 1 0 2009 2 20132
3 1 0 2010 2 20132
4 1 0 2011 2 20132
5 1 1 2013 2 20132
6 1 0 2014 10 20132
7 2 0 2008 2 20112
8 2 0 2010 2 20112
9 2 1 2011 2 20112
10 2 0 2013 2 20112
11 2 0 2015 11 20112
12 3 0 2010 2 NA
13 3 0 2011 2 NA
14 3 0 2013 2 NA
15 3 0 2015 11 NA
16 3 0 2017 10 NA
I am working with data that looks like this:
ID Year Variable_of_Interest
1 a 2000 0
2 a 2001 0
3 a 2002 0
4 a 2003 0
5 a 2004 0
6 a 2005 1
7 a 2006 1
8 a 2007 1
9 a 2008 1
10 a 2009 1
11 b 2000 0
12 b 2001 0
13 b 2002 0
14 b 2003 1
15 b 2004 1
16 b 2005 1
17 b 2006 1
18 b 2007 1
19 b 2008 1
20 b 2009 1
21 c 2000 0
22 c 2001 0
23 c 2002 0
24 c 2003 0
25 c 2004 0
26 c 2005 0
27 c 2006 1
28 c 2007 1
29 c 2008 1
30 c 2009 1
31 d 2000 0
32 d 2001 0
33 d 2002 1
34 d 2003 1
35 d 2004 1
36 d 2005 1
37 d 2006 0
38 d 2007 0
39 d 2008 0
40 d 2009 0
The unit of analysis is ID. The IDs repeat across each year in the data. The variable of interest column represents changes to the IDs, wherein some years they are a 0 and other years they are a 1
I want to create an additional column that codes changes (defined as going from 0 to 1) in the Variable_of_Interest at the year before and after the change, while also ignoring changes from (1 to 0) (as seen when the ID is equal to "d").
Any code that can help me achieve this solution would be greatly appreciated!
Perferability I'd like the data to look like this:
ID Year Variable_of_Interest Solution
1 a 2000 0 -5
2 a 2001 0 -4
3 a 2002 0 -3
4 a 2003 0 -2
5 a 2004 0 -1
6 a 2005 1 0
7 a 2006 1 1
8 a 2007 1 2
9 a 2008 1 3
10 a 2009 1 4
11 b 2000 0 -3
12 b 2001 0 -2
13 b 2002 0 -1
14 b 2003 1 0
15 b 2004 1 1
16 b 2005 1 2
17 b 2006 1 3
18 b 2007 1 4
19 b 2008 1 5
20 b 2009 1 6
21 c 2000 0 -6
22 c 2001 0 -5
23 c 2002 0 -4
24 c 2003 0 -3
25 c 2004 0 -2
26 c 2005 0 -1
27 c 2006 1 0
28 c 2007 1 1
29 c 2008 1 2
30 c 2009 1 3
31 d 2000 0 -2
32 d 2001 0 -1
33 d 2002 1 0
34 d 2003 1 1
35 d 2004 1 2
36 d 2005 1 3
37 d 2006 0 NA
38 d 2007 0 NA
39 d 2008 0 NA
40 d 2009 0 NA
Here is the replication code:
ID <- c(rep("a",10), rep("b", 10), rep("c", 10), rep("d", 10)); length(ID)
Year <- rep(seq(2000,2009, 1), 4)
Variable_of_Interest <- c(rep(0,5), rep(1, 5),
rep(0,3), rep(1, 7),
rep(0,6), rep(1, 4),
rep(0,2), rep(1, 4), rep(0,4))
data.frame(ID, Year, Variable_of_Interest)
Thank you for your help!
We could create a function :
library(dplyr)
get_sequence <- function(x) {
inds <- which(x == 1 & lag(x) == 0)[1]
vals <- seq_along(x) - inds
inds <- which(x == 0 & lag(x) == 1)[1]
if(!is.na(inds)) vals[inds:length(x)] <- NA
return(vals)
}
and apply it for each ID :
df %>% group_by(ID) %>% mutate(Solution = get_sequence(Variable_of_Interest))
# ID Year Variable_of_Interest Solution
#1 a 2000 0 -5
#2 a 2001 0 -4
#3 a 2002 0 -3
#4 a 2003 0 -2
#5 a 2004 0 -1
#6 a 2005 1 0
#7 a 2006 1 1
#8 a 2007 1 2
#9 a 2008 1 3
#10 a 2009 1 4
#11 b 2000 0 -3
#...
#...
#33 d 2002 1 0
#34 d 2003 1 1
#35 d 2004 1 2
#36 d 2005 1 3
#37 d 2006 0 NA
#38 d 2007 0 NA
#39 d 2008 0 NA
#40 d 2009 0 NA
I am working with data that looks like this:
ID Year Variable_of_Interest
1 a 2000 0
2 a 2001 0
3 a 2002 0
4 a 2003 0
5 a 2004 0
6 a 2005 1
7 a 2006 1
8 a 2007 1
9 a 2008 1
10 a 2009 1
11 b 2000 0
12 b 2001 0
13 b 2002 0
14 b 2003 1
15 b 2004 1
16 b 2005 1
17 b 2006 0
18 b 2007 1
19 b 2008 1
20 b 2009 1
21 c 2000 0
22 c 2001 0
23 c 2002 0
24 c 2003 0
25 c 2004 0
26 c 2005 0
27 c 2006 1
28 c 2007 1
29 c 2008 1
30 c 2009 0
31 d 2000 0
32 d 2001 0
33 d 2002 1
34 d 2003 1
35 d 2004 0
36 d 2005 1
37 d 2006 1
38 d 2007 1
39 d 2008 1
40 d 2009 1
The unit of analysis is in the ID column. The IDs repeat across each year in the data.
The variable of interest column represents changes to the IDs, wherein some years the values are a 0 and other years they are a 1.
I want to create an additional column that includes sequences of numbers that document the time before and after a code changes (defined as going from 0 to 1) in the Variable_of_Interest at the year before and after the change.
The code must account for repeat code changes (defined as going from 0 to 1), such as ID b from 2002-2003 and 2006-2007.
NAs can be assigned to 0 values without changes back to 1, such as 0 in "c" 2009.
Such that, the data looks like:
ID Year Variable_of_Interest Solution
1 a 2000 0 -5
2 a 2001 0 -4
3 a 2002 0 -3
4 a 2003 0 -2
5 a 2004 0 -1
6 a 2005 1 0
7 a 2006 1 1
8 a 2007 1 2
9 a 2008 1 3
10 a 2009 1 4
11 b 2000 0 -3
12 b 2001 0 -2
13 b 2002 0 -1
14 b 2003 1 0
15 b 2004 1 1
16 b 2005 1 2
17 b 2006 0 -1
18 b 2007 1 0
19 b 2008 1 1
20 b 2009 1 2
21 c 2000 0 -6
22 c 2001 0 -5
23 c 2002 0 -4
24 c 2003 0 -3
25 c 2004 0 -2
26 c 2005 0 -1
27 c 2006 1 0
28 c 2007 1 1
29 c 2008 1 2
30 c 2009 0 NA
31 d 2000 0 -2
32 d 2001 0 -1
33 d 2002 1 0
34 d 2003 1 1
35 d 2004 0 -1
36 d 2005 1 0
37 d 2006 1 1
38 d 2007 1 2
39 d 2008 1 3
40 d 2009 1 4
Here is the replication code:
ID <- c(rep("a",10), rep("b", 10), rep("c", 10), rep("d", 10)); length(ID)
Year <- rep(seq(2000,2009, 1), 4)
Variable_of_Interest <- c(rep(0,5), rep(1, 5),
rep(0,3), rep(1, 3), rep(0, 1), rep(1, 3),
rep(0,6), rep(1, 3), rep(0, 1),
rep(0,2), rep(1, 2), rep(0,1), rep(1,5))
data.frame(ID, Year, Variable_of_Interest)
Thank you so much for your help!
Another option using data.table:
#identify runs
setDT(DF)[, ri := rleid(voi)]
#generate the desired output depending on whether VOI is 1 or 0
DF[, soln := if (voi[1L]==1L) seq.int(.N) - 1L else -rev(seq.int(.N)), .(ID, ri)]
#replace trailing 0 with NA
DF[, soln := if(voi[.N]==0L) replace(soln, ri==ri[.N], NA_integer_) else soln, ID]
data:
ID <- c(rep("a",10), rep("b", 10), rep("c", 10), rep("d", 10)); length(ID)
Year <- rep(seq(2000,2009, 1), 4)
Variable_of_Interest <- c(rep(0,5), rep(1, 5),
rep(0,3), rep(1, 3), rep(0, 1), rep(1, 3),
rep(0,6), rep(1, 3), rep(0, 1),
rep(0,2), rep(1, 2), rep(0,1), rep(1,5))
DF <- data.frame(ID, Year, voi=Variable_of_Interest)
Here is a dplyr option:
library(dplyr)
df %>%
group_by(ID, idx = cumsum(Variable_of_Interest != lag(Variable_of_Interest, default = first(Variable_of_Interest)))) %>%
mutate(Solution = case_when(Variable_of_Interest == 0 ~ rev(-1:(-n())), Variable_of_Interest == 1 ~ 0:(n() - 1))) %>%
group_by(ID) %>%
mutate(Solution = replace(Solution, idx == max(idx) & Variable_of_Interest == 0, NA)) %>%
select(-idx)
Output:
ID Year Variable_of_Interest idx Solution
1 a 2000 0 0 -5
2 a 2001 0 0 -4
3 a 2002 0 0 -3
4 a 2003 0 0 -2
5 a 2004 0 0 -1
6 a 2005 1 1 0
7 a 2006 1 1 1
8 a 2007 1 1 2
9 a 2008 1 1 3
10 a 2009 1 1 4
11 b 2000 0 2 -3
12 b 2001 0 2 -2
13 b 2002 0 2 -1
14 b 2003 1 3 0
15 b 2004 1 3 1
16 b 2005 1 3 2
17 b 2006 0 4 -1
18 b 2007 1 5 0
19 b 2008 1 5 1
20 b 2009 1 5 2
21 c 2000 0 6 -6
22 c 2001 0 6 -5
23 c 2002 0 6 -4
24 c 2003 0 6 -3
25 c 2004 0 6 -2
26 c 2005 0 6 -1
27 c 2006 1 7 0
28 c 2007 1 7 1
29 c 2008 1 7 2
30 c 2009 0 8 NA
31 d 2000 0 8 -2
32 d 2001 0 8 -1
33 d 2002 1 9 0
34 d 2003 1 9 1
35 d 2004 0 10 -1
36 d 2005 1 11 0
37 d 2006 1 11 1
38 d 2007 1 11 2
39 d 2008 1 11 3
40 d 2009 1 11 4
df_test <- data.frame(MONTH_NUM = c(7,7,8,8,8,10,11,12,1,2,3,4,4,5,5,5,5,NA)
, YEAR = c(2018,2018,2018,2018,2019,2019,2019,2019,2019,2018,2018,2019,2018,2018,2018,2018,2018,NA)
, Sys_Indicator = c(1,0,0,1,0,0,0,0,1,1,0,1,0,1,1,1,1,1)
, lbl_Indicator = c(1,1,1,1,0,1,0,0,1,1,0,1,1,1,1,1,1,0)
, Pk_Indicator=c(1,0,1,1,0,1,0,0,1,1,0,1,0,0,0,0,1,1))
I want to find the cumulative sum of each indicator for each month+year combination. I'm currently using dplyr to achieve this but I was wondering if there was an easier way to do this and to do it for all variables that have and Indicator in their names? I want all my variable with Indicator in them to have cumulative sum.
df_test %>%
group_by(YEAR,MONTH_NUM) %>%
summarize(Sys_sum=sum(Sys_Indicator),lbl_Sum=sum(lbl_Indicator),Pk_Sum=sum(Pk_Indicator)) %>%
arrange(MONTH_NUM,YEAR) %>%
ungroup() %>%
mutate(Sys_cum=cumsum(Sys_sum),Cum_lbl=cumsum(lbl_Sum),Pk_sum=cumsum(Pk_Sum))
You could use the _at variants in dplyr to apply this for multiple columns :
library(dplyr)
df_test %>%
arrange(MONTH_NUM,YEAR) %>%
group_by(YEAR,MONTH_NUM) %>%
summarize_at(vars(ends_with('Indicator')), sum) %>%
ungroup() %>%
mutate_at(vars(ends_with('Indicator')), list(cs = ~cumsum(.)))
# YEAR MONTH_NUM Sys_Indicator lbl_Indicator Pk_Indicator Sys_Indicator_cs lbl_Indicator_cs Pk_Indicator_cs
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2018 2 1 1 1 1 1 1
# 2 2018 3 0 0 0 1 1 1
# 3 2018 4 0 1 0 1 2 1
# 4 2018 5 4 4 1 5 6 2
# 5 2018 7 1 2 1 6 8 3
# 6 2018 8 1 2 2 7 10 5
# 7 2019 1 1 1 1 8 11 6
# 8 2019 4 1 1 1 9 12 7
# 9 2019 8 0 0 0 9 12 7
#10 2019 10 0 1 1 9 13 8
#11 2019 11 0 0 0 9 13 8
#12 2019 12 0 0 0 9 13 8
#13 NA NA 1 0 1 10 13 9
I think I understand what you want. Here is a data.table approach.
library(data.table)
setDT(df_test)[ ,sapply(names(df_test)[grep("Indicator",names(df_test))],paste0,"_cumsum") := lapply(.SD[,grep("Indicator",names(df_test))],cumsum)]
df_test
MONTH_NUM YEAR Sys_Indicator lbl_Indicator Pk_Indicator Sys_Indicator_cumsum lbl_Indicator_cumsum Pk_Indicator_cumsum
1: 7 2018 1 1 1 1 1 1
2: 7 2018 0 1 0 1 2 1
3: 8 2018 0 1 1 1 3 2
4: 8 2018 1 1 1 2 4 3
5: 8 2019 0 0 0 2 4 3
6: 10 2019 0 1 1 2 5 4
7: 11 2019 0 0 0 2 5 4
8: 12 2019 0 0 0 2 5 4
9: 1 2019 1 1 1 3 6 5
10: 2 2018 1 1 1 4 7 6
11: 3 2018 0 0 0 4 7 6
12: 4 2019 1 1 1 5 8 7
13: 4 2018 0 1 0 5 9 7
14: 5 2018 1 1 0 6 10 7
15: 5 2018 1 1 0 7 11 7
16: 5 2018 1 1 0 8 12 7
17: 5 2018 1 1 1 9 13 8
18: NA NA 1 0 1 10 13 9
I have a dataframe that looks like this one
state start end date treat
1 1999 2000 2001 1
1 1998 2000 2001 1
1 2000 2003 NA 0
2 2001 2002 NA 0
2 2002 2004 2003 1
2 2003 2004 2005 1
3 2002 2004 2006 1
3 2003 2004 NA 0
3 2005 2007 NA 0
I want to group it by state identifier and, for each state, I want compute the number of treated observation (treat) the date of which lies in between start and end.
In other words I want to get the following
state start end date treat result
1 1999 2000 2001 1 0
1 1998 2000 2001 1 0
1 2000 2003 NA 0 2
2 2001 2002 NA 0 0
2 2002 2004 2003 1 1
2 2003 2004 2005 1 0
3 2002 2004 2006 1 0
3 2003 2004 NA 0 0
3 2005 2008 NA 0 1
For instance, result in the first row is equal to 0 because within state = 1 there is no date between 1999 and 2000. On the other hand, result in the last row is equal to one because within state 3 I have one treated unit the date of which lies between 2005 and 2008 (in particular date = 2006 in the 7th row).
Thank you very much for your help.
You can split by state and combine two outer with & testing if date is between start and end and then sum treat for those matching dates.
x$result <- unlist(lapply(split(x, x$state), function(y) {
tt <- outer(y$start, y$date, "<") & outer(y$end, y$date, ">")
tt[is.na(tt)] <- TRUE
apply(tt, 1, function(z) sum(y$treat[z]))
}))
x
# state start end date treat result
#1 1 1999 2000 2001 1 0
#2 1 1998 2000 2001 1 0
#3 1 2000 2003 NA 0 2
#4 2 2001 2002 NA 0 0
#5 2 2002 2004 2003 1 1
#6 2 2003 2004 2005 1 0
#7 3 2002 2004 2006 1 0
#8 3 2003 2004 NA 0 0
#9 3 2005 2007 NA 0 1
Or you take the part describing the treat per state and date and merge it with the part describing state, start and end and sum the matching treat.
tt <- aggregate(treat ~ state + date, x[,c("state", "date", "treat")], sum)
tt <- merge(x[,c("state", "start", "end")], tt)
tt$treat[tt$start >= tt$date | tt$end <= tt$date] <- 0
aggregate(treat ~ start + end + state, tt, sum)
# start end state treat
#1 1998 2000 1 0
#2 1999 2000 1 0
#3 2000 2003 1 2
#4 2001 2002 2 0
#5 2002 2004 2 1
#6 2003 2004 2 0
#7 2002 2004 3 0
#8 2003 2004 3 0
#9 2005 2007 3 1
This gives your numbers though it repeats them on every row:
library(tidyverse)
df %>% group_by(state) %>%
mutate(result=sum(treat==1 & date>=min(start, na.rm=TRUE) & date<=max(end, na.rm=TRUE), na.rm=TRUE))
#> # A tibble: 9 x 6
#> # Groups: state [3]
#> state start end date treat result
#> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 1 1999 2000 2001 1 2
#> 2 1 1998 2000 2001 1 2
#> 3 1 2000 2003 NA 0 2
#> 4 2 2001 2002 NA 0 1
#> 5 2 2002 2004 2003 1 1
#> 6 2 2003 2004 2005 1 1
#> 7 3 2002 2004 2006 1 1
#> 8 3 2003 2004 NA 0 1
#> 9 3 2005 2007 NA 0 1
If you just want one number per group, summarize might be a better option:
df %>% group_by(state) %>%
summarize(result=sum(treat==1 & date>=min(start, na.rm=TRUE) & date<=max(end, na.rm=TRUE), na.rm=TRUE))
#> # A tibble: 3 x 2
#> state result
#> <dbl> <int>
#> 1 1 2
#> 2 2 1
#> 3 3 1