BTSCS data in R: create t - r

Suppose I have the following data frame df:
id year y
1 1 1990 NA
2 1 1991 0
3 1 1992 0
4 1 1993 1
5 1 1994 NA
6 2 1990 0
7 2 1991 0
8 2 1992 0
9 2 1993 0
10 2 1994 0
11 3 1990 0
12 3 1991 0
13 3 1992 1
14 3 1993 NA
15 3 1994 NA
Code to create the df:
id<-c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
year<-c(1990,1991,1992,1993,1994,1990,1991,1992,1993,1994,1990,1991,1992,1993,1994)
y<-c(NA,0,0,1,NA,0,0,0,0,0,0,0,1,NA,NA)
df<-data.frame(id,year,y)
I want to create the following vector t that measures the duration an observation has been at risk until an event occurs (y=1) or the last entry of an observation (equal to right censoring):
id year y t
1 1 1990 NA NA
2 1 1991 0 1
3 1 1992 0 2
4 1 1993 1 3
5 1 1994 NA NA
6 2 1990 0 1
7 2 1991 0 2
8 2 1992 0 3
9 2 1993 0 4
10 2 1994 0 5
11 3 1990 0 1
12 3 1991 0 2
13 3 1992 1 3
14 3 1993 NA NA
15 3 1994 NA NA
Any help is highly welcome!

Here's a possible data.table solution which will also update your data set by reference
library(data.table)
setDT(df)[!is.na(y), t := seq_len(.N), id][]
# id year y t
# 1: 1 1990 NA NA
# 2: 1 1991 0 1
# 3: 1 1992 0 2
# 4: 1 1993 1 3
# 5: 1 1994 NA NA
# 6: 2 1990 0 1
# 7: 2 1991 0 2
# 8: 2 1992 0 3
# 9: 2 1993 0 4
# 10: 2 1994 0 5
# 11: 3 1990 0 1
# 12: 3 1991 0 2
# 13: 3 1992 1 3
# 14: 3 1993 NA NA
# 15: 3 1994 NA NA

A base R option would be
df$t <- with(df, ave(!is.na(y), id, FUN=cumsum)*NA^is.na(y))
df
# id year y t
#1 1 1990 NA NA
#2 1 1991 0 1
#3 1 1992 0 2
#4 1 1993 1 3
#5 1 1994 NA NA
#6 2 1990 0 1
#7 2 1991 0 2
#8 2 1992 0 3
#9 2 1993 0 4
#10 2 1994 0 5
#11 3 1990 0 1
#12 3 1991 0 2
#13 3 1992 1 3
#14 3 1993 NA NA
#15 3 1994 NA NA
Or using dplyr
library(dplyr)
df %>%
group_by(id) %>%
mutate(t=replace(y, !is.na(y), seq(na.omit(y))))

You can achieve this using the btcs() command from Dave Armstrong's packages DAMisc.
df <- btscs(df, "y", "year", "id")
That will spit out your original dataset along with a column 'spell' which is the number of time units since the last event.

Related

Replacing Grouped Data in R

I want to create a new column based on conditions
My data frame looks like:
Id Year Code Y
1 2009 0 0
1 2010 NA NA
2 2009 0 0
2 2010 NA NA
3 2009 1 1
3 2010 NA NA
4 2009 2 1
4 2010 NA NA
I need to replace values in my original Y variable in a way that returns 1 when the first year code for each individual is equal to 0 and the second year/code line is NA. The output that I'm looking for is:
Id Year Code Y
1 2009 0 0
1 2010 NA 1
2 2009 0 0
2 2010 NA 1
3 2009 1 1
3 2010 NA 0
4 2009 2 1
4 2010 NA 0
Thanks in advance!
This replace 0 to 1 and 1 to 0 if NA is present in Y column for each Id.
library(dplyr)
df %>%
arrange(Id, Year) %>%
group_by(Id) %>%
mutate(Y = ifelse(is.na(Y), as.integer(!as.logical(na.omit(Y))), Y))
# Id Year Code Y
# <int> <int> <int> <int>
#1 1 2009 0 0
#2 1 2010 NA 1
#3 2 2009 0 0
#4 2 2010 NA 1
#5 3 2009 1 1
#6 3 2010 NA 0
#7 4 2009 2 1
#8 4 2010 NA 0

Create sequence by condition in the case when condition changes

the data looks like:
df <- data.frame("Grp"=c(rep("A",10),rep("B",10)),
"Year"=c(seq(2001,2010,1),seq(2001,2010,1)),
"Treat"=c(as.character(c(0,0,1,1,1,1,0,0,1,1)),
as.character(c(1,1,1,0,0,0,1,1,1,0))))
df
Grp Year Treat
1 A 2001 0
2 A 2002 0
3 A 2003 1
4 A 2004 1
5 A 2005 1
6 A 2006 1
7 A 2007 0
8 A 2008 0
9 A 2009 1
10 A 2010 1
11 B 2001 1
12 B 2002 1
13 B 2003 1
14 B 2004 0
15 B 2005 0
16 B 2006 0
17 B 2007 1
18 B 2008 1
19 B 2009 1
20 B 2010 0
All I want is to generate another col seq to count the sequence of Treat by Grp, maintaining the sequence of Year. I think the hard part is that when Treat turns to 0, seq should be 0 or whatever, and the sequence of Treat should be re-counted when it turns back to non-zero again. An example of the final dataframe looks like below:
Grp Year Treat seq
1 A 2001 0 0
2 A 2002 0 0
3 A 2003 1 1
4 A 2004 1 2
5 A 2005 1 3
6 A 2006 1 4
7 A 2007 0 0
8 A 2008 0 0
9 A 2009 1 1
10 A 2010 1 2
11 B 2001 1 1
12 B 2002 1 2
13 B 2003 1 3
14 B 2004 0 0
15 B 2005 0 0
16 B 2006 0 0
17 B 2007 1 1
18 B 2008 1 2
19 B 2009 1 3
20 B 2010 0 0
Any suggestions would be much appreciated!
With data.table rleid , you can do :
library(dplyr)
df %>%
group_by(Grp, grp = data.table::rleid(Treat)) %>%
mutate(seq = row_number() * as.integer(Treat)) %>%
ungroup %>%
select(-grp)
# Grp Year Treat seq
# <chr> <dbl> <chr> <int>
# 1 A 2001 0 0
# 2 A 2002 0 0
# 3 A 2003 1 1
# 4 A 2004 1 2
# 5 A 2005 1 3
# 6 A 2006 1 4
# 7 A 2007 0 0
# 8 A 2008 0 0
# 9 A 2009 1 1
#10 A 2010 1 2
#11 B 2001 1 1
#12 B 2002 1 2
#13 B 2003 1 3
#14 B 2004 0 0
#15 B 2005 0 0
#16 B 2006 0 0
#17 B 2007 1 1
#18 B 2008 1 2
#19 B 2009 1 3
#20 B 2010 0 0

Create a conditional timeline based on events in R

I have data where the 'Law' variable indicates changes in legislation, in different places ('Place'):
Person Place Year Law
1 A 1990 0
2 A 1991 1
3 A 1992 1
4 B 1990 0
5 B 1991 0
6 B 1992 1
7 B 1993 1
8 B 1993 1
9 B 1993 1
10 B 1992 1
Basically the law was implemented in place A in 1991 and remained in force for all subsequent time periods. It was implemented in place B in 1992 and remained in force, & so on.
I would like to create a new variable that takes on a value of 0 for the year the law was implemented, 1 for 1 year after, 2 for 2 years after, -1 for the year before, -2 for 2 years before, and so on.
I need the final dataframe to look like:
Person Place Year Law timeline
1 A 1990 0 -1
2 A 1991 1 0
3 A 1992 1 1
4 B 1990 0 -2
5 B 1991 0 -1
6 B 1992 1 0
7 B 1993 1 1
8 B 1993 1 2
9 B 1993 1 2
10 B 1992 1 1
I have tried:
library(dplyr)
df %>%
group_by(Place) %>%
arrange(Year) %>%
mutate(timeline = rank(Law))
but it's not working like I need. What am I doing wrong? Can I do this in dplyr or do I need to create a complex for loop?
You can subtract the row_numer by the index where the Law is implemented:
df %>%
arrange(Year) %>%
group_by(Place) %>%
mutate(timeline = row_number() - which(diff(Law) == 1) - 1) %>%
arrange(Place)
# A tibble: 7 x 5
# Groups: Place [2]
# Person Place Year Law timeline
# <int> <fct> <int> <int> <dbl>
#1 1 A 1990 0 -1.
#2 2 A 1991 1 0.
#3 3 A 1992 1 1.
#4 4 B 1990 0 -2.
#5 5 B 1991 0 -1.
#6 6 B 1992 1 0.
#7 7 B 1993 1 1.
using data.table
library(data.table)
setDT(dat)[,timeline:=sequence(.N)-which.min(!Law),by=Place]
dat
Person Place Year Law timeline
1: 1 A 1990 0 -1
2: 2 A 1991 1 0
3: 3 A 1992 1 1
4: 4 B 1990 0 -2
5: 5 B 1991 0 -1
6: 6 B 1992 1 0
7: 7 B 1993 1 1
Using base r:
transform(dat,timeline=ave(Law,Place,FUN=function(x)1:length(x)-which.min(!x)))
Person Place Year Law timeline
1 1 A 1990 0 -1
2 2 A 1991 1 0
3 3 A 1992 1 1
4 4 B 1990 0 -2
5 5 B 1991 0 -1
6 6 B 1992 1 0
7 7 B 1993 1 1

How to calculate a frequency (count) variable in R? [duplicate]

This question already has answers here:
Calculate cumulative sum (cumsum) by group
(5 answers)
Closed 7 years ago.
I've begun to gradually shift to R from Excel but I'm still having some difficulties with (relatively simple) calculations.
I want to create a frequency version of my variable x, let's call it "xfrequency".
Please, see the sample of my data below.
The desired variable xfrequency should basically count the number of x's, during a certain period (country-year). In the sample data the observation period is from 1990 to 1995. So, in 1994 Canada recieved 4 x's in total.
Perhaps there is a relevant function for this out there? Thanks!
country year x xfrequency
CAN 1990 1 1
CAN 1991 0 0
CAN 1992 1 2
CAN 1993 0 0
CAN 1994 2 4
CAN 1995 1 5
USA 1990 0 0
USA 1991 2 2
USA 1992 1 3
USA 1993 0 0
USA 1994 1 4
USA 1995 0 0
GER 1990 NA NA
GER 1991 1 1
GER 1992 0 0
GER 1993 1 2
GER 1994 2 4
GER 1995 1 5
Example with data.table assuming your dataset in in a variable called data:
library(data.table)
setDT(data)
data[is.na(x),x := 0] # Remove the NA as a sum of anything with NA is NA
data[, xfreq := cumsum(x), by=country]
Which gives:
country year x xfrequency xfreq
1: CAN 1990 1 1 1
2: CAN 1991 0 0 1
3: CAN 1992 1 2 2
4: CAN 1993 0 0 2
5: CAN 1994 2 4 4
6: CAN 1995 1 5 5
7: USA 1990 0 0 0
8: USA 1991 2 2 2
9: USA 1992 1 3 3
10: USA 1993 0 0 3
11: USA 1994 1 4 4
12: USA 1995 0 0 4
13: GER 1990 0 NA 0
14: GER 1991 1 1 1
15: GER 1992 0 0 1
16: GER 1993 1 2 2
17: GER 1994 2 4 4
18: GER 1995 1 5 5
this is not exactly your expected output, but according to the description you give, the xfreq column seems to be what you're looking for.
To get your exact output, we can reset the xfreq to 0 when x is 0:
> data[x==0,xfreq := 0]
> data
country year x xfrequency xfreq
1: CAN 1990 1 1 1
2: CAN 1991 0 0 0
3: CAN 1992 1 2 2
4: CAN 1993 0 0 0
5: CAN 1994 2 4 4
Or in one pass with a test:
data[, xfreq := ifelse(x==0,0L,cumsum(x)), by=country]
A base R alternative:
mydf <- transform(mydf, xfreq = ave(x, country, FUN = function(x) cumsum(!is.na(x))))
mydf[mydf$x==0 | is.na(mydf$x), "xfreq"] <- 0
gives:
> mydf
country year x xfrequency xfreq
1 CAN 1990 1 1 1
2 CAN 1991 0 0 0
3 CAN 1992 1 2 3
4 CAN 1993 0 0 0
5 CAN 1994 2 4 5
6 CAN 1995 1 5 6
7 USA 1990 0 0 0
8 USA 1991 2 2 2
9 USA 1992 1 3 3
10 USA 1993 0 0 0
11 USA 1994 1 4 5
12 USA 1995 0 0 0
13 GER 1990 NA NA 0
14 GER 1991 1 1 1
15 GER 1992 0 0 0
16 GER 1993 1 2 3
17 GER 1994 2 4 4
18 GER 1995 1 5 5
You can use library(dplyr).
library(dplyr)
sum_data <- data %>% group_by(country) %>% summarise(xfrequency = sum(x, na.rm=T)).
I just grouped your data by country and added sum of x for all the periods given for this country.

How to create time since last event in unbalance panel data in R?

I have unbalanced panel data with a binary variable indicating if the event occurred or not. I want to control for time dependency, so I want to create a variable that indicates the number of years that have passed since the last event. The data is organized by dyad-year.
Here is a reproducible example, with a vector of what I am trying to achieve. Thanks!
id year onset time_since_event
1 1 1989 0 1
2 1 1990 0 2
3 1 1991 1 0
4 1 1992 0 1
5 1 1993 0 2
6 2 1989 0 1
7 2 1990 1 0
8 2 1991 0 1
9 2 1992 1 0
10 3 1991 0 1
11 3 1992 0 2
˚
id <- c(1,1,1,1,1,2,2,2,2,3,3)
year <- c(1989,1990,1991,1992,1993,1989,1990,1991,1992,1991,1992)
onset <- c(0,0,1,0,0,0,1,0,1,0,0)
time_since_event<-c(1,2,0,1,2,1,0,1,0,1,2) #what I want to create
df <- data.frame(cbind(id, year, onset,time_since_event))
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df), create a run-length id grouping variable ('ind') based on the 'onset' column using rleid. Grouped by 'ind' and 'id' column, we assign the 'time_since_event' column as the row sequence where 'onset' is not equal to 1. In the next step, replace the 'NA' elements with 0.
library(data.table)#v1.9.6+
setDT(df)[, ind:=rleid(onset)][onset!=1, time_since_event:=1:.N ,
by = .(ind, id)][is.na(time_since_event), time_since_event:= 0]
df
# id year onset ind time_since_event
# 1: 1 1989 0 1 1
# 2: 1 1990 0 1 2
# 3: 1 1991 1 2 0
# 4: 1 1992 0 3 1
# 5: 1 1993 0 3 2
# 6: 2 1989 0 3 1
# 7: 2 1990 1 4 0
# 8: 2 1991 0 5 1
# 9: 2 1992 1 6 0
#10: 3 1991 0 7 1
#11: 3 1992 0 7 2
Or it can be made compact. Grouped by rleid(onset) and 'id' column, we negate the 'onset' (so that 0 become TRUE and 1 FALSE), multiply with row sequence (1:.N) and assign (:=) it as the 'time_since_event' column.
setDT(df)[,time_since_event := 1:.N *!onset, by = .(rleid(onset), id)]
df
# id year onset time_since_event
# 1: 1 1989 0 1
# 2: 1 1990 0 2
# 3: 1 1991 1 0
# 4: 1 1992 0 1
# 5: 1 1993 0 2
# 6: 2 1989 0 1
# 7: 2 1990 1 0
# 8: 2 1991 0 1
# 9: 2 1992 1 0
#10: 3 1991 0 1
#11: 3 1992 0 2
Or we can use dplyr. We group by 'id' and another variable created (by taking the difference of adjacent elements in 'onset' (diff), create a logical index (!=0) and cumsum the index). Within the mutate, we multiply the row sequence (row_number()) with the negated 'onset' (just like before), and remove the 'ind' column using select.
library(dplyr)
df %>%
group_by(id, ind= cumsum(c(TRUE, diff(onset)!=0))) %>%
mutate(time_since_event= (!onset) *row_number()) %>%
ungroup() %>%
select(-ind)
# id year onset time_since_event
# (dbl) (dbl) (dbl) (int)
#1 1 1989 0 1
#2 1 1990 0 2
#3 1 1991 1 0
#4 1 1992 0 1
#5 1 1993 0 2
#6 2 1989 0 1
#7 2 1990 1 0
#8 2 1991 0 1
#9 2 1992 1 0
#10 3 1991 0 1
#11 3 1992 0 2
data
df <- data.frame(id, year, onset)

Resources