Create a conditional timeline based on events in R - r

I have data where the 'Law' variable indicates changes in legislation, in different places ('Place'):
Person Place Year Law
1 A 1990 0
2 A 1991 1
3 A 1992 1
4 B 1990 0
5 B 1991 0
6 B 1992 1
7 B 1993 1
8 B 1993 1
9 B 1993 1
10 B 1992 1
Basically the law was implemented in place A in 1991 and remained in force for all subsequent time periods. It was implemented in place B in 1992 and remained in force, & so on.
I would like to create a new variable that takes on a value of 0 for the year the law was implemented, 1 for 1 year after, 2 for 2 years after, -1 for the year before, -2 for 2 years before, and so on.
I need the final dataframe to look like:
Person Place Year Law timeline
1 A 1990 0 -1
2 A 1991 1 0
3 A 1992 1 1
4 B 1990 0 -2
5 B 1991 0 -1
6 B 1992 1 0
7 B 1993 1 1
8 B 1993 1 2
9 B 1993 1 2
10 B 1992 1 1
I have tried:
library(dplyr)
df %>%
group_by(Place) %>%
arrange(Year) %>%
mutate(timeline = rank(Law))
but it's not working like I need. What am I doing wrong? Can I do this in dplyr or do I need to create a complex for loop?

You can subtract the row_numer by the index where the Law is implemented:
df %>%
arrange(Year) %>%
group_by(Place) %>%
mutate(timeline = row_number() - which(diff(Law) == 1) - 1) %>%
arrange(Place)
# A tibble: 7 x 5
# Groups: Place [2]
# Person Place Year Law timeline
# <int> <fct> <int> <int> <dbl>
#1 1 A 1990 0 -1.
#2 2 A 1991 1 0.
#3 3 A 1992 1 1.
#4 4 B 1990 0 -2.
#5 5 B 1991 0 -1.
#6 6 B 1992 1 0.
#7 7 B 1993 1 1.

using data.table
library(data.table)
setDT(dat)[,timeline:=sequence(.N)-which.min(!Law),by=Place]
dat
Person Place Year Law timeline
1: 1 A 1990 0 -1
2: 2 A 1991 1 0
3: 3 A 1992 1 1
4: 4 B 1990 0 -2
5: 5 B 1991 0 -1
6: 6 B 1992 1 0
7: 7 B 1993 1 1
Using base r:
transform(dat,timeline=ave(Law,Place,FUN=function(x)1:length(x)-which.min(!x)))
Person Place Year Law timeline
1 1 A 1990 0 -1
2 2 A 1991 1 0
3 3 A 1992 1 1
4 4 B 1990 0 -2
5 5 B 1991 0 -1
6 6 B 1992 1 0
7 7 B 1993 1 1

Related

How to flag first change in a variable value between years, per group?

Given a very large longitudinal dataset with different groups, I need to create a flag that indicates the first change in a certain variable (code) between years (year), per group (id). The type of observation within the same id-year just indicates different group members.
Sample data:
library(tidyverse)
sample <- tibble(id = rep(1:3, each=6),
year = rep(2010:2012, 3, each=2),
type = (rep(1:2, 9)),
code = c("abc","abc","","","xyz","xyz", "","","lmn","","efg","efg","def","def","","klm","nop","nop"))
What I need is to flag the first change to code within a group, between years. Second changes do not matter. Missing codes ("") can be treated as NA but in any case should not affect flag. The following is the above tibble with a flag field as it should be:
# A tibble: 18 × 5
id year type code flag
<int> <int> <int> <chr> <dbl>
1 1 2010 1 abc 0
2 1 2010 2 abc 0
3 1 2011 1 0
4 1 2011 2 0
5 1 2012 1 xyz 1
6 1 2012 2 xyz 1
7 2 2010 1 0
8 2 2010 2 0
9 2 2011 1 lmn 0
10 2 2011 2 0
11 2 2012 1 efg 1
12 2 2012 2 efg 1
13 3 2010 1 def 0
14 3 2010 2 def 0
15 3 2011 1 1
16 3 2011 2 klm 1
17 3 2012 1 nop 1
18 3 2012 2 nop 1
I still have a looping mindset and I am trying to use vectorized dplyr to do what I need.
Any input would be greatly appreciated!
EDIT: thanks for pointing this out regarding the importance of year. The id's are arranged by year, as the ordering is important here, and also all types per id per year need to have the same flag. So, in the edited row 15 e code is "" which would not warrant a change by itself, but since in the same year row 16 has a new code, both observations need to have their codes changed to 1.
We can use data.table
library(data.table)
setDT(sample)[, flag :=0][code!="", flag := {rl <- rleid(code)-1; cummax(rl*(rl < 2)) }, id]
sample
# id year type code flag
# 1: 1 2010 1 abc 0
# 2: 1 2010 2 abc 0
# 3: 1 2011 1 0
# 4: 1 2011 2 0
# 5: 1 2012 1 xyz 1
# 6: 1 2012 2 xyz 1
# 7: 2 2010 1 0
# 8: 2 2010 2 0
# 9: 2 2011 1 lmn 0
#10: 2 2011 2 0
#11: 2 2012 1 efg 1
#12: 2 2012 2 efg 1
#13: 3 2010 1 def 0
#14: 3 2010 2 def 0
#15: 3 2011 1 klm 1
#16: 3 2011 2 klm 1
#17: 3 2012 1 nop 1
#18: 3 2012 2 nop 1
Update
If we need to include the 'year' as well,
setDT(sample)[, flag :=0][code!="", flag := {rl <- rleid(code, year)-1
cummax(rl*(rl < 2)) }, id]
possible solution using the dplyr. not sure its the cleanest way though
sample %>%
group_by(id) %>%
#find first year per group where code exists
mutate(first_year = min(year[code != ""])) %>%
#gather all codes from first year (does not assume code is constant within year)
mutate(first_codes = list(code[year==first_year])) %>%
#if year is not first year & code not in first year codes & code not blank
mutate(flag = as.numeric(year != first_year & !(code %in% unlist(first_codes)) & code != "")) %>%
#drop created columns
select(-first_year, -first_codes) %>%
ungroup()
output
# A tibble: 18 × 5
id year type code flag
<int> <int> <int> <chr> <dbl>
1 1 2010 1 abc 0
2 1 2010 2 abc 0
3 1 2011 1 0
4 1 2011 2 0
5 1 2012 1 xyz 1
6 1 2012 2 xyz 1
7 2 2010 1 0
8 2 2010 2 0
9 2 2011 1 lmn 0
10 2 2011 2 0
11 2 2012 1 efg 1
12 2 2012 2 efg 1
13 3 2010 1 def 0
14 3 2010 2 def 0
15 3 2011 1 klm 1
16 3 2011 2 klm 1
17 3 2012 1 nop 1
18 3 2012 2 nop 1
A short solution with the data.table-package:
library(data.table)
setDT(samp)[, flag := 0][code!="", flag := 1*(rleid(code)-1 > 0), by = id]
Or:
setDT(samp)[, flag := 0][code!="", flag := 1*(code!=code[1] & code!=''), by = id][]
which gives the desired result:
> samp
id year type code flag
1: 1 2010 1 abc 0
2: 1 2010 2 abc 0
3: 1 2011 1 0
4: 1 2011 2 0
5: 1 2012 1 xyz 1
6: 1 2012 2 xyz 1
7: 2 2010 1 0
8: 2 2010 2 0
9: 2 2011 1 lmn 0
10: 2 2011 2 0
11: 2 2012 1 efg 1
12: 2 2012 2 efg 1
13: 3 2010 1 def 0
14: 3 2010 2 def 0
15: 3 2011 1 klm 1
16: 3 2011 2 klm 1
17: 3 2012 1 nop 1
18: 3 2012 2 nop 1
Or when the year is relevant as well:
setDT(samp)[, flag := 0][code!="", flag := 1*(rleid(code, year)-1 > 0), id]
A possible base R alternative:
f <- function(x) {
x <- rle(x)$lengths
1 * (rep(seq_along(x), times=x) - 1 > 0)
}
samp$flag <- 0
samp$flag[samp$code!=''] <- with(samp[samp$code!=''], ave(as.character(code), id, FUN = f))
NOTE: it is better not to give your object the same name as functions.
Used data:
samp <- data.frame(id = rep(1:3, each=6),
year = rep(2010:2012, 3, each=2),
type = (rep(1:2, 9)),
code = c("abc","abc","","","xyz","xyz", "","","lmn","","efg","efg","def","def","klm","klm","nop","nop"))

Dealing with ties using rank (R)

I'm trying to create dummy variable for whether a child is first born, and one for if the child is second born. My data looks something like this
ID MID CMOB CYRB
1 1 1 1991
2 1 7 1989
3 2 1 1985
4 2 11 1985
5 2 9 1994
6 3 4 1992
7 4 2 1992
8 4 10 1983
With ID = child ID, MID = mother ID, CMOB = month of birth and CYRB = year of birth.
For the first born dummy I tried using this:
Identifiers_age <- Identifiers_age %>% group_by(MPUBID)
%>% mutate(first = as.numeric(rank(CYRB) == 1))
But there doesn't seem to be a way of breaking ties by the rank of another columnn (clearly in this case the desired column being CMOB), whenever I try using the "ties.method" argument it tell me the input must be a character vector.
Am I missing something here?
order might be more convenient to use here, from ?order:
order returns a permutation which rearranges its first argument into
ascending or descending order, breaking ties by further arguments.
Identifiers_age <- Identifiers_age %>% group_by(MID) %>%
mutate(first = as.numeric(order(CYRB, CMOB) == 1))
Identifiers_age
#Source: local data frame [8 x 5]
#Groups: MID [4]
# ID MID CMOB CYRB first
# <int> <int> <int> <int> <dbl>
#1 1 1 1 1991 0
#2 2 1 7 1989 1
#3 3 2 1 1985 1
#4 4 2 11 1985 0
#5 5 2 9 1994 0
#6 6 3 4 1992 1
#7 7 4 2 1992 0
#8 8 4 10 1983 1
If we still want to use rank, we can convert the 'CYRB', 'CMOB' in to 'Date', apply rank on it and the get the binary output based on the logical vector
Identifiers_age %>%
group_by(MID) %>%
mutate(first = as.integer(rank(as.Date(paste(CYRB, CMOB, 1,
sep="-"), "%Y-%m-%d"))==1))
# ID MID CMOB CYRB first
# <int> <int> <int> <int> <int>
#1 1 1 1 1991 0
#2 2 1 7 1989 1
#3 3 2 1 1985 1
#4 4 2 11 1985 0
#5 5 2 9 1994 0
#6 6 3 4 1992 1
#7 7 4 2 1992 0
#8 8 4 10 1983 1
Or we can use arithmetic to do this with rank
Identifiers_age %>%
group_by(MID) %>%
mutate(first = as.integer(rank(CYRB + CMOB/12)==1))
# ID MID CMOB CYRB first
# <int> <int> <int> <int> <int>
#1 1 1 1 1991 0
#2 2 1 7 1989 1
#3 3 2 1 1985 1
#4 4 2 11 1985 0
#5 5 2 9 1994 0
#6 6 3 4 1992 1
#7 7 4 2 1992 0
#8 8 4 10 1983 1

How to calculate a frequency (count) variable in R? [duplicate]

This question already has answers here:
Calculate cumulative sum (cumsum) by group
(5 answers)
Closed 7 years ago.
I've begun to gradually shift to R from Excel but I'm still having some difficulties with (relatively simple) calculations.
I want to create a frequency version of my variable x, let's call it "xfrequency".
Please, see the sample of my data below.
The desired variable xfrequency should basically count the number of x's, during a certain period (country-year). In the sample data the observation period is from 1990 to 1995. So, in 1994 Canada recieved 4 x's in total.
Perhaps there is a relevant function for this out there? Thanks!
country year x xfrequency
CAN 1990 1 1
CAN 1991 0 0
CAN 1992 1 2
CAN 1993 0 0
CAN 1994 2 4
CAN 1995 1 5
USA 1990 0 0
USA 1991 2 2
USA 1992 1 3
USA 1993 0 0
USA 1994 1 4
USA 1995 0 0
GER 1990 NA NA
GER 1991 1 1
GER 1992 0 0
GER 1993 1 2
GER 1994 2 4
GER 1995 1 5
Example with data.table assuming your dataset in in a variable called data:
library(data.table)
setDT(data)
data[is.na(x),x := 0] # Remove the NA as a sum of anything with NA is NA
data[, xfreq := cumsum(x), by=country]
Which gives:
country year x xfrequency xfreq
1: CAN 1990 1 1 1
2: CAN 1991 0 0 1
3: CAN 1992 1 2 2
4: CAN 1993 0 0 2
5: CAN 1994 2 4 4
6: CAN 1995 1 5 5
7: USA 1990 0 0 0
8: USA 1991 2 2 2
9: USA 1992 1 3 3
10: USA 1993 0 0 3
11: USA 1994 1 4 4
12: USA 1995 0 0 4
13: GER 1990 0 NA 0
14: GER 1991 1 1 1
15: GER 1992 0 0 1
16: GER 1993 1 2 2
17: GER 1994 2 4 4
18: GER 1995 1 5 5
this is not exactly your expected output, but according to the description you give, the xfreq column seems to be what you're looking for.
To get your exact output, we can reset the xfreq to 0 when x is 0:
> data[x==0,xfreq := 0]
> data
country year x xfrequency xfreq
1: CAN 1990 1 1 1
2: CAN 1991 0 0 0
3: CAN 1992 1 2 2
4: CAN 1993 0 0 0
5: CAN 1994 2 4 4
Or in one pass with a test:
data[, xfreq := ifelse(x==0,0L,cumsum(x)), by=country]
A base R alternative:
mydf <- transform(mydf, xfreq = ave(x, country, FUN = function(x) cumsum(!is.na(x))))
mydf[mydf$x==0 | is.na(mydf$x), "xfreq"] <- 0
gives:
> mydf
country year x xfrequency xfreq
1 CAN 1990 1 1 1
2 CAN 1991 0 0 0
3 CAN 1992 1 2 3
4 CAN 1993 0 0 0
5 CAN 1994 2 4 5
6 CAN 1995 1 5 6
7 USA 1990 0 0 0
8 USA 1991 2 2 2
9 USA 1992 1 3 3
10 USA 1993 0 0 0
11 USA 1994 1 4 5
12 USA 1995 0 0 0
13 GER 1990 NA NA 0
14 GER 1991 1 1 1
15 GER 1992 0 0 0
16 GER 1993 1 2 3
17 GER 1994 2 4 4
18 GER 1995 1 5 5
You can use library(dplyr).
library(dplyr)
sum_data <- data %>% group_by(country) %>% summarise(xfrequency = sum(x, na.rm=T)).
I just grouped your data by country and added sum of x for all the periods given for this country.

How to create time since last event in unbalance panel data in R?

I have unbalanced panel data with a binary variable indicating if the event occurred or not. I want to control for time dependency, so I want to create a variable that indicates the number of years that have passed since the last event. The data is organized by dyad-year.
Here is a reproducible example, with a vector of what I am trying to achieve. Thanks!
id year onset time_since_event
1 1 1989 0 1
2 1 1990 0 2
3 1 1991 1 0
4 1 1992 0 1
5 1 1993 0 2
6 2 1989 0 1
7 2 1990 1 0
8 2 1991 0 1
9 2 1992 1 0
10 3 1991 0 1
11 3 1992 0 2
˚
id <- c(1,1,1,1,1,2,2,2,2,3,3)
year <- c(1989,1990,1991,1992,1993,1989,1990,1991,1992,1991,1992)
onset <- c(0,0,1,0,0,0,1,0,1,0,0)
time_since_event<-c(1,2,0,1,2,1,0,1,0,1,2) #what I want to create
df <- data.frame(cbind(id, year, onset,time_since_event))
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df), create a run-length id grouping variable ('ind') based on the 'onset' column using rleid. Grouped by 'ind' and 'id' column, we assign the 'time_since_event' column as the row sequence where 'onset' is not equal to 1. In the next step, replace the 'NA' elements with 0.
library(data.table)#v1.9.6+
setDT(df)[, ind:=rleid(onset)][onset!=1, time_since_event:=1:.N ,
by = .(ind, id)][is.na(time_since_event), time_since_event:= 0]
df
# id year onset ind time_since_event
# 1: 1 1989 0 1 1
# 2: 1 1990 0 1 2
# 3: 1 1991 1 2 0
# 4: 1 1992 0 3 1
# 5: 1 1993 0 3 2
# 6: 2 1989 0 3 1
# 7: 2 1990 1 4 0
# 8: 2 1991 0 5 1
# 9: 2 1992 1 6 0
#10: 3 1991 0 7 1
#11: 3 1992 0 7 2
Or it can be made compact. Grouped by rleid(onset) and 'id' column, we negate the 'onset' (so that 0 become TRUE and 1 FALSE), multiply with row sequence (1:.N) and assign (:=) it as the 'time_since_event' column.
setDT(df)[,time_since_event := 1:.N *!onset, by = .(rleid(onset), id)]
df
# id year onset time_since_event
# 1: 1 1989 0 1
# 2: 1 1990 0 2
# 3: 1 1991 1 0
# 4: 1 1992 0 1
# 5: 1 1993 0 2
# 6: 2 1989 0 1
# 7: 2 1990 1 0
# 8: 2 1991 0 1
# 9: 2 1992 1 0
#10: 3 1991 0 1
#11: 3 1992 0 2
Or we can use dplyr. We group by 'id' and another variable created (by taking the difference of adjacent elements in 'onset' (diff), create a logical index (!=0) and cumsum the index). Within the mutate, we multiply the row sequence (row_number()) with the negated 'onset' (just like before), and remove the 'ind' column using select.
library(dplyr)
df %>%
group_by(id, ind= cumsum(c(TRUE, diff(onset)!=0))) %>%
mutate(time_since_event= (!onset) *row_number()) %>%
ungroup() %>%
select(-ind)
# id year onset time_since_event
# (dbl) (dbl) (dbl) (int)
#1 1 1989 0 1
#2 1 1990 0 2
#3 1 1991 1 0
#4 1 1992 0 1
#5 1 1993 0 2
#6 2 1989 0 1
#7 2 1990 1 0
#8 2 1991 0 1
#9 2 1992 1 0
#10 3 1991 0 1
#11 3 1992 0 2
data
df <- data.frame(id, year, onset)

BTSCS data in R: create t

Suppose I have the following data frame df:
id year y
1 1 1990 NA
2 1 1991 0
3 1 1992 0
4 1 1993 1
5 1 1994 NA
6 2 1990 0
7 2 1991 0
8 2 1992 0
9 2 1993 0
10 2 1994 0
11 3 1990 0
12 3 1991 0
13 3 1992 1
14 3 1993 NA
15 3 1994 NA
Code to create the df:
id<-c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
year<-c(1990,1991,1992,1993,1994,1990,1991,1992,1993,1994,1990,1991,1992,1993,1994)
y<-c(NA,0,0,1,NA,0,0,0,0,0,0,0,1,NA,NA)
df<-data.frame(id,year,y)
I want to create the following vector t that measures the duration an observation has been at risk until an event occurs (y=1) or the last entry of an observation (equal to right censoring):
id year y t
1 1 1990 NA NA
2 1 1991 0 1
3 1 1992 0 2
4 1 1993 1 3
5 1 1994 NA NA
6 2 1990 0 1
7 2 1991 0 2
8 2 1992 0 3
9 2 1993 0 4
10 2 1994 0 5
11 3 1990 0 1
12 3 1991 0 2
13 3 1992 1 3
14 3 1993 NA NA
15 3 1994 NA NA
Any help is highly welcome!
Here's a possible data.table solution which will also update your data set by reference
library(data.table)
setDT(df)[!is.na(y), t := seq_len(.N), id][]
# id year y t
# 1: 1 1990 NA NA
# 2: 1 1991 0 1
# 3: 1 1992 0 2
# 4: 1 1993 1 3
# 5: 1 1994 NA NA
# 6: 2 1990 0 1
# 7: 2 1991 0 2
# 8: 2 1992 0 3
# 9: 2 1993 0 4
# 10: 2 1994 0 5
# 11: 3 1990 0 1
# 12: 3 1991 0 2
# 13: 3 1992 1 3
# 14: 3 1993 NA NA
# 15: 3 1994 NA NA
A base R option would be
df$t <- with(df, ave(!is.na(y), id, FUN=cumsum)*NA^is.na(y))
df
# id year y t
#1 1 1990 NA NA
#2 1 1991 0 1
#3 1 1992 0 2
#4 1 1993 1 3
#5 1 1994 NA NA
#6 2 1990 0 1
#7 2 1991 0 2
#8 2 1992 0 3
#9 2 1993 0 4
#10 2 1994 0 5
#11 3 1990 0 1
#12 3 1991 0 2
#13 3 1992 1 3
#14 3 1993 NA NA
#15 3 1994 NA NA
Or using dplyr
library(dplyr)
df %>%
group_by(id) %>%
mutate(t=replace(y, !is.na(y), seq(na.omit(y))))
You can achieve this using the btcs() command from Dave Armstrong's packages DAMisc.
df <- btscs(df, "y", "year", "id")
That will spit out your original dataset along with a column 'spell' which is the number of time units since the last event.

Resources