How to use an index within another index to locate a change in a variable - R - r

I have the following dataset.
id<-c(1001,1001,1001,1002,1002,1003,1004,1005,1005,1005)
year<-c(2010,2013,2016, 2013,2010,2010,2016,2016,2010,2013)
status<-c(2,2,2,3,4,2,1,1,1,5)
df<-data.frame(id, year, status)
df <- df[order(df$id, df$year), ]
My goal is to create a for-loop with two indices one for id and the other for year so that it runs through the id first and then within each id it looks at years in which there was a change in the status. To record the changes with this loop, I want another variable that shows in which the change happened.
For example, in the dataframe below the variable change records 0 for id 1001 in all three years. But for 1002, a change in status is recorded with 1 in year 2013. For 1005, status changes twice, in 2013 and 2016, that's why 1 is recorded twice. btw, id is a character variable because the real data I am working on has alpha-numeric ids.
id year status change
1 1001 2010 2 0
2 1001 2013 2 0
3 1001 2016 2 0
5 1002 2010 4 0
4 1002 2013 3 1
6 1003 2010 2 0
7 1004 2016 1 0
9 1005 2010 1 0
10 1005 2013 2 1
8 1005 2016 1 1
The actual dataframe has over 600k observations. Loop takes a lot of time running. I am open to faster solutions too.
My code is below:
df$change<-NA df$id<-as.character(df$id) for(id in unique(df$id)) {
tau<-df$year[df$id==id] if (length(tau)>1) {
for( j in 1:(length(tau)-1)){
if (df$status[df$year==tau[j] & df$id==id] != df$status[df$year==tau[j+1]& df$id==id]) {
df$change[df$year==tau[j] & df$id==id]<-0
df$change[df$year==tau[j+1] & df$id==id]<-1
} else {
df$change[df$year==tau[j] & df$id==id]<-0
df$change[df$year==tau[j+1] & df$id==id]<-0
}}}

You could do:
Base R:
df |>
transform(change = ave(status, id, FUN = \(x)c(0, diff(x))!=0))
In tidyverse:
library(tidyverse)
df %>%
group_by(id) %>%
mutate(change = c(0, diff(status)!=0))
id year status change
<dbl> <dbl> <dbl> <dbl>
1 1001 2010 2 0
2 1001 2013 2 0
3 1001 2016 2 0
4 1002 2010 4 0
5 1002 2013 3 1
6 1003 2010 2 0
7 1004 2016 1 0
8 1005 2010 1 0
9 1005 2013 5 1
10 1005 2016 1 1

Does this yield the correct result?
library(dplyr)
id<-c(1001,1001,1001,1002,1002,1003,1004,1005,1005,1005)
year<-c(2010,2013,2016, 2013,2010,2010,2016,2016,2010,2013)
status<-c(2,2,2,3,4,2,1,1,1,5)
df<-data.frame(id, year, status)
df <- df[order(df$id, df$year), ]
df %>%
group_by(id) %>%
mutate(change = as.numeric(status != lag(status,
default = first(status))))
#> # A tibble: 10 x 4
#> id year status change
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1001 2010 2 0
#> 2 1001 2013 2 0
#> 3 1001 2016 2 0
#> 4 1002 2010 4 0
#> 5 1002 2013 3 1
#> 6 1003 2010 2 0
#> 7 1004 2016 1 0
#> 8 1005 2010 1 0
#> 9 1005 2013 5 1
#> 10 1005 2016 1 1
Note: I put the "NA replacement" in a second mutate since this step does not have to be on the grouped data which is then faster for large datasets

We can use ifelse with a logical comparison between status and lag(status). The key is the argument default = first(status), which eliminates common problems with NAs in the output.
df %>% group_by(id) %>%
mutate(change=ifelse(status==lag(status, default = first(status)), 0, 1))
# A tibble: 10 x 4
# Groups: id [5]
id year status change
<dbl> <dbl> <dbl> <dbl>
1 1001 2010 2 0
2 1001 2013 2 0
3 1001 2016 2 0
4 1002 2010 4 0
5 1002 2013 3 1
6 1003 2010 2 0
7 1004 2016 1 0
8 1005 2010 1 0
9 1005 2013 5 1
10 1005 2016 1 1

Related

Count the occurences of accidents until the next accidents

I have the following data frame and I would like to create the "OUTPUT_COLUMN".
Explanation of columns:
ID is the identification number of the policy
ID_REG_YEAR is the identification number per Registration Year
CALENDAR_YEAR is the year that the policy have exposure
NUMBER_OF_RENEWALS is the count of numbers that the policy has renewed
ACCIDENT is accident occurred
KEY TO THE DATASET: ID_REG_YEAR and CALENDAR_YEAR
Basically, if column NUMBER_OF_RENEWALS = 0 then OUTPUT_COLUMN = 100. Any rows that an accident did not occurred before should contain 100 (e.g rows 13,16,17). If an Accident occured I would like to count the number of renewals until the next accident.
ID ID_REG_YEAR CALENDAR_YEAR NUMBER_OF_RENEWALS ACCIDENT OUTPUT_COLUMN
1 A A_2015 2015 0 YES 100
2 A A_2015 2016 0 YES 100
3 A A_2016 2016 1 YES 0
4 A A_2016 2017 1 YES 0
5 A A_2017 2017 2 NO 1
6 A A_2017 2018 2 NO 1
7 A A_2018 2018 3 NO 2
8 A A_2018 2019 3 NO 2
9 A A_2019 2019 4 YES 0
10 A A_2019 2020 4 YES 0
11 B B_2015 2015 0 NO 100
12 B B_2015 2016 0 NO 100
13 B B_2016 2016 1 NO 100
14 C C_2013 2013 0 NO 100
15 C C_2013 2014 0 NO 100
16 C C_2014 2014 1 NO 100
17 C C_2014 2015 1 NO 100
18 C C_2015 2015 2 YES 0
19 C C_2015 2016 2 YES 0
20 C C_2016 2016 3 NO 1
21 C C_2016 2017 3 NO 1
22 C C_2017 2017 4 NO 2
23 C C_2017 2018 4 NO 2
24 C C_2018 2018 5 YES 0
25 C C_2018 2019 5 YES 0
26 C C_2019 2019 6 NO 1
27 C C_2019 2020 6 NO 1
28 C C_2020 2020 7 NO 2
Here is a dplyr solution. First, obtain a separate column for the registration year, which will be used to calculate renewals since prior accident (assumes this is years since last accident). Then, create a column to contain the year of the last accident after grouping by ID. Using fill this value will be propagated. The final outcome column will be set as either 100 (if no prior accident, or NUMBER_OF_RENEWALS is zero) vs. the registration year - last accident year.
library(dplyr)
df %>%
separate(ID_REG_YEAR, into = c("ID_REG", "REG_YEAR"), convert = T) %>%
group_by(ID) %>%
mutate(LAST_ACCIDENT = ifelse(ACCIDENT == "YES", REG_YEAR, NA_integer_)) %>%
fill(LAST_ACCIDENT, .direction = "down") %>%
mutate(OUTPUT_COLUMN_2 = ifelse(
is.na(LAST_ACCIDENT) | NUMBER_OF_RENEWALS == 0, 100, REG_YEAR - LAST_ACCIDENT
))
Output
ID ID_REG REG_YEAR CALENDAR_YEAR NUMBER_OF_RENEWALS ACCIDENT OUTPUT_COLUMN LAST_ACCIDENT OUTPUT_COLUMN_2
<chr> <chr> <int> <int> <int> <chr> <int> <int> <dbl>
1 A A 2015 2015 0 YES 100 2015 100
2 A A 2015 2016 0 YES 100 2015 100
3 A A 2016 2016 1 YES 0 2016 0
4 A A 2016 2017 1 YES 0 2016 0
5 A A 2017 2017 2 NO 1 2016 1
6 A A 2017 2018 2 NO 1 2016 1
7 A A 2018 2018 3 NO 2 2016 2
8 A A 2018 2019 3 NO 2 2016 2
9 A A 2019 2019 4 YES 0 2019 0
10 A A 2019 2020 4 YES 0 2019 0
# … with 18 more rows
Note: If you want to use your policy number (NUMBER_OF_RENEWALS) and not go by the year, you can do something similar. Instead of adding a column with the last accident year, you can include the last accident policy. Then, your output column could reflect the policy number instead of year (to consider the possibility that one or more years could be skipped).
df %>%
separate(ID_REG_YEAR, into = c("ID_REG", "REG_YEAR"), convert = T) %>%
group_by(ID) %>%
mutate(LAST_ACCIDENT_POLICY = ifelse(ACCIDENT == "YES", NUMBER_OF_RENEWALS, NA_integer_)) %>%
fill(LAST_ACCIDENT_POLICY, .direction = "down") %>%
mutate(OUTPUT_COLUMN_2 = ifelse(
is.na(LAST_ACCIDENT_POLICY) | NUMBER_OF_RENEWALS == 0, 100, NUMBER_OF_RENEWALS - LAST_ACCIDENT_POLICY
))

How to modify a column based on a condition in a time series?

I have a data on animal territories by month (1 = January etc.) for multiple individuals:
year month terr_size id
2018 1 20 1
2018 2 30 1
2019 1 5 1
2019 2 10 1
2018 3 20 2
2018 5 25 2
2018 6 20 2
2018 7 20 2
2019 1 10 2
2019 2 5 2
2019 3 20 2
2019 4 30 2
I want to add a column that has a 1 if two consecutive months exceed some value e.g. 10. One wrinkle is that my data can run over one year for a single id.
year month terr_size id new_col
2018 1 20 1 1
2018 2 30 1 1
2019 1 5 1 0
2019 2 10 1 0
2018 3 20 2 0
2018 5 25 2 1
2018 6 20 2 1
2018 7 20 2 1
2019 1 10 2 0
2019 2 5 2 0
2019 3 20 2 1
2019 4 30 2 1
This can be expressed compactly using a single left join in a single SQL statement.
Using the input shown in the Note at the end, perform a left self join using the indicated on condition and set new_col to 1 if for any original row both it and any matched rows have terr_size greater than or equal to 10. If there is no matched row then use coalesce to set new_col to 0.
library(sqldf)
sqldf("
select a.*,
coalesce(max(a.terr_size >= 10 and b.terr_size >= 10), 0)
new_col
from DF a
left join DF b on
a.id = b.id and
(12 * b.year + b.month = 12 * a.year + a.month + 1 or
12 * b.year + b.month = 12 * a.year + a.month - 1)
group by a.rowid")
giving:
year month terr_size id new_col
1 2018 1 20 1 1
2 2018 2 30 1 1
3 2019 1 5 1 0
4 2019 2 10 1 0
5 2018 3 20 2 0
6 2018 5 25 2 1
7 2018 6 20 2 1
8 2018 7 20 2 1
9 2019 1 10 2 0
10 2019 2 5 2 0
11 2019 3 20 2 1
12 2019 4 30 2 1
Note
The input and output shown in the question are not consistent so to be clear we assumed this:
Lines <- "year month terr_size id
2018 1 20 1
2018 2 30 1
2019 1 5 1
2019 2 10 1
2018 3 20 2
2018 5 25 2
2018 6 20 2
2018 7 20 2
2019 1 10 2
2019 2 5 2
2019 3 20 2
2019 4 30 2 "
DF <- read.table(text = Lines, header = TRUE)
Your data:
df <- read.table(text = "year month terr_size id
2018 1 20 1
2018 2 30 1
2019 1 5 1
2019 2 10 1
2018 3 20 2
2018 2 25 2
2018 6 20 2
2018 7 20 2
2019 1 10 2
2019 2 5 2
2019 3 20 2
2019 4 30 2 ", header = TRUE)
The idea is to create a date variable first.
Then you create two copies of your data by changing the dates one month ahead and one month back.
R is efficient memory-wise for this kind of operation, so you won't have a problem.
You will just take the space for one additional column. It doesn't actually replicate the whole dataframe.
Then you can join the new columns to the original dataframe.
You then apply the condition you needed.
I created a magic_number variable for that.
At the end, I selected only the original columns plus the one you needed.
library(dplyr)
library(lubridate)
# the threshold number
magic_number <- 10
# creare date variable
df <- df %>% mutate(date = make_date(year, month))
# [p]revious month
dfp <- df %>% transmute(id, date = date - months(1), terr_size_p = terr_size)
# [n]ext month
dfn <- df %>% transmute(id, date = date + months(1), terr_size_n = terr_size)
# join by id and date
df <- df %>%
left_join(dfp, by = c("id", "date")) %>%
left_join(dfn, by = c("id", "date"))
# for new_col to be 1, terr_size must be over the threshold, so must be at least one between previous and next month
df <- df %>%
mutate(new_col = as.numeric(terr_size > magic_number &
any(terr_size_p > magic_number, terr_size_n > magic_number)))
# remove variables if there is no more use for them
df <- df %>% select(-terr_size_p, -terr_size_n, -date)
df
Result:
year month terr_size id new_col
1 2018 1 20 1 1
2 2018 2 30 1 1
3 2019 1 5 1 0
4 2019 2 10 1 0
5 2018 3 20 2 1
6 2018 2 25 2 1
7 2018 6 20 2 1
8 2018 7 20 2 1
9 2019 1 10 2 0
10 2019 2 5 2 0
11 2019 3 20 2 1
12 2019 4 30 2 1
(The result is not exactly the same because your initial data and expected results do not correspond at row 5)
This solution handles the december-january issue we talked about in the comments.
I'm not exactly sure what is the rule because your output isn't following the rule you talk about (eg: line1/5 doesn't have another month for comparison yet you put an 1, line 6 is separated by 2 months, you put a 1 in the line 11 whereas line12 was <10).
I assumed the most complicated scenario, so you can remove the extra conditions you don't need:
You put an 1 if the territory size remained >10 for two consecutive months including this one (or the first recorded month if it's >10) for each individual.
df <- read.table(text = "year month terr_size id
2018 1 20 1
2018 2 30 1
2019 1 5 1
2019 2 10 1
2018 3 20 2
2018 5 25 2
2018 6 20 2
2018 7 20 2
2019 1 10 2
2019 2 5 2
2019 3 20 2
2019 4 30 2", header = TRUE)
Using dplyr and lag:
library(dplyr)
df %>% arrange(id, year,month) %>%
dplyr::mutate(newcol=case_when(is.na(lag(month))==TRUE & terr_size>10~1,
lag(id)!=id & terr_size>10~1,
id==lag(id) & year-lag(year)==0 & month-lag(month)==1 & terr_size>10 & lag(terr_size)>10~1,
id==lag(id) & year-lag(year)==1 & lag(month)-month==11 & terr_size>10 & lag(terr_size)>10~1,
TRUE~0))
output:
year month terr_size id newcol
1 2018 1 20 1 1
2 2018 2 30 1 1
3 2019 1 5 1 0
4 2019 2 10 1 0
5 2018 3 20 2 1
6 2018 5 25 2 0
7 2018 6 20 2 1
8 2018 7 20 2 1
9 2019 1 10 2 0
10 2019 2 5 2 0
11 2019 3 20 2 0
12 2019 4 30 2 1

how do I identify rows where an element appears for the first time?

I have the following data frame of student records. what I want is to identify students who joined a certain program in 2014 for the first time when they were in 9th grade.
names.first<-c('a','a','b','b','c','d')
names.last<-c('c','c','z','z','f','h')
year<-c(2014,2013,2014,2015,2015,2014)
grade<-c(9,8,9,10,10,10)
df<-data.frame(names.first,names.last,year,grade)
df
To do this, I have used the following statement to say that I want students where the program year==2014 and their grade ==9.
df$first.cohort<-ifelse(df$year==2014 & df$grade==9,1,0)
df
names.first names.last year grade first.cohort
1 a c 2014 9 1
2 a c 2013 8 0
3 b z 2014 9 1
4 b z 2015 10 0
5 c f 2015 10 0
6 d h 2014 10 0
However, as you can notice this would include students who didn't enter the program in year 2014 such as student awho started in 2013. How do I create a ifelse statement where I only capture students who are in 9th grade and started the program in 2014 for the first time so that the df looks like
names.first names.last year grade first.cohort
1 a c 2014 9 0
2 a c 2013 8 0
3 b z 2014 9 1
4 b z 2015 10 0
5 c f 2015 10 0
6 d h 2014 10 0
We can use first after arrangeing by 'name' and 'year' to create the logical expression
library(dplyr)
df %>%
arrange(names, year) %>%
group_by(names) %>%
mutate(first.cohort = as.integer(grade == 9 & first(year) == 2014))
# A tibble: 6 x 4
# Groups: names [4]
# names year grade first.cohort
# <fct> <dbl> <dbl> <int>
#1 a 2013 8 0
#2 a 2014 9 0
#3 b 2014 9 1
#4 b 2015 10 0
#5 c 2015 10 0
#6 d 2014 10 0
For keeping the same order as in the input dataset, we can create a sequence column first and then do the arrange on the column after the mutate
df %>%
mutate(rn = row_number()) %>%
arrange(names, year) %>%
group_by(names) %>%
mutate(first.cohort = as.integer(grade == 9 & first(year) == 2014)) %>%
ungroup %>%
arrange(rn) %>%
select(-rn)
Or using the same logic with data.table that have the additional advantage of keeping the same order as in the input dataset
library(data.table)
setDT(df)[order(names, year), first.cohort := as.integer(grade == 9 &
first(year) == 2014), names]
Update
With the new example in the OP's post, we do the grouping by both the 'names' column
df %>%
arrange(names.first, names.last, year) %>%
group_by(names.first, names.last) %>%
mutate(first.cohort = as.integer(grade == 9 & first(year) == 2014))
# A tibble: 6 x 5
# Groups: names.first, names.last [4]
# names.first names.last year grade first.cohort
# <fct> <fct> <dbl> <dbl> <int>
#1 a c 2013 8 0
#2 a c 2014 9 0
#3 b z 2014 9 1
#4 b z 2015 10 0
#5 c f 2015 10 0
#6 d h 2014 10 0
Using dplyr
library(dplyr)
df%>%group_by(names)%>%dplyr::mutate(Fc=as.numeric((year==2014&grade==9)&(min(year)==2014)))
# A tibble: 6 x 4
# Groups: names [4]
names year grade Fc
<fctr> <dbl> <dbl> <dbl>
1 a 2014 9 0
2 a 2013 8 0
3 b 2014 9 1
4 b 2015 10 0
5 c 2015 10 0
6 d 2014 10 0

Search in row for a certain value and report the date

Ciao,
Here is my replicating example.
a=c(1,2,3,4,5,6,7,8)
b=c(1,1,0,0,0,"NA",0,"NA")
c=c(11,7,9,9,5,"NA",7,"NA")
d=c(2012,2011,2012,2014,2014,"NA",2011,"NA")
e=c(1,0,1,0,0,1,"NA","NA")
f=c(10,4,11,10,10,6,"NA","NA")
g=c(2014,2012,2010,2012,2013,2011,"NA","NA")
h=c(1,0,1,0,1,0,1,"NA")
i=c(2,12,12,6,8,11,3,"NA")
j=c(2011,2012,2011,2012,2012,2014,2012,"NA")
k=c(1,1,1,0,1,1,1,"NA")
l=c(11/1/2012,"7/1/2012","11/1/2010",0 ,"8/1/2012","6/1/2012","3/1/2012","NA")
mydata = data.frame(a,b,c,d,e,f,g,h,i,j,k,l)
names(mydata) = c("id","test1","month1","year1","test2","month2","year2","test3","month3","year3","anytest","date")
I am aiming to search through each row and find the first test column that is equal to 1. The new column I am aiming to create is "anytest." This column is 1 if test1 or test2 or test3 equals to 1. If none of them do then it equals to 0. This ignores NA values..if test1 and test2 are NA but test3 equals to 0 then anytest equals to 0. Now I have made progress I think using this code:
anytestTRY = if(rowSums(mydata[,c(test1,test2,test3)] == 1, na.rm=TRUE) > 0],1,0)
But now I am at a crossroads because I am aiming to search through each row to find the first column of test1 test2 or test3 that equals to 1 and then report the month and year for that test. So if test1 equals to 0 and test2 equals to NA and test3 equals to 1 I want the column which I created called date to have the month3 and year3 in analyzable time format. Thanks a million.
a=c(1,2,3,4,5,6,7,8)
b=c(1,1,0,0,0,"NA",0,"NA")
c=c(11,7,9,9,5,"NA",7,"NA")
d=c(2012,2011,2012,2014,2014,"NA",2011,"NA")
e=c(1,0,1,0,0,1,"NA","NA")
f=c(10,4,11,10,10,6,"NA","NA")
g=c(2014,2012,2010,2012,2013,2011,"NA","NA")
h=c(1,0,1,0,1,0,1,"NA")
i=c(2,12,12,6,8,11,3,"NA")
j=c(2011,2012,2011,2012,2012,2014,2012,"NA")
mydata = data.frame(a,b,c,d,e,f,g,h,i,j)
names(mydata) = c("id","test1","month1","year1","test2","month2","year2","test3","month3","year3")
library(tidyverse)
library(lubridate)
mydata %>%
mutate_all(~as.numeric(as.character(.))) %>% # update columns to numeric
group_by(id) %>% # for each id
nest() %>% # nest data
mutate(date = map(data, ~case_when(.$test1==1 ~ ymd(paste0(.$year1,"-",.$month1,"-",1)), # get date based on first test that is 1
.$test2==1 ~ ymd(paste0(.$year2,"-",.$month2,"-",1)),
.$test3==1 ~ ymd(paste0(.$year3,"-",.$month3,"-",1)))),
anytest = map(data, ~as.numeric(case_when(sum(c(.$test1, .$test2, .$test3)==1) > 0 ~ "1", # create anytest column
sum(is.na(c(.$test1, .$test2, .$test3))) == 3 ~ "NA",
TRUE ~ "0")))) %>%
unnest() # unnestdata
which returns:
# # A tibble: 8 x 12
# id date anytest test1 month1 year1 test2 month2 year2 test3 month3 year3
# <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2012-11-01 1 1 11 2012 1 10 2014 1 2 2011
# 2 2 2011-07-01 1 1 7 2011 0 4 2012 0 12 2012
# 3 3 2010-11-01 1 0 9 2012 1 11 2010 1 12 2011
# 4 4 NA 0 0 9 2014 0 10 2012 0 6 2012
# 5 5 2012-08-01 1 0 5 2014 0 10 2013 1 8 2012
# 6 6 2011-06-01 0 NA NA NA 1 6 2011 0 11 2014
# 7 7 2012-03-01 0 0 7 2011 NA NA NA 1 3 2012
# 8 8 NA NA NA NA NA NA NA NA NA NA NA

Create a new variable to epidemiological week

I have a data frame with a column week and another year (87 weeks). I need to create a new column (weekseq) with a number that identify the week sequentially from first to last. I dont know how to do. Someone can help me?
Example:
id week month year yearweek weekseq
1 1 1 2014 2014/1
1 1 1 2013 2013/1
1 2 1 2014 2014/2
1 2 1 2013 2013/2
1 3 1 2014 2014/3
1 3 1 2013 2013/3
1 4 1 2014 2014/4
1 4 1 2013 2013/4
1 5 1 2014 2014/5
1 5 1 2013 2013/5
1 6 2 2014 2014/6
1 6 2 2013 2013/6
1 7 2 2014 2014/7
1 7 2 2013 2013/7
1 8 2 2014 2014/8
1 8 2 2013 2013/8
1 9 2 2014 2014/9
1 9 2 2013 2013/9
1 10 3 2014 2014/10
1 10 3 2013 2013/10
1 11 3 2014 2014/11
1 11 3 2013 2013/11
1 12 3 2014 2014/12
1 12 3 2013 2013/12
This solution requires the 'dplyr' and 'plyr' packages:
# Coerce into tbd_df
datatbl <- tbl_df(data)
# Arrange, giving more weight to year than week
datatbl <- arrange(datatbl, year, month, week)
# Create a new column that numbers the arranged rows sequentially
seqtbl <- ddply(datatbl, .(id), transform, sequence=seq_along(id))

Resources