Search in row for a certain value and report the date - r

Ciao,
Here is my replicating example.
a=c(1,2,3,4,5,6,7,8)
b=c(1,1,0,0,0,"NA",0,"NA")
c=c(11,7,9,9,5,"NA",7,"NA")
d=c(2012,2011,2012,2014,2014,"NA",2011,"NA")
e=c(1,0,1,0,0,1,"NA","NA")
f=c(10,4,11,10,10,6,"NA","NA")
g=c(2014,2012,2010,2012,2013,2011,"NA","NA")
h=c(1,0,1,0,1,0,1,"NA")
i=c(2,12,12,6,8,11,3,"NA")
j=c(2011,2012,2011,2012,2012,2014,2012,"NA")
k=c(1,1,1,0,1,1,1,"NA")
l=c(11/1/2012,"7/1/2012","11/1/2010",0 ,"8/1/2012","6/1/2012","3/1/2012","NA")
mydata = data.frame(a,b,c,d,e,f,g,h,i,j,k,l)
names(mydata) = c("id","test1","month1","year1","test2","month2","year2","test3","month3","year3","anytest","date")
I am aiming to search through each row and find the first test column that is equal to 1. The new column I am aiming to create is "anytest." This column is 1 if test1 or test2 or test3 equals to 1. If none of them do then it equals to 0. This ignores NA values..if test1 and test2 are NA but test3 equals to 0 then anytest equals to 0. Now I have made progress I think using this code:
anytestTRY = if(rowSums(mydata[,c(test1,test2,test3)] == 1, na.rm=TRUE) > 0],1,0)
But now I am at a crossroads because I am aiming to search through each row to find the first column of test1 test2 or test3 that equals to 1 and then report the month and year for that test. So if test1 equals to 0 and test2 equals to NA and test3 equals to 1 I want the column which I created called date to have the month3 and year3 in analyzable time format. Thanks a million.

a=c(1,2,3,4,5,6,7,8)
b=c(1,1,0,0,0,"NA",0,"NA")
c=c(11,7,9,9,5,"NA",7,"NA")
d=c(2012,2011,2012,2014,2014,"NA",2011,"NA")
e=c(1,0,1,0,0,1,"NA","NA")
f=c(10,4,11,10,10,6,"NA","NA")
g=c(2014,2012,2010,2012,2013,2011,"NA","NA")
h=c(1,0,1,0,1,0,1,"NA")
i=c(2,12,12,6,8,11,3,"NA")
j=c(2011,2012,2011,2012,2012,2014,2012,"NA")
mydata = data.frame(a,b,c,d,e,f,g,h,i,j)
names(mydata) = c("id","test1","month1","year1","test2","month2","year2","test3","month3","year3")
library(tidyverse)
library(lubridate)
mydata %>%
mutate_all(~as.numeric(as.character(.))) %>% # update columns to numeric
group_by(id) %>% # for each id
nest() %>% # nest data
mutate(date = map(data, ~case_when(.$test1==1 ~ ymd(paste0(.$year1,"-",.$month1,"-",1)), # get date based on first test that is 1
.$test2==1 ~ ymd(paste0(.$year2,"-",.$month2,"-",1)),
.$test3==1 ~ ymd(paste0(.$year3,"-",.$month3,"-",1)))),
anytest = map(data, ~as.numeric(case_when(sum(c(.$test1, .$test2, .$test3)==1) > 0 ~ "1", # create anytest column
sum(is.na(c(.$test1, .$test2, .$test3))) == 3 ~ "NA",
TRUE ~ "0")))) %>%
unnest() # unnestdata
which returns:
# # A tibble: 8 x 12
# id date anytest test1 month1 year1 test2 month2 year2 test3 month3 year3
# <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2012-11-01 1 1 11 2012 1 10 2014 1 2 2011
# 2 2 2011-07-01 1 1 7 2011 0 4 2012 0 12 2012
# 3 3 2010-11-01 1 0 9 2012 1 11 2010 1 12 2011
# 4 4 NA 0 0 9 2014 0 10 2012 0 6 2012
# 5 5 2012-08-01 1 0 5 2014 0 10 2013 1 8 2012
# 6 6 2011-06-01 0 NA NA NA 1 6 2011 0 11 2014
# 7 7 2012-03-01 0 0 7 2011 NA NA NA 1 3 2012
# 8 8 NA NA NA NA NA NA NA NA NA NA NA

Related

How do I add a column indicating the years since a binary variable in R?

I thought this would be trivial, and I think it must be, but I am very tired and stuck at this problem at the moment.
Consider a df with two columns, one with a year, and the other with a binary variable indicating some event.
df <- data.frame(year = c(2000,2001,2002,2003,2004, 2005,2006,2007,2008,2010),
flag = c(0,0,0,1,0,0,0,1,0,0))
I want to create a third column that simply counts the years since the last flag and that resets when a new flag appears, like so:
I thought this code would do the job:
First, add a 0 as "year_since" for every year with a flag, then, if there was a flag in the previous year, add 1 to the value of the previous "year_since".
df <- df %>% mutate(year_since = ifelse(flag == 1, 0, NA)) %>%
mutate(year_since = ifelse(dplyr::lag(flag, n=1, order_by = "year") == 1 & is.na(year_since),
dplyr::lag(year_since, n=1, order_by = "year")+1, year_since))
However, this returns NA for every row that should be 1,2,3, and so on.
You could do
df %>%
group_by(group = cumsum(flag)) %>%
mutate(year_since = ifelse(group == 0, NA, seq(n()) - 1)) %>%
ungroup() %>%
select(-group)
#> # A tibble: 10 x 3
#> year flag year_since
#> <dbl> <dbl> <dbl>
#> 1 2000 0 NA
#> 2 2001 0 NA
#> 3 2002 0 NA
#> 4 2003 1 0
#> 5 2004 0 1
#> 6 2005 0 2
#> 7 2006 0 3
#> 8 2007 1 0
#> 9 2008 0 1
#> 10 2010 0 2
Created on 2022-09-16 with reprex v2.0.2
Using data.table
library(data.table)
setDT(df)[, year_since := (NA^!cummax(flag)) * rowid(cumsum(flag))-1]
-output
> df
year flag year_since
<num> <num> <num>
1: 2000 0 NA
2: 2001 0 NA
3: 2002 0 NA
4: 2003 1 0
5: 2004 0 1
6: 2005 0 2
7: 2006 0 3
8: 2007 1 0
9: 2008 0 1
10: 2010 0 2

R: Turning row data from one dataframe into column data by group in another

I have data in the following format:
ID
Age
Sex
1
29
M
2
32
F
3
18
F
4
89
M
5
45
M
and;
ID
subID
Type
Status
Year
1
3
Car
Y
1
11
Toyota
NULL
2011
1
23
Kia
NULL
2009
2
5
Car
N
3
2
Car
Y
3
4
Honda
NULL
2019
3
7
Fiat
NULL
2006
3
8
Mitsubishi
NULL
2020
4
1
Car
N
5
7
Car
Y
Each ID in the second table has a row specifying if they have a car, and additional rows stating the brand of car/s they own. Each person has a maximum of 3 cars. I want to simplify this data into a single table as so.
ID
Age
Sex
Car?
Car.1
Car1.year
Car.2
Car2.year
Car.3
Car3.year
1
29
M
Y
Toyota
2011
Kia
2009
NULL
NULL
2
32
F
N
NULL
NULL
NULL
NULL
NULL
NULL
3
18
F
Y
Honda
2019
Fiat
2006
Mitsubishi
2020
4
89
M
N
NULL
NULL
NULL
NULL
NULL
NULL
5
45
M
Y
NULL
NULL
NULL
NULL
NULL
NULL
I've tried using the mutate function in dplyr with the case_when function, but I can't check conditions in another dataframe. If I try to join the tables together, I would have multiple rows for each ID which I want to avoid. The non-standard set up of the second table makes things complicated. My only remaining idea is to switch to Python/Pandas and create a for loop that slowly loops through each ID, searches the second dataframe if the person has a car and the car brands, then mutates a column in the first dataframe. But given the size of my dataset, this would be inefficient and take a long time.
What is the best way to do this?
You can try the following codes:
library(tidyverse)
df1
# A tibble: 5 x 3
ID Age Sex
<dbl> <dbl> <chr>
1 1 29 M
2 2 32 F
3 3 18 F
4 4 89 M
5 5 45 M
df2
# A tibble: 10 x 5
ID subID Type Status Year
<dbl> <dbl> <chr> <chr> <dbl>
1 1 3 Car Y NA
2 1 11 Toyota Y 2011
3 1 23 Kia Y 2009
4 2 5 Car N NA
5 3 2 Car Y NA
6 3 4 Honda Y 2019
7 3 7 Fiat Y 2006
8 3 8 Mitsubishi Y 2020
9 4 1 Clothed N NA
10 5 7 Clothed Y NA
df2 <- df2 %>% mutate(Status = if_else(Status == "NULL", "Y", Status))
df3 <- df2 %>% filter(!is.na(Year)) %>% group_by(ID) %>% mutate(index = row_number())
df4 <- df3 %>% pivot_wider(id_cols = c(ID), values_from = c(Type, Year), names_from = index )
So your desired output will be produced:
df1 %>% left_join(df2 %>% select(ID, Status) %>% distinct()) %>% left_join(df4)
# A tibble: 5 x 10
ID Age Sex Status Type_1 Type_2 Type_3 Year_1 Year_2 Year_3
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1 29 M Y Toyota Kia NA 2011 2009 NA
2 2 32 F N NA NA NA NA NA NA
3 3 18 F Y Honda Fiat Mitsubishi 2019 2006 2020
4 4 89 M N NA NA NA NA NA NA
5 5 45 M Y NA NA NA NA NA NA

How to use an index within another index to locate a change in a variable - R

I have the following dataset.
id<-c(1001,1001,1001,1002,1002,1003,1004,1005,1005,1005)
year<-c(2010,2013,2016, 2013,2010,2010,2016,2016,2010,2013)
status<-c(2,2,2,3,4,2,1,1,1,5)
df<-data.frame(id, year, status)
df <- df[order(df$id, df$year), ]
My goal is to create a for-loop with two indices one for id and the other for year so that it runs through the id first and then within each id it looks at years in which there was a change in the status. To record the changes with this loop, I want another variable that shows in which the change happened.
For example, in the dataframe below the variable change records 0 for id 1001 in all three years. But for 1002, a change in status is recorded with 1 in year 2013. For 1005, status changes twice, in 2013 and 2016, that's why 1 is recorded twice. btw, id is a character variable because the real data I am working on has alpha-numeric ids.
id year status change
1 1001 2010 2 0
2 1001 2013 2 0
3 1001 2016 2 0
5 1002 2010 4 0
4 1002 2013 3 1
6 1003 2010 2 0
7 1004 2016 1 0
9 1005 2010 1 0
10 1005 2013 2 1
8 1005 2016 1 1
The actual dataframe has over 600k observations. Loop takes a lot of time running. I am open to faster solutions too.
My code is below:
df$change<-NA df$id<-as.character(df$id) for(id in unique(df$id)) {
tau<-df$year[df$id==id] if (length(tau)>1) {
for( j in 1:(length(tau)-1)){
if (df$status[df$year==tau[j] & df$id==id] != df$status[df$year==tau[j+1]& df$id==id]) {
df$change[df$year==tau[j] & df$id==id]<-0
df$change[df$year==tau[j+1] & df$id==id]<-1
} else {
df$change[df$year==tau[j] & df$id==id]<-0
df$change[df$year==tau[j+1] & df$id==id]<-0
}}}
You could do:
Base R:
df |>
transform(change = ave(status, id, FUN = \(x)c(0, diff(x))!=0))
In tidyverse:
library(tidyverse)
df %>%
group_by(id) %>%
mutate(change = c(0, diff(status)!=0))
id year status change
<dbl> <dbl> <dbl> <dbl>
1 1001 2010 2 0
2 1001 2013 2 0
3 1001 2016 2 0
4 1002 2010 4 0
5 1002 2013 3 1
6 1003 2010 2 0
7 1004 2016 1 0
8 1005 2010 1 0
9 1005 2013 5 1
10 1005 2016 1 1
Does this yield the correct result?
library(dplyr)
id<-c(1001,1001,1001,1002,1002,1003,1004,1005,1005,1005)
year<-c(2010,2013,2016, 2013,2010,2010,2016,2016,2010,2013)
status<-c(2,2,2,3,4,2,1,1,1,5)
df<-data.frame(id, year, status)
df <- df[order(df$id, df$year), ]
df %>%
group_by(id) %>%
mutate(change = as.numeric(status != lag(status,
default = first(status))))
#> # A tibble: 10 x 4
#> id year status change
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1001 2010 2 0
#> 2 1001 2013 2 0
#> 3 1001 2016 2 0
#> 4 1002 2010 4 0
#> 5 1002 2013 3 1
#> 6 1003 2010 2 0
#> 7 1004 2016 1 0
#> 8 1005 2010 1 0
#> 9 1005 2013 5 1
#> 10 1005 2016 1 1
Note: I put the "NA replacement" in a second mutate since this step does not have to be on the grouped data which is then faster for large datasets
We can use ifelse with a logical comparison between status and lag(status). The key is the argument default = first(status), which eliminates common problems with NAs in the output.
df %>% group_by(id) %>%
mutate(change=ifelse(status==lag(status, default = first(status)), 0, 1))
# A tibble: 10 x 4
# Groups: id [5]
id year status change
<dbl> <dbl> <dbl> <dbl>
1 1001 2010 2 0
2 1001 2013 2 0
3 1001 2016 2 0
4 1002 2010 4 0
5 1002 2013 3 1
6 1003 2010 2 0
7 1004 2016 1 0
8 1005 2010 1 0
9 1005 2013 5 1
10 1005 2016 1 1

Calculating a ratio from two columns of data by parameters set in another column

I have date values in wide from and I'm trying to calculate the ratio of the date value with the baseline only within the Start Date and End Dates.
For example:
ID Start Date End Date Baseline 1/18 2/18 3/18 4/18 5/18 6/18 7/18 8/18
A 1/1/2018 5/1/2018 5 2 4 1 3 5 2 4 5
B 6/1/2018 8/1/2018 2 4 2 4 3 6 6 2 1
C 2/1/2018 3/1/2018 8 3 5 5 3 2 7 8 2
D 5/1/2015 7/1/2018 9 1 3 5 7 4 8 9 1
I would like to output to be:
ID Start Date End Date Baseline 1/18 2/18 3/18 4/18 5/18 6/18 7/18 8/18
A 1/1/2018 5/1/2018 5 0.4 0.8 0.2 0.6 1
B 6/1/2018 8/1/2018 2 3 1 0.5
C 2/1/2018 3/1/2018 8 0.625 0.625
D 5/1/2015 7/1/2018 9 0.44 0.88 1
Thank you!
A very inelegant solution with dplyr and tidyr, which someone can probably build on:
library(dplyr)
library(tidyr)
sample <- sample %>% mutate_at(vars(5:12), funs(round(./Baseline, digits = 3))) ## perform the initial simple proportion calculation
sample <- sample %>% gather(5:12, key = "day", value = "value") %>%
rowwise() %>% ## allow for rowwise operations
mutate(value_temp = case_when(any(grepl(as.numeric(str_extract(day, "^[:digit:]{1,2}(?=/)")),
as.numeric(str_extract(StartDate, "^[:digit:]{1,2}(?=/)")):as.numeric(str_extract(EndDate, "^[:digit:]{1,2}(?=/)")))) == T ~ T,
TRUE ~ NA)) ## create a logical vector which indicates TRUE if the "day" is included in the range of days of StartDate and EndDate
sample$value[is.na(sample$value_temp)] <- NA ## sets values which aren't included in the vector of days to NA
sample$value_temp <- NULL ## remove the temp variable
sample <- sample %>% spread(day, value) ## spread to original df
> sample
# A tibble: 4 x 12
ID StartDate EndDate Baseline `1/18` `2/18` `3/18` `4/18` `5/18` `6/18` `7/18` `8/18`
<chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1/1/2018 5/1/2018 5 0.4 0.8 0.2 0.6 1 NA NA NA
2 B 6/1/2018 8/1/2018 2 NA NA NA NA NA 3 1 0.5
3 C 2/1/2018 3/1/2018 8 NA 0.625 0.625 NA NA NA NA NA
4 D 5/1/2015 7/1/2018 9 NA NA NA NA 0.444 0.889 1 NA
Update:
sample <- sample %>% mutate_at(vars(5:12), funs(round(./Baseline, digits = 3)))
sample <- sample %>% gather(5:12, key = "day", value = "value") %>%
rowwise() %>%
mutate(value_temp = case_when(any(grepl(as.numeric(str_extract(day, "^[:digit:]{1,2}(?=/)")),
as.numeric(str_extract(Start_Date, "^[:digit:]{1,2}(?=/)")):as.numeric(str_extract(End_Date, "^[:digit:]{1,2}(?=/)")))) == T &
any(grepl(as.numeric(str_extract(day, "[:digit:]{2}$")),
as.numeric(str_extract(Start_Date, "[:digit:]{2}$")):as.numeric(str_extract(End_Date, "[:digit:]{2}$")))) ~ T,
TRUE ~ NA))
sample$value[is.na(sample$value_temp)] <- NA
sample$value_temp <- NULL
sample$day <- sample$day %>% as_factor()
sample <- sample %>% spread(day, value)

how do I identify rows where an element appears for the first time?

I have the following data frame of student records. what I want is to identify students who joined a certain program in 2014 for the first time when they were in 9th grade.
names.first<-c('a','a','b','b','c','d')
names.last<-c('c','c','z','z','f','h')
year<-c(2014,2013,2014,2015,2015,2014)
grade<-c(9,8,9,10,10,10)
df<-data.frame(names.first,names.last,year,grade)
df
To do this, I have used the following statement to say that I want students where the program year==2014 and their grade ==9.
df$first.cohort<-ifelse(df$year==2014 & df$grade==9,1,0)
df
names.first names.last year grade first.cohort
1 a c 2014 9 1
2 a c 2013 8 0
3 b z 2014 9 1
4 b z 2015 10 0
5 c f 2015 10 0
6 d h 2014 10 0
However, as you can notice this would include students who didn't enter the program in year 2014 such as student awho started in 2013. How do I create a ifelse statement where I only capture students who are in 9th grade and started the program in 2014 for the first time so that the df looks like
names.first names.last year grade first.cohort
1 a c 2014 9 0
2 a c 2013 8 0
3 b z 2014 9 1
4 b z 2015 10 0
5 c f 2015 10 0
6 d h 2014 10 0
We can use first after arrangeing by 'name' and 'year' to create the logical expression
library(dplyr)
df %>%
arrange(names, year) %>%
group_by(names) %>%
mutate(first.cohort = as.integer(grade == 9 & first(year) == 2014))
# A tibble: 6 x 4
# Groups: names [4]
# names year grade first.cohort
# <fct> <dbl> <dbl> <int>
#1 a 2013 8 0
#2 a 2014 9 0
#3 b 2014 9 1
#4 b 2015 10 0
#5 c 2015 10 0
#6 d 2014 10 0
For keeping the same order as in the input dataset, we can create a sequence column first and then do the arrange on the column after the mutate
df %>%
mutate(rn = row_number()) %>%
arrange(names, year) %>%
group_by(names) %>%
mutate(first.cohort = as.integer(grade == 9 & first(year) == 2014)) %>%
ungroup %>%
arrange(rn) %>%
select(-rn)
Or using the same logic with data.table that have the additional advantage of keeping the same order as in the input dataset
library(data.table)
setDT(df)[order(names, year), first.cohort := as.integer(grade == 9 &
first(year) == 2014), names]
Update
With the new example in the OP's post, we do the grouping by both the 'names' column
df %>%
arrange(names.first, names.last, year) %>%
group_by(names.first, names.last) %>%
mutate(first.cohort = as.integer(grade == 9 & first(year) == 2014))
# A tibble: 6 x 5
# Groups: names.first, names.last [4]
# names.first names.last year grade first.cohort
# <fct> <fct> <dbl> <dbl> <int>
#1 a c 2013 8 0
#2 a c 2014 9 0
#3 b z 2014 9 1
#4 b z 2015 10 0
#5 c f 2015 10 0
#6 d h 2014 10 0
Using dplyr
library(dplyr)
df%>%group_by(names)%>%dplyr::mutate(Fc=as.numeric((year==2014&grade==9)&(min(year)==2014)))
# A tibble: 6 x 4
# Groups: names [4]
names year grade Fc
<fctr> <dbl> <dbl> <dbl>
1 a 2014 9 0
2 a 2013 8 0
3 b 2014 9 1
4 b 2015 10 0
5 c 2015 10 0
6 d 2014 10 0

Resources