R dates comparison using loop - r
I have a dataframe with 1 date column (converted as.Date).
I am trying to write a loop to create a value in another column to "check" the death date compare to a fix value (today's date).
fix_date= as.Date(2021-10-28)
for (i in 1:length(df$Death.date)) {
if (df$Death.date[i] < as.Date(fix_date)){
df$death_check[i]<-"good"
}
}
So for each row if Death.date < fix_date, fill death_check column with "good".
It is giving me this error code:
Error in if (new_possible_population$Death.date[i] <
as.Date(exploratory_date)) { : missing value where TRUE/FALSE
needed
Is this the correct way to code for the loop concerning date values? or is there a better way than using loops for this?
You definitely want to use vectorised functions for this, check out the dplyr package:
df %>%
mutate(death_check = case_when(Death.date < as.Date("2021-10-28") ~ "good"))
As you can see I added "" around the date as well, this is neccessary. If your df$Death.date is not actually in Date format you can change that here as well.
library(data.table)
df <- data.table(
Death.date = sample(seq(as.Date("2020-01-01"), by = "month", length.out = 25))
)
# just a TRUE for "good" which makes FALSE "bad"
df[, death_check_1 := Death.date < Sys.time()]
# written "good"
df[Death.date < Sys.time(), death_check_2 := "good"]
Here's another option using sapply and an ifelse:
# make df using Merijn's code
df <- data.frame(Death.date = sample(seq(as.Date("2020-01-01"),
by = "month",
length.out = 25)))
# set the date to check against
fix_date <- as.Date("2021-10-28")
# make the comparison, return "good" or NA
df$death_check <- sapply(df$Death.date, function(x) {
ifelse(x < fix_date, "good", NA)
})
df
Related
Is there a way to reference a returned dataframe's variable in R without an assignment?
I'm trying to see if I can do the following code using only a single assignment and one line of code in R. This is how I wish it could work. Variable/Column names I wish to select are 'Diet' and 'Time': data.melted <- melt.data.frame(ChickWeight, measure.vars = 'weight', na.rm=T)[c(Diet == 1 | Diet == 4 & Time == 21)] This is how I get it to currently work: data.melted <- melt.data.frame(ChickWeight, measure.vars = 'weight', na.rm=T) data.melted <- diet.data.melted[c((data.melted$Diet == 1 | data.melted$Diet == 4) & data.melted$Time == 21),] Is there a way to reference the object that get returned from a function, like a reserve word such that I could select the columns? As in: melt.data.frame(ChickWeight, measure.vars = 'weight', na.rm=T)[ReserveWordForReturnedDF$Time,] Thank you!
Pipe (%>%) was introduced to avoid creating such intermediate objects. You can do : library(magrittr) melt.data.frame(ChickWeight, measure.vars = 'weight', na.rm=TRUE) %>% dplyr::filter(Diet %in% c(1, 4) & Time == 21)
how to insert sequential rows in data.table in R (Example given)?
df is data.table and df_expected is desired data.table . I want to add hour column from 0 to 23 and visits value would be filled as 0 for hours newly added . df<-data.table(customer=c("x","x","x","y","y"),location_id=c(1,1,1,2,3),hour=c(2,5,7,0,4),visits=c(40,50,60,70,80)) df_expected<-data.table(customer=c("x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x", "y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y", "y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y"), location_id=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3), hour=c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23, 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23, 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23), visits=c(0,0,40,0,0,50,0,60,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 70,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,80,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)) This is what I tried to obtain my result , but it did not work df1<-df[,':='(hour=seq(0:23)),by=(customer)] Error in `[.data.table`(df, , `:=`(hour = seq(0L:23L)), by = (customer)) : Type of RHS ('integer') must match LHS ('double'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)
Here's an approach that creates the target and then uses a join to add in the visits information. The ifelse statement just helps up clean up the NA from the merge. You could also leave them in and replace them with := in the new data.table. target <- data.table( customer = rep(unique(df$customer), each = 24), hour = 0:23) df_join <- df[target, on = c("customer", "hour"), .(customer, hour, visits = ifelse(is.na(visits), 0, visits)) ] all.equal(df_expected, df_join) Edit: This addresses the request to include the location_id column. One way to do this is with by=location in the creation of the target. I've also added in some of the code from chinsoon12's answer. target <- df[ , .("customer" = rep(unique(customer), each = 24L), "hour" = rep(0L:23L, times = uniqueN(customer))), by = location_id] df_join <- df[target, on = .NATURAL, .(customer, location_id, hour, visits = fcoalesce(visits, 0))] all.equal(df_expected, df_join)
Another option using CJ to generate your universe, on=.NATURAL for joining on identically named columns, and fcoalesce to handle NAs: df[CJ(customer, hour=0L:23L, unique=TRUE), on=.NATURAL, allow.cartesian=TRUE, .(customer=i.customer, hour=i.hour, visits=fcoalesce(visits, 0))]
here's a for-loop answer. df_final <- data.table() for(i in seq(24)){ if(i %in% df[,hour]){ a <- df[hour==i] }else{ a <- data.table(customer="x", hour=i, visits=0)} df_final <- rbind(df_final, a) } df_final You can wrap this in another for-loop to have your multiple customers x, y, etc. (the following loop isnt very clean but gets the job done). df_final <- data.table() for(j in unique(df[,customer])){ for(i in seq(24)){ if(i %in% df[,hour]){ if(df[hour==i,customer] %in% j){ a <- df[hour==i] }else{ a <- data.table(customer=j, hour=i, visits=0) } }else{ a <- data.table(customer=j, hour=i, visits=0) } df_final <- rbind(df_final, a) } } df_final
Is there a more efficient way to fill extra column than a 'for' loop?
I have a data.table with about 100k rows. I am going to simplify this to only 3 columns because that is all that is relevant here. dt <- data.table(indicator = c("x", "y"), date1 = c("20190111", "20190212", "20190512", "20190723"), date2 = c("20190105", "20190215", "20190616", "20190623")) What I want to do is assign either date1 or date2 to a new column, 'final_date' depending on the indicator column. If indicator is "x" assign final_date as date1. If indicator "y" assign final_date as date2. I am able to do this with a "for" loop and if/else statements, but it takes a few minutes to complete with 100k rows. for (row in 1:nrow(dt)) { if(dt$indicator[row] == "x") { dt$final_date[row] <- dt$date1[row] } else { dt$final_date[row] <- dt$date2[row] } } Is there any more efficient way to do this with data.table functionality or anything else?
With data.table, I would do something like this: dt[, final_date := ifelse(indicator == "x", date1, date2)] Really quick and simple! I suspect with a large set of data it will be faster than dplyr as well as the solution you have, as data.table mutates in place rather than creating a copy of the data.
With the dplyr pipeline > dt%>%mutate(final_data=if_else(indicator=="x",date1,date2)) indicator date1 date2 final_data 1 x 20190111 20190105 20190111 2 y 20190212 20190215 20190215 3 x 20190512 20190616 20190512 4 y 20190723 20190623 20190623
Try this: # necessary package library(dplyr) library(data.table) # reproduce your data dt <- data.table( indicator = c("x", "y"), date1 = c("20190111", "20190212", "20190512", "20190723"), date2 = c("20190105", "20190215", "20190616", "20190623") ) # create your variable final_date dt[, final_date := case_when(indicator == "x" ~ date1, TRUE ~ date2)] Hope it helps
R - add column based on intervals in separate data frame
I have the following data frames: DF <- data.frame(Time=c(1:20)) StartEnd <- data.frame(Start=c(2,6,14,19), End=c(4,10,17,20)) I want to add a column "Activity" to DF if the values in the Time column lie inbetween one of the intervals specified in the StartEnd dataframe. I came up with the following: mapply(FUN = function(Start,End) ifelse(DF$Time >= Start & DF$Time <= End, 1, 0), Start=StartEnd$Start, End=StartEnd$End) This doesn't give me the output I want (it gives me a matrix with four columns), but I would like to get a vector that I can add to DF. I guess the solution is easy but I'm not seeing it :) Thank you in advance. EDIT: I'm sure I can use a loop but I'm wondering if there are more elegant solutions.
You can achieve this with DF$Activity <- sapply(DF$Time, function(x) { ifelse(sum(ifelse(x >= StartEnd$Start & x <= StartEnd$End, 1, 0)), 1, 0) }) I hope this helps!
If you're using the tidyverse, I think a good way to go would be with with purrr::map2: # generate a sequence (n, n + 1, etc.) for each StartEnd row # (map functions return a list; purrr::flatten_int or unlist can # squash this down to a vector!) activity_times = map2(StartEnd$Start, StartEnd$End, seq) %>% flatten_int # then get a new DF column that is TRUE if Time is in activity_times DF %>% mutate(active = Time %in% active_times)
change variable values based on preceding value
I have the following dataset: df <- data.frame(subject = c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3), time = c(1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,11), performance = c(1,0,-1,-1,0,1,1,-1,0,0,0,1,1,1,-1,0,1,1,-1,0,0,1,-1,1,1,0,1,1,-1,0,-1,-1,0)) What I would like to do is to change some of the entries in the performance variable. More specifically, if a "-1" entry is preceded by a "1", I want to change the "-1" to "0". However, this should be done within subjects only, but not across subjects (all of the subjects have a varying number of sessions). So, this is what I'd like to have in the end: df2 =data.frame(subject = c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3), time = c(1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,11), performance = c(1,0,-1,-1,0,1,1,0,0,0,0,1,1,1,0,0,1,1,0,0,0,1,-1,1,1,0,1,1,-1,0,-1,-1,0)) Does anyone have an idea how to do this? Thanks in advance! S.
Using dplyr, df %>% group_by(subject) %>% mutate(performance = replace(performance, which(performance + lag(performance)==0 & performance == -1), 0))
Here's a data.table approach, where I first create a flag column which is then used to subset the data and update the performance column by reference. library(data.table) dt <- as.data.table(df) # or setDT(df) dt[, flag := performance == -1 & shift(performance, 1L) == 1, by = subject] dt[(flag), performance := 0][, flag := NULL] I chose to do it with an intermediate flag-column because I expect that to perform very well for large data sets. If performance is not your concern, you could of course use ifelse or replace instead.
This is ugly, but should work: dftest <- df for (i in 2:nrow(dftest)) { if( dftest$performance[i] == -1 && dftest$performance[i - 1] == 1 ){ if( dftest$subject[i] == dftest$subject[i - 1] ) { dftest$performance[i] <- 0 } } } all.equal(df2, dftest) # ONE ERROR This gives an error in line 29 - can you check whether your example df2 is correct here? If I understand the question correctly df2$performance[29] should be 0?
A base R solution using by and sapply: gr <- do.call(c, by(df, df$subject, function(x) { c(FALSE, unlist(sapply(1:length(x$performance), function(y) (x$performance[y] == -1) & (x$performance[y-1] == 1)))) })) df[gr, 3] <- 0 cbind(df, df2)