Fill dates and create new rows in dataframe - r

rptdate st
1 2/18/2017 2/12/2017
2 2/25/2017 2/19/2017
3 3/4/2017 2/26/2017
4 3/11/2017 3/5/2017
5 3/18/2017 3/12/2017
6 3/25/2017 3/19/2017
7 4/1/2017 3/26/2017
8 4/8/2017 4/2/2017
9 4/15/2017 4/9/2017
10 4/22/2017 4/16/2017
11 4/29/2017 4/23/2017
12 5/6/2017 4/30/2017
13 5/13/2017 5/7/2017
14 5/20/2017 5/14/2017
15 5/27/2017 5/21/2017
16 6/3/2017 5/28/2017
17 6/10/2017 6/4/2017
So basically rptdate is a bunch of Saturdays and st is each previous Sunday.
I would like to reshape this dataframe (the data is in date format) in this manner:
what I would like to do is this:
i=1
j=1
While (rptdate[i][j]>=st[i][j])
{add a new row where rptdate[i][j+1]= rptdate[i][j] and st[i][j+1]=rptdate[i][j]+1}
So basically, my desired new dataframe should be like this:
rptdate st
1 2/18/2017 2/12/2017
2/18/2017 2/13/2017
2/18/2017 2/14/2017
2/18/2017 2/15/2017
2/18/2017 2/16/2017
2/18/2017 2/17/2017
2/18/2017 2/18/2017
2 2/25/2017 2/19/2017
2/25/2017 2/20/2017
2/25/2017 2/21/2017
2/25/2017 2/22/2017
2/25/2017 2/23/2017
2/25/2017 2/24/2017
2/25/2017 2/25/2017
Thank you very much for your time.

Here is an idea via base R. You need to convert you variables to dates first. Then expand the data frame with extra 7 rows (1 week) for each date. Generate all the missing dates using seq and add them in your st variable.
d2[] <- lapply(d2, function(i) as.Date(i, format = '%m/%d/%Y'))
d3 <- d2[rep(row.names(d2), each = 7),]
d3$st<- do.call(c, Map(function(x, y)seq(x, y, by = 1), d2$st, d2$rptdate))
head(d3, 10)
# rptdate st
#1 2017-02-18 2017-02-12
#1.1 2017-02-18 2017-02-13
#1.2 2017-02-18 2017-02-14
#1.3 2017-02-18 2017-02-15
#1.4 2017-02-18 2017-02-16
#1.5 2017-02-18 2017-02-17
#1.6 2017-02-18 2017-02-18
#2 2017-02-25 2017-02-19
#2.1 2017-02-25 2017-02-20
#2.2 2017-02-25 2017-02-21
...

library(data.table)
dt <- data.table(V1=as.Date(c("2/18/2017","2/25/2017","3/4/2017","3/11/2017"),format = "%m/%d/%Y"),
V2=as.Date(c("2/12/2017","2/19/2017","2/26/2017","3/5/2017"),format = "%m/%d/%Y"))
for(i in 0:6){
dt[,paste0("colomn_i",i):=V1-i]
}
dt[,V2:=NULL]
temp <- melt(dt,id.vars = "V1")
setorder(temp,V1,value)
temp[,variable:=NULL]
Even though, eventually V2, is not needed

Here is an example using functions from dplyr and lubridate. dt2 would be the final output.
# Create example data frame
dt <- read.table(text = "rptdate st
2/18/2017 2/12/2017
2/25/2017 2/19/2017
3/4/2017 2/26/2017
3/11/2017 3/5/2017
3/18/2017 3/12/2017
3/25/2017 3/19/2017
4/1/2017 3/26/2017
4/8/2017 4/2/2017
4/15/2017 4/9/2017
4/22/2017 4/16/2017
4/29/2017 4/23/2017
5/6/2017 4/30/2017
5/13/2017 5/7/2017
5/20/2017 5/14/2017
5/27/2017 5/21/2017
6/3/2017 5/28/2017
6/10/2017 6/4/2017",
header = TRUE, stringsAsFactors = FALSE)
# Load packages
library(dplyr)
library(lubridate)
# Process the data
dt2 <- dt %>%
mutate(rptdate = mdy(rptdate), st = mdy(st)) %>%
rowwise() %>%
do(data_frame(rptdate = rep(.$rptdate[1], 7),
st = seq(.$st[1], .$rptdate[1], by = 1))) %>%
mutate(rptdate = format(rptdate, "%m/%d/%Y"),
st = format(st, "%m/%d/%Y"))
Or you can use the map2 and unnest functions from tidyverse.
# Load packages
library(tidyverse)
library(lubridate)
# Process the data
dt2 <- dt %>%
mutate(rptdate = mdy(rptdate), st = mdy(st)) %>%
mutate(st = map2(st, rptdate, seq, by = 1)) %>%
unnest() %>%
mutate(rptdate = format(rptdate, "%m/%d/%Y"),
st = format(st, "%m/%d/%Y"))

Related

How to select rows where two dates are close to N-days apart for various number of N's efficiently?

Let's say I have the following data.table :
DT = structure(list(date = structure(c(17774, 16545, 15398, 17765,
17736, 16342, 15896, 17928, 16692, 18022), class = "Date"), exdate = structure(c(17809,
16549, 15605, 17781, 17746, 16361, 16060, 17977, 16724, 18033
), class = "Date"), price_at_entry = c(301.66, 205.27, 33.81,
321.64, 297.43, 245.26, 122.27, 312.21, 253.19, 255.34), strike_price = c(195,
212.5, 37, 255, 430, 120, 46, 320, 440, 245)), row.names = c(NA,
-10L), class = c("data.table", "data.frame"))
DT[, `:=`(DTE = as.integer(difftime(exdate, date, unit = 'days')))]
date exdate price_at_entry strike_price DTE
1: 2018-08-31 2018-10-05 301.66 195.0 35
2: 2015-04-20 2015-04-24 205.27 212.5 4
3: 2012-02-28 2012-09-22 33.81 37.0 207
4: 2018-08-22 2018-09-07 321.64 255.0 16
5: 2018-07-24 2018-08-03 297.43 430.0 10
6: 2014-09-29 2014-10-18 245.26 120.0 19
7: 2013-07-10 2013-12-21 122.27 46.0 164
8: 2019-02-01 2019-03-22 312.21 320.0 49
9: 2015-09-14 2015-10-16 253.19 440.0 32
10: 2019-05-06 2019-05-17 255.34 245.0 11
I want to subset the data.table for days which DTE is within 10 units of various DTE_target values. My current solution is to use rbindlist and lapply to basically loop through the values of DTE_target. Something like this:
rbindlist(
lapply(
c(7,30,60,90), function(DTE_target){
data[data[,.I[abs(DTE-DTE_target) == min(abs(DTE-DTE_target))
& abs(DTE-DTE_target) < 10], by = date]$V1][ , DTE_target := DTE_target]
})
)
date exdate price_at_entry strike_price DTE DTE_target
1: 2015-04-20 2015-04-24 205.27 212.5 4 7
2: 2018-08-22 2018-09-07 321.64 255.0 16 7
3: 2018-07-24 2018-08-03 297.43 430.0 10 7
4: 2019-05-06 2019-05-17 255.34 245.0 11 7
5: 2018-08-31 2018-10-05 301.66 195.0 35 30
6: 2015-09-14 2015-10-16 253.19 440.0 32 30
Is there a more data.table like efficient solution? I need to basically use this process on potentially billions of rows. I am also open to a PostgreSQL solution if possible as well. Also after obtaining the above result, I repeat a similar process using price_at_entry and strike_price. ( which in its current form introduces even more looping )
Maybe it's possible to use rolling joins? If I join data on itself using date and exdate as the keys and roll = 10. But I cannot seem to get a solution that makes sense.
Any help would be appreciated. Thanks!
EDIT:::::
I can't believe I missed this... Here is a potential solution that I need to keep exploring but seems to be very efficient.
DTE_target = c(7,14,30,60,90,120,150, 180, 210, 240, 270, 300)
# create a map of Target DTEs with the +/- range
# ( for some reason i have to duplicate the column for the join to pull DTE_target)
DTE_table = data.table(DTE = DTE_target, DTE_low = DTE_target - 10,
DTE_high = DTE_target + 10,
DTE_target = DTE_target)
# map on nearest
DTE_table[DT, on = .(DTE), roll = "nearest"]
# subset on low/high range
DTE_table[DT, on = .(DTE), roll = "nearest"][DTE >= DTE_low & DTE <= DTE_high]
EDIT::::
based on #Henrik's comment
DT[DTE_table, on = .(DTE >= DTE_low, DTE <= DTE_high), DTE_target := i.DTE_target]
For each DTE_target, find DTE rows within 10 units range. It will output a boolean array.
DT[, DTE := as.integer(difftime(exdate, date, unit = 'days')) ]
DTE_target <- c(7,30, 60, 90)
val = 10
bool_arr <- DT[, lapply(DTE_target, function(x) abs(DTE - x) <= val) ]
Then loop through the array and find any row with TRUE. Use it to extract the rows from the original DT datatable.
selected_rows <- apply(bool_arr, 1, any)
DT[selected_rows, ]
Here is full code and output
library(data.table)
DTE_target <- c(7,30, 60, 90)
val = 10 # 10 units value
DT[apply(DT[, lapply(DTE_target, function(x) abs(DTE - x) <= val) ], 1, any), ]
# date exdate price_at_entry strike_price DTE
#1: 2018-08-31 2018-10-05 301.66 195.0 35
#2: 2015-04-20 2015-04-24 205.27 212.5 4
#3: 2018-08-22 2018-09-07 321.64 255.0 16
#4: 2018-07-24 2018-08-03 297.43 430.0 10
#5: 2015-09-14 2015-10-16 253.19 440.0 32
#6: 2019-05-06 2019-05-17 255.34 245.0 11
Now use the filtered dataset to perform above function on other columns: price_at_entry and strike_price
Since you have a billion rows in data, you can split data into chunks apply the above function to speed things up.
Solution - 2: using mutually not exclusive target values: 30 and 31
DTE_target <- c(7,30, 31, 60, 90)
bool_arr <- DT[, lapply(DTE_target, function(x) abs(DTE - x) <= val) ]
target_vals <- apply(bool_arr, 1, any)
dt_vals <- apply(bool_arr, 1, function(x) DTE_target[x])
rm(bool_arr) # remove bool_arr from memory to free up space
DT[target_vals, ][, `:=`(DTE_t = dt_vals[target_vals])][]
rm(target_vals)
rm(dt_vals)
# date exdate price_at_entry strike_price DTE DTE_t
#1: 2018-08-31 2018-10-05 301.66 195.0 35 30,31
#2: 2015-04-20 2015-04-24 205.27 212.5 4 7
#3: 2018-08-22 2018-09-07 321.64 255.0 16 7
#4: 2018-07-24 2018-08-03 297.43 430.0 10 7
#5: 2015-09-14 2015-10-16 253.19 440.0 32 30,31
#6: 2019-05-06 2019-05-17 255.34 245.0 11 7
Solution -3
Data:
library(data.table)
setDT(DT)
DT = rbindlist( lapply( 1:10^6, function(i){ DT } ) )
DTE_target <- c(7,30, 31, 60, 90)
val=10
Code
system.time({
DT[, id := .I]
DT[, DTE := as.integer(difftime(exdate, date, unit = 'days')) ]
DT[, DTE_t := paste(DTE_target[ abs(DTE - DTE_target)<=val], collapse = "," ), by = id]
DT[, id := NULL]
})
#user system elapsed
#91.90 0.46 92.48
Output:
head(DT, 10)
# date exdate price_at_entry strike_price DTE DTE_t
# 1: 2018-08-31 2018-10-05 301.66 195.0 35 30,31
# 2: 2015-04-20 2015-04-24 205.27 212.5 4 7
# 3: 2012-02-28 2012-09-22 33.81 37.0 207
# 4: 2018-08-22 2018-09-07 321.64 255.0 16 7
# 5: 2018-07-24 2018-08-03 297.43 430.0 10 7
# 6: 2014-09-29 2014-10-18 245.26 120.0 19
# 7: 2013-07-10 2013-12-21 122.27 46.0 164
# 8: 2019-02-01 2019-03-22 312.21 320.0 49
# 9: 2015-09-14 2015-10-16 253.19 440.0 32 30,31
# 10: 2019-05-06 2019-05-17 255.34 245.0 11 7

How to check if a date range in my dataframe overlaps with any (ID-specific) range in another dataframe

I have 2 dataframes (DFs) that each contain identifiers and date ranges. In both DFs there can be numerous date ranges associated with each ID.
What I want to do is select the rows from the first DF (DF.A) for which there is an overlapping interval of any length, in the second DF (DF.B).
df.A <- data.frame("ID" = c(1,1,1,2,3,3),
"Start.A" = c("2019-01-01", "2019-03-15", "2019-06-10", "2017-01-01", "2015-05-10", "2015-05-15"),
"End.A" = c("2019-01-31", "2019-04-15", "2019-07-09", "2017-01-31", "2015-06-10", "2015-06-02"))
df.B <- data.frame("ID" = c(1,1,1,3,3),
"Start.B" = c("2019-01-01", "2019-02-01", "2019-03-01", "2015-06-01", "2015-07-01"),
"End.B" = c("2019-01-31", "2019-02-28", "2019-03-31", "2015-06-30", "2015-07-31"))
Dataframe A:
ID Start.A End.A
1 2019-01-01 2019-01-31
1 2019-03-15 2019-04-15
1 2019-06-10 2019-07-09
2 2017-01-01 2017-01-31
3 2015-05-10 2015-06-10
3 2015-05-15 2015-06-02
Dataframe B:
ID Start.B End.B
1 2019-01-01 2019-01-31
1 2019-02-01 2019-02-28
1 2019-03-01 2019-03-31
3 2015-06-01 2015-06-30
3 2015-07-01 2015-07-31
Would I would like as my output is:
ID Start.A End.A
1 2019-01-01 2019-01-31
1 2019-03-15 2019-04-15
3 2015-05-10 2015-06-10
3 2015-05-15 2015-06-02
I think I would be able to do this without a problem if I had a one to one match but, as I mentioned, in both DFs there are numerous observations for each ID. I've tried my hand at trying to apply lubridate's interval but I'm struggling with how to how to look for overlaps while dealing with the added complexity of having to look up all corresponding IDs in DF.B for a potential match.
This is a very large dataset (>5 million observations in DF.A and >2 million in DF.B) so speed is crucial. Any recommendations to transform the data to make this operation as fast as possible would also be appreciated.
If helpful: For a given ID, DF.A can have observations that overlap with other observations in DF.A (e.g. ID 3 in the toy example above). Contrarily, there can be no overlaps between the DF.B intervals.
How about this ?
library(data.table)
df.A <- data.table("ID" = c(1,1,1,2,3,3),
"Start.A" = c("2019-01-01", "2019-03-15", "2019-06-10", "2017-01-01", "2015-05-10", "2015-05-15"),
"End.A" = c("2019-01-31", "2019-04-15", "2019-07-09", "2017-01-31", "2015-06-10", "2015-06-02"))
df.B <- data.table("ID" = c(1,1,1,3,3),
"Start.B" = c("2019-01-01", "2019-02-01", "2019-03-01", "2015-06-01", "2015-07-01"),
"End.B" = c("2019-01-31", "2019-02-28", "2019-03-31", "2015-06-30", "2015-07-31"))
And
DF = merge(df.A, df.B , by ='ID',allow.cartesian = TRUE)
DF$SEQ_DATE.A = apply(DF[,c('Start.A','End.A'), with=F],1, function(x){paste(x,collapse = ',')})
DF$SEQ_DATE.A = unlist(lapply(strsplit(DF$SEQ_DATE.A,','),function(x){
out = seq(as.Date(x[1]),as.Date(x[2]),by = 'day')
out = paste(out, collapse = '|')
return(out)
}
))
DF$SEQ_DATE.B = apply(DF[,c('Start.B','End.B'), with=F],1, function(x){paste(x,collapse = ',')})
DF$SEQ_DATE.B = unlist(lapply(strsplit(DF$SEQ_DATE.B,','),function(x){
out = seq(as.Date(x[1]),as.Date(x[2]),by = 'day')
out = paste(out, collapse = '|')
return(out)
}
))
DF$Result= apply(DF[,c('SEQ_DATE.A','SEQ_DATE.B'), with = F], 1, function(x){grepl(x[1],x[2])})
And the result is shown below :
> DF[,-c('SEQ_DATE.A','SEQ_DATE.B'), with =F][Result == 'TRUE']
ID Start.A End.A Start.B End.B Result
1: 1 2019-01-01 2019-01-31 2019-01-01 2019-01-31 TRUE
2: 1 2019-03-15 2019-04-15 2019-03-01 2019-03-31 TRUE
3: 3 2015-05-10 2015-06-10 2015-06-01 2015-06-30 TRUE
4: 3 2015-05-15 2015-06-02 2015-06-01 2015-06-30 TRUE

Creating column of 0 and 1 based on inequalities of three date columns

I would like to create a column of 0s and 1s based on inequalities of three columns of dates.
The idea is the following. If event_date is before death_date or study_over, the the column event should be ==1, if event_date occurs after death_date or study_over, event should be == 0. Both event_date and death_date may contain NAs.
set.seed(1337)
rand_dates <- Sys.Date() - 365:1
df <-
data.frame(
event_date = sample(rand_dates, 20),
death_date = sample(rand_dates, 20),
study_over = sample(rand_dates, 20)
)
My attempt was the following
eventR <-
function(x, y, z){
if(is.na(y)){
ifelse(x <= z, 1, 0)
} else if(y <= z){
ifelse(x < y, 1, 0)
} else {
ifelse(x <= z, 1, 0)
}
}
I use it in the following manner
library(dplyr)
df[c(3, 5, 7), "event_date"] <- NA #there are some NA in .$event_date
df[c(3, 4, 6), "death_date"] <- NA #there are some NA in .$death_date
df %>%
mutate(event = sapply(.$event_date, eventR, y = .$death_date, z = .$study_over))
##Error: wrong result size (400), expected 20 or 1
##In addition: There were 40 warnings (use warnings() to see them)
I can't figure out how to do this. Any suggestions?
This would seem to construct a binary column (with NA's where needed) where 1 indicates "event_date is before death_date or study_over" and 0 is used elsewhere. As already pointed out your specification does not cover all cases:
df$event <- with(df, as.numeric( event_date < pmax( death_date , study_over) ) )
df
Can use pmap_dbl() from the purrr package instead of sapply...
library(dplyr)
library(purrr)
df %>% mutate(event = pmap_dbl(list(event_date, death_date, study_over), eventR))
event_date death_date study_over event
1 2016-10-20 2017-01-27 2016-12-16 1
2 2016-10-15 2016-12-12 2017-01-20 1
3 <NA> <NA> 2016-10-09 NA
4 2016-09-04 <NA> 2016-11-17 1
5 <NA> 2016-10-13 2016-06-09 NA
6 2016-07-21 <NA> 2016-04-26 0
7 <NA> 2017-02-21 2016-07-12 NA
8 2016-07-02 2017-02-08 2016-08-24 1
9 2016-06-19 2016-09-07 2016-04-11 0
10 2016-05-14 2017-03-13 2016-08-03 1
11 2017-03-06 2017-02-05 2017-02-28 0
12 2017-03-10 2016-04-28 2016-11-30 0
13 2017-01-10 2016-12-10 2016-10-27 0
14 2016-05-31 2016-06-12 2016-08-13 1
15 2017-03-03 2016-12-25 2016-12-20 0
16 2016-04-01 2016-11-03 2016-06-30 1
17 2017-02-26 2017-02-25 2016-05-12 0
18 2017-02-08 2016-12-08 2016-10-14 0
19 2016-07-19 2016-07-03 2016-09-22 0
20 2016-06-17 2016-06-06 2016-11-09 0
You might also be interested in the dplyr function, case_when() for handling many if else statements.

R Calculate time difference between events

I have the following data:
Timestamp Tag OldValue NewValue
5/4/2015 8:45 MD_LL46001_BYP NORMAL MAN_BYP
5/7/2015 20:46 MD_LL46001_BYP MAN_BYP NORMAL
5/4/2015 2:58 MD_LL46101_BYP MAN_BYP NORMAL
5/7/2015 20:47 MD_LL47401_BYP MAN_BYP NORMAL
5/7/2015 20:47 MD_LL47401_BYP NORMAL MAN_BYP
5/7/2015 20:47 MD_LL47401_BYP MAN_BYP NORMAL
5/11/2015 1:37 MD_LL47401_BYP NORMAL MAN_BYP
5/11/2015 2:04 MD_LL47401_BYP MAN_BYP NORMAL
5/11/2015 11:03 MD_LL47401_BYP NORMAL MAN_BYP
5/11/2015 11:11 MD_LL47401_BYP MAN_BYP NORMAL
5/10/2015 13:16 MD_LXL21102_BYP NORMAL MAN_BYP
5/10/2015 19:42 MD_LXL21102_BYP MAN_BYP NORMAL
5/4/2015 11:40 MD_PL41201_BYP NORMAL MAN_BYP
5/4/2015 11:57 MD_PL41201_BYP MAN_BYP NORMAL
I'd like to calculate time difference between events ("change from normal to man_byp" and "change from man_byp to normal") for the same tags, is that possible to do in R? Output I'd like to see:
Tag Bypass put in Duration
MD_LL46001_BYP 5/4/2015 8:45 xxx minutes
MD_LL47401_BYP 5/7/2015 20:47 xx minutes
The events and tags are not necessarily consecutively placed in the table.
Welcome to SO! I'm deliberately using dplyr to refresh my skills, so be aware that some things may not be very effective or idiomatic. Anyway, here's the workflow:
Make proper datetime objects, encode correct status changes
Remove extra columns and sort by tag and time
Apply the described logic to each tag
Here's the code:
library(dplyr)
df <- read.table("clipboard")
names(df) <- c("date", "time", "tag", "status1", "status2")
df$datetime <- as.POSIXct(strptime(paste(df$date, df$time), "%m/%d/%Y %H:%M"))
df$status_change <-
ifelse(df$status1 == "NORMAL" & df$status2 == "MAN_BYP", 1,
ifelse(df$status2 == "NORMAL" & df$status1 == "MAN_BYP", 2, NA))
df %>%
select(datetime, tag, status_change) %>%
arrange(tag, datetime) -> df2
df2 %>%
group_by(tag) %>%
do({
k <- nrow(.)
res <- rep(NA, k)
ind <- as.logical(c(0, .$status_change[-1] - .$status_change[-k]))
dmins <- c(NA, difftime(.$datetime[-1], .$datetime[-k], units = "mins"))
res[ind] <- dmins[ind]
data.frame(res)
}) %>%
(function(x) cbind(df2, x[, -1]))
Result:
datetime tag status_change res
1 2015-05-04 08:45:00 MD_LL46001_BYP 1 NA
2 2015-05-07 20:46:00 MD_LL46001_BYP 2 5041
3 2015-05-04 02:58:00 MD_LL46101_BYP 2 NA
4 2015-05-07 20:47:00 MD_LL47401_BYP 2 NA
5 2015-05-07 20:47:00 MD_LL47401_BYP 1 0
6 2015-05-07 20:47:00 MD_LL47401_BYP 2 0
7 2015-05-11 01:37:00 MD_LL47401_BYP 1 4610
8 2015-05-11 02:04:00 MD_LL47401_BYP 2 27
9 2015-05-11 11:03:00 MD_LL47401_BYP 1 539
10 2015-05-11 11:11:00 MD_LL47401_BYP 2 8
11 2015-05-10 13:16:00 MD_LXL21102_BYP 1 NA
12 2015-05-10 19:42:00 MD_LXL21102_BYP 2 386
13 2015-05-04 11:40:00 MD_PL41201_BYP 1 NA
14 2015-05-04 11:57:00 MD_PL41201_BYP 2 17
Notice that zero and NA do not have the same sense.

Replace NA´s in dates with another date

Data:
DB1 <- data.frame(orderItemID = 1:10,
orderDate = c("2013-01-21","2013-03-31","2013-04-12","2013-06-01","2014-01-01", "2014-02-19","2014-02-27","2014-10-02","2014-10-31","2014-11-21"),
deliveryDate = c("2013-01-23", "2013-03-01", "NA", "2013-06-04", "2014-01-03", "NA", "2014-02-28", "2014-10-04", "2014-11-01", "2014-11-23"))
Expected Outcome:
DB1 <- data.frame(orderItemID = 1:10,
orderDate= c("2013-01-21","2013-03-31","2013-04-12","2013-06-01","2014-01-01", "2014-02-19","2014-02-27","2014-10-02","2014-10-31","2014-11-21"),
deliveryDate = c("2013-01-23", "2013-03-01", "2013-04-14", "2013-06-04", "2014-01-03", "2014-02-21", "2014-02-28", "2014-10-04", "2014-11-01", "2014-11-23"))
My question is similar to another one I posted: so don´t be confused.
As you can see above I have some missing values in the delivery dates and I want to replace them by another date. That date should be the order date of the specific item + the average delivery time in (full) days.(2days)
The average delivery time is the time calculated from the average value of all samples that do not contain Missing values = (2days+1day+3days+2days+1day+2days+1day+2days):8=1,75
So I want to replace the NA in delivery time with the order date +2days. When there´s no NA, the date should stay the same.
I tried this already (with lubridate), but it´s not working :(
DB1$deliveryDate[is.na(DB1$deliveryDate) ] <- DB1$orderDate + days(2)
Can someone plz help me?
First, convert the columns to Date objects:
DB1[,2:3]<-lapply(DB1[,2:3],as.Date)
Then, replace the NA elements:
DB1$deliveryDate[is.na(DB1$deliveryDate)] <-
DB1$orderDate[is.na(DB1$deliveryDate)] +
mean(difftime(DB1$orderDate,DB1$deliveryDate,units="days"),na.rm=TRUE)
# orderItemID orderDate deliveryDate
#1 1 2013-01-21 2013-01-23
#2 2 2013-03-31 2013-03-01
#3 3 2013-04-12 2013-04-14
#4 4 2013-06-01 2013-06-04
#5 5 2014-01-01 2014-01-03
#6 6 2014-02-19 2014-02-21
#7 7 2014-02-27 2014-02-28
#8 8 2014-10-02 2014-10-04
#9 9 2014-10-31 2014-11-01
#10 10 2014-11-21 2014-11-23
You can do:
DB1 =cbind(DB1$orderItemID,as.data.frame(lapply(DB1[-1], as.character)))
days = round(mean(DB1$deliveryDate-DB1$orderDate, na.rm=T))
mask = is.na(DB1$deliveryDate)
DB1$deliveryDate[mask] = DB1$orderDate[mask]+days
# DB1$orderItemID orderDate deliveryDate
#1 1 2013-01-21 2013-01-23
#2 2 2013-03-31 2013-04-01
#3 3 2013-04-12 2013-04-14
#4 4 2013-06-01 2013-06-04
#5 5 2014-01-01 2014-01-03
#6 6 2014-02-19 2014-02-21
#7 7 2014-02-27 2014-02-28
#8 8 2014-10-02 2014-10-04
#9 9 2014-10-31 2014-11-01
#10 10 2014-11-21 2014-11-23
I re-arrange your data since they were not clean:
DB1 <- data.frame(orderItemID = 1:10,
orderDate = c("2013-01-21","2013-03-31","2013-04-12","2013-06-01","2014-01-01", "2014-02-19","2014-02-27","2014-10-02","2014-10-31","2014-11-21"),
deliveryDate = c("2013-01-23", "2013-04-01", NA, "2013-06-04", "2014-01-03", NA, "2014-02-28", "2014-10-04", "2014-11-01", "2014-11-23"))
Assuming that you have entered your data like this (note that NAs are not enclosed in quotes so they are read as NAs and not "NA")...
DB1 <- data.frame(orderItemID = 1:10,
orderDate = c("2013-01-21","2013-03-31","2013-04-12","2013-06-01","2014-01-01", "2014-02-19","2014-02-27","2014-10-02","2014-10-31","2014-11-21"),
deliveryDate = c("2013-01-23", "2013-03-01", NA, "2013-06-04", "2014-01-03", NA, "2014-02-28", "2014-10-04", "2014-11-01", "2014-11-23"),
stringsAsFactors = FALSE)
...and, per Nicola's answer, done this to get the formatting right...
DB1[,2:3]<-lapply(DB1[,2:3],as.Date)
...this also works:
library(lubridate)
DB1$deliveryDate <- with(DB1, as.Date(ifelse(is.na(deliveryDate), orderDate + days(2), deliveryDate), origin = "1970-01-01"))
Or you could use dplyr and pipe it:
library(lubridate)
library(dplyr)
DB2 <- DB1 %>%
mutate(deliveryDate = ifelse(is.na(deliveryDate), orderDate + days(2), deliveryDate)) %>%
mutate(deliveryDate = as.Date(.[,"deliveryDate"], origin = "1970-01-01"))

Resources