Replace NA´s in dates with another date - r

Data:
DB1 <- data.frame(orderItemID = 1:10,
orderDate = c("2013-01-21","2013-03-31","2013-04-12","2013-06-01","2014-01-01", "2014-02-19","2014-02-27","2014-10-02","2014-10-31","2014-11-21"),
deliveryDate = c("2013-01-23", "2013-03-01", "NA", "2013-06-04", "2014-01-03", "NA", "2014-02-28", "2014-10-04", "2014-11-01", "2014-11-23"))
Expected Outcome:
DB1 <- data.frame(orderItemID = 1:10,
orderDate= c("2013-01-21","2013-03-31","2013-04-12","2013-06-01","2014-01-01", "2014-02-19","2014-02-27","2014-10-02","2014-10-31","2014-11-21"),
deliveryDate = c("2013-01-23", "2013-03-01", "2013-04-14", "2013-06-04", "2014-01-03", "2014-02-21", "2014-02-28", "2014-10-04", "2014-11-01", "2014-11-23"))
My question is similar to another one I posted: so don´t be confused.
As you can see above I have some missing values in the delivery dates and I want to replace them by another date. That date should be the order date of the specific item + the average delivery time in (full) days.(2days)
The average delivery time is the time calculated from the average value of all samples that do not contain Missing values = (2days+1day+3days+2days+1day+2days+1day+2days):8=1,75
So I want to replace the NA in delivery time with the order date +2days. When there´s no NA, the date should stay the same.
I tried this already (with lubridate), but it´s not working :(
DB1$deliveryDate[is.na(DB1$deliveryDate) ] <- DB1$orderDate + days(2)
Can someone plz help me?

First, convert the columns to Date objects:
DB1[,2:3]<-lapply(DB1[,2:3],as.Date)
Then, replace the NA elements:
DB1$deliveryDate[is.na(DB1$deliveryDate)] <-
DB1$orderDate[is.na(DB1$deliveryDate)] +
mean(difftime(DB1$orderDate,DB1$deliveryDate,units="days"),na.rm=TRUE)
# orderItemID orderDate deliveryDate
#1 1 2013-01-21 2013-01-23
#2 2 2013-03-31 2013-03-01
#3 3 2013-04-12 2013-04-14
#4 4 2013-06-01 2013-06-04
#5 5 2014-01-01 2014-01-03
#6 6 2014-02-19 2014-02-21
#7 7 2014-02-27 2014-02-28
#8 8 2014-10-02 2014-10-04
#9 9 2014-10-31 2014-11-01
#10 10 2014-11-21 2014-11-23

You can do:
DB1 =cbind(DB1$orderItemID,as.data.frame(lapply(DB1[-1], as.character)))
days = round(mean(DB1$deliveryDate-DB1$orderDate, na.rm=T))
mask = is.na(DB1$deliveryDate)
DB1$deliveryDate[mask] = DB1$orderDate[mask]+days
# DB1$orderItemID orderDate deliveryDate
#1 1 2013-01-21 2013-01-23
#2 2 2013-03-31 2013-04-01
#3 3 2013-04-12 2013-04-14
#4 4 2013-06-01 2013-06-04
#5 5 2014-01-01 2014-01-03
#6 6 2014-02-19 2014-02-21
#7 7 2014-02-27 2014-02-28
#8 8 2014-10-02 2014-10-04
#9 9 2014-10-31 2014-11-01
#10 10 2014-11-21 2014-11-23
I re-arrange your data since they were not clean:
DB1 <- data.frame(orderItemID = 1:10,
orderDate = c("2013-01-21","2013-03-31","2013-04-12","2013-06-01","2014-01-01", "2014-02-19","2014-02-27","2014-10-02","2014-10-31","2014-11-21"),
deliveryDate = c("2013-01-23", "2013-04-01", NA, "2013-06-04", "2014-01-03", NA, "2014-02-28", "2014-10-04", "2014-11-01", "2014-11-23"))

Assuming that you have entered your data like this (note that NAs are not enclosed in quotes so they are read as NAs and not "NA")...
DB1 <- data.frame(orderItemID = 1:10,
orderDate = c("2013-01-21","2013-03-31","2013-04-12","2013-06-01","2014-01-01", "2014-02-19","2014-02-27","2014-10-02","2014-10-31","2014-11-21"),
deliveryDate = c("2013-01-23", "2013-03-01", NA, "2013-06-04", "2014-01-03", NA, "2014-02-28", "2014-10-04", "2014-11-01", "2014-11-23"),
stringsAsFactors = FALSE)
...and, per Nicola's answer, done this to get the formatting right...
DB1[,2:3]<-lapply(DB1[,2:3],as.Date)
...this also works:
library(lubridate)
DB1$deliveryDate <- with(DB1, as.Date(ifelse(is.na(deliveryDate), orderDate + days(2), deliveryDate), origin = "1970-01-01"))
Or you could use dplyr and pipe it:
library(lubridate)
library(dplyr)
DB2 <- DB1 %>%
mutate(deliveryDate = ifelse(is.na(deliveryDate), orderDate + days(2), deliveryDate)) %>%
mutate(deliveryDate = as.Date(.[,"deliveryDate"], origin = "1970-01-01"))

Related

Add column in dataframe based on 3 columns from another dataframe using R

I have 2 dataframes which are as follows:
Dataframe 1: traffic_df which is hourly data.
Date_Time
Traffic
2020-03-09 06:00:00
10
2020-03-09 07:00:00
20
2020-03-10 07:00:00
20
2020-03-24 08:00:00
15
Dataframe 2: Alert.level
Start
End
Alert.level
10/03/2020 13:30
23/03/2020 13:30
2
23/03/2020 13:30
25/03/2020 23:59
3
I want to add a 3rd column to traffic_df which is the associated Alert.level if the Date_Time falls within the Start and End Date_Time of the Alert.level df so that the resulting dataframe will look like this:
Dataframe 1: traffic_df
Date_Time
Traffic
Alert.Level
2020-03-09 06:00:00
10
2020-03-09 07:00:00
20
2020-03-10 07:00:00
20
2
2020-03-24 08:00:00
15
3
Is there anyway to do this without having to make a matching hourly dataframe and then using join?
I'm thinking somehow using the map function?
Code to produce the df:
traffic_df <- structure(list(Date_Time = c("2020-03-09 06:00:00", "2020-03-09 07:00:00", "2020-03-10 07:00:00",
"2020-03-24 08:00:00"), Traffic = c(10L, 20L, 20L, 15L)),
row.names = c(NA, -4L), class = "data.frame")
Alert.Level = data.frame(Start = c("10/03/2020 13:30", "23/03/2020 13:30"),
End = c("23/03/2020 13:30", "25/03/2020 23:59"),
Alert.level = c(2, 3))
You may try the fuzzyjoin package.
Data
library(lubridate)
traffic_df <- structure(list(Date_Time = c("2020-03-09 06:00:00", "2020-03-09 07:00:00", "2020-03-10 07:00:00",
"2020-03-24 08:00:00"), Traffic = c(10L, 20L, 20L, 15L)),
row.names = c(NA, -4L), class = "data.frame") %>%
mutate(Date_Time = ymd_hms(Date_Time))
Alert.Level = data.frame(Start = c("10/03/2020 13:30", "23/03/2020 13:30"),
End = c("23/03/2020 13:30", "25/03/2020 23:59"),
Alert.level = c(2, 3)) %>%
mutate(Start = dmy_hms(Start),
End = dmy_hms(End))
Code
library(fuzzyjoin)
traffic_df %>%
fuzzy_left_join(Alert.Level,
match_fun = list(`>=`, `<=`),
by = list(x = c("Date_Time",
"Date_Time"),
y = c("Start",
"End"))) %>%
select(-Start, -End)
Output
In contrast to your expected output above, row three is not matched, because 7:00 o'clock is before the starting time of 13:30.
Date_Time Traffic Alert.level
1 2020-03-09 06:00:00 10 NA
2 2020-03-09 07:00:00 20 NA
3 2020-03-10 07:00:00 20 NA
4 2020-03-24 08:00:00 15 3
Here is a solution using sqldf. Note that I renamed the data.frame to have an underscore for convenience with SQL.
library(sqldf)
Alert_level <- Alert.level
sqldf("SELECT * FROM traffic_df
LEFT JOIN Alert_level
ON traffic_df.Date_Time BETWEEN Alert_level.Start AND Alert_level.End")
Output
Date_Time Traffic Start End Alert.level
1 2020-03-09 06:00:00 10 <NA> <NA> NA
2 2020-03-09 07:00:00 20 <NA> <NA> NA
3 2020-03-10 07:00:00 20 <NA> <NA> NA
4 2020-03-24 08:00:00 15 2020-03-23 13:30:00 2020-03-25 23:59:00 3
I like outer approaches in such cases. First, define a Vectorized FUNction, that looks if a specific x is between an y interval. Put it in outer which iterates each Date_Time with each start/end interval of Alert.Level. This gives a matrix o that informs which of the intervals is applicable (I use unname to avoid confusion). Then, in traffic_df we crate a NA column alert_lv (should just have a different name than "Alert.Level"), subset it with positive colSums, and put in the according levels of Alert.Level.
FUN <- Vectorize(function(x, y) x >= y[1] & x < y[2])
(o <- unname(outer(traffic_df$Date_Time, Alert.Level[-3], FUN)))
# [,1] [,2] [,3] [,4]
# [1,] FALSE FALSE TRUE FALSE
# [2,] FALSE FALSE FALSE TRUE
w <- unlist(apply(o, 1, which))
traffic_df <- within(traffic_df, {
alert_lv <- NA
alert_lv[rowSums(o) > 0] <- Alert.Level[w, 3]
})
traffic_df
# Date_Time Traffic alert_lv
# 1 2020-03-09 06:00:00 10 NA
# 2 2020-03-09 07:00:00 20 NA
# 3 2020-03-10 07:00:00 20 2
# 4 2020-03-24 08:00:00 15 3
Note: To use this solution you first need the usual 'POSIXct' formats, so first you should do
traffic_df$Date_Time <- as.POSIXct(traffic_df$Date_Time)
Alert.Level[1:2] <- lapply(Alert.Level[1:2], strptime, format='%d/%m/%Y %H:%M')

Calculate the difference in time in Days for Character Dates and also with Blank Spaces in one of the column

I have the following data, Both columns are dates & have to take the difference in days. However most of the values in one of date column is blank so I have to return NA for them.
a b
02-07-2012
18-08-2012
13-08-2012
16-04-2012
26-04-2012
03-05-2012 12-05-2012
09-06-2012
30-05-2012
22-06-2012
05-07-2012
30-06-2012
09-05-2012
22-06-2012
02-07-2012
17-07-2012
17-08-2012
16-07-2012
01-08-2012
05-08-2012
17-08-2012
30-04-2012
05-07-2012
07-04-2012
27-04-2012
21-06-2012
03-07-2012
21-07-2012
24-04-2012
05-06-2012
03-07-2012
02-04-2012 01-06-2012
06-04-2012
15-04-2012
16-06-2012
01-08-2012
13-05-2012
09-07-2012
09-07-2012
18-04-2012
09-08-2012
10-04-2012
12-05-2012
04-04-2012
04-06-2012 04-06-2012
15-06-2012
02-07-2012
05-07-2012
21-08-2012
19-07-2012
06-08-2012
15-06-2012
06-04-2012
04-06-2012
23-07-2012
06-04-2012
12-04-2012 11-06-2012
24-05-2012
03-08-2012
04-05-2012 09-05-2012
07-05-2012
07-06-2012
06-07-2012
13-07-2012
26-07-2012
26-04-2012
22-06-2012
26-07-2012
12-04-2012
07-08-2012
27-06-2012
03-04-2012 02-06-2012
13-04-2012
28-07-2012
07-05-2012
29-06-2012
03-04-2012 02-06-2012
04-04-2012
04-04-2012 24-05-2012
04-04-2012
05-04-2012
07-04-2012
10-04-2012
11-04-2012
13-04-2012
13-04-2012
13-04-2012
13-04-2012
14-04-2012
14-04-2012
14-04-2012
18-04-2012
19-04-2012
21-04-2012
25-04-2012
25-04-2012
26-04-2012
26-04-2012
26-04-2012
27-04-2012
30-04-2012
04-06-2012
04-06-2012
05-06-2012
05-06-2012
05-06-2012
05-06-2012
05-06-2012 16-07-2012
06-06-2012 29-06-2012
I tried the following but couldnt succeed
date_strings[date_strings==""]<-NA # Replaced blank spaces with NA & removed them
head(date_strings)
newdata<-na.omit(date_strings)
str(newdata)
newdata$a<-as.Date(newdata$a,"%m%d%y")
newdata$b<-as.Date(newdata$b,"%m%d%y")
diff_in_days = difftime(newdata$a, newdata$b, units = "days") # days
Change the dates to date class which will turn blanks to NA automatically and then subtract days using difftime.
date_strings[] <- lapply(date_strings, as.Date, format = '%d-%m-%Y')
date_strings$diff_in_days = difftime(date_strings$b, date_strings$a,
units = "days")
date_strings
# a b diff_in_days
#1 2012-07-02 <NA> NA
#2 2012-08-18 <NA> NA
#3 2012-08-13 <NA> NA
#4 2012-04-16 <NA> NA
#5 2012-04-26 <NA> NA
#6 2012-05-03 2012-05-12 9
Or directly subtract
date_strings$diff_in_days = date_strings$b - date_strings$a
data
date_strings <- structure(list(a = c("02-07-2012", "18-08-2012", "13-08-2012",
"16-04-2012", "26-04-2012", "03-05-2012"), b = c("", "", "",
"", "", "12-05-2012")), class = "data.frame", row.names = c(NA, -6L))
With tidyverse, we can do
library(dplyr)
library(lubridate)
date_strings %>%
mutate(across(everything(), dmy)) %>%
mutate(diff_in_days = b - a)
# a b diff_in_days
#1 2012-07-02 <NA> NA days
#2 2012-08-18 <NA> NA days
#3 2012-08-13 <NA> NA days
#4 2012-04-16 <NA> NA days
#5 2012-04-26 <NA> NA days
#6 2012-05-03 2012-05-12 9 days
data
date_strings <- structure(list(a = c("02-07-2012", "18-08-2012", "13-08-2012",
"16-04-2012", "26-04-2012", "03-05-2012"), b = c("", "", "",
"", "", "12-05-2012")), class = "data.frame", row.names = c(NA, -6L))

How to check if a date range in my dataframe overlaps with any (ID-specific) range in another dataframe

I have 2 dataframes (DFs) that each contain identifiers and date ranges. In both DFs there can be numerous date ranges associated with each ID.
What I want to do is select the rows from the first DF (DF.A) for which there is an overlapping interval of any length, in the second DF (DF.B).
df.A <- data.frame("ID" = c(1,1,1,2,3,3),
"Start.A" = c("2019-01-01", "2019-03-15", "2019-06-10", "2017-01-01", "2015-05-10", "2015-05-15"),
"End.A" = c("2019-01-31", "2019-04-15", "2019-07-09", "2017-01-31", "2015-06-10", "2015-06-02"))
df.B <- data.frame("ID" = c(1,1,1,3,3),
"Start.B" = c("2019-01-01", "2019-02-01", "2019-03-01", "2015-06-01", "2015-07-01"),
"End.B" = c("2019-01-31", "2019-02-28", "2019-03-31", "2015-06-30", "2015-07-31"))
Dataframe A:
ID Start.A End.A
1 2019-01-01 2019-01-31
1 2019-03-15 2019-04-15
1 2019-06-10 2019-07-09
2 2017-01-01 2017-01-31
3 2015-05-10 2015-06-10
3 2015-05-15 2015-06-02
Dataframe B:
ID Start.B End.B
1 2019-01-01 2019-01-31
1 2019-02-01 2019-02-28
1 2019-03-01 2019-03-31
3 2015-06-01 2015-06-30
3 2015-07-01 2015-07-31
Would I would like as my output is:
ID Start.A End.A
1 2019-01-01 2019-01-31
1 2019-03-15 2019-04-15
3 2015-05-10 2015-06-10
3 2015-05-15 2015-06-02
I think I would be able to do this without a problem if I had a one to one match but, as I mentioned, in both DFs there are numerous observations for each ID. I've tried my hand at trying to apply lubridate's interval but I'm struggling with how to how to look for overlaps while dealing with the added complexity of having to look up all corresponding IDs in DF.B for a potential match.
This is a very large dataset (>5 million observations in DF.A and >2 million in DF.B) so speed is crucial. Any recommendations to transform the data to make this operation as fast as possible would also be appreciated.
If helpful: For a given ID, DF.A can have observations that overlap with other observations in DF.A (e.g. ID 3 in the toy example above). Contrarily, there can be no overlaps between the DF.B intervals.
How about this ?
library(data.table)
df.A <- data.table("ID" = c(1,1,1,2,3,3),
"Start.A" = c("2019-01-01", "2019-03-15", "2019-06-10", "2017-01-01", "2015-05-10", "2015-05-15"),
"End.A" = c("2019-01-31", "2019-04-15", "2019-07-09", "2017-01-31", "2015-06-10", "2015-06-02"))
df.B <- data.table("ID" = c(1,1,1,3,3),
"Start.B" = c("2019-01-01", "2019-02-01", "2019-03-01", "2015-06-01", "2015-07-01"),
"End.B" = c("2019-01-31", "2019-02-28", "2019-03-31", "2015-06-30", "2015-07-31"))
And
DF = merge(df.A, df.B , by ='ID',allow.cartesian = TRUE)
DF$SEQ_DATE.A = apply(DF[,c('Start.A','End.A'), with=F],1, function(x){paste(x,collapse = ',')})
DF$SEQ_DATE.A = unlist(lapply(strsplit(DF$SEQ_DATE.A,','),function(x){
out = seq(as.Date(x[1]),as.Date(x[2]),by = 'day')
out = paste(out, collapse = '|')
return(out)
}
))
DF$SEQ_DATE.B = apply(DF[,c('Start.B','End.B'), with=F],1, function(x){paste(x,collapse = ',')})
DF$SEQ_DATE.B = unlist(lapply(strsplit(DF$SEQ_DATE.B,','),function(x){
out = seq(as.Date(x[1]),as.Date(x[2]),by = 'day')
out = paste(out, collapse = '|')
return(out)
}
))
DF$Result= apply(DF[,c('SEQ_DATE.A','SEQ_DATE.B'), with = F], 1, function(x){grepl(x[1],x[2])})
And the result is shown below :
> DF[,-c('SEQ_DATE.A','SEQ_DATE.B'), with =F][Result == 'TRUE']
ID Start.A End.A Start.B End.B Result
1: 1 2019-01-01 2019-01-31 2019-01-01 2019-01-31 TRUE
2: 1 2019-03-15 2019-04-15 2019-03-01 2019-03-31 TRUE
3: 3 2015-05-10 2015-06-10 2015-06-01 2015-06-30 TRUE
4: 3 2015-05-15 2015-06-02 2015-06-01 2015-06-30 TRUE

How do I check if a date is between two values in R?

I have a table that looks like this;
user_id timestamp
aa 2018-01-01 12:01 UTC
ab 2018-01-01 05:01 UTC
bb 2018-06-01 09:01 UTC
bc 2018-03-03 23:01 UTC
cc 2018-01-02 11:01 UTC
I have another table that has every week in 2018.
week_id week_start week_end
1 2018-01-01 2018-01-07
2 2018-01-08 2018-01-15
3 2018-01-16 2018-01-23
4 2018-01-23 2018-01-30
... ... ...
Assume the week_start is a Monday and week_end is a Sunday.
I'd like to do two things. I'd first like to join the week_id to the first table and then I'd like to assign a day to each of the timestamps. My output would look like this:
user_id timestamp week_id day_of_week
aa 2018-01-01 12:01 UTC 1 Monday
ab 2018-01-02 05:01 UTC 1 Tuesday
bb 2018-01-13 09:01 UTC 2 Friday
bc 2018-01-28 23:01 UTC 4 Friday
cc 2018-01-06 11:01 UTC 1 Saturday
In Excel I could easily do this with a vlookup. My main interest is to learn how to join tables in cases like this. For that reason, I won't accept answers that use the weekday function.
Here are both of the tables in a more accessible format.
user_id <- c("aa", "ab", "bb", "bc", "cc")
timestamp <- c("2018-01-01 12:01", "2018-01-01 05:01", "2018-06-01 09:01", "2018-03-03 23:01", "2018-01-02 11:01")
week_id <- seq(1,52)
week_start <- seq(as.Date("2018-01-01"), as.Date("2018-12-31"), 7)
week_end <- week_start + 6
week_start <- week_start[1:52]
week_end <- week_end[1:52]
table1 <- data.frame(user_id, timestamp)
table2 <- data.frame(week_id, week_start, week_end)
Using SQL one can join two tables on a range like this. This seems the most elegant solution expressing our intent directly but we also provide some alternatives further below.
library(sqldf)
DF1$date <- as.Date(DF1$timestamp)
sqldf("select *
from DF1 a
left join DF2 b on date between week_start and week_end")
giving:
user_id timestamp date week_id week_start week_end
1 aa 2018-01-01 12:01:00 2018-01-01 1 2018-01-01 2018-01-07
2 ab 2018-01-01 05:01:00 2018-01-01 1 2018-01-01 2018-01-07
3 bb 2018-06-01 09:01:00 2018-06-01 NA <NA> <NA>
4 bc 2018-03-03 23:01:00 2018-03-04 NA <NA> <NA>
5 cc 2018-01-02 11:01:00 2018-01-02 1 2018-01-01 2018-01-07
dplyr
In a comment the poster asked for whether it could be done in dplyr. It can't be done directly since dplyr does not support complex joins but a workaound would be to do a full cross join of the two data frames which gives rise to an nrow(DF1) * nrow(DF2) intermediate result and then filter this down. dplyr does not directly support cross joins but we can simulate one by doing a full join on an identical dummy constant column that is appended to both data frames in the full join. Since we actually need a right join here to add back the unmatched rows, we do a final right join with the original DF1 data frame. Obviously this is entirely impractical for sufficiently large inputs but for the small input here we can do it. If it were known that there is a match in DF2 to every row in DF1 then the right_join at the end could be omitted.
DF1 %>%
mutate(date = as.Date(timestamp), dummy = 1) %>%
full_join(DF2 %>% mutate(dummy = 1)) %>%
filter(date >= week_start & date <= week_end) %>%
select(-dummy) %>%
right_join(DF1)
R Base
findix finds the index in DF2 corresponding to a date d. We then sapply it over the dates corresponding to rows of DF1 and put DF1 and the corresponding DF2 row together.
findix <- function(d) c(which(d >= DF2$week_start & d <= DF2$week_end), NA)[1]
cbind(DF1, DF2[sapply(as.Date(DF1$timestamp), findix), ])
Note
The input data in reproducible form used is:
Lines1 <- "user_id timestamp
aa 2018-01-01 12:01 UTC
ab 2018-01-01 05:01 UTC
bb 2018-06-01 09:01 UTC
bc 2018-03-03 23:01 UTC
cc 2018-01-02 11:01 UTC"
DF1 <- read.csv(text = gsub(" +", ",", Lines1), strip.white = TRUE)
DF1$timestamp <- as.POSIXct(DF1$timestamp)
Lines2 <- "week_id week_start week_end
1 2018-01-01 2018-01-07
2 2018-01-08 2018-01-15
3 2018-01-16 2018-01-23
4 2018-01-23 2018-01-30"
DF2 <- read.table(text = Lines2, header = TRUE)
DF2$week_start <- as.Date(DF2$week_start)
DF2$week_end <- as.Date(DF2$week_end)
This is a case for the fuzzyjoin-package. With the match_fun- argument we can specify conditions for each column. In this case table1$date >= table2$week_start and table1$date <= table2$week_end.
library(fuzzyjoin)
library(lubridate)
table1$date <- as.Date(table1$timestamp)
fuzzy_left_join(table1, table2,
by = c("date" = "week_start", "date" = "week_end"),
match_fun = list(`>=`, `<=`)) %>%
mutate(day_of_week = wday(date, label = TRUE)) %>%
select(user_id, timestamp, week_id, day_of_week)
user_id timestamp week_id day_of_week
1 aa 2018-01-01 12:01 1 Mo
2 ab 2018-01-01 05:01 1 Mo
3 bb 2018-06-01 09:01 22 Fr
4 bc 2018-03-03 23:01 9 Sa
5 cc 2018-01-02 11:01 1 Di
I'm also a smartass because I didn't use the weekday-function but wday from the lubridate-package.

dplyr window functions with order_by and with_order

BACKGROUD
dplyr has window functions. When you want to control the order of window functions,
you can use order_by.
DATA
mydf <- data.frame(id = c("ana", "bob", "caroline",
"bob", "ana", "caroline"),
order = as.POSIXct(c("2015-01-01 18:00:00", "2015-01-01 18:05:00",
"2015-01-01 19:20:00", "2015-01-01 09:07:00",
"2015-01-01 08:30:00", "2015-01-01 11:11:00"),
format = "%Y-%m-%d %H:%M:%S"),
value = runif(6, 10, 20),
stringsAsFactors = FALSE)
# id order value
#1 ana 2015-01-01 18:00:00 19.00659
#2 bob 2015-01-01 18:05:00 13.64010
#3 caroline 2015-01-01 19:20:00 12.08506
#4 bob 2015-01-01 09:07:00 14.40996
#5 ana 2015-01-01 08:30:00 17.45165
#6 caroline 2015-01-01 11:11:00 14.50865
Suppose you want to use lag(), you can do the following.
arrange(mydf, id, order) %>%
group_by(id) %>%
mutate(check = lag(value))
# id order value check
#1 ana 2015-01-01 08:30:00 17.45165 NA
#2 ana 2015-01-01 18:00:00 19.00659 17.45165
#3 bob 2015-01-01 09:07:00 14.40996 NA
#4 bob 2015-01-01 18:05:00 13.64010 14.40996
#5 caroline 2015-01-01 11:11:00 14.50865 NA
#6 caroline 2015-01-01 19:20:00 12.08506 14.50865
However, you can avoid using arrange() with order_by().
group_by(mydf, id) %>%
mutate(check = lag(value, order_by = order))
# id order value check
#1 ana 2015-01-01 18:00:00 19.00659 17.45165
#2 bob 2015-01-01 18:05:00 13.64010 14.40996
#3 caroline 2015-01-01 19:20:00 12.08506 14.50865
#4 bob 2015-01-01 09:07:00 14.40996 NA
#5 ana 2015-01-01 08:30:00 17.45165 NA
#6 caroline 2015-01-01 11:11:00 14.50865 NA
EXPERIMENT
I wanted to apply the same procedure to the case in which I wanted
to assign row number to a new column. Using the sample data, you can do the folowing.
group_by(mydf, id) %>%
arrange(order) %>%
mutate(num = row_number())
# id order value num
#1 ana 2015-01-01 08:30:00 17.45165 1
#2 ana 2015-01-01 18:00:00 19.00659 2
#3 bob 2015-01-01 09:07:00 14.40996 1
#4 bob 2015-01-01 18:05:00 13.64010 2
#5 caroline 2015-01-01 11:11:00 14.50865 1
#6 caroline 2015-01-01 19:20:00 12.08506 2
Can we omit the arrange line? Seeing the CRAN manual, I did the following.
Both attempts were not successful.
### Not working
group_by(mydf, id) %>%
mutate(num = row_number(order_by = order))
### Not working
group_by(mydf, id) %>%
mutate(num = order_by(order, row_number()))
How can we achieve this?
I did not mean to answer this question by myself. But, I decided to share
what I found given I have not seen many posts using order_by and particularly
with_order. My answer was to use with_order() instead of order_by().
group_by(mydf, id) %>%
mutate(num = with_order(order_by = order, fun = row_number, x = order))
# id order value num
#1 ana 2015-01-01 18:00:00 19.00659 2
#2 bob 2015-01-01 18:05:00 13.64010 2
#3 caroline 2015-01-01 19:20:00 12.08506 2
#4 bob 2015-01-01 09:07:00 14.40996 1
#5 ana 2015-01-01 08:30:00 17.45165 1
#6 caroline 2015-01-01 11:11:00 14.50865 1
I wanted to see if there would be any difference between the two
approaches in terms of speed. It seems that they are pretty similar in this case.
library(microbenchmark)
mydf2 <- data.frame(id = rep(c("ana", "bob", "caroline",
"bob", "ana", "caroline"), times = 200000),
order = seq(as.POSIXct("2015-03-01 18:00:00", format = "%Y-%m-%d %H:%M:%S"),
as.POSIXct("2015-01-01 18:00:00", format = "%Y-%m-%d %H:%M:%S"),
length.out = 1200000),
value = runif(1200000, 10, 20),
stringsAsFactors = FALSE)
jazz1 <- function() {group_by(mydf2, id) %>%
arrange(order) %>%
mutate(num = row_number())}
jazz2 <- function() {group_by(mydf2, id) %>%
mutate(num = with_order(order_by = order, fun = row_number, x = order))}
res <- microbenchmark(jazz1, jazz2, times = 1000000L)
res
#Unit: nanoseconds
# expr min lq mean median uq max neval cld
# jazz1 32 36 47.17647 38 47 12308 1e+06 a
# jazz2 32 36 47.02902 38 47 12402 1e+06 a

Resources