Update two columns that are interdependent row-wise using data.table - r

I want to create a data.table with the departure and arrival times between bus stops. This is the format of my data.table. (reproducible dataset below)
trip_id stop_sequence arrival_time departure_time travel_time
1: a 1 07:00:00 07:00:00 00:00:00
2: a 2 00:00:00 00:00:00 00:02:41
3: a 3 00:00:00 00:00:00 00:01:36
4: a 4 00:00:00 00:00:00 00:02:39
5: a 5 00:00:00 00:00:00 00:02:28
6: b 1 07:00:00 07:00:00 00:00:00
7: b 2 00:00:00 00:00:00 00:00:00
8: b 3 00:00:00 00:00:00 00:01:36
9: b 4 00:00:00 00:00:00 00:00:37
10: b 5 00:00:00 00:00:00 00:03:00
Here is how it should work. The idea is that a vehicle travels following the stop sequence. In trip a, for example, it takes 00:02:41 for the vehicle to travel from stop 1 to stop 2. Given a fixed time of 40 seconds for passangers to enter/leave the vehicle at each stop, the bus would departure from stop 2 at "07:03:21"
The thing here is that this is a row-wise iterative process between two columns. Intuitively, I would would a for set loop in data.table but I couldn't get my head around this. Help?
reproducible dataset:
library(data.table)
library(chron)
dt <- structure(list(trip_id = c("a", "a", "a", "a", "a", "b", "b",
"b", "b", "b"), stop_sequence = c(1L, 2L, 3L, 4L, 5L, 1L, 2L,
3L, 4L, 5L), arrival_time = structure(c(0.291666666666667, 0,
0, 0, 0, 0.291666666666667, 0, 0, 0, 0), format = "h:m:s", class = "times"),
departure_time = structure(c(0.291666666666667, 0, 0, 0,
0, 0.291666666666667, 0, 0, 0, 0), format = "h:m:s", class = "times"),
travel_time = structure(c(0, 0.00186598685444013, 0.00110857958406301,
0.00183749407361369, 0.00171664297781446, 0, 0.000522388450578203,
0.00111473367541453, 0.000427755975518318, 0.00207918951573377
), format = "h:m:s", class = "times")), .Names = c("trip_id",
"stop_sequence", "arrival_time", "departure_time", "travel_time"
), class = c("data.table", "data.frame"), row.names = c(NA, -10L
))
expected output: first four rows
trip_id stop_sequence arrival_time departure_time travel_time
1: a 1 07:00:00 07:00:00 00:00:00
2: a 2 07:02:41 07:03:21 00:02:41
3: a 3 07:04:57 07:05:37 00:01:36
4: a 4 07:08:16 07:08:56 00:02:39

I think it might be possible to do it without looping. I think you can calculate the departure_time without looping and then once you have that, the arrival_time is just departure_time - 40 seconds:
dt2 <- copy(dt)
dt2[,c("arrival_time", "departure_time") := .(cumsum(arrival_time + ifelse(travel_time==0, 0, travel_time + times("00:00:40"))) - ifelse(travel_time == 0 , 0, times("00:00:40")),
cumsum(arrival_time + ifelse(travel_time==0, 0, travel_time + times("00:00:40")))),
by = trip_id]
dt2
# trip_id stop_sequence arrival_time departure_time travel_time
#1: a 1 07:00:00 07:00:00 00:00:00
#2: a 2 07:02:41 07:03:21 00:02:41
#3: a 3 07:04:57 07:05:37 00:01:36
#4: a 4 07:08:16 07:08:56 00:02:39
#5: a 5 07:11:24 07:12:04 00:02:28
#6: b 1 07:00:00 07:00:00 00:00:00
#7: b 2 07:00:45 07:01:25 00:00:45
#8: b 3 07:03:01 07:03:41 00:01:36
#9: b 4 07:04:18 07:04:58 00:00:37
#10: b 5 07:07:58 07:08:38 00:03:00
Alternatively, so you don't have to repeat the long cumsum for departure_time to get arrival_time you could do:
dt2[,departure_time := cumsum(arrival_time + ifelse(travel_time==0, 0, travel_time + times("00:00:40"))), by = trip_id]
dt2[, arrival_time := departure_time - ifelse(travel_time == 0 , 0, times("00:00:40"))]
A third option posted by #eddi:
dt[, departure_time := arrival_time[1] + cumsum(travel_time) + (0:(.N-1))*times('00:00:40'), by = trip_id]
dt[, arrival_time := c(arrival_time[1], tail(departure_time, -1) - times('00:00:40')), by = trip_id]

Related

R Merge two data frames based on nearest date and time match [duplicate]

This question already has answers here:
How to join two dataframes by nearest time-date?
(2 answers)
Closed last year.
I've seen various solutions for this question based on date only, but the time component is tripping me up. I have two data frames with POSIXct columns called 'datetime'. For DF1 that column has data rounded to the nearest hour. For DF2, the time component is not rounded to the nearest hour and can occur anytime. The dataframes look like this:
DF1
datetime
X
Y
Z
2020-09-01 03:00:00
1
3
4
2020-09-02 12:00:00
12
3
5
2020-09-02 22:00:00
4
9
19
2020-09-03 01:00:00
4
10
2
2020-09-04 06:00:00
4
12
1
2020-09-04 08:00:00
11
13
10
DF2
datetime
Var
2020-09-01 02:23:14
A
2020-09-01 03:12:09
B
2020-09-02 11:52:15
A
2020-09-02 12:15:44
B
2020-09-02 22:31:56
A
2020-09-02 21:38:05
B
2020-09-03 01:11:39
A
2020-09-03 00:59:33
B
2020-09-04 05:12:19
A
2020-09-04 06:07:09
B
2020-09-04 08:22:28
A
2020-09-04 07:50:17
B
What I want is to merge these two dataframes based on this column using the date and time that are closest in time to 'datetime' in DF1, so that it looks like this:
datetime
X
Y
Z
Var
2020-09-01 03:00:00
1
3
4
B
2020-09-02 12:00:00
12
3
5
A
2020-09-02 22:00:00
4
9
19
B
2020-09-03 01:00:00
4
10
2
B
2020-09-04 06:00:00
4
12
1
B
2020-09-04 08:00:00
11
13
10
B
Thank you!
Adding helper columns for merge and group_by, using merge and then dplyr for the filtering
library(dplyr)
df1$tmp <- as.Date(df1$datetime)
df2$tmp <- as.Date(df2$datetime)
df1$grp <- 1:(nrow(df1))
merge(df1, df2, "tmp") %>%
group_by(grp) %>%
slice(which.min(abs(difftime(datetime.x, datetime.y)))) %>%
ungroup() %>%
select(-c(tmp,grp,datetime.y))
# A tibble: 6 × 5
datetime.x X Y Z Var
<chr> <int> <int> <int> <chr>
1 2020-09-01 03:00:00 1 3 4 B
2 2020-09-02 12:00:00 12 3 5 A
3 2020-09-02 22:00:00 4 9 19 B
4 2020-09-03 01:00:00 4 10 2 B
5 2020-09-04 06:00:00 4 12 1 B
6 2020-09-04 08:00:00 11 13 10 B
Data
df1 <- structure(list(datetime = c("2020-09-01 03:00:00", "2020-09-02 12:00:00",
"2020-09-02 22:00:00", "2020-09-03 01:00:00", "2020-09-04 06:00:00",
"2020-09-04 08:00:00"), X = c(1L, 12L, 4L, 4L, 4L, 11L), Y = c(3L,
3L, 9L, 10L, 12L, 13L), Z = c(4L, 5L, 19L, 2L, 1L, 10L)), class = "data.frame", row.names = c(NA,
-6L))
df2 <- structure(list(datetime = c("2020-09-01 02:23:14", "2020-09-01 03:12:09",
"2020-09-02 11:52:15", "2020-09-02 12:15:44", "2020-09-02 22:31:56",
"2020-09-02 21:38:05", "2020-09-03 01:11:39", "2020-09-03 00:59:33",
"2020-09-04 05:12:19", "2020-09-04 06:07:09", "2020-09-04 08:22:28",
"2020-09-04 07:50:17"), Var = c("A", "B", "A", "B", "A", "B",
"A", "B", "A", "B", "A", "B")), class = "data.frame", row.names = c(NA,
-12L))

Imputing dates to empty cells for large dataset

I have a dataset that looks like below:
PPID join_date week date visit
A 2017-10-01 1 NA 0
A 2017-10-01 2 2017-10-08 2
A 2017-10-01 3 2017-10-15 1
A 2017-10-01 4 NA 0
B 2017-05-23 1 2017-05-21 4
B 2017-05-23 2 2017-05-28 2
B 2017-05-23 3 NA 0
week indicates the difference between the Sunday of the week of join_date and date in weeks (e.g. for participant B, the Sunday of the week of 2017-05-23 is 2017-05-21; thus participant B's week1 starts on 2017-05-21, and week2 starts on 2017-05-28).
My goal is to fill in date where it is currently NA, such that the output looks like below:
PPID join_date week date visit
A 2017-10-01 1 2017-10-01 0
A 2017-10-01 2 2017-10-08 2
A 2017-10-01 3 2017-10-15 1
A 2017-10-01 4 2017-10-22 0
B 2017-05-23 1 2017-05-21 4
B 2017-05-23 2 2017-05-28 2
B 2017-05-23 3 2017-06-04 0
The code I currently have is:
library(dplyr)
library(lubridate)
df2 <- df %>%
group_by(PPID) %>%
mutate(date = seq(unique(floor_date(as.Date(join_date), "weeks")),
unique(floor_date(as.Date(join_date), "weeks") + 7*(max(week)-1)),
by="week"))
The problem with this approach is that I'm working with large dataset (~8 mil observation) and it takes forever to run! I read some posts that all those date conversion/calculation (e.g. floor_date or as.Date) is what takes so long, and was wondering if there's ways to make my code more efficient.
Thanks!
How about simply
df2$date = floor_date(df2$join_date, 'week') + 7*(df2$week-1)
# PPID join_date week date visit
# 1 A 2017-10-01 1 2017-10-01 0
# 2 A 2017-10-01 2 2017-10-08 2
# 3 A 2017-10-01 3 2017-10-15 1
# 4 A 2017-10-01 4 2017-10-22 0
# 5 B 2017-05-23 1 2017-05-21 4
# 6 B 2017-05-23 2 2017-05-28 2
# 7 B 2017-05-23 3 2017-06-04 0
Although this calculates floor_date for every row, it is vectorised rather looping (as you did implicitly using by), so should be fast enough for most purposes. If you need even more speed-up, you could subset on is.na(df2$data) to only calculate the rows you need to impute.
Data:
df2 = structure(list(PPID = c("A", "A", "A", "A", "B", "B", "B"), join_date = structure(c(17440,
17440, 17440, 17440, 17309, 17309, 17309), class = "Date"), week = c(1L,
2L, 3L, 4L, 1L, 2L, 3L), date = structure(c(NA, 17447, 17454,
NA, 17307, 17314, NA), class = "Date"), visit = c(0L, 2L, 1L,
0L, 4L, 2L, 0L)), row.names = c(NA, -7L), class = "data.frame")

Efficient solution to (recursively) replace NAs with the mean of lags, by group

I need to replace NAs with the mean of previous three values, by group.
Once an NA is replaced, it will serve as input for computing the mean corresponding to the next NA (if next NA is within the next three months).
Here it is an example:
id date value
1 2017-04-01 40
1 2017-05-01 40
1 2017-06-01 10
1 2017-07-01 NA
1 2017-08-01 NA
2 2014-01-01 27
2 2014-02-01 13
Data:
dt <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L), date = structure(c(17257, 17287, 17318, 17348, 17379, 16071, 16102), class = "Date"), value = c(40, 40, 10, NA, NA, 27, 13)), row.names = c(1L, 2L, 3L, 4L, 5L, 8L, 9L), class = "data.frame")
The output should look like:
id date value
1 2017-04-01 40.00
1 2017-05-01 40.00
1 2017-06-01 10.00
1 2017-07-01 30.00
1 2017-08-01 26.66
2 2014-01-01 27.00
2 2014-02-01 13.00
where 26.66 = (30 + 10 + 40)/3
What is an efficient way to do this (i.e. to avoid for loops)?
The following uses base R only and does what you need.
sp <- split(dt, dt$id)
sp <- lapply(sp, function(DF){
for(i in which(is.na(DF$value))){
tmp <- DF[seq_len(i - 1), ]
DF$value[i] <- mean(tail(tmp$value, 3))
}
DF
})
result <- do.call(rbind, sp)
row.names(result) <- NULL
result
# id date value
#1 1 2017-01-04 40.00000
#2 1 2017-01-05 40.00000
#3 1 2017-01-06 10.00000
#4 1 2017-01-07 30.00000
#5 1 2017-01-08 26.66667
#6 2 2014-01-01 27.00000
#7 2 2014-01-02 13.00000
Define a roll function which takes 3 or less previous values as a list and the current value and returns as a list the previous 2 values with the current value if the current value is not NA and the prevous 2 values with the mean if the current value is NA. Use that with Reduce and pick off the last value of each list in the result. Then apply all that to each group using ave.
roll <- function(prev, cur) {
prev <- unlist(prev)
list(tail(prev, 2), if (is.na(cur)) mean(prev) else cur)
}
reduce_roll <- function(x) {
sapply(Reduce(roll, init = x[1], x[-1], acc = TRUE), tail, 1)
}
transform(dt, value = ave(value, id, FUN = reduce_roll))
giving:
id date value
1 1 2017-04-01 40
2 1 2017-05-01 40
3 1 2017-06-01 10
4 1 2017-07-01 30
5 1 2017-08-01 26.66667
8 2 2014-01-01 27
9 2 2014-02-01 13

Manipulating dates alongside consecutive results

I need some help working with consecutive results.
Here is my sample data:
df <- structure(list(idno = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2), result = structure(c(1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L), .Label = c("Negative", "Positive"
), class = c("ordered", "factor")), samp_date = structure(c(15909,
15938, 15979, 16007, 16041, 16080, 16182, 16504, 16576, 16645,
16721, 16745, 17105, 17281, 17416, 17429), class = "Date")), class = "data.frame", row.names = c(NA,
-16L))
The 'idno' represents individual people who had a test with 'result' on a given date ('samp_date').
From each individual person, I need to find the earliest consecutive 'Negatives' and return the date of the first 'negative' result. To return this date, the consecutive negatives must span >30 days with no 'positive' results.
The example answer for idno == 1 would be 2013-10-29, and 2015-11-06 for idno == 2.
I have tried using rle(as.character(df$result)) but have struggled to understand how to apply this to grouped data.
I would prefer an approach that uses dplyr or data.table.
Thanks for any help.
Similar to #MKR's answer, you can make a grouping variable and summarize in data.table:
library(data.table)
setDT(df)[, samp_date := as.IDate(samp_date)]
# summarize by grouping var g = rleid(idno, result)
runDT = df[, .(
start = first(samp_date),
end = last(samp_date),
dur = difftime(last(samp_date), first(samp_date), units="days")
), by=.(idno, result, g = rleid(idno, result))]
# idno result g start end dur
# 1: 1 Negative 1 2013-07-23 2013-07-23 0 days
# 2: 1 Positive 2 2013-08-21 2013-10-01 41 days
# 3: 1 Negative 3 2013-10-29 2015-07-29 638 days
# 4: 2 Positive 4 2015-10-13 2015-10-13 0 days
# 5: 2 Negative 5 2015-11-06 2016-10-31 360 days
# 6: 2 Positive 6 2017-04-25 2017-09-20 148 days
# find rows meeting the criterion
w = runDT[.(idno = unique(idno), result = "Negative", min_dur = 30),
on=.(idno, result, dur >= min_dur), mult="first", which=TRUE]
# filter
runDT[w]
# idno result g start end dur
# 1: 1 Negative 3 2013-10-29 2015-07-29 638 days
# 2: 2 Negative 5 2015-11-06 2016-10-31 360 days
A dplyr based solution can be achieved by creating a group of consecutive occurrence of result column and then finally taking 1st occurrence that meets criteria:
library(dplyr)
df %>% mutate(samp_date = as.Date(samp_date)) %>%
group_by(idno) %>%
arrange(samp_date) %>%
mutate(result_grp = cumsum(as.character(result)!=lag(as.character(result),default=""))) %>%
group_by(idno, result_grp) %>%
filter( result == "Negative" & (max(samp_date) - min(samp_date) )>=30) %>%
slice(1) %>%
ungroup() %>%
select(-result_grp)
# # A tibble: 2 x 3
# idno result samp_date
# <dbl> <ord> <date>
# 1 1.00 Negative 2013-10-29
# 2 2.00 Negative 2015-11-06
library(dplyr)
df %>% group_by(idno) %>%
mutate(time_diff = ifelse(result=="Negative" & lead(result)=='Negative', samp_date - lead(samp_date),0),
ConsNegDate = min(samp_date[which(abs(time_diff)>30)]))
# A tibble: 16 x 5
# Groups: idno [2]
idno result samp_date time_diff ConsNegDate
<dbl> <ord> <date> <dbl> <date>
1 1 Negative 2013-07-23 0 2013-10-29
2 1 Positive 2013-08-21 0 2013-10-29
3 1 Positive 2013-10-01 0 2013-10-29
4 1 Negative 2013-10-29 -34 2013-10-29
5 1 Negative 2013-12-02 -39 2013-10-29
6 1 Negative 2014-01-10 -102 2013-10-29
7 1 Negative 2014-04-22 -322 2013-10-29
8 1 Negative 2015-03-10 -72 2013-10-29
9 1 Negative 2015-05-21 -69 2013-10-29
10 1 Negative 2015-07-29 NA 2013-10-29
11 2 Positive 2015-10-13 0 2015-11-06
12 2 Negative 2015-11-06 -360 2015-11-06
13 2 Negative 2016-10-31 0 2015-11-06
14 2 Positive 2017-04-25 0 2015-11-06
15 2 Positive 2017-09-07 0 2015-11-06
16 2 Positive 2017-09-20 0 2015-11-06

Aggregate Dates to produce unique periods

I would like to able to aggregate survey data collected over a range of days into a unique period. For example, for the first three dates (2015-03-17, 2015-03-23, 2015-03-26), i'd like to combine to produce the period "March 2015". I will then use these combined dates to produce boxplots which show "Average.Counts" for that period.
All up I would like to make 4 unique periods:
March 15 (first 3 dates as per table below)
September 15 (dates 4,5 as per table below)
March 2016 (dates 6-15 as per table below)
September 2016 (dates 16-23 as per table below)
Here are the dataset headings.
head(Survival.Pre.Harvest)
Bay.Unique Date Average.Count Total.Predators Time Previous.Average.Count
2 1 2015-03-17 346.9 2 0 NA
3 1 2015-09-14 326.6 8 181 346.9
4 1 2016-02-29 322.6 3 349 326.6
7 2 2015-03-17 326.4 2 0 NA
8 2 2015-09-14 288.8 4 181 326.4
9 2 2016-02-29 271.4 6 349 288.8
These are the unique dates within the dataset.
table(Survival.Pre.Harvest$Date)
2015-03-17 2015-03-23 2015-03-26 2015-09-14 2015-09-15 2016-02-24 2016-02-25 2016-02-26 2016-02-29
9 3 1 9 3 4 6 6 5
2016-03-01 2016-03-02 2016-03-03 2016-03-04 2016-03-22 2016-03-23 2016-09-12 2016-09-13 2016-09-14
3 6 3 6 6 2 6 6 4
2016-09-20 2016-09-22 2016-10-18 2016-10-19 2016-10-20
7 10 4 3 14
Thanks in advance!
dput(head(Survival.Pre.Harvest))
structure(list(Bay.Unique = c(1, 1, 1, 2, 2, 2), Date = structure(c(16511,
16692, 16860, 16511, 16692, 16860), class = "Date"), Average.Count = c(346.9,
326.6, 322.6, 326.4, 288.8, 271.4), Total.Predators = c(2L, 8L,
3L, 2L, 4L, 6L), Time = c(0, 181, 349, 0, 181, 349), Previous.Average.Count = c(NA,
346.9, 326.6, NA, 326.4, 288.8)), .Names = c("Bay.Unique", "Date",
"Average.Count", "Total.Predators", "Time", "Previous.Average.Count"
), row.names = c(2L, 3L, 4L, 7L, 8L, 9L), class = "data.frame")
This should work:
library(lubridate)
library(ggplot2)
Survival.Pre.Harvest$Date <- ymd(Survival.Pre.Harvest$Date)
bks = ymd("2015-01-01", "2015-08-31", "2016-01-01", "2016-08-31", "2017-01-01")
lbs <- c("Mar2015", "Sep2015", "Mar2016", "Sep2016")
Survival.Pre.Harvest$yearmonth <- cut.Date(Survival.Pre.Harvest$Date, breaks = bks, labels = lbs)
ggplot(Survival.Pre.Harvest, aes(x=yearmonth, y=Average.Count)) + geom_boxplot()

Resources