I have scoured SO for a way to achieve what I need without luck so here it goes.
A while back I discovered the package dplyr and its potential. I am thinking this package can do what I want, I just don't know how. This is a small subset of my data, but should be representative of my problem.
dummy<-structure(list(time = structure(1:20, .Label = c("2015-03-25 12:24:00",
"2015-03-25 21:08:00", "2015-03-25 21:13:00", "2015-03-25 21:47:00",
"2015-03-26 03:08:00", "2015-04-01 20:30:00", "2015-04-01 20:34:00",
"2015-04-01 20:42:00", "2015-04-01 20:45:00", "2015-09-29 18:26:00",
"2015-09-29 19:11:00", "2015-09-29 21:21:00", "2015-09-29 22:03:00",
"2015-09-29 22:38:00", "2015-09-30 00:48:00", "2015-09-30 01:38:00",
"2015-09-30 01:41:00", "2015-09-30 01:45:00", "2015-09-30 01:47:00",
"2015-09-30 01:49:00"), class = "factor"), ID = c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L), station = c(1L, 1L, 1L, 2L, 3,
4L, 4L, 4L, 4L, 5L, 5L, 6L,
6L, 5, 5, 5L, 7, 7, 7L,
7)), .Names = c("time", "ID", "station"), class = "data.frame", row.names = c(NA,
-20L))
I wish to evaluate rows within the time column conditional on the ID and station column. Specifically, I would like the function (dplyr?) to evaluate each time row, and compare the time to the previous time (row-1) and next time (row+1). If the time of current row is within 1 hour of time of previous and/or next row, and the ID and station of current row match that of previous and/or next row, then I would like to add in a new row a 1, otherwise a 0.
How would I achieve this using dplyr?
The expected outcome should be like this:
time ID station new.value
1 2015-03-25 12:24:00 1 1 0
2 2015-03-25 21:08:00 1 1 1
3 2015-03-25 21:13:00 1 1 1
4 2015-03-25 21:47:00 1 2 0
5 2015-03-26 03:08:00 1 3 0
6 2015-04-01 20:30:00 1 4 1
7 2015-04-01 20:34:00 1 4 1
8 2015-04-01 20:42:00 1 4 1
9 2015-04-01 20:45:00 1 4 1
10 2015-09-29 18:26:00 2 5 1
11 2015-09-29 19:11:00 2 5 1
12 2015-09-29 21:21:00 2 6 1
13 2015-09-29 22:03:00 2 6 1
14 2015-09-29 22:38:00 2 5 0
15 2015-09-30 00:48:00 2 5 1
16 2015-09-30 01:38:00 2 5 1
17 2015-09-30 01:41:00 2 7 1
18 2015-09-30 01:45:00 2 7 1
19 2015-09-30 01:47:00 2 7 1
20 2015-09-30 01:49:00 2 7 1
Here is an option using the difftime with dplyr mutate function. Firstly, we use a group_by operation to make sure the comparison is within each unique combination of ID and Station. The difftime can be used to calculate the difference time, here the units will be set as hours for convenience. The lag and lead functions are also from dplyr package which shift the selected column backward or forward. Combining with the vectorised operation of difftime, you can calculate the time difference between the current row and the previous/next row. We use abs to make sure the result is absolute value. The condition of <1 make sure the difference is within an hour. as.integer convert the logical values (T or F) to (1 or 0) correspondingly.
library(dplyr)
dummy %>% group_by(ID, station) %>%
mutate(new.value = as.integer(
abs(difftime(time, lag(time, default = Inf), units = "hours")) < 1 |
abs(difftime(time, lead(time, default = Inf), units = "hours")) < 1))
Source: local data frame [20 x 4]
Groups: ID, station [7]
time ID station new.value
(time) (int) (dbl) (int)
1 2015-03-25 12:24:00 1 1 0
2 2015-03-25 21:08:00 1 1 1
3 2015-03-25 21:13:00 1 1 1
4 2015-03-25 21:47:00 1 2 0
5 2015-03-26 03:08:00 1 3 0
6 2015-04-01 20:30:00 1 4 1
7 2015-04-01 20:34:00 1 4 1
8 2015-04-01 20:42:00 1 4 1
9 2015-04-01 20:45:00 1 4 1
10 2015-09-29 18:26:00 2 5 1
11 2015-09-29 19:11:00 2 5 1
12 2015-09-29 21:21:00 2 6 1
13 2015-09-29 22:03:00 2 6 1
14 2015-09-29 22:38:00 2 5 0
15 2015-09-30 00:48:00 2 5 1
16 2015-09-30 01:38:00 2 5 1
17 2015-09-30 01:41:00 2 7 1
18 2015-09-30 01:45:00 2 7 1
19 2015-09-30 01:47:00 2 7 1
20 2015-09-30 01:49:00 2 7 1
Psidom's answer is great -- here's a data.table approach.
library(data.table)
setDT(dummy)
# you do NOT want a factor for your time variable
dummy[, time := as.POSIXct(time) ]
dummy[, `:=`(lag_diff = c(Inf, diff(as.numeric(time))),
lead_diff = c(diff(as.numeric(time)), Inf)),
by = .(ID, station) ]
dummy[, new.value := as.integer(lag_diff < 3600 | lead_diff < 3600) ]
dummy
Another solution using R base functions (sapply and difftime):
n=nrow(dummy)
dummy$new.value=
as.numeric(sapply(1:n, function(i)
(i<n && (dummy[i,"ID"]==dummy[i+1,"ID"] && dummy[i,"station"]==dummy[i+1,"station"])
&& abs(as.numeric(difftime(dummy[i,"time"], dummy[i+1,"time"]), "hours"))<=1)
||
(i>1 && (dummy[i,"ID"]==dummy[i-1,"ID"] && dummy[i,"station"]==dummy[i-1,"station"])
&& abs(as.numeric(difftime(dummy[i,"time"], dummy[i-1,"time"]), "hours"))<=1)
))
# > dummy
# time ID station new.value
# 1 2015-03-25 12:24:00 1 1 0
# 2 2015-03-25 21:08:00 1 1 1
# 3 2015-03-25 21:13:00 1 1 1
# 4 2015-03-25 21:47:00 1 2 0
# 5 2015-03-26 03:08:00 1 3 0
# 6 2015-04-01 20:30:00 1 4 1
# 7 2015-04-01 20:34:00 1 4 1
# 8 2015-04-01 20:42:00 1 4 1
# 9 2015-04-01 20:45:00 1 4 1
# 10 2015-09-29 18:26:00 2 5 1
# 11 2015-09-29 19:11:00 2 5 1
# 12 2015-09-29 21:21:00 2 6 1
# 13 2015-09-29 22:03:00 2 6 1
# 14 2015-09-29 22:38:00 2 5 0
# 15 2015-09-30 00:48:00 2 5 1
# 16 2015-09-30 01:38:00 2 5 1
# 17 2015-09-30 01:41:00 2 7 1
# 18 2015-09-30 01:45:00 2 7 1
# 19 2015-09-30 01:47:00 2 7 1
# 20 2015-09-30 01:49:00 2 7 1
Related
I have a dataset containing changes in mean arterial blood pressure (MAP) over time from multiple participants. Here is an example dataframe:
df=structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), Time = structure(1:14, .Label = c("11:02:00",
"11:03:00", "11:04:00", "11:05:00", "11:06:00", "11:07:00", "11:08:00",
"13:30:00", "13:31:00", "13:32:00", "13:33:00", "13:34:00", "13:35:00",
"13:36:00"), class = "factor"), MAP = c(90.27999878, 84.25, 74.81999969,
80.87000275, 99.38999939, 81.51000214, 71.51000214, 90.08999634,
88.75, 84.72000122, 83.86000061, 94.18000031, 98.54000092, 51
)), class = "data.frame", row.names = c(NA, -14L))
I have binned the data into groups: e.g. MAP 40-60, 60-80, 80-100 and added a unique flag (1, 2 or 3) in an additional column map_bin. This is my code so far:
library(dplyr)
#Mean Arterial Pressure
#Bin 1=40-60; Bin 2=60-80; Bin 3=80-100
map_bin=c("1","2","3")
output <- as_tibble(df) %>%
mutate(map_bin = case_when(
MAP >= 40 & MAP < 60 ~ map_bin[1],
MAP >= 60 & MAP < 80 ~ map_bin[2],
MAP >= 80 & MAP < 100 ~ map_bin[3]
))
For each ID I wish to calculate, in an additional column, the total time MAP is in each bin. I expect the following output:
ID
Time
MAP
map_bin
map_bin_dur
1
11:02:00
90.27999878
3
5
1
11:03:00
84.25
3
5
1
11:04:00
74.81999969
2
2
1
11:05:00
80.87000275
3
5
1
11:06:00
99.38999939
3
5
1
11:07:00
81.51000214
3
5
1
11:08:00
71.51000214
2
2
2
13:30:00
90.08999634
3
6
2
13:31:00
88.75
3
6
2
13:32:00
84.72000122
3
6
2
13:33:00
83.86000061
3
6
2
13:34:00
94.18000031
3
6
2
13:35:00
98.54000092
3
6
2
13:36:00
51
1
1
Where map_bin_dur is the time in minutes that MAP for each individual resided in each bin. e.g. ID 1 had a MAP in Bin 3 for 5 minutes in total.
If you have Time column of 1 min-duration always you can use add_count -
library(dplyr)
output <- output %>% add_count(ID, map_bin, name = 'map_bin_dur')
output
# ID Time MAP map_bin map_bin_dur
# <int> <fct> <dbl> <chr> <int>
# 1 1 11:02:00 90.3 3 5
# 2 1 11:03:00 84.2 3 5
# 3 1 11:04:00 74.8 2 2
# 4 1 11:05:00 80.9 3 5
# 5 1 11:06:00 99.4 3 5
# 6 1 11:07:00 81.5 3 5
# 7 1 11:08:00 71.5 2 2
# 8 2 13:30:00 90.1 3 6
# 9 2 13:31:00 88.8 3 6
#10 2 13:32:00 84.7 3 6
#11 2 13:33:00 83.9 3 6
#12 2 13:34:00 94.2 3 6
#13 2 13:35:00 98.5 3 6
#14 2 13:36:00 51 1 1
I am trying to find an R function that can index groups iteratively, given a set of unevenly spaced dates, uneven group sizes, and by grouped cases. Here are example data:
> h
# A tibble: 20 x 2
ID date
<int> <date>
1 1 2021-01-07
2 1 2021-01-11
3 1 2021-01-15
4 1 2021-01-16
5 1 2021-01-21
6 1 2021-01-26
7 1 2021-02-04
8 1 2021-02-08
9 1 2021-02-13
10 1 2021-02-20
11 1 2021-02-23
12 1 2021-02-27
13 2 2021-01-05
14 2 2021-01-11
15 2 2021-02-02
16 2 2021-02-08
17 2 2021-02-08
18 2 2021-02-14
19 2 2021-02-17
20 2 2021-02-21
For each unique ID, I want to find the first date (chronologically) and create a group (i.e., group==1) for that case and any other rows within 7 days. For the next date after 7 days, create a second group (i.e., group==2) for that case and any others within the next 7 days. Note: the next date is not necessarily exactly 7 days after the initial date. Repeat this process for the remaining remaining cases to get the desired output:
# A tibble: 20 x 3
ID date group
<int> <date> <dbl>
1 1 2021-01-07 1
2 1 2021-01-11 1
3 1 2021-01-15 2
4 1 2021-01-16 2
5 1 2021-01-21 2
6 1 2021-01-26 3
7 1 2021-02-04 4
8 1 2021-02-08 4
9 1 2021-02-13 5
10 1 2021-02-20 5
11 1 2021-02-23 6
12 1 2021-02-27 6
13 2 2021-01-05 1
14 2 2021-01-11 1
15 2 2021-02-02 2
16 2 2021-02-08 2
17 2 2021-02-08 2
18 2 2021-02-14 3
19 2 2021-02-17 3
20 2 2021-02-21 3
Using a rolling window function of 7 days will not work, as far as I can tell, as it will group the cases incorrectly. But I am wondering if a sort of custom rolling window function could be used? I would prefer a solution using dplyr, but other options would also work. Any help here is appreciated.
> dput(h)
structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), date = structure(c(18634,
18638, 18642, 18643, 18648, 18653, 18662, 18666, 18671, 18678,
18681, 18685, 18632, 18638, 18660, 18666, 18666, 18672, 18675,
18679), class = "Date")), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
Define a function date1 which given the first date of the group of the prior row's point and the current row's date returns the date of the start of the current group -- that must be one of the two arguments. Then grouping by ID use Reduce to apply that to the dates in each ID and convert the result to factor and then to integer.
library(dplyr)
date1 <- function(prev, x) if (x > prev + 7) x else prev
h %>%
group_by(ID) %>%
mutate(group = as.integer(factor(Reduce(date1, date, acc = TRUE)))) %>%
ungroup
giving:
# A tibble: 20 x 3
ID date group
<int> <date> <dbl>
1 1 2021-01-07 1
2 1 2021-01-11 1
3 1 2021-01-15 2
4 1 2021-01-16 2
5 1 2021-01-21 2
6 1 2021-01-26 3
7 1 2021-02-04 4
8 1 2021-02-08 4
9 1 2021-02-13 5
10 1 2021-02-20 5
11 1 2021-02-23 6
12 1 2021-02-27 6
13 2 2021-01-05 1
14 2 2021-01-11 1
15 2 2021-02-02 2
16 2 2021-02-08 2
17 2 2021-02-08 2
18 2 2021-02-14 3
19 2 2021-02-17 3
20 2 2021-02-21 3
For each ID group, create group as a vector of NAs. While some group elements are still NA, take the first date value where group is NA and add 0 and 7 days to it to make a range of dates. For any rows where date is in the calculated date range, set elements of group to 1 more than the current max value of group (or 0 if group is still all NA).
library(data.table)
setDT(df)
df[order(ID, date), {
group <- rep(NA_real_, .N)
while(any(is.na(group))){
group_range <- first(date[is.na(group)]) + c(0, 7)
group[date %between% group_range] <- 1 + max(fcoalesce(group, 0))
}
list(date, group)
}, by = ID]
# ID date group
# 1: 1 2021-01-07 1
# 2: 1 2021-01-11 1
# 3: 1 2021-01-15 2
# 4: 1 2021-01-16 2
# 5: 1 2021-01-21 2
# 6: 1 2021-01-26 3
# 7: 1 2021-02-04 4
# 8: 1 2021-02-08 4
# 9: 1 2021-02-13 5
# 10: 1 2021-02-20 5
# 11: 1 2021-02-23 6
# 12: 1 2021-02-27 6
# 13: 2 2021-01-05 1
# 14: 2 2021-01-11 1
# 15: 2 2021-02-02 2
# 16: 2 2021-02-08 2
# 17: 2 2021-02-08 2
# 18: 2 2021-02-14 3
# 19: 2 2021-02-17 3
# 20: 2 2021-02-21 3
Here's another version where I try to limit the computations. No idea if it's actually faster
df[order(ID, date), {
group <- rep(NA_integer_, .N)
i <- 1L
g <- 1L
while(i <= .N){
group_range <- date[i] + c(0, 7)
chg <- date %between% group_range
group[chg] <- g
g <- g + 1L
i <- i + sum(chg)
}
list(date, group)
}, by = ID]
I am trying to create a new column that assigns a unique value to the observation (row) only IF the recorded observation occur after a specific time following the last observation (see data frame).
Context:
I set up camera trap to observe what species visit a particular plot, every visit by a species should get a unique visitID. The actual database contains more complexity but this is the main problem I have.
new.df <- data.frame(
species = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
visit.time = c(seq(ymd_hm('2015-01-01 00:00'), ymd_hm('2015-01-01 00:10'), by = '2 mins'),
seq(ymd_hm('2015-01-01 00:00'), ymd_hm('2015-01-01 00:10'), by = '2 mins'))
)
> new.df
species visit.time
1 A 2015-01-01 00:00:00
2 A 2015-01-01 00:02:00
3 A 2015-01-01 00:04:00
4 A 2015-01-01 00:06:00
5 A 2015-01-01 00:08:00
6 A 2015-01-01 00:10:00
7 B 2015-01-01 00:00:00
8 B 2015-01-01 00:02:00
9 B 2015-01-01 00:04:00
10 B 2015-01-01 00:06:00
11 B 2015-01-01 00:08:00
12 B 2015-01-01 00:10:00
I would like to create a new column called "visitID" that records an each species' visit that occured. However, I only want to assign a unique number only of the visit occurred at least 2 minutes after the previous recorded visit:
> new.df
species visit.time visitID
1 A 2015-01-01 00:00:00 1
2 A 2015-01-01 00:02:00 -
3 A 2015-01-01 00:04:00 2
4 A 2015-01-01 00:06:00 -
5 A 2015-01-01 00:08:00 3
6 A 2015-01-01 00:10:00 -
7 B 2015-01-01 00:00:00 1
8 B 2015-01-01 00:02:00 -
9 B 2015-01-01 00:04:00 2
10 B 2015-01-01 00:06:00 -
11 B 2015-01-01 00:08:00 3
12 B 2015-01-01 00:10:00 -
where - is just an NA
I would usually try using dplyr:mutate with conditional terms ifelse, the problem is I do not know how to account for time elapse from the previous visit.
Please let me know if there are more details that could provide. Thanks!
From your desired output it seems you want a new ID when the time difference between the current and the last recorded visit that received a new ID exceeds 2 minutes. In that case, we could use a cumulative sum that resets at a certain threshold. I've used the function from this answer: dplyr / R cumulative sum with reset
sum_reset_at <- function(thresh) {
function(x) {
accumulate(x, ~if_else(.x>thresh, .y, .x+.y))
}
}
new.df <- new.df %>%
group_by(species) %>% # group df by species
arrange(species, visit.time) %>% # sort the data
mutate(
time.elapsed = as.numeric(difftime(visit.time, lag(visit.time), units = "mins")), # calculate time difference in minutes
time.elapsed = ifelse(is.na(time.elapsed), 0, time.elapsed), # replace NAs at first entries with 0s
time.elapsed.cum = sum_reset_at(2)(time.elapsed), # build cumulative sum that resets once the value is greater (not greater or equal) to two
newID = ifelse(time.elapsed.cum > 2, TRUE, FALSE), # build logical vector that marks the position where a new ID starts
visitID = cumsum(newID) + 1, # generate visit IDs
visitID = replace(visitID, duplicated(visitID), NA) # keep only first entry of an id, replace rest with NA
)
Output:
> new.df
# A tibble: 12 x 6
# Groups: species [2]
species visit.time time.elapsed time.elapsed.cum newID visitID
<fct> <dttm> <dbl> <dbl> <lgl> <dbl>
1 A 2015-01-01 00:00:00 0 0 FALSE 1
2 A 2015-01-01 00:02:00 2 2 FALSE NA
3 A 2015-01-01 00:04:00 2 4 TRUE 2
4 A 2015-01-01 00:06:00 2 2 FALSE NA
5 A 2015-01-01 00:08:00 2 4 TRUE 3
6 A 2015-01-01 00:10:00 2 2 FALSE NA
7 B 2015-01-01 00:00:00 0 0 FALSE 1
8 B 2015-01-01 00:02:00 2 2 FALSE NA
9 B 2015-01-01 00:04:00 2 4 TRUE 2
10 B 2015-01-01 00:06:00 2 2 FALSE NA
11 B 2015-01-01 00:08:00 2 4 TRUE 3
12 B 2015-01-01 00:10:00 2 2 FALSE NA
So basically we are summing up the time differences until they exceed two minutes, then we reset the sum to zero. Where this cumsum is greater than two we need to add a new ID. We do this by adding a logical vector and building the cumsum of that vector (because TRUE = 1 and FALSE = 0). Lastly, we replace the duplicated IDs in the groups to get the output you specified. We can drop the columns you don't need:
> new.df %>% select(-c(time.elapsed, time.elapsed.cum, newID))
# A tibble: 12 x 3
# Groups: species [2]
species visit.time visitID
<fct> <dttm> <dbl>
1 A 2015-01-01 00:00:00 1
2 A 2015-01-01 00:02:00 NA
3 A 2015-01-01 00:04:00 2
4 A 2015-01-01 00:06:00 NA
5 A 2015-01-01 00:08:00 3
6 A 2015-01-01 00:10:00 NA
7 B 2015-01-01 00:00:00 1
8 B 2015-01-01 00:02:00 NA
9 B 2015-01-01 00:04:00 2
10 B 2015-01-01 00:06:00 NA
11 B 2015-01-01 00:08:00 3
12 B 2015-01-01 00:10:00 NA
You can return the differences using diff(). Just make sure to prepend a 2 to each group of species, i.e. c(2, diff(visit.time) / 60), so that the first visit for each species always gets an ID (R will throw an error otherwise).
The only criterion you've given for visitID is that the values for each species are unique, but not that they are consecutive, so I'll assume that 1 5 6 is just as valid as 1 2 3. This simplifies things quite a bit:
library(dplyr)
df %>%
group_by(species) %>%
mutate(tdiff = c(2, diff(visit.time) / 60),
visitID = seq_along(species),
visitID = ifelse(tdiff >= 2, visitID, NA)
)
Which will return the following data frame:
# A tibble: 12 x 4
# Groups: species [2]
species visit.time tdiff visitID
<fct> <dttm> <dbl> <int>
1 A 2015-01-01 00:02:10 2 1
2 A 2015-01-01 00:03:00 0.833 NA
3 A 2015-01-01 00:03:10 0.167 NA
4 A 2015-01-01 00:04:00 0.833 NA
5 A 2015-01-01 00:07:40 3.67 5
6 A 2015-01-01 00:09:40 2 6
7 B 2015-01-01 00:00:40 2 1
8 B 2015-01-01 00:01:10 0.5 NA
9 B 2015-01-01 00:04:10 3 3
10 B 2015-01-01 00:05:40 1.5 NA
11 B 2015-01-01 00:09:40 4 5
12 B 2015-01-01 00:09:50 0.167 NA
Note that I've used a modified dataset because the differences between the times in the example you provide are all == 2.
Data:
df <- structure(list(species = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
visit.time = structure(c(1420070530, 1420070580, 1420070590,
1420070640, 1420070860, 1420070980, 1420070440, 1420070470,
1420070650, 1420070740, 1420070980, 1420070990), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), class = "data.frame", row.names = c(NA,
-12L))
So I basically got a while loop function that creates 1's in the "algorithm_column" based on the highest percentages in the "percent" column, until a certain total percentage is reached (90% or something). The rest of the rows that are not taken into account will have a value of 0 in the "algorithm_column" ( Create while loop function that takes next largest value untill condition is met)
I want to show, based on what the loop function found, the min and max times of the column "timeinterval" (the min is where the 1's start and max is the last row with a 1, the 0's are out of the scope). And then finally create a time interval from this.
So if we have the following code, I want to create in another column, lets say "total_time" a calculation from the min time 09:00 ( this is where 1 start in the algorithm_column) until 11:15, which makes a time interval of 02:15 hours added to the "total_time" column.
algorithm
# pc4 timeinterval stops percent idgroup algorithm_column
#1 5464 08:45:00 1 1.3889 1 0
#2 5464 09:00:00 5 6.9444 2 1
#3 5464 09:15:00 8 11.1111 3 1
#4 5464 09:30:00 7 9.7222 4 1
#5 5464 09:45:00 5 6.9444 5 1
#6 5464 10:00:00 10 13.8889 6 1
#7 5464 10:15:00 6 8.3333 7 1
#8 5464 10:30:00 4 5.5556 8 1
#9 5464 10:45:00 7 9.7222 9 1
#10 5464 11:00:00 6 8.3333 10 1
#11 5464 11:15:00 5 6.9444 11 1
#12 5464 11:30:00 8 11.1111 12 0
I have multiple pc4 groups, so it should look at every group and calculate a total_time for each group respectively.
I got this function, but I'm a bit stuck if this is what I need.
test <- function(x) {
ind <- x[["algorithm$algorithm_column"]] == 0
Mx <- max(x[["timeinterval"]][ind], na.rm = TRUE);
ind <- x[["algorithm$algorithm_column"]] == 1
Mn <- min(x[["timeinterval"]][ind], na.rm = TRUE);
list(Mn, Mx) ## or return(list(Mn, Mx))
}
test(algorithm)
Here is a dplyr solution.
library(dplyr)
algorithm %>%
mutate(tmp = cumsum(c(0, diff(algorithm_column) != 0))) %>%
filter(algorithm_column == 1) %>%
group_by(pc4, tmp) %>%
summarise(first = first(timeinterval),
last = last(timeinterval)) %>%
select(-tmp)
## A tibble: 1 x 3
## Groups: pc4 [1]
# pc4 first last
# <int> <fct> <fct>
#1 5464 09:00:00 11:15:00
Data.
algorithm <- read.table(text = "
pc4 timeinterval stops percent idgroup algorithm_column
1 5464 08:45:00 1 1.3889 1 0
2 5464 09:00:00 5 6.9444 2 1
3 5464 09:15:00 8 11.1111 3 1
4 5464 09:30:00 7 9.7222 4 1
5 5464 09:45:00 5 6.9444 5 1
6 5464 10:00:00 10 13.8889 6 1
7 5464 10:15:00 6 8.3333 7 1
8 5464 10:30:00 4 5.5556 8 1
9 5464 10:45:00 7 9.7222 9 1
10 5464 11:00:00 6 8.3333 10 1
11 5464 11:15:00 5 6.9444 11 1
12 5464 11:30:00 8 11.1111 12 0
", header = TRUE)
What I am trying to do is calculate the end Date of a huge data-frame based on a number of weeks that will be required for the job.
desired outcome:
such is PO.Due.Date = 2019-01-01 Weeks duration = 4 (New Column) End Date = 2019-01-29
$ Quote Number : chr "Q000297" "Q000300" "Q000401" "Q000405" ...
$ Confidence.Level: num 0.1 0.1 0.1 0.1 0.1 0.6 0.2 0.2 0.6 0.1 ...
$ PO.Due.Date : Date, format: "2019-01-03" "2019-01-03" "2019-01-03" ...
$ Duration.Weeks : num 2 2 4 4 2 1 4 4 4 4 ...
mydf
Quote Number Confidence.Level PO.Due.Date Duration.Weeks
1 Q000297 0.10 2019-01-03 2
2 Q000300 0.10 2019-01-03 2
3 Q000401 0.10 2019-01-03 4
4 Q000405 0.10 2019-01-03 4
5 Q000464 0.10 2019-01-03 2
6 Q000465 0.60 0028-02-20 1
7 Q000479/1 0.20 2019-03-01 4
8 Q000480 0.20 2019-03-01 4
9 Q000481 0.60 2019-02-28 4
10 Q000494 0.10 2019-01-03 4
I though I could get it into weeks and year and then add them. However that will mess up the end product that we are looking for that is day detailed.
mydf$week <- format(mydf$PO.Due.Date, format="%Y-%U")
or even with a function or something like but I couldn't get it
mydf %>%
mutate(PO.End.Date = colSums(PO.Due.Date + weeks(mydf$Duration.Weeks)))
One-liner in base R:
d$EndDate <- d$StartDate + 7 * d$DurationWeeks
> d
StartDate DurationWeeks EndDate
1 2019-01-03 2 2019-01-17
2 2019-01-03 2 2019-01-17
3 2019-01-03 4 2019-01-31
4 2019-01-03 4 2019-01-31
5 2019-01-03 2 2019-01-17
6 2019-02-20 1 2019-02-27
7 2019-03-01 4 2019-03-29
8 2019-03-01 4 2019-03-29
9 2019-02-28 4 2019-03-28
10 2019-01-03 4 2019-01-31
Your StartDate needs to be formatted as a proper date,
e.g., as.Date(d$StartDate, "%Y-%m-%d").
The data should be clean. 0028-02-20 (line 6) is not a proper date.
Data
d <- structure(list(StartDate = structure(c(17899, 17899, 17899, 17899, 17899, 17947, 17956, 17956, 17955, 17899), class = "Date"), DurationWeeks = c(2L, 2L, 4L, 4L, 2L, 1L, 4L, 4L, 4L, 4L)), class = "data.frame", row.names = c(NA, -10L))
It seems your example has a typo? 4 weeks following 1 Jan 2019 is not 1 Feb 2019...?
If you add n to a date-object you get a new date-object n days later. So I would suggest:
mydf$PO.End.Date<-mydf$PO.Due.Date+mydf$Duration.Weeks*7