I have a dataframe with a structure similar to this:
name
time_from
time_to
value
a
2020-01-01 00:00:00
2020-01-01 01:30:00
value1
a
2020-01-01 02:00:00
2020-01-01 02:30:00
value2
b
2020-01-01 00:00:00
2020-01-01 01:00:00
value3
I want to convert the dataframe to the following structure by increasing the time_from timestamp by 30 minutes up to the time_to timestamp value while the name and value remain the same over the timestamp increments.
name
time
value
a
2020-01-01 00:00:00
value1
a
2020-01-01 00:30:00
value1
a
2020-01-01 01:00:00
value1
a
2020-01-01 01:30:00
value1
a
2020-01-01 02:00:00
value2
a
2020-01-01 02:30:00
value2
b
2020-01-01 00:00:00
value3
b
2020-01-01 00:30:00
value3
b
2020-01-01 01:00:00
value3
Help and guidance would be greatly appreciated. Thank you.
Using seqPOSIXt in a by approach.
dat <- do.call(rbind, by(dat, dat[c('name', 'value')], function(x) {
setNames(
data.frame(x[1, 1], seq.POSIXt(x[1, 2], x[nrow(x), 3], by='30 min'), x[1, 4]),
c('name', 'time', 'value'))}))
dat
# name time value
# 1 a 2020-01-01 00:00:00 value1
# 2 a 2020-01-01 00:30:00 value1
# 3 a 2020-01-01 01:00:00 value1
# 4 a 2020-01-01 01:30:00 value1
# 5 a 2020-01-01 02:00:00 value2
# 6 a 2020-01-01 02:30:00 value2
# 7 b 2020-01-01 00:00:00 value3
# 8 b 2020-01-01 00:30:00 value3
# 9 b 2020-01-01 01:00:00 value3
Of course, the solution assumes correctly formated 'POSIXct' format as input. Convert beforehand if you don't have it:
tcols <- c('time_from', 'time_to')
dat[tcols] <- lapply(dat[tcols], as.POSIXct)
Data:
dat <- structure(list(name = c("a", "a", "b"), time_from = structure(c(1577833200,
1577840400, 1577833200), class = c("POSIXct", "POSIXt"), tzone = ""),
time_to = structure(c(1577838600, 1577842200, 1577836800), class = c("POSIXct",
"POSIXt"), tzone = ""), value = c("value1", "value2", "value3"
)), row.names = c(NA, -3L), class = "data.frame")
Related
I have the following data:
# dput:
data <- structure(list(start = structure(c(1641193200, 1641189600, 1641218400,
1641189600, 1641222000, 1641222000, 1641222000), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), end = structure(c(1641218400, 1641218400,
1641241800, 1641218400, 1641241800, 1641241800, 1641232800), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), group = c("A", "B", "C", "D", "E",
"F", "G")), row.names = c(NA, -7L), class = c("tbl_df", "tbl",
"data.frame"))
data
# A tibble: 7 x 3
start end group
<dttm> <dttm> <chr>
1 2022-01-03 07:00:00 2022-01-03 14:00:00 A
2 2022-01-03 06:00:00 2022-01-03 14:00:00 B
3 2022-01-03 14:00:00 2022-01-03 20:30:00 C
4 2022-01-03 06:00:00 2022-01-03 14:00:00 D
5 2022-01-03 15:00:00 2022-01-03 20:30:00 E
6 2022-01-03 15:00:00 2022-01-03 20:30:00 F
7 2022-01-03 15:00:00 2022-01-03 18:00:00 G
And I want to calculate at what time there only 1 group has an "active" time interval (start to end) without overlapping with any other group.
I already experimented with lubridate and the interval function but had trouble comparing more than 2 Intervals with each other.
Desired Output
The output should give the result that the group C has the time interval from 14:00 to 15:00 that has no overlap with any other group.
You can check ivs::iv_locate_splits to see which time frame is occupied by which group:
library(ivs)
ivv <- iv(data$start, data$end)
iv_locate_splits(ivv)
key loc
1 [2022-01-03 06:00:00, 2022-01-03 07:00:00) 2, 4
2 [2022-01-03 07:00:00, 2022-01-03 08:00:00) 1, 2, 4
3 [2022-01-03 08:00:00, 2022-01-03 14:00:00) 1, 2, 4, 7
4 [2022-01-03 14:00:00, 2022-01-03 15:00:00) 3, 7
5 [2022-01-03 15:00:00, 2022-01-03 18:00:00) 3, 5, 6, 7
6 [2022-01-03 18:00:00, 2022-01-03 20:30:00) 3, 5, 6
Updated framework to get the desired outcome:
library(ivs)
#convert to iv format
ivv <- iv(data$start, data$end)
#Check the splits
spl <- iv_locate_splits(ivv)
#Get the index of splits with only 1 group
index <- unlist(spl$loc[lengths(spl$loc) == 1])
#Create the desired outcome using the index
data.frame(frame = spl$key[index],
group = data$group[index])
# frame group
#1 [2022-01-03 14:00:00, 2022-01-03 15:00:00) C
I have two datasets, one with values at specific time points for different IDs and another one with several time frames for the IDs. Now I want to check if the timepoint in dataframe one is within any of the time frames from dataset 2 matching the ID.
For example:
df1:
ID date time
1 2020-04-14 11:00:00
1 2020-04-14 18:00:00
1 2020-04-15 10:00:00
1 2020-04-15 20:00:00
1 2020-04-16 11:00:00
1 ...
2 ...
df2:
ID start end
1 2020-04-14 16:00:00 2020-04-14 20:00:00
1 2020-04-15 18:00:00 2020-04-16 13:00:00
2 ...
2
what I want
df1_new:
ID date time mark
1 2020-04-14 11:00:00 0
1 2020-04-14 18:00:00 1
1 2020-04-15 10:00:00 0
1 2020-04-15 20:00:00 1
1 2020-04-16 11:00:00 1
1 ...
2 ...
Any help would be appreciated!
An option could be:
library(tidyverse)
library(lubridate)
#> date, intersect, setdiff, union
df_1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L), date = c("14.04.2020",
"14.04.2020", "15.04.2020", "15.04.2020", "16.04.2020"), time = c("11:00:00",
"18:00:00", "10:00:00", "20:00:00", "11:00:00"), date_time = structure(c(1586862000,
1586887200, 1586944800, 1586980800, 1587034800), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), class = "data.frame", row.names = c(NA,
-5L))
df_2 <- structure(list(ID = c(1L, 1L), start = c("14.04.2020 16:00",
"15.04.2020 18:00"), end = c("14.04.2020 20:00", "16.04.2020 13:00"
)), class = "data.frame", row.names = c(NA, -2L))
df_22 <- df_2 %>%
mutate(across(c("start", "end"), dmy_hm)) %>%
group_nest(ID)
left_join(x = df_1, y = df_22, by = "ID") %>%
as_tibble() %>%
mutate(mark = map2_dbl(date_time, data, ~+any(.x %within% interval(.y$start, .y$end)))) %>%
select(-data)
#> # A tibble: 5 x 5
#> ID date time date_time mark
#> <int> <chr> <chr> <dttm> <dbl>
#> 1 1 14.04.2020 11:00:00 2020-04-14 11:00:00 0
#> 2 1 14.04.2020 18:00:00 2020-04-14 18:00:00 1
#> 3 1 15.04.2020 10:00:00 2020-04-15 10:00:00 0
#> 4 1 15.04.2020 20:00:00 2020-04-15 20:00:00 1
#> 5 1 16.04.2020 11:00:00 2020-04-16 11:00:00 1
Created on 2021-05-25 by the reprex package (v2.0.0)
I have the following data frame:
df <- structure(list(ID = 1:4, col1.date = structure(c(1546188000,
1272294300, 1087908540, 1512241620), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), col2.date = structure(c(1546237740, 1272928800,
1087966800, 1512277200), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
col3.date = structure(c(1546323000, 1272949200, 1088049600,
1512396000), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
col1.result = c(1.31, 0.95, 3.3, 0.55), col2.result = c(1.19,
1.57, 1.6, 0.59), col3.result = c(0.97, 2.13, 1.1, 0.57)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L))
I would like to have for each ID three rows and two columns: result and date.
This is what I have tried:
df_long <- df %>%
gather(v, value, col1.date:col3.result) %>%
separate(v, c("var", "col")
however I am getting the date transformed to numeric.
What am I doing wrong?
Since you ultimately want to reshape multiple columns (and it is the "new way" of tidyr-1.0.0), then try pivot_longer. This answer is adapted directly from the example in the help page at ?pivot_longer:
df %>%
pivot_longer(
col1.date:col3.result,
names_to = c("set", ".value"),
names_pattern = "(.*)\\.(.*)"
)
# # A tibble: 12 x 4
# ID set date result
# <int> <chr> <dttm> <dbl>
# 1 1 col1 2018-12-30 16:40:00 1.31
# 2 1 col2 2018-12-31 06:29:00 1.19
# 3 1 col3 2019-01-01 06:10:00 0.97
# 4 2 col1 2010-04-26 15:05:00 0.95
# 5 2 col2 2010-05-03 23:20:00 1.57
# 6 2 col3 2010-05-04 05:00:00 2.13
# 7 3 col1 2004-06-22 12:49:00 3.3
# 8 3 col2 2004-06-23 05:00:00 1.6
# 9 3 col3 2004-06-24 04:00:00 1.1
# 10 4 col1 2017-12-02 19:07:00 0.55
# 11 4 col2 2017-12-03 05:00:00 0.59
# 12 4 col3 2017-12-04 14:00:00 0.570
I have two data frames with timestamps (in as.POSIXct, format="%Y-%m-%d %H:%M:%S") as below.
df_ID1
ID DATETIME TIMEDIFF EV
A 2019-03-26 06:13:00 2019-03-26 00:13:00 1
B 2019-04-03 08:00:00 2019-04-03 02:00:00 1
B 2019-04-04 12:35:00 2019-04-04 06:35:00 1
df_ID0
ID DATETIME
A 2019-03-26 00:02:00
A 2019-03-26 04:55:00
A 2019-03-26 11:22:00
B 2019-04-02 20:43:00
B 2019-04-04 11:03:00
B 2019-04-06 03:12:00
I want to compare the DATETIME in df_ID1 with the DATETIME in df_ID0 that is with the same ID and the DATETIME is "smaller than but closest to" the one in df_ID1,
For the pair in two data frames that matches, I want to further compare the TIMEDIFF in df_ID1 to the matched DATETIME in df_ID0, if TIMEDIFF in df_ID1 greater than the DATETIME in df_ID0, change EV 1 to 4 in df_ID1.
My desired result is
df_ID1
ID DATETIME TIMEDIFF EV
A 2019-03-26 06:13:00 2019-03-26 00:13:00 1
B 2019-04-03 08:00:00 2019-04-03 02:00:00 4
B 2019-04-04 12:35:00 2019-04-04 06:35:00 1
I've checked how to compare timestamps and calculate the time difference, also how to change values based on criteria...
But I cannot find anything to select the "smaller than but closest to" timestamps and cannot figure out how to apply all these logic too..
Any help would be appreciate!
You can do this with a for loop keeping in mind that if your actual data base is very big then the overhead would be quite bad performance wise.
for(i in 1:nrow(df_1)){
sub <- subset(df_0, ID == df_1$ID[i]) # filter on ID
df_0_dt <- max(sub[sub$DATETIME < df_1$DATETIME[i],]$DATETIME) # Take max of those with DATETIME less than (ie less than but closest to)
if(df_0_dt < df_1$TIMEDIFF[i]){ # final condition
df_1[i, "EV"] <- 4
}
}
df_1
# A tibble: 3 x 4
ID DATETIME TIMEDIFF EV
<chr> <dttm> <dttm> <dbl>
1 A 2019-03-26 06:13:00 2019-03-26 00:13:00 1
2 B 2019-04-03 08:00:00 2019-04-03 02:00:00 4
3 B 2019-04-04 12:35:00 2019-04-04 06:35:00 1
One option using nested mapply, is to first split df_ID1 and df_ID0 based on ID. Calculate the difference in time between each value in df_ID1 with that of df_ID0 of same ID. Get the index of "smaller than but closest to" and store it in inds and change the value to 4 if the value of corresponding TIMEDIFF column is greater than the matched DATETIME value.
df_ID1$EV[unlist(mapply(function(x, y) {
mapply(function(p, q) {
vals = as.numeric(difftime(p, y$DATETIME))
inds = which(vals == min(vals[vals > 0]))
q > y$DATETIME[inds]
}, x$DATETIME, x$TIMEDIFF)
}, split(df_ID1, df_ID1$ID), split(df_ID0, df_ID0$ID)))] <- 4
df_ID1
# ID DATETIME TIMEDIFF EV
#1 A 2019-03-26 06:13:00 2019-03-26 00:13:00 1
#2 B 2019-04-03 08:00:00 2019-04-03 02:00:00 4
#3 B 2019-04-04 12:35:00 2019-04-04 06:35:00 1
data
df_ID0 <- structure(list(ID = structure(c(1L, 1L, 1L, 2L, 2L, 2L),
.Label = c("A",
"B"), class = "factor"), DATETIME = structure(c(1553529720, 1553547300,
1553570520, 1554208980, 1554346980, 1554491520), class = c("POSIXct",
"POSIXt"), tzone = "")), row.names = c(NA, -6L), class = "data.frame")
df_ID1 <- structure(list(ID = structure(c(1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), DATETIME = structure(c(1553551980, 1554249600,
1554352500), class = c("POSIXct", "POSIXt"), tzone = ""), TIMEDIFF =
structure(c(1553530380,
1554228000, 1554330900), class = c("POSIXct", "POSIXt"), tzone = ""),
EV = c(1, 1, 1)), row.names = c(NA, -3L), class = "data.frame")
This question already has answers here:
How to flatten / merge overlapping time periods
(5 answers)
Closed 4 years ago.
I know the following problam can be solved using Bioconductor's IRanges-package, using reduce.
But since that function only accepts numeric input, and I am working with data.table anyway, I am wondering is the following van be achieved using data.tables'foverlaps().
Sample data
structure(list(group = c("A", "A", "A", "A", "B", "B", "B", "B"
), subgroup = c(1, 1, 2, 2, 1, 1, 2, 2), start = structure(c(1514793600,
1514795400, 1514794200, 1514798100, 1514815200, 1514817000, 1514815800,
1514818800), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
end = structure(c(1514794500, 1514797200, 1514794800, 1514799000,
1514816100, 1514818800, 1514817600, 1514820600), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
# group subgroup start end
# 1: A 1 2018-01-01 08:00:00 2018-01-01 08:15:00
# 2: A 1 2018-01-01 08:30:00 2018-01-01 09:00:00
# 3: A 2 2018-01-01 08:10:00 2018-01-01 08:20:00
# 4: A 2 2018-01-01 09:15:00 2018-01-01 09:30:00
# 5: B 1 2018-01-01 14:00:00 2018-01-01 14:15:00
# 6: B 1 2018-01-01 14:30:00 2018-01-01 15:00:00
# 7: B 2 2018-01-01 14:10:00 2018-01-01 14:40:00
# 8: B 2 2018-01-01 15:00:00 2018-01-01 15:30:00
Question
What I would like to achieve, is to join/merge events (by group) when:
a range (start - end) overlaps (or partially overlaps) another range
the start of a range is the end of another range
Subgroups can be ignored
As mentioned above, I'm know this can be done using biocondustor's IRanges reduce, but I wonder if the same can be achieved using data.table. I can't shake the feeling that foverlaps should be able to tackle my problem, but I cannot figure out how...
Since I'm an intermediate R-user, but pretty much a novice in data.table, it's hard for me to 'read' some solutions already provided on stackoverflow. So I'm not sure if a similar quenstion has already been asked and answered (if so, please be gentle ;-) )
Desired output
structure(list(group = c("A", "A", "A", "B"), start = structure(c(1514793600,
1514795400, 1514798100, 1514815200), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), end = structure(c(1514794800, 1514797200,
1514799000, 1514820600), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
# group start end
# 1: A 2018-01-01 08:00:00 2018-01-01 08:20:00
# 2: A 2018-01-01 08:30:00 2018-01-01 09:00:00
# 3: A 2018-01-01 09:15:00 2018-01-01 09:30:00
# 4: B 2018-01-01 14:00:00 2018-01-01 15:30:00
If you arrange on group and start (in that order) and unselect the indx column, this solution posted by David Arenburg works perfectly: How to flatten/merge overlapping time periods in R
library(dplyr)
df1 %>%
group_by(group) %>%
arrange(group, start) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(group, indx) %>%
summarise(start = first(start), end = last(end)) %>%
select(-indx)
group start end
<chr> <dttm> <dttm>
1 A 2018-01-01 08:00:00 2018-01-01 08:20:00
2 A 2018-01-01 08:30:00 2018-01-01 09:00:00
3 A 2018-01-01 09:15:00 2018-01-01 09:30:00
4 B 2018-01-01 14:00:00 2018-01-01 15:30:00