Expand rows of data frame date-time column with intervening date-times - r

I have a date-time column with non-consecutive date-times (all on the hour), like this:
dat <- data.frame(dt = as.POSIXct(c("2018-01-01 12:00:00",
"2018-01-13 01:00:00",
"2018-02-01 11:00:00")))
# Output:
# dt
#1 2018-01-01 12:00:00
#2 2018-01-13 01:00:00
#3 2018-02-01 11:00:00
I'd like to expand the rows of column dt so that every hour in between the very minimum and maximum date-times is present, looking like:
# Desired output:
# dt
#1 2018-01-01 12:00:00
#2 2018-01-01 13:00:00
#3 2018-01-01 14:00:00
#4 .
#5 .
And so on. tidyverse-based solutions are most preferred.

#DavidArenburg's comment is the way to go for a vector. However, if you want to expand dt inside a data frame with other columns that you would like to keep, you might be interested in tidyr::complete combined with tidyr::full_seq:
dat <- data.frame(dt = as.POSIXct(c("2018-01-01 12:00:00",
"2018-01-13 01:00:00",
"2018-02-01 11:00:00")))
dat$a <- letters[1:3]
dat
#> dt a
#> 1 2018-01-01 12:00:00 a
#> 2 2018-01-13 01:00:00 b
#> 3 2018-02-01 11:00:00 c
library(tidyr)
res <- complete(dat, dt = full_seq(dt, 60 ** 2))
print(res, n = 5)
#> # A tibble: 744 x 2
#> dt a
#> <dttm> <chr>
#> 1 2018-01-01 12:00:00 a
#> 2 2018-01-01 13:00:00 <NA>
#> 3 2018-01-01 14:00:00 <NA>
#> 4 2018-01-01 15:00:00 <NA>
#> 5 2018-01-01 16:00:00 <NA>
#> # ... with 739 more rows
Created on 2018-03-12 by the reprex package (v0.2.0).

Related

How to create groups based on a changin condition using for in R?

I have a data frame and a vector that I want to compare with a column of my data frame to assign groups based on the values that meet the condition, the problem is that these values are dynamic so I need a code that takes into account the different lengths that this vector can take
This is a minimal reproducible example of my data frame
value <- c(rnorm(39, 5, 2))
Date <- seq(as.POSIXct('2021-01-18'), as.POSIXct('2021-10-15'), by = "7 days")
df <- data.frame(Date, value)
This is the vector I have to compare with the Date of the data frame
dates_tour <- as.POSIXct(c('2021-01-18', '2021-05-18', '2021-08-18', '2021-10-15'))
This creates the desired output
df <- df %>% mutate(tour = case_when(Date >= dates_tour[1] & Date <= dates_tour[2] ~ 1,
Date > dates_tour[2] & Date <= dates_tour[3]~2,
Date > dates_tour[3] & Date <= dates_tour[4]~3))
However, I don't want to do it like that since this project needs to be updated frequently and the variable dates_tour change in length
So I would like to take that into account to create the tour variable
I tried to do it like this: but it doesn't work
for (i in 1:length(dates_tour)) {
df <- df %>% mutate(tour = case_when(Date >= dates_tour[i] & Date <= dates_tour[i+1] ~ i))
}
You can use cut to bin a vector based on break points:
df %>%
mutate(
tour = cut(Date, breaks = dates_tour, labels = seq_along(dates_tour[-1]))
)
We may remove the first and last elements to create a tibble and then loop over the rows of the tibble
library(dplyr)
library(purrr)
keydat <- tibble(start = dates_tour[-length(dates_tour)],
end = dates_tour[-1])
df$tour <- imap(seq_len(nrow(keydat)),
~ case_when(df$Date >= keydat$start[.x] &
df$Date <= keydat$end[.x]~ .y )) %>%
invoke(coalesce, .)
-output
> df
Date value tour
1 2021-01-18 00:00:00 7.874620 1
2 2021-01-25 00:00:00 9.704973 1
3 2021-02-01 00:00:00 5.898070 1
4 2021-02-08 00:00:00 3.287319 1
5 2021-02-15 00:00:00 5.488132 1
6 2021-02-22 00:00:00 4.425636 1
7 2021-03-01 00:00:00 6.244084 1
8 2021-03-08 00:00:00 5.528364 1
9 2021-03-15 01:00:00 7.954929 1
10 2021-03-22 01:00:00 4.691995 1
11 2021-03-29 01:00:00 5.943415 1
12 2021-04-05 01:00:00 5.316373 1
13 2021-04-12 01:00:00 5.182952 1
14 2021-04-19 01:00:00 3.330700 1
15 2021-04-26 01:00:00 7.461089 1
16 2021-05-03 01:00:00 4.338873 1
17 2021-05-10 01:00:00 5.768665 1
18 2021-05-17 01:00:00 3.574488 1
19 2021-05-24 01:00:00 5.106042 2
20 2021-05-31 01:00:00 2.828844 2
21 2021-06-07 01:00:00 4.616084 2
22 2021-06-14 01:00:00 7.234506 2
23 2021-06-21 01:00:00 4.760413 2
24 2021-06-28 01:00:00 7.020543 2
25 2021-07-05 01:00:00 7.403235 2
26 2021-07-12 01:00:00 6.368435 2
27 2021-07-19 01:00:00 3.527764 2
28 2021-07-26 01:00:00 5.254025 2
29 2021-08-02 01:00:00 5.676425 2
30 2021-08-09 01:00:00 3.783304 2
31 2021-08-16 01:00:00 6.310292 2
32 2021-08-23 01:00:00 2.938218 3
33 2021-08-30 01:00:00 5.101852 3
34 2021-09-06 01:00:00 3.765659 3
35 2021-09-13 01:00:00 5.489846 3
36 2021-09-20 01:00:00 4.174276 3
37 2021-09-27 01:00:00 7.348895 3
38 2021-10-04 01:00:00 5.103772 3
39 2021-10-11 01:00:00 4.941248 3

Aggregate a tibble based on a consecutive values in a boolean column

I've got a fairly straight-forward problem, but I'm struggling to find a solution that doesn't require a wall of code and complicated loops.
I've got a summary table, df, for an hourly timeseries dataset where each observations belongs to a group.
I want to merge some of those groups, based on a boolean column in the summary table.
The boolean column, merge_with_next indicates whether a given group should be merged with the next group (one row down).
The merging effectively occurs by updating the end, value and removing rows:
library(dplyr)
# Demo data
df <- tibble(
group = 1:12,
start = seq.POSIXt(as.POSIXct("2019-01-01 00:00"), as.POSIXct("2019-01-12 00:00"), by = "1 day"),
end = seq.POSIXt(as.POSIXct("2019-01-01 23:59"), as.POSIXct("2019-01-12 23:59"), by = "1 day"),
merge_with_next = rep(c(TRUE, TRUE, FALSE), 4)
)
df
#> # A tibble: 12 x 4
#> group start end merge_with_next
#> <int> <dttm> <dttm> <lgl>
#> 1 1 2019-01-01 00:00:00 2019-01-01 23:59:00 TRUE
#> 2 2 2019-01-02 00:00:00 2019-01-02 23:59:00 TRUE
#> 3 3 2019-01-03 00:00:00 2019-01-03 23:59:00 FALSE
#> 4 4 2019-01-04 00:00:00 2019-01-04 23:59:00 TRUE
#> 5 5 2019-01-05 00:00:00 2019-01-05 23:59:00 TRUE
#> 6 6 2019-01-06 00:00:00 2019-01-06 23:59:00 FALSE
#> 7 7 2019-01-07 00:00:00 2019-01-07 23:59:00 TRUE
#> 8 8 2019-01-08 00:00:00 2019-01-08 23:59:00 TRUE
#> 9 9 2019-01-09 00:00:00 2019-01-09 23:59:00 FALSE
#> 10 10 2019-01-10 00:00:00 2019-01-10 23:59:00 TRUE
#> 11 11 2019-01-11 00:00:00 2019-01-11 23:59:00 TRUE
#> 12 12 2019-01-12 00:00:00 2019-01-12 23:59:00 FALSE
# Desired result
desired <- tibble(
group = c(1, 4, 7, 9),
start = c("2019-01-01 00:00", "2019-01-04 00:00", "2019-01-07 00:00", "2019-01-10 00:00"),
end = c("2019-01-03 23:59", "2019-01-06 23:59", "2019-01-09 23:59", "2019-01-12 23:59")
)
desired
#> # A tibble: 4 x 3
#> group start end
#> <dbl> <chr> <chr>
#> 1 1 2019-01-01 00:00 2019-01-03 23:59
#> 2 4 2019-01-04 00:00 2019-01-06 23:59
#> 3 7 2019-01-07 00:00 2019-01-09 23:59
#> 4 9 2019-01-10 00:00 2019-01-12 23:59
Created on 2019-03-22 by the reprex package (v0.2.1)
I'm looking for a short and clear solution that doesn't involve a myriad of helper tables and loops. The final value in the group column is not significant, I only care about the start and end columns from the result.
We can use dplyr and create groups based on every time TRUE value occurs in merge_with_next column and select first value from start and last value from end column for each group.
library(dplyr)
df %>%
group_by(temp = cumsum(!lag(merge_with_next, default = TRUE))) %>%
summarise(group = first(group),
start = first(start),
end = last(end)) %>%
ungroup() %>%
select(-temp)
# group start end
# <int> <dttm> <dttm>
#1 1 2019-01-01 00:00:00 2019-01-03 23:59:00
#2 4 2019-01-04 00:00:00 2019-01-06 23:59:00
#3 7 2019-01-07 00:00:00 2019-01-09 23:59:00
#4 10 2019-01-10 00:00:00 2019-01-12 23:59:00

How to generate a sequence of dates and times with a specific start date/time in R

I am looking to generate or complete a column of dates and times. I have a dataframe of four numeric columns and one POSIXct time column that looks like this:
CH_1 CH_2 CH_3 CH_4 date_time
1 -10096 -11940 -9340 -9972 2018-07-24 10:45:01
2 -10088 -11964 -9348 -9960 <NA>
3 -10084 -11940 -9332 -9956 <NA>
4 -10088 -11956 -9340 -9960 <NA>
5 -10084 -11944 -9332 -9976 <NA>
6 -10076 -11940 -9340 -9948 <NA>
7 -10088 -11956 -9352 -9960 <NA>
8 -10084 -11944 -9348 -9980 <NA>
9 -10076 -11964 -9348 -9976 <NA>
0 -10076 -11956 -9348 -9964 <NA>
I would like to sequentially generate dates and times for the date_time column, increasing by 1 second until the dataframe is filled. (i.e. the next date/time should be 2018-07-24 10:45:02). This is meant to be reproducible for multiple datasets and the number of rows that need filled is not always known, but the start date/time will always be present in that first cell.
I know that the solution is likely within seq.Date (or similar), but the problem I have is that I won't always know the end date/time, which is what most examples I have found require. Any help would be appreciated!
Here's a tidyverse solution, using Zygmunt Zawadzki's example data:
library(lubridate)
library(tidyverse)
df %>% mutate(date_time = date_time[1] + seconds(row_number()-1))
Output:
date_time
1 2018-01-01 00:00:00
2 2018-01-01 00:00:01
3 2018-01-01 00:00:02
4 2018-01-01 00:00:03
5 2018-01-01 00:00:04
6 2018-01-01 00:00:05
7 2018-01-01 00:00:06
8 2018-01-01 00:00:07
9 2018-01-01 00:00:08
10 2018-01-01 00:00:09
11 2018-01-01 00:00:10
Data:
df <- data.frame(date_time = c(as.POSIXct("2018-01-01 00:00:00"), rep(NA,10)))
No need for lubridate, just,R code:
x <- data.frame(date = c(as.POSIXct("2018-01-01 00:00:00"), rep(NA,10)))
startDate <- x[["date"]][1]
x[["date2"]] <- startDate + (seq_len(nrow(x)) - 1)
x
# date date2
# 1 2018-01-01 2018-01-01 00:00:00
# 2 <NA> 2018-01-01 00:00:01
# 3 <NA> 2018-01-01 00:00:02
# 4 <NA> 2018-01-01 00:00:03
# 5 <NA> 2018-01-01 00:00:04
# 6 <NA> 2018-01-01 00:00:05
# 7 <NA> 2018-01-01 00:00:06
# 8 <NA> 2018-01-01 00:00:07
# 9 <NA> 2018-01-01 00:00:08
# 10 <NA> 2018-01-01 00:00:09
# 11 <NA> 2018-01-01 00:00:10

replacing missing values by group conditional on date/time

I want to group by person and date1 and fill in missing data for date2 and indicator by person and day IF with the person's next observation occurs in the same day.
For instance, person 1 is missing date2 and indicator values for the second and third observations. As shown below, I want to replace these missing values with the next non-NA observation in the same day for this person: date2==2018-02-02 15:04:00 and indicator==1.
Note that for person 2, the last NA does not have a next observation in the same day, so it needs to remain NA.
Here is the data frame I have:
person date1 date2 indicator
1 1 2018-02-02 12:00:00 2018-02-02 12:05:00 1
2 1 2018-02-02 13:00:00 <NA> NA
3 1 2018-02-02 14:00:00 <NA> NA
4 1 2018-02-02 15:00:00 2018-02-02 15:04:00 1
5 2 2018-02-01 12:00:00 <NA> NA
6 2 2018-02-01 13:00:00 2018-02-01 13:06:00 1
7 2 2018-02-02 12:00:00 2018-02-02 12:03:00 1
8 2 2018-02-03 12:00:00 <NA> NA
Here is the data frame I want:
person date1 date2 indicator
1 1 2018-02-02 12:00:00 2018-02-02 12:05:00 1
2 1 2018-02-02 13:00:00 2018-02-02 15:04:00 1
3 1 2018-02-02 14:00:00 2018-02-02 15:04:00 1
4 1 2018-02-02 15:00:00 2018-02-02 15:04:00 1
5 2 2018-02-01 12:00:00 2018-02-01 13:06:00 1
6 2 2018-02-01 13:00:00 2018-02-01 13:06:00 1
7 2 2018-02-02 12:00:00 2018-02-02 12:03:00 1
8 2 2018-02-03 12:00:00 <NA> NA
Example:
library(tidyverse)
df.have <- data.frame(person=c(1, 1, 1, 1, 2, 2, 2, 2),
date1=ymd_hms(c("2018-02-02 12:00:00",
"2018-02-02 13:00:00",
"2018-02-02 14:00:00",
"2018-02-02 15:00:00",
"2018-02-01 12:00:00",
"2018-02-01 13:00:00",
"2018-02-02 12:00:00",
"2018-02-03 12:00:00")),
date2=ymd_hms(c("2018-02-02 12:05:00",
NA,
NA,
"2018-02-02 15:04:00",
NA,
"2018-02-01 13:06:00",
"2018-02-02 12:03:00",
NA)),
indicator=c(1, NA, NA, 1,
NA, 1, 1, NA))
df.want <- data.frame(person=c(1, 1, 1, 1, 2, 2, 2, 2),
date1=ymd_hms(c("2018-02-02 12:00:00",
"2018-02-02 13:00:00",
"2018-02-02 14:00:00",
"2018-02-02 15:00:00",
"2018-02-01 12:00:00",
"2018-02-01 13:00:00",
"2018-02-02 12:00:00",
"2018-02-03 12:00:00")),
date2=ymd_hms(c("2018-02-02 12:05:00",
"2018-02-02 15:04:00",
"2018-02-02 15:04:00",
"2018-02-02 15:04:00",
"2018-02-01 13:06:00",
"2018-02-01 13:06:00",
"2018-02-02 12:03:00",
NA)),
indicator=c(1, 1, 1, 1,
1, 1, 1, NA))
I can filter down to some of the replacement values, but still a good bit from where I want to get.
df.have %>%
group_by(person, date(date1)) %>%
arrange(person, date1) %>%
filter(row_number() %in% c(n()))
You can do it like this (note that you also need lubridate as well as the tidyverse packages)...
df.want <- df.have %>% mutate(day=date(date1)) %>% #add a date variable for grouping
group_by(day,person) %>%
fill(date2,indicator,.direction = "up") %>% #use tidyr 'fill' to remove NAs
ungroup() %>%
select(-day) %>% #remove grouping variable
arrange(person,date1) #restore original order
df.want
# A tibble: 8 x 4
person date1 date2 indicator
<dbl> <dttm> <dttm> <dbl>
1 1 2018-02-02 12:00:00 2018-02-02 12:05:00 1
2 1 2018-02-02 13:00:00 2018-02-02 15:04:00 1
3 1 2018-02-02 14:00:00 2018-02-02 15:04:00 1
4 1 2018-02-02 15:00:00 2018-02-02 15:04:00 1
5 2 2018-02-01 12:00:00 2018-02-01 13:06:00 1
6 2 2018-02-01 13:00:00 2018-02-01 13:06:00 1
7 2 2018-02-02 12:00:00 2018-02-02 12:03:00 1
8 2 2018-02-03 12:00:00 NA NA

split date and time in different columns of dataframe in R

I have a following dataframe in R
ID Date1 Date2
1 21-03-16 8:36 22-03-16 12:36
1 23-03-16 9:36 24-03-16 01:36
1 22-03-16 10:36 25-03-16 11:46
1 23-03-16 11:36 28-03-16 10:16
My desired dataframe is
ID Date1 Date1_time Date2 Date2_time
1 2016-03-21 08:36:00 2016-03-22 12:36:00
1 2016-03-23 09:36:00 2016-03-24 01:36:00
1 2016-03-22 10:36:00 2016-03-25 11:46:00
1 2016-03-23 11:36:00 2016-03-28 10:16:00
I can do this individually using strptime like following
df$Date1 <- strptime(df$Date1, format='%d-%m-%y %H:%M')
df$Date1_time <- strftime(df$Date1 ,format="%H:%M:%S")
df$Date1 <- strptime(df$Date1, format='%Y-%m-%d')
But,I have many date columns to convert like above. How can I write function in R which will do this.
You can do this with dplyr::mutate_at to operate on multiple columns. See select helpers for more info on efficiently specifying which columns to operate on.
Then you can use lubridate and hms for date and time functions.
library(dplyr)
library(lubridate)
library(hms)
df <- readr::read_csv(
'ID,Date1,Date2
1,"21-03-16 8:36","22-03-16 12:36"
1,"23-03-16 9:36","24-03-16 01:36"
1,"22-03-16 10:36","25-03-16 11:46"
1,"23-03-16 11:36","28-03-16 10:16"'
)
df
#> # A tibble: 4 x 3
#> ID Date1 Date2
#> <int> <chr> <chr>
#> 1 1 21-03-16 8:36 22-03-16 12:36
#> 2 1 23-03-16 9:36 24-03-16 01:36
#> 3 1 22-03-16 10:36 25-03-16 11:46
#> 4 1 23-03-16 11:36 28-03-16 10:16
df %>%
mutate_at(vars(Date1, Date2), dmy_hm) %>%
mutate_at(vars(Date1, Date2), funs("date" = date(.), "time" = as.hms(.))) %>%
select(-Date1, -Date2)
#> # A tibble: 4 x 5
#> ID Date1_date Date2_date Date1_time Date2_time
#> <int> <date> <date> <time> <time>
#> 1 1 2016-03-21 2016-03-22 08:36:00 12:36:00
#> 2 1 2016-03-23 2016-03-24 09:36:00 01:36:00
#> 3 1 2016-03-22 2016-03-25 10:36:00 11:46:00
#> 4 1 2016-03-23 2016-03-28 11:36:00 10:16:00
Using dplyr for manipulation:
convertTime <- function(x)as.POSIXct(x, format='%d-%m-%y %H:%M')
df %>%
mutate_at(vars(Date1, Date2), convertTime) %>%
group_by(ID) %>%
mutate_all(funs("date"=as.Date(.), "time"=format(., "%H:%M:%S")))
# Source: local data frame [4 x 7]
# Groups: ID [1]
#
# ID Date1 Date2 Date1_date Date2_date Date1_time Date2_time
# <int> <dttm> <dttm> <date> <date> <chr> <chr>
# 1 1 2016-03-22 12:36:00 2016-03-22 12:36:00 2016-03-22 2016-03-22 12:36:00 12:36:00
# 2 1 2016-03-24 01:36:00 2016-03-24 01:36:00 2016-03-23 2016-03-23 01:36:00 01:36:00
# 3 1 2016-03-25 11:46:00 2016-03-25 11:46:00 2016-03-25 2016-03-25 11:46:00 11:46:00
# 4 1 2016-03-28 10:16:00 2016-03-28 10:16:00 2016-03-28 2016-03-28 10:16:00 10:16:00
I have the same problem, you can try this may be help using strsplit
x <- df$Date1
y = t(as.data.frame(strsplit(as.character(x),' ')))
row.names(y) = NULL
# store splitted data into new columns
df$date <- y[,1] # date column
df$time <- y[,2] # time column

Resources