Missing datetime values in a time series, filling with carrying over the last value - datetime

I have intraday time series, like this:
01/12/2022 08:00 4545
01/12/2022 08:01 85758
01/12/2022 08:03 87786
01/12/2022 08:04 456867
01/12/2022 08:06 4278528
01/12/2022 08:07 5682
01/12/2022 08:08 428
01/12/2022 08:09 5272
As you can see, my time series doesn't follow the right order, as for some minutes the table has missing rows. What I would like to do is to fill in the missing dates with time, and fill the missing value with the last value before it. I tried many things, but so far I couldn't manage to properly solve it. Any help would be welcome!
I already tried converting the datetime column into POSIXct format, but no luck with that. I was thinking of creating a vector of datetime, with seq, but I was not able to define minimum and maximum values, to be in the right order. The max(datetime) and min(datetime) only returned NA.

Data
data <-
structure(list(date = c("01/12/2022 08:00", "01/12/2022 08:01",
"01/12/2022 08:03", "01/12/2022 08:04", "01/12/2022 08:06", "01/12/2022 08:07",
"01/12/2022 08:08", "01/12/2022 08:09"),
value = c(4545L, 85758L,87786L, 456867L, 4278528L, 5682L, 428L, 5272L)),
class = "data.frame", row.names = c(NA,-8L))
Code
library(lubridate)
library(dplyr)
library(tidyr)
data <-
data %>%
mutate(date = dmy_hm(date))
data.frame(date = seq.POSIXt(from = min(data$date), to = max(data$date),by = "min")) %>%
left_join(data) %>%
fill(value,.direction = "down")
Output
Joining, by = "date"
date value
1 2022-12-01 08:00:00 4545
2 2022-12-01 08:01:00 85758
3 2022-12-01 08:02:00 85758
4 2022-12-01 08:03:00 87786
5 2022-12-01 08:04:00 456867
6 2022-12-01 08:05:00 456867
7 2022-12-01 08:06:00 4278528
8 2022-12-01 08:07:00 5682
9 2022-12-01 08:08:00 428
10 2022-12-01 08:09:00 5272

Related

Can you specify what space to separate columns by?

I am working with a data set called sleep with the following columns:
head(sleep)
Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
1 1503960366 4/12/2016 12:00:00 AM 1 327 346
2 1503960366 4/13/2016 12:00:00 AM 2 384 407
3 1503960366 4/15/2016 12:00:00 AM 1 412 442
4 1503960366 4/16/2016 12:00:00 AM 2 340 367
I am trying to separate the sleepDay column into two columns named "Date" and "Sleep"
I used the separate function and was able to create the two columns below:
separate(weight_log, Date, into = c("Date", "Time"), sep = ' ')
Id Date Time WeightKg WeightPounds Fat BMI IsManualReport LogId
1 1503960366 5/2/2016 11:59:59 52.6 115.9631 22 22.65 True 1.462234e+12
2 1503960366 5/3/2016 11:59:59 52.6 115.9631 NA 22.65 True 1.462320e+12
3 1927972279 4/13/2016 1:08:52 133.5 294.3171 NA 47.54 False 1.460510e+12
I want to be able to keep the AM and PM next to the times, but with the function I used, they seem to disappear I assume because I am separating based on a space. Is there anyway to be able to specify that I am only trying to separate the column into two based on the first space?
Edit: The data set Sleep shown at the top is different then the dataset I used the separator function on which is weight_log, but the issue is the same
data.frame(SleepDay = "4/12/2016 12:00:00 AM") %>%
separate(SleepDay, into = c("Date", "Time"), sep = " ", extra = "merge")
# Date Time
#1 4/12/2016 12:00:00 AM
If you are doing further analysis or visualization, I recommend converting the text into a datetime.
library(lubridate)
data.frame(SleepDay = "4/12/2016 12:05:00 AM") %>%
mutate(SleepDay = mdy_hms(SleepDay),
SleepDay_base = as.POSIXct(SleepDay),
date = as_date(SleepDay),
time_12 = format(SleepDay, "%I:%M %p"),
time_24 = format(SleepDay, "%H:%M"))
# SleepDay SleepDay_base date time_12 time_24
#1 2016-04-12 00:05:00 2016-04-12 00:05:00 2016-04-12 12:05 AM 00:05

How to merge date and time into one variable

I want to have a date variable and a time variable as one variable like this 2012-05-02 07:30
This code does the job, but I need to get a new combined variable into the data frame, and this code shows it only in the console
as.POSIXct(paste(data$Date, data$Time), format="%Y-%m-%d %H:%M")
This code is supposed to combine time and date, but seemingly doesn't do that. In the column "Combined" only the date appears
data$Combined = as.POSIXct(paste0(data$Date,data$Time))
Here's the data
structure(list(Date = structure(c(17341, 18198, 17207, 17023,
17508, 17406, 18157, 17931, 17936, 18344), class = "Date"), Time = c("08:40",
"10:00", "22:10", "18:00", "08:00", "04:30", "20:00", "15:40",
"11:00", "07:00")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
We could use ymd_hm function from lubridate package:
library(lubridate)
df$Date_time <- ymd_hm(paste0(df$Date, df$Time))
Date Time Date_time
<date> <chr> <dttm>
1 2017-06-24 08:40 2017-06-24 08:40:00
2 2019-10-29 10:00 2019-10-29 10:00:00
3 2017-02-10 22:10 2017-02-10 22:10:00
4 2016-08-10 18:00 2016-08-10 18:00:00
5 2017-12-08 08:00 2017-12-08 08:00:00
6 2017-08-28 04:30 2017-08-28 04:30:00
7 2019-09-18 20:00 2019-09-18 20:00:00
8 2019-02-04 15:40 2019-02-04 15:40:00
9 2019-02-09 11:00 2019-02-09 11:00:00
10 2020-03-23 07:00 2020-03-23 07:00:00

For Loops R: Feed result from grep(pattern) into For loop for multiple columns in R

I would like to calculate with labvalues from certain (asynchronous) dates (120+). However, Lubridate does not allow both ymd and ymd_hms in the same column. Hence, I wrote a working grep which looks up the dates without time, and it adds 00:00:00 behind those. So i.e. 01-12-1998 becomes 01-12-1998 00:00 (based on this query: lubridate converting midnight timestamp returns NA: how to fill missing timestamp)
Now I want to make a Forloop which automatically recognizes the eligeble columns (might alter in the future), and perform the time addition function.
I couldn't find the right documententation to tie all the functions below together. Would love to know where to find more info on this!
Data frame: Testset
ID Lab_date1 Lab_date2 Lab_date3 Lab_date4
76 18/1/1982 26/01/1990 20/06/1990 15/11/1990
183 18/10/1982 24/04/1989 27/04/1989 02/04/1991
27 1/11/1983 18/10/1982 01:01 13/04/1983 31/10/1984
84 12-1-1983 12-1-1983 00:00 21-4-1983 15:10 22-3-1984 00:00
28 13-10-1989 13-1-1989 12:00 13-11-1991 14:11 19-11-1991 00:00
120 1-10-1982 14-7-1982 00:00 26-8-1986 00:00 26-8-1986 00:00
The function for altering the dates is, now programmed for Lab_date1
Testset$Lab_date1[grep("[0-9]{1,2}.[0-9]{1,2}.[0-9]{4}$",Testset$Lab_date1)] <- paste(
Testset$Lab_date1[grep("[0-9]{1,2}.[0-9]{1,2}.[0-9]{4}$",Testset$Lab_date1)],"00:00:00")
Also, i wrote a grep (pattern) which returns the colnumbers of the lab dates, ie 2:4. Can this result be feeded into a a forloop with the function above?
dat_lab <- grep(pattern="Lab_date",
x=colnames(Testset))
I already tried this, but it didn't work
for(i in names(dat_lab)){
y <- dat_lab[i]
y[grep("[0-9]{1,2}.[0-9]{1,2}.[0-9]{4}$",y)] <- paste(
y[grep("[0-9]{1,2}.[0-9]{1,2}.[0-9]{4}$",y)],"00:00:00")
}
You can use parse_date_time from lubridate to change datetime in different formats.
library(dplyr)
Testset %>%
mutate(across(starts_with('Lab_date'),
lubridate::parse_date_time, c('dmY', 'dmY HM'))) -> Testset
Testset
# ID Lab_date1 Lab_date2 Lab_date3 Lab_date4
#1 76 1982-01-18 1990-01-26 00:00:00 1990-06-20 00:00:00 1990-11-15
#2 183 1982-10-18 1989-04-24 00:00:00 1989-04-27 00:00:00 1991-04-02
#3 27 1983-11-01 1982-10-18 01:01:00 1983-04-13 00:00:00 1984-10-31
#4 84 1983-01-12 1983-01-12 00:00:00 1983-04-21 15:10:00 1984-03-22
#5 28 1989-10-13 1989-01-13 12:00:00 1991-11-13 14:11:00 1991-11-19
#6 120 1982-10-01 1982-07-14 00:00:00 1986-08-26 00:00:00 1986-08-26
data
Testset <- structure(list(ID = c(76L, 183L, 27L, 84L, 28L, 120L), Lab_date1 = c("18/1/1982",
"18/10/1982", "1/11/1983", "12-1-1983", "13-10-1989", "1-10-1982"
), Lab_date2 = c("26/01/1990", "24/04/1989", "18/10/1982 01:01",
"12-1-1983 00:00", "13-1-1989 12:00", "14-7-1982 00:00"), Lab_date3 = c("20/06/1990",
"27/04/1989", "13/04/1983", "21-4-1983 15:10", "13-11-1991 14:11",
"26-8-1986 00:00"), Lab_date4 = c("15/11/1990", "02/04/1991",
"31/10/1984", "22-3-1984 00:00", "19-11-1991 00:00", "26-8-1986 00:00"
)), class = "data.frame", row.names = c(NA, -6L))
We can use anytime
library(anytime)
library(dplyr)
addFormats('%d/%m/%Y')
Testset %>%
mutate(across(starts_with('Lab_date'), anytime))
# ID Lab_date1 Lab_date2 Lab_date3 Lab_date4
#1 76 1982-01-18 1990-01-26 1990-06-20 1990-11-15
#2 183 1982-10-18 1989-04-24 1989-04-27 1991-04-02
#3 27 1983-01-11 1982-10-18 1983-04-13 1984-10-31
#4 84 1983-01-12 1983-01-12 1983-04-21 1984-03-22
#5 28 1989-10-13 1989-01-13 1991-11-13 1991-11-19
#6 120 1982-01-10 1982-07-14 1986-08-26 1986-08-26

Detect two consecutive days and subset/filter a certain time period among two consecutive dates in data.frame in R

I have a data.frame of weather info in this link
https://www.dropbox.com/s/60p93cmhgdi93yd/weather%EF%BC%882%EF%BC%89.xlsx?dl=0
This weather info is recorded every 4-6 min (depending on days). I wanted to extract a certain period of time from the data.frame among the two consecutive days. For example, I would like to extract the time period from 9:45 am on 2018-4-9 through 9:45 am on 2018-4-10, and from 9:45 am on 2018-4-23 through 9:45 am on 2018-4-24, .....
I also created a fake data.frame as recommended, but my actual data.frame has 60+ groups of two consecutive days:
df1 <- data.frame(
datetime = seq(
as.POSIXct("2018-4-9 00:00"), as.POSIXct("2018-4-10 00:00"), by = "60 min"))
df2 <- data.frame(
datetime = seq(
as.POSIXct("2018-4-23 00:00"), as.POSIXct("2018-4-24 00:00"), by = "60 min"))
df3 <- data.frame(
datetime = seq(
as.POSIXct("2018-5-7 00:00"), as.POSIXct("2018-5-8 00:00"), by = "60 min"))
df <- rbind(df,df2,df3)
I have thought several ways to do this:
I can use the 'lubridate' package to transform time into the numeric form so I can define the duration of certain numbers to be extracted. But I also need to group every two consecutive dates together to calculate the duration. I had some code like this
daystart <- hm("0:0")
weather$date1 <- sort(as.Date(weather$Date))
a <- split(weather$date1,cumsum(c(TRUE,diff(weather$date1)>1)))
weather <- data.frame(weather,a)
#this does not work
weather <- weather %>%
group_by(group #the grouped consecutive days) %>%
mutate(dur = as.numeric(Time-daystart)) %>%
filter(dur > xxx & dur < xxxx)
#I was thinking to do it this way
a grouped two consecutive days together but only return group ID once, thus it cannot be combined with the weather data.frame (I guess this is the problem). Also, I am not sure how to calculate the duration for every two consecutive days, but I think it could be done once I can group the consecutive days together.
I also thought about using "filter" and "ifelse" to extract the time
weather <- weather %>%
filter(
if(diff(Date) <= 1){
Time <= trapstart
}
else{
NULL
}
)
Something like this, but it does not work (of course).
What I really want to build the code is something like this (this is not actually the code)
weather <- weather %>%
filter(
if("these are two consecutive days"){
"9:45 of the first day < Time <= 9:45 the second day"
}
else{
NULL
}
)
The recording time of this data-frame is not consistent each day, thus the recording time may be different among days and the recorded data points are different among days.
Here is what I expect for the output (imagine I only have 5 records each day):
Date Time DateTime
4/9/2018 9:46 4/9/2018 9:46
4/9/2018 15:34 4/9/2018 15:34
4/9/2018 22:44 4/9/2018 22:44
4/10/2018 4:34 4/10/2018 4:34
4/10/2018 7:09 4/10/2018 7:09
4/10/2018 9:44 4/10/2018 9:44
4/23/2018 9:46 4/23/2018 9:46
4/23/2018 12:27 4/23/2018 12:27
4/23/2018 19:29 4/23/2018 19:29
4/24/2018 1:08 4/24/2018 1:08
4/24/2018 5:24 4/24/2018 5:24
4/24/2018 9:44 4/24/2018 9:44
5/7/2018 9:48 5/7/2018 9:48
5/7/2018 17:59 5/7/2018 17:59
5/8/2018 0:55 5/8/2018 0:55
5/8/2018 1:00 5/8/2018 1:00
5/8/2018 4:30 5/8/2018 4:30
5/8/2018 9:41 5/8/2018 9:41
I am not sure if I am stating my question in an understandable way since this logic thing is messing up my brain now... I would appreciate any suggestions and helps! Also, feel free to ask me to clarify my question if is it not clear enough.
I would use dplyr::between in the following way.
First, let's generate some sample data (always best to explicitly include data instead of providing a link).
df <- data.frame(
datetime = seq(
as.POSIXct("2018-4-9 00:00"), as.POSIXct("2018-4-11 00:00"), by = "5 min"))
Then we can filter data between "2018-4-9 9:45" and "2018-4-9 9:45" using dplyr::between
library(dplyr)
start <- as.POSIXct("2018-4-9 9:45")
end <- as.POSIXct("2018-4-10 9:45")
df %>% filter(between(datetime, start, end))
PS. Perhaps a typo, but I think the library you are referring to in your post is called lubridate, not lubricate.
Update
You can achieve filtering your source data by multiple (start, end) ranges using a non-equi join of your original data df and a dataframe that contains the different ranges.
Here's an example based on the sample data you give, and using fuzzyjoin::fuzzy_inner_join to do the non-equi join:
library(dplyr)
library(fuzzyjoin)
df_range <- data.frame(
start = as.POSIXct(c("2018-4-9 9:45", "2018-4-23 9:45")),
end = as.POSIXct(c("2018-4-10 9:45", "2018-4-24 9:45"))
)
df %>%
fuzzy_inner_join(
df_range,
by = c("datetime" = "start", "datetime" = "end"),
match_fun = list(`>=`, `<=`)) %>%
select(-start, -end)
# datetime
#1 2018-04-09 10:00:00
#2 2018-04-09 11:00:00
#3 2018-04-09 12:00:00
#4 2018-04-09 13:00:00
#5 2018-04-09 14:00:00
#6 2018-04-09 15:00:00
#7 2018-04-09 16:00:00
#8 2018-04-09 17:00:00
#9 2018-04-09 18:00:00
#10 2018-04-09 19:00:00
#11 2018-04-09 20:00:00
#12 2018-04-09 21:00:00
#13 2018-04-09 22:00:00
#14 2018-04-09 23:00:00
#15 2018-04-10 00:00:00
#16 2018-04-23 10:00:00
#17 2018-04-23 11:00:00
#18 2018-04-23 12:00:00
#19 2018-04-23 13:00:00
#20 2018-04-23 14:00:00
#21 2018-04-23 15:00:00
#22 2018-04-23 16:00:00
#23 2018-04-23 17:00:00
#24 2018-04-23 18:00:00
#25 2018-04-23 19:00:00
#26 2018-04-23 20:00:00
#27 2018-04-23 21:00:00
#28 2018-04-23 22:00:00
#29 2018-04-23 23:00:00
#30 2018-04-24 00:00:00
Sample data
df1 <- data.frame(
datetime = seq(
as.POSIXct("2018-4-9 00:00"), as.POSIXct("2018-4-10 00:00"), by = "60 min"))
df2 <- data.frame(
datetime = seq(
as.POSIXct("2018-4-23 00:00"), as.POSIXct("2018-4-24 00:00"), by = "60 min"))
df3 <- data.frame(
datetime = seq(
as.POSIXct("2018-5-7 00:00"), as.POSIXct("2018-5-8 00:00"), by = "60 min"))
df <- rbind(df1 , df2, df3)

Extract maximum hourly value each day R

I have this data.frame:
Time a b c d
1 2015-01-01 00:00:00 863 1051 1899 25385
2 2015-01-01 01:00:00 920 1009 1658 24382
3 2015-01-01 02:00:00 1164 973 1371 22734
4 2015-01-01 03:00:00 1503 949 779 21286
5 2015-01-01 04:00:00 1826 953 720 20264
6 2015-01-01 05:00:00 2109 952 743 19905
...
Time a b c d
8756 2015-12-31 19:00:00 0 775 4957 28812
8757 2015-12-31 20:00:00 0 783 5615 29568
8758 2015-12-31 21:00:00 0 790 4838 28653
8759 2015-12-31 22:00:00 0 766 3841 27078
8760 2015-12-31 23:00:00 72 729 2179 24565
8761 2016-01-01 00:00:00 290 710 1612 23311
It represents every hour of every day for a year. I would like to extract one line per day, as a function of the maximum value of d. So at the end I want to obtain a data.frame of 365x5.
I have tried all the propositions from :Extract the maximum value within each group in a dataframe and also:Daily minimum values in R but it still doesn't work.
May be it could come from the way I proceed to generate my time serie?
library(lubridate)
start <- dmy_hms("1 Jan 2015 00:00:00")
end <- dmy_hms("01 Jan 2016 00:00:00")
time <- as.data.frame(seq(start, end, by="hours"))
Thanks for help!
If we are aggregating by the 'Day', convert the 'Time' column to Date class stripping off the Time attributes, grouped by those, get the max of 'd'. In the OP's post, the syntax for data.table involves mydf and df. Assuming these are the same, we need
library(data.table)
setDT(mydf)[, .(d = max(d)), by = .(Day = as.Date(Time))]
Or using aggregate from base R
aggregate(d ~ Day, transform(mydf, Day = as.Date(Time)), FUN = max)
Or with tidyverse
library(tidyverse)
mydf %>%
group_by(Day = as.Date(Time)) %>%
summarise(d = max(d))
NOTE: Based on the OP's comments, columns 'a' to 'd' are factor class. We need to convert it to numeric either at the beginning or convert it during the processing stage
mydf$d <- as.numeric(as.character(mydf$d)))
For multiple columns
mydf[c('a', 'b', 'c', 'd')] <- lapply(mydf[c('a', 'b', 'c', 'd'), function(x)
as.numeric(as.character(x)))
data
mydf <- structure(list(Time = c("2015-01-01 00:00:00", "2015-01-01 01:00:00",
"2015-01-01 02:00:00", "2015-01-01 03:00:00", "2015-01-01 04:00:00",
"2015-01-01 05:00:00"), a = c(863L, 920L, 1164L, 1503L, 1826L,
2109L), b = c(1051L, 1009L, 973L, 949L, 953L, 952L), c = c(1899L,
1658L, 1371L, 779L, 720L, 743L), d = c(25385L, 24382L, 22734L,
21286L, 20264L, 19905L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
'max' doesn't work with factors. Hence convert the column (in your case, its column d) for which you are finding the maximum into double using as.numeric
Assuming your data set is in a data frame
mydf$d = as.numeric(mydf$d)
Thanks for your help! Finally I choose
do.call(rbind, lapply(split(test,test$time), function(x) {return(x[which.max(x$d),])}))
which allows me to have a 365x5 data.frame. All your propositions were right. I just needed to change my time serie like
time <- as.data.frame(rep(c(1:365), each = 24))
test<- cbind.data.frame(time, df, timebis)
which allows me to have a 365x5 data.frame. All your propositions were right. I just needed to change my time serie.

Resources