I am wondering if I am missing something?!
I would like to know: is there a better/shorter way to get minutes from a datetime object:
Lessons studied so far:
Extract time (HMS) from lubridate date time object?
converting from hms to hour:minute in r, or rounding in minutes from hms
R: Convert hours:minutes:seconds
My tibble:
df <- structure(list(dttm = structure(c(-2209068000, -2209069200, -2209061520,
-2209064100, -2209065240), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), row.names = c(NA, -5L), class = c("tbl_df", "tbl",
"data.frame"))
# A tibble: 5 x 1
dttm
<dttm>
1 1899-12-31 02:00:00
2 1899-12-31 01:40:00
3 1899-12-31 03:48:00
4 1899-12-31 03:05:00
5 1899-12-31 02:46:00
I would like to add a new column with minutes as integer:
My approach so far:
library(dplyr)
library(lubridate) # ymd_hms
library(hms) # as_hms
library(chron) # times
test %>%
mutate(dttm_min = as_hms(ymd_hms(dttm)),
dttm_min = 60*24*as.numeric(times(dttm_min)))
# A tibble: 5 x 2
dttm dttm_min
<dttm> <dbl>
1 1899-12-31 02:00:00 120
2 1899-12-31 01:40:00 100
3 1899-12-31 03:48:00 228
4 1899-12-31 03:05:00 185
5 1899-12-31 02:46:00 166
This gives me the result I want, but I need for this operation 3 packages. Is there a more direct way?
Here is a base R way -
You can extract the time using format, change the date to '1970-01-01' (Since R datetime starts with '1970-01-01'), convert to numeric and divide the time by 60 to get the duration in minutes.
as.numeric(as.POSIXct(paste('1970-01-01', format(df$dttm, '%T')), tz = 'UTC'))/60
#[1] 120 100 228 185 166
Here are two ways.
Base R
with(df, as.integer(format(dttm, "%M")) + 60*as.integer(format(dttm, "%H")))
#[1] 120 100 228 185 166
Another base R option, using class "POSIXlt" as proposed here.
minute_of_day <- function(x){
y <- as.POSIXlt(x)
60*y$hour + y$min
}
minute_of_day(df$dttm)
#[1] 120 100 228 185 166
Package lubridate
lubridate::minute(df$dttm) + 60*lubridate::hour(df$dttm)
#[1] 120 100 228 185 166
If the package is loaded, this can be simplified, with the same output, to
library(lubridate)
minute(df$dttm) + 60*hour(df$dttm)
We can use
library(data.table)
as.numeric(as.ITime(format(df$dttm, '%T')))/60
[1] 120 100 228 185 166
For sake of completeness, the time of day (in minutes) can be calculated by taking the difftime() between the POSIXct datetime object and the beginning of the day, e.g.,
difftime(df$dttm, lubridate::floor_date(df$dttm, "day"), units = "min")
Time differences in mins
[1] 120 100 228 185 166
Besides base R only one other package is required.
According to help("difftime"), difftime() returns an object of class "difftime" with an attribute indicating the units.
Related
I know a lot of questions have been asked on the same subject but I have not found an answer to this particular question, despite trying to adapt other codes to my problem.
My data frame "v1" has more than 300 thousand lines with the variable "Date" in the following format:
Date
2015-07-27 17:35:00
2015-07-27 17:40:00
2015-07-27 17:45:00
1st I want to know if all the "Date" intervals are in the 5 to 5 minutes interval. If not I would like to track where different intervals are.
2nd I pretend to create a new column where it can be seen the time stamp of the different intervals. For example, "time_int" where it would be seen "00:05:00", "00:05:00"...
Any help will be appreciated. Thank you in advance.
Here is an option to calculate the difference using lag. If you'd like, you could create another column showing hours with units = "hours".
library(tidyverse)
library(lubridate)
df <- data.frame(date = ymd_hms(c("2015-07-27 17:35:00",
"2015-07-27 17:40:00", "2015-07-27 17:49:00", "2015-07-27 19:49:00")))
df %>%
mutate(diff = date - lag(date),
diff_minutes = as.numeric(diff, units = "mins"),
time_int = format(.POSIXct(diff_minutes*60, "UTC"), "%H:%M:%S")) %>%
select(date, diff_minutes, time_int) %>%
# Filter the data for a range of minutes
filter(diff_minutes >= 5 & diff_minutes < 10)
# OUTPUT:
#> date diff_minutes time_int
#> 1 2015-07-27 17:40:00 5 00:05:00
#> 2 2015-07-27 17:49:00 9 00:09:00
Created on 2021-03-09 by the reprex package (v0.3.0)
Original Data
date
<S3: POSIXct>
2015-07-27 17:35:00
2015-07-27 17:40:00
2015-07-27 17:49:00
2015-07-27 19:49:00
You can use rollapplyr to find the time difference between two consecutive rows. And then you can use which to find the rows that the time difference is not 5 minutes.
dt=read.table(text=text, header=TRUE)
library(lubridate)
library(dplyr)
library(zoo)
dt=mutate(dt, Date=ymd_hms(Date)) %>%
mutate(dt, Dif=rollapplyr(Date, 2, function(x) {
return(difftime(x[2], x[1]))
}, fill=NA))
dt
Date Dif
1 2015-07-27 17:35:00 NA
2 2015-07-27 17:40:00 5
3 2015-07-27 17:45:00 5
4 2015-07-27 17:49:00 4
dt[which(dt$Dif != as.difftime(5, units="mins")),]
Date Dif
4 2015-07-27 17:49:00 4
Lastly, to format the times in your desired format:
dt %>% mutate(DifString=format(.POSIXct(Dif*60, tz="GMT"), "%H:%M:%S"))
Date Dif DifString
1 2015-07-27 17:35:00 NA <NA>
2 2015-07-27 17:40:00 5 00:05:00
3 2015-07-27 17:45:00 5 00:05:00
4 2015-07-27 17:49:00 4 00:04:00
Data
text="Date
'2015-07-27 17:35:00'
'2015-07-27 17:40:00'
'2015-07-27 17:45:00'
'2015-07-27 17:49:00'"
dt=read.table(text=text, header=TRUE)
I have a large dataset over many years which has several variables, but the one I am interested in is wind speed and dateTime. I want to find the time of the max wind speed for every day in the data set. I have hourly data in Posixct format, with WS as a numeric with occasional NAs. Below is a short data set that should hopefully illustrate my point, however my dateTime wasn't working out to be hourly data, but it provides enough for a sample.
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1798,rep=TRUE)
WD <- sample(0:390,1798,rep=TRUE)
Temp <- sample(0:40,1798,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I have previously tried creating a new column with just a posix date (minus time) to allow for day isolation, however all the things I have tried have only returned a shortened data frame with date and WS (aggregate, splitting, xts). Aggregate was only one that didn't do this, however, it gave me 23:00:00 as a constant time which isn't correct.
I have looked at How to calculate daily means, medians, from weather variables data collected hourly in R?, https://stats.stackexchange.com/questions/7268/how-to-aggregate-by-minute-data-for-a-week-into-hourly-means and others but none have answered this question, or the solutions have not returned an ideal result.
I need to compare the results of this analysis with another data frame, so hence the reason I need the actual time when the max wind speed occurred for each day in the dataset. I have a feeling there is a simple solution, however, this has me frustrated.
A dplyr solution may be:
library(dplyr)
df %>%
mutate(date = as.Date(dateTime)) %>%
left_join(
df %>%
mutate(date = as.Date(dateTime)) %>%
group_by(date) %>%
summarise(max_ws = max(WS, na.rm = TRUE)) %>%
ungroup(),
by = "date"
) %>%
select(-date)
# dateTime WS WD Temp max_ws
# 1 2011-01-01 00:00:00 NA 313 2 15
# 2 2011-01-01 00:24:00 7 376 1 15
# 3 2011-01-01 00:48:00 3 28 28 15
# 4 2011-01-01 01:12:00 15 262 24 15
# 5 2011-01-01 01:36:00 1 149 34 15
# 6 2011-01-01 02:00:00 4 319 33 15
# 7 2011-01-01 02:24:00 15 280 22 15
# 8 2011-01-01 02:48:00 NA 110 23 15
# 9 2011-01-01 03:12:00 12 93 15 15
# 10 2011-01-01 03:36:00 3 5 0 15
Dee asked for: "I want to find the time of the max wind speed for every day in the data set." Other answers have calculated the max(WS) for every day, but not at which hour that occured.
So I propose the following solution with dyplr:
library(dplyr)
set.seed(12345)
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1738,rep=TRUE)
WD <- sample(0:390,1738,rep=TRUE)
Temp <- sample(0:40,1738,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
df %>%
group_by(Date = as.Date(dateTime)) %>%
mutate(Hour = hour(dateTime),
Hour_with_max_ws = Hour[which.max(WS)])
I want to highlight out, that if there are several hours with the same maximal windspeed (in the example below: 15), only the first hour with max(WS) will be shown as result, though the windspeed 15 was reached on that date at the hours 0, 3, 4, 21 and 22! So you might need a more specific logic.
For the sake of completeness (and because I like the concise code) here is a "one-liner" using data.table:
library(data.table)
setDT(df)[, max.ws := max(WS, na.rm = TRUE), by = as.IDate(dateTime)][]
dateTime WS WD Temp max.ws
1: 2011-01-01 00:00:00 NA 293 22 15
2: 2011-01-01 00:24:00 15 55 14 15
3: 2011-01-01 00:48:00 NA 186 24 15
4: 2011-01-01 01:12:00 4 300 22 15
5: 2011-01-01 01:36:00 0 120 36 15
---
1734: 2011-01-29 21:12:00 12 249 5 15
1735: 2011-01-29 21:36:00 9 282 21 15
1736: 2011-01-29 22:00:00 12 238 6 15
1737: 2011-01-29 22:24:00 10 127 21 15
1738: 2011-01-29 22:48:00 13 297 0 15
I have a very big dataframe in R, containing weather data with the following format.
valid temp
1 17/08/2014 00:20 14
2 17/08/2014 00:50 14
3 17/08/2014 01:20 13.5
4 17/08/2014 01:50 13
5 17/08/2014 02:20 12
6 17/08/2014 02:50 10
I would like to convert these sub-hourly data to hourly, like the following.
valid tmpc
1 2014-08-17 00:00:00 14
2 2014-08-17 01:00:00 13.75
3 2014-08-17 02:00:00 12.5
The class of df$valid is 'factor'. I have tried first converting them to Date through POSIXct, but it gives only NA values. I have also tried changing the system locale and still I get NAs.
We can do this in base R by converting to POSIXlt, set the minute to 0, convert it back to POSIXct and aggregate to get the mean of 'temp'
df1$valid <- strptime(df1$valid, "%d/%m/%Y %H:%M")
df1$valid$min <- 0
df1$valid <- as.POSIXct(df1$valid)
aggregate(temp~valid, df1, FUN = mean)
Option 1: The lubridate solution using ceiling_date or round_date. It's not clear according to your data frame and results if what you want is to round or ceiling. For instance, in the first row you are rounding and in the third using ceiling. Anyways here the example:
library(lubridate)
df <- data.frame(i = 1, valid= "17/08/2014 01:28", temp = 14)
df$valid <- dmy_hm(df$valid)
df$valid_round <- ceiling_date(df$valid , unit="hours")
Results:
i valid temp valid_round
1 1 2014-08-17 01:28:00 14 2014-08-17 02:00:00
Option 2: using the base functions. Use:
df$valid <- as.POSIXct(strptime(df$valid, "%d/%m/%Y %H:%M", tz ="UTC"))
and then round it.
I am trying to convert a day-number within the year to month/day format.
With this df:
set.seed(123)
df1 <- data.frame(Year = rep(15,100), DayNum = seq(78,177,1), Hour = sample(0:23,100,replace = T))
df2 <- data.frame(Year = rep(16,100), DayNum = seq(78,177,1), Hour = sample(0:23,100,replace = T))
df <- rbind(df1, df2)
> head(df)
Year DayNum Hour
1 15 78 6
2 15 79 18
3 15 80 9
4 15 81 21
5 15 82 22
6 15 83 1
> tail(df)
Year DayNum Hour
195 16 172 22
196 16 173 11
197 16 174 9
198 16 175 15
199 16 176 3
200 16 177 13
which has 100 records for 2015 and 2016, how can I make a POSIXct date/time column?
While there are a number of related posts with a Julian date from a beginning origin (usually 1970-01-01), I could not find any posts with a day-number within a year and with a variable year (i.e. 2015 and 2016).
The as.POSIXct function has an option to specify the origin date when converting from a "Julian" date to the date/time object:
#calculate the origin date based on the year column
df$origin<-as.Date(paste0("20", df$Year,"-01-01"))
#convert the Julian day to a date/time object
as.POSIXct(df$JulianDay, origin=df$origin)
One may need to consider adding the timezone option for completeness:
as.POSIXct(df$JulianDay, origin=df$origin, tz="GMT")
You might need something like this, use %j to specify the day of the year:
strptime(with(df, paste(Year, DayNum, Hour)), "%y %j %H")
# [1] "2015-03-19 06:00:00 EDT"
# [2] "2015-03-20 18:00:00 EDT"
# [3] "2015-03-21 09:00:00 EDT"
# [4] "2015-03-22 21:00:00 EDT"
# [5] "2015-03-23 22:00:00 EDT"
I have a set of time series data that has a start and stop time. Each event can last from few seconds to few days, I need to calculate the sum, in this example the total memory used, every hour of the jobs active at the time. Here is a sample of the data:
mem_used start_time stop_time
16 2015-10-24 17:24:41 2015-10-25 04:19:44
80 2015-10-24 17:24:51 2015-10-25 03:14:59
44 2015-10-24 17:25:27 2015-10-25 01:16:10
28 2015-10-24 17:25:43 2015-10-25 00:00:31
72 2015-10-24 17:30:23 2015-10-24 23:58:31
In this case it should give something like:
time total_mem
2015-10-24 17:00:00 240
2015-10-24 18:00:00 240
...
2015-10-25 00:00:00 168
2015-10-25 01:00:00 140
2015-10-25 02:00:00 96
2015-10-25 03:00:00 96
2015-10-25 04:00:00 16
I'm trying to do something with the aggregate function but I can not figure it out. Any ideas? Thanks.
Here's how I would do it, using lubridate.
First, make sure that your dates are in POSIXct format:
dat$start_time = as.POSIXct(dat$start_time, format = "%Y-%m-%d %H:%M:%S")
dat$stop_time = as.POSIXct(dat$stop_time, format = "%Y-%m-%d %H:%M:%S")
Then make an interval object with lubridate:
library(lubridate)
dat$interval <- interval(dat$start_time, dat$stop_time)
Now we can make a vector of times, replace these with your desired times:
z <- seq(start = dat$start_time[1], stop = dat$stop_time[5], by = "hours")
And sum those where we have an overlap:
out <- data.frame(times = z,
mem_used = sapply(z, function(x) sum(dat$mem_used[x %within% dat$interval])))
times mem_used
1 2015-10-24 17:24:41 16
2 2015-10-24 18:24:41 240
3 2015-10-24 19:24:41 240
4 2015-10-24 20:24:41 240
5 2015-10-24 21:24:41 240
6 2015-10-24 22:24:41 240
7 2015-10-24 23:24:41 240
Here's the data used:
structure(list(mem_used = c(16L, 80L, 44L, 28L, 72L), start_time = structure(c(1445721881,
1445721891, 1445721927, 1445721943, 1445722223), class = c("POSIXct",
"POSIXt"), tzone = ""), stop_time = structure(c(1445761184, 1445757299,
1445750170, 1445745631, 1445745511), class = c("POSIXct", "POSIXt"
), tzone = "")), .Names = c("mem_used", "start_time", "stop_time"
), row.names = c(NA, -5L), class = "data.frame")
Here is another solution based on dplyr and lubridate.
Make sure first to have the data in the right format (e.g date in POSIXct)
library(dplyr)
library(lubridate)
glimpse(df)
## Observations: 5
## Variables: 3
## $ mem_used (int) 16, 80, 44, 28, 72
## $ start_time (time) 2015-10-24 17:24:41, 2015-10-24 17:24:51...
## $ end_time (time) 2015-10-25 04:19:44, 2015-10-25 03:14:59...
Then we will just keep the hour (removing minutes and seconds) since we want to aggregate per hour.
### Remove minutes and seconds
minute(df$start_time) <- 0
second(df$start_time) <- 0
minute(df$end_time) <- 0
second(df$end_time) <- 0
The most important step now, is to create a new data.frame with one row for each hour between start_time and end_time. For example, if on the first line of the original data.frame we have 5 hours between start_time and end_time, we will end with 5 rows and the value mem_used duplicated 5 times.
###
n <- nrow(df)
l <- lapply(1:n, function(i) {
date <- seq.POSIXt(df$start_time[i], df$end_time[i], by = "hour")
mem_used <- rep(df$mem_used[i], length(date))
data.frame(time = date, mem_used = mem_used)
})
df <- Reduce(rbind, l)
glimpse(df)
## Observations: 47
## Variables: 2
## $ time (time) 2015-10-24 17:00:00, 2015-10-24 18:00:00, ...
## $ mem_used (int) 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,...
Finally, we can now aggregate using dplyr or aggregate (or other similar functions)
df %>%
group_by(time) %>%
summarise(tot = sum(mem_used))
## time tot
## (time) (int)
## 1 2015-10-24 17:00:00 240
## 2 2015-10-24 18:00:00 240
## 3 2015-10-24 19:00:00 240
## 4 2015-10-24 20:00:00 240
## 5 2015-10-24 21:00:00 240
## 6 2015-10-24 22:00:00 240
## 7 2015-10-24 23:00:00 240
## 8 2015-10-25 00:00:00 168
## 9 2015-10-25 01:00:00 140
## 10 2015-10-25 02:00:00 96
## 11 2015-10-25 03:00:00 96
## 12 2015-10-25 04:00:00 16
## Or aggregate
aggregate(mem_used ~ time, FUN = sum, data = df)