Grouping a dataframe by time with "nice" breaks using dplyr - r

Intro:
I would like to aggregate some 5-minute data into 10-minute data. Specifically, I only want to aggregate on the 10-minute marks (00:10:00, 00:20:00, 00:30:00, etc.).
The code below almost achieves this, but the breaks are on the 5 minute mark instead of the 10 minute mark (00:05:00, 00:15:00, 00:25:00). I think dplyr is using the first row in the dataframe when determining the cutpoints.
Are there any ways to achieve "nice" 10-min breaks using cut {base} and group_by() {dplyr}? I would be okay with just removing the first row of data, but I really need the solution to manage many different files, each of which with unique starting points.
Thanks in advance!
Example Code:
date <- c("2017-06-14 14:35:00", "2017-06-14 14:40:00", "2017-06-14 14:45:00", "2017-06-14 14:50:00")
co <- as.numeric(c(5.17,10.07,13.88,13.78))
no <- as.numeric(c(34.98,32.45,31.34,29.09))
no2 <- as.numeric(c(0.00,0.00,0.00,0.00))
o3 <- as.numeric(c(5.17,10.07,13.88,13.78))
data <- data.frame(date, co, no , no2, o3)
data$date <- strptime(data$date, format = "%Y-%m-%d %H:%M")
data$date <- as.POSIXct(data$date)
head(data)
data_10min <- data %>%
group_by(date = cut(date, breaks = "10 min")) %>%
summarize(co = mean(co), no = mean(no), no2 = mean(no2), o3 = mean(o3))
head(data_10min)
Desired Output:
2017-06-14 14:40:00
2017-06-14 14:50:00

Just adding 300 seconds to date column during group_by gets the desired result.
library(magrittr)
library(dplyr)
df_10min <- df %>%
group_by(date = cut(as.POSIXct(x) + 300, breaks = "10 min")) %>%
summarize_each(funs(mean))
df_10min
The result:
# # A tibble: 2 × 5
# date co no no2 o3
# <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 2017-06-14 14:40:00 7.62 33.715 0 7.62
# 2 2017-06-14 14:50:00 13.83 30.215 0 13.83

Related

Calculating distances in a datraframe

I have a large dataframe with one column with time and a second column with speed measurements (km/h). Here is an short example of the database:
df <- data.frame(time = as.POSIXct(c("2019-04-01 13:55:18", "2019-04-01 14:03:18",
"2019-04-01 14:14:18", "2019-04-01 14:26:55",
"2019-04-01 14:46:55", "2019-04-01 15:01:55")),
speed = c(4.5, 6, 3.2, 5, 4, 2))
Is there any way to do a new dataframe, which calculates the distance driven every 20 minutes, from 2019-04-01 14:00:00 to 2019-04-01 15:00:00? assuming that the speed changes are linear. I was trying to find solutions with integrals, but was not sure if it is the correct way to do it. Thanks for the help!
Here is a solution using a combination of zoo::na.approx and dplyr functions.
library(zoo)
library(dplyr)
seq = data.frame(time = seq(min(df$time),max(df$time), by = 'secs'))
df <- merge(seq,df,all.x=T)
df$speed <- na.approx(df$speed)
df %>%
filter(time >= "2019-04-01 14:00:00" & time < "2019-04-01 15:00:00") %>%
mutate(km = speed/3600) %>%
group_by(group = cut(time, breaks = "20 min")) %>%
summarise(distance = sum(km))
Which gives:
# A tibble: 3 x 2
group distance
<fct> <dbl>
1 2019-04-01 14:00:00 1.50
2 2019-04-01 14:20:00 1.54
3 2019-04-01 14:40:00 1.16
Explanation:
The first step is to create a sequence of time frames to compute the speed between two times points (seq). The sequence is then merged with the data frame and NAs are filled using na.approx.
Then, using dplyr verbs, the data frame is filtered, and the 20 minutes sequences are created using cut. The final distance is the sum of every 1-sec distance in the 20 minutes time frame.

Calcute mean for same time on different dates

I have data that shows a numeric amount of something measured at a few seconds after every minute of every day over a period of several days. Here is an example for two minutes on three days:
dat <- read.table(textConnection('
date_and_time amount
"2020-05-01 13:23:02" 8
"2020-05-01 13:24:06" 26
"2020-05-02 13:23:01" 5
"2020-05-02 13:24:01" 30
"2020-05-03 13:23:03" 6
"2020-05-03 13:24:02" 27
'), header = TRUE, colClasses=c("POSIXct", "numeric"))
For that data, I want to calculate the mean amount for each minute over all days. For the above sample data, the result would look like this:
time_of_day mean_amount
13:23:00 6.333333
13:24:00 27.66667
To get that result, I have converted the datetime objects to character strings, stripped the dates and the seconds from the strings, converted the strings to a factor, and calculated the means for each factor.
Is there a way to achieve that result with the datetime objects? That is, is there a function to calculate means over the same time of different dates?
If by datetime you mean POSIXct then that class cannot represent times without a date; however, the chron times class can.
The following converts the date/time to a chron object, ch, and then converts that to a times object, time_of_day, and truncate that to the minute. Finally we aggregate amount by that.
library(chron)
ch <- as.chron(format(dat$date_and_time))
time_of_day <- trunc(ch - dates(ch), "min")
ag <- aggregate(amount ~ time_of_day, dat, mean)
giving:
> ag
time_of_day amount
1 13:23:00 6.333333
2 13:24:00 27.666667
> str(ag)
'data.frame': 2 obs. of 2 variables:
$ time_of_day: 'times' num 13:23:00 13:24:00
..- attr(*, "format")= chr "h:m:s"
$ amount : num 6.33 27.67
in Base-R
sapply(split(dat$amount,format(dat$date_and_time, format='%H:%M')), mean)
13:23 13:24
6.333333 27.666667
I used the format function to strip the days and seconds. You could use other ways of calculating the mean from that as well.
The answer to your question is no. Objects of class POSIXct must have a date.
Here's an approach with lubridate and dplyr:
library(dplyr)
library(lubridate)
dat %>%
mutate(hour = hour(date_and_time),
minute = minute(date_and_time)) %>%
group_by(hour,minute) %>%
dplyr::summarise(mean_amount = mean(amount))
# hour minute mean_amount
# <int> <int> <dbl>
#1 13 23 6.33
#2 13 24 27.7
additional solution
library(tidyverse)
library(lubridate)
library(hms)
dat %>%
mutate(time = floor_date(x = date_and_time, unit = "min") %>% hms::as_hms()) %>%
group_by(time) %>%
summarise(mean_amount = mean(amount))

How to create subsets of multiple date ranges in R

I have a data frame with dates and numbers called 'df'. I have another data frame with start and end dates called 'date_ranges'.
My goal is to filter/subset df so that it only shows for the start/end dates in each row of the date_ranges column. Here is my code so far:
df_date <- as.Date((as.Date('2010-01-01'):as.Date('2010-04-30')))
df_numbers <- c(1:120)
df <- data.frame(df_date, df_numbers)
start_dates <- as.Date(c("2010-01-06", "2010-02-01", '2010-04-15'))
end_dates <- as.Date(c("2010-01-23", "2010-02-06", '2010-04-29'))
date_ranges <- data.frame(start_dates, end_dates)
# Attempting to filter df by start and end dates
for (i in range(date_ranges$start_dates)){
for (j in range(date_ranges$end_dates)){
print (
df %>%
filter(between(df_date, i, j)))
}
}
The first and third result of the nested for loop is what I want, but not the second result. The first and third give me the dates and values for df between their respective rows, but the second result is the range from the earliest date to the latest date. How can I fix this loop to exclude the second result?
A tidyverse approach could be to create a sequence between start and end_dates and join with df to keep only the dates which lie in the range.
library(dplyr)
date_ranges %>%
mutate(df_date = purrr::map2(start_dates, end_dates, seq, "day")) %>%
tidyr::unnest(df_date) %>%
select(-start_dates, -end_dates) %>%
left_join(df, by = 'df_date')
# A tibble: 39 x 2
# df_date df_numbers
# <date> <int>
# 1 2010-01-06 6
# 2 2010-01-07 7
# 3 2010-01-08 8
# 4 2010-01-09 9
# 5 2010-01-10 10
# 6 2010-01-11 11
# 7 2010-01-12 12
# 8 2010-01-13 13
# 9 2010-01-14 14
#10 2010-01-15 15
# … with 29 more rows
You can try looping through index
for (i in seq_along(date_ranges$start_dates)){
print (
df %>%
filter(between(df_date, date_ranges$start_dates[i], date_ranges$end_dates[i])))
}
Base R solution:
# Your data creation can be simplified:
df <- data.frame(df_date = seq.Date(as.Date('2010-01-01', "%Y-%m-%d"), as.Date('2010-04-30', "%Y-%m-%d"),
by = 1), df_numbers = c(1:120))
# Store start and end date vectors to filter the data.frame:
start_dates <- as.Date(c("2010-01-06", "2010-02-01", '2010-04-15'))
end_dates <- as.Date(c("2010-01-23", "2010-02-06", '2010-04-29'))
# Subset the data to extract records with matching dates: df => stdout (Console
df[df$df_date %in% c(start_dates, end_dates),]

weekend dates within an interval R

I'm trying to identify whether or not a weekend fell within an interval of dates. I've been able to identify if a specific date is a weekend, but not when trying to look at a range of dates. Is this possible? If so, please advise. TIA.
library(lubridate, chron)
start.date <- c("1/1/2017", "2/1/2017")
end.date <- c("1/21/2017", "2/11/2017")
df <- data.frame(start.date, end.date)
df$start.date <- mdy(df$start.date)
df$end.date <- mdy(df$end.date)
df$interval.date <- interval(df$start.date, df$end.date)
df$weekend.exist <- ifelse(is.weekend(df$interval.date), 1, 0)
# Error in dts - floor(dts) :
# Arithmetic operators undefined for 'Interval' and 'Interval' classes:
# convert one to numeric or a matching time-span class.
why don't you prefer a seq of dates rather than creating the interval ? like
df$weekend.exist <- sapply(1:nrow(df), function(i)
as.numeric(any(is.weekend(seq(df$start.date[i], df$end.date[i],by = "day")))))
# [1] 1 1
library(dplyr)
df %>%
group_by(start.date,end.date) %>%
mutate(weekend.exist = as.numeric(any(is.weekend(seq(start.date, end.date,by = "day")))))
# start.date end.date weekend.exist
# <date> <date> <dbl>
# 1 2017-01-01 2017-01-21 1
# 2 2017-02-01 2017-02-03 1

How to select time range during weekdays and associated data on the next column

Here is an example of a subset data in .csv files. There are three columns with no header. The first column represents the date/time and the second column is load [kw] and the third column is 1= weekday, 0 = weekends/ holiday.
9/9/2010 3:00 153.94 1
9/9/2010 3:15 148.46 1
I would like to program in R, so that it selects the first and second column within time ranges from 10:00 to 20:00 for all weekdays (when the third column is 1) within a month of September and do not know what's the best and most efficient way to code.
code dt <- read.csv("file", header = F, sep=",")
#Select a column with weekday designation = 1, weekend or holiday = 0
y <- data.frame(dt[,3])
#Select a column with timestamps and loads
x <- data.frame(dt[,1:2])
t <- data.frame(dt[,1])
#convert timestamps into readable format
s <- strptime("9/1/2010 0:00", format="%m/%d/%Y %H:%M")
e <- strptime("9/30/2010 23:45", format="%m/%d/%Y %H:%M")
range <- seq(s,e, by = "min")
df <- data.frame(range)
OP ask for "best and efficient way to code" this without showing "inefficient code", so #Justin is right.
It's seems that the OP is new to R (and it's officially the summer of love) so I give it a try and I have a solution (not sure about efficiency..)
index <- c("9/9/2010 19:00", "9/9/2010 21:15", "10/9/2010 11:00", "3/10/2010 10:30")
index <- as.POSIXct(index, format = "%d/%m/%Y %H:%M")
set.seed(1)
Data <- data.frame(Date = index, load = rnorm(4, mean = 120, sd = 10), weeks = c(0, 1, 1, 1))
## Data
## Date load weeks
## 1 2010-09-09 19:00:00 113.74 0
## 2 2010-09-09 21:15:00 121.84 1
## 3 2010-09-10 11:00:00 111.64 1
## 4 2010-10-03 10:30:00 135.95 1
cond <- expression(format(Date, "%H:%M") < "20:00" &
format(Date, "%H:%M") > "10:00" &
weeks == 1 &
format(Date, "%m") == "09")
subset(Data, eval(cond))
## Date load weeks
## 3 2010-09-10 11:00:00 111.64 1

Resources