How to add group column in R dataframe based on time ranges - r

I have a dataframe in R (thousands of rows) containing data like this.
"id","ts"
1,2010-11-11 06:00:00
2,2010-11-11 06:01:00
3,2010-11-11 06:02:00
4,2010-11-11 06:03:00
...
11,2010-11-11 06:10:00
12,2010-11-11 06:11:00
13,2010-11-11 06:12:00
14,2010-11-11 06:13:00
15,2010-11-11 06:14:00
16,2010-11-11 06:15:00
17,2010-11-11 10:00:00
18,2010-11-11 10:01:00
19,2010-11-11 10:02:00
20,2010-11-11 10:03:00
21,2010-11-11 10:04:00
22,2010-11-11 10:05:00
...
I have data like the above for many days (11 Nov 2010 - 15 Dec 2010). Each day, ideally, has timestamp data (as.POSIXct, tz = "UTC") in three time slots between the ranges given below. However, some days have data for one or two time slots only.
Slot1: 06:00:00 - 06:15:00
Slot2: 10:00:00 - 10:15:00
Slot3: 13:00:00 - 13:15:00
What I would like to do is, to add a group column (continous group number until 15 Dec 2010 data) based on the above three time ranges. The expected output is:
"id","ts","Group"
1,2010-11-11 06:00:00,1
2,2010-11-11 06:01:00,1
3,2010-11-11 06:02:00,1
4,2010-11-11 06:03:00,1
...
11,2010-11-11 06:10:00,1
12,2010-11-11 06:11:00,1
13,2010-11-11 06:12:00,1
14,2010-11-11 06:13:00,1
15,2010-11-11 06:14:00,1
16,2010-11-11 06:15:00,1
17,2010-11-11 10:00:00,2
18,2010-11-11 10:01:00,2
19,2010-11-11 10:02:00,2
20,2010-11-11 10:03:00,2
21,2010-11-11 10:04:00,2
22,2010-11-11 10:05:00,2
...
How this could be achieved in R?
Some reproducible sample data is here:
start1 <- as.POSIXct("2010-11-11 06:00:00 UTC")
end1 <- as.POSIXct("2010-11-11 06:15:00 UTC")
start2 <- as.POSIXct("2010-11-11 10:00:00 UTC")
end2 <- as.POSIXct("2010-11-11 10:15:00 UTC")
start3 <- as.POSIXct("2010-11-11 13:00:00 UTC")
end3 <- as.POSIXct("2010-11-11 13:15:00 UTC")
ts1 <- data.frame(ts=seq.POSIXt(start1,end1, by = "min"))
ts2 <- data.frame(ts=seq.POSIXt(start2,end2, by = "min"))
ts3 <- data.frame(ts=seq.POSIXt(start3,end3, by = "min"))
ts <- data.frame(rbind(ts1,ts2,ts3))
id <- data.frame(id=seq.int(1,48,1))
dat <- data.frame(cbind(id,ts))

You can extract hour and minute value from ts and use case_when to apply Group number.
library(dplyr)
library(lubridate)
dat %>%
arrange(ts) %>%
mutate(hour = hour(ts),
minute = minute(ts),
date = as.Date(ts),
Group = case_when(hour == 6 & minute <= 15 ~ 1L,
hour == 10 & minute <= 15 ~ 2L,
hour == 13 & minute <= 15 ~ 3L),
Group = (as.integer(date - min(date)) * 3) + Group,
Group = match(Group, unique(Group))) -> result
result
You can keep the columns that you want using select i.e result %>% select(id, ts, Group).

Related

Converting mixed times into 24 hour format

I currently have a dataset with multiple different time formats(AM/PM, numeric, 24hr format) and I'm trying to turn them all into 24hr format. Is there a way to standardize mixed format columns?
Current sample data
time
12:30 PM
03:00 PM
0.961469907
0.913622685
0.911423611
09:10 AM
18:00
Desired output
new_time
12:30:00
15:00:00
23:04:31
21:55:37
21:52:27
09:10:00
18:00:00
I know how to do them all individually(an example below), but is there a way to do it all in one go because I have a large amount of data and can't go line by line?
#for numeric time
> library(chron)
> x <- c(0.961469907, 0.913622685, 0.911423611)
> times(x)
[1] 23:04:31 21:55:37 21:52:27
The decimal times are a pain but we can parse them first, feed them back as a character then use lubridate's parse_date_time to do them all at once
library(tidyverse)
library(chron)
# Create reproducible dataframe
df <-
tibble::tibble(
time = c(
"12:30 PM",
"03:00 PM",
0.961469907,
0.913622685,
0.911423611,
"09:10 AM",
"18:00")
)
# Parse times
df <-
df %>%
dplyr::mutate(
time_chron = chron::times(as.numeric(time)),
time_chron = if_else(
is.na(time_chron),
time,
as.character(time_chron)),
time_clean = lubridate::parse_date_time(
x = time_chron,
orders = c(
"%I:%M %p", # HH:MM AM/PM 12 hour format
"%H:%M:%S", # HH:MM:SS 24 hour format
"%H:%M")), # HH:MM 24 hour format
time_clean = hms::as_hms(time_clean)) %>%
select(-time_chron)
Which gives us
> df
# A tibble: 7 × 2
time time_clean
<chr> <time>
1 12:30 PM 12:30:00
2 03:00 PM 15:00:00
3 0.961469907 23:04:31
4 0.913622685 21:55:37
5 0.911423611 21:52:27
6 09:10 AM 09:10:00
7 18:00 18:00:00

Efficient Group Variable to Note When Values Fall Between Two Times

I have a dataset that contains start and end time stamps, as well as a performance percentage. I'd like to calculate group statistics over hourly blocks, e.g. "the average performance for the midnight hour was x%."
My question is if there is a more efficient way to do this than a series of ifelse() statements.
# some sample data
pre.starting <- data.frame(starting = format(seq.POSIXt(from =
as.POSIXct(Sys.Date()), to = as.POSIXct(Sys.Date()+1), by = "5 min"),
"%H:%M", tz="GMT"))
pre.ending <- data.frame(ending = pre.starting[seq(1, nrow(pre.starting),
2), ])
ending2 <- pre.ending[-c(1), ]
starting2 <- data.frame(pre.starting = pre.starting[!(pre.starting$starting
%in% pre.ending$ending),])
dataset <- data.frame(starting = starting2
, ending = ending2
, perct = rnorm(nrow(starting2), 0.5, 0.2))
For example, I could create hour blocks with code along the lines of the following:
dataset2 <- dataset %>%
mutate(hour = ifelse(starting >= 00:00 & ending < 01:00, 12
, ifelse(starting >= 01:00 & ending < 02:00, 1
, ifelse(starting >= 02:00 & ending < 03:00, 13)))
) %>%
group_by(hour) %>%
summarise(mean.perct = mean(perct, na.rm=T))
Is there a way to make this code more efficient, or improve beyond ifelse()?
We can use cut ending hour based on hourly interval after converting timestamps into POSIXct and then take mean for each hour.
library(dplyr)
dataset %>%
mutate_at(vars(pre.starting, ending), as.POSIXct, format = "%H:%M") %>%
group_by(ending_hour = cut(ending, breaks = "1 hour")) %>%
summarise(mean.perct = mean(perct, na.rm = TRUE))
# ending_hour mean.perct
# <fct> <dbl>
# 1 2019-09-30 00:00:00 0.540
# 2 2019-09-30 01:00:00 0.450
# 3 2019-09-30 02:00:00 0.612
# 4 2019-09-30 03:00:00 0.470
# 5 2019-09-30 04:00:00 0.564
# 6 2019-09-30 05:00:00 0.437
# 7 2019-09-30 06:00:00 0.413
# 8 2019-09-30 07:00:00 0.397
# 9 2019-09-30 08:00:00 0.492
#10 2019-09-30 09:00:00 0.613
# … with 14 more rows

I want to assign "day" and"night" variables based on maximum duration inside and outside "08:00:00-20:00:00"

I'm trying to add a new variable in a DateTime database, I can assign "day" and "night" when it doesn't intercept "08:00:00"/"20:00:00" but when it intercepts these two timepoints I want to assign "day" or "night" based the maximum time spent inside 08:00-20:00 (day) or outside 20:00-08:00 (night).
#Current input
pacman::p_load(pacman,lubridate,chron)
id<-c("m1","m1","m1","m2","m2","m2","m3","m4","m4")
x<-c("1998-01-03 10:00:00","1998-01-03 16:00:00","1998-01-03 19:20:00","1998-01-04 00:50:00","1998-01-06 11:20:00","1998-01-06 20:50:00","1998-01-06 22:00:00","1998-01-07 06:30:00","1998-01-07 07:50:00")
start<-as.POSIXct(x,"%Y-%m-%d %H:%M:%S",tz="UTC")
y<-c("1998-01-03 16:00:00","1998-01-03 19:20:00","1998-01-04 00:50:00","1998-01-06 11:20:00","1998-01-06 20:50:00","1998-01-06 22:00:00","1998-01-07 07:40:00","1998-01-07 07:50:00","1998-01-07 08:55:00")
end<-as.POSIXct(y,"%Y-%m-%d %H:%M:%S",tz="UTC")
mydata<-data.frame(id,start,end)
#Current output
df1 <- mydata %>%
mutate(start1 = as.POSIXct(sub("\\d+-\\d+-\\d+", Sys.Date(), start)),
end1 = as.POSIXct(sub("\\d+-\\d+-\\d+", Sys.Date(), end)),
day.night = case_when(start1 >= as.POSIXct('08:00:00', format = "%T") &
end1 >= as.POSIXct('08:00:00', format = "%T") &
end1 < as.POSIXct('20:00:00', format = "%T") ~ "day",
start1 >= as.POSIXct('20:00:00', format = "%T") &
(start1 < as.POSIXct('08:00:00', format = "%T") | end1 < as.POSIXct('23:00:00', format = "%T"))|
(start1 < as.POSIXct('08:00:00', format = "%T") & end1 < as.POSIXct('08:00:00', format = "%T")) ~ "night",
difftime(as.POSIXct('20:00:00', format = "%T"), start1) > difftime(end1, as.POSIXct('20:00:00', format = "%T")) ~ "day",
difftime(as.POSIXct('20:00:00', format = "%T"), start1) < difftime(end1, as.POSIXct('20:00:00', format = "%T")) ~ "night",
TRUE ~ "mixed"))
The current output is misassigning any periods that intercept 08:00-20:00
i.e. row 3 should = "night" because 4hrs50mins are "night" and 40 mins are "day"
row 4 should = "night" because 31hrs50mins are "night" and 28hrs20mins are "day"
#Current table
id start end start1 end1 day.night
1 m1 1998-01-03 10:00:00 1998-01-03 16:00:00 2019-09-03 10:00:00 2019-09-03 16:00:00 day
2 m1 1998-01-03 16:00:00 1998-01-03 19:20:00 2019-09-03 16:00:00 2019-09-03 19:20:00 day
3 m1 1998-01-03 19:20:00 1998-01-04 00:50:00 2019-09-03 19:20:00 2019-09-03 00:50:00 day
4 m2 1998-01-04 00:50:00 1998-01-06 11:20:00 2019-09-03 00:50:00 2019-09-03 11:20:00 day
5 m2 1998-01-06 11:20:00 1998-01-06 20:50:00 2019-09-03 11:20:00 2019-09-03 20:50:00 day
6 m2 1998-01-06 20:50:00 1998-01-06 22:00:00 2019-09-03 20:50:00 2019-09-03 22:00:00 night
7 m3 1998-01-06 22:00:00 1998-01-07 07:40:00 2019-09-03 22:00:00 2019-09-03 07:40:00 night
8 m4 1998-01-07 06:30:00 1998-01-07 07:50:00 2019-09-03 06:30:00 2019-09-03 07:50:00 night
9 m4 1998-01-07 07:50:00 1998-01-07 08:55:00 2019-09-03 07:50:00 2019-09-03 08:55:00 day
library(dplyr)
library(lubridate)
library(chron)
id<-c("m1","m1","m1","m2","m2","m2","m3","m4","m4")
x<-c("1998-01-03 10:00:00","1998-01-03 16:00:00","1998-01-03 19:20:00","1998-01-04 00:50:00","1998-01-06 11:20:00","1998-01-06 20:50:00","1998-01-06 22:00:00","1998-01-07 06:30:00","1998-01-07 07:50:00")
start<-as.POSIXct(x,"%Y-%m-%d %H:%M:%S",tz="UTC")
y<-c("1998-01-03 16:00:00","1998-01-03 19:20:00","1998-01-04 00:50:00","1998-01-06 11:20:00","1998-01-06 20:50:00","1998-01-06 22:00:00","1998-01-07 07:40:00","1998-01-07 07:50:00","1998-01-07 08:55:00")
end<-as.POSIXct(y,"%Y-%m-%d %H:%M:%S",tz="UTC")
mydata<-data.frame(id,start,end)
#Current output
df1 <- mydata %>%
mutate(i = interval(start, end),
total_interval_length = time_length(i, unit = "hour")) %>%
# Calculate daytime hours on first and last days
mutate(first_day = floor_date(start, unit = "day"),
last_day = floor_date(end, unit = "day")) %>%
mutate(first_day_daytime =
interval(update(first_day, hour = 8), update(first_day, hour = 20)),
last_day_daytime =
interval(update(last_day, hour = 8), update(last_day, hour = 20))) %>%
mutate(first_day_overlap =
coalesce(as.numeric(as.duration(intersect(first_day_daytime, i)), "hour"),0),
last_day_overlap =
coalesce(as.numeric(as.duration(intersect(last_day_daytime, i)), "hour"),0)
) %>%
# Calculate total daytime hours
# For rows of one date only, that is just first_day_overlap (or last_day_overlap since it's the same day)
# For rows in multiple dates, it's the first_day_overlap plus last_day_overlap plus 12 hours for each day in between
mutate(daytime_length =
ifelse(first_day == last_day,
first_day_overlap,
first_day_overlap + last_day_overlap +
12*(as.numeric(as.duration(interval(first_day, last_day)), "day")-1))
) %>%
# Assign day or night classification
mutate(day_night = ifelse(daytime_length >= total_interval_length - daytime_length, "day", "night"))

Interpolation over time

In a dataframe, I have wind speed data measured four times a day, at 00:00, 06:00, 12:00 and 18:00 o'clock. To combine these with other data, I need to fill the time in between towards a resolution of 15 minutes. I would like to fill the gaps by simple interpolation.
The following example produces two corresponding sample dataframes. df1 and df2 need to be merged. In the resulting merged dataframe, the gap values between the 6-hourly values (where var == NA?) need to be filled by a simply mean interpolation. My problem is how to merge both and do the concrete interpolation between the given values.
First dataframe
Creation:
# create a corresponding sample data frame
df1 <- data.frame(
date = seq.POSIXt(
from = ISOdatetime(2015,10,1,0,0,0, tz = "GMT"),
to = ISOdatetime(2015,10,14,23,59,0, tz= "GMT"),
by = "6 hour"
),
windspeed = abs(rnorm(14*4, 10, 4)) # abs() because windspeed shoud be positive
)
Resulting dataframe:
> # show the head of the dataframe
> head(df1)
date windspeed
1 2015-10-01 00:00:00 17.928217
2 2015-10-01 06:00:00 11.306025
3 2015-10-01 12:00:00 6.648131
4 2015-10-01 18:00:00 10.320146
5 2015-10-02 00:00:00 2.138559
6 2015-10-02 06:00:00 9.076344
Second dataframe
Creation:
# create a 2nd corresponding sample data frame
df2 <- data.frame(
date = seq.POSIXt(
from = ISOdatetime(2015,10,1,0,0,0, tz = "GMT"),
to = ISOdatetime(2015,10,14,23,59,0, tz= "GMT"),
by = "15 min"
),
var = abs(rnorm(14*24*4, 300, 100))
)
Resulting dataframe:
> # show the head of the 2nd dataframe
> head(df2)
date var
1 2015-10-01 00:00:00 198.2657
2 2015-10-01 00:15:00 472.9041
3 2015-10-01 00:30:00 605.8776
4 2015-10-01 00:45:00 429.0949
5 2015-10-01 01:00:00 400.2390
6 2015-10-01 01:15:00 317.1503
This is a solution
First merge them to get using all = TRUE to get all values
df3 <- merge(df1, df2, all = TRUE)
Then use approx for Interpolation
df3$windspeed <- approx(x = df1$date, y = df1$windspeed, xout = df2$date)$y
The only problem there is that the las ones will be NA unless your last value of windspeed is there, but everything in between will be there

Combining time series data with different resolution in R

I have read in and formatted my data set like shown under.
library(xts)
#Read data from file
x <- read.csv("data.dat", header=F)
x[is.na(x)] <- c(0) #If empty fill in zero
#Construct data frames
rawdata.h <- data.frame(x[,2],x[,3],x[,4],x[,5],x[,6],x[,7],x[,8]) #Hourly data
rawdata.15min <- data.frame(x[,10]) #15 min data
#Convert time index to proper format
index.h <- as.POSIXct(strptime(x[,1], "%d.%m.%Y %H:%M"))
index.15min <- as.POSIXct(strptime(x[,9], "%d.%m.%Y %H:%M"))
#Set column names
names(rawdata.h) <- c("spot","RKup", "RKdown","RKcon","anm", "pp.stat","prod.h")
names(rawdata.15min) <- c("prod.15min")
#Convert data frames to time series objects
data.htemp <- xts(rawdata.h,order.by=index.h)
data.15mintemp <- xts(rawdata.15min,order.by=index.15min)
#Select desired subset period
data.h <- data.htemp["2013"]
data.15min <- data.15mintemp["2013"]
I want to be able to combine hourly data from data.h$prod.h with data, with 15 min resolution, from data.15min$prod.15min corresponding to the same hour.
An example would be to take the average of the hourly value at time 2013-12-01 00:00-01:00 with the last 15 minute value in that same hour, i.e. the 15 minute value from time 2013-12-01 00:45-01:00. I'm looking for a flexible way to do this with an arbitrary hour.
Any suggestions?
Edit: Just to clarify further: I want to do something like this:
N <- NROW(data.h$prod.h)
for (i in 1:N){
prod.average[i] <- mean(data.h$prod.h[i] + #INSERT CODE THAT FINDS LAST 15 MIN IN HOUR i )
}
I found a solution to my problem by converting the 15 minute data into hourly data using the very useful .index* function from the xts package like shown under.
prod.new <- data.15min$prod.15min[.indexmin(data.15min$prod.15min) %in% c(45:59)]
This creates a new time series with only the values occuring in the 45-59 minute interval each hour.
For those curious my data looked like this:
Original hourly series:
> data.h$prod.h[1:4]
2013-01-01 00:00:00 19.744
2013-01-01 01:00:00 27.866
2013-01-01 02:00:00 26.227
2013-01-01 03:00:00 16.013
Original 15 minute series:
> data.15min$prod.15min[1:4]
2013-09-30 00:00:00 16.4251
2013-09-30 00:15:00 18.4495
2013-09-30 00:30:00 7.2125
2013-09-30 00:45:00 12.1913
2013-09-30 01:00:00 12.4606
2013-09-30 01:15:00 12.7299
2013-09-30 01:30:00 12.9992
2013-09-30 01:45:00 26.7522
New series with only the last 15 minutes in each hour:
> prod.new[1:4]
2013-09-30 00:45:00 12.1913
2013-09-30 01:45:00 26.7522
2013-09-30 02:45:00 5.0332
2013-09-30 03:45:00 2.6974
Short answer
df %>%
group_by(t = cut(time, "30 min")) %>%
summarise(v = mean(value))
Long answer
Since, you want to compress the 15 minutes time series to a smaller resolution (30 minutes), you should use dplyr package or any other package that computes the "group by" concept.
For instance:
s = seq(as.POSIXct("2017-01-01"), as.POSIXct("2017-01-02"), "15 min")
df = data.frame(time = s, value=1:97)
df is a time series with 97 rows and two columns.
head(df)
time value
1 2017-01-01 00:00:00 1
2 2017-01-01 00:15:00 2
3 2017-01-01 00:30:00 3
4 2017-01-01 00:45:00 4
5 2017-01-01 01:00:00 5
6 2017-01-01 01:15:00 6
The cut.POSIXt, group_by and summarise functions do the work:
df %>%
group_by(t = cut(time, "30 min")) %>%
summarise(v = mean(value))
t v
1 2017-01-01 00:00:00 1.5
2 2017-01-01 00:30:00 3.5
3 2017-01-01 01:00:00 5.5
4 2017-01-01 01:30:00 7.5
5 2017-01-01 02:00:00 9.5
6 2017-01-01 02:30:00 11.5
A more robust way is to convert 15 minutes values into hourly values by taking average. Then do whatever operation you want to.
### 15 Minutes Data
min15 <- structure(list(V1 = structure(1:8, .Label = c("2013-01-01 00:00:00",
"2013-01-01 00:15:00", "2013-01-01 00:30:00", "2013-01-01 00:45:00",
"2013-01-01 01:00:00", "2013-01-01 01:15:00", "2013-01-01 01:30:00",
"2013-01-01 01:45:00"), class = "factor"), V2 = c(16.4251, 18.4495,
7.2125, 12.1913, 12.4606, 12.7299, 12.9992, 26.7522)), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -8L))
min15
### Hourly Data
hourly <- structure(list(V1 = structure(1:4, .Label = c("2013-01-01 00:00:00",
"2013-01-01 01:00:00", "2013-01-01 02:00:00", "2013-01-01 03:00:00"
), class = "factor"), V2 = c(19.744, 27.866, 26.227, 16.013)), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -4L))
hourly
### Convert 15min data into hourly data by taking average of 4 values
min15$V1 <- as.POSIXct(min15$V1,origin="1970-01-01 0:0:0")
min15 <- aggregate(. ~ cut(min15$V1,"60 min"),min15[setdiff(names(min15), "V1")],mean)
min15
names(min15) <- c("time","min15")
names(hourly) <- c("time","hourly")
### merge the corresponding values
combined <- merge(hourly,min15)
### average of hourly and 15min values
rowMeans(combined[,2:3])

Resources