group a column by date with different formats - r

I have a dataset where one column has a date and time values. Every date has multiple entries. The first row for every date has a date value inthe form 29MAY2018_00:00:00.000000 while the rest of the row for the same date has time values i.e. 20:00 - 21:00. The problem is that I want to sum the values in another column for each day.
The sample data has the following format
Date A
29MAY2018_00:00:00.000000
20:00 - 21:00 0.009
21:00 - 22:00 0.003
22:00 - 23:00 0.0003
23:00 - 00:00 0
30MAY2018_00:00:00.000000
00:00 - 01:00 -0.0016
01:00 - 02:00 -0.0012
02:00 - 03:00 -0.0002
03:00 - 04:00 -0.0023
04:00 - 05:00 0
05:00 - 06:00 -0.0005
20:00 - 21:00 -0.0042
21:00 - 22:00 -0.0035
22:00 - 23:00 -0.0026
23:00 - 00:00 -0.001
I have created a new column
data$C[data$A ==0 ] <- 0
data$C[data$A < 0 ] <- -1
data$C[data$A > 0 ] <- 1
I need to sum the column `C' for every date.
The output should be
A B
29-MAY-2019 4
30-MAY-2019 -9
31-MAY-2019 3

An option would be to create a grouping column based on the occurrence of full datetime format in the 'Date', summarise the first 'Date', convert it to Date format (with anydate from anytime) and get the sum of sign of 'A'
library(tidyverse)
library(anytime)
data %>%
group_by(grp = cumsum(str_detect(Date, "[A-Z]"))) %>%
summarise(Date = anydate(first(Date)),
B = sum(sign(A), na.rm = TRUE))

Related

Calculate midpoint for hh:mm variable over the span of two days with no date

I would like to calculate the midpiont between falling asleep and waking up. I'm struggling a bit, because I have no date, just the time as hh:mm.
For example, if you go to bed at 2 a.m. and wake up at 10 a.m., the midpoint is 6 a.m.
Below is an example and I'd be glad for some help!
DF <- data.frame(time_start = c("23:45", "21:30", "22:00", "23:00", "00:30", "02:00"),
time_end = c("06:49", "06:30", "07:00", "09:00", "5:30", "10:00"))
Convert the two time objects to numeric, calculate their averages, and then convert them back to time objects.
library(dplyr)
library(lubridate)
DF %>%
mutate_all(hm) %>%
mutate(time_end = time_end + (time_end < time_start) * days(1),
midpoint = seconds_to_period(as.numeric(time_start + time_end) / 2)) %>%
mutate_all(~sprintf("%02d:%02d", hour(.), minute(.)))
# time_start time_end midpoint
# 1 23:45 06:49 03:17
# 2 21:30 06:30 02:00
# 3 22:00 07:00 02:30
# 4 23:00 09:00 04:00
# 5 00:30 05:30 03:00
# 6 02:00 10:00 06:00

Subsetting data based on time in R

I have a set of traffic data that has date and time columns, however, I'm having issues to properly subset the data according to the specific times. Is there a way to properly subset data based on date and time ranges? Using a filter or subset does not seem to work for me.
For e.g. I would like to extract data from 17/08/2019 to 19/08/2019 and for the following time periods: 06:00 to 07:00, 08:30 to 10:00, 12:00 to 13:00, 17:30 to 19:00, 19:00 to 20:00 and 20:00 to 22:00. I appreciate everyone's advice, please!
Vehicle.No. Date Time Payment.Amount
SXX0001A 17/08/2019 00:01 1.25
SXX0002A 17/08/2019 00:21 5
SXX0003A 17/08/2019 00:31 0
SXX0004A 17/08/2019 02:01 3
SXX0005A 17/08/2019 03:01 2
SXX0006A 17/08/2019 18:01 1.25
.
.
.
SXX0007A 18/08/2019 00:01 1.25
SXX0008A 18/08/2019 02:01 1.25
SXX0009A 18/08/2019 19:01 1.25
SXX0010A 18/08/2019 20:01 1.25
.
.
.
SXX0006A 20/08/2019 02:01 1.25
SXX0006A 20/08/2019 03:01 3.25
SXX0006A 20/08/2019 01:01 5.25
SXX0006A 20/08/2019 12:01 0
SXX0006A 20/08/2019 14:01 1.25
.
.
.
The first thing is to make sure that your Date and Time variables are in date and time formats respectively. It is impossible to tell, from what you are providing, whether this is the case or whether those variables are characters or factors.
Let's assume that they are characters:
df <- read.table(
text =
"Vehicle.No. Date Time Payment.Amount
SXX0001A 17/08/2019 00:01 1.25
SXX0002A 17/08/2019 00:21 5
SXX0003A 17/08/2019 00:31 0
SXX0004A 17/08/2019 02:01 3
SXX0005A 17/08/2019 03:01 2
SXX0006A 17/08/2019 18:01 1.25
SXX0007A 18/08/2019 00:01 1.25
SXX0008A 18/08/2019 02:01 1.25
SXX0009A 18/08/2019 19:01 1.25
SXX0010A 18/08/2019 20:01 1.25
SXX0006A 20/08/2019 02:01 1.25
SXX0006A 20/08/2019 03:01 3.25
SXX0006A 20/08/2019 01:01 5.25
SXX0006A 20/08/2019 12:01 0
SXX0006A 20/08/2019 14:01 1.25",
stringsAsFactors = F,
header = T
)
str(df$Date)
chr [1:15] "17/08/2019" "17/08/2019" "17/08/2019" "17/08/2019" ...
str(df$Time)
chr [1:15] "00:01" "00:21" "00:31" "02:01" "03:01" "18:01" "00:01" "02:01" ...
Let's create 2 new variables (date and datetime) in date and datetime formats. I am creating a datetime variable rather than a time one because this will come in handy later. The package readr has great functions to parse vectors.
library(dplyr)
library(readr)
df <-
df %>%
mutate(
date = parse_date(Date, "%d/%m/%Y"),
datetime = parse_datetime(paste(Date, Time), "%d/%m/%Y %H:%M")
)
str(df$date)
Date[1:15], format: "2019-08-17" "2019-08-17" "2019-08-17" ...
str(df$datetime)
POSIXct[1:15], format: "2019-08-17 00:01:00" "2019-08-17 00:21:00" ...
It is not clear to me how you want your output (do you want to filter the data that fit in any of the times you list? or do you want to filter for each date and time period separately?). Let's assume that you want all of the data that fit in any of the date and time periods you list.
Since we need to filter for the same time periods for several days, we will use purrr to avoid code repetition:
create a list of filtered data frames (each element corresponding to one of the days of interest)
create a function that will filter data for all the time periods of interest for a certain day. This function uses the package lubridate.
apply the function to each element of the list and output a data frame thanks to purrr:map_df() and remove the variables time and datetime we had created (though maybe you should keep them and get rid of your Date and Time variables instead).
library(purrr)
library(lubridate)
ls <- list(
filter(df, date == "2019-08-17"),
filter(df, date == "2019-08-18"),
filter(df, date == "2019-08-19")
)
select_times <- function(df) {
df %>%
filter(
datetime %within% interval(paste(unique(df$date), "06:00:00"),
paste(unique(df$date), "07:00:00")) |
datetime %within% interval(paste(unique(df$date), "08:30:00"),
paste(unique(df$date), "10:00:00")) |
datetime %within% interval(paste(unique(df$date), "12:00:00"),
paste(unique(df$date), "13:00:00")) |
datetime %within% interval(paste(unique(df$date), "17:30:00"),
paste(unique(df$date), "22:00:00"))
)
}
map_df(ls, select_times) %>%
select(- date, - datetime)
Output:
Vehicle.No. Date Time Payment.Amount
1 SXX0006A 17/08/2019 18:01 1.25
2 SXX0009A 18/08/2019 19:01 1.25
3 SXX0010A 18/08/2019 20:01 1.25
This is the subset of your data for the time periods of interest during the days of interest.
For alternative solutions, you might want to look at the package xts. This post could be useful.

R, select rainfall events and calculate rainfall event total from time-series data

Here is what I am trying to make the code do:
-identify unique rainfall "events" in the dataset. I want to start with an inter event period of 6 dry hours between events.
-My plan of attack was to create a column that would contain a unique "flags" for the events. The event flag or ID could be the start timedate stamp of the event or just a n+1 the last identifier (1,1,1,1,2,2,2,2) etc. I'm having trouble to get this unique flag part, because I need R to "look ahead" in the precip column to see if it rains within 6 hours in the future. Then if it does, it should create a flag.
-Finally, I'd like to get an output (similar to a pivot table) that sums the total precip in inches of each unique event, and also gives me the start and stop time, and total duration of event.
EXAMPLE OUTPUT
Event ID Precip (in) Event STart Event Stop Time (hours)
1 0.07 10/6/2017 17:00 10/6/2017 22:00 6:00
2 0.01 10/7/2017 15:00 10/7/2017 15:00 1:00
3 0.15 10/10/2017 11:00 10/10/2017 13:00 3:00
CODE
library(zoo) # to get rollsum fxn
DF1 <- read.csv("U:/R_files/EOF_Rainfall_Stats_2017-
18/Precip_DF1_Oct17toMay18.csv")
DF1$event <- NA
DF1$event[DF1$Precip_in > 0] = "1"
DF1$event[DF1$Precip_in == 0] = "0"
str(DF1)
DF1$event <- as.numeric(DF1$event)
str(DF1)
DF1$rollsum6 <- round(rollsum(DF1$event, k=6, fill=NA, align="right"),5)
DF1$eventID <- NA
DF1$eventID <- ifelse(DF1$rollsum6 >= 2 & DF1$event == 1, "flag", "NA")
RAW DATA
DateTime Precip_in
10/6/2017 13:00 0
10/6/2017 14:00 0
10/6/2017 15:00 0
10/6/2017 16:00 0
10/6/2017 17:00 0.04
10/6/2017 18:00 0
10/6/2017 19:00 0
10/6/2017 20:00 0
10/6/2017 21:00 0.01
10/6/2017 22:00 0.02
10/6/2017 23:00 0
10/7/2017 0:00 0
10/7/2017 1:00 0
10/7/2017 2:00 0
10/7/2017 3:00 0
10/7/2017 4:00 0
10/7/2017 5:00 0
10/7/2017 6:00 0
10/7/2017 7:00 0
10/7/2017 8:00 0
10/7/2017 9:00 0
10/7/2017 10:00 0
10/7/2017 11:00 0
10/7/2017 12:00 0
10/7/2017 13:00 0
10/7/2017 14:00 0
10/7/2017 15:00 0.01
If someone is still looking for a way to solve this question, here is my 'tidy' approach on it. I saved the data in a variable called data.
library(dplyr)
# Set data column as POSIXct, important for calculating duration afterwards
data <- data %>% mutate(DateTime = as.POSIXct(DateTime, format = '%m/%d/%Y %H:%M'))
flags <- data %>%
# Set a rain flag if there is rain registered on the gauge
mutate(rainflag = ifelse(Precip_in > 0, 1, 0)) %>%
# Create a column that contains the number of consecutive times there was rain or not.
# Use `rle`` which indicates how many times consecutive values happen, and `rep`` to repeat it for each row.
mutate(rainlength = rep(rle(rainflag)$lengths, rle(rainflag)$lengths)) %>%
# Set a flag for an event happening, when there is rain there is a rain event,
# when it is 0 but not for six consecutive times, it is still a rain event
mutate(
eventflag = ifelse(
rainflag == 1,
1,
ifelse(
rainflag == 0 & rainlength < 6,
1,
0
)
)
) %>%
# Correct for the case when the dataset starts with no rain for less than six consecutive times
# If within the first six rows there is no rain registered, then the event flag should change to 0
mutate(eventflag = ifelse(row_number() < 6 & rainflag == 0, 0, eventflag)) %>%
# Add an id to each event (rain or not), to group by on the pivot table
mutate(eventid = rep(seq(1,length(rle(eventflag)$lengths)), rle(eventflag)$lengths))
rain_pivot <- flags %>%
# Select only the rain events
filter(eventflag == 1) %>%
# Group by id
group_by(eventid) %>%
summarize(
precipitation = sum(Precip_in),
eventStart = first(DateTime),
eventEnd = last(DateTime)
) %>%
# Compute time difference as duration of event, add 1 hour, knowing that the timestamp is the time when the rain record ends
mutate(time = as.numeric(difftime(eventEnd,eventStart, units = 'h')) + 1)
rain_pivot
#> # A tibble: 2 x 5
#> eventid precipitation eventStart eventEnd time
#> <int> <dbl> <dttm> <dttm> <dbl>
#> 1 2 0.07 2017-10-06 17:00:00 2017-10-06 22:00:00 6
#> 2 4 0.01 2017-10-07 15:00:00 2017-10-07 15:00:00 1

Dataframe datetime value row filling

I have a CSV file that contain the following:
ts1<-read.table(header = TRUE, sep=",", text="
start, end, value
1,26/11/2014 13:00,26/11/2014 20:00,decreasing
2,26/11/2014 20:00,27/11/2014 09:00,increasing ")
I would like to transfer the above dataframe to a dataframe in which each row time column is opened and filled in with the value. The time gap is filled in from the start time to the end time - 1 (minus 1), as followed:
date hour value
1 26/11/2014 13:00 decreasing
2 26/11/2014 14:00 decreasing
3 26/11/2014 15:00 decreasing
4 26/11/2014 16:00 decreasing
5 26/11/2014 17:00 decreasing
6 26/11/2014 18:00 decreasing
7 26/11/2014 19:00 decreasing
8 26/11/2014 20:00 increasing
9 26/11/2014 21:00 increasing
10 26/11/2014 22:00 increasing
11 26/11/2014 23:00 increasing
12 26/11/2014 00:00 increasing
13 26/11/2014 01:00 increasing
14 26/11/2014 02:00 increasing
15 26/11/2014 03:00 increasing
16 26/11/2014 04:00 increasing
17 26/11/2014 05:00 increasing
18 26/11/2014 06:00 increasing
19 26/11/2014 07:00 increasing
20 26/11/2014 08:00 increasing
I tried to start with separating the hours from the dates:
> t <- strftime(ts1$end, format="%H:%M:%S")
> t
[1] "00:00:00" "00:00:00"
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(ts1)), grouped by the sequence of rows (1:nrow(ts1)), we convert the 'start' and 'end' columns to datetime class (using dmy_hm from lubridate), get the sequence by '1 hour', format the result to expected format, then split by space (tstrsplit), concatenate with the 'value' column, remove the 'rn' column by assigning to NULL. Finally, we can change the column names (if needed).
library(lubridate)
library(data.table)
res <- setDT(ts1)[,{st <- dmy_hm(start)
et <- dmy_hm(end)
c(tstrsplit(format(head(seq(st, et, by = "1 hour"),-1),
"%d/%m/%Y %H:%M"), "\\s+"), as.character(value))} ,
by = .(rn=1:nrow(ts1))
][, rn := NULL][]
setnames(res, c("date", "hour", "value"))[]
# date hour value
# 1: 26/11/2014 13:00 decreasing
# 2: 26/11/2014 14:00 decreasing
# 3: 26/11/2014 15:00 decreasing
# 4: 26/11/2014 16:00 decreasing
# 5: 26/11/2014 17:00 decreasing
# 6: 26/11/2014 18:00 decreasing
# 7: 26/11/2014 19:00 decreasing
# 8: 26/11/2014 20:00 increasing
# 9: 26/11/2014 21:00 increasing
#10: 26/11/2014 22:00 increasing
#11: 26/11/2014 23:00 increasing
#12: 27/11/2014 00:00 increasing
#13: 27/11/2014 01:00 increasing
#14: 27/11/2014 02:00 increasing
#15: 27/11/2014 03:00 increasing
#16: 27/11/2014 04:00 increasing
#17: 27/11/2014 05:00 increasing
#18: 27/11/2014 06:00 increasing
#19: 27/11/2014 07:00 increasing
#20: 27/11/2014 08:00 increasing
Here is a solution using lubridate and plyr. It processes each row of the data to make a sequence from the start to the end, and returns this with the value. Results from each row are combined into one data.frame. If you need to process the results further, you might be better off not separating the datetime into date and time
library(plyr)
library(lubridate)
ts1$start <- dmy_hm(ts1$start)
ts1$end <- dmy_hm(ts1$end)
adply(.data = ts1, .margin = 1, .fun = function(x){
datetime <- seq(x$start, x$end, by = "hour")
#data.frame(datetime, value = x$value)"
data.frame(date = as.Date(datetime), time = format(datetime, "%H:%M"), value = x$value)
})[, -(1:2)]

How to finding a given length of runs in a series of data?

I'm trying to study times in which flow was operating at a given level. I would like to find when flows were above a given level for 4 or more hours. How would I go about doing this?
Sample code:
Date<-format(seq(as.POSIXct("2014-01-01 01:00"), as.POSIXct("2015-01-01 00:00"), by="hour"), "%Y-%m-%d %H:%M", usetz = FALSE)
Flow<-runif(8760, 0, 2300)
IsHigh<- function(x ){
if (x < 1600) return(0)
if (1600 <= x) return(1)
}
isHighFlow = unlist(lapply(Flow, IsHigh))
df = data.frame(Date, Flow, isHighFlow )
I was asked to edit my questions to supply what I would like to see as an output.
I would like to see a data from such as the one below. The only issue is the hourseHighFlow is incorrect. I'm not sure how to fix the code to generation the correct hoursHighFlow.
temp <- df %>%
mutate(highFlowInterval = cumsum(isHighFlow==1)) %>%
group_by(highFlowInterval) %>%
summarise(hoursHighFlow = n(), minDate = min(as.character(Date)), maxDate = max(as.character(Date)))
#Then join the two tables together.
temp2<-sqldf("SELECT *
FROM temp LEFT JOIN df
ON df.Date BETWEEN temp.minDate AND temp.maxDate")
Able to use subset to select the length of time running at a high flow rate.
t<-subset(temp2,isHighFlow==1)
t<-subset(t, hoursHighFlow>=4)
Put it in a data.table:
require(data.table)
DT <- data.table(df)
Mark runs and lengths:
DT[,`:=`(r=.GRP,rlen=.N),by={r <- rle(isHighFlow);rep(1:length(r[[1]]),r$lengths)}]
Subset to long runs:
DT[rlen>4L]
How it works:
New columns are created in the second argument of DT[i,j,by] with :=.
.GRP and .N are special variables for, respectively, the index and size of the by group.
A data.table can be subset simply with DT[i], unlike a data.frame.
Apart from subsetting, most of what works with a data.frame works the same on a data.table.
Here is a solution using the dplyr package:
df %>%
mutate(interval = cumsum(isHighFlow!=lag(isHighFlow, default = 0))) %>%
group_by(interval) %>%
summarise(hoursHighFlow = n(), minDate = min(as.character(Date)), maxDate = max(as.character(Date)), isHighFlow = mean(isHighFlow)) %>%
filter(hoursHighFlow >= 4, isHighFlow == 1)
Result:
interval hoursHighFlow minDate maxDate isHighFlow
1 25 4 2014-01-03 07:00 2014-01-03 10:00 1
2 117 4 2014-01-12 01:00 2014-01-12 04:00 1
3 245 6 2014-01-23 13:00 2014-01-23 18:00 1
4 401 6 2014-02-07 03:00 2014-02-07 08:00 1
5 437 5 2014-02-11 02:00 2014-02-11 06:00 1
6 441 4 2014-02-11 21:00 2014-02-12 00:00 1
7 459 4 2014-02-13 09:00 2014-02-13 12:00 1
8 487 4 2014-02-16 03:00 2014-02-16 06:00 1
9 539 7 2014-02-21 08:00 2014-02-21 14:00 1
10 567 4 2014-02-24 11:00 2014-02-24 14:00 1
.. ... ... ... ... ...
As Frank notes, you could achieve the same result with using rle to set intervals, replacing the mutate line with:
mutate(interval = rep(1:length(rle(df$isHighFlow)[[2]]),rle(df$isHighFlow)[[1]])) %>%

Resources