filtering intraday data R - r

I'm trying to filter intraday-data to include only certain period inside the day. Is there a trick in some packages to achieve this. Here is example data:
library(tibbletime)
example <- as.tibble(data.frame(
date = ymd_hms(seq(as.POSIXct("2017-01-01 09:00:00"), as.POSIXct("2017-01-02 20:00:00"), by="min")),
value = rep(1, 2101)))
I would like to include only 10:00:00 - 18:35:00 for each day, but can't achieve this nicely. My solution for now has been creating extra indic columns and then filter by them, but it hasn't worked well either.

You can use the function between() from data.table
example[data.table::between(format(example$date, "%H:%M:%S"),
lower = "10:00:00",
upper = "18:35:00"), ]

library(tibbletime)
library(tidyverse)
library(lubridate)
example <- as.tibble(data.frame(
date = ymd_hms(seq(as.POSIXct("2017-01-01 09:00:00"), as.POSIXct("2017-01-02 20:00:00"), by="min")),
value = rep(1, 2101)))
example %>%
mutate(time = as.numeric(paste0(hour(date),".",minute(date)))) %>%
filter(time >= 10 & time <= 18.35) %>%
select(-time)

This is pretty hacky but if you really want to stay in the tidyverse:
rng <- range((hms("10:00:00") %>% as_datetime()), (hms("18:35:00") %>% as_datetime()))
example %>%
separate(., date, into = c("date", "time"), sep = " ") %>%
mutate(
time = hms(time) %>% as_datetime(),
date = as_date(date)
) %>%
filter(time > rng[1] & time < rng[2]) %>%
separate(., time, into = c("useless", "time"), sep = " ") %>%
select(-useless)

Related

Group data by year and filter by month in R

I have a list of data frames with daily streamflow data.
I want to estimate the maximum daily flow from June to November every year for each data frame in the list that corresponds each of them to data in a station.
This is how the list of data frames looks:
and this is the code I am using:
#Peak mean daily flow summer and fall (June to November)
PeakflowSummerFall <- lapply(listDF,function(x){x %>% group_by(x %>% mutate(year = year(Date)))
%>% filter((x %>% mutate(month = month(Date)) >= 6) & (x %>% mutate(month = month(Date)) <= 11))
%>% summarise(max=max(DailyStreamflow, na.rm =TRUE))})
but I am having this error:
<error/dplyr_error>
Problem with `filter()` input `..1`.
x Input `..1` must be of size 1, not size 24601.
i Input `..1` is `&...`.
i The error occurred in group 1: Date = 1953-06-01, DailyStreamflow = 32, year = 1953.
Backtrace:
Run `rlang::last_trace()` to see the full context
Any solution to this problem?
#### This should give provide you with enough
#### sample data for answerers to work with
install.packages('purrr')
library(purrr)
sample_dat <- listDF %>%
head %>%
map( ~ head(.x))
dput(sample_dat)
#### With that being said...
#### You should flatten the data frame...
#### It's easier to work with...
install.packages('lubridate')
library(lubridate)
listDF %>%
plyr::ldply(rbind) %>%
mutate(month = floor_date(Date, unit = 'month')) %>%
filter(month(Date) > 5, month(Date) < 12) %>%
group_by(.id, month) %>%
dplyr::summarise(max_flow = max(DailyStreamflow)) %>%
split(.$.id)
Given the posted image of the data structure, the following might work.
library(lubridate)
library(dplyr)
listDF %>%
purrr::map(function(x){
x %>%
filter(month(Date) >= 6 & month(Date) <= 11) %>%
group_by(year(Date)) %>%
summarise(Max = max(DailyStreamflow, na.rm = TRUE), .groups = "keep")
})
Test data creation code.
fun <- function(year, n){
d1 <- as.Date(paste(year, 1, 1, sep = "-"))
d2 <- as.Date(paste(year + 10, 12, 31, sep = "-"))
d <- seq(d1, d2, by = "day")
d <- sort(rep(sample(d, n, TRUE), length.out = n))
flow <- sample(10*n, n, TRUE)
data.frame(Date = d, DailyStreamflow = flow)
}
set.seed(2020)
listDF <- lapply(1:3, function(i) fun(c(1953, 1965, 1980)[i], c(24601, 13270, 17761)[i]))
str(listDF)
rm(fun)

replace historical data of a data.frame with the most recent year data in R?

I want to replace Jan 01 to Jun 25 of all the years in FakeData with data from Ob2020 for the two variables (Level & Flow) of my data.frame. Here is what i have started and am looking for suggestions to achieving my goal.
library(tidyverse)
library(lubridate)
set.seed(1500)
FakeData <- data.frame(Date = seq(as.Date("2010-01-01"), to = as.Date("2018-12-31"), by = "days"),
Level = runif(3287, 0, 30), Flow = runif(3287, 1,10))
Ob2020 <- data.frame(Date = seq(as.Date("2020-01-01"), to = as.Date("2020-06-25"), by = "days"),
Level = runif(177, 0, 30), Flow = runif(177, 1,10))
Here's a way using dplyr and lubridate :
library(dplyr)
library(lubridate)
FakeData %>%
mutate(day = day(Date), month = month(Date)) %>%
left_join(Ob2020 %>%
mutate(day = day(Date), month = month(Date)),
by = c('day', 'month')) %>%
mutate(Level = coalesce(Level.y, Level.x),
Flow = coalesce(Flow.y, Flow.x)) %>%
select(Date = Date.x, Level, Flow)
If you dont mind a data.table solution, here is an update join:
library(data.table)
#extract year and month of the date
setDT(FakeData)[, c("day", "mth") := .(mday(Date), month(Date))]
setDT(Ob2020)[, c("day", "mth") := .(mday(Date), month(Date))]
#print to console to show old values
head(FakeData)
head(Ob2020)
cols <- c("Level", "Flow")
FakeData[Ob2020[mth<=6L & day<=25], on=.(day, mth),
(cols) := mget(paste0("i.", cols))]
#print to console to show new values
head(FakeData)

How to sum variable by month/year in R?

library(dplyr)
library(plotly)
library(lubridate)
googlesearch <- read.csv("multiTimeline.csv", header = FALSE, stringsAsFactors = FALSE)
googlesearch2 <- googlesearch [-1, ]
googlesearch2 <- googlesearch2 [-1, ]
colnames(googlesearch2)[1] <- 'Date'
colnames(googlesearch2)[2] <- 'NumberofSearch'
googlesearch2$Date <- as.Date(googlesearch2$Date)
googlesearch2 <- googlesearch2 %>%
filter(Date > "2015-01-04" & Date < "2018-05-27")
googlesearch3 <- googlesearch2 %>%
transform(googlesearch2$Date, Date = as.Date(as.character(Date), "%Y-%m-%d"))
googlesearch3 <- googlesearch2 %>%
mutate(month = format(Date, "%m"), year = format(Date, "%Y")) %>%
group_by(Date, yearMon = as.yearmon(Date, "%m-%d-%Y"))
googlesearch3$Date <- as.numeric(googlesearch3$NumberofSearch)
googlesearch3 <- googlesearch3 %>%
mutate(month = format(Date, "%m"), year = format(Date, "%Y")) %>%
group_by(Date, yearMon = as.yearmon(Date, "%m-%d-%Y")) %>%
summarise(NumberofSearch_sum = sum(NumberofSearch))
data <- tbl_df(googlesearch3)
data %>%
group_by(yearMon) %>%
summarise(NumberofSearch_mon = sum(NumberofSearch))
I know this is messy.
I'm getting this error and I don't know why.Adding the sample code.
Error in summarise_impl(.data, dots) :
Evaluation error: invalid 'type' (character) of argument.
In lack of a reproducible example, try to replace the last code chunk of you sample code with:
library(hablar)
data %>%
retype() %>%
group_by(yearMon) %>%
summarise(NumberofSearch_mon = sum(NumberofSearch))
Maybe it works :)

Use variable names in function in dplyr for sum and cumsum

dplyr programming question here. Trying to write a dplyr function which takes column names as inputs and also filters on a component outlined in the function. What I am trying to recreate is as follow called test:
#test df
x<- sample(1:100, 10)
y<- sample(c(TRUE, FALSE), 10, replace = TRUE)
date<- seq(as.Date("2018-01-01"), as.Date("2018-01-10"), by =1)
my_df<- data.frame(x = x, y =y, date =date)
test<- my_df %>% group_by(date) %>%
summarise(total = n(), total_2 = sum(y ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter(date >= "2018-01-03")
The function I am testing is as follows:
cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {
date_field <- enquo(date_field)
cumulative_y <- enquo(cumulative_y)
data %>% group_by(!!date_field) %>%
summarise(total = n(), total_2 = sum(!!cumulative_y ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter((!!date_field) >= minimum_date)
}
test2<- cumsum_df(data = my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-03")
I have looked looked at some examples of using enquo and this thread gets me half way there:
Use variable names in functions of dplyr
But the issue is I get two different data frame outputs for test 1 and test 2. The one from the function outputs does not have data from the logical y referenced column.
I also tried this instead
cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {
date_field <- enquo(date_field)
cumulative_y <- deparse(substitute(cumulative_y))
data %>% group_by(!!date_field) %>%
summarise(total = n(), total_2 = sum(data[[cumulative_y]] ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter((!!date_field) >= minimum_date)
}
test2<- cumsum_df(data= my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-04")
Based on this thread: Pass a data.frame column name to a function
But the output from my test 2 column is also wildly different and it seems to do some kind or recursive accumulation. Which again is different to my test date frame.
If anyone can help that would be much appreciated.

hourly sums with dplyr with zeros for empty hours

I have a dataset similar to the format of "my_data" below, where each line is a single count of an event. I want to obtain a summary of how many events happen in every hour. I would like to have every hour with no events be included with a 0 for its "hourly_total" value.
I can achieve this with dplyr as shown, but the empty hours are dropped instead of being set to 0.
Thank you!
set.seed(123)
library(dplyr)
library(lubridate)
latemail <- function(N, st="2012/01/01", et="2012/1/31") {
st <- as.POSIXct(as.Date(st))
et <- as.POSIXct(as.Date(et))
dt <- as.numeric(difftime(et,st,unit="sec"))
ev <- sort(runif(N, 0, dt))
rt <- st + ev
}
my_data <- data_frame( fake_times = latemail(25),
count = 1)
my_data %>% group_by( rounded_hour = floor_date(fake_times, unit = "hour")) %>%
summarise( hourly_total = sum(count))
Assign your counts to an object
counts <- my_data %>% group_by( rounded_hour = floor_date(fake_times, unit = "hour")) %>%
summarise( hourly_total = sum(count))
Create a data frame with all the necessary hours
complete_data = data.frame(hour = seq(floor_date(min(my_data$fake_times), unit = "hour"),
floor_date(max(my_data$fake_times), unit = "hour"),
by = "hour"))
Join to it and fill in the NAs.
complete_data %>% group_by( rounded_hour = floor_date(hour, unit = "hour")) %>%
left_join(counts) %>%
mutate(hourly_total = ifelse(is.na(hourly_total), 0, hourly_total))

Resources