aggregate data by 5min excluding max and min - r

I have a data-frame likeso:
Time <- seq.POSIXt(as.POSIXct("2017-11-14 00:01:00 CET"), as.POSIXct("2017-11-14 00:15:00 CET"), units = "minute", by=60)
A <- c(2,3,5,2,5,8,17,3,5,8,17,3,5,1,5)
B <- c(1,1,2,1,2,1,2,2,2,4,6,7,8,8,9)
DF <- data.frame(Time=Time, A=A, B=B)
and i want a "newDF" where I aggregate data by 5min, excluding however, for each column, the max/min value before the aggregation.
Using dplyr i get to something like this:
DF$TimeStamp_round<-floor_date(DF$Time,unit="5 minutes")
DF<-DF %>%
group_by(TimeStamp_round) %>%
mutate(TimeStamp_count = cur_group_id())
newDF<-DF %>%
group_by(TimeStamp_count) %>%
summarise(across(where(is.numeric), mean))
but i still donĀ“t manage to exclude the max/min value before the summarise() function in newDF
note: I do not want to do it manually for each column, because in the real DF the columns are 350

We can remove the range of values before taking the mean after grouping by 'TimeStamp_round'
library(dplyr)
DF %>%
group_by(TimeStamp_round) %>%
summarise(across(A:B, ~ mean(.[!. %in% range(.)])), .groups = 'drop')
Or if there are more columns and want to get the mean only for numeric
DF %>%
select(-Time) %>%
group_by(TimeStamp_round) %>%
summarise(across(where(is.numeric),
~ mean(.[!. %in% range(.)])), .groups = 'drop')

Related

Getting rid of NA values in R when trying to aggregate columns

I'm trying to aggregate this df by the last value in each corresponding country observation. For some reason, the last value that is added to the tibble is not correct.
aggre_data <- combined %>%
group_by(location) %>%
summarise(Last_value_vacc = last(people_vaccinated_per_hundred)
aggre_data
I believe it has something to do with all of the NA values throughout the df. However I did try:
aggre_data <- combined %>%
group_by(location) %>%
summarise(Last_value_vacc = last(people_vaccinated_per_hundred(na.rm = TRUE)))
aggre_data
Update:
combined %>%
group_by(location) %>%
arrange(date, .by_group = TRUE) %>% # or whatever
summarise(Last_value_vacc = last(na.omit( people_vaccinated_per_hundred)))

R roll mean on a non continuous time serie

I want to make a rolling mean on the last X number of days. rollmean() does that using rows. Since I am using loggers that sometimes fail, and also the data were cleaned, the time series is not continuous (rows do not necessarily represent a constant time difference).
A colleague suggested the solution below, which works great. Except my data need to be grouped (in the example by treatment). For each day, I want the rolling mean of the last X days for each treatment.
Thanks
# making some example data
# vector with days since the beginning of experiment
days <- 0:30
# random values df1 <- tibble::tibble(
days_since_beginning = days,
value_to_used = rnorm(length(days)),
treatment = sample(letters[1],31,replace = TRUE) )
df2 <- tibble::tibble(
days_since_beginning = days,
value_to_used = rnorm(length(days)),
treatment = sample(letters[2],31,replace = TRUE) )
df <- full_join(df1, df2)
# how long should be the period for mean
time_period <- 10 # calculate for last 10 days
df_mean <- df %>% dplyr::mutate(
# calculate rolling mean
roll_mean = purrr::map_dbl(
.x = days_since_beginning,
.f = ~ df %>%
# select only data for the last `time_period`
dplyr::filter(days_since_beginning >= .x - time_period &
days_since_beginning <= .x) %>%
purrr::pluck("value_to_used") %>%
mean() %>%
return()
) )
This takes the mean over the last 10 days by treatment. The width argument includes a computation of how many rows back to use so that it corresponds to 10 days rather than 10 rows. This uses the fact that width can be a vector.
library(dplyr)
library(zoo)
df %>%
group_by(treatment) %>%
mutate(roll = rollapplyr(value_to_used,
seq_along(days_since_beginning) - findInterval(days_since_beginning - 10, days_since_beginning),
mean)) %>%
ungroup
Same colleague came up with his own solution:
df_mean <-
df %>%
dplyr::group_by(treatment) %>%
tidyr::nest() %>%
dplyr::mutate(
data_with_mean = purrr::map(
.x = data,
.f = ~ {
dataset <- .x
dataset %>%
dplyr::mutate(
# calculate rolling mean
roll_mean = purrr::map_dbl(
.x = days_since_beginning,
.f = ~ dataset %>%
# select only data for the last `time_period`
dplyr::filter(days_since_beginning >= .x - time_period &
days_since_beginning <= .x) %>%
purrr::pluck("value_to_used") %>%
mean() %>%
return()
)) %>%
return()
}
)) %>%
dplyr::select(-data) %>%
tidyr::unnest(data_with_mean) %>%
dplyr::ungroup()
I compared the results with G. Grothendieck's idea, and it only matches if I use time_period in my colleague's code and time_period + 1 in G. Grothendieck's code. So there is a difference in how the time_period is used, and I am confused about why it happens.

Manipulating data.frame while using cycles and storing values in a list

I have 2 codes that manipulate and filter (by date) my data.frame and that work perfectly. Now I want to run the code for not only one day, but for every day in vector:
seq(from=as.Date('2020-03-02'), to=Sys.Date(),by='days')` #.... 538 days
The code I want to run for all the days between 2020-03-02 and today is:
KOKOKO <- data.frame %>%
filter(DATE < '2020-03-02')%>%
summarize(DATE = '2020-03-02', CZK = sum(Objem.v.CZK,na.rm = T)
STAVPTF <- data.frame %>%
filter (DATE < '2020-03-02')%>%
group_by(CP) %>%
summarize(mnozstvi = last(AKTUALNI_MNOZSTVI_AKCIE), DATE = '2020-03-02') %>%
select(DATE,CP,mnozstvi) %>%
rbind(KOKOKO)%>%
drop_na() %>%
So instead of '2020-03-02' I want to fill in all days since '2020-03-02' one after another. And each of the KOKOKO and STAVPTF created for the unique day like this I want to save as a separate data.frame and all of them store in a list.
We could use map to loop over the sequence and apply the code
library(dplyr)
library(purrr)
out <- map(s1, ~ data.frame %>%
filter(DATE < .x)%>%
summarize(DATE = .x, CZK = sum(Objem.v.CZK,na.rm = TRUE))
As this is repeated cycle, a function would make it cleaner
f1 <- function(dat, date_col, group_col, Objem_col, aktualni_col, date_val) {
filtered <- dat %>%
filter({{date_col}} < date_val)
KOKOKO <- filtered %>%
summarize({{date_col}} := date_val,
CZK = sum({{Objem_col}}, na.rm = TRUE)
STAVPTF <- filtered %>%
group_by({{group_col}}) %>%
summarize(mnozstvi = last({{aktualni_col}}),
{{date_col}} := date_val) %>%
select({{date_col}}, {{group_col}}, mnozstvi) %>%
bind_rows(KOKOKO)%>%
drop_na()
return(STAVPTF)
}
and call as
map(s1, ~ f1(data.frame, DATE, CP, Objem.v.CZK, AKTUALNI_MNOZSTVI_AKCIE, !!.x))
where
s1 <- seq(from=as.Date('2020-03-02'), to=Sys.Date(), by='days')
It would be easier to answer your question, if you would provide a minimal reproducible example. It's easy done with tidyverses reprex packages
However, your KOKOKO code can be rewritten as simple cumulative sum:
KOKOKO =
data.frame %>%
arrange(DATE) %>% # if necessary
group_by(DATE) %>%
summarise(CZK = sum(Objem.v.CZK), .groups = 'drop') %>% # summarise per DATE (if necessary)
mutate(CZK = cumsum(CZK) - CZK) # cumulative sum excluding current row (current DATE)
Even STAVPTF code can probably be rewritten without iterations. First find the last value of AKTUALNI_MNOZSTVI_AKCIE per CP and DATE. Then this value is assigned to the next DATE:
STAVPTF <-
data.frame %>%
group_by(CP, DATE) %>%
summarise(mnozstvi = last(AKTUALNI_MNOZSTVI_AKCIE), .groups='drop_last') %>%
arrange(DATE) %>% # if necessary
mutate(DATE = lead(DATE))

Calculate cumulative sum over time stamp with dplyr

I'm trying to calculate cumulative sums over a time span. Is there a way to calculate this within a step? Any package recommendations?
activate_2019 <- activate_rate %>%
filter(
grepl("2019", join_day)
) %>%
summarize(
proportion = sum(if_activate) /n()
)
activate_2020 <- activate_rate %>%
filter(
grepl("2019|2020", join_day)
) %>%
summarize(
proportion = sum(if_activate) /n()
)
activate_2021 <- activate_rate %>%
filter(
grepl("2019|2020|2021", join_day)
) %>%
summarize(
proportion = sum(if_activate) /n()
)
Here is one method with tidyverse
Extract the unique year` from the 'join_day' column
Loop over those, slice the rows in active_rate based on the matching the 'year' looped in 'join_day'
Summarise by taking the mean of 'if_activate'
Bind the output with _dfc i.e. column bind in map
library(stringr)
library(dplyr)
library(purrr)
un1 <- str_extract_all(activate_rate$join_day, "\\d{4}") %>%
unlist %>%
unique %>%
as.integer %>%
sort
map_dfc(un1, ~ activate_rate %>%
arrange(as.Date(join_day)) %>%
slice(seq(max(grep(as.character(.x), join_day)))) %>%
sumarise(!!str_c("proportion", .x) := mean(if_activate)))
If I understand correctly, this should do the trick:
activate_rate %>%
mutate(year = floor_date(join_day, unit = "year")) %>%
group_by(year) %>%
summarise(proportion = sum(if_activate) / n())

dplyr summarise and then summarise_at in the same pipe

This question has come up before and there are some solutions but none that I could find for this specific case. e.g.
my_diamonds <- diamonds %>%
mutate(blah_var1 = rnorm(n()),
blah_var2 = rnorm(n()),
blah_var3 = rnorm(n()),
blah_var4 = rnorm(n()),
blah_var5 = rnorm(n()))
my_diamonds %>%
group_by(cut) %>%
summarise(MaxClarity = max(clarity),
MinTable = min(table), .groups = 'drop') %>%
summarise_at(vars(contains('blah')), mean)
Want a new df showing the max clarity, min table and mean of each of the blah variables. The above returned an empty tibble. Based on some other SO posts I tried using mutate and then summarise at:
my_diamonds %>%
group_by(cut) %>%
mutate(MaxClarity = max(clarity),
MinTable = min(table)) %>%
summarise_at(vars(contains('blah')), mean)
This returns a tibble but only for the blah variables, MaxClarity and MinTable are missing.
Is there a way to combine summarise and summarise_at in the same dplyr chain?
One issue with the summarise is that after the first call of summarise, we get only the columns in the grouping i.e. the 'cut' along with and the summarised columns i.e. 'MaxClarity' and 'MinTable'. In addition, after the first summarise step, the grouping is removed with groups = 'drop'
library(dplyr) # version >= 1.0
my_diamonds %>%
group_by(cut) %>%
summarise(MaxClarity = max(clarity),
MinTable = min(table),
across(contains('blah'), mean, na.rm = TRUE), .groups = 'drop')

Resources