More efficient way of taking averages over multiple lists - r

I have some data where I use the rsample package to create rolling windows (I use the iris data set here). The rolling_iris dataset contains a number of lists.
I would like to compute the min, max, mean and sd of each of the lists. That is in split 1 compute the min across the first 4 columns etc. I originally do this by mapping over the splits and using pivot_longer to rearrange the data then computing the statistics, finally using pivot_wider to get the data back into the original form. This is quite slow.
library(dplyr)
library(purrr)
iris
rolling_iris <- rsample::rolling_origin(iris, initial = 10, assess = 1, cumulative = FALSE, skip = 0)
rolling_iris_statistics <- map(rolling_iris$splits, ~analysis(.x) %>%
pivot_longer(cols = 1:4) %>%
mutate(
min = min(value),
max = max(value),
mean = mean(value),
sd = sd(value)
) %>%
group_by(name) %>%
mutate(rowID = row_number()) %>%
pivot_wider(names_from = name, values_from = value)
)
I would like to map over each of the lists and compute the above statistics. Then once this is done scale the analysis by the following function.
Scale_Me <- function(x){
(x - min(x)) / (max(x) - min(x))
}
Additional:
rolling_iris_analysis <- map(rolling_iris$splits, ~analysis(.x))
rolling_iris_assessment <- map(rolling_iris$splits, ~assessment(.x))
EDIT:
I managed to compute the following (I am not sure if it is "faster")
analysis <- map(rolling_iris$splits, ~analysis(.x))
map(analysis, ~select(., c(1:4)) %>% as.matrix %>% mean())

The below code subsets into each sub data frame. So, rolling_iris_dfs is a list of data frames. Then, you can iterate over each data frame and compute statistics.
rolling_iris_dfs <- map(seq(1, length(rolling_iris[[1]])), ~rolling_iris[[1]][[.x]]$data)
rolling_iris_stats <- map(rolling_iris_dfs, ~analysis(.x) %>%
pivot_longer(cols = 1:4) %>%
mutate(
min = min(value),
max = max(value),
mean = mean(value),
sd = sd(value)
) %>%
group_by(name) %>%
mutate(rowID = row_number()) %>%
pivot_wider(names_from = name, values_from = value)
)

Related

R - mutate columns loops over mutated columns when specifying multiple columns to add

following my recent question I am stuck with a mutation that mutates across recently mutated columns.
sheet <- sheet %>%
group_by(across(all_of(.GlobalEnv$filter_list[[select_filter]]))) %>%
mutate(
across(where(is.numeric), ~ (sum(.x)) , .names="{.col}_sum"),
across(where(is.numeric), ~ (mean(.x)) , .names="{.col}_average"),
across(where(is.numeric), ~ (sd(.x)) , .names="{.col}_SD")
#total = sum(.data[[names(sheet[,10])]], na.rm = TRUE)
)
is calling all numeric columns and performs a function on new column. Problem is I for a column calculated fo sum I get colname_sum and colname_sum_average and colname_sum_average_SD. How can I avoid this duplicity?
Thanks to #julian answer, I can manage this by:
sheet <- sheet %>%
group_by(across(all_of(.GlobalEnv$filter_list[[select_filter]]))) %>%
summarise(
across(
where(is.numeric),
list(sum = sum, average = mean, sd = sd),
.names = "{.col}_{.fn}"
)
)
Although this brings out another problem, where if I define my own function, the data is calculated on each value instead of by grouping:
testfunc <- function(.) {
. + 1000
}
sheet <- sheet %>%
group_by(across(all_of(.GlobalEnv$filter_list[[select_filter]]))) %>%
summarise(
across(
where(is.numeric),
list(sum = sum, average = mean, sd = sd, test = testfunc),
.names = "{.col}_{.fn}"
#total = sum(.data[[names(sheet[,10])]], na.rm = TRUE)
)
)
Use a list of functions inside the mutate
sheet <- sheet %>%
group_by(across(all_of(.GlobalEnv$filter_list[[select_filter]]))) %>%
mutate(
across(where(is.numeric), list(sum = sum, average = mean, sd = sd) , .names="{.col}_{.fn}")
)
But it's better to use summarise instead of mutate
sheet <- sheet %>%
group_by(across(all_of(.GlobalEnv$filter_list[[select_filter]]))) %>%
summarise(
across(where(is.numeric), list(sum = sum, average = mean, sd = sd) , .names="{.col}_{.fn}")
)
Example with theiris dataset
iris %>%
mutate(
across(where(is.numeric), list(sum = sum, average = mean, sd = sd) , .names="{.col}_{.fn}"))

Looping Multiple Variables in R

I am new to R. I have a data frame with firm level data such as revenue, profits and costs. I would need to loop through 3 variables - revenue, profit and costs over this code:
datagroup %>% group_by(treat) %>% summarise(n = n(), mean = mean(profit), std_error = sd(profit) / sqrt(n))
Basically, I would run the code for revenue and costs by replacing the variable profit. Could you assist? I tried for loops but to no avail.
We can do this in a loop with the column name as string, then convert it to symbol, evaluate (!!) and get the mean
library(tidyverse)
c("revenue", "costs") %>%
map(~ datagroup %>%
group_by(treat) %>%
summarise(n = n(),
!! str_c("mean_", .x) := mean(!! rlang::sym(.x)), # convert to symbol
!! str_c("std_error_", .x) := sd(!! rlang::sym(.x)) / sqrt(n)))
We can also do this with summarise_at
c("revenue", "costs") %>%
map(~ datagroup %>%
group_by(treat) %>%
group_by(n = n(), add = TRUE) %>%
summarise_at(vars(.x),
list(mean = ~ mean(.x),
std_error = ~ sd(.x)/sqrt(first(n)))))
The output will be a list of data.frames
Since you are new to R, consider base R for multiple aggregate functions on multiple numeric columns via a cbind + aggregate + do.call:
do.call(data.frame,
aggregate(cbind(revenue, cost, profit) ~ treat,
datagroup,
function(x) c(n = length(x),
mean = mean(x),
std_error = sd(x) / sqrt(length(x))
)
)
)

Mutate within nested data frame

I would like to perform kmeans within groups and add to my data information about cluster number and center which an observation was assigned to (still, within groups so cluster 1 is not the same for group A and group B). I thought that I can pluck cluster assignment and centroid from kmeans and then maybe join these two with each other and finally, with original data. To do the former I wanted to add a row number to data frames with centers and then join by the number of cluster. But how can I add row number within nested data frames? The following code works well until the last, 'nested' mutate.
my_data <- data.frame(group = c(sample(c('A', 'B', 'C'), 20, replace = TRUE)), x = runif(100, 0, 10), y = runif(100, 0, 10))
my_data %>%
group_by(group) %>%
nest() %>%
mutate(km_cluster = map(data, ~kmeans(.x, 3) %>% pluck('cluster')),
km_centers = map(data, ~kmeans(.x, 3) %>% pluck('centers') %>% mutate(cluster = row_number())))
#Luke.sonnet provided an answer that works well with map, but interestingly not with map2, see below:
my_data %>%
group_by(group) %>%
nest() %>%
mutate(number = sample(3:7, 3)) %>%
mutate(km_cluster = map2(data, number, ~kmeans(.x, .y) %>% pluck('cluster')),
km_centers = map2(data, number, ~kmeans(.x, .y) %>% pluck('centers') %>% as_tibble() %>% mutate(cluster = row_number())))
Any ideas how to solve the issue in that case? And equally important, what is the cause of such behaviour?
The problem is that pluck() is returning a matrix. Cast to a tibble first and number differently.
library(tidyverse)
my_data <- data.frame(group = c(sample(c('A', 'B', 'C'), 20, replace = TRUE)), x = runif(100, 0, 10), y = runif(100, 0, 10))
my_data %>%
group_by(group) %>%
nest() %>%
mutate(number = sample(3:7, 3)) %>%
mutate(km_cluster = map2(data, number, ~kmeans(.x, .y) %>% pluck('cluster')),
km_centers = map2(data, number, ~kmeans(.x, .y) %>% pluck('centers') %>% as_tibble() %>% mutate(cluster = seq_len(nrow(.)))))
Note you can also do mutate(cluster = row_number(x)))) and this provides different numbers (note that just using row_number() uses the rows from the parent df). I think given kmeans that the matrix of centers is ordered row-wise by cluster number that the answer in the main chunk is correct.

Use variable names in function in dplyr for sum and cumsum

dplyr programming question here. Trying to write a dplyr function which takes column names as inputs and also filters on a component outlined in the function. What I am trying to recreate is as follow called test:
#test df
x<- sample(1:100, 10)
y<- sample(c(TRUE, FALSE), 10, replace = TRUE)
date<- seq(as.Date("2018-01-01"), as.Date("2018-01-10"), by =1)
my_df<- data.frame(x = x, y =y, date =date)
test<- my_df %>% group_by(date) %>%
summarise(total = n(), total_2 = sum(y ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter(date >= "2018-01-03")
The function I am testing is as follows:
cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {
date_field <- enquo(date_field)
cumulative_y <- enquo(cumulative_y)
data %>% group_by(!!date_field) %>%
summarise(total = n(), total_2 = sum(!!cumulative_y ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter((!!date_field) >= minimum_date)
}
test2<- cumsum_df(data = my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-03")
I have looked looked at some examples of using enquo and this thread gets me half way there:
Use variable names in functions of dplyr
But the issue is I get two different data frame outputs for test 1 and test 2. The one from the function outputs does not have data from the logical y referenced column.
I also tried this instead
cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {
date_field <- enquo(date_field)
cumulative_y <- deparse(substitute(cumulative_y))
data %>% group_by(!!date_field) %>%
summarise(total = n(), total_2 = sum(data[[cumulative_y]] ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter((!!date_field) >= minimum_date)
}
test2<- cumsum_df(data= my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-04")
Based on this thread: Pass a data.frame column name to a function
But the output from my test 2 column is also wildly different and it seems to do some kind or recursive accumulation. Which again is different to my test date frame.
If anyone can help that would be much appreciated.

summarise mean of a specific column in dplyr

I would like to summarise a grouped data.frame without knowing the name of the column. But what I know is, that the feature is always at position 3 (column) in this data.frame, is that possible?
df <- data_frame(date = rep(c("2017-01-01", "2017-01-02", "2017-01-03"), 2),
group = rep(c("A", "B"), 3),
temperature = runif(6, -10, 30),
percipitation = runif(6, 0,5)
)
parameter <- "perc"
df1 <- df %>%
select(date, group, starts_with(parameter)) %>%
group_by(group) %>%
summarise(
avg = mean(percipitation)
)
In this example the code works, but of course only for the parameter 'perc' and not for 'temp' or so.
avg = mean(df[[3]])
or something like this doesn't work. Any suggestions?
You could keep just the grouping variable and the third column using select(group, 3). The function summarise_all() can then be used to calculate the mean.
df %>%
select(group, 3) %>%
group_by(group) %>%
summarise_all(
funs(mean)
)

Resources