I am new to R. I have a data frame with firm level data such as revenue, profits and costs. I would need to loop through 3 variables - revenue, profit and costs over this code:
datagroup %>% group_by(treat) %>% summarise(n = n(), mean = mean(profit), std_error = sd(profit) / sqrt(n))
Basically, I would run the code for revenue and costs by replacing the variable profit. Could you assist? I tried for loops but to no avail.
We can do this in a loop with the column name as string, then convert it to symbol, evaluate (!!) and get the mean
library(tidyverse)
c("revenue", "costs") %>%
map(~ datagroup %>%
group_by(treat) %>%
summarise(n = n(),
!! str_c("mean_", .x) := mean(!! rlang::sym(.x)), # convert to symbol
!! str_c("std_error_", .x) := sd(!! rlang::sym(.x)) / sqrt(n)))
We can also do this with summarise_at
c("revenue", "costs") %>%
map(~ datagroup %>%
group_by(treat) %>%
group_by(n = n(), add = TRUE) %>%
summarise_at(vars(.x),
list(mean = ~ mean(.x),
std_error = ~ sd(.x)/sqrt(first(n)))))
The output will be a list of data.frames
Since you are new to R, consider base R for multiple aggregate functions on multiple numeric columns via a cbind + aggregate + do.call:
do.call(data.frame,
aggregate(cbind(revenue, cost, profit) ~ treat,
datagroup,
function(x) c(n = length(x),
mean = mean(x),
std_error = sd(x) / sqrt(length(x))
)
)
)
Related
following my recent question I am stuck with a mutation that mutates across recently mutated columns.
sheet <- sheet %>%
group_by(across(all_of(.GlobalEnv$filter_list[[select_filter]]))) %>%
mutate(
across(where(is.numeric), ~ (sum(.x)) , .names="{.col}_sum"),
across(where(is.numeric), ~ (mean(.x)) , .names="{.col}_average"),
across(where(is.numeric), ~ (sd(.x)) , .names="{.col}_SD")
#total = sum(.data[[names(sheet[,10])]], na.rm = TRUE)
)
is calling all numeric columns and performs a function on new column. Problem is I for a column calculated fo sum I get colname_sum and colname_sum_average and colname_sum_average_SD. How can I avoid this duplicity?
Thanks to #julian answer, I can manage this by:
sheet <- sheet %>%
group_by(across(all_of(.GlobalEnv$filter_list[[select_filter]]))) %>%
summarise(
across(
where(is.numeric),
list(sum = sum, average = mean, sd = sd),
.names = "{.col}_{.fn}"
)
)
Although this brings out another problem, where if I define my own function, the data is calculated on each value instead of by grouping:
testfunc <- function(.) {
. + 1000
}
sheet <- sheet %>%
group_by(across(all_of(.GlobalEnv$filter_list[[select_filter]]))) %>%
summarise(
across(
where(is.numeric),
list(sum = sum, average = mean, sd = sd, test = testfunc),
.names = "{.col}_{.fn}"
#total = sum(.data[[names(sheet[,10])]], na.rm = TRUE)
)
)
Use a list of functions inside the mutate
sheet <- sheet %>%
group_by(across(all_of(.GlobalEnv$filter_list[[select_filter]]))) %>%
mutate(
across(where(is.numeric), list(sum = sum, average = mean, sd = sd) , .names="{.col}_{.fn}")
)
But it's better to use summarise instead of mutate
sheet <- sheet %>%
group_by(across(all_of(.GlobalEnv$filter_list[[select_filter]]))) %>%
summarise(
across(where(is.numeric), list(sum = sum, average = mean, sd = sd) , .names="{.col}_{.fn}")
)
Example with theiris dataset
iris %>%
mutate(
across(where(is.numeric), list(sum = sum, average = mean, sd = sd) , .names="{.col}_{.fn}"))
I have some data where I use the rsample package to create rolling windows (I use the iris data set here). The rolling_iris dataset contains a number of lists.
I would like to compute the min, max, mean and sd of each of the lists. That is in split 1 compute the min across the first 4 columns etc. I originally do this by mapping over the splits and using pivot_longer to rearrange the data then computing the statistics, finally using pivot_wider to get the data back into the original form. This is quite slow.
library(dplyr)
library(purrr)
iris
rolling_iris <- rsample::rolling_origin(iris, initial = 10, assess = 1, cumulative = FALSE, skip = 0)
rolling_iris_statistics <- map(rolling_iris$splits, ~analysis(.x) %>%
pivot_longer(cols = 1:4) %>%
mutate(
min = min(value),
max = max(value),
mean = mean(value),
sd = sd(value)
) %>%
group_by(name) %>%
mutate(rowID = row_number()) %>%
pivot_wider(names_from = name, values_from = value)
)
I would like to map over each of the lists and compute the above statistics. Then once this is done scale the analysis by the following function.
Scale_Me <- function(x){
(x - min(x)) / (max(x) - min(x))
}
Additional:
rolling_iris_analysis <- map(rolling_iris$splits, ~analysis(.x))
rolling_iris_assessment <- map(rolling_iris$splits, ~assessment(.x))
EDIT:
I managed to compute the following (I am not sure if it is "faster")
analysis <- map(rolling_iris$splits, ~analysis(.x))
map(analysis, ~select(., c(1:4)) %>% as.matrix %>% mean())
The below code subsets into each sub data frame. So, rolling_iris_dfs is a list of data frames. Then, you can iterate over each data frame and compute statistics.
rolling_iris_dfs <- map(seq(1, length(rolling_iris[[1]])), ~rolling_iris[[1]][[.x]]$data)
rolling_iris_stats <- map(rolling_iris_dfs, ~analysis(.x) %>%
pivot_longer(cols = 1:4) %>%
mutate(
min = min(value),
max = max(value),
mean = mean(value),
sd = sd(value)
) %>%
group_by(name) %>%
mutate(rowID = row_number()) %>%
pivot_wider(names_from = name, values_from = value)
)
The problem in question would be to apply the function f to each group of a tibble. It is a simpler way to do this, but I would like to solve the problem using the group_map() function.
Data used: starwars of the dplyr package.
What I want is to get an average of the height variable for a grouped tibble considering the variables gender and species. I know the problem could be easily solved by doing:
starwars %>% group_by(gender, species) %>%
summarise(mean = mean(height, na.rm = TRUE))
However, my desire is to implement summarise(mean = mean(height, na.rm = TRUE)) in a function and send to group_map().
I tried to create the f() function that gets the data argument which is a tibble object with the previously defined groups. The second argument of the f() function would be ... so that I could pass the variables of interest from data to f().
f <- function(dados, ...){
dados %>% summarise(mean = mean(..., na.rm = TRUE))
}
starwars %>% group_by(gender, species) %>%
group_map(.tbl = ., .f = ~f(dados = .x), height)
Solutions:
func_1 <- function(dados, var, ...){
var_interesse <- enquo(var)
dots <- enquos(...)
# Could be attributed direct reference ...
dados %>% group_by(!!!dots) %>%
summarise(media = mean(x = !!var_interesse, na.rm = TRUE))
}
starwars %>% func_1(var = height, gender, species)
or
func_2 <- function(dados, var){
var_interesse <- enquo(var)
#dots <- enquos(...)
dados %>% summarise(media = mean(x = !!var_interesse, na.rm = TRUE))
}
agrupamento <- starwars %>% group_by(gender, species)
agrupamento %>%
group_map(.tbl = ., .f = ~func_2(dados = .x, var = height))
dplyr programming question here. Trying to write a dplyr function which takes column names as inputs and also filters on a component outlined in the function. What I am trying to recreate is as follow called test:
#test df
x<- sample(1:100, 10)
y<- sample(c(TRUE, FALSE), 10, replace = TRUE)
date<- seq(as.Date("2018-01-01"), as.Date("2018-01-10"), by =1)
my_df<- data.frame(x = x, y =y, date =date)
test<- my_df %>% group_by(date) %>%
summarise(total = n(), total_2 = sum(y ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter(date >= "2018-01-03")
The function I am testing is as follows:
cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {
date_field <- enquo(date_field)
cumulative_y <- enquo(cumulative_y)
data %>% group_by(!!date_field) %>%
summarise(total = n(), total_2 = sum(!!cumulative_y ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter((!!date_field) >= minimum_date)
}
test2<- cumsum_df(data = my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-03")
I have looked looked at some examples of using enquo and this thread gets me half way there:
Use variable names in functions of dplyr
But the issue is I get two different data frame outputs for test 1 and test 2. The one from the function outputs does not have data from the logical y referenced column.
I also tried this instead
cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {
date_field <- enquo(date_field)
cumulative_y <- deparse(substitute(cumulative_y))
data %>% group_by(!!date_field) %>%
summarise(total = n(), total_2 = sum(data[[cumulative_y]] ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter((!!date_field) >= minimum_date)
}
test2<- cumsum_df(data= my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-04")
Based on this thread: Pass a data.frame column name to a function
But the output from my test 2 column is also wildly different and it seems to do some kind or recursive accumulation. Which again is different to my test date frame.
If anyone can help that would be much appreciated.
New to (d)plyr, working through chaining, a basic question - for the hflights example, want to use one of these embedded vars to make a basic plot:
hflights %>%
group_by(Year, Month, DayofMonth) %>%
select(Year:DayofMonth, ArrDelay, DepDelay) %>%
summarise(
arr = mean(ArrDelay, na.rm = TRUE),
dep = mean(DepDelay, na.rm = TRUE)
) %>%
plot (Month, arr)
Returns:
Error in match.fun(panel) : object 'arr' not found
I can make this work going step by step, but can I get where I want to go somehow with %>%...
plot() doesn't work that way. The closest you could get is:
library(dplyr)
library(hflights)
summary <- hflights %>%
group_by(Year, Month, DayofMonth) %>%
select(Year:DayofMonth, ArrDelay, DepDelay) %>%
summarise(
arr = mean(ArrDelay, na.rm = TRUE),
dep = mean(DepDelay, na.rm = TRUE)
)
summary %>%
plot(arr ~ Month, .)
Another alternative is to use ggvis, which is explicitly designed to work with pipes:
library(ggvis)
summary %>%
ggvis(~Month, ~arr)