I have a dataframe that looks like this
df = data.frame(id = 1:10, wt = 71:80, gender = rep(1:2, 5), race = rep(1:2, 5))
I'm trying to write a function that takes on a dataframe as a first argument together with any number of arguments that represent column names in that dataframe and use these column names to perform operations on the dataframe. My function would look like this:
library(dplyr)
myFunction <- function(df, ...){
columns <- list(...)
for (i in 1:length(columns)){
var <- enquo(columns[[i]])
df <- df %>% group_by(!!var)
}
df2 = summarise(df, mean = mean(wt))
return(df2)
}
I call the function as the following
myFunction(df, race, gender)
However, I get the following error message:
Error in myFunction(df, race, gender) : object 'race' not found
We can convert the elements in ... to quosures and then do the evaluation (!!!)
myFunction <- function(dat, ...){
columns <- quos(...) # convert to quosures
dat %>%
group_by(!!! columns) %>% # evaluate
summarise(mean = mean(wt))
}
myFunction(df, race, gender)
# A tibble: 2 x 3
# Groups: race [?]
# race gender mean
# <int> <int> <dbl>
#1 1 1 75
#2 2 2 76
myFunction(df, race)
# A tibble: 2 x 2
# race mean
# <int> <dbl>
#1 1 75
#2 2 76
NOTE: In the OP's example, 'race' and 'gender' are the same
If it change it, will see the difference
df <- data.frame(id = 1:10, wt = 71:80, gender = rep(1:2, 5),
race = rep(1:2, each = 5))
myFunction(df, race, gender)
myFunction(df, race)
myFunction(df, gender)
If we decide to pass the arguments as quoted strings, then we can make use of group_by_at
myFunction2 <- function(df, ...) {
columns <- c(...)
df %>%
group_by_at(columns) %>%
summarise(mean= mean(wt))
}
myFunction2(df, "race", "gender")
Related
I have a table with columns
[Time, var1, var2, var3, var4...varN]
I need to calculate mean/SE per Time for each var1, var2...var n , and I want to do this programmatically for all variables, rather than 1 at a time which would involve a lot of copy-pasting.
Section 8.2.3 here https://tidyeval.tidyverse.org/dplyr.html is close to what I want but my below code:
x <- as.data.frame(matrix(nrow = 2, ncol = 3))
x[1,1] = 1
x[1,2] = 2
x[1,3] = 3
x[2,1] =4
x[2,2] = 5
x[2,3] = 6
names(x)[1] <- "time"
names(x)[2] <- "var1"
names(x)[3] <- "var2"
grouped_mean3 <- function(.data, ...) {
print(.data)
summary_vars <- enquos(...)
print(summary_vars)
summary_vars <- purrr::map(summary_vars, function(var) {
expr(mean(!!var, na.rm = TRUE))
})
print(summary_vars)
.data %>%
group_by(time)
summarise(!!!summary_vars) # Unquote-splice the list
}
grouped_mean3(x, var("var1"), var("var2"))
Yields
Error in !summary_vars : invalid argument type
And the original cause is "Must group by variables found in .data." and it finds a column that isn't in the dummy "x" that I generated for the purposes of testing. I have no idea what's happening, sadly.
How do I actually extract the mean from the new summary_vars and add it to the .data table? summary_vars becomes something like
[[1]]
mean(~var1, na.rm = TRUE)
[[2]]
mean(~var2, na.rm = TRUE)
Which seems close, but needs evaluation. How do I evaluate this? !!! wasn't working.
For what it's worth, I tried plugging the example in dplyr into this R engine https://rdrr.io/cran/dplyr/man/starwars.html and it didn't work either.
Help?
End goal would be a table along the lines of
[Time, var1mean, var2mean, var3mean, var4mean...]
Try this :
library(dplyr)
grouped_mean3 <- function(.data, ...) {
vars <- c(...)
.data %>%
group_by(time) %>%
summarise(across(all_of(vars), mean))
}
grouped_mean3(x, 'var1')
# time var1mean
# <dbl> <dbl>
#1 1 2
#2 4 5
grouped_mean3(x, 'var1', 'var2')
# time var1mean var2mean
# <dbl> <dbl> <dbl>
#1 1 2 3
#2 4 5 6
Perhaps this is what you are looking for?
x %>%
group_by(time) %>%
summarise_at(vars(starts_with('var')), ~mean(.,na.rm=T)) %>%
rename_at(vars(starts_with('var')),funs(paste(.,"mean"))) %>%
merge(x)
With your data (from your question) following is the output:
time var1mean var2mean var1 var2
1 1 2 3 2 3
2 4 5 6 5 6
I am writing a function that computes the mean of a variable according to some grouping (g1 and g2). I would like the function to take care of the case when the user just wants to compute the mean across the groups, so the group argument will be empty.
I want a solution using tidyverse.
Suppose the following:
y = 1:4
g1 = c('a', 'a', 'b', 'b')
g2 = c(1,2,1,2)
MyData = data.frame(g1, g2, y)
MyFun = function(group){
group_sym = syms(group)
MyData %>%
group_by(!!!group_sym) %>%
summarise(mean = mean(y))
}
# this works well
MyFun(group = c('g1', 'g2'))
Now suppose I want the mean of y across all groups. I would like the function be able to treat something like
MyFun(group = '')
or
MyFun(group = NULL)
So ideally I would like the group argument to be empty / null and thus MyData would not be grouped. One solution could be to add a condition at the beginning of the function checking if the argument is empty and if TRUE write summarise without group_by. But this is not elegant and my real code is much longer than just a few lines.
Any idea?
1) Use {{...}} and use g1 in place of 'g1':
MyFun = function(group) {
MyData %>%
group_by({{group}}) %>%
summarise(mean = mean(y)) %>%
ungroup
}
MyFun(g1)
## # A tibble: 2 x 2
## g1 mean
## <fct> <dbl>
## 1 a 1.5
## 2 b 3.5
MyFun()
## # A tibble: 1 x 1
## mean
## <dbl>
## 1 2.5
2) This approach uses 'g1' as in the question.
MyFun = function(group) {
group <- if (missing(group)) 'All' else sym(group)
MyData %>%
group_by(!!group) %>%
summarise(mean = mean(y)) %>%
ungroup
}
MyFun('g1')
## # A tibble: 2 x 2
## g1 mean
## <fct> <dbl>
## 1 a 1.5
## 2 b 3.5
MyFun()
## # A tibble: 1 x 2
## `"All"` mean
## <chr> <dbl>
## 1 All 2.5
3) This also works and gives the same output as (2).
MyFun = function(...) {
group <- if (...length()) syms(...) else 'All'
MyData %>%
group_by(!!!group) %>%
summarise(mean = mean(y)) %>%
ungroup
}
MyFun('g1')
MyFun()
A different approach consists of creating a fake group (named 'across_group') in the data, in the case of group is missing.
MyFun = function(group) {
if (missing(group)) MyData$across_group = 1
group <- if (missing(group)) syms('across_group') else syms(group)
MyData %>%
group_by(!!!group) %>%
summarise(mean = mean(y)) %>%
ungroup
}
MyFun()
# A tibble: 1 x 2
across_group mean
<dbl> <dbl>
1 1 2.5
I am trying to create a data-frame of the column type and unique variables for each column.
I am able to get column type in the desired data-frame format using map(df, class) %>% bind_rows() %>% gather(key = col_name, value = col_class), but unable to get the unique variables to become a data-frame instead of a list.
Below is a small data-frame and code that gets the unique variables in a list, but not a data frame. Ideally, I could do this in one (map) function, but if I have to join them, it would not be a big deal.
df <- data.frame(v1 = c(1,2,3,2), v2 = c("a","a","b","b"))
library(tidyverse)
map(df, class) %>% bind_rows() %>% gather(key = col_name, value = col_class)
map(df, unique)
When I try to use the same method on the map(df, unique) as on the map(df, class) I get the following error: Error: Argument 2 must be length 3, not 2 which is expected, but I am not sure how to get around it.
The number of unique values are different in those two columns. You need to reduce them to a single element.
df2 <- map(df, ~str_c(unique(.x),collapse = ",")) %>%
bind_rows() %>%
gather(key = col_name, value = col_unique)
> df2
# A tibble: 2 x 2
col_name col_class
<chr> <chr>
1 v1 1,2,3
2 v2 a,b
We could use map_df and get the class and unique values from each column into one tibble. Since every column would have variables of different type, we need to bring them in one common class to bind the data together in one dataframe.
purrr::map_df(df,~tibble::tibble(class = class(.), value = as.character(unique(.))))
# class value
# <chr> <chr>
#1 numeric 1
#2 numeric 2
#3 numeric 3
#4 factor a
#5 factor b
Or if you want to have only one value for every column, we could do
map_df(df, ~tibble(class = class(.), value = toString(unique(.))))
# class value
# <chr> <chr>
#1 numeric 1, 2, 3
#2 factor a, b
Same in base R using lapply
do.call(rbind, lapply(df, function(x)
data.frame(class = class(x), value = as.character(unique(x)))))
and
do.call(rbind, lapply(df, function(x)
data.frame(class = class(x), value = toString(unique(x)))))
To address OP's comment asking about enframe and unnest I set up a benchmark.
set.seed(123)
df <- data.frame(v1 = sample(1:100000,10000000, replace = TRUE),
v2 = sample(c(letters,LETTERS),10000000, replace = TRUE))
library(tidyverse)
map(df, ~str_c(unique(.x),collapse = ",")) %>%
bind_rows() %>%
gather(key = col_name, value = col_unique)
#> # A tibble: 2 x 2
#> col_name col_unique
#> <chr> <chr>
#> 1 v1 51663,57870,2986,29925,95246,68293,62555,45404,65161,46435,9642~
#> 2 v2 S,V,k,t,z,K,f,J,n,R,W,h,M,P,q,g,C,U,a,d,Y,u,O,x,b,m,v,r,F,w,A,j~
map(df, ~str_c(unique(.x),collapse = ",")) %>%
enframe() %>%
unnest()
#> # A tibble: 2 x 2
#> name value
#> <chr> <chr>
#> 1 v1 51663,57870,2986,29925,95246,68293,62555,45404,65161,46435,9642,59~
#> 2 v2 S,V,k,t,z,K,f,J,n,R,W,h,M,P,q,g,C,U,a,d,Y,u,O,x,b,m,v,r,F,w,A,j,c,~
microbenchmark::microbenchmark(
bind_gather = map(df, ~str_c(unique(.x),collapse = ",")) %>%
bind_rows() %>%
gather(key = col_name, value = col_unique) ,
frame_unnest = map(df, ~str_c(unique(.x),collapse = ",")) %>%
enframe() %>%
unnest() ,
times = 10)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> bind_gather 581.6403 594.6479 615.0841 612.9336 618.3057 697.6204 10
#> frame_unnest 568.6620 590.0003 604.2774 606.5676 624.8159 630.2372 10
It seems that enframe %>% unnest is slightly faster than using bind_rows %>% gather().
Does this work for you?
data.table::rbindlist(list(map(df, class), map(df, function(x) list(unique(x)))))
I am fairly new to R. I wrote the below function which tries to summarise a dataframe, based on a feature variable (passed to the function as 'variable') and a target variable (passed to the function as target_var). I also pass it a value (target_val) on which to filter.
The function below falls over on the filter line (filter(target_var == target_val)). I think it has something to do with quo, quosure etc, but can't figure out how to fix it. The following code should be ready to run - if you exclude the filter line it should work, if you included the filter line it will fall over.
library(dplyr)
target <- c('good', 'good', 'bad', 'good', 'good', 'bad')
var_1 <- c('debit_order', 'other', 'other', 'debit_order','debit_order','debit_order')
dset <- data.frame(target, var_1)
odds_by_var <- function(dataframe, variable, target_var, target_val){
df_name <- paste('odds', deparse(substitute(variable)), sep = "_")
variable_string <- deparse(substitute(variable))
target_string <- deparse(substitute(target_var))
temp_df1 <- dataframe %>%
group_by_(variable_string, target_string) %>%
summarise(cnt = n()) %>%
group_by_(variable_string) %>%
mutate(total = sum(cnt)) %>%
mutate(rate = cnt / total) %>%
filter(target_var == target_val)
assign(df_name, temp_df1, envir=.GlobalEnv)
}
odds_by_var(dset, var_1, target, 'bad')
so I assume you want to filter by target good or bad.
In my understanding, always filter() before you group_by(), as you will possibly ommit your filter variables. I restructured your function a little:
dset <- data.frame(target, var_1)
odds_by_var <- function(dataframe, variable, target_var, target_val){
df_name <- paste('odds', deparse(substitute(variable)), sep = "_")
variable_string <- deparse(substitute(variable))
target_string <- deparse(substitute(target_var))
temp_df1 <- dataframe %>%
group_by_(variable_string, target_string) %>%
summarise(cnt = n()) %>%
mutate(total = sum(cnt),
rate = cnt / total)
names(temp_df1) <- c(variable_string,"target","cnt","total","rate" )
temp_df1 <- temp_df1[temp_df1$target == target_val,]
assign( df_name,temp_df1, envir=.GlobalEnv)
}
odds_by_var(dset, var_1, target, "bad")
result:
> odds_var_1
# A tibble: 2 x 5
# Groups: var_1 [2]
var_1 target cnt total rate
<chr> <chr> <int> <int> <dbl>
1 debit_order bad 1 4 0.25
2 other bad 1 2 0.5
I have a dataset which looks like this
df1 <- data.frame (
age = rep(c("40-44", "45-49", "50-54", "55-59", "60-64"),4),
dep = rep(c("Dep1", "Dep2", "Dep3", "Dep4", "Dep5"),4),
ethnic_1 = c(rep("M",4),rep("NM",7),rep("P", 3), rep("A", 6)),
ethnic_2 = c(rep("M",8),rep("NM",6),rep("P",2),rep("A", 4)),
gender = c(rep("M",10), rep("F",10))
)
What I want to do, is get a comparison of the two ethnicity classifications in these dataframes, by creating and running the following function
Comp_fun <- function(data, var1, ...) {
group_var <- quos(...)
var_quo <- enquo(var1)
df <- data %>%
group_by(!!! group_var) %>%
summarise (n = n()) %>%
spread(key = !!! var_quo, value = count)
return(df)
}
eth_comp <- Comp_fun(df1, ethnic_1, ethnic_1, ethnic_2)
When I run this code, I get the following error message Error: Invalid column specification
What I want as output from this is a 4 x 4 table, showing the count of ethnic 1 along the horizontal, and the count of ethnic 2 along the vertical, and showing the numbers where they match, and where they don't.
I think I'm using the quo enquo incorrectly. Can anyone tell me where I'm going wrong?
There is no 'count' variable. It should be 'n'. Also, 'var_quo' is a quosure and not quosures. So, it can be evaluated with !!
Comp_fun <- function(data, var1, ...) {
group_var <- quos(...)
var_quo <- enquo(var1)
data %>%
group_by(!!! group_var) %>%
summarise (n = n()) %>%
spread(key = !! var_quo, value = n)
}
eth_comp <- Comp_fun(df1, ethnic_1, ethnic_1, ethnic_2)
eth_comp
# A tibble: 4 x 5
# ethnic_2 A M NM P
# <fct> <int> <int> <int> <int>
#1 A 4 NA NA NA
#2 M NA 4 4 NA
#3 NM NA NA 3 3
#4 P 2 NA NA NA