I've been struggling with this issue which is quite similar to a question raised here before. Somehow I can't translate the solution given in that question to my own problem.
I start off with making an example data frame:
test.df <- data.frame(col1 = rep(c('a','b'), each=5), col2 = runif(10))
str(test.df)
The following function should create a new data frame with the mean of a "statvar" based on groups of a "groupvar".
test.f <- function(df, groupvar, statvar) {
df %>%
group_by_(groupvar) %>%
select_(statvar) %>%
summarise_(
avg = ~mean(statvar, na.rm = TRUE)
)
}
test.f(df = test.df,
groupvar = "col1",
statvar = "col2")
What I would like this to return is a data frame with 2 calculated averages (one for all a values in col1 and one for all b values in col1). Instead I get this:
col1 avg
1 a NA
2 b NA
Warning messages:
1: In mean.default("col2", na.rm = TRUE) :
argument is not numeric or logical: returning NA
2: In mean.default("col2", na.rm = TRUE) :
argument is not numeric or logical: returning NA
I find this strange cause I'm pretty sure col2 is numeric:
str(test.df)
'data.frame': 10 obs. of 2 variables:
$ col1: Factor w/ 2 levels "a","b": 1 1 1 1 1 2 2 2 2 2
$ col2: num 0.4269 0.1928 0.7766 0.0865 0.1798 ...
library(lazyeval)
library(dplyr)
test.f <- function(df, groupvar, statvar) {
test.df %>%
group_by_(groupvar) %>%
select_(statvar) %>%
summarise_(
avg = (~mean(statvar, na.rm = TRUE)) %>%
interp(statvar = as.name(statvar))
)
}
test.f(df = test.df,
groupvar = "col1",
statvar = "col2")
Your issue is that "col2" is being substituted for statvar, and the mean("col2") is undefined
With the soon to be released dplyr 0.6.0, new functionality can help. The new function is UQ(), it unquotes what has been quoted. You are entering statvar as a string like "col1". dplyr has alternate functions that can evaluate regularly as in group_by_ and select_. But for summarise_ the alteration of the string can be ugly as in the above answer. We can now use the regular summarise function and unquote the quoted variable name. For more help on what 'unquote the quoted' means, see this vignette. For now the developer's version has it.
library(dplyr)
test.df <- data.frame(col1 = rep(c('a','b'), each=5), col2 = runif(10))
test.f <- function(df, groupvar, statvar) {
q_statvar <- as.name(statvar)
df %>%
group_by_(groupvar) %>%
select_(statvar) %>%
summarise(
avg = mean(!!q_statvar, na.rm = TRUE)
)
}
test.f(df = test.df,
groupvar = "col1",
statvar = "col2")
# # A tibble: 2 × 2
# col1 avg
# <fctr> <dbl>
# 1 a 0.6473072
# 2 b 0.4282954
Related
I have a table with columns
[Time, var1, var2, var3, var4...varN]
I need to calculate mean/SE per Time for each var1, var2...var n , and I want to do this programmatically for all variables, rather than 1 at a time which would involve a lot of copy-pasting.
Section 8.2.3 here https://tidyeval.tidyverse.org/dplyr.html is close to what I want but my below code:
x <- as.data.frame(matrix(nrow = 2, ncol = 3))
x[1,1] = 1
x[1,2] = 2
x[1,3] = 3
x[2,1] =4
x[2,2] = 5
x[2,3] = 6
names(x)[1] <- "time"
names(x)[2] <- "var1"
names(x)[3] <- "var2"
grouped_mean3 <- function(.data, ...) {
print(.data)
summary_vars <- enquos(...)
print(summary_vars)
summary_vars <- purrr::map(summary_vars, function(var) {
expr(mean(!!var, na.rm = TRUE))
})
print(summary_vars)
.data %>%
group_by(time)
summarise(!!!summary_vars) # Unquote-splice the list
}
grouped_mean3(x, var("var1"), var("var2"))
Yields
Error in !summary_vars : invalid argument type
And the original cause is "Must group by variables found in .data." and it finds a column that isn't in the dummy "x" that I generated for the purposes of testing. I have no idea what's happening, sadly.
How do I actually extract the mean from the new summary_vars and add it to the .data table? summary_vars becomes something like
[[1]]
mean(~var1, na.rm = TRUE)
[[2]]
mean(~var2, na.rm = TRUE)
Which seems close, but needs evaluation. How do I evaluate this? !!! wasn't working.
For what it's worth, I tried plugging the example in dplyr into this R engine https://rdrr.io/cran/dplyr/man/starwars.html and it didn't work either.
Help?
End goal would be a table along the lines of
[Time, var1mean, var2mean, var3mean, var4mean...]
Try this :
library(dplyr)
grouped_mean3 <- function(.data, ...) {
vars <- c(...)
.data %>%
group_by(time) %>%
summarise(across(all_of(vars), mean))
}
grouped_mean3(x, 'var1')
# time var1mean
# <dbl> <dbl>
#1 1 2
#2 4 5
grouped_mean3(x, 'var1', 'var2')
# time var1mean var2mean
# <dbl> <dbl> <dbl>
#1 1 2 3
#2 4 5 6
Perhaps this is what you are looking for?
x %>%
group_by(time) %>%
summarise_at(vars(starts_with('var')), ~mean(.,na.rm=T)) %>%
rename_at(vars(starts_with('var')),funs(paste(.,"mean"))) %>%
merge(x)
With your data (from your question) following is the output:
time var1mean var2mean var1 var2
1 1 2 3 2 3
2 4 5 6 5 6
I have a dataframe that looks like this
df = data.frame(id = 1:10, wt = 71:80, gender = rep(1:2, 5), race = rep(1:2, 5))
I'm trying to write a function that takes on a dataframe as a first argument together with any number of arguments that represent column names in that dataframe and use these column names to perform operations on the dataframe. My function would look like this:
library(dplyr)
myFunction <- function(df, ...){
columns <- list(...)
for (i in 1:length(columns)){
var <- enquo(columns[[i]])
df <- df %>% group_by(!!var)
}
df2 = summarise(df, mean = mean(wt))
return(df2)
}
I call the function as the following
myFunction(df, race, gender)
However, I get the following error message:
Error in myFunction(df, race, gender) : object 'race' not found
We can convert the elements in ... to quosures and then do the evaluation (!!!)
myFunction <- function(dat, ...){
columns <- quos(...) # convert to quosures
dat %>%
group_by(!!! columns) %>% # evaluate
summarise(mean = mean(wt))
}
myFunction(df, race, gender)
# A tibble: 2 x 3
# Groups: race [?]
# race gender mean
# <int> <int> <dbl>
#1 1 1 75
#2 2 2 76
myFunction(df, race)
# A tibble: 2 x 2
# race mean
# <int> <dbl>
#1 1 75
#2 2 76
NOTE: In the OP's example, 'race' and 'gender' are the same
If it change it, will see the difference
df <- data.frame(id = 1:10, wt = 71:80, gender = rep(1:2, 5),
race = rep(1:2, each = 5))
myFunction(df, race, gender)
myFunction(df, race)
myFunction(df, gender)
If we decide to pass the arguments as quoted strings, then we can make use of group_by_at
myFunction2 <- function(df, ...) {
columns <- c(...)
df %>%
group_by_at(columns) %>%
summarise(mean= mean(wt))
}
myFunction2(df, "race", "gender")
I have a dataset which looks like this
df1 <- data.frame (
age = rep(c("40-44", "45-49", "50-54", "55-59", "60-64"),4),
dep = rep(c("Dep1", "Dep2", "Dep3", "Dep4", "Dep5"),4),
ethnic_1 = c(rep("M",4),rep("NM",7),rep("P", 3), rep("A", 6)),
ethnic_2 = c(rep("M",8),rep("NM",6),rep("P",2),rep("A", 4)),
gender = c(rep("M",10), rep("F",10))
)
What I want to do, is get a comparison of the two ethnicity classifications in these dataframes, by creating and running the following function
Comp_fun <- function(data, var1, ...) {
group_var <- quos(...)
var_quo <- enquo(var1)
df <- data %>%
group_by(!!! group_var) %>%
summarise (n = n()) %>%
spread(key = !!! var_quo, value = count)
return(df)
}
eth_comp <- Comp_fun(df1, ethnic_1, ethnic_1, ethnic_2)
When I run this code, I get the following error message Error: Invalid column specification
What I want as output from this is a 4 x 4 table, showing the count of ethnic 1 along the horizontal, and the count of ethnic 2 along the vertical, and showing the numbers where they match, and where they don't.
I think I'm using the quo enquo incorrectly. Can anyone tell me where I'm going wrong?
There is no 'count' variable. It should be 'n'. Also, 'var_quo' is a quosure and not quosures. So, it can be evaluated with !!
Comp_fun <- function(data, var1, ...) {
group_var <- quos(...)
var_quo <- enquo(var1)
data %>%
group_by(!!! group_var) %>%
summarise (n = n()) %>%
spread(key = !! var_quo, value = n)
}
eth_comp <- Comp_fun(df1, ethnic_1, ethnic_1, ethnic_2)
eth_comp
# A tibble: 4 x 5
# ethnic_2 A M NM P
# <fct> <int> <int> <int> <int>
#1 A 4 NA NA NA
#2 M NA 4 4 NA
#3 NM NA NA 3 3
#4 P 2 NA NA NA
I am having trouble using several dplyr functions in one function, despite using the function variants.
Example
library(dplyr)
# Data:
mydf <- data.frame(
var1 = factor(rep(1:24, each = 100)),
var2 = runif(2400, min = -10, max = 125),
var3 = runif(2400, min = 0, max = 2500),
var4 = runif(2400, min = - 10, max = 25)
)
# The function I want to build:
fx.average <- function(df, varlist) {
# select some varibles from a data frame
df <- df %>% dplyr::select_(.dots = varlist)
# Group by a variable and then just calculate the mean
df <-df %>% dplyr::group_by_(var1) %>% # added df here
dplyr::summarise_each_(funs_(mean(., na.rm = TRUE)))
}
So, now I am going to test the function in the following:
# Test function, Setup var-list
varlist0 <- c("var1", "var2", "var3")
fx.average(mydf, varlist0)
# Error in dplyr::group_by_(var1) : object 'var1' not found
# object 'var1' not found
# Manual example
mydf %>% dplyr::select(var1, var2, var3) %>%
group_by(var1) %>%
summarise_each(funs(mean(., na.rm = TRUE)))
Not sure what goes wrong? From other questions it seems it should be solved by adding an underscore to the functions - since they are build for use inside functions?
In the OP's code, there are some typos (not specifying the data in the group_by step along with using NSE without quoted string and using funs_ with summarise_each_ where summarise_each and funs works)
fx.average <- function(df, varlist) {
df %>%
dplyr::select_(.dots = varlist) %>%
dplyr::group_by_(.dots = "var1") %>%
dplyr::summarise_each(funs(mean(., na.rm = TRUE)))
}
fx.average(mydf, varlist0)
# A tibble: 24 × 3
# var1 var2 var3
# <fctr> <dbl> <dbl>
#1 1 55.13601 1141.021
#2 2 59.16508 1155.226
#3 3 59.64524 1245.043
#4 4 60.12310 1284.808
#5 5 57.65874 1221.771
#6 6 58.86611 1266.026
#7 7 66.13987 1303.927
#8 8 54.21595 1303.638
#9 9 63.84230 1280.380
#10 10 49.15238 1236.456
# ... with 14 more rows
Is it possible to filter a data.frame for complete cases using dplyr? complete.cases with a list of all variables works, of course. But that is a) verbose when there are a lot of variables and b) impossible when the variable names are not known (e.g. in a function that processes any data.frame).
library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5)
)
df %.%
filter(complete.cases(x1,x2))
Try this:
df %>% na.omit
or this:
df %>% filter(complete.cases(.))
or this:
library(tidyr)
df %>% drop_na
If you want to filter based on one variable's missingness, use a conditional:
df %>% filter(!is.na(x1))
or
df %>% drop_na(x1)
Other answers indicate that of the solutions above na.omit is much slower but that has to be balanced against the fact that it returns row indices of the omitted rows in the na.action attribute whereas the other solutions above do not.
str(df %>% na.omit)
## 'data.frame': 2 obs. of 2 variables:
## $ x1: num 1 2
## $ x2: num 1 2
## - attr(*, "na.action")= 'omit' Named int 3 4
## ..- attr(*, "names")= chr "3" "4"
ADDED Have updated to reflect latest version of dplyr and comments.
ADDED Have updated to reflect latest version of tidyr and comments.
This works for me:
df %>%
filter(complete.cases(df))
Or a little more general:
library(dplyr) # 0.4
df %>% filter(complete.cases(.))
This would have the advantage that the data could have been modified in the chain before passing it to the filter.
Another benchmark with more columns:
set.seed(123)
x <- sample(1e5,1e5*26, replace = TRUE)
x[sample(seq_along(x), 1e3)] <- NA
df <- as.data.frame(matrix(x, ncol = 26))
library(microbenchmark)
microbenchmark(
na.omit = {df %>% na.omit},
filter.anonymous = {df %>% (function(x) filter(x, complete.cases(x)))},
rowSums = {df %>% filter(rowSums(is.na(.)) == 0L)},
filter = {df %>% filter(complete.cases(.))},
times = 20L,
unit = "relative")
#Unit: relative
# expr min lq median uq max neval
# na.omit 12.252048 11.248707 11.327005 11.0623422 12.823233 20
#filter.anonymous 1.149305 1.022891 1.013779 0.9948659 4.668691 20
# rowSums 2.281002 2.377807 2.420615 2.3467519 5.223077 20
# filter 1.000000 1.000000 1.000000 1.0000000 1.000000 20
Here are some benchmark results for Grothendieck's reply. na.omit() takes 20x as much time as the other two solutions. I think it would be nice if dplyr had a function for this maybe as part of filter.
library('rbenchmark')
library('dplyr')
n = 5e6
n.na = 100000
df = data.frame(
x1 = sample(1:10, n, replace=TRUE),
x2 = sample(1:10, n, replace=TRUE)
)
df$x1[sample(1:n, n.na)] = NA
df$x2[sample(1:n, n.na)] = NA
benchmark(
df %>% filter(complete.cases(x1,x2)),
df %>% na.omit(),
df %>% (function(x) filter(x, complete.cases(x)))()
, replications=50)
# test replications elapsed relative
# 3 df %.% (function(x) filter(x, complete.cases(x)))() 50 5.422 1.000
# 1 df %.% filter(complete.cases(x1, x2)) 50 6.262 1.155
# 2 df %.% na.omit() 50 109.618 20.217
This is a short function which lets you specify columns (basically everything which dplyr::select can understand) which should not have any NA values (modeled after pandas df.dropna()):
drop_na <- function(data, ...){
if (missing(...)){
f = complete.cases(data)
} else {
f <- complete.cases(select_(data, .dots = lazyeval::lazy_dots(...)))
}
filter(data, f)
}
[drop_na is now part of tidyr: the above can be replaced by library("tidyr")]
Examples:
library("dplyr")
df <- data.frame(a=c(1,2,3,4,NA), b=c(NA,1,2,3,4), ac=c(1,2,NA,3,4))
df %>% drop_na(a,b)
df %>% drop_na(starts_with("a"))
df %>% drop_na() # drops all rows with NAs
try this
df[complete.cases(df),] #output to console
OR even this
df.complete <- df[complete.cases(df),] #assign to a new data.frame
The above commands take care of checking for completeness for all the columns (variable)
in your data.frame.
Just for the sake of completeness, dplyr::filter can be avoided altogether but still be able to compose chains just by using magrittr:extract (an alias of [):
library(magrittr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5))
df %>%
extract(complete.cases(.), )
The additional bonus is speed, this is the fastest method among the filter and na.omit variants (tested using #Miha Trošt microbenchmarks).
dplyr >= 1.0.4
if_any and if_all are available in newer versions of dplyr to apply across-like syntax in the filter function. This could be useful if you had other variables in your dataframe that were not part of what you considered complete case. For example, if you only wanted non-missing rows in columns that start with "x":
library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5),
y = c(NA, "A", "B", "C")
)
df %>%
dplyr::filter(if_all(starts_with("x"), ~!is.na(.)))
x1 x2 y
1 1 1 <NA>
2 2 2 A
For more information on these functions see this link.