Using dplyr within a function, non-standard evaluation - r

Trying to get my head around Non-Standard Evaluation as used by dplyr but without success. I'd like a short function that returns summary statistics (N, mean, sd, median, IQR, min, max) for a specified set of variables.
Simplified version of my function...
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = mean(~to.sum, na.rm = TRUE))
return(results)
}
And running it with some dummy data...
set.seed(43290)
temp <- cbind(rnorm(n = 100, mean = 2, sd = 4),
rnorm(n = 100, mean = 3, sd = 6)) %>% as.data.frame()
names(temp) <- c('eg1', 'eg2')
mean(temp$eg1)
[1] 1.881721
mean(temp$eg2)
[1] 3.575819
my_summarise(df = temp, to.sum = 'eg1')
n mean
1 100 NA
N is calculated, but the mean is not, can't figure out why.
Ultimately I'd like my function to be more general, along the lines of...
my_summarise <- function(df = temp,
group.by = 'group'
to.sum = c('eg1', 'eg2'),
...){
results <- list()
## Select columns
df <- dplyr::select_(df, .dots = c(group.by, to.sum))
## Summarise overall
results$all <- summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
## Summarise by specified group
results$by.group <- group_by_(df, ~to.group) %>%
summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
return(results)
}
...but before I move onto this more complex version (which I was using this example for guidance) I need to get the evaluation working in the simple version first as thats the stumbling block, the call to dplyr::select() works ok.
Appreciate any advice as to where I'm going wrong.
Thanks in advance

The basic idea is that you have to actually build the appropriate call yourself, most easily done with the lazyeval package.
In this case you want to programmatically create a call that looks like ~mean(eg1, na.rm = TRUE). This is how:
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = lazyeval::interp(~mean(x, na.rm = TRUE),
x = as.name(to.sum)))
return(results)
}
Here is what I do when I struggle to get things working:
Remember that, just like the ~n() you already have, the call will have to start with a ~.
Write the correct call with the actual variable and see if it works (~mean(eg1, na.rm = TRUE)).
Use lazyeval::interp to recreate that call, and check this by running only the interp to visually see what it is doing.
In this case I would probably often write interp(~mean(x, na.rm = TRUE), x = to.sum). But running that will give us ~mean("eg1", na.rm = TRUE) which is treating eg1 as a character instead of a variable name. So we use as.name, as is taught to us in vignette("nse").

Related

How can I incorporate NA removal into aggregate based on a custom function?

This is my first time using any custom functions, so bear with me. I made a function for standard error that I'd like to use with aggregate. It worked until I tried to exclude NAs.
Dummy data frame to work with:
se <- function(x) sd(x)/sqrt(length(x))
df <- data.frame(site = c('N','N','N','S','S','S'),
birds = c(NA,4,2,9,3,1),
worms = c(2,1,2,4,0,5))
means <- aggregate(df[,2:3], na.rm = T, list(site = df$site), FUN = mean)
error <- aggregate(df[,2:3], na.rm = T, list(site = df$site), FUN = se)
So aggregate worked before I excluded NAs (e.g. error <- aggregate(df[,2:3], list(site = df$site), FUN = se)), and it works when finding the mean (using the rest of the values to take the mean and ignoring the missing value). How can I exclude NAs in that same manner when using my custom se function?
The problem is that you do not have an explicit argument for na.rm in your se function. If you add that to your function, it should work:
se <- function(x, na.rm = TRUE) {
sd(x, na.rm = na.rm)/sqrt(sum(!is.na(x)))
}

R validate package. Create indicator object for arbitrary column

I'm new to R and I'm coming up against a problem that makes me very puzzled about how the language works. There is a package, "validate", that can create objects you can use to check that your data is as expected.
Testing it out on some toy data, I found that while the following code worked as expected:
library(validate)
I <- indicator(
cnt_misng = number_missing(x)
, sum = sum(x, na.rm = TRUE)
, min = min(x, na.rm = TRUE)
, mean = mean(x, na.rm = TRUE)
, max = max(x, na.rm = TRUE)
)
dat <- data.frame(x=1:4, y=c(NA,11,7,8), z=c(NA,2,0,NA))
C <- confront(dat, I)
values(C)
However, I found that I could not create a function that would return an indicator object for any arbitrary column of the data frame. This was my failed attempt:
check_values <- function(data, x){
print(x)
I <- indicator(
cnt_misng = number_missing(eval(x))
, max = max(eval(x), na.rm = TRUE)
)
C <- confront(df, I)
return(C)
}
df <- data.frame(A=1:4, B=c(NA,11,7,8), C=c(NA,2,0,NA))
C <- check_values(df,'B')
values(C)
If I have a large dataset, I'd like to be able to loop through a list of columns and have an identically formatted report for each one in the list. At this point, I'll probably give up on this package and find another way to more directly do that. However, I am still curious how this could be made to work. It seems like there should be a way to functionalize the creation of this indicator object so I can reuse the code to check the same stats for any arbitrary column of a data frame.
Any ideas?

Avoiding intermediate dlply step when starting with a dataframe and ending with a dataframe

I am using plyr to perform a bootstrapping function on subsets of a dataset.
Because the boot function creates a list object, I am currently using dlply to store the output of the function, then a ddply to get just the parts of the bootfunction that I want out
My example dataset is as follows:
dat = data.frame(x = rnorm(10, sd = 1),y = rnorm(10, sd = 1),z = rep(c("SppA", "SppB", "SppC", "SppD", "SppE"), 2),u = rep(c("SiteA", "SiteB"), 5))
the exact function isn't terribly important, but for the sake of reproducibility, here is the function I'm using:
boot_fun = function(x,i) {
i = sample(1:24, 24, replace = TRUE)
ts1 = mean(x[i,"x"])
ts2 = sample(x[i,"y"])
mean(ts1) - mean(ts2)
}
My plyr function is the following:
temp = dlply(dat, c("z", "u"), summarise, boot_object = boot(dat, boot_fun, R = 1000))
Since what I want out of the boot object is the mean and CI, I then perform the following plyr function:
temp2 = ldply(temp, summarise, mean = mean(boot$t), lowCI = quantile(boot$t, 0.025), highCI = quantile(boot$t, 0.975))
This works and accomplishes exactly what I want it to (although with an error about subsetting which doesn't seem to affect anything I care about), but I feel like there should be a way to skip the intermediate dlply step.
-edit- to clarify on what I'm trying to do if I didn't need to be splitting the groups
If I was manually splitting instead of using plyr, it would look something like the following:
temp = boot(dat[dat$z == "SppA" & dat$u == "SiteA",], boot_fun, R = 1000)
temp2$mean = mean(temp$t)
temp2$lowCI = quantile(temp$t, 0.025)
temp2$highCI = quantile(temp$t, 0.975)
If I didn't care about the groups at all and just wanted to do this to the whole group it would be something like
temp = boot(dat, boot_fun, R = 1000)
temp2$mean = mean(temp$t)
temp2$lowCI = quantile(temp$t, 0.025)
temp2$highCI = quantile(temp$t, 0.975)
Your example is not reproducible by me.
When I do temp = boot(dat, boot_fun, R = 1000), I get a WARNING:
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = dat, statistic = boot_fun, R = 1000)
Bootstrap Statistics :
WARNING: All values of t1* are NA
I think your current code is pretty efficient, but if you're looking for other possibilities, you could try tidyverse to 1) group_by the relevant columns, 2) nest the relevant data for bootstrapping, 3) run your bootstrap with the nested data, 4) isolate the statistics you desire, then 5) return to a normal data frame
library(boot)
library(tidyverse)
dat1 <- dat %>%
group_by(z,u) %>%
nest() %>%
mutate(data=map(data,~boot(.x, boot_fun, R=1000))) %>%
mutate(data=map(data,~data.frame(mean=mean(.x$t), lowCI=quantile(.x$t, 0.025), highCI=quantile(.x$t,0.975)))) %>%
unnest(data)

Am I using NSE and rlang correctly/reasonably?

I've been reading through programming with dplyr and trying to apply the ideas it describes in my work. I have something that works, but it's unclear to me whether I've done it in the "right" way. Is there something more elegant or concise I could be doing?
I have a tibble where rows are scenarios and columns relate to tests that were run in that scenario. There are two types of columns, those that store a test statistic that was computed in that scenario and those that store the degrees of freedom of that test.
So, here's a small, toy example of the type of data I have:
library(tidyverse)
set.seed(27599)
my_tbl <- data_frame(test1_stat = rnorm(12), test1_df = rep(x = c(1, 2, 3), times = 4),
test2_stat = rnorm(12), test2_df = rep(x = c(1, 2, 3, 4), times = 3))
I want to compute a summary of each test that will be based on both its stat and its df. My example here is that I want to compute the median stat for each group, where groups are defined by df. The groupings are not guaranteed to be the same across tests, nor are the number of groups even guaranteed to be the same.
So, here's what I've done:
get_test_median = function(df, test_name) {
stat_col_name <- paste0(test_name, '_stat')
df_col_name <- paste0(test_name, '_df')
median_col_name <- paste0(test_name, '_median')
df %>%
dplyr::group_by(rlang::UQ(rlang::sym(df_col_name))) %>%
dplyr::summarise(rlang::UQ(median_col_name) := median(x = rlang::UQ(rlang::sym(stat_col_name)), na.rm = TRUE))
}
my_tbl %>% get_test_median(test_name = 'test1')
my_tbl %>% get_test_median(test_name = 'test2')
This works. But is it how an experienced rlang user would do it? I am new to NSE, and a bit surprised to be using two nested rlang functions repeatedly (UQ(sym(.))).
I am happy using UQ rather than !!, just because I'm more comfortable with traditional function notation.
Based on the comments, I got rid of the namespace::function notation and now my function doesn't look so verbose:
get_test_median = function(df, test_name) {
stat_col_name <- paste0(test_name, '_stat')
df_col_name <- paste0(test_name, '_df')
median_col_name <- paste0(test_name, '_median')
df %>%
dplyr::group_by(UQ(sym(df_col_name))) %>%
dplyr::summarise(UQ(median_col_name) := median(x = UQ(sym(stat_col_name)), na.rm = TRUE))
}

dplyr mutate use standard evaluation [duplicate]

Trying to get my head around Non-Standard Evaluation as used by dplyr but without success. I'd like a short function that returns summary statistics (N, mean, sd, median, IQR, min, max) for a specified set of variables.
Simplified version of my function...
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = mean(~to.sum, na.rm = TRUE))
return(results)
}
And running it with some dummy data...
set.seed(43290)
temp <- cbind(rnorm(n = 100, mean = 2, sd = 4),
rnorm(n = 100, mean = 3, sd = 6)) %>% as.data.frame()
names(temp) <- c('eg1', 'eg2')
mean(temp$eg1)
[1] 1.881721
mean(temp$eg2)
[1] 3.575819
my_summarise(df = temp, to.sum = 'eg1')
n mean
1 100 NA
N is calculated, but the mean is not, can't figure out why.
Ultimately I'd like my function to be more general, along the lines of...
my_summarise <- function(df = temp,
group.by = 'group'
to.sum = c('eg1', 'eg2'),
...){
results <- list()
## Select columns
df <- dplyr::select_(df, .dots = c(group.by, to.sum))
## Summarise overall
results$all <- summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
## Summarise by specified group
results$by.group <- group_by_(df, ~to.group) %>%
summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
return(results)
}
...but before I move onto this more complex version (which I was using this example for guidance) I need to get the evaluation working in the simple version first as thats the stumbling block, the call to dplyr::select() works ok.
Appreciate any advice as to where I'm going wrong.
Thanks in advance
The basic idea is that you have to actually build the appropriate call yourself, most easily done with the lazyeval package.
In this case you want to programmatically create a call that looks like ~mean(eg1, na.rm = TRUE). This is how:
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = lazyeval::interp(~mean(x, na.rm = TRUE),
x = as.name(to.sum)))
return(results)
}
Here is what I do when I struggle to get things working:
Remember that, just like the ~n() you already have, the call will have to start with a ~.
Write the correct call with the actual variable and see if it works (~mean(eg1, na.rm = TRUE)).
Use lazyeval::interp to recreate that call, and check this by running only the interp to visually see what it is doing.
In this case I would probably often write interp(~mean(x, na.rm = TRUE), x = to.sum). But running that will give us ~mean("eg1", na.rm = TRUE) which is treating eg1 as a character instead of a variable name. So we use as.name, as is taught to us in vignette("nse").

Resources