dplyr mutate use standard evaluation [duplicate] - r

Trying to get my head around Non-Standard Evaluation as used by dplyr but without success. I'd like a short function that returns summary statistics (N, mean, sd, median, IQR, min, max) for a specified set of variables.
Simplified version of my function...
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = mean(~to.sum, na.rm = TRUE))
return(results)
}
And running it with some dummy data...
set.seed(43290)
temp <- cbind(rnorm(n = 100, mean = 2, sd = 4),
rnorm(n = 100, mean = 3, sd = 6)) %>% as.data.frame()
names(temp) <- c('eg1', 'eg2')
mean(temp$eg1)
[1] 1.881721
mean(temp$eg2)
[1] 3.575819
my_summarise(df = temp, to.sum = 'eg1')
n mean
1 100 NA
N is calculated, but the mean is not, can't figure out why.
Ultimately I'd like my function to be more general, along the lines of...
my_summarise <- function(df = temp,
group.by = 'group'
to.sum = c('eg1', 'eg2'),
...){
results <- list()
## Select columns
df <- dplyr::select_(df, .dots = c(group.by, to.sum))
## Summarise overall
results$all <- summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
## Summarise by specified group
results$by.group <- group_by_(df, ~to.group) %>%
summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
return(results)
}
...but before I move onto this more complex version (which I was using this example for guidance) I need to get the evaluation working in the simple version first as thats the stumbling block, the call to dplyr::select() works ok.
Appreciate any advice as to where I'm going wrong.
Thanks in advance

The basic idea is that you have to actually build the appropriate call yourself, most easily done with the lazyeval package.
In this case you want to programmatically create a call that looks like ~mean(eg1, na.rm = TRUE). This is how:
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = lazyeval::interp(~mean(x, na.rm = TRUE),
x = as.name(to.sum)))
return(results)
}
Here is what I do when I struggle to get things working:
Remember that, just like the ~n() you already have, the call will have to start with a ~.
Write the correct call with the actual variable and see if it works (~mean(eg1, na.rm = TRUE)).
Use lazyeval::interp to recreate that call, and check this by running only the interp to visually see what it is doing.
In this case I would probably often write interp(~mean(x, na.rm = TRUE), x = to.sum). But running that will give us ~mean("eg1", na.rm = TRUE) which is treating eg1 as a character instead of a variable name. So we use as.name, as is taught to us in vignette("nse").

Related

apply a function across columns in R

Let's say I have a dataframe (df) in R:
df <- data.frame(x = rnorm(5, mean = 5), u = rnorm(5, mean = 5), y = rnorm(5, mean = 5), z = rnorm(5, mean = 5))
print(df)
I want to get the mean absolute difference (MAD) between the first column (x) and the other columns.
With this function, I can find the MAD between the first column and another (the second for example):
mad <- function(dat){
abs(mean(dat[,1] - dat[,2], na.rm = TRUE))
}
mad(dat = df)
But I want to generalize the function to apply across all of the columns. Changing the function to something like this:
mad <- function(dat) {
abs(mean(dat[,1] - dat[,2:4], na.rm = TRUE))
}
mad(dat = df)
does not work and returns this error: "argument is not numeric or logical: returning NA"
I was thinking of using apply() across the dataframe, as that seems to be the general advice that I've found on here. But I don't understand how to keep the first column constant and subtract the other columns from the first.
We can create the function with two arguments
mad <- function(x, y) abs(mean(x - y, na.rm = TRUE))
and use sapply/lapply to loop over the columns other than 1, apply the mad function by extracting the first column of data with the looped column values
sapply(df[-1], function(x) mad(df[,1], x))
# u y z
#0.003399429 0.991685267 0.710553411
Here is another option without defining mad function:
sapply(abs(df[-1] - df[["x"]]), mean, na.rm = TRUE)

How can I incorporate NA removal into aggregate based on a custom function?

This is my first time using any custom functions, so bear with me. I made a function for standard error that I'd like to use with aggregate. It worked until I tried to exclude NAs.
Dummy data frame to work with:
se <- function(x) sd(x)/sqrt(length(x))
df <- data.frame(site = c('N','N','N','S','S','S'),
birds = c(NA,4,2,9,3,1),
worms = c(2,1,2,4,0,5))
means <- aggregate(df[,2:3], na.rm = T, list(site = df$site), FUN = mean)
error <- aggregate(df[,2:3], na.rm = T, list(site = df$site), FUN = se)
So aggregate worked before I excluded NAs (e.g. error <- aggregate(df[,2:3], list(site = df$site), FUN = se)), and it works when finding the mean (using the rest of the values to take the mean and ignoring the missing value). How can I exclude NAs in that same manner when using my custom se function?
The problem is that you do not have an explicit argument for na.rm in your se function. If you add that to your function, it should work:
se <- function(x, na.rm = TRUE) {
sd(x, na.rm = na.rm)/sqrt(sum(!is.na(x)))
}

R validate package. Create indicator object for arbitrary column

I'm new to R and I'm coming up against a problem that makes me very puzzled about how the language works. There is a package, "validate", that can create objects you can use to check that your data is as expected.
Testing it out on some toy data, I found that while the following code worked as expected:
library(validate)
I <- indicator(
cnt_misng = number_missing(x)
, sum = sum(x, na.rm = TRUE)
, min = min(x, na.rm = TRUE)
, mean = mean(x, na.rm = TRUE)
, max = max(x, na.rm = TRUE)
)
dat <- data.frame(x=1:4, y=c(NA,11,7,8), z=c(NA,2,0,NA))
C <- confront(dat, I)
values(C)
However, I found that I could not create a function that would return an indicator object for any arbitrary column of the data frame. This was my failed attempt:
check_values <- function(data, x){
print(x)
I <- indicator(
cnt_misng = number_missing(eval(x))
, max = max(eval(x), na.rm = TRUE)
)
C <- confront(df, I)
return(C)
}
df <- data.frame(A=1:4, B=c(NA,11,7,8), C=c(NA,2,0,NA))
C <- check_values(df,'B')
values(C)
If I have a large dataset, I'd like to be able to loop through a list of columns and have an identically formatted report for each one in the list. At this point, I'll probably give up on this package and find another way to more directly do that. However, I am still curious how this could be made to work. It seems like there should be a way to functionalize the creation of this indicator object so I can reuse the code to check the same stats for any arbitrary column of a data frame.
Any ideas?

Avoiding intermediate dlply step when starting with a dataframe and ending with a dataframe

I am using plyr to perform a bootstrapping function on subsets of a dataset.
Because the boot function creates a list object, I am currently using dlply to store the output of the function, then a ddply to get just the parts of the bootfunction that I want out
My example dataset is as follows:
dat = data.frame(x = rnorm(10, sd = 1),y = rnorm(10, sd = 1),z = rep(c("SppA", "SppB", "SppC", "SppD", "SppE"), 2),u = rep(c("SiteA", "SiteB"), 5))
the exact function isn't terribly important, but for the sake of reproducibility, here is the function I'm using:
boot_fun = function(x,i) {
i = sample(1:24, 24, replace = TRUE)
ts1 = mean(x[i,"x"])
ts2 = sample(x[i,"y"])
mean(ts1) - mean(ts2)
}
My plyr function is the following:
temp = dlply(dat, c("z", "u"), summarise, boot_object = boot(dat, boot_fun, R = 1000))
Since what I want out of the boot object is the mean and CI, I then perform the following plyr function:
temp2 = ldply(temp, summarise, mean = mean(boot$t), lowCI = quantile(boot$t, 0.025), highCI = quantile(boot$t, 0.975))
This works and accomplishes exactly what I want it to (although with an error about subsetting which doesn't seem to affect anything I care about), but I feel like there should be a way to skip the intermediate dlply step.
-edit- to clarify on what I'm trying to do if I didn't need to be splitting the groups
If I was manually splitting instead of using plyr, it would look something like the following:
temp = boot(dat[dat$z == "SppA" & dat$u == "SiteA",], boot_fun, R = 1000)
temp2$mean = mean(temp$t)
temp2$lowCI = quantile(temp$t, 0.025)
temp2$highCI = quantile(temp$t, 0.975)
If I didn't care about the groups at all and just wanted to do this to the whole group it would be something like
temp = boot(dat, boot_fun, R = 1000)
temp2$mean = mean(temp$t)
temp2$lowCI = quantile(temp$t, 0.025)
temp2$highCI = quantile(temp$t, 0.975)
Your example is not reproducible by me.
When I do temp = boot(dat, boot_fun, R = 1000), I get a WARNING:
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = dat, statistic = boot_fun, R = 1000)
Bootstrap Statistics :
WARNING: All values of t1* are NA
I think your current code is pretty efficient, but if you're looking for other possibilities, you could try tidyverse to 1) group_by the relevant columns, 2) nest the relevant data for bootstrapping, 3) run your bootstrap with the nested data, 4) isolate the statistics you desire, then 5) return to a normal data frame
library(boot)
library(tidyverse)
dat1 <- dat %>%
group_by(z,u) %>%
nest() %>%
mutate(data=map(data,~boot(.x, boot_fun, R=1000))) %>%
mutate(data=map(data,~data.frame(mean=mean(.x$t), lowCI=quantile(.x$t, 0.025), highCI=quantile(.x$t,0.975)))) %>%
unnest(data)

Using dplyr within a function, non-standard evaluation

Trying to get my head around Non-Standard Evaluation as used by dplyr but without success. I'd like a short function that returns summary statistics (N, mean, sd, median, IQR, min, max) for a specified set of variables.
Simplified version of my function...
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = mean(~to.sum, na.rm = TRUE))
return(results)
}
And running it with some dummy data...
set.seed(43290)
temp <- cbind(rnorm(n = 100, mean = 2, sd = 4),
rnorm(n = 100, mean = 3, sd = 6)) %>% as.data.frame()
names(temp) <- c('eg1', 'eg2')
mean(temp$eg1)
[1] 1.881721
mean(temp$eg2)
[1] 3.575819
my_summarise(df = temp, to.sum = 'eg1')
n mean
1 100 NA
N is calculated, but the mean is not, can't figure out why.
Ultimately I'd like my function to be more general, along the lines of...
my_summarise <- function(df = temp,
group.by = 'group'
to.sum = c('eg1', 'eg2'),
...){
results <- list()
## Select columns
df <- dplyr::select_(df, .dots = c(group.by, to.sum))
## Summarise overall
results$all <- summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
## Summarise by specified group
results$by.group <- group_by_(df, ~to.group) %>%
summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
return(results)
}
...but before I move onto this more complex version (which I was using this example for guidance) I need to get the evaluation working in the simple version first as thats the stumbling block, the call to dplyr::select() works ok.
Appreciate any advice as to where I'm going wrong.
Thanks in advance
The basic idea is that you have to actually build the appropriate call yourself, most easily done with the lazyeval package.
In this case you want to programmatically create a call that looks like ~mean(eg1, na.rm = TRUE). This is how:
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = lazyeval::interp(~mean(x, na.rm = TRUE),
x = as.name(to.sum)))
return(results)
}
Here is what I do when I struggle to get things working:
Remember that, just like the ~n() you already have, the call will have to start with a ~.
Write the correct call with the actual variable and see if it works (~mean(eg1, na.rm = TRUE)).
Use lazyeval::interp to recreate that call, and check this by running only the interp to visually see what it is doing.
In this case I would probably often write interp(~mean(x, na.rm = TRUE), x = to.sum). But running that will give us ~mean("eg1", na.rm = TRUE) which is treating eg1 as a character instead of a variable name. So we use as.name, as is taught to us in vignette("nse").

Resources