I've been reading through programming with dplyr and trying to apply the ideas it describes in my work. I have something that works, but it's unclear to me whether I've done it in the "right" way. Is there something more elegant or concise I could be doing?
I have a tibble where rows are scenarios and columns relate to tests that were run in that scenario. There are two types of columns, those that store a test statistic that was computed in that scenario and those that store the degrees of freedom of that test.
So, here's a small, toy example of the type of data I have:
library(tidyverse)
set.seed(27599)
my_tbl <- data_frame(test1_stat = rnorm(12), test1_df = rep(x = c(1, 2, 3), times = 4),
test2_stat = rnorm(12), test2_df = rep(x = c(1, 2, 3, 4), times = 3))
I want to compute a summary of each test that will be based on both its stat and its df. My example here is that I want to compute the median stat for each group, where groups are defined by df. The groupings are not guaranteed to be the same across tests, nor are the number of groups even guaranteed to be the same.
So, here's what I've done:
get_test_median = function(df, test_name) {
stat_col_name <- paste0(test_name, '_stat')
df_col_name <- paste0(test_name, '_df')
median_col_name <- paste0(test_name, '_median')
df %>%
dplyr::group_by(rlang::UQ(rlang::sym(df_col_name))) %>%
dplyr::summarise(rlang::UQ(median_col_name) := median(x = rlang::UQ(rlang::sym(stat_col_name)), na.rm = TRUE))
}
my_tbl %>% get_test_median(test_name = 'test1')
my_tbl %>% get_test_median(test_name = 'test2')
This works. But is it how an experienced rlang user would do it? I am new to NSE, and a bit surprised to be using two nested rlang functions repeatedly (UQ(sym(.))).
I am happy using UQ rather than !!, just because I'm more comfortable with traditional function notation.
Based on the comments, I got rid of the namespace::function notation and now my function doesn't look so verbose:
get_test_median = function(df, test_name) {
stat_col_name <- paste0(test_name, '_stat')
df_col_name <- paste0(test_name, '_df')
median_col_name <- paste0(test_name, '_median')
df %>%
dplyr::group_by(UQ(sym(df_col_name))) %>%
dplyr::summarise(UQ(median_col_name) := median(x = UQ(sym(stat_col_name)), na.rm = TRUE))
}
Related
I have a massive dataframe where I need to create "lagged" variables and compare them with former time points. As this process needs to be variable, I've chosen to write my own functions which create these lagged variables (not included here).
As I use GLM's, I want to use the stepAIC function and before I start writing tenth of "lag01 + lag02..." I wanted to create another function (modelfiller) which creates these strings according to my parameters and then I use string2lang to make them expressions.
This mostly works but there is one issue which I cannot get my head around.
As you can see in the reprex full.model can be created when I only use y~x+lag01+lag02. If I use modelfiller("y", 2, "x", "lag") at location 1 and 3 it also works. But the moment I put modelfiller("y", 2, "x", "lag") at location 2 in the code (within the stepAIC glm) it creates the following error message:
Error: Problem with `mutate()` input `GLM_AIC`.
x object '.x' not found
i Input `GLM_AIC` is `purrr::map(...)`.
i The error occurred in group 1: group = "a".
I have also tried as.formula with & without eval, but it caused the same issue.
group <- c(rep("a", 10), rep("b", 10), rep("c", 10))
order <- c(seq(1:10), seq(1:10), seq(1:10))
x <- c(runif(30))
y <- c(runif(30))
df <- data.frame(group, order, x, y)
df <- df %>%
dplyr::group_by(group) %>%
dplyr::arrange(group, order) %>%
dplyr::mutate(lag01 = dplyr::lag(x, n=1),
lag02 = dplyr::lag(x, n=2)) %>%
tidyr::drop_na()
modelfiller = function(depPar, maxlag, indepPar, str) {
varnames = list()
for (i in seq(1:maxlag)) {
varnames[i] = paste0(str, stringr::str_pad(i, width = 2, pad = "0"))
}
varnames = paste0(varnames, collapse="+")
varnames = paste(indepPar, varnames, sep = "+")
return(paste(depPar, varnames, sep = "~"))
}
full.model <- df %>%
tidyr::nest(- group) %>%
dplyr::mutate(
# Perform GLM calculation on each group and then a step-wise model selection based on AIC
GLM = purrr::map(
data, ~ lm(data = .x,
# Location 1 - Working
str2lang(modelfiller("y", 2, "x", "lag"))
#y~x+lag01+lag02
)),
GLM_AIC = purrr::map(
data, ~ MASS::stepAIC(glm(data = .x,
# Location 2 - NOT Working
str2lang(modelfiller("y", 2, "x", "lag"))
#y~x+lag01+lag02
)
,direction = "both", trace = FALSE, k = 2,
scope = list(
lower = lm(data = .x,
y ~ 1),
upper = glm(data = .x,
# Location 3 - Working
str2lang(modelfiller("y", 2, "x", "lag"))
#y~x+lag01+lag02
)
)))
)
The issue is that glm stores the name of the variable used to reference the data, and stepAIC then attempts to retrieve this name and evaluate it to access the data, but gets confused about which environment the variable was defined in. To demonstrate, I'm going to simplify your code to
mdl <- str2lang(modelfiller("y", 2, "x", "lag")) # This is your y~x+lag01+lag02
dfn <- df %>% tidyr::nest( data = c(-group) ) # First step of your %>% chain
glms <- purrr::map( dfn$data, ~glm(data = .x, mdl) ) # Construct the models
# Examine glms to observe that
# Call: glm(formula = mdl, data = .x) <--- glm() remembers that the data is in .x
# but stepAIC is not properly aware of where .x
# is defined and behaves effectively as
MASS::stepAIC( glms[[1]] ) # Error: object '.x' not found
Option 1
One workaround is to manually construct the expression that contains the data and then evaluate it:
glm2 <- function(.df, ...) {
eval(rlang::expr(glm(!!rlang::enexpr(.df),!!!list(...)))) }
glms2 <- purrr::map( dfn$data, ~glm2(data = .x, mdl) ) # Same as above, but with glm2
MASS::stepAIC( glms2[[1]] ) # Now works
Changing glm to glm2 in your problematic spot makes your code work too. The down side is that the Call: then remembers the entire data frame, which can be problematic if they are very large.
Option 2
Another alternative is to replace the purrr call with a for loop, which helps maintain the calling frames assumed by stepAIC, thus guiding it to where the data is defined
# This fails with Error: object '.x' not found
purrr::map( dfn$data, ~MASS::stepAIC(glm(data=.x, mdl), direction="both") )
# This works
for( mydata in dfn$data )
MASS::stepAIC(glm(data=mydata, mdl), direction="both")
The advantage here is not needing to store the entire data frame inside the call. The disadvantage is that you effectively lose access to what purrr does to streamline the code.
I wrote some code to performed oversampling, meaning that I replicate my observations in a data.frame and add noise to the replicates, so they are not exactly the same anymore. I'm quite happy that it works now as intended, but...it is too slow. I'm just learning dplyr and have no clue about data.table, but I hope there is a way to improve my function. I'm running this code in a function for 100s of data.frames which may contain about 10,000 columns and 400 rows.
This is some toy data:
library(tidyverse)
train_set1 <- rep(0, 300)
train_set2 <- rep("Factor1", 300)
train_set3 <- data.frame(replicate(1000, sample(0:1, 300, rep = TRUE)))
train_set <- cbind(train_set1, train_set2, train_set3)
row.names(train_set) <- c(paste("Sample", c(1:nrow(train_set)), sep = "_"))
This is the code to replicate each row a given number of times and a function to determine whether the added noise later will be positive or negative:
# replicate each row twice, added row.names contain a "."
train_oversampled <- train_set[rep(seq_len(nrow(train_set)), each = 3), ]
# create a flip function
flip <- function() {
sample(c(-1,1), 1)
}
In the relevant "too slow" piece of code, I'm subsetting the row.names for the added "." to filter for the replicates. Than I select only the numeric columns. I go through those columns row by row and leave the values untouched if they are 0. If not, a certain amount is added (here +- 1 %). Later on, I combine this data set with the original data set and have my oversampled data.frame.
# add percentage of noise to non-zero values in numerical columns
noised_copies <- train_oversampled %>%
rownames_to_column(var = "rowname") %>%
filter(grepl("\\.", row.names(train_oversampled))) %>%
rowwise() %>%
mutate_if(~ is.numeric(.), ~ if_else(. == 0, 0,. + (. * flip() * 0.01 ))) %>%
ungroup() %>%
column_to_rownames(var = "rowname")
# combine original and oversampled, noised data set
train_noised <- rbind(noised_copies, train_set)
I assume there are faster ways using e.g. data.table, but it was already tough work to get this code running and I have no idea how to improve its performance.
EDIT:
The solution is working perfectly fine with fixed values, but called within a for loop I receive "Error in paste(Sample, n, sep = ".") : object 'Sample' not found"
Code to replicate:
library(data.table)
train_set <- data.frame(
x = c(rep(0, 10)),
y = c(0:9),
z = c(rep("Factor1", 10)))
# changing the row name to avoid confusion with "Sample"
row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
train_list <- list(aa = train_set, bb = train_set, cc = train_set)
for(current_table in train_list) {
setDT(current_table, keep.rownames="Sample")
cols <- names(current_table)[sapply(current_table, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(current_table)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, list(train_set)), use.names=FALSE)
# As this is an example, I did not write anything to actually
# store the results, so I have to remove the object
rm(train_noised)
}
Any ideas why the column Sample can't be found now?
Here is a more vectorized approach using data.table:
library(data.table)
setDT(train_set, keep.rownames="Sample")
cols <- names(train_set)[sapply(train_set, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(train_set)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, list(train_set)), use.names=FALSE)
With data.table version >= 1.12.9, you can pass is.numeric directly to .SDcols argument and maybe a shorter way (e.g. (.SD) or names(.SD)) to pass to the left hand side of :=
address OP's updated post:
The issue is that although each data.frame within the list is converted to a data.table, the train_list is not updated. You can update the list with a left bind before the for loop:
library(data.table)
train_set <- data.frame(
x = c(rep(0, 10)),
y = c(0:9),
z = c(rep("Factor1", 10)))
# changing the row name to avoid confusion with "Sample"
row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
train_list <- list(aa = train_set, bb = copy(train_set), cc = copy(train_set))
train_list <- lapply(train_list, setDT, keep.rownames="Sample")
for(current_table in train_list) {
cols <- names(current_table)[sapply(current_table, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(current_table)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, train_list), use.names=FALSE)
# As this is an example, I did not write anything to actually
# store the results, so I have to remove the object
rm(train_noised)
}
I am using plyr to perform a bootstrapping function on subsets of a dataset.
Because the boot function creates a list object, I am currently using dlply to store the output of the function, then a ddply to get just the parts of the bootfunction that I want out
My example dataset is as follows:
dat = data.frame(x = rnorm(10, sd = 1),y = rnorm(10, sd = 1),z = rep(c("SppA", "SppB", "SppC", "SppD", "SppE"), 2),u = rep(c("SiteA", "SiteB"), 5))
the exact function isn't terribly important, but for the sake of reproducibility, here is the function I'm using:
boot_fun = function(x,i) {
i = sample(1:24, 24, replace = TRUE)
ts1 = mean(x[i,"x"])
ts2 = sample(x[i,"y"])
mean(ts1) - mean(ts2)
}
My plyr function is the following:
temp = dlply(dat, c("z", "u"), summarise, boot_object = boot(dat, boot_fun, R = 1000))
Since what I want out of the boot object is the mean and CI, I then perform the following plyr function:
temp2 = ldply(temp, summarise, mean = mean(boot$t), lowCI = quantile(boot$t, 0.025), highCI = quantile(boot$t, 0.975))
This works and accomplishes exactly what I want it to (although with an error about subsetting which doesn't seem to affect anything I care about), but I feel like there should be a way to skip the intermediate dlply step.
-edit- to clarify on what I'm trying to do if I didn't need to be splitting the groups
If I was manually splitting instead of using plyr, it would look something like the following:
temp = boot(dat[dat$z == "SppA" & dat$u == "SiteA",], boot_fun, R = 1000)
temp2$mean = mean(temp$t)
temp2$lowCI = quantile(temp$t, 0.025)
temp2$highCI = quantile(temp$t, 0.975)
If I didn't care about the groups at all and just wanted to do this to the whole group it would be something like
temp = boot(dat, boot_fun, R = 1000)
temp2$mean = mean(temp$t)
temp2$lowCI = quantile(temp$t, 0.025)
temp2$highCI = quantile(temp$t, 0.975)
Your example is not reproducible by me.
When I do temp = boot(dat, boot_fun, R = 1000), I get a WARNING:
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = dat, statistic = boot_fun, R = 1000)
Bootstrap Statistics :
WARNING: All values of t1* are NA
I think your current code is pretty efficient, but if you're looking for other possibilities, you could try tidyverse to 1) group_by the relevant columns, 2) nest the relevant data for bootstrapping, 3) run your bootstrap with the nested data, 4) isolate the statistics you desire, then 5) return to a normal data frame
library(boot)
library(tidyverse)
dat1 <- dat %>%
group_by(z,u) %>%
nest() %>%
mutate(data=map(data,~boot(.x, boot_fun, R=1000))) %>%
mutate(data=map(data,~data.frame(mean=mean(.x$t), lowCI=quantile(.x$t, 0.025), highCI=quantile(.x$t,0.975)))) %>%
unnest(data)
Trying to get my head around Non-Standard Evaluation as used by dplyr but without success. I'd like a short function that returns summary statistics (N, mean, sd, median, IQR, min, max) for a specified set of variables.
Simplified version of my function...
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = mean(~to.sum, na.rm = TRUE))
return(results)
}
And running it with some dummy data...
set.seed(43290)
temp <- cbind(rnorm(n = 100, mean = 2, sd = 4),
rnorm(n = 100, mean = 3, sd = 6)) %>% as.data.frame()
names(temp) <- c('eg1', 'eg2')
mean(temp$eg1)
[1] 1.881721
mean(temp$eg2)
[1] 3.575819
my_summarise(df = temp, to.sum = 'eg1')
n mean
1 100 NA
N is calculated, but the mean is not, can't figure out why.
Ultimately I'd like my function to be more general, along the lines of...
my_summarise <- function(df = temp,
group.by = 'group'
to.sum = c('eg1', 'eg2'),
...){
results <- list()
## Select columns
df <- dplyr::select_(df, .dots = c(group.by, to.sum))
## Summarise overall
results$all <- summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
## Summarise by specified group
results$by.group <- group_by_(df, ~to.group) %>%
summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
return(results)
}
...but before I move onto this more complex version (which I was using this example for guidance) I need to get the evaluation working in the simple version first as thats the stumbling block, the call to dplyr::select() works ok.
Appreciate any advice as to where I'm going wrong.
Thanks in advance
The basic idea is that you have to actually build the appropriate call yourself, most easily done with the lazyeval package.
In this case you want to programmatically create a call that looks like ~mean(eg1, na.rm = TRUE). This is how:
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = lazyeval::interp(~mean(x, na.rm = TRUE),
x = as.name(to.sum)))
return(results)
}
Here is what I do when I struggle to get things working:
Remember that, just like the ~n() you already have, the call will have to start with a ~.
Write the correct call with the actual variable and see if it works (~mean(eg1, na.rm = TRUE)).
Use lazyeval::interp to recreate that call, and check this by running only the interp to visually see what it is doing.
In this case I would probably often write interp(~mean(x, na.rm = TRUE), x = to.sum). But running that will give us ~mean("eg1", na.rm = TRUE) which is treating eg1 as a character instead of a variable name. So we use as.name, as is taught to us in vignette("nse").
Trying to get my head around Non-Standard Evaluation as used by dplyr but without success. I'd like a short function that returns summary statistics (N, mean, sd, median, IQR, min, max) for a specified set of variables.
Simplified version of my function...
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = mean(~to.sum, na.rm = TRUE))
return(results)
}
And running it with some dummy data...
set.seed(43290)
temp <- cbind(rnorm(n = 100, mean = 2, sd = 4),
rnorm(n = 100, mean = 3, sd = 6)) %>% as.data.frame()
names(temp) <- c('eg1', 'eg2')
mean(temp$eg1)
[1] 1.881721
mean(temp$eg2)
[1] 3.575819
my_summarise(df = temp, to.sum = 'eg1')
n mean
1 100 NA
N is calculated, but the mean is not, can't figure out why.
Ultimately I'd like my function to be more general, along the lines of...
my_summarise <- function(df = temp,
group.by = 'group'
to.sum = c('eg1', 'eg2'),
...){
results <- list()
## Select columns
df <- dplyr::select_(df, .dots = c(group.by, to.sum))
## Summarise overall
results$all <- summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
## Summarise by specified group
results$by.group <- group_by_(df, ~to.group) %>%
summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
return(results)
}
...but before I move onto this more complex version (which I was using this example for guidance) I need to get the evaluation working in the simple version first as thats the stumbling block, the call to dplyr::select() works ok.
Appreciate any advice as to where I'm going wrong.
Thanks in advance
The basic idea is that you have to actually build the appropriate call yourself, most easily done with the lazyeval package.
In this case you want to programmatically create a call that looks like ~mean(eg1, na.rm = TRUE). This is how:
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = lazyeval::interp(~mean(x, na.rm = TRUE),
x = as.name(to.sum)))
return(results)
}
Here is what I do when I struggle to get things working:
Remember that, just like the ~n() you already have, the call will have to start with a ~.
Write the correct call with the actual variable and see if it works (~mean(eg1, na.rm = TRUE)).
Use lazyeval::interp to recreate that call, and check this by running only the interp to visually see what it is doing.
In this case I would probably often write interp(~mean(x, na.rm = TRUE), x = to.sum). But running that will give us ~mean("eg1", na.rm = TRUE) which is treating eg1 as a character instead of a variable name. So we use as.name, as is taught to us in vignette("nse").