mutate_each_ non-standard evaluation - r

Really struggling with putting dplyr functions within my functions. I understand the function_ suffix for the standard evaluation versions, but still having problems, and seemingly tried all combinations of eval paste and lazy.
Trying to divide multiple columns by the median of the control for a group. Example data includes an additional column in iris named 'Control', so each species has 40 'normal', and 10 'control'.
data(iris)
control <- rep(c(rep("normal", 40), rep("control", 10)), 3)
iris$Control <- control
Normal dplyr works fine:
out_df <- iris %>%
group_by(Species) %>%
mutate_each(funs(./median(.[Control == "control"])), 1:4)
Trying to wrap this up into a function:
norm_iris <- function(df, control_col, control_val, species, num_cols = 1:4){
out <- df %>%
group_by_(species) %>%
mutate_each_(funs(./median(.[control_col == control])), num_cols)
return(out)
}
norm_iris(iris, control_col = "Control", control_val = "control", species = "Species")
I get the error:
Error in UseMethod("as.lazy_dots") :
no applicable method for 'as.lazy_dots' applied to an object of class "c('integer', 'numeric')"
Using funs_ instead of funs I get Error:...: need numeric data

If you haven't already, it might help you to read the vignette on standard evaluation here, although it sounds like some of this may be changing soon.
Your function is missing the use of interp from package lazyeval in the mutate_each_ line. Because you are trying to use a variable name (the Control variable) in the funs, you need funs_ in this situation along with interp. Notice that this is a situation where you don't need mutate_each_ at all. You would need it if you were trying to use column names instead of column numbers when selecting the columns you want to mutate.
Here is what the line would look like in your function instead of what you have:
mutate_each(funs_(interp(~./median(.[x == control_val]), x = as.name(control_col))),
num_cols)

Related

Issue computing AUC with pROC package

I'm trying to use a function that calls on the pROC package in R to calculate the area under the curve for a number of different outcomes.
# Function used to compute area under the curve
proc_auc <- function(outcome_var, predictor_var) {
pROC::auc(outcome_var, predictor_var)}
To do this, I am intending to refer to outcome names in a vector (much like below).
# Create a vector of outcome names
outcome <- c('outcome_1', 'outcome_2')
However, I am having problems defining variables to input into this function. When I do this, I generate the error: "Error in roc.default(response, predictor, auc = TRUE, ...): 'response' must have two levels". However, I can't work out why, as I reckon I only have two levels...
I would be so happy if anyone could help me!
Here is a reproducible code from the iris dataset in R.
library(pROC)
library(datasets)
library(dplyr)
# Use iris dataset to generate binary variables needed for function
df <- iris %>% dplyr::mutate(outcome_1 = as.numeric(ntile(Sepal.Length, 4)==4),
outcome_2 = as.numeric(ntile(Petal.Length, 4)==4))%>%
dplyr::rename(predictor_1 = Petal.Width)
# Inspect binary outcome variables
df %>% group_by(outcome_1) %>% summarise(n = n()) %>% mutate(Freq = n/sum(n))
df %>% group_by(outcome_2) %>% summarise(n = n()) %>% mutate(Freq = n/sum(n))
# Function used to compute area under the curve
proc_auc <- function(outcome_var, predictor_var) {
pROC::auc(outcome_var, predictor_var)}
# Create a vector of outcome names
outcome <- c('outcome_1', 'outcome_2')
# Define variables to go into function
outcome_var <- df %>% dplyr::select(outcome[[1]])
predictor_var <- df %>% dplyr::select(predictor_1)
# Use function - first line works but not last line!
proc_auc(df$outcome_1, df$predictor_1)
proc_auc(outcome_var, predictor_var)
outcome_var and predictor_var are dataframes with one column which means they cannot be used directly as an argument in the auc function.
Just specify the column names and it will work.
proc_auc(outcome_var$outcome_1, predictor_var$predictor_1)
You'll have to familiarize yourself with dplyr's non-standard evaluation, which makes it pretty hard to program with. In particular, you need to realize that passing a variable name is an indirection, and that there is a special syntax for it.
If you want to stay with the pipes / non-standard evaluation, you can use the roc_ function which follows a previous naming convention for functions taking variable names as input instead of the actual column names.
proc_auc2 <- function(data, outcome_var, predictor_var) {
pROC::auc(pROC::roc_(data, outcome_var, predictor_var))
}
At this point you can pass the actual column names to this new function:
proc_auc2(df, outcome[[1]], "predictor_1")
# or equivalently:
df %>% proc_auc2(outcome[[1]], "predictor_1")
That being said, for most use cases you probably want to follow #druskacik's answer and use standard R evaluation.

How to use apply for functions that need "data$varname" vs functions that need just "varname"

Relatively new R user here that has been wrestling with making code more efficient for future uses, mainly trying out functions from the apply family.
Right now, I have a script in which I pull means from a large number of variables by (manually) creating a list of variable names and passing it into a sapply.
So this is an example of how I made a list of variable names and how I passed that into sapply
vars <- c("data$age", "data$gender", "data$PCLR")
means <- sapply(vars, fmean, data$group, na.rm=TRUE)
However, I now want to use a function that uses the argument format of function(varname, data), so I can't actually use that list of names I made. What I'm trying to do:
krusk <- sapply(vars, function(x) kruskal.test(x ~ group, data))
I feel like there is a way to pass my variable names into functions I have been completely overlooking, instead manually creating lists. Anyone have any suggestions?
This can work using iris dataset as data similar to great suggestion from #deschen:
#Vars
vars <- c("Sepal.Length", "Sepal.Width")
#Code
krusk <- sapply(vars, function(x) kruskal.test(iris[[x]] ~ iris[['Species']]))
Output:
krusk
Sepal.Length Sepal.Width
statistic 96.93744 63.57115
parameter 2 2
p.value 8.918734e-22 1.569282e-14
method "Kruskal-Wallis rank sum test" "Kruskal-Wallis rank sum test"
data.name "iris[[x]] by iris[["Species"]]" "iris[[x]] by iris[["Species"]]"
You were very close! You can do it by subsetting the data frame that you input to sapply using your vars vector, and changing the formula in kruskal.test:
vars <- c("Sepal.Length", "Sepal.Width")
sapply(iris[, vars], function(x) kruskal.test(x ~ iris$Species))
R is a very diverse coding language and there will ultimately be many ways to do the same thing. Some functions expect a more standard evaluation while others may use NSE (non-standard evaluation).
However, you seem to be asking about functions who just expect a single vector as impute as opposed to functions who have a data argument, in which you use variable as opposed to data$variable.
I have a few sidebars before I give some advice
Sidebar 1 - S3 Methods
While this may be besides the point in regards to the question, the function kruskal.test has two methods.
methods("kruskal.test")
#[1] kruskal.test.default* kruskal.test.formula*
#see '?methods' for accessing help and source code
Which method is used depends on the class of the first argument in the function. In this example you are passing a formula expression in which the data argument is necessary, whereas the default method just requires x and g arguments (which you could probably use your original pipelines).
So, if you're used to doing something one way, be sure to check the documentation of a function has different dispatch methods that will work for you.
Sidebar 2 - Data Frames
Data frames are really just a collection of vectors. The difference between f(data$variable) and f(x = variable, data = data), is that in the first the user explicitly tells R where to find the vector, while in the latter, it is up the the function f to evaluate x in the context of data.
I bring this up because of what I said in the beginning - there are many ways to do the same thing. So its generally up to you what you want your standard to be.
If you prefer to be explicit
vars <- c("data$age", "data$gender", "data$PCLR")
means <- sapply(vars, fmean, data$group, na.rm=TRUE)
krusk <- sapply(vars, kruskal.test, g = data$group)
or you can write your functions in which they are expected to be evaluated within a certain data.frame object
vars <- c("age", "gender", "PCLR")
means <- sapply(vars, function(x, id, data) fmean(data[[x]], id = data[[id]], na.rm=T), id = "group", data = data)
krusk <- sapply(vars, function(x, id, data) kruskal.test(data[[x]], data[[id]]), id = "group", data = data)
My Advice
I recommend looking into the following packages dplyr, tidyr, purrr. I'm sure there are a few things in these packages that will make your life easier.
For example, you have expressed having to manually make list before doing the sapply. In dplyr package you can possibly circumvent this if there is a condition to filter on.
data %>%
group_by(group) %>% #groups data
summarise_if(is.numeric, mean, na.rm = TRUE) #applys mean to every column that is a numeric vector
And similarly, we can summarise the results of the kurskal.test function if we reshape the data a bit.
data %>%
group_by(group) %>% #grouping to retain column in the next select statement
select_if(is.numeric) %>% # selecting all numeric columns
pivot_longer(cols = -group) %>% # all columns except "group" will be reshaped. Column names are stored in `name`, and values are stored in `values`
group_by(name) %>% #regroup on name variable (old numeric columns)
summarise(krusk = list(kruskal.test(value ~ as.factor(group)))) #perform test
I've only mentioned purrr because you can almost drop and replace all apply style functions with their map variants. purrr is very consistent across its function variants with a lot of options to control the output type.
I hope this helps and wish you luck on your coding adventures.

How can I write this R expression in the pipe operator format?

I am trying to rewrite this expression to magrittr’s pipe operator:
print(mean(pull(df, height), na.rm=TRUE))
which returns 175.4 for my dataset.
I know that I have to start with the data frame and write it as >df%>% but I’m confused about how to write it inside out. For example, should the na.rm=TRUE go inside mean(), pull() or print()?
UPDATE: I actually figured it out by trial and error...
>df%>%
+pull(height)%>%
+mean(na.rm=TRUE)
+print()
returns 175.4
It would be good practice to make a reproducible example, with dummy data like this:
height <- seq(1:30)
weight <- seq(1:30)
df <- data.frame(height, weight)
These pipe operators work with the majority of the tidyverse (not just magrittr). What you are trying to do is actually coming out of dplyr. The na.rm=T is required for many summary variables like mean, sd, as well as certain functions used to gather specific data points like min, max, etc. These functions don't play well with NA values.
df %>% pull(height) %>% mean(na.rm=T) %>% print()
Unless your data is nested you may not even need to use pull
df %>% summarise(mean = mean(height,na.rm=T))
Also, using summarise you can pipe these into another dataframe rather than just printing, and call them out of the dataframe whenever you want.
df %>% summarise(meanHt = mean(height,na.rm=T), sdHt = sd(height,na.rm=T)) -> summary
summary[1]
summary[2]

R - Passing a column name to a function to be evaluated in a non standard way

I have a data-frame where I want to pass a column name to R in and then filter based on that column.
I have tried reading a few tutorials on this and it seems to be related to non standard evaluation in R.
I cant seem to wrap my head around the examples in the blog posts I have read.
Just for simplicity, I have taken the iris dataset and I want to pass a column to a function which will then filter that dataset where the column value is greater than one.
mydf <- iris
filter_measurements <- function(mydf, measurement){
mydf <- filter(measurement >= 1)
mydf
}
mydf %>%
filter_measurements(measurement = Petal.Width)
Do I have to add something to my function so that R knows I want a column and not use it as 'Petal.Width' for example.
I have seen Passing a variable name to a function in R which i was unable to adapt to my example
Thank you all for your time
A great resource for this is Programming with dplyr.
mydf <- iris
filter_measurements <- function(mydf, measurement){
measurement <- enquo(measurement)
mydf <- filter(mydf, (!!measurement) >= 1)
mydf
}
mydf %>%
filter_measurements(measurement = Petal.Width)
You have to tell the function that measurement is giving as a bare variable name. For this first use enquo to evaluate what is given in the measurement argument and store it as a quosure. Then with !! in front of measurement the filter function knows that it doesn't have to quote this argument, as it is already a quosure.
Alternative
You can also pass the column you want to filter on as a string and use filter_:
filter_measurements <- function(mydf, measurement){
mydf <- filter_(mydf, paste0(measurement, " >= 1"))
mydf
}
mydf %>%
filter_measurements(measurement = "Petal.Width")
You have to pass the column name either as character or integer index of the column. Also, the line
mydf <- filter(measurement >= 1)
within your function never states what is being filtered and will expect the "measurement" to be a stand-alone object, not a part of a data frame.
Try this:
filter_measurements <- function(mydf, measurement)
{
mydf <- filter(mydf, mydf[,measurement] >= 1)
mydf
}
iris %>% filter_measurements("Petal.Width")
A more convoluted invocation of the function would also work:
iris %>% filter_measurements(which(names(.)=="Petal.Width"))

Repeatedly mutate variable using dplyr and purrr

I'm self-taught in R and this is my first StackOverflow question. I apologize if this is an obvious issue; please be kind.
Short Version of my Question
I wrote a custom function to calculate the percent change in a variable year over year. I would like to use purrr's map_at function to apply my custom function to a vector of variable names. My custom function works when applied to a single variable, but fails when I chain it using map_a
My custom function
calculate_delta <- function(df, col) {
#generate variable name
newcolname = paste("d", col, sep="")
#get formula for first difference.
calculate_diff <- lazyeval::interp(~(a + lag(a))/a, a = as.name(col))
#pass formula to mutate, name new variable the columname generated above
df %>%
mutate_(.dots = setNames(list(calculate_diff), newcolname)) }
When I apply this function to a single variable in the mtcars dataset, the output is as expected (although obviously the meaning of the result is non-sensical).
calculate_delta(mtcars, "wt")
Attempt to Apply the Function to a Character Vector Using Purrr
I think that I'm having trouble conceptualizing how map_at passes arguments to the function. All of the example snippets I can find online use map_at with functions like is.character, which don't require additional arguments. Here are my attempts at applying the function using purrr.
vars <- c("wt", "mpg")
mtcars %>% map_at(vars, calculate_delta)
This gives me this error message
Error in paste("d", col, sep = "") :
argument "col" is missing, with no default
I assume this is because map_at is passing vars as the df, and not passing an argument for col. To get around that issue, I tried the following:
vars <- c("wt", "mpg")
mtcars %>% map_at(vars, calculate_delta, df = .)
That throws me this error:
Error: unrecognised index type
I've monkeyed around with a bunch of different versions, including removing the df argument from the calculate_delta function, but I have had no luck.
Other potential solutions
1) A version of this using sapply, rather than purrr. I've tried solving the problem that way and had similar trouble. And my goal is to figure out a way to do this using purrr, if that is possible. Based on my understanding of purrr, this seems like a typical use case.
2) I can obviously think of how I would implement this using a for loop, but I'm trying to avoid that if possible for similar reasons.
Clearly I'm thinking about this wrong. Please help!
EDIT 1
To clarify, I am curious if there is a method of repeatedly transforming variables that accomplishes two things.
1) Generates new variables within the original tbl_df without replacing replace the columns being mutated (as is the case when using dplyr's mutate_at).
2) Automatically generates new variable labels.
3) If possible, accomplishes what I've described by applying a single function using map_at.
It may be that this is not possible, but I feel like there should be an elegant way to accomplish what I am describing.
Try simplifying the process:
delta <- function(x) (x + dplyr::lag(x)) /x
cols <- c("wt", "mpg")
#This
library(dplyr)
mtcars %>% mutate_at(cols, delta)
#Or
library(purrr)
mtcars %>% map_at(cols, delta)
#If necessary, in a function
f <- function(df, cols) {
df %>% mutate_at(cols, delta)
}
f(iris, c("Sepal.Width", "Petal.Length"))
f(mtcars, c("wt", "mpg"))
Edit
If you would like to embed new names after, we can write a custom pipe-ready function:
Rename <- function(object, old, new) {
names(object)[names(object) %in% old] <- new
object
}
mtcars %>%
mutate_at(cols, delta) %>%
Rename(cols, paste0("lagged",cols))
If you want to rename the resulting lagged variables:
mtcars %>% mutate_at(cols, funs(lagged = delta))

Resources