Equivalent for deprecated select_() and mutate_() [duplicate] - r

This question already has an answer here:
Passing variable in function to other function variables in R
(1 answer)
Closed 2 years ago.
Equivalent for deprecated select_() and mutate_()
I am trying to make a function with this data and would really appreciate your help!
Imagine I have a data.frame like this one (the fusion of control and sites).
I want to select the InitDryW and FinalDryW columns of the Treatment “Control” and then calculate the average.
Inside the function I must write select_() and then mutate_(). However, I understand that these two functions are deprecated.
control <- data.frame(Day=c(0,0,0,0,0,0),
Replica=c(1,1,1,1,1,1),
Initial_Dry_Weight=c(5.010,5.010,5.010,5.010,5.010,5.000),
Final_Dry_Weight=c(4.990,4.940,4.840,4.820,4.960,4.970),
InitiaFraction=c(1.1071,1.1964,1.0647,1.0005,1.0453,1.1212),
FinalFraction=c(0.3858,0.3504,0.4248,0.3333,0.3417,0.3467),
Treatment=c("Control","Control","Control","Control","Control","Control"))
control
sites <-data.frame(Day=c(2,4,8,16,32,44),
Replica=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3),
Initial_Dry_Weight=c(5.000,5.000,5.000,5.000,5.01,5.000,5.000,5.000,
5.000,5.000,5.000,5.01,5.01,5.01,5.000,5.000,5.000,5.000),
Final_Dry_Weight=c(4.65,4.63,4.67,4.64,4.37,4.37,4.17,3.72,4.12,4,3.99,3.64,
4.26,3.3,3.47,3.7,3.75,3.3),
InitiaFraction=c(1.0081,1.0972,1.1307,1.0898,1.075,1.0295,1.0956,1.042,1.0876,
1.006,1.1052,1.0922,1.0472,1.0843,1.0177,1.0143,1.1112,1.0061),
FinalFraction=c(0.3229,0.3605,0.3304,0.3489,0.3181,0.2948,0.4098,0.3762,0.3787,
0.3345,0.3595,0.3511,0.3921,0.3908,0.3385,0.347,0.3366,0.3318),
Treatment=c("CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC",
"CC","CC","CC","CC","CC"))
sites
total <- dplyr::bind_rows(control,sites)
total
My functions is:
manipulation <- function(data,
InitDryW,
FinalDryW,
Treatment,
Difference) {control <- data %>%
filter(Treatment == "Control") %>%
select_(InitDryW,FinalDryW) %>%
mutate_(Difference = lazyeval::interp (~a/b, a=as.name(FinalDryW),b=as.name(InitDryW)))
meanControl <- mean(control$Difference, na.rm = TRUE)
return (meanControl)
}
manipulation()
Then, I run the example:
control <- manipulation(data= total,
InitDryW = "Initial_Dry_Weight",
FinalDryW = "Final_Dry_Weight",
Treatment = "Treatment")
control
Now, I'm getting warnings like these (for both select_() and mutate_()):
Warning message:
mutate_() is deprecated.
Please use mutate() instead
The 'programming' vignette or the tidyeval book can help you
to program with mutate() : https://tidyeval.tidyverse.org
The result is correct, but the first time that warning appears.
My question is: what is the equivalent of select_() and mutate_() in functions in this case?
I think now select_() is solved using only select()
Thanks in advance!!!

You can pass unqouted column names and use {{}} to evaluate it.
library(dplyr)
library(rlang)
manipulation <- function(data,InitDryW,FinalDryW,Treatment,Difference) {
control <- data %>%
filter({{Treatment}} == "Control") %>%
select({{InitDryW}},{{FinalDryW}}) %>%
mutate(Difference = {{FinalDryW}}/{{InitDryW}})
meanControl <- mean(control$Difference, na.rm = TRUE)
return (meanControl)
}
manipulation(data= total,
InitDryW = Initial_Dry_Weight,
FinalDryW = Final_Dry_Weight,
Treatment = Treatment)
However, based on #27 ϕ 9's comment we think you might want to do :
manipulation <- function(data,InitDryW,FinalDryW,Treatment) {
control <- data %>%
filter(Treatment == Treatment) %>%
select({{InitDryW}},{{FinalDryW}}) %>%
mutate(Difference = {{FinalDryW}}/{{InitDryW}})
meanControl <- mean(control$Difference, na.rm = TRUE)
return (meanControl)
}
manipulation(data= total,
InitDryW = Initial_Dry_Weight,
FinalDryW = Final_Dry_Weight,
Treatment = "Control")

I think NSE (non standard evaluation) could help you. At first it might be a little bit confusing, but I think it's quite an elegant way to forget about the underscore functions :) All(?) the dplyr functions work somehow that way. So you should already be familiar with the concept (even if you didn't know about it):
... here is an example.
# some data
dat <- dplyr::tibble(A=1:5,
B=5:1)
# some function
some_function <- function(dat,
.var){
.var <- rlang::enquo(.var)
dat %>%
dplyr::select(!!.var)
}
# run function
some_function(dat,.var=B)
# output
# A tibble: 5 x 1
B
<int>
1 5
2 4
3 3
4 2
5 1

Related

R 4.1.2: Dynamically check values for a cumulative pattern. Null following values if that pattern occurs at any time across values

This relates to another problem I posted, but I did not quite ask the right question. If anyone can help with this, it would really be appreciated.
I have a DF with several players' answers to 100 questions in a quiz (example data frame below with 10 questions and 10 players-not the real data, which is not really from a quiz, but the principle is the same).
My goal is to create a function that will check when a player has answered 3 questions incorrectly cumulatively at any point during their answers, and then change their following answers to the string "disc". I would like to be able to change the parameters also, so it could be 4 or 5 questions incorrect etc. In the df: 1=correct, 0=incorrect, and 2=unanswered. Unanswered is considered incorrect, but I do not want to recode it as 0.
df=data.frame(playerID=numeric(),
q1=numeric(),
q2=numeric(),
q3=numeric(),
q4=numeric(),
q5=numeric(),
q6=numeric(),
q7=numeric(),
q8=numeric(),
q9=numeric(),
q10=numeric())
set.seed(1)
for(i in 1:10){
list_i=c(i,sample(0:2,1),sample(0:2,1),sample(0:2,1),sample(0:2,1),sample(0:2,1),sample(0:2,1),sample(0:2,1),sample(0:2,1),sample(0:2,1),sample(0:2,1))
df[i,]=list_i
}
So, in this DF, for example, playerID=3,8 and 9 should have their answers="disc" from q4 onwards, whereas playerid5 should have “disc” from 8 onwards. So anytime there are 3 consecutive incorrect answers (including values of 2), the following answers should change to “disc”.
I presume the syntax would be a for loop with an if statement inside using mutate or similar.
One possible solution using mutate and across:
df %>%
ungroup() %>%
mutate(
# Mutate across all question columns
across(
starts_with("q"),
function(col) {
# Get previous columns
col_i <- which(names(cur_data())==cur_column())
previous_cols <- 2:(col_i-1)
# Get results for previous questions as string (i.e. zero, or 2)
previous_qs <- select(cur_data(), all_of(previous_cols)) %>%
mutate(across(everything(), ~as.numeric(.x %in% c(0,2)))) %>%
tidyr::unite("str", sep = "") %>%
pull(str)
# Check for three successive incorrect answers at some previous point
results <- grepl(pattern = "111", previous_qs)
# For those with three successive incorrect answers at some previous point, overwrite value with 'disc'
col[results] <- "disc"
col
}
)
)
Are you looking for something like this?
library(tidyverse)
n <- 100
f <- function(v, cap, new_value){
df <-
data.frame(v = v) |>
mutate(
b = cumsum(v),
v_new = ifelse(b > cap, new_value, v)
)
return(df$v_new)
}
# apply function to vector
v <- runif(n)
v_new <- f(v, 5, "disc")
# apply function in a dataframe with mutate
df <-
data.frame(a = runif(n))
df |>
mutate(
b = f(a, 5, "disc")
)

Using group_by inside a function [duplicate]

This question already has answers here:
Dplyr write a function with column names as inputs
(2 answers)
Closed 3 years ago.
I'm trying to write a function, using dplyr syntax, which includes grouping with group_by inside the function. There seems to be a problem with the group_by statement, and I can't figure out whats wrong. When I pass abc as an argument and using select inside the function, it works as i would have expected (Gfunc1). When trying to group_by the same argument, it gives me an error;
Error: Column dims is unknown
Please see exampel below. I really hope I have not overlooked some embarrassingly simple thing... anyway, would be gratefull for help!
library(dplyr)
abc <- c("a","a","a","b","b","c")
num <- c(1,2,3,4,5,6)
df <- data.frame(abc,num)
Gfunc1 <- function(dims) {
test1 <- df %>%
select(dims)
assign("test1", test1, envir = .GlobalEnv)
}
Gfunc2 <- function(dims) {
test2 <- df %>%
group_by(dims)
assign("test2", test2, envir = .GlobalEnv)
}
Gfunc1("abc")
# Returns as expected; df test1 with only col = "abc"
Gfunc2("abc")
# Does not return what i expect; gives error: Error: Column `dims` is unknown
One can solve this by using {{}}(I'm using rlang 0.4.1, dplyr 0.8.3) as follows.
The issue is that one needs to do a bit of extra work when writing functions that depend on dplyr. This is often done via tidy evaluation/Non Standard Evaluation(NSE). I added df as an argument because I feel it is always better to provide the dataset as an argument rather than calling it from an external environment.
Why Gfunc1 works is to do with select being more robust unlike other dplyr functions:
Gfunc2 <- function(df = NULL,dims) {
test2 <- df %>%
group_by({{dims}})
assign("test2", test2, envir = .GlobalEnv)
}
For earlier versions of rlang and dplyr, the same can be achieved using sym and !!:
Gfunc2 <- function(df = NULL,dims) {
test2 <- df %>%
group_by(!!sym(dims))
assign("test2", test2, envir = .GlobalEnv)
}
Gfunc2(df,"abc")
NOTE
It is almost always better to store results in a list instead of sending them to .GlobalEnv.
You can create a function by passing the dots to it. This way you can group by and select more than one variable at the time using NSE.
Gfunc1 <- function(.df, ...) {
test1 <- .df %>%
select(...)
assign("test1", test1, envir = .GlobalEnv)
}
Gfunc2 <- function(.df, ...) {
test2 <- .df %>%
group_by(...)
assign("test2", test2, envir = .GlobalEnv)
}
Gfunc1(df, abc)
Gfunc2(df, abc)
results
> test1
abc
1 a
2 a
3 a
4 b
5 b
6 c
test2 %>%
summarise(sum = sum(num))
abc sum
<fct> <dbl>
1 a 6
2 b 9
3 c 6
To see more about this, consider the material from the RstudioConf selecting and doing with the Tidy Eval
- slides
- video

How do you specify of possible arguments states in functions using non-standard evaluation?

I am learning about programming with tidy evaluation and non-standard evaluation and have been trying to work out how to constrain the possible states of an argument in a function.
For instance given a data set:
set.seed(123)
data <- data_frame(GROUP_ONE = sample(LETTERS[1:3], 10, replace = TRUE),
GROUP_TWO = sample(letters[4:6], 10, replace = TRUE),
result = rnorm(10))
I can create a function which has an argument I use to group the data using a quosure like so:
my_function <- function(data, group = GROUP_ONE){
require(dplyr)
require(magrittr)
group <- enquo(group)
result <- data %>%
group_by(!!group) %>%
summarise(mean=mean(result))
return(result)
}
and this does what I want
my_function(data)
# A tibble: 3 x 2
GROUP_ONE mean
<chr> <dbl>
1 A 1.5054975
2 B 0.2817966
3 C -0.5129904
and I can supply a different group:
my_function(data, group = GROUP_TWO)
# A tibble: 3 x 2
GROUP_TWO mean
<chr> <dbl>
1 d -0.3308130
2 e 0.2352483
3 f 0.7347437
However, I cannot group by a column for which is not present in the data.
e.g.
my_function(data, group = GROUP_THREE)
Error in grouped_df_impl(data, unname(vars), drop) : Column GROUP_THREE is unknown
I would like to add a step at the beginning of the function so that the function stops with a custom error message if the group argument is not GROUP_ONE or GROUP_TWO
something like:
if(!group %in% c(~GROUP_ONE, ~GROUP_TWO)) stop("CUSTOM ERROR MESSAGE")
except this does not work as you apparently you can't put quosures in a vector. It should be possible to convert the quosure to a string somehow and have a vector of strings but I can't figure out how.
How is this done?
I think you need quo_name (from dplyr or rlang), which transforms a quoted symbol to a string:
my_function <- function(data, group = GROUP_ONE){
require(dplyr)
require(magrittr)
group <- enquo(group)
if(!quo_name(group) %in% names(data)) stop("CUSTOM ERROR MESSAGE")
result <- data %>%
group_by(!!group) %>%
summarise(mean=mean(result))
return(result)
}
# > my_function(data, GROUP_THREE)
# Error in my_function(data, GROUP_THREE) : CUSTOM ERROR MESSAGE
Edit
As noted by lionel in comment: except for quo_name, there are many other alternatives including base R as.character and as_string from rlang.
quo_name() is for transforming arbitrary expressions to text so that isn't robust for checking symbols.
If you expect only symbols, and if those symbols should only represent data frames columns, you don't need quosures. In this case you can capture with enexpr() (and there will be ensym() in the next version of rlang):
group <- enexpr(group)
stopifnot(is_symbol(group)) # Or some custom error
Then turn it to a string for the check:
as_string(group) %in% names
You can then unquote the symbol just like you unquote the quosure.
df %>% group_by(!! group)
Alternatively if you need quosures you can check the contained expression:
expr <- get_expr(quo)
is_symbol(expr) && as_string(expr) %in% names
That should be the preferred UI because group_by() has mutate semantics, so you can do stuff like this: df %>% group_by(as.factor(col)). This also means that it's hopeless to try to provide custom error messages, unless you want to capture the error, parse it to make sure it's a "symbol not found" one, and rethrow another error.

Simple mutate with dplyr gives "wrong result size" error

My data table df has a subject column (e.g. "SubjectA", "SubjectB", ...). Each subject answers many questions, and the table is in long format, so there are many rows for each subject. The subject column is a factor. I want to create a new column - call it subject.id - that is simply a numeric version of subject. So for all rows with "SubjectA", it would be 1; for all rows with "SubjectB", it would be 2; etc.
I know that an easy way to do this with dplyr would be to call df %>% mutate(subject.id = as.numeric(subject)). But I was trying to do it this way:
subj.list <- unique(as.character(df$subject))
df %>% mutate(subject.id = which(as.character(subject) == subj.list))
And I get this error:
Error: wrong result size (12), expected 72 or 1
Why does this happen? I'm not interested in other ways to solve this particular problem. Rather, I worry that my inability to understand this error reflects a deep misunderstanding of dplyr or mutate. My understanding is that this call should be conceptually equivalent to:
df$subject.id <- NULL
for (i in 1:nrow(df)) {
df$subject.id[i] <- which(as.character(df$subject[i]) == subj.list))
}
But the latter works and the former doesn't. Why?
Reproducible example:
df <- InsectSprays %>% rename(subject = spray)
subj.list <- unique(as.character(df$subject))
# this works
df$subject.id <- NULL
for (i in 1:nrow(df)) {
df$subject.id[i] <- which(as.character(df$subject[i]) == subj.list)
}
# but this doesn't
df %>% mutate(subject.id = which(as.character(subject) == subj.list))
The issue is that operators and functions are applied in a vectorized way by mutate. Thus, which is applied to the vector produced by as.character(df$subject) == subj.list, not to each row (as in your loop).
Using rowwise as described here would solve the issue: https://stackoverflow.com/a/24728107/3772587
So, this will work:
df %>%
rowwise() %>%
mutate(subject.id = which(as.character(subject) == subj.list))
Since your df$subject is a factor, you could simply do:
df %>% mutate(subj.id=as.numeric(subject))
Or use a left join approach:
subj.df <- df$subject %>%
unique() %>%
as_tibble() %>%
rownames_to_column(var = 'subj.id')
df %>% left_join(subj.df,by = c("subject"="value"))

R: Using dplyr inside a function. exception in eval(expr, envir, enclos): unknown column

I have created a function in R based on the kind help of #Jim M.
When i run the function i get the error: Error: unknown column 'rawdata'
When looking at the debugger i get the message: Rcpp::exception in eval(expr, envir, enclos): unknown column 'rawdata'
However when i look at the environment window i can see 2 variables which I have passed to the function and they contain information rawdata with 7 level factors and refdata with 28 levels
function (refdata, rawdata)
{
wordlist <- expand.grid(rawdata = rawdata, refdata = refdata, stringsAsFactors = FALSE)
wordlist %>% group_by(rawdata) %>% mutate(match_score = jarowinkler(rawdata, refdata)) %>%
summarise(match = match_score[which.max(match_score)], matched_to = ref[which.max(match_score)])
}
This is the problem with functions using NSE (non-standard evaluation). Functions using NSE are very useful in interactive programming but cause many problems in development i.e. when you try to use those inside other functions. Due to expressions not being evaluated directly, R is not able to find the objects in the environments it looks in. I can suggest you read here and preferably the scoping issues chapter for more info.
First of all you need to know that ALL the standard dplyr functions use NSE. Let's see an approximate example to your problem:
Data:
df <- data.frame(col1 = rep(c('a','b'), each=5), col2 = runif(10))
> df
col1 col2
1 a 0.03366446
2 a 0.46698763
3 a 0.34114682
4 a 0.92125387
5 a 0.94511394
6 b 0.67241460
7 b 0.38168131
8 b 0.91107090
9 b 0.15342089
10 b 0.60751868
Let's see how NSE will make our simple problem crush:
First of all the simple interactive case works:
df %>% group_by(col1) %>% summarise(count = n())
Source: local data frame [2 x 2]
col1 count
1 a 5
2 b 5
Let's see what happens if I put it in a function:
lets_group <- function(column) {
df %>% group_by(column) %>% summarise(count = n())
}
>lets_group(col1)
Error: index out of bounds
Not the same error as yours but it is caused by NSE. Exactly the same line of code worked outside the function.
Fortunately, there is a solution to your problem and that is standard evaluation. Hadley also made versions of all the functions in dplyr that use standard evaluation. They are just the normal functions plus the _ underscore at the end.
Now look at how this will work:
#notice the formula operator (~) at the function at summarise_
lets_group2 <- function(column) {
df %>% group_by_(column) %>% summarise_(count = ~n())
}
This yields the following result:
#also notice the quotes around col1
> lets_group2('col1')
Source: local data frame [2 x 2]
col1 count
1 a 5
2 b 5
I cannot test your problem but using SE instead of NSE will give you the results you want. For more info you can also read here

Resources