Converting string to data masking with dyplr inside function - r

I'm trying to make a function which restructures some data. I got this function to work and it looked somehthing like this:
function_1 <- function(df, group, focal, reference){
if (reference == "rest"){
df <- df %>% dplyr::mutate({{group}} := recode({{group}}, {{focal}} := {{focal}}, .default = "rest"))
df <- df %>% dplyr::select(ac_id, question_id, question_result_score, {{group}})
df <- df[!(duplicated(dplyr::select(df, ac_id, question_id))), ]
df <- df %>% dplyr::arrange(ac_id)
}
else{
df <- dplyr::filter(df, {{group}} == {{focal}} | {{group}} == {{reference}})
df <- df %>% dplyr::select(ac_id, question_id, question_result_score, {{group}})
df <- df[!(duplicated(dplyr::select(df, ac_id, question_id))), ]
df <- df %>% dplyr::arrange(ac_id)
}
return(df)
}
# and I run the following command:
function_1(mydata, gender, "male", "rest")
This works exactly as I want it to. Now this needs to go inside another function (let's call this function_2), where I loop over different demographic characteristics (age, gender, english-native, etc.) and demographic indicators (e.g. "male" (from gender), "female" (from gender), etc.).
Inside function_2 we loop over the output of another function, which returns a dataframe with the following structure:
group
focal
reference
gender
female
male
gender
female
rest
gender
male
rest
english
native
non-native
...
...
...
The problem when looping over this output is (I THINK) that the input of function_1 becomes:
function_1(mydata, "gender", "female", "male")
#instead of
function_1(mydata, gender, "female", "male")
So without the quotation marks. Does anybody know a way how to fix function_1 such that it works with input as shown above?
Any help would be greatly appreciated and if any other information let me know!
KR
P.S.
Maybe the following helps. To generate the table as shown above, we use a function which I stored in a variable called viable_cat and this output has the following properties:
typeof(viable_cat)
[1] "character"
> class(viable_cat)
[1] "matrix" "array" ```

I recommend !!sym(.) for turning strings into variable names. For example:
library(dplyr)
data(mtcars)
in_var = "mpg"
out_var = "mpg2"
new = mtcars %>%
mutate(!!sym(out_var) := 2 * !!sym(in_var))
You can pass strings between multiple functions with ease.
I know this technique is not recommended by programming with dplyr. It is an older approach for programming with dplyr. I find it more applicable to my use cases than some of the options currently recommended.

Related

How to pass user defined variable to filter dplr function in R? it seems that select works fine but filter gives wrong results

Here is the sample data:
sample,fit_result,Site,Dx_Bin,dx,Hx_Prev,Hx_of_Polyps,Age,Gender,Smoke,Diabetic,Hx_Fam_CRC,Height,Weight,NSAID,Diabetes_Med,stage
2003650,0,U Michigan,High Risk Normal,normal,0,1,64,m,,0,1,182,120,0,0,0
2005650,0,U Michigan,High Risk Normal,normal,0,1,61,m,0,0,0,167,78,0,0,0
2007660,26,U Michigan,High Risk Normal,normal,0,1,47,f,0,0,1,170,63,0,0,0
2009650,10,Toronto,Adenoma,adenoma,0,1,81,f,1,0,0,168,65,1,0,0
2013660,0,U Michigan,Normal,normal,0,0,44,f,0,0,0,170,72,1,0,0
2015650,0,Dana Farber,High Risk Normal,normal,0,1,51,f,1,0,0,160,67,0,0,0
2017660,7,Dana Farber,Cancer,cancer,1,1,78,m,1,1,0,172,78,0,1,3
2019651,19,U Michigan,Normal,normal,0,0,59,m,0,0,0,177,65,0,0,0
2023680,0,Dana Farber,High Risk Normal,normal,1,1,63,f,1,0,0,154,54,0,0,0
2025653,1509,U Michigan,Cancer.,cancer,1,1,67,m,1,0,0,167,58,0,0,4
2027653,0,Toronto,Normal,normal,0,0,65,f,0,0,0,167,60,0,0,0
below is the R code
library(tidyverse)
h <- 'Height'
w <- 'Weight'
data %>% select(h) %>% filter(h > 180)
I can see only height column in output but filter is not applied. I dont get any error when i run the code. similarly, below code also does not work
s <- 'Site'
data %>% select(s) %>% mutate(s = str_replace(s," ","_"))
Output:
Site s
1 U Michigan Site
2 U Michigan Site
3 U Michigan Site
4 Toronto Site
I want to replce the space in Site column but obviously its not recognizing s and creating a new column s.
I tried running below code and still face the same issue.
exp <- substitute(s <- 'Site')
r <- eval(exp,data)
data %>% select(r) %>% mutate(r = str_replace(s," ","_"))
I searched everywhere and could not find a solution, Any help would be great. Thanks in advance (i know the normal way to do it i just want to be able to pass variables to the function)
We may either convert to sym and evaluate (!!). Also, if we want to assign on the lhs of the operator, use := instead of = and evaluate with !!
library(dplyr)
library(stringr)
data %>%
select(all_of(s)) %>%
mutate(!!s := str_replace(!! rlang::sym(s)," ","_"))
Similarly for the filter
data %>%
select(all_of(h)) %>%
filter(!! rlang::sym(h) > 180)
Yet another option would be to pass the variable objects in across (for filter can also use if_any/if_all) where we can pass one or more variables to loop across the columns
data %>%
select(all_of(s)) %>%
mutate(across(all_of(s), ~ str_replace(.x, " ", "_")))
Or use .data
data %>%
select(all_of(s)) %>%
mutate(!!s := str_replace(.data[[s]]," ","_"))

How to change variable to factor based on its name in some list by using across?

(I am new in R)
Trying to change variables data type of df members to factors based on condition if their names available in a list to_factors_list.
I have tried some code using mutate(across()) but it's giving errors.
Data prep.:
library(tidyverse)
# tidytuesday himalayan data
members <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
# creating list of names
to_factors_list <- members %>%
map_df(~(data.frame(n_distinct = n_distinct(.x))),
.id = "var_name") %>%
filter(n_distinct < 15) %>%
select(var_name) %>% pull()
to_factors_list
############### output ###############
'season''sex''hired''success''solo''oxygen_used''died''death_cause''injured''injury_type'
Getting error in below code attempts:
members %>%
mutate(across(~.x %in% to_factors_list, factor))
members %>%
mutate_if( ~.x %in% to_factors_list, factor)
I am not sure what's wrong and how can I make this work ?
In base R, this can be done with lapply
members[to_factors_list] <- lapply(members[to_factors_list], factor)
The correct syntax is:
members %>% mutate(across(to_factors_list, factor))
Or if you prefer an older-version dplyr syntax:
members %>% mutate_at(vars(to_factors_list), factor)

Can I write a function to revalue levels of a factor?

I have a column 'lg_with_children' in my data frame that has 5 levels, 'Half and half', 'Mandarin', 'Shanghainese', 'Other', 'N/A', and 'Not important'. I want to condense the 5 levels down to just 2 levels, 'Shanghainese' and 'Other'.
In order to do this I used the revalue() function from the plyr package to successfully rename the levels. I used the code below and it worked fine.
data$lg_with_children <- revalue(data$lg_with_children,
c("Mandarin" = "Other"))
data$lg_with_children <- revalue(data$lg_with_children,
c("Half and half" = "Other"))
data$lg_with_children <- revalue(data$lg_with_children,
c("N/A" = "Other"))
data$lg_with_children <- revalue(data$lg_with_children,
c("Not important" = "Other"))
To condense the code a little I went back data before I revalued the levels and attempted to write a function. I tried the following after doing research on how to write your own functions (I'm rather new at this).
revalue_factor_levels <- function(df, col, source, target) {df$col <- revalue(df$col, c("source" = "target"))}
I intentionally left the df, col, source, and target generic because I need to revalue some other columns in the same way.
Next, I tried to run the code filling in the args and get this message:
warning message
I am not quite sure what the problem is. I tried the following adjustment to code and still nothing.
revalue_factor_levels <- function(df, col, source, target) {df$col <- revalue(df$col, c(source = target))}
Any guidance is appreciated. Thanks.
You can write your function to recode the levels - the easiest way to do that is probably to change the levels directly with levels(fac) <- list(new_lvl1 = c(old_lvl1, old_lvl2), new_lvl2 = c(old_lvl3, old_lvl4))
But there are already several functions that do it out of the box. I typically use the forcats package to manipulate factors.
Check out fct_recode from the forcats package. Link to doc.
There are also other functions that could help you - check out the comments below.
Now, as to why your code isn't working:
df$col looks for a column literally named col. The workaround is to do df[[col]] instead.
Don't forget to return df at the end of your function
c(source = target) will create a vector with one element named "source", regardless of what happens to be in the variable source.
The solution is to create the vector c(source = target) in 2 steps.
revalue_factor_levels <- function(df, col, source, target) {
to_rename <- target
names(to_rename) <- source
df[[col]] <- revalue(df[[col]], to_rename)
df
}
Returning the df means the syntax is:
data <- revalue_factor_levels(data, "lg_with_children", "Mandarin", "Other")
I like functions that take the data as the first argument and return the modified data because they are pipeable.
library(dplyr)
data <- data %>%
revalue_factor_levels("lg_with_children", "Mandarin", "Other") %>%
revalue_factor_levels("lg_with_children", "Half and half", "Other") %>%
revalue_factor_levels("lg_with_children", "N/A", "Other")
Still, using forcats is easier and less prone to breaking on edge cases.
Edit:
There is nothing preventing you from both using forcats and creating your custom function. For example, this is closer to what you want to achieve:
revalue_factor_levels <- function(df, col, ref_level) {
df[[col]] <- forcats::fct_others(df[[col]], keep = ref_level)
df
}
# Will keep Shanghaisese and revalue other levels to "Other".
data <- revalue_factor_levels(data, "lg_with_children", "Shanghainese")
Here is what I ended up with thanks to help from the community.
revalue_factor_levels <- function(df, col, ref_level) {
df[[col]] <- fct_other(df[[col]], keep = ref_level)
df
}
data <- revalue_factor_levels(data, "lg_with_children", "Shanghainese")

How to put a formula within a function in R?

I want to store a dplyr function/formula (e.g. filter(exercise=="Inadequate") or mutate(exercise="adequate") in the variable_to_filter section for my function. I have lots of variables that need to go through this function. How can I do that? I know the code below doesn't work, but I hope you can see the logic in what I'm trying to do.
exercise_inadequate<-(exercise=="Inadequate")
variable_to_mutate<-(mutate(exercise="adequate"))
difference_pe<-function(percent, variable_to_filter, variable_to_mutate){
filtered <- dataset %>% filter(variable_to_filter)
sampled <- sample_frac(filtered, percent/100)
sampled <- sampled %>% mutate(variable_to_mutate)
}
difference_pe(100, exercise_inadequate, exercise_adequate)
I would prefer passing the column name and value separately to the function because evaluating string as condition in filter statement can be ugly.
library(dplyr)
library(rlang)
difference_pe<- function(dataset, percent, col, value) {
filtered <- dataset %>% filter({{col}} == value)
sampled <- sample_frac(filtered, percent/100)
return(sampled)
}
You can use this function as :
difference_pe(dataset, 100, exercise, "Inadequate")
If for some reason the above is not possible and you need to pass condition as string we can use parse_expr which is similar to eval parse.
exercise_inadequate<- 'exercise=="Inadequate"'
difference_pe<- function(dataset, percent, variable_to_filter) {
filtered <- dataset %>% filter(eval(parse_expr(variable_to_filter)))
#filtered <- dataset %>% filter(eval(parse(text = variable_to_filter)))
sampled <- sample_frac(filtered, percent/100)
return(sampled)
}
difference_pe(dataset, 100, exercise_inadequate)

use outside variable inside of rename() function in R

I'm new to R and have a problem
I am trying to reformat some data, and in the process I would like to rename the columns of the new data set.
here is how I have tried to do this:
first the .csv file is read in, lets say case1_case2.csv
then the name of the .csv file is broken up into two parts
each part is assigned to a vector
so it ends up being like this:
xName=case1
yName=case2
After I have put my data into new columns I would like to rename each column to be case1 and case2
to do this I tried using the rename function in R but instead of renaming to case1 and case2 the columns get renamed to xName and yName.
here is my code:
for ( n in 1:length(dirNames) ){
inFile <- read.csv(dirNames[n], header=TRUE, fileEncoding="UTF-8-BOM")
xName <- sub("_.*","",dirNames[n])
yName <- sub(".*[_]([^.]+)[.].*", "\\1", dirNames[n])
xValues <- inFile %>% select(which(str_detect(names(inFile), xName))) %>% stack() %>% rename( xName = values ) %>% subset( select = xName)
yValues <- inFile %>% select(which(!str_detect(names(inFile), xName))) %>% stack() %>% rename(yName = values, Organisms=ind)
finalForm <- cbind(xValues, yValues) %>% filter(complete.cases(.))
}
how can I make sure that the variables xName and yName are expanded inside of the rename() function
thanks.
You didn't provide a reproducible example, so I'll just demonstrate the idea in general. The rename function is part of the dplyr package.
You need to "unquote" the variable that contains the string you want to use as the new column name. The unquote operator is !! and you'll need to use the special := assignment operator to make unquoting on the left hand side allowed.
library(tidyverse)
df <- data_frame(x = 1:3)
y <- "Foo"
df %>% rename(y=x) # Not what you want - need to unquote y
df %>% rename(!!y = x) # Gives error - need to use :=
df %>% rename(!!y := x) # Correct

Resources