How to call function arguments inside tidyselect operators? - r

I am trying to create a function that relies on tidyselect operators. I am having trouble feeding non-string function arguments in there. I would appreciate any help trying to do this.
Here's an example of what I've tried to do using deparse(substitute(xvar)) to no avail.
library(tidyverse)
myfun <- function(xvar) {
new_df <- mtcars |>
select(starts_with(deparse(substitute(xvar))), qsec)
return(new_df)
}
myfun(d) # variables that start with d and qsec

To feed in an object, ie, d (as opposed to the simple string, "d") in the user-defined function, you just need to define the deparse(substitute(xvar)) outside the starts_with:
myfun <- function(xvar) {
xx <- deparse(substitute(xvar))
mtcars |>
select(starts_with(xx), qsec)
}
myfun(d)
# disp drat qsec
# Mazda RX4 160.0 3.90 16.46
# Mazda RX4 Wag 160.0 3.90 17.02
# Datsun 710 108.0 3.85 18.61
# ...

Related

passing column names in functions as strings and evaluate with mutate

I see gobs of posts about passing column names as strings to function but none of them consider this use case. All the methods I see don't work. Here is one. Please compare the what_I_want column to the what_I_get column below. I want the value of items in the column, not the column name, of course. Thanks.
library(dplyr)
Fun <- function(df,column) {
df %>%
mutate(what_I_want = cyl) %>%
# current best practice? Doen't work in this case.
mutate(what_I_get := {{column}})
}
mtcars[1:2,1:3] %>% Fun("cyl")
#> mpg cyl disp what_I_want what_I_get
#> Mazda RX4 21 6 160 6 cyl
#> Mazda RX4 Wag 21 6 160 6 cyl
Created on 2022-11-07 with reprex v2.0.2
Just add get
Fun <- function(df,column) {
df %>%
mutate(what_I_want = get(column) )
}
mtcars[1:2,1:3] %>% Fun("cyl")
mpg cyl disp what_I_want
Mazda RX4 21 6 160 6
Mazda RX4 Wag 21 6 160 6
We may use ensym which can take both quoted as well as unquoted column name
Fun <- function(df,column) {
df %>%
mutate(what_I_want = !! rlang::ensym(column))
}
-testing
> mtcars[1:2,1:3] %>% Fun("cyl")
mpg cyl disp what_I_want
Mazda RX4 21 6 160 6
Mazda RX4 Wag 21 6 160 6
> mtcars[1:2,1:3] %>% Fun(cyl)
mpg cyl disp what_I_want
Mazda RX4 21 6 160 6
Mazda RX4 Wag 21 6 160 6
Using the .data pronoun you could do:
library(dplyr)
Fun <- function(df,column) {
df %>%
mutate(what_I_get = .data[[column]])
}
mtcars[1:2,1:3] %>%
Fun("cyl")
#> mpg cyl disp what_I_get
#> Mazda RX4 21 6 160 6
#> Mazda RX4 Wag 21 6 160 6
For more on the .data pronoun see Data mask programming patterns.

How to rename a variable with spaces in the name dynamically in dplyr?

I want to rename a variable in my dataframe using dplyr to have spaces but this variable name is a concatenation of a dynamic variable and a static string. In the following example, I'd need "Test1" to be a dynamic variable
df <- mtcars %>% select(`Test1 mpg` = "mpg")
So when I try this, I end up with an error:
var <- "Test1"
df <- mtcars %>% select(paste0(var, " mpg") = "mpg")
How could I go about making those new variable names dynamic?
Using the special assignment operator := you could do:
library(dplyr)
df <- mtcars %>% select(`Test1 mpg` = "mpg")
var <- "Test1"
mtcars %>%
select("{var} mpg" := "mpg")
#> Test1 mpg
#> Mazda RX4 21.0
#> Mazda RX4 Wag 21.0
#> Datsun 710 22.8
#> Hornet 4 Drive 21.4
or using !!sym():
mtcars %>%
select(!!sym(paste(var, " mpg")) := "mpg")
#> Test1 mpg
#> Mazda RX4 21.0
#> Mazda RX4 Wag 21.0
#> Datsun 710 22.8
#> Hornet 4 Drive 21.4

Order columns from a list of pre-defined names and ignore column names which don't exist in the list

I want to order a data.table by using a set of predefined names available in a list.
For example:
library(data.table)
dt <- as.data.table(mtcars)
list_name <-c("mpg", "disp", "xyz")
#Order columns
setcolorder(dt, list_name) #requirement: if "xyz" column doesn't exist it should ignore and take the rest
The use case case is that there are multiple data.tables that are getting created and all of them have column names from a list of names. There can be missing column names in some data but the data needs to be ordered as per a list.
output:
dt
disp wt mpg cyl hp drat qsec vs am gear carb
1: 160.0 2.620 21.0 6 110 3.90 16.46 0 1 4 4
2: 160.0 2.875 21.0 6 110 3.90 17.02 0 1 4 4
3: 108.0 2.320 22.8 4 93 3.85 18.61 1 1 4 1
An option is to load all of them in a list and then use setcolorder by looping over the list with lapply and use intersect on the names of the dataset while ordering
lst1 <- list(dt, dt)
lst1 <- lapply(lst1, function(x) setcolorder(x, intersect(list_name, names(x)))
If we need to reuse, create a function
f1 <- function(dat, nm1) {
setcolorder(dat, intersect(nm1, names(dat)))
}
f1(dt, list_name)
f1(dt2, list_name)

Subset error when using a loop

I have been using this formula:
varlist <- c("A", "B")
for(i in c(1:2)) {
print(varlist[i])
print(summary(svyglm(as.formula(paste0(varlist[i], "~YEAR + REGION")),
design = subset(FEI.w, varlist[i] != "U"),
family = quasibinomial)))
}
I have more variables than just A and B, but I want to do a glm in the survey package using A and B as my dependent variable.
The problem I am running into is that when I subset the data to exclude unknown values in A and B, R doesn't do it and includes the whole data frame.
Any pointers as to why this is happening and how to fix this would be very much appreciated!
subset() uses non-standard evaluation, which means it takes the column names as unquoted variables, e.g.
subset(mtcars, mpg == 21)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
vs
subset(mtcars, "mpg" == 21)
#> [1] mpg cyl disp hp drat wt qsec vs am gear carb
#> <0 rows> (or 0-length row.names)
Your varlist[i] != "U" compares the literal strings "A" and "U" and finds that they aren't equal.
You might be able to get around this with
eval(parse(text = varlist[i])) != "U"
i.e.
subset(mtcars, eval(parse(text="mpg")) == 21)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
but the old adage goes that if you're using eval(parse( then something has probably gone wrong.
svyglm has a subset parameter so you don't need to call subset on the design object. You should do the subsetting like this:
library(survey)
data(api)
dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
rstrat<-as.svrepdesign(dstrat)
for (type in unique(apistrat$stype)) {
print(summary(svyglm(api00~ell+meals+mobility,
design = rstrat,
subset = apistrat$stype==type)))
}

Adding new column with diff() function when there is one less row in R

If I have a sample data frame like mtcars, and I want to find the difference between mtcars$qsec for all rows, I can do diff(mtcars$qsec). But is there a simple way to make diff(mtcars$qsec) a new column in the original mtcars data frame? I'm finding it difficult because there's one less row in diff(mtcars$qsec) than the rest of mtcars.
> head(mtcars,3)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Here are two approaches. Both put an NA in the first row of diff_qsec and put diff(qsec) in the remaining rows:
library(dplyr)
mtcars %>% mutate(diff_qsec = qsec - lag(qsec)) # dplyr has its own version of lag
transform(mtcars, diff_qsec = c(NA, diff(qsec)))
Also, on the general issue of padding see: How can I pad a vector with NA from the front?
You could use the base function within() like so:
mtcars <- within(mtcars, difference <- c(NA,diff(qsec)))
This creates a column called "difference" with the first element NA and the rest calculated by diff(qsec).
You could create more columns at the same time by wrapping commands in {}, such as:
mtcars <- within(mtcars, {difference <- c(NA,diff(qsec))
multiple <- qsec*2})
Note that you must use <- for the assignment and not =.

Resources