Error: cannot join on columns: index out of bounds [duplicate] - r

I am trying to perform an inner join two tables using dplyr, and I think I'm getting tripped up by non-standard evaluation rules. When using the by=("a" = "b") argument, everything works as expected when "a" and "b" are actual strings. Here's a toy example that works:
library(dplyr)
data(iris)
inner_join(iris, iris, by=c("Sepal.Length" = "Sepal.Width"))
But let's say I was putting inner_join in a function:
library(dplyr)
data(iris)
myfn <- function(xname, yname) {
data(iris)
inner_join(iris, iris, by=c(xname = yname))
}
myfn("Sepal.Length", "Sepal.Width")
This returns the following error:
Error: cannot join on columns 'xname' x 'Sepal.Width': index out of bounds
I suspect there is some fancy expression, deparsing, quoting, or unquoting that I could do to make this work, but I'm a bit murky on those details.

You can use
myfn <- function(xname, yname) {
data(iris)
inner_join(iris, iris, by=setNames(yname, xname))
}
The suggested syntax in the ?inner_join documentation of
by = c("a"="b") # same as by = c(a="b")
is slightly misleading because both those values aren't proper character values. You're actually created a named character vector. To dynamically set the values to the left of the equals sign is different from those on the right. You can use setNames() to set the names of the vector dynamically.

I like MrFlick's answer and fber's addendum, but I prefer structure. For me setNames feels as something at the end of a pipe, not as an on-the-fly constructor. On another note, both setNames and structure enable the use of variables in the function call.
myfn <- function(xnames, ynames) {
data(iris)
inner_join(iris, iris, by = structure(names = xnames, .Data = ynames))
}
x <- "Sepal.Length"
myfn(x, "Sepal.Width")
A named vector argument would run into problems here:
myfn <- function(byvars) {
data(iris)
inner_join(iris, iris, by = byvars)
}
x <- "Sepal.Length"
myfn(c(x = "Sepal.Width"))
You could solve that, though, by using setNames or structure in the function call.

I know I'm late to the party, but how about:
myfn <- function(byvar) {
data(iris)
inner_join(iris, iris, by=byvar)
}
This way you can do what you want with:
myfn(c("Sepal.Length"="Sepal.Width"))

I faced a nearly identical challenge as #Peter, but needed to pass multiple different sets of by = join parameters at one time. I chose to use the map() function from the tidyverse package, purrr.
This is the subset of the tidyverse that I used.
library(magrittr)
library(dplyr)
library(rlang)
library(purrr)
First, I adapted myfn to use map() for the case posted by Peter. 42's comment and Felipe Gerard's answer made it clear that the by argument can take a named vector. map() requires a list over which to iterate.
myfn_2 <- function(xname, yname) {
by_names <- list(setNames(nm = xname, yname ))
data(iris)
# map() returns a single-element list. We index to retrieve dataframe.
map( .x = by_names,
.f = ~inner_join(x = iris,
y = iris,
by = .x)) %>%
`[[`(1)
}
myfn_2("Sepal.Length", "Sepal.Width")
I found that I didn't need quo_name() / !! in building the function.
Then, I adapted the function to take a list of by parameters. For each by_i in by_grps, we could extend x and y to add named values on which to join.
by_grps <- list( by_1 = list(x = c("Sepal.Length"), y = c("Sepal.Width")),
by_2 = list(x = c("Sepal.Width"), y = c("Petal.Width"))
)
myfn_3 <- function(by_grps_list, nm_dataset) {
by_named_vectors_list <- lapply(by_grps_list,
function(by_grp) setNames(object = by_grp$y,
nm = by_grp$x))
map(.x = by_named_vectors_list,
.f = ~inner_join(nm_dataset, nm_dataset, by = .x))
}
myfn_3(by_grps, iris)

Related

Function containing dataframe and variable using lapply

I have two dataframes and a function, which works when I use it on a single variable.
library(tidyverse)
iris1<-iris
iris2<-iris
iris_fn<-function(df,species_type){
df1<-df%>%
filter((Species==species_type))
return(df1)}
new_df<-iris_fn(df=iris1, species_type="setosa")
I want to pass a vector of variables to the function with the expected output being a list of dataframes (3), one filtered to each variable, for which I have been experimenting using lapply:
variables<-c("setosa","versicolor","virginica")
new_df<-lapply(df=iris1, species_type="setosa", FUN= iris_fn)
The error message is Error in is.vector(X) : argument "X" is missing, with no default which I dont understand because I have stated the variables of the function and what the name of the function is.
Can anyone suggest a solution to get the desired output? I essentially need a version of lapply or purrr function that will allow a dataframe and a vector as inputs.
lapply expects an argument called X as the main input. You could re-write it so that the function expects X instead of species_type e.g.
iris_fn <- function(df, X){
df1 <- df %>% filter((Species==X))
return(df1)
}
variables <- c("setosa", "versicolor", "virginica")
new_df <- lapply(X=variables, FUN=iris_fn, df=iris1)
EDIT:
Alternatively to avoid using X, you need the first argument of the function to match the lapply input e.g.
iris_fn <- function(species_type, df){
df1 <- df %>% filter((Species==species_type))
return(df1)
}
new_df <- lapply(variables, FUN=iris_fn, df=iris1)
Check out the split function for a convenient way to split a data.frame to a list e.g. split(iris, f=iris$Species)
From ?lapply : lapply(X, FUN, ...) , by naming all your arguments there's no X that could be passed to function as the first arg.
Try something like this:
library(dplyr)
iris1<-iris
# note the changes arg. order
iris_fn<-function(species_type, df){
df1<-df%>%
filter((Species==species_type))
return(df1)}
variables<-c("setosa","versicolor","virginica")
new_df_list <-lapply(variables, iris_fn, df=iris1 )
Or with just an anonymous function:
new_df_list <-lapply(variables, \(x) filter(iris1, Species == x))
As you already use Tidyverse, perhaps with purrr::map() instead:
library(purrr)
new_df_list <- map(variables, ~ filter(iris1, Species == .x))
Created on 2022-11-14 with reprex v2.0.2

How to find object name passed to function

I have a function which takes a dataframe and its columns and processes it in various ways (left out for simplicity). We can put in column names as arguments or transform columns directly inside function arguments (like here). I need to find out what object(s) are passed in the function.
Reproducible example:
df <- data.frame(x= 1:10, y=1:10)
myfun <- function(data, col){
col_new <- eval(substitute(col), data)
# magic part
object_name <- ...
# magic part
plot(col_new, main= object_name)
}
For instance, the expected output for myfun(data= df, x*x) is the plot plot(df$x*df$x, main= "x"). So the title is x, not x*x. What I have got so far is this:
myfun <- function(data, col){
colname <- tryCatch({eval(substitute(col))}, error= function(e) {geterrmessage()})
colname <- gsub("' not found", "", gsub("object '", "", colname))
plot(eval(substitute(col), data), main= colname)
}
This function gives the expected output but there must be some more elegant way to find out to which object the input refers to. The answer must be with base R.
Use substitute to get the expression passed as col and then use eval and all.vars to get the values and name.
myfun <- function(data, col){
s <- substitute(col)
plot(eval(s, data), main = all.vars(s), type = "o", ylab = "")
}
myfun(df, x * x)
Anothehr possibility is to pass a one-sided formula.
myfun2 <- function(formula, data){
plot(eval(formula[[2]], data), main = all.vars(formula), type = "o", ylab = "")
}
myfun2(~ x * x, df)
The rlang package can be very powerful when you get a hang of it. Does something like this do what you want?
library(rlang)
myfun <- function (data, col){
.col <- enexpr(col)
unname(sapply(call_args(.col), as_string))
}
This gives you back the "wt" column.
myfun(mtcars, as.factor(wt))
# [1] "wt"
I am not sure your use case, but this would work for multiple inputs.
myfun(mtcars, sum(x, y))
# [1] "x" "y"
And finally, it is possible you might not even need to do this, but rather store the expression and operate directly on the data. The tidyeval framework can help with that as well.

Convert list of symbols to character string in own function

I have the following data frame:
dat <- data.frame(time = runif(20),
group1 = rep(1:2, times = 10),
group2 = rep(1:2, each = 10),
group3 = rep(3:4, each = 10))
I'm now writing a function my_function that takes the following form:
my_function(data, time_var = time, group_vars = c(group1, group2))
If I'm not mistaken, I'm passing the group_vars as symbols to my function, right?
However, within my function I want to first do some error checks if the variables passed to the function exist in the data. For the time variable I was successful, but I don't know how I can turn my group_vars list into a vector of strings so that it looks like c("group1", "group2").
My current function looks like:
my_function <- function (data, time_var = NULL, group_vars = NULL)
{
time_var <- enquo(time_var)
time_var_string <- as_label(time_var)
group_vars <- enquos(group_vars)
# is "time" variable part of the dataset?
if (!time_var_string %in% colnames(data))
{
stop(paste0("The variable '", time_var_string, "' doesn't exist in your data set. Please check for typos."))
}
}
And I want to extend the latter part so that I can also do some checks in the form of !group_vars %in% colnames(data). I know I could pass the group_var variables already as a vector of strings to the function, but I don't want to do that for other reasons.
enquos is the wrong function here: it operates on multiple arguments, but you’re only passing a single argument. Just use enquo. However, either way the result isn’t directly usable, because you don’t get a vector of unevaluated names — you get an unevaluated c call.
Working with this is a bit more convoluted, I’m afraid:
group_vars_expr = quo_squash(group_vars)
group_var_names = if (is_symbol(group_vars_expr)) {
as_name(group_vars_expr)
} else {
stopifnot(is_call(group_vars_expr))
stopifnot(identical(group_vars_expr[[1L]], sym('c')))
stopifnot(all(purrr::map_lgl(group_vars_expr[-1L], is_symbol)))
purrr::map_chr(group_vars_expr[-1L], as_name)
}
stopifnot(all(group_var_names %in% colnames(data)))
If you'd like to use c() in this way, chances are you need selections. One easy way to take selections in an argument is to interface with dplyr::select():
my_function <- function(data, group_vars = NULL) {
group_vars <- names(dplyr::select(data, {{ group_vars }}))
group_vars
}
mtcars %>% my_function(c(cyl, mpg))
#> [1] "cyl" "mpg"
mtcars %>% my_function(starts_with("d"))
#> [1] "disp" "drat"

Pass generic column names to xtabs function in R

Is there any way to pass generic column names to functions like xtabs in R?
Typically, I'm trying to do something like:
xtabs(weight ~ col, data=dframe)
with col and weight two columns of my data.frame, weight being a column containing weights. It works, but if I want to wrap xtabs in a function to which I pass the column names as argument, it fails. So, if I do:
xtabs.wrapper <- function(dframe, colname, weightname) {
return(xtabs(weightname ~ colname, data=dframe))
}
it fails. Is there a simple way to do something similar? Perhaps I'm missing something with R logic, but it seems to me quite annoying not to be able to pass generic variables to such functions since I'm not particularly fond of copy/paste.
Any help or comments appreciated!
Edit: as mentioned in comments, I was suggested to use eval and I came with this solution:
xtabs.wrapper <- function(dframe, wname, cname) {
xt <- eval(parse(text=paste("xtabs(", wname, "~", cname, ", data=",
deparse(substitute(dframe)), ")")))
return(xt)
}
As I said, I seems to me to be an ugly trick, but I'm probably missing something about the language logic.
Not sure if this is any prettier, but here is a way to define a function without using eval ... it involves accessing the correct columns of dframe via []:
xtabs.wrapper <- function(dframe, wname, cname) {
tmp.wt <- dframe[,wname]
tmp.col <- dframe[,cname]
xt <- xtabs(tmp.wt~tmp.col)
return(xt)
}
Or you can shorten the guts of the function to:
xtabs.wrapper2 <- function(dframe, wname, cname) {
xt <- xtabs(dframe[,wname]~dframe[,cname])
return(xt)
}
To show they are equivalent here with an example from the mtcars data:
data(mtcars)
xtabs(wt~cyl, mtcars)
xtabs.wrapper(mtcars, "wt", "cyl")
xtabs.wrapper2(mtcars, "wt", "cyl")
I did this once:
creatextab<-function(factorsToUse, data)
{
newform<-as.formula(paste("Freq ~", paste(factorsToUse, collapse="+"), sep=""))
xtabs(formula= newform, drop.unused.levels = TRUE, data=data)
}
Obviously this is a different form because of the Freq, but basically .. you can generate the forumula as a string and then you are just using xtabs() directly.
If you want an n-way crosstab and cname contains a string of variable names, then you'll want the following:
xtabs.wrapper3 <- function(dframe, wname, cname) {
eval(cname)
formula <- paste0(wname, " ~ ", paste0(cname, collapse=" + ") )
xt <- xtabs(formula, data = dframe)
return(xt)
}
xtabs.wrapper3(mtcars, "wt", c("cyl", "vs"))

character string as function argument r

I'm working with dplyr and created code to compute new data that is plotted with ggplot.
I want to create a function with this code. It should take a name of a column of the data frame that is manipulated by dplyr. However, trying to work with columnnames does not work. Please consider the minimal example below:
df <- data.frame(A = seq(-5, 5, 1), B = seq(0,10,1))
library(dplyr)
foo <- function (x) {
df %>%
filter(x < 1)
}
foo(B)
Error in filter_impl(.data, dots(...), environment()) :
object 'B' not found
Is there any solution to use the name of a column as a function argument?
If you want to create a function which accepts the string "B" as an argument (as in you question's title)
foo_string <- function (x) {
eval(substitute(df %>% filter(xx < 1),list(xx=as.name(x))))
}
foo_string("B")
If you want to create a function which accepts captures B as an argument (as in dplyr)
foo_nse <- function (x) {
# capture the argument without evaluating it
x <- substitute(x)
eval(substitute(df %>% filter(xx < 1),list(xx=x)))
}
foo_nse(B)
You can find more information in Advanced R
Edit
dplyr makes things easier in version 0.3. Functions with suffixes "_" accept a string or an expression as an argument
foo_string <- function (x) {
# construct the string
string <- paste(x,"< 1")
# use filter_ instead of filter
df %>% filter_(string)
}
foo_string("B")
foo_nse <- function (x) {
# capture the argument without evaluating it
x <- substitute(x)
# construct the expression
expression <- lazyeval::interp(quote(xx < 1), xx = x)
# use filter_ instead of filter
df %>% filter_(expression)
}
foo_nse(B)
You can find more information in this vignette
I remember a similar question which was answered by #Richard Scriven. I think you need to write something like this.
foo <- function(x,...)filter(x,...)
What #Richard Scriven mentioned was that you need to use ... here. If you type ?dplyr, you will be able to find this: filter(.data, ...) I think you replace .data with x or whatever. If you want to pick up rows which have values smaller than 1 in B in your df, it will be like this.
foo <- function (x,...) filter(x,...)
foo(df, B < 1)

Resources