Difficulties using `with` inside a function - r

I am trying to understand how to pass a data frame to an R function. I found an answer to this question on StackOverflow that provides the following demonstration / solution:
Pass a data.frame column name to a function
df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
max(x[,column])
}
fun1(df, "B")
fun1(df, c("B","A"))
This makes sense to me, but I don't quit understand the rules for calling data frames within a function. Take the following example:
data(iris)
x.test <- function(df, x){
out <- with(df, mean(x))
return(out)
}
x.test(iris, "Sepal.Length")
The output of this is NA, with a warning message. But, if I do the same procedure without the function it seems to work just fine.
with(iris, mean(Sepal.Length))
I'm obviously missing something here -- any help would be greatly appreciated.
Thanks!

You have been given the correct advice already (which was to use "[" or "[[" rather than with inside functions) but it might also be helpful to ponder why the problem occurred. Inside the with you asked the mean function to return the mean of a character vector, so NA was the result. When you used with at the interactive level, you had no quotes around the character name of the column and if you had you would have gotten the same result:
> with(iris, mean('Sepal.Length'))
[1] NA
Warning message:
In mean.default("Sepal.Length") :
argument is not numeric or logical: returning NA
If you had used the R get mechanism for "promoting" a character object to return the result of a named object you would actually have succeeded, although with is still generally not recommended for programming use:
x.test <- function(df, x){
out <- with(df, mean( get(x)) ) # get() retrieves the named object from the workspace
return(out)
}
x.test(iris, "Sepal.Length")
#[1] 5.843333
See the Details section of the ?with page for warnings about its use in functions.

This will work
data(iris)
x.test <- function(df, x){
out <- mean(df[, x])
return(out)
}
x.test(iris, "Sepal.Length")
Your code is trying to take mean("Sepal.Length") which is clearly not what you want.

Related

How does the assignment of colnames work "behind the scenes"?

(I hope that this question hasn't been asked before).
For convenience I am using abbreviations for functions like "cn" instead of "colnames". However, for colnames/rownames the abbreviated functions only work for reading purposes. I am not able to set colnames with that new "cn" function. Can anyone explain the black magic behind the colnames function? This is the example:
cn <- match.fun(colnames)
x <- matrix(1:2)
colnames(x) <- "a" # OK, works.
cn(x) <- "b" # Error in cn(x) <- "b" : could not find function "cn<-"
Thank you, echasnovski, for the link to that great website.
It has helped me a lot to better understand R!
http://adv-r.had.co.nz/Functions.html#replacement-functions
In R, special "replacement functions" like foo<- can be defined. E.g. we can define a function
`setSecondElement<-` <- function(x, value){
x[2] <- value
return(x)
}
# Let's try it:
x <- 1:3
setSecondElement(x) <- 100
print(x)
# [1] 1 100 3
The colnames<- function works essentially the same. However, "behind the scenes" it will check if x is a data.frame or matrix and set either names(x) or dimnames(x)[[2]]. Just execute the following line in R and you'll see the underlying routine.
print( `colnames<-` )
For my specific problem the solution turns out to be very simple. Remember that I'd like to have a shorter version of colnames which shall be called cn. I can either do it like this:
cn <- match.fun(colnames);
`cn<-` <- function(x, value){
colnames(x) <- value
return(x)
}
More easily, as Stéphane Laurent points out, the definition of `cn<-` can be simplified to:
`cn<-` <- `colnames<-`
There is a minor difference between these approaches. The first approach will define a new function, which calls the colnames<- function. The second approach will copy the reference from the colnames<- function and make exactly the same function call even if you use cn<-. This approach is more efficient, since 1 additinal function call will be avoided.

R custom function to apply to all variables in a dataframe

I am trying to create a custom function that would, applied within a loop, give me a table with all the informations I need for all the variables of my table. My function is based on dplyr functions and base.
myfun <- function(x, y) summarise(x, var=names(x[y]), n=sum(!is.na(y)), blank=n()-sum(!is.na(y)), distinct=n_distinct(y, na.rm=TRUE))
My problem is that the base function (names()) requires the y argument (the variable name) to be given with quotation marks, but the dplyr function n_distinct needs to be simply so without quotation marks to give the right answer with na.rm=TRUE (if I use n_distinct(x[y], na.rm=TRUE) it doesn't give me a result without NA values). So I don't know how to find a solution to have the good form of the y argument to pass in both functions. I've tried using \" for the names() function, but it didn't seemed to work. Here the errors I obtain:
myfun <- function(x, y) summarise(x, var=names(x[y]), n=sum(!is.na(y)), blank=n()-sum(!is.na(y)), distinct=n_distinct(y, na.rm=TRUE))
myfun(mtcars, "cyl")
Error: Error in summarise_impl(.data, dots) : variable 'y' not found
myfun <- function(x, y) summarise(x, var=names(x[y]), n=sum(!is.na(y)), blank=n()-sum(!is.na(y)), distinct=n_distinct(y, na.rm=TRUE))
myfun(mtcars, cyl)
Error: Error in summarise_impl(.data, dots) : Evaluation error: object 'cyl' not found.
myfun <- function(x, y) summarise(x, var=names(x[y]), n=sum(!is.na(x[y])), blank=n()-sum(!is.na(x[y])), distinct=n_distinct(x[y], na.rm=TRUE))
myfun(mtcars, "cyl")
No error, but na.rm=TRUE doesn't seem to be seen.
My goal would then be apple with some loop to make a table with one row for each variable of my dataframe that I could then export to have these informations for all the variables in just one table.
I tried to make a minimal reproducible example:
library(dplyr)
myfun <- function(x, y) summarise(x, var=names(x[, y]), n=sum(!is.na(x[, y])), blank=n()-sum(!is.na(x[, y])), n_distinct=n_distinct(x[, y], na.rm=TRUE))
a <- mtcars%>%
summarise(n=sum(!is.na(cyl)), blank=n()-sum(!is.na(cyl)), n_distinct=n_distinct(cyl, na.rm=TRUE))
a <- lapply(colnames(mtcars), function(x) data.frame(bind_rows(a, myfun(mtcars, x))))
a <- data.frame(bind_rows(a, myfun(mtcars, "cyl")))
a <- a%>%
filter(!is.na(var))%>%
distinct(var, .keep_all=TRUE)
But for some incomprehensible reason (at least for me) it doesn't work (line a <- lapply(colnames(mtcars), function(x) data.frame(bind_rows(a, myfun(mtcars, x)))), error message Error in summarise_impl(.data, dots) : Columnvaris of unsupported type NULL). It works fine with my dataframe, I subsetted it and it still worked fine, I manually created the same again by writting from hand all the same values in the same class, it didn't work... So I'm really lost, don't understand why it works for my dataset but no other, and because I'm new in R and just learn that by trying, without having lectures about this language code, I sometimes have no idea what I'm really doing but it works (like this code above for me), and then no more...
So this code works for me pretty good, there is just the problem as said that because I use n_distinct(x[, y]) it ignores na.rm=TRUE, what I cannot understand.
Sorry for the rather uncomprehensive question I asked I think, I would be glad to edit it if you leaves comment about how to clarify it. I'm simply totally lost with my try and have no idea how to present things in a clearer way. Thanks for the help and sorry for the mess
I'm not entirely clear on what on exactly what you are trying to do, but this might get at it.
First create a function that will be run for each column.
fn <- function(x){
a = levels(x)
n = n=sum(!is.na(x))
blank = length(x) - sum(!is.na(x))
dist = length(unique(x))
c(column = a, n=n, blank=blank, distinct=dist )
}
Then use apply to apply the function to each column of the data.frame. I've transposed it to provide rows.
t(apply(mtcars, 2, fn))

Print.myclass function in R

I am trying to develop my first package in R and I am facing some issues with "myclass" generic functions that i will try to describe.
Assume a data.frame X with n <- nrow(X) rows and K <- ncol(X) columns.
My main package function (too big to put it in this post) lets say
fun1 <- function(X){
# do staff...
out <- list(index= character vector, A= A, B= B,... etc)
return(out)
class(out) <- "myclass"
}
returns as an output a list. Then I have to use the output for the generic print method in a print.myclass function. However, in my print function I want to use the data frame X used in my main function without asking the user to provide it in an argument (i.e, print(out,X)) and without having it in my output list out (visible to the user at least). Is there any way to do that? Thanks in advance!

cannot combine with and functions in R

Could somebody please point out to me why is that the following example does not work:
df <- data.frame(ex =rep(1,5))
sample.fn <- function(var1) with(df, mean(var1))
sample.fn(ex)
It seems that I am using the wrong syntax to combine with inside of a function.
Thanks,
This is what I meant by learning to use "[" (actually"[["):
> df <- data.frame(ex =rep(1,5))
> sample.fn <- function(var1) mean(df[[var1]])
> sample.fn('ex')
[1] 1
You cannot use an unquoted ex since there is no object named 'ex', at least not at the global environment where you are making the call to sample.fn. 'ex' is a name only inside the environment of the df-dataframe and only df itself is "visible" when the sample.fn-function is called.
Out of interest, I tried using the method that the with.default function uses to build a function taking an unquoted expression argument in the manner you were expecting:
samp.fn <- function(expr) mean(
eval(substitute(expr), df, enclos = parent.frame())
)
samp.fn(ex)
#[1] 1
It's not a very useful function, since it would only be applicable when there was a dataframe named 'df' in the parent.frame(). And apologies for incorrectly claiming that there was a warning on the help page. As #rawr points out the warning about using functions that depend on non-standard evaluation appears on the subset page.

R sapply is.factor

I'm trying to separate a dataset into parts that have factor variables and non-factor variables.
I'm looking to do something like:
This part works:
factorCols <- sapply(df1, is.factor)
factorDf <- df1[,factorCols]
This part won't work:
nonFactorCols <- sapply(df1, !is.factor)
due to this error:
Error in !is.factor : invalid argument type
Is there a correct way to do this?
Correct way:
nonFactorCols <- sapply(df1, function(col) !is.factor(col))
# or, more efficiently
nonFactorCols <- !sapply(df1, is.factor)
# or, even more efficiently
nonFactorCols <- !factorCols
Joshua gave you the correct way to do it. As for why sapply(df1, !is.factor) did not work:
sapply is expecting a function. !is.factor is not a function. The bang operator returns a logical value (albeit, it cannot take is.factor as an argument).
Alternatively, you could use Negate(is.factor) which does in fact return a function.

Resources