R custom function to apply to all variables in a dataframe - r

I am trying to create a custom function that would, applied within a loop, give me a table with all the informations I need for all the variables of my table. My function is based on dplyr functions and base.
myfun <- function(x, y) summarise(x, var=names(x[y]), n=sum(!is.na(y)), blank=n()-sum(!is.na(y)), distinct=n_distinct(y, na.rm=TRUE))
My problem is that the base function (names()) requires the y argument (the variable name) to be given with quotation marks, but the dplyr function n_distinct needs to be simply so without quotation marks to give the right answer with na.rm=TRUE (if I use n_distinct(x[y], na.rm=TRUE) it doesn't give me a result without NA values). So I don't know how to find a solution to have the good form of the y argument to pass in both functions. I've tried using \" for the names() function, but it didn't seemed to work. Here the errors I obtain:
myfun <- function(x, y) summarise(x, var=names(x[y]), n=sum(!is.na(y)), blank=n()-sum(!is.na(y)), distinct=n_distinct(y, na.rm=TRUE))
myfun(mtcars, "cyl")
Error: Error in summarise_impl(.data, dots) : variable 'y' not found
myfun <- function(x, y) summarise(x, var=names(x[y]), n=sum(!is.na(y)), blank=n()-sum(!is.na(y)), distinct=n_distinct(y, na.rm=TRUE))
myfun(mtcars, cyl)
Error: Error in summarise_impl(.data, dots) : Evaluation error: object 'cyl' not found.
myfun <- function(x, y) summarise(x, var=names(x[y]), n=sum(!is.na(x[y])), blank=n()-sum(!is.na(x[y])), distinct=n_distinct(x[y], na.rm=TRUE))
myfun(mtcars, "cyl")
No error, but na.rm=TRUE doesn't seem to be seen.
My goal would then be apple with some loop to make a table with one row for each variable of my dataframe that I could then export to have these informations for all the variables in just one table.
I tried to make a minimal reproducible example:
library(dplyr)
myfun <- function(x, y) summarise(x, var=names(x[, y]), n=sum(!is.na(x[, y])), blank=n()-sum(!is.na(x[, y])), n_distinct=n_distinct(x[, y], na.rm=TRUE))
a <- mtcars%>%
summarise(n=sum(!is.na(cyl)), blank=n()-sum(!is.na(cyl)), n_distinct=n_distinct(cyl, na.rm=TRUE))
a <- lapply(colnames(mtcars), function(x) data.frame(bind_rows(a, myfun(mtcars, x))))
a <- data.frame(bind_rows(a, myfun(mtcars, "cyl")))
a <- a%>%
filter(!is.na(var))%>%
distinct(var, .keep_all=TRUE)
But for some incomprehensible reason (at least for me) it doesn't work (line a <- lapply(colnames(mtcars), function(x) data.frame(bind_rows(a, myfun(mtcars, x)))), error message Error in summarise_impl(.data, dots) : Columnvaris of unsupported type NULL). It works fine with my dataframe, I subsetted it and it still worked fine, I manually created the same again by writting from hand all the same values in the same class, it didn't work... So I'm really lost, don't understand why it works for my dataset but no other, and because I'm new in R and just learn that by trying, without having lectures about this language code, I sometimes have no idea what I'm really doing but it works (like this code above for me), and then no more...
So this code works for me pretty good, there is just the problem as said that because I use n_distinct(x[, y]) it ignores na.rm=TRUE, what I cannot understand.
Sorry for the rather uncomprehensive question I asked I think, I would be glad to edit it if you leaves comment about how to clarify it. I'm simply totally lost with my try and have no idea how to present things in a clearer way. Thanks for the help and sorry for the mess

I'm not entirely clear on what on exactly what you are trying to do, but this might get at it.
First create a function that will be run for each column.
fn <- function(x){
a = levels(x)
n = n=sum(!is.na(x))
blank = length(x) - sum(!is.na(x))
dist = length(unique(x))
c(column = a, n=n, blank=blank, distinct=dist )
}
Then use apply to apply the function to each column of the data.frame. I've transposed it to provide rows.
t(apply(mtcars, 2, fn))

Related

Truly understanding lapply et al

I have been using R for a long time and am very happy using the map-family of functions as well as rowwise. I really just don't get the apply-family, even after reading many a tutorial. Right now it's very much up to chance if I get any apply function to work, and if I do, I'm not sure why it did in that case. Could anyone give an intuitive explanation of the syntax? E.g. why does the code below fail?
stupid_function = function(x,y){
a = sum(x,y)
b = max(x,y)
return(list(MySum=a,MyMax=b))
}
mtcars %>%
rowwise() %>%
mutate(using_rowwise = list(stupid_function(vs, am))) %>%
unnest_wider(using_rowwise)
mtcars %>%
mutate(using_map = pmap(list(vs,am),stupid_function)) %>%
unnest_wider(using_map)
mtcars %>%
mutate(using_lapply = lapply(list(vs,am), stupid_function))
Using rowwise and pmap I get what I want/expect. But the last line yields the following error:
Error: Problem with `mutate()` input `using_lapply`.
x argument "y" is missing, with no default
i Input `using_lapply` is `lapply(list(vs, am), stupid_function)`.
Run `rlang::last_error()` to see where the error occurred.
The lapply() function has the following usage (from ?lapply).
lapply(X, FUN, ...)
The X argument is a list or vector or data.frame - something with elements. The FUN argument is some function. lapply then applies the FUN to each element of X and returns the outputs in a list. The first element of this list is FUN(X[1])and the second is FUN(X[2]).
In your example, lapply(list(vs,am), stupid_function), lapply is trying to apply stupid_function to vs and then to am. However, stupid_function appears to require two arguments. This is where the ... comes in. You pass additional arguments to FUN here. You just need to name them correctly. So, in your case, you would use lapply(vs, stupid_function, y = am).
However, this isn't really what you want either. This will use all am as the second argument and not iterate over am. lapply only iterates over one variable, not two. You want to use a map function for this or you need to do something like the following:
lapply(1:nrow(mtcars) function(x) {stupid_function(mtcars$vs[x], mtcars$am[x]})

How can create a function using variables in a dataframe

I'm sure the question is a bit dummy (sorry)... I'm trying to create a function using differents variables I have stored in a Dataframe. The function is like that:
mlr_turb <- function(Cond_in, Flow_in, pH_in, pH_out, Turb_in, nm250_i, nm400_i, nm250_o, nm400_o){
Coag = (+0.032690 + 0.090289*Cond_in + 0.003229*Flow_in - 0.021980*pH_in - 0.037486*pH_out
+0.016031*Turb_in -0.026006*nm250_i +0.093138*nm400_o - 0.397858*nm250_o - 0.109392*nm400_o)/0.167304
return(Coag)
}
m4_turb <- mlr_turb(dataset)
The problem is when I try to run my function in a dataframe (with the same name of variables). It doesn't detect my variables and shows this message:
Error in mlr_turb(dataset) :
argument "Flow_in" is missing, with no default
But, actually, there is, also all the variables.
I think I missplace or missing some order in the function that gives it the possibility to take the variables from the dataset. I have searched a lot about that but I have not found any answer...
No dumb questions!
I think you're looking for do.call. This function allows you to unpack values into a function as arguments. Here's a really simple example.
# a simple function that takes x, y and z as arguments
myFun <- function(x, y, z){
result <- (x + y)/z
return(result)
}
# a simple data frame with columns x, y and z
myData <- data.frame(x=1:5,
y=(1:5)*pi,
z=(11:15))
# unpack the values into the function using do.call
do.call('myFun', myData)
Output:
[1] 0.3765084 0.6902654 0.9557522 1.1833122 1.3805309
You meet a standard problem when writing R that is related to the question of standard evaluation (SE) vs non standard evaluation (NSE). If you need more elements, you can have a look at this blog post I wrote
I think the most convenient way to write function using variables is to use variable names as arguments of the function.
Let's take again #Muon example.
# a simple function that takes x, y and z as arguments
myFun <- function(x, y, z){
result <- (x + y)/z
return(result)
}
The question is where R should find the values behind names x, y and z. In a function, R will first look within the function environment (here x,y and z are defined as parameters) then it will look at global environment and then it will look at the different packages attached.
In myFun, R expects vectors. If you give a column name, you will experience an error. What happens if you want to give a column name ? You must say to R that the name you gave should be associated to a value in the scope of a dataframe. You can for instance do something like that:
myFun <- function(df, col1 = "x", col2 = "y", col3 = "z"){
result <- (df[,col1] + df[,col2])/df[,col3]
return(result)
}
You can go far further in that aspect with data.table package. If you start writing functions that need to use variables from a dataframe, I recommend you to start having a look at this package
I like Muon's answer, but I couldn't get it to work if there are columns in the data.frame not in the function. Using the with() function is a simple way to make this work as well...
#Code from Muon:
# a simple function that takes x, y and z as arguments
myFun <- function(x, y, z){
result <- (x + y)/z
return(result)
}
# a simple data frame with columns x, y and z
myData <- data.frame(x=1:5,
y=(1:5)*pi,
z=(11:15),
a=6:10) #adding a var not used in myFun
# unpack the values into the function using do.call
do.call('myFun', myData)
#generates an error for the unused "a" column
#using with() function:
with(myData, myFun(x, y, z))

Difficulties using `with` inside a function

I am trying to understand how to pass a data frame to an R function. I found an answer to this question on StackOverflow that provides the following demonstration / solution:
Pass a data.frame column name to a function
df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
max(x[,column])
}
fun1(df, "B")
fun1(df, c("B","A"))
This makes sense to me, but I don't quit understand the rules for calling data frames within a function. Take the following example:
data(iris)
x.test <- function(df, x){
out <- with(df, mean(x))
return(out)
}
x.test(iris, "Sepal.Length")
The output of this is NA, with a warning message. But, if I do the same procedure without the function it seems to work just fine.
with(iris, mean(Sepal.Length))
I'm obviously missing something here -- any help would be greatly appreciated.
Thanks!
You have been given the correct advice already (which was to use "[" or "[[" rather than with inside functions) but it might also be helpful to ponder why the problem occurred. Inside the with you asked the mean function to return the mean of a character vector, so NA was the result. When you used with at the interactive level, you had no quotes around the character name of the column and if you had you would have gotten the same result:
> with(iris, mean('Sepal.Length'))
[1] NA
Warning message:
In mean.default("Sepal.Length") :
argument is not numeric or logical: returning NA
If you had used the R get mechanism for "promoting" a character object to return the result of a named object you would actually have succeeded, although with is still generally not recommended for programming use:
x.test <- function(df, x){
out <- with(df, mean( get(x)) ) # get() retrieves the named object from the workspace
return(out)
}
x.test(iris, "Sepal.Length")
#[1] 5.843333
See the Details section of the ?with page for warnings about its use in functions.
This will work
data(iris)
x.test <- function(df, x){
out <- mean(df[, x])
return(out)
}
x.test(iris, "Sepal.Length")
Your code is trying to take mean("Sepal.Length") which is clearly not what you want.

Refer to a vector anonymously in R

Instead of writing one vector subscript operation a line, such as:
x.and.y <- intersect(x, y)
idx.x <- match(x, x.and.y)
idx.x <- idx.x[!is.na(idx.x)]
I could chain them in one line:
x.and.y <- intersect(x, y)
idx.x <- subset(tmp <- match(x, x.and.y), !is.na(tmp))
In order to do that, I must give intermediate vector a name to be used in subscript operations. To make code even more concise, is there a way to refer to a vector anonymously? Like this:
x.and.y <- intersect(x, y)
idx.x <- match(x, x.and.y)[!is.na] ## illegal R
Considering intersect calls match, what you're doing is redundant. intersect is defined as:
function (x, y)
{
y <- as.vector(y)
unique(y[match(as.vector(x), y, 0L)])
}
And you can get the same result as your 3 lines of code by using %in%: x[y%in%x].
I realize this may not be representative of your actual problem, but "referring to a vector anonymously" doesn't really fit the R paradigm. Function arguments are pass-by-value. You're essentially saying, "I want a function to manipulate an object, but I don't want to provide the object to the function."
You could use R's scoping rules to do this (which is what mplourde did using Filter with an anonymous function), but you're going to create quite a bit of convoluted code that way.

Anonymous function in R

Using a dataset w, which includes a numeric column PY, I can do:
nrow(subset(w, PY==50))
and get the correct answer. If, however, I try to create a function:
fxn <- function(dataset, fac, lev){nrow(subset(dataset, fac==lev))}
and run
fxn(w, PY, 50)
I get the following error:
Error in eval(expr, envir, enclos) : object 'PY' not found
What am I doing wrong? Thanks.
From the documentation of subset:
Warning
This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.
This rather obscure warning was very well explained here: Why is `[` better than `subset`?
The final word is you can't use subset other than interactively, in particular, not via a wrapper like you are trying. You should use [ instead:
fxn <- function(dataset, fac, lev) nrow(dataset[dataset[fac] == lev, , drop = FALSE])
or rather simply:
fxn <- function(dataset, fac, lev) sum(dataset[fac] == lev)

Resources