Applying multiple functions using the sapply function to columns of data frame - r

I have a simple and a small data set for which I wish to apply a set of functions to each column or variable of the data frame using the sapply function. Below is the code from R Blogs
multi.sapply <- function(...) {
arglist <- match.call(expand.dots = FALSE)$...
var.names <- sapply(arglist, deparse)
has.name <- (names(arglist) != "")
var.names[has.name] <- names(arglist)[has.name]
arglist <- lapply(arglist, eval.parent, n = 2)
x <- arglist[[1]]
arglist[[1]] <- NULL
result <- sapply(arglist, function (FUN, x) sapply(x, FUN), x)
colnames(result) <- var.names[-1]
return(result)
}
Since I am a novice user of R, I would like to know, how can you modify the above code when the data has missing or NA values? So for example:
multi.sapply(mydata,mean, median, min, max)
Works fine but yields NA values for variables that has missing values
The following code however gives me the following error message:
multi.sapply(mydata,mean, median, valid.n, min, max, na.rm = TRUE)
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'FUN' of mode 'function' was not found
Your help would me much appreciated!

Related

How to write a function with an unspecified number of arguments where the arguments are column names

I am trying to write a function with an unspecified number of arguments using ... but I am running into issues where those arguments are column names. As a simple example, if I want a function that takes a data frame and uses within() to make a new column that is several other columns pasted together, I would intuitively write it as
example.fun <- function(input,...){
res <- within(input,pasted <- paste(...))
res}
where input is a data frame and ... specifies column names. This gives an error saying that the column names cannot be found (they are treated as objects). e.g.
df <- data.frame(x = c(1,2),y=c("a","b"))
example.fun(df,x,y)
This returns "Error in paste(...) : object 'x' not found "
I can use attach() and detach() within the function as a work around,
example.fun2 <- function(input,...){
attach(input)
res <- within(input,pasted <- paste(...))
detach(input)
res}
This works, but it's clunky and runs into issues if there happens to be an object in the global environment that is called the same thing as a column name, so it's not my preference.
What is the correct way to do this?
Thanks
1) Wrap the code in eval(substitute(...code...)) like this:
example.fun <- function(data, ...) {
eval(substitute(within(data, pasted <- paste(...))))
}
# test
df <- data.frame(x = c(1, 2), y = c("a", "b"))
example.fun(df, x, y)
## x y pasted
## 1 1 a 1 a
## 2 2 b 2 b
1a) A variation of that would be:
example.fun.2 <- function(data, ...) {
data.frame(data, pasted = eval(substitute(paste(...)), data))
}
example.fun.2(df, x, y)
2) Another possibility is to convert each argument to a character string and then use indexing.
example.fun.3 <- function(data, ...) {
vnames <- sapply(substitute(list(...))[-1], deparse)
data.frame(data, pasted = do.call("paste", data[vnames]))
}
example.fun.3(df, x, y)
3) Other possibilities are to change the design of the function and pass the variable names as a formula or character vector.
example.fun.4 <- function(data, formula) {
data.frame(data, pasted = do.call("paste", get_all_vars(formula, data)))
}
example.fun.4(df, ~ x + y)
example.fun.5 <- function(data, vnames) {
data.frame(data, pasted = do.call("paste", data[vnames]))
}
example.fun.5(df, c("x", "y"))

Store the arguments an R function expects in a character vector

Is it possible to get the information which arguments are expected by a function and then store it in a character vector?
I know args(foo) but it only prints this information and returns NULL.
Why do I need this?
I want to work with the three dot arguments (dot dot dot, ...) and pass it to different functions.
Let me explain...
The following simple case works.
data <- c(1:10)
cv <- function(x, ...) {
numerator <- mean(x, ...)
denominator <- sd(x, ...)
return(numerator / denominator)
}
cv(data, na.rm = TRUE)
However, in a slightly different case, R will not figure out automatically which aruments match which function.
data <- c(1:10)
roundCv <- function(x, ...) {
numerator <- mean(x, ...)
denominator <- sd(x, ...)
result <- round(numerator / denominator, ...)
return(result)
}
roundCv(data, na.rm = TRUE, digits = 2)
# Error in sd(x, ...) : unused argument (digits = 2)
If I want to separate those arguments, it gets a little hairy. The approach is not generic but has to be adapted to all functions involved.
data <- c(1:10)
roundCv2 <- function(x, ...) {
args <- list(...)
args1 <- args[ names(args) %in% "na.rm"] # For mean/sd
args2 <- args[!names(args) %in% "na.rm"] # For print
numerator <- do.call("mean", c(list(x = x), args1))
denominator <- do.call("sd", c(list(x = x), args1))
tmp <- numerator / denominator
do.call("round", c(list(x = tmp), args2))
}
roundCv2(data, na.rm = TRUE, digits = 2)
Is there a simple way to do this?!
If I would know the arguments each function expects, I could handle it generically. That's why I'm asking:
Is it possible to get the information which arguments are expected by a function and then store it in a character vector?
A shout-out to MrFlick for pointing to similar questions and giving the answer in the comments.
You can use formals() to get a list like object back, bit it won't work for primitive functions. Like names(formals(...))
More details can be found here: https://stackoverflow.com/a/4128401/1553796

R apply() function issue with custom function included

I have coded following function:
one_way_anova <- function(m, n, sample_means, sample_vars) {
keskiarvo = 1/m*sum(sample_means)
otosv = (sum((sample_means-keskiarvo)^2))/(m-1)
TS = (n*otosv)/(sum(sample_vars)/m)
parvo = 1-pf(TS, m-1, m*(n-1))
return(parvo)
}
And using following data:
set.seed(1)
dat <- matrix(rnorm(300*20), nrow=300)
sample_means <- matrix(rowMeans(dat), nrow=100)
sample_vars <- matrix(apply(dat, 1, var), nrow=100)
m <- nrow(sample_means)
n <- ncol(sample_means)
Now I try to use apply -function to calculate "parvo" with my function one_way_anova for dataset sample_means by individual rows with three samples (matrix is 100x3).
apply(sample_means, 1, one_way_anova)
Which gives following error
Error in FUN(newX[, i], ...) : argument "sample_means" is missing, with no default
Since your function one_way_anova needs multiple arguments, you need to pass all other arguments besides sample_means if you used apply.
If you want to run it over rows in sample_means and sample_vars, maybe you can try sapply like below
sapply(1:m,function(k) one_way_anova(m,n,sample_means[k,],sample_vars[k,]))

Median replace, needs numeric data

I am trying to impute missing values based on a group. I am getting an error that the median() function requires numeric data, but all of my data is numeric so I can't see the issue. Here is a minimally reproducible example.
set.seed(123)
cluster = sample(seq(1,10),1000,replace=TRUE)
V1 = sample(c(runif(100),NA),1000,replace=TRUE)
V2 = sample(c(runif(100),NA),1000,replace=TRUE)
df = as.data.frame(cbind(cluster,V1,V2))
df_fixed = by(df,df$cluster,function(x){replace(x,is.na(x),median(x, na.rm=TRUE))})
Error returned:
Error in median.default(x, na.rm = TRUE) : need numeric data
This code will work though, so the issue is with the median function.
df_fixed = by(df,df$cluster,function(x){replace(x,is.na(x),1)})
df_fixed <- apply(df[,2:3], 2, function(x) {
md <- sapply(sort(unique(df$cluster)), function(k) median(x[df$cluster==k], na.rm=TRUE))
x[is.na(x)] <- md[df$cluster][is.na(x)]
return(x)
})
any(is.na(df_fixed))
# [1] FALSE

Why the parameter FUN in tapply is invalid combined with colwise

I usually use the combination of colwise and tapply to calculate grouped values in a data frame. However, I found unexpectedly that the parameter FUN in tapply cannot work correctly with colwise from plyr. The example is as follows:
Data:
df <- data.frame(a = 1:10, b = rep(1:2, each = 5), c = 2:11)
Normal:
library(plyr)
colwise(tapply)(subset(df, select = c(a, c)), df$b, function(x){sum(x[x > 2])})
Above code is correct and can work normally. But if I add FUN, it will be wrong:
colwise(tapply)(subset(df, select = c(a, c)), df$b, FUN = function(x){sum(x[x > 2])})
Error is:
Error in FUN(X[[1L]], ...) :
unused arguments (function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
{
FUN <- if (!is.null(FUN)) match.fun(FUN)
if (!is.list(INDEX)) INDEX <- list(INDEX)
nI <- length(INDEX)
if (!nI) stop("'INDEX' is of length zero")
namelist <- vector("list", nI)
names(namelist) <- names(INDEX)
extent <- integer(nI)
nx <- length(X)
one <- 1
group <- rep.int(one, nx)
ngroup <- one
for (i in seq_along(INDEX)) {
index <- as.factor(INDEX[[i]])
if (length(index) != nx) stop("arguments must have same length")
namelist[[i]] <- levels(index)
extent[i] <- nlevels(index)
group <- group + ngroup * (as.integer(index) - one)
ngroup <- ngroup * nlevels(index)
}
if (is.null(FUN)) return(group)
ans <- lapply(X = split(X, group), FUN = FUN, ...)
index <- as.integer(names(ans))
if (simplify && all(unlist(lapply(ans, length)) == 1)) {
ansmat <- array(dim = extent, dimnames = namelist)
Could anyone explain the reason? Thank you in advance.
Well, the issue is that both lapply and tapply have an optional FUN argument. Note that colwise(tapply) is a function with the following line:
out <- do.call("lapply", c(list(filtered, .fun, ...), dots))
Let's go to this line with our debugger by writing
ct <- colwise(tapply); trace(ct, quote(browser()), at = 6)
and then running
ct(subset(df, select = c(a, c)), df$b, FUN = function(x){sum(x[x > 2])})
Now let's print c(list(filtered, .fun, ...), dots). Notice that the first three (unnamed) arguments are now the dataframe, tapply, and db$b, with the FUN argument above coming in last. However, this argument is named. Since this is a do.call on lapply, instead of that argument becoming an optional parameter for tapply, it now becomes the main call on lapply! So what is happening is that you are turning this into:
lapply(subset(df, select = c(a, c)), function(x){sum(x[x > 2])}, tapply, df$b)
This, of course, makes no sense, and if you execute the above (still in your debugger) manually, you will get the exact same error you are getting. For a simple workaround, try:
tapply2 <- function(.FUN, ...) tapply(FUN = .FUN, ...)
colwise(tapply2)(subset(df, select = c(a, c)), df$b, .FUN = function(x){sum(x[x > 2])})
The plyr package should be checking for ... arguments named FUN (or anything that can interfere with lapply's job), but it doesn't seem the author included this. You can submit a pull request to the plyr package that implements any of the following workarounds:
Defines a local
.lapply <- function(`*X*`, `*FUN*`, ...) lapply(X = `*X*`, `*FUN*`, ...)
(minimizing interference further).
Checks names(list(...)) within the colwise(tapply) function for X and FUN (can introduce problems if the author intended to prevent evaluation of promises until the child call).
Calls do.call("lapply", ...) explicitly with named X and FUN, so that you get the intended
formal argument "FUN" matched by multiple actual arguments

Resources