I am writing a wrapper to combine any number of datasets row-wise. Since some may have unique variables, I am first restricting to the variables in the data.
My function works like this
rcombine <- function(List, Vars) {
List2 <- lapply(List, subset, select=Vars)
Reduce(rbind, List2)
}
When I run the code directly, it works. But in the function, my variable Vars disappears.
For instance:
x <- data.frame('a'=sample(LETTERS, 10), 'b'=sample(LETTERS, 10), 'c'=sample(LETTERS, 10))
y <- data.frame('a'=sample(LETTERS, 10), 'b'=sample(LETTERS, 10), 'e'=sample(LETTERS, 10))
rcombine(list(x, y), c('a', 'b'))
gives me:
Error in eval(expr, envir, enclos) : object 'Vars' not found
but running:
List <- list(x, y)
Reduce(rbind, lapply(List, subset, select=c('a','b')))
Works. I can print Vars from the function, but inside lapply it disappears. What is going on?
subset really shouldn't be used for these types of things. From the help page
This is a convenience function intended for use interactively. For
programming it is better to use the standard subsetting functions like
[, and in particular the non-standard evaluation of argument subset
can have unanticipated consequences.
For your particular problem I don't see why just replacing subset with directly using "[" would be a problem.
rcombine <- function(List, Vars) {
List2 <- lapply(List, "[", i= , j = Vars, drop = FALSE) # here is the change
Reduce(rbind, List2)
}
# alternatively...
rcombine <- function(List, Vars) {
List2 <- lapply(List, function(x){x[, Vars, drop = FALSE]}) # here is the change
Reduce(rbind, List2)
}
x <- data.frame('a'=sample(LETTERS, 10), 'b'=sample(LETTERS, 10), 'c'=sample(LETTERS, 10))
y <- data.frame('a'=sample(LETTERS, 10), 'b'=sample(LETTERS, 10), 'e'=sample(LETTERS, 10))
rcombine(list(x, y), c('a', 'b'))
Related
I am trying to write a function with an unspecified number of arguments using ... but I am running into issues where those arguments are column names. As a simple example, if I want a function that takes a data frame and uses within() to make a new column that is several other columns pasted together, I would intuitively write it as
example.fun <- function(input,...){
res <- within(input,pasted <- paste(...))
res}
where input is a data frame and ... specifies column names. This gives an error saying that the column names cannot be found (they are treated as objects). e.g.
df <- data.frame(x = c(1,2),y=c("a","b"))
example.fun(df,x,y)
This returns "Error in paste(...) : object 'x' not found "
I can use attach() and detach() within the function as a work around,
example.fun2 <- function(input,...){
attach(input)
res <- within(input,pasted <- paste(...))
detach(input)
res}
This works, but it's clunky and runs into issues if there happens to be an object in the global environment that is called the same thing as a column name, so it's not my preference.
What is the correct way to do this?
Thanks
1) Wrap the code in eval(substitute(...code...)) like this:
example.fun <- function(data, ...) {
eval(substitute(within(data, pasted <- paste(...))))
}
# test
df <- data.frame(x = c(1, 2), y = c("a", "b"))
example.fun(df, x, y)
## x y pasted
## 1 1 a 1 a
## 2 2 b 2 b
1a) A variation of that would be:
example.fun.2 <- function(data, ...) {
data.frame(data, pasted = eval(substitute(paste(...)), data))
}
example.fun.2(df, x, y)
2) Another possibility is to convert each argument to a character string and then use indexing.
example.fun.3 <- function(data, ...) {
vnames <- sapply(substitute(list(...))[-1], deparse)
data.frame(data, pasted = do.call("paste", data[vnames]))
}
example.fun.3(df, x, y)
3) Other possibilities are to change the design of the function and pass the variable names as a formula or character vector.
example.fun.4 <- function(data, formula) {
data.frame(data, pasted = do.call("paste", get_all_vars(formula, data)))
}
example.fun.4(df, ~ x + y)
example.fun.5 <- function(data, vnames) {
data.frame(data, pasted = do.call("paste", data[vnames]))
}
example.fun.5(df, c("x", "y"))
I have this data frame in R:
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
I also have this function:
some_function <- function(x,y) { return(x+y) }
Basically, I want to create a new column in the data frame based on "some_function". I thought I could do this with the "lapply" function in R:
data_frame$new_column <-lapply(c(data_frame$x, data_frame$y),some_function)
This does not work:
Error in `$<-.data.frame`(`*tmp*`, f, value = list()) :
replacement has 0 rows, data has 8281
I know how to do this in a more "clunky and traditional" way:
data_frame$new_column = x + y
But I would like to know how to do this using "lapply" - in the future, I will have much more complicated and longer functions that will be a pain to write out like I did above. Can someone show me how to do this using "lapply"?
Thank you!
When working within a data.frame you could use apply instead of lapply:
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
head(data_frame)
some_function <- function(x,y) { return(x+y) }
data_frame$new_column <- apply(data_frame, 1, \(x) some_function(x["Var1"], x["Var2"]))
head(data_frame)
To apply a function to rows set MAR = 1, to apply a function to columns set MAR = 2.
lapply, as the name suggests, is a list-apply. As a data.frame is a list of columns you can use it to compute over columns but within rectangular data, apply is often the easiest.
If some_function is written for that specific purpose, it can be written to accept a single row of the data.frame as in
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
head(data_frame)
some_function <- function(row) { return(row[1]+row[2]) }
data_frame$yet_another <- apply(data_frame, 1, some_function)
head(data_frame)
Final comment: Often functions written for only a pair of values come out as perfectly vectorized. Probably the best way to call some_function is without any function of the apply-familiy as in
some_function <- function(x,y) { return(x + y) }
data_frame$last_one <- some_function(data_frame$Var1, data_frame$Var2)
I have a list of dataframes that I would like to multiply for each element of vector.
The first dataframe in the list would be multiplied by the first observation of the vector, and so on, producing another list of dataframes already multiplied.
I tried to do this with a loop, but was unsuccessful. I also tried to imagine something using map or lapply, but I couldn't.
for(i in vec){
for(j in listdf){
listdf2 <- i*listdf[[j]]
}
}
Error in listdf[[j]] : invalid subscript type 'list'
Any idea how to solve this?
*Vector and the List of Dataframes have the same length.
Use Map :
listdf2 <- Map(`*`, listdf, vec)
in purrr this can be done using map2 :
listdf2 <- purrr::map2(listdf, vec, `*`)
If you are interested in for loop solution you just need one loop :
listdf2 <- vector('list', length(listdf))
for (i in seq_along(vec)) {
listdf2[[i]] <- listdf[[i]] * vec[i]
}
data
vec <- c(4, 3, 5)
df <- data.frame(a = 1:5, b = 3:7)
listdf <- list(df, df, df)
Similar to this question:
Return elements of list as independent objects in global environment
I cannot seem to adapt the answer to assign the list elements when list2env is called inside a function:
E.g.
lst <- list(a = c(1, 2), b = c(3, 4))
tmp_fn <- function(lst) {
# do computations on list elements
# first assign each to the function environment
list2env(lst, parent = parent.frame()) # fails
# do stuff
...
}
I thought the parent = parent.frame() would work, but while debugging tmp_fn I only see that lst gets assigned to the function environment as a list, not the individual variables a and b.
1) Use envir= here rather than parent= like this. Also, as shown, you may wish to add envir as an argument for flexibility:
lst <- list(a = c(1, 2), b = c(3, 4))
tmp_fn <- function(lst, envir = parent.frame()) {
invisible(list2env(lst, envir = envir))
}
tmp_fn(lst)
2) Another possibility is to use list[...]<- from the gsubfn package (development version):
devtools::install_github("ggrothendieck/gsubfn")
library(gsubfn)
func <- function(lst) lst
list[a, b] <- func(lst)
Now a and b will be in the current environment.
I usually use the combination of colwise and tapply to calculate grouped values in a data frame. However, I found unexpectedly that the parameter FUN in tapply cannot work correctly with colwise from plyr. The example is as follows:
Data:
df <- data.frame(a = 1:10, b = rep(1:2, each = 5), c = 2:11)
Normal:
library(plyr)
colwise(tapply)(subset(df, select = c(a, c)), df$b, function(x){sum(x[x > 2])})
Above code is correct and can work normally. But if I add FUN, it will be wrong:
colwise(tapply)(subset(df, select = c(a, c)), df$b, FUN = function(x){sum(x[x > 2])})
Error is:
Error in FUN(X[[1L]], ...) :
unused arguments (function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
{
FUN <- if (!is.null(FUN)) match.fun(FUN)
if (!is.list(INDEX)) INDEX <- list(INDEX)
nI <- length(INDEX)
if (!nI) stop("'INDEX' is of length zero")
namelist <- vector("list", nI)
names(namelist) <- names(INDEX)
extent <- integer(nI)
nx <- length(X)
one <- 1
group <- rep.int(one, nx)
ngroup <- one
for (i in seq_along(INDEX)) {
index <- as.factor(INDEX[[i]])
if (length(index) != nx) stop("arguments must have same length")
namelist[[i]] <- levels(index)
extent[i] <- nlevels(index)
group <- group + ngroup * (as.integer(index) - one)
ngroup <- ngroup * nlevels(index)
}
if (is.null(FUN)) return(group)
ans <- lapply(X = split(X, group), FUN = FUN, ...)
index <- as.integer(names(ans))
if (simplify && all(unlist(lapply(ans, length)) == 1)) {
ansmat <- array(dim = extent, dimnames = namelist)
Could anyone explain the reason? Thank you in advance.
Well, the issue is that both lapply and tapply have an optional FUN argument. Note that colwise(tapply) is a function with the following line:
out <- do.call("lapply", c(list(filtered, .fun, ...), dots))
Let's go to this line with our debugger by writing
ct <- colwise(tapply); trace(ct, quote(browser()), at = 6)
and then running
ct(subset(df, select = c(a, c)), df$b, FUN = function(x){sum(x[x > 2])})
Now let's print c(list(filtered, .fun, ...), dots). Notice that the first three (unnamed) arguments are now the dataframe, tapply, and db$b, with the FUN argument above coming in last. However, this argument is named. Since this is a do.call on lapply, instead of that argument becoming an optional parameter for tapply, it now becomes the main call on lapply! So what is happening is that you are turning this into:
lapply(subset(df, select = c(a, c)), function(x){sum(x[x > 2])}, tapply, df$b)
This, of course, makes no sense, and if you execute the above (still in your debugger) manually, you will get the exact same error you are getting. For a simple workaround, try:
tapply2 <- function(.FUN, ...) tapply(FUN = .FUN, ...)
colwise(tapply2)(subset(df, select = c(a, c)), df$b, .FUN = function(x){sum(x[x > 2])})
The plyr package should be checking for ... arguments named FUN (or anything that can interfere with lapply's job), but it doesn't seem the author included this. You can submit a pull request to the plyr package that implements any of the following workarounds:
Defines a local
.lapply <- function(`*X*`, `*FUN*`, ...) lapply(X = `*X*`, `*FUN*`, ...)
(minimizing interference further).
Checks names(list(...)) within the colwise(tapply) function for X and FUN (can introduce problems if the author intended to prevent evaluation of promises until the child call).
Calls do.call("lapply", ...) explicitly with named X and FUN, so that you get the intended
formal argument "FUN" matched by multiple actual arguments