In a lot of cases, I need to write some code that makes up a logical bloc and it feels right to place it in a function. However, being used only once, it makes it more cumbersome to move the code away from where it is applied and give it a single-use name thus polluting the namespace.
Today, I was experimenting and I also came across a question about lambda expressions in R. So I implemented my logic as following:
x <- (function(charsBase, n, m) {
z <- apply(
matrix(
sample(unique(charsBase), n*m*3, replace = TRUE)
, nrow = n*3, ncol = m
)
, 1
, paste, collapse="")
head(unique(z), n)
}) (LETTERS, 1000, 3)
Questions:
Is there a better way of creating a lambda in R?
While the namespace is apparently kept clean, how about the memory? In my experience, R usually leaks when you create / remove object in the global environment. If extensive allocation / freeing is done within the function, would this keep the memory under control?
Thanks a lot in advance!
You can use with with a list or data-frame as first argument. For example:
result <- with(list(a=3, b=4), {
foo <- a + b
foo^2
})
This keeps the global environment clean, because the part enclosed in brackets is evaluated in a separate environment that is destroyed after the evaluation takes place.
However, in my experience it can become cumbersome to program in this style. Sometimes I find more practical to clean up unwanted objects with rm() when they're no longer needed. It's not as elegant, this I agree.
I would use local with shorter lines and more readable code:
x <- local({
charsBase <- LETTERS
n <- 1000
m <- 3
sam <- sample(unique(charsBase), n*m*3, replace = TRUE)
mtx <- matrix(sam, nrow = n*3, ncol = m)
z <- apply(mtx, 1, paste, collapse="")
head(unique(z), n)
})
Nothing of the above "leaks" to the global environment unless you explicitly use global assignment with <<-. The value of the last "thing" evaluated within the curly brackets becomes the value of x. You can get identical result with ...
local({
charsBase <- LETTERS
n <- 1000
m <- 3
sam <- sample(unique(charsBase), n*m*3, replace = TRUE)
mtx <- matrix(sam, nrow = n*3, ncol = m)
z <- apply(mtx, 1, paste, collapse="")
x <<- head(unique(z), n) # notice the wyrd assignment operator
})
... I'd say it's less elegant but then again, it's a matter of preference.
A useful trick I sometimes use when experimenting is ...
local(browser())
You don't have to keep track of the assignments, everything is temporary unless you use global assignment.
Related
I would like to use hash table to implement simple coversion script.
An input should be multiplied with a factor using its symbol, i.e. y = x * 1E-12, should be called e.g. y <- X2Y(x,"p") with "p" being the symbol for 1E-12.
library(hash)
sym2num <- function(x) {
h <- hash( c("f"=1E-15,"p"=1E-12,"n"=1E-9,"mu"=1E-6,"m"=1E-3,"c"=1E-2) )
return(h$x)
}
X2Y <- function(X,x) {
xNum <- sym2num(x)
Y <- X * xNum
return(Y)
}
# y = x * 1E-12
y <- X2Y(x,"p")
print(y)
With the above code I get numeric(0) as result. Any idaes where it goes wrong?
There’s no benefit to using the {hash} library here. Indeed, since you rehash your vector before each subsetting, this will be substantially less efficient than a direct lookup.
Even if you only constructed the hash table once instead of repeatedly, it would probably still be faster not to use it: the hash table implementation carries a substantial constant overhead. It’s only faster than direct vector or list subsetting for fairly large tables.
Instead, just do this:
sym2num <- function(x) {
c(f = 1E-15, p = 1E-12, n = 1E-9, mu = 1E-6, m = 1E-3, c = 1E-2)[x]
}
This is idiomatic, efficient R code.
Fundamentally, the mistake in your R code was the subsetting, h$x. This fails because the subset operator $ doesn’t work with variables, it expects an unevaluated name on its right-hand side. The code will thus always look up the literal name x inside h. thc’s answer shows how to avoid the issue.
Your function sym2num always returns the hash of "x", which is NULL.
h$x is a shortcut for h[["x"]], but what you want is h[[x]].
Instead use this:
sym2num <- function(x) {
h <- hash( c("f"=1E-15,"p"=1E-12,"n"=1E-9,"mu"=1E-6,"m"=1E-3,"c"=1E-2) )
return(h[[x]])
}
I have a list which contains more lists of lists:
results <- sapply(c(paste0("cv_", seq(1:50)), "errors"), function(x) NULL)
## Locations for results to be stored
step_results <- sapply(c("myFit", "forecast", "errors"), function(x) NULL)
step_errors <- sapply(c("MAE", "MSE", "sign_accuracy"), function(x) NULL)
final_error <- sapply(c("MAE", "MSE", "sign_accuracy"), function(x) NULL)
for(i in 1:50){results[[i]] <- step_results}
for(i in 1:50){results[[i]][[3]] <- step_errors}
results$errors <- final_error
Now in this whole structure, I would like to sum up all the values in sign_accuracy and save them in results$errors$sign_accuracy
I could maybe do this with a for-loop, indexing with i:
## This is just an example - it won't actually work!
sign_acc <- matrix(nrow = 50, ncol = 2)
for (i in 1:50){
sign_acc[i, ] <- `results[[i]][[3]][[3]]`
results$errors$sign_accuracy <- sign_acc
}
If I remember correctly, in Matlab there is something like list(:), which means all elements. In Python I have seen something like list(0:-1), which also means all elements.
What is the elegent R equivalent? I don't really like loops.
I have seen methods using the apply family of functions. With something like apply(data, "[[", 2), but can't get it to work for deeper lists.
Did you try with c(..., recursive)?
Here is an option with a short example at the end:
sumList <- function(l, label) {
lc <- c(l, recursive=T)
filter <- grepl(paste0("\\.",label, "$"), names(lc)) | (names(lc) == label)
nums <- lc[filter]
return(sum(as.numeric(nums)))
}
ex <- list(a=56,b=list("5",a=34,list(c="3",a="5")))
sumList(ex,"a")
In this case, you can do what you want with
results$errors$sign_accuracy <- do.call(sum, lapply(results, function(x){x[[3]][[3]]}))
lapply loops through the first layer of results, and pulls out the third element of the third element for each. do.call(sum catches all the results and sums them.
The real problems with lists arise when the nesting is more irregular, or when you need to loop through more than one index. It can always be done in the same way, but it gets extraordinarily ugly very quickly.
I'm doing cross validation. So I wanted to split data into 10 folds. Somebody has post following code.
f_K_fold <- function(Nobs,K=10){
rs <- runif(Nobs)
id <- seq(Nobs)[order(rs)]
k <- as.integer(Nobs * seq(1, K-1) / K)
k <- matrix(c(0, rep(k, each=2), Nobs), ncol = 2, byrow = TRUE)
k[,1] <- k[,1]+1
l <- lapply(seq.int(K), function(x, k, d)
list(train=d[!(seq(d) %in% seq(k[x, 1],k[x, 2]))],
test=d[seq(k[x,1],k[x,2])]),
k=k,d=id)
return(l)
}
however I don't really understand what the lapply doing. Could someone explain to a newbie? Appreciate it.
It's really unfortunate that the code folding in this example is horrible, since aving properly formatted code can aid in understanding the code and catching mistakes.
The last three lines can be viewed as an anonymous function passed to lapply. lapply in essence "climbs" a list and for each list element, applies that (anonymous) function. In the example below, I've disambiguated the lines into a not so anonymous function and a call to lapply.
notSoanonymousFunction <- function(x, k, d) {
list(train = d[!(seq(d) %in% seq(k[x,1],k[x,2]))],
test = d[seq(k[x,1],k[x,2])])
}
l <- lapply(seq.int(K), FUN = notSoanonymousFunction, k = k, d = id)
If you look at ?lapply, you'll notice that there are no k or d arguments. However, these arguments do belong to our notSoanonymousFunction, and lapply takes it in via the ... argument.
As a mental exercise for you, I will show you one more trick how to learn what the function is doing. If you need to see what is happening inside the function, place a browser() call inside and run it. In your case, this would look like this:
notSoanonymousFunction <- function(x, k, d) {
browser()
list(train = d[!(seq(d) %in% seq(k[x,1],k[x,2]))],
test = d[seq(k[x,1],k[x,2])])
}
Once you run this, your console should say something along the lines of
Browser[1] >
You are now effectively inside the function. You can navigate to next line by typing n, running the whole chunk by c and quitting the browser all together, by pressing Q (see ?browser()). You can view and manipulate objects ad libidum. You can try by checking your workspace with ls() to see which objects are inside the function. You can bet your family farm that there will be objects x, k and d.
I have following for loop in R:
v = c(1,2,3,4)
s = create.some.complex.object()
for (i in v){
print(i)
s = some.complex.function.that.updates.s(s)
}
# s here has the right content.
Needless to say, this loop is horribly slow in R.
I tried to write it in functional style:
lapply(v, function(i){
print(i)
s = some.complex.function.that.updates.s(s)
})
# s wasn't updated.
But this doesn't work, because s is passed by value and not by reference.
I only need the result of the last iteration, not all of the intermediate steps.
How do I formulate the first loop in R-style?
Mulone
lapply(v, function(i){
print(i)
s = some.complex.function.that.updates.s(s)
return(s)
})
the result will be a list of object s created for each value of v. Even if it should have passed the value of v anyway cause it was the last operation performed by the function.
If you can't afford to create it many times then there are not a lot of options. It is hard to say as well without seeing the object that you are operating on. If the object is growing/appending you could collect the intermediate results and do the appending at the end. If it is actually mutating you should try to get away from the pass value and use reference classes (http://www.inside-r.org/r-doc/methods/ReferenceClasses). Then the function that modifies it will actually be a method you just call n times.
Is the loop itself really the problem? Or is it rather the time the execution of some.complex.function.that.updates.s needs?
Some R programers will jump through hoops to avoid loops but have a look at this example:
f <- function(a) a/1.001
loop <- function(n) { s = (1/f(1)^n); for (i in 1:n) s <- f(s); s}
system.time(loop(1E7))
user system elapsed
7.011 0.030 7.008
This is 0.7 micro seconds (on a MacBook Pro) per call of a very trivial function in a loop.
v = c(1,2,3,4)
s = create.some.complex.object()
lapply(v, function(i){
print(i)
s <<- some.complex.function.that.updates.s(s)
}) |> invisible()
Use of the <<- operator can sometimes get you into trouble and is (somewhat) discouraged, but when I want to mimic a for loop with side-effects this is a pattern I have found useful.
v = c(1,2,3,4)
s = create.some.complex.object()
lapply(v, function(i){
print(i)
assign('s', some.complex.function.that.updates.s(s), envir = .GlobalEnv)
}) |> invisible()
Using assign allows you to avoid the use of <<- operator. Using <<- is significantly faster than invoking the assign function. For performance reasons in more intensive applications it is very much worth it to replace sequential for loops with vectorized operations as the median execution time of lapply can be several orders of magnitude faster! Here are some toy benchmarks to support this assertion:
v <- c(1, 2, 3, 4)
microbenchmark::microbenchmark({
s <- 1
lapply(v, function(i) {
s <<- s + i
})
}, times = 1e4, unit = 'microseconds')
Median: ~ 4 microseconds
v <- c(1, 2, 3, 4)
microbenchmark::microbenchmark({
s <- 1
for(i in v) {
s <- s + i
}
}, times = 1e4, unit = 'microseconds')
Median: ~ 1488 microseconds
I am trying to come up with a variant of mapply (call it xapply for now) that combines the functionality (sort of) of expand.grid and mapply. That is, for a function FUN and a list of arguments L1, L2, L3, ... of unknown length, it should produce a list of length n1*n2*n3 (where ni is the length of list i) which is the result of applying FUN to all combinations of the elements of the list.
If expand.grid worked to generate lists of lists rather than data frames, one might be able to use it, but I have in mind that the lists may be lists of things that won't necessarily fit into a data frame nicely.
This function works OK if there are exactly three lists to expand, but I am curious about a more generic solution. (FLATTEN is unused, but I can imagine that FLATTEN=FALSE would generate nested lists rather than a single list ...)
xapply3 <- function(FUN,L1,L2,L3,FLATTEN=TRUE,MoreArgs=NULL) {
retlist <- list()
count <- 1
for (i in seq_along(L1)) {
for (j in seq_along(L2)) {
for (k in seq_along(L3)) {
retlist[[count]] <- do.call(FUN,c(list(L1[[i]],L2[[j]],L3[[k]]),MoreArgs))
count <- count+1
}
}
}
retlist
}
edit: forgot to return the result. One might be able to solve this by making a list of the indices with combn and going from there ...
I think I have a solution to my own question, but perhaps someone can do better (and I haven't implemented FLATTEN=FALSE ...)
xapply <- function(FUN,...,FLATTEN=TRUE,MoreArgs=NULL) {
L <- list(...)
inds <- do.call(expand.grid,lapply(L,seq_along)) ## Marek's suggestion
retlist <- list()
for (i in 1:nrow(inds)) {
arglist <- mapply(function(x,j) x[[j]],L,as.list(inds[i,]),SIMPLIFY=FALSE)
if (FLATTEN) {
retlist[[i]] <- do.call(FUN,c(arglist,MoreArgs))
}
}
retlist
}
edit: I tried #baptiste's suggestion, but it's not easy (or wasn't for me). The closest I got was
xapply2 <- function(FUN,...,FLATTEN=TRUE,MoreArgs=NULL) {
L <- list(...)
xx <- do.call(expand.grid,L)
f <- function(...) {
do.call(FUN,lapply(list(...),"[[",1))
}
mlply(xx,f)
}
which still doesn't work. expand.grid is indeed more flexible than I thought (although it creates a weird data frame that can't be printed), but enough magic is happening inside mlply that I can't quite make it work.
Here is a test case:
L1 <- list(data.frame(x=1:10,y=1:10),
data.frame(x=runif(10),y=runif(10)),
data.frame(x=rnorm(10),y=rnorm(10)))
L2 <- list(y~1,y~x,y~poly(x,2))
z <- xapply(lm,L2,L1)
xapply(lm,L2,L1)
#ben-bolker, I had a similar desire and think I have a preliminary solution worked out, that I've also tested to work in parallel. The function, which I somewhat confusingly called gmcmapply (g for grid) takes an arbitrarily large named list mvars (that gets expand.grid-ed within the function) and a FUN that utilizes the list names as if they were arguments to the function itself (gmcmapply will update the formals of FUN so that by the time FUN is passed to mcmapply it's arguments reflect the variables that the user would like to iterate over (which would be layers in a nested for loop)). mcmapply then dynamically updates the values of these formals as it cycles over the expanded set of variables in mvars.
I've posted the preliminary code as a gist (reprinted with an example below) and would be curious to get your feedback on it. I'm a grad student, that is self-described as an intermediately-skilled R enthusiast, so this is pushing my R skills for sure. You or other folks in the community may have suggestions that would improve on what I have. I do think even as it stands, I'll be coming to this function quite a bit in the future.
gmcmapply <- function(mvars, FUN, SIMPLIFY = FALSE, mc.cores = 1, ...){
require(parallel)
FUN <- match.fun(FUN)
funArgs <- formals(FUN)[which(names(formals(FUN)) != "...")] # allow for default args to carry over from FUN.
expand.dots <- list(...) # allows for expanded dot args to be passed as formal args to the user specified function
# Implement non-default arg substitutions passed through dots.
if(any(names(funArgs) %in% names(expand.dots))){
dot_overwrite <- names(funArgs[which(names(funArgs) %in% names(expand.dots))])
funArgs[dot_overwrite] <- expand.dots[dot_overwrite]
#for arg naming and matching below.
expand.dots[dot_overwrite] <- NULL
}
## build grid of mvars to loop over, this ensures that each combination of various inputs is evaluated (equivalent to creating a structure of nested for loops)
grid <- expand.grid(mvars,KEEP.OUT.ATTRS = FALSE, stringsAsFactors = FALSE)
# specify formals of the function to be evaluated by merging the grid to mapply over with expanded dot args
argdefs <- rep(list(bquote()), ncol(grid) + length(expand.dots) + length(funArgs) + 1)
names(argdefs) <- c(colnames(grid), names(funArgs), names(expand.dots), "...")
argdefs[which(names(argdefs) %in% names(funArgs))] <- funArgs # replace with proper dot arg inputs.
argdefs[which(names(argdefs) %in% names(expand.dots))] <- expand.dots # replace with proper dot arg inputs.
formals(FUN) <- argdefs
if(SIMPLIFY) {
#standard mapply
do.call(mcmapply, c(FUN, c(unname(grid), mc.cores = mc.cores))) # mc.cores = 1 == mapply
} else{
#standard Map
do.call(mcmapply, c(FUN, c(unname(grid), SIMPLIFY = FALSE, mc.cores = mc.cores)))
}
}
example code below:
# Example 1:
# just make sure variables used in your function appear as the names of mvars
myfunc <- function(...){
return_me <- paste(l3, l1^2 + l2, sep = "_")
return(return_me)
}
mvars <- list(l1 = 1:10,
l2 = 1:5,
l3 = letters[1:3])
### list output (mapply)
lreturns <- gmcmapply(mvars, myfunc)
### concatenated output (Map)
lreturns <- gmcmapply(mvars, myfunc, SIMPLIFY = TRUE)
## N.B. This is equivalent to running:
lreturns <- c()
for(l1 in 1:10){
for(l2 in 1:5){
for(l3 in letters[1:3]){
lreturns <- c(lreturns,myfunc(l1,l2,l3))
}
}
}
### concatenated outout run on 2 cores.
lreturns <- gmcmapply(mvars, myfunc, SIMPLIFY = TRUE, mc.cores = 2)
Example 2. Pass non-default args to FUN.
## Since the apply functions dont accept full calls as inputs (calls are internal), user can pass arguments to FUN through dots, which can overwrite a default option for FUN.
# e.g. apply(x,1,FUN) works and apply(x,1,FUN(arg_to_change= not_default)) does not, the correct way to specify non-default/additional args to FUN is:
# gmcmapply(mvars, FUN, arg_to_change = not_default)
## update myfunc to have a default argument
myfunc <- function(rep_letters = 3, ...){
return_me <- paste(rep(l3, rep_letters), l1^2 + l2, sep = "_")
return(return_me)
}
lreturns <- gmcmapply(mvars, myfunc, rep_letters = 1)
A bit of additional functionality I would like to add but am still trying to work out is
cleaning up the output to be a pretty nested list with the names of mvars (normally, I'd create multiple lists within a nested for loop and tag lower-level lists onto higher level lists all the way up until all layers of the gigantic nested loop were done). I think using some abstracted variant of the solution provided here will work, but I haven't figured out how to make the solution flexible to the number of columns in the expand.grid-ed data.frame.
I would like an option to log the outputs of the child processesthat get called in mcmapply in a user-specified directory. So you could look at .txt outputs from every combination of variables generated by expand.grid (i.e. if the user prints model summaries or status messages as a part of FUN as I often do). I think a feasible solution is to use the substitute() and body() functions, described here to edit FUN to open a sink() at the beginning of FUN and close it at the end if the user specifies a directory to write to. Right now, I just program it right into FUN itself, but later it would be nice to just pass gmcmapply an argument called something like log_children = "path_to_log_dir. and then editing the body of the function to (pseudocode) sink(file = file.path(log_children, paste0(paste(names(mvars), sep = "_"), ".txt")
Let me know what you think!
-Nate