Writing to large matrices within a function - fast vs slow - r

[Question amended following responses]
Thanks for the responses. I was unclear in my question, for which I apologise.
I'll try to give more details of our situation. We have c. 100 matrices that we keep in an environment. Each is very large. If at all possible we want to avoid any copying of these matrices when we perform updates. We're often running up against the 2GB memory limit, so this is very important for us.
So our two requirements are 1) avoiding copies and 2) addressing the matrices indirectly by name. Speed, whilst important, is a side-issue that would be solved by avoiding the copying.
It appears to me that Tommy's solution involved creating a copy (though it did entirely answer my actual original question, so I'm the one at fault).
The code below is what seems most obvious to us, but it clearly creates a copy (as shown by the memory.size increase)
myenv <- new.env()
myenv$testmat1 <- matrix(1.0, nrow=6000, ncol=200)
testfnDirect <- function(paramEnv) {
print(memory.size())
for (i in 1:300) {
temp <- paramEnv$testmat1[10,]
paramEnv$testmat1[10,] <- temp * 0
}
print(memory.size())
}
system.time(testfnDirect(myenv))
Using the with keyword seems to avoid this, as shown below:
myenv <- new.env()
myenv$testmat1 <- matrix(1.0, nrow=6000, ncol=200)
testfnDirect <- function(paramEnv) {
print(gc())
varname <- "testmat1" # unused, but see text
with (paramEnv, {
for (i in 1:300) {
temp <- testmat1[10,]
testmat1[10,] <- temp * 0
}
})
print(gc())
}
system.time(testfnDirect(myenv))
However, that code works by addressing testmat1 directly by name. Our problem is that we need to address it indirectly (we don't know in advance which matrices we'll be updating).
Is there a way of amending testfnDirect such that we use the variable varname rather than hardcoding testmat

A fairly recent change to the 'data.table' package was specifically to avoid copying when modifying values. So if your application can handle data.tables for the other operations, that could be a solution. (And it would be fast.)

Well, it would be nice if you could explain why the first solution isn't OK... It looks much neater AND runs faster.
To try to answer the questions:
A "nested replacement" operation like foo[bar][baz] <- 42 is very complex, and is optimized for certain cases to avoid copying. But it is very likely that your particular use case is not optimized. That would lead to lots of copies, and loss of performance.
A way to test that theory is to call gcinfo(TRUE) before your tests. You'll then see that the first solution triggers 2 garbage collects, and the second one triggers around 160!
Here's a variant of your second solution that converts the environment to a list, does its thing and the converts back to an environment. It is as fast as your first solution.
Code:
testfnList <- function() {
mylist <- as.list(myenv, all.names=TRUE)
thisvar <- "testmat2"
for (i in 1:300) {
temp <- mylist[[thisvar]][10,]
mylist[[thisvar]][10,] <- temp * 0
}
myenv <<- as.environment(mylist)
}
system.time(testfnList()) # 0.02 secs
...it would of course be neater if you passed myenv to the function as an argument.
A small improvement (if you loop a lot, not just 300 times) would be to index by number instead of name (doesn't work for environments, but for lists). Just change thisvar:
thisvar <- match("testmat2", names(mylist))

Related

R not remembering objects written within functions

I'm struggling to clearly explain this problem.
Essentially, something has seemed to have happened within the R environment and none of the code I write inside my functions are working and not data is being saved. If I type a command line directly into the console it works (i.e. Monkey <- 0), but if I type it within a function, it doesn't store it when I run the function.
It could be I'm missing a glaring error in the code, but I noticed the problem when I accidentally clicked on the debugger and tried to excite out of the browser[1] prompt which appeared.
Any ideas? This is driving me nuts.
corr <- function(directory, threshold=0) {
directory <- paste(getwd(),"/",directory,"/",sep="")
file.list <- list.files(directory)
number <- 1:length(file.list)
monkey <- c()
for (i in number) {
x <- paste(directory,file.list[i],sep="")
y <- read.csv(x)
t <- sum(complete.cases(y))
if (t >= threshold) {
correl <- cor(y$sulfate, y$nitrate, use='pairwise.complete.obs')
monkey <- append(monkey,correl)}
}
#correl <- cor(newdata$sulfate, newdata$nitrate, use='pairwise.complete.obs')
#summary(correl)
}
corr('specdata', 150)
monkey```
It's a namespace issue. Functions create their own 'environment', that isn't necessarily in the global environment.
Using <- will assign in the local environment. To save an object to the global environment, use <<-
Here's some information on R environments.
I suggest you give a look at some tutorial on using functions in R.
Briefly (and sorry for my horrible explanation) objects that you define within functions will ONLY be defined within functions, unless you explicitly export them using (one of the possible approaches) the return() function.
browser() is indeed used for debugging, keeps you inside the function, and allows you accessing objects created inside the function.
In addition, to increase the probability to have useful answers, I suggest that you try to post a self-contained, working piece of code allowing quickly reproducing the issue. Here you are reading some files we have no access to.
It seems to me you have to store the output yourself when you run your script:
corr_out <- corr('specdata', 150)

different results using one core and multiple cores to modify data.table

I found something very confusing when I use multiple processing to modify values in R data.table.
I tried to modify value in place by using a function. It works well using one core, and the values in data.table were successfully changed. But when I used multiple cores, it failed to change the value in data.table.
That makes me very confused. Anyone know why?
library(data.table)
library(parallel)
aa <- as.data.table(iris)
aa[,tt:=0]
# modify aa$tt in place
main <- function(x){
#set(aa,x,6L,5)
aa[x,tt:=5]
return(NULL)
}
# aa$tt changed
mclapply(1:nrow(aa), main, mc.cores = 1)
# aa$tt unchanged
mclapply(1:nrow(aa), main, mc.cores = 2)
Short answer: Parallel sub processes work on copies of aa.
Longer answer:
mclapply uses forked "sub" processes (= mainly copies* of the parent process) and therefore work on copied data (aa in your case).
This means inplace changes of aa in a sub process do not modify aa in the main process.
See ?parallel::mclapply for details, eg. how to use the final result that is a return value (!).
*) In fact under Linux forking is implemented using copy-on-write memory pages to improve performance

Calling files with c(x:y)

I have a large number of files (in GB size).I want to run a for loop in which I call some files, do so processing that creates some files, bind them together, and save it.
AA<-c(1,6)
BB<-c(5,10)
for(i in length(AA)){
listofnames<-list.files(pattern="*eng")
listofnames<- listofnames[c(paste(AA[i],BB[i],sep=":"))]
listoffiles <- lapply( listofnames, readRDS)
}
But listofnames has NA. What I am doing wrong?
It took me a while looking at your code to realize that you were actually trying to construct a character representation of the expression 1:5 that was supposed to index a vector by position. This is very wrong; you just can't paste together arbitrary R commands/expressions and expect to drop them in to you code wherever. (Technically, there are tools that do that sort of thing, but they are discouraged.)
Probably you're looking to do something closer to:
listofnames <- list.files(pattern="*eng")
ind <- rep(1:5,each = 5,length.out = length(listofnames))
listofnames_split <- split(listofnames,ind)
for (i in seq_along(listofnames_split)){
my_data <- lapply(listofnames_split[[i]], readRDS)
#Do processing here
#...
rm(my_data) #Assuming memory really is a problem
}
But I'm just sketching out hypothetical code here, I can't really match it to your exact situation since your example isn't really fully fleshed out.

How to reuse code in an R function?

I have a block of code that I want to use several times inside a function (let's call it myFunction). I naturally want to avoid duplicating this block of code, but I can't find a way of reusing it short of putting it in an external file and sourcing that each time.
The first thing I tried was to put the duplicate code in an internal mini-function with no arguments (let's call it internalFunction. This meant that I could call internalFunction as needed; however, this masked the objects output by internalFunction from the main environment of myFunction.
I then tried using the <<- operator to assign output objects within internalFunction, so that they would be made available to the main environment of the myFunction. Unfortunately, this also makes those objects available to the global R environment outside myFunction, which I want to avoid.
Is there a way of writing a block of R code to an object and then calling that, or sourcing from an object instead of a file? I would really like to a) avoid duplicate code and b) include all code within a single file.
I think you what you want would be some easy way to return multiple values to the calling function, this can be done with a list, as follows:
maxmin <- function(i1,i2){
if (i1>i2){
mx <- i1
mn <- i2
} else
{
mn <- i1
mx <- i2
}
rv <- list(min=mn,max=mx)
return(rv)
}
r1 <- maxmin(3,4)
r2 <- maxmin(6,5)
print(sprintf("minimums %d %d",r1$min,r2$min))
print(sprintf("maximums %d %d",r1$max,r2$max))
Edit: I got rid of the quotes for the list element names, they are not necessary
Here is another way, but it feels tricky and is probably not be a good software engineering solution in most cases. Basically you can explictily access a variable in the parent's environment.
fun1 <- function(x)
{
maxminenv <- function(i1,i2){
if (i1>i2){
mx <- i1
mn <- i2
} else
{
mn <- i1
mx <- i2
}
penv <- parent.frame()
penv$min <- mn
penv$max <- mx
}
maxminenv(3,4)
print(sprintf("min:%d max:%d",min,max))
}
fun1()
For more information on environments see this excellent chapter in Hadley Wickam's new book. http://adv-r.had.co.nz/Environments.html

Why does R store the loop variable/index/dummy in memory?

I've noticed that R keeps the index from for loops stored in the global environment, e.g.:
for (ii in 1:5){ }
print(ii)
# [1] 5
Is it common for people to have any need for this index after running the loop?
I never use it, and am forced to remember to add rm(ii) after every loop I run (first, because I'm anal about keeping my namespace clean and second, for memory, because I sometimes loop over lists of data.tables--in my code right now, I have 357MB-worth of dummy variables wasting space).
Is there an easy way to get around this annoyance?
Perfect would be a global option to set (a la options(keep_for_index = FALSE); something like for(ii in 1:5, keep_index = FALSE) could be acceptable as well.
In order to do what you suggest, R would have to change the scoping rules for for loops. This will likely never happen because i'm sure there is code out there in packages that rely on it. You may not use the index after the for loop, but given that loops can break() at any time, the final iteration value isn't always known ahead of time. And having this as a global option again would cause problems with existing code in working packages.
As pointed out, it's for more common to use sapply or lapply loops in R. Something like
for(i in 1:4) {
lm(data[, 1] ~ data[, i])
}
becomes
sapply(1:4, function(i) {
lm(data[, 1] ~ data[, i])
})
You shouldn't be afraid of functions in R. After all, R is a functional language.
It's fine to use for loops for more control, but you will have to take care of removing the indexing variable with rm() as you've pointed out. Unless you're using a different indexing variable in each loop, i'm surprised that they are piling up. I'm also surprised that in your case, if they are data.tables, they they are adding additional memory since data.tables don't make deep copies by default as far as i know. The only memory "price" you would pay is a simple pointer.
I agree with the comments above. Even if you have to use for loop (using just side effects, not functions' return values) it would be a good idea to structure
your code in several functions and store your data in lists.
However, there is a way to "hide" index and all temporary variables inside the loop - by calling the for function in a separate environment:
do.call(`for`, alist(i, 1:3, {
# ...
print(i)
# ...
}), envir = new.env())
But ... if you could put your code in a function, the solution is more elegant:
for_each <- function(x, FUN) {
for(i in x) {
FUN(i)
}
}
for_each(1:3, print)
Note that with using "for_each"-like construct you don't even see the index variable.

Resources