How to reuse code in an R function? - r

I have a block of code that I want to use several times inside a function (let's call it myFunction). I naturally want to avoid duplicating this block of code, but I can't find a way of reusing it short of putting it in an external file and sourcing that each time.
The first thing I tried was to put the duplicate code in an internal mini-function with no arguments (let's call it internalFunction. This meant that I could call internalFunction as needed; however, this masked the objects output by internalFunction from the main environment of myFunction.
I then tried using the <<- operator to assign output objects within internalFunction, so that they would be made available to the main environment of the myFunction. Unfortunately, this also makes those objects available to the global R environment outside myFunction, which I want to avoid.
Is there a way of writing a block of R code to an object and then calling that, or sourcing from an object instead of a file? I would really like to a) avoid duplicate code and b) include all code within a single file.

I think you what you want would be some easy way to return multiple values to the calling function, this can be done with a list, as follows:
maxmin <- function(i1,i2){
if (i1>i2){
mx <- i1
mn <- i2
} else
{
mn <- i1
mx <- i2
}
rv <- list(min=mn,max=mx)
return(rv)
}
r1 <- maxmin(3,4)
r2 <- maxmin(6,5)
print(sprintf("minimums %d %d",r1$min,r2$min))
print(sprintf("maximums %d %d",r1$max,r2$max))
Edit: I got rid of the quotes for the list element names, they are not necessary

Here is another way, but it feels tricky and is probably not be a good software engineering solution in most cases. Basically you can explictily access a variable in the parent's environment.
fun1 <- function(x)
{
maxminenv <- function(i1,i2){
if (i1>i2){
mx <- i1
mn <- i2
} else
{
mn <- i1
mx <- i2
}
penv <- parent.frame()
penv$min <- mn
penv$max <- mx
}
maxminenv(3,4)
print(sprintf("min:%d max:%d",min,max))
}
fun1()
For more information on environments see this excellent chapter in Hadley Wickam's new book. http://adv-r.had.co.nz/Environments.html

Related

Adding columns to data frame via user-defined function

I am trying to add columns to several dataframes. I am trying to create a function that will add the columns and then I would want to use that function with lapply over a list of object. The function is currently just adding empty columns to the data frame. But, if I solve the problem below, I would like to add to it to automatically populate the new columns (and still keeping the initial name of the object).
This is the code I have so far:
AAA_Metadata <- data.frame(AAA_Code=character(),AAA_REV4=character(),AAA_CONCEPT=character(),AAA_Unit=character(),AAA_Date=character(),AAA_Vintage=character())
add_empty_metadata <- function(x) {
temp_dataframe <- setNames(data.frame(matrix(ncol=length(AAA_Metadata),nrow=nrow(x))),as.list(colnames(AAA_Metadata)))
x <- cbind(temp_dataframe,x)
}
However when I run this
a <- data.frame(matrix(ncol=6,nrow=100))
add_empty_metadata(a)
and look at the Global Environment
object "a" still has 6 columns instead of 12.
I understand that I am actually working on a copy of "a" within the function (based on the other topics I checked, e.g. Update data frame via function doesn't work). So I tried:
x <<- cbind(temp_dataframe,x)
and
x <- cbind(temp_dataframe,x)
assign('x',x, envir=.GlobalEnv)
But none of those work. I want to have the new a in the Global Environment for future reference and keep the name 'a' unchanged. Any idea what I am doing wrong here?
Is this what you're looking for:
addCol <- function(x, newColNames){
for(i in newColNames){
x[,i] <- NA
}
return(x)
}
a <- data.frame(matrix(ncol=6,nrow=100));dim(a)
a <- addCol(a, newColNames = names(WIS_Metadata));dim(a)
Amazing source for this kind of stuff is Advanced R by Hadley Wickham with a website here.
R objects are immutable - they don't change - just get destroyed and rebuilt with the same name. a is never changed - it is used as an input to the function and unless the resulting object inside the function is reassigned via return, the version inside the function (this is a separate environment) is binned when the function is complete.

Why does R store the loop variable/index/dummy in memory?

I've noticed that R keeps the index from for loops stored in the global environment, e.g.:
for (ii in 1:5){ }
print(ii)
# [1] 5
Is it common for people to have any need for this index after running the loop?
I never use it, and am forced to remember to add rm(ii) after every loop I run (first, because I'm anal about keeping my namespace clean and second, for memory, because I sometimes loop over lists of data.tables--in my code right now, I have 357MB-worth of dummy variables wasting space).
Is there an easy way to get around this annoyance?
Perfect would be a global option to set (a la options(keep_for_index = FALSE); something like for(ii in 1:5, keep_index = FALSE) could be acceptable as well.
In order to do what you suggest, R would have to change the scoping rules for for loops. This will likely never happen because i'm sure there is code out there in packages that rely on it. You may not use the index after the for loop, but given that loops can break() at any time, the final iteration value isn't always known ahead of time. And having this as a global option again would cause problems with existing code in working packages.
As pointed out, it's for more common to use sapply or lapply loops in R. Something like
for(i in 1:4) {
lm(data[, 1] ~ data[, i])
}
becomes
sapply(1:4, function(i) {
lm(data[, 1] ~ data[, i])
})
You shouldn't be afraid of functions in R. After all, R is a functional language.
It's fine to use for loops for more control, but you will have to take care of removing the indexing variable with rm() as you've pointed out. Unless you're using a different indexing variable in each loop, i'm surprised that they are piling up. I'm also surprised that in your case, if they are data.tables, they they are adding additional memory since data.tables don't make deep copies by default as far as i know. The only memory "price" you would pay is a simple pointer.
I agree with the comments above. Even if you have to use for loop (using just side effects, not functions' return values) it would be a good idea to structure
your code in several functions and store your data in lists.
However, there is a way to "hide" index and all temporary variables inside the loop - by calling the for function in a separate environment:
do.call(`for`, alist(i, 1:3, {
# ...
print(i)
# ...
}), envir = new.env())
But ... if you could put your code in a function, the solution is more elegant:
for_each <- function(x, FUN) {
for(i in x) {
FUN(i)
}
}
for_each(1:3, print)
Note that with using "for_each"-like construct you don't even see the index variable.

How to import only functions from .R file without executing the whole file

Let's say I have a R script, testScript.R
test <- function(){cat('Hello world')}
cat('Bye world')
In the R-console, I understand I can import the function, test() by
source('testScript.R')
However at the same time, it will also execute cat('Bye world'). Assuming it is not allowed to create/modify files, is there a way to import only the function, test() without executing cat('Bye world')?
First of all, let me say that this really isn't a good idea. R is a functional programming language so functions are just like regular objects. There's not a strong separation between calling a function and assigning a function. These are all pretty much the same thing
a <- function(a) a+1
a(6)
# [1] 7
assign("a", function(i) i+1)
a(6)
# [1] 7
`<-`(a, function(i) i+1)
a(6)
# [1] 7
There's no difference between defining a function and calling an assignment function. You never know what the code inside a function will do unless you run it; therefore it's not easy to tell which code creates "functions" and which does not. As #mdsumner pointed out, you would be better off manual separating the code you used to define functions and the code you use to run them.
That said, if you wanted to extract all the variable assignments where you use <- from a code file, you could do
cmds <- parse("fakeload.R")
assign.funs <- sapply(cmds, function(x) {
if(x[[1]]=="<-") {
if(x[[3]][[1]]=="function") {
return(TRUE)
}
}
return(FALSE)
})
eval(cmds[assign.funs])
This will evaluate all the function assignments of the "standard" form.
Oh man... that's interesting. I don't know of any way to do that without some atrocity like this:
# assume your two like script is stored in testScript.R
a <- readLines("testScript.R")
a <- paste(a, collapse="\n")
library(stringr)
func_string <- str_extract(a, "[a-z]+ <- function.+}")
test <- eval(parse(text=func_string))
> test()
Hello world
You will certainly need to work on the regex to extract your functions. And str_extract_all() will be helpful if there's more than one function. Good luck.

calling objects in nested function R

First off, I'm an R beginner taking an R programming course at the moment. It is extremely lacking in teaching the fundamentals of R so I'm trying to learn myself via you wonderful contributors on Stack Overflow. I'm trying to figure out how nested functions work, which means I also need to learn about how lexical scoping works. I've got a function that computes the complete cases in multiple CSV files and spits out a nice table right now.
Here's the CSV files:
https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip
And here's my code, I realize it'd be cleaner if I used the apply stuff but it works as is:
complete<- function(directory, id = 1:332){
data <- NULL
for (i in 1:length(id)) {
data[[i]]<- c(paste(directory, "/", formatC(id[i], width=3, flag=0),
".csv", sep=""))
}
cases <- NULL
for (d in 1:length(data)) {
cases[[d]]<-c(read.csv(data[d]))
}
df <- NULL
for (c in 1:length(cases)){
df[[c]] <- (data.frame(cases[c]))
}
dt <- do.call(rbind, df)
ok <- (complete.cases(dt))
finally <- as.data.frame(table(dt[ok, "ID"]), colnames=c("id", "nobs"))
colnames(finally) <- c('id', 'nobs')
return(finally)
}
I am now trying to call the different variables in the dataframe finally that is the output of the above function within this new function:
corr<-function(directory, threshold = 0){
complete(directory, id = 1:332)
finally$nobs
}
corr('specdata')
Without finally$nobs this function spits out the data frame, as it should, but when I try to call the variable nobs in object finally, it says object finally is not found. I realize this problem is due to my lack of understanding in the subject of lexical scoping, my professor hasn't really made lexical scoping very clear so I'm not totally sure how to find the object within the nested function environment... any help would be great.
The object finally is only in scope within the function complete(). If you want to do something further with the object you are returning, you need to store it in a variable in the environment you are working in (in this instance, the environment you are working in is the function corr(). If we weren't working inside any function, the environment would be the "global environment"). In other words, this code should work:
corr<-function(directory, threshold=0){
this.finally <- complete(directory, id=1:332)
this.finally$nobs
}
I am calling the object that is returned by complete() this.finally to help distinguish it from the object finally that is now out of scope. Of course, you can call it anything you like!

Writing to large matrices within a function - fast vs slow

[Question amended following responses]
Thanks for the responses. I was unclear in my question, for which I apologise.
I'll try to give more details of our situation. We have c. 100 matrices that we keep in an environment. Each is very large. If at all possible we want to avoid any copying of these matrices when we perform updates. We're often running up against the 2GB memory limit, so this is very important for us.
So our two requirements are 1) avoiding copies and 2) addressing the matrices indirectly by name. Speed, whilst important, is a side-issue that would be solved by avoiding the copying.
It appears to me that Tommy's solution involved creating a copy (though it did entirely answer my actual original question, so I'm the one at fault).
The code below is what seems most obvious to us, but it clearly creates a copy (as shown by the memory.size increase)
myenv <- new.env()
myenv$testmat1 <- matrix(1.0, nrow=6000, ncol=200)
testfnDirect <- function(paramEnv) {
print(memory.size())
for (i in 1:300) {
temp <- paramEnv$testmat1[10,]
paramEnv$testmat1[10,] <- temp * 0
}
print(memory.size())
}
system.time(testfnDirect(myenv))
Using the with keyword seems to avoid this, as shown below:
myenv <- new.env()
myenv$testmat1 <- matrix(1.0, nrow=6000, ncol=200)
testfnDirect <- function(paramEnv) {
print(gc())
varname <- "testmat1" # unused, but see text
with (paramEnv, {
for (i in 1:300) {
temp <- testmat1[10,]
testmat1[10,] <- temp * 0
}
})
print(gc())
}
system.time(testfnDirect(myenv))
However, that code works by addressing testmat1 directly by name. Our problem is that we need to address it indirectly (we don't know in advance which matrices we'll be updating).
Is there a way of amending testfnDirect such that we use the variable varname rather than hardcoding testmat
A fairly recent change to the 'data.table' package was specifically to avoid copying when modifying values. So if your application can handle data.tables for the other operations, that could be a solution. (And it would be fast.)
Well, it would be nice if you could explain why the first solution isn't OK... It looks much neater AND runs faster.
To try to answer the questions:
A "nested replacement" operation like foo[bar][baz] <- 42 is very complex, and is optimized for certain cases to avoid copying. But it is very likely that your particular use case is not optimized. That would lead to lots of copies, and loss of performance.
A way to test that theory is to call gcinfo(TRUE) before your tests. You'll then see that the first solution triggers 2 garbage collects, and the second one triggers around 160!
Here's a variant of your second solution that converts the environment to a list, does its thing and the converts back to an environment. It is as fast as your first solution.
Code:
testfnList <- function() {
mylist <- as.list(myenv, all.names=TRUE)
thisvar <- "testmat2"
for (i in 1:300) {
temp <- mylist[[thisvar]][10,]
mylist[[thisvar]][10,] <- temp * 0
}
myenv <<- as.environment(mylist)
}
system.time(testfnList()) # 0.02 secs
...it would of course be neater if you passed myenv to the function as an argument.
A small improvement (if you loop a lot, not just 300 times) would be to index by number instead of name (doesn't work for environments, but for lists). Just change thisvar:
thisvar <- match("testmat2", names(mylist))

Resources