Creating Automatin in R - r

I have created a script that analyzes a set of raw data and converts it into many different formats based on different parameters and functions. I have 152 more raw data sheets to go, but all I will have to do is use my script on each one. However, there will be times that I might decide I need to change a variable or parameter and I would like to come up with a parameter list at the top of my spreadsheet that would affect the rest of the functions in my soon to be very large script.
Global variables aren't the answer to this problem, this is best illustrated through this example:
exceedes <- function (L=NULL, R=NULL)
{
if (is.null(L) | is.null(R))
{
print ("mycols: invalid L,R.")
return (NULL)
}
options (na.rm = TRUE)
test <-(mean(L, na.rm=TRUE)-R*sd(L,na.rm=TRUE))
test1 <- ifelse(is.na(L), NA, ifelse(L > test, 1, 0))
return (test1)
}
L=ROCC[,2]
R=.08
ROCC$newcolumn <- exceedes(L,R)
names(ROCC)[names(ROCC)=="newcolumn"]="Exceedes1"
L=ROCC[,2]
R=.16
ROCC$newcolumn <- exceedes(L,R)
names(ROCC)[names(ROCC)=="newcolumn"]="Exceedes2"
L=ROCC[,2]
R=.24
ROCC$newcolumn <- exceedes(L,R)
names(ROCC)[names(ROCC)=="newcolumn"]="Exceedes3"
So in the above example, I would like to have a way at the top of my script to change the range of R and have it affect the rest of the script because this function will be repeated 152 times. The only way I can think of doing it is to copy and paste the function over and over with a different variable each time, and set it globally. But I have to imagine there is a simpler way, my function possibly needs to be rearranged perhaps?
File names and output names. I am not sure whether this is possible but say for example that all my input.csv's come in a format where one dataset will be titled 123 another will be 124, another 125 etc. and then have R know to take the very next dataset, and then output that dataset to a specific folder on my computer without me having to actually type in read.csv(file="123.csv"), and then write.csv(example, file="123.csv) and so on?
General formatting of automation script
Before I dive into my automation, my procedure was going to be to copy and past the script 152 times over and then change the filename and output name for each one. This sounds ridiculous, but with my lack of programming skills I am not sure a better way to change it. Any ideas?
Thanks for all the help in advance.

You can rerun the function with different parameters by constructing a vector of paremters (say R)
R <- c(seq(0.1, 1, by = 0.01))
and then run your exceedes function length(R) times using sapply.
exceedes <- function(R, L) {} #notice the argument order
sapply(X = R, FUN = exceedes, L = ROCC[, 2])
You can pass other arguments to your function (e.g. file.name) and use it to create whatever file name you need.

Related

R: Enriched debugging for linear code chains

I am trying to figure out if it is possible, with a sane amount of programming, to create a certain debugging function by using R's metaprogramming features.
Suppose I have a block of code, such that each line uses as all or part of its input the output from thee line before -- the sort of code you might build with pipes (though no pipe is used here).
{
f1(args1) -> out1
f2(out1, args2) -> out2
f3(out2, args3) -> out3
...
fn(out<n-1>, args<n>) -> out<n>
}
Where for example it might be that:
f1 <- function(first_arg, second_arg, ...){my_body_code},
and you call f1 in the block as:
f1(second_arg = 1:5, list(a1 ="A", a2 =1), abc = letters[1:3], fav = foo_foo)
where foo_foo is an object defined in the calling environment of f1.
I would like a function I could wrap around my block that would, for each line of code, create an entry in a list. Each entry would be named (line1, line2) and each line entry would have a sub-entry for each argument and for the function output. the argument entries would consist, first, of the name of the formal, to which the actual argument is matched, second, the expression or name supplied to that argument if there is one (and a placeholder if the argument is just a constant), and third, the value of that expression as if it were immediately forced on entry into the function. (I'd rather have the value as of the moment the promise is first kept, but that seems to me like a much harder problem, and the two values will most often be the same).
All the arguments assigned to the ... (if any) would go in a dots = list() sublist, with entries named if they have names and appropriately labeled (..1, ..2, etc.) if they are assigned positionally. The last element of each line sublist would be the name of the output and its value.
The point of this is to create a fairly complete record of the operation of the block of code. I think of this as analogous to an elaborated version of purrr::safely that is not confined to iteration and keeps a more detailed record of each step, and indeed if a function exits with an error you would want the error message in the list entry as well as as much of the matched arguments as could be had before the error was produced.
It seems to me like this would be very useful in debugging linear code like this. This lets you do things that are difficult using just the RStudio debugger. For instance, it lets you trace code backwards. I may not know that the value in out2 is incorrect until after I have seen some later output. Single-stepping does not keep intermediate values unless you insert a bunch of extra code to do so. In addition, this keeps the information you need to track down matching errors that occur before promises are even created. By the time you see output that results from such errors via single-stepping, the matching information has likely evaporated.
I have actually written code that takes a piped function and eliminates the pipes to put it in this format, just using text manipulation. (Indeed, it was John Mount's "Bizarro pipe" that got me thinking of this). And if I, or we, or you, can figure out how to do this, I would hope to make a serious run on a second version where each function calls the next, supplying it with arguments internally rather than externally -- like a traceback where you get the passed argument values as well as the function name and and formals. Other languages have debugging environments like that (e.g. GDB), and I've been wishing for one for R for at least five years, maybe 10, and this seems like a step toward it.
Just issue the trace shown for each function that you want to trace.
f <- function(x, y) {
z <- x + y
z
}
trace(f, exit = quote(print(returnValue())))
f(1,2)
giving the following which shows the function name, the input and output. (The last 3 is from the function itself.)
Tracing f(1, 2) on exit
[1] 3
[1] 3

Using strings from loops as parts of function commands and variable names in R

How does one use the string coming from a loop
- to generate new variables
- as a part of functions commands
- as functions' arguments
- as a part of if statements
in R?
Specifically, as an example (the code obviously doesn't work, but I'd like to have something not less intelligible than what is bellow),
list_dist <- c("unif","norm")
for (dist in list_dist){
paste("rv",dist,sep="") = paste("r",dist,sep="")(100,0,1)
paste("meanrv",dist,sep="") = mean(paste("rv",dist,sep=""))
if (round(paste("meanrv",dist,sep=""),3) != 0){
print("Not small enough")
}
}
Note: This is an example and I do need to use kind of loops to avoid writing huge scripts.
I managed to use strings as in the example above but only with eval/parse/text/paste and combining the whole statement (i.e. the whole "line") inside paste, instead of pasting only in the varname part or the function part, which makes code ugly and illegible and coding inefficient.
Other available replies to similar questions which I've seen are not specific as in how to deal with this sort of usage of strings from loops.
I'm sure there must be a more efficient and flexible way to deal with this, as there is in some other languages.
Thanks in advance!
Resist the temptation of creating variable names programmatically. Instead, structure your data properly into lists:
list_dist = list(unif = runif, norm = rnorm)
distributions = lapply(list_dist, function (f) f(100, 0, 1))
means = unlist(lapply(distributions, mean))
# … etc.
As you can see, this also gets rid of the loop, by using list functions instead.
Your last step can also be vectorised:
if (any(round(means, 3) != 0))
warning('not small enough')
try this:
list_dist <- list(unif = runif,norm = rnorm)
for (i in 1:length(list_dist)){
assign(paste("rv",names(list_dist)[i],sep=""), list_dist[[i]](100,0,1))
assign(paste("meanrv",names(list_dist)[i],sep=""),mean(get(paste("rv",names(list_dist)[i],sep=""))))
if (round(get(paste("meanrv",names(list_dist)[i],sep="")),3) != 0){
print("Not small enough")
}
}

Good practice on how to store the result of a function for later use in R

I have the situation where I have written an R function, ComplexResult, that computes a computationally expensive result that two other separate functions will later use, LaterFuncA and LaterFuncB.
I want to store the result of ComplexResult somewhere so that both LaterFuncA and LaterFuncB can use it, and it does not need to be recalculated. The result of ComplexResult is a large matrix that only needs to be calculated once, then re-used later on.
R is my first foray into the world of functional programming, so interested to understand what it considered good practice. My first line of thinking is as follows:
# run ComplexResult and get the result
cmplx.res <- ComplexResult(arg1, arg2)
# store the result in the global environment.
# NB this would not be run from a function
assign("CachedComplexResult", cmplx.res, envir = .GlobalEnv)
Is this at all the right thing to do? The only other approach I can think of is having a large "wrapper" function, e.g.:
MyWrapperFunction <- function(arg1, arg2) {
cmplx.res <- ComplexResult(arg1, arg2)
res.a <- LaterFuncA(cmplx.res)
res.b <- LaterFuncB(cmplx.res)
# do more stuff here ...
}
Thoughts? Am I heading at all in the right direction with either of the above? Or is an there Option C which is more cunning? :)
The general answer is you should Serialize/deSerialize your big object for further use. The R way to do this is using saveRDS/readRDS:
## save a single object to file
saveRDS(cmplx.res, "cmplx.res.rds")
## restore it under a different name
cmplx2.res <- readRDS("cmplx.res.rds")
This assign to GlobalEnv:
CachedComplexResult <- ComplexResult(arg1, arg2)
To store I would use:
write.table(CachedComplexResult, file = "complex_res.txt")
And then to use it directly:
LaterFuncA(read.table("complex_res.txt"))
Your approach works for saving to local memory; other answers have explained saving to global memory or a file. Here are some thoughts on why you would do one or the other.
Save to file: this is slowest, so only do it if your process is volatile and you expect it to crash hard and you need to pick up the pieces where it left off, OR if you just need to save the state once in a while where speed/performance is not a concern.
Save to global: if you need access from multiple spots in a large R program.

does an object need to be initialized before for loop in R

am wondering if I an create an object within a for loop - i.e. don't have to initialize it. I have tried this how one might do it in matlab. Please see the following R code:
> for (i in 1:nrow(snp.ids)) {
+ snp.fasta[i]<-entrez_fetch(db="protein", id=snp.ids[i,], rettype="xml",retmode="text")
+ snp.seq[i]<-xpathSApply(xmlParse(snp.fasta[i]), "//Seq-data_iupacaa",xmlValue)
+ }
Error in snp.fasta[i] <- entrez_fetch(db = "protein", id = snp.ids[i, :
object 'snp.fasta' not found
where it obviously does not find snp.fasta - but you can see from the code I am trying to create snp.fasta. can anyone shed any light on why it would not create it within. the for loop, and what would be the proper way to initialize snp.fasta if I cannot create it within the for loop.
Thanks
Generally , yes. That would be an acceptable way to loop over a vector of ids. Just assign to a non-indexed object.
for (i in 1:nrow(snp.ids)) {
snp.fasta <- entrez_fetch(db="protein", id=snp.ids[i,], rettype="xml",retmode="text")
snp.seq <- xpathSApply(xmlParse(snp.fasta), "//Seq-data_iupacaa",xmlValue)
}
(You would then still need to assign any useful result to an index-able object or build a sequence of such within the loop or print some result. As it stands this example will over-write all the values of snp.seq and leave only the last one. )
It's a bit confusing to see id=snp.ids[i,]. That would imply that snp.ids has a dimension of 2. I would have expected a column name or number to be used: id=snp.ids[i,"id"]. You should provide dput(head(snp.ids)) so we can do some realistic testing rather than this half-assed guesswork.
In R, subsetting is also a function, so assigning value to an item in a vector:
a[1] = 123
is identical to
"["(a, 1) = 123
Here [ is a normal function. If a is not defined, there is an error.
Before the loop:
snp.fasta <- NULL

attach() inside function

I'd like to give a params argument to a function and then attach it so that I can use a instead of params$a everytime I refer to the list element a.
run.simulation<-function(model,params){
attach(params)
#
# Use elements of params as parameters in a simulation
detach(params)
}
Is there a problem with this? If I have defined a global variable named c and have also defined an element named c of the list "params" , whose value would be used after the attach command?
Noah has already pointed out that using attach is a bad idea, even though you see it in some examples and books. There is a way around. You can use "local attach" that's called with. In Noah's dummy example, this would look like
with(params, print(a))
which will yield identical result, but is tidier.
Another possibility is:
run.simulation <- function(model, params){
# Assume params is a list of parameters from
# "params <- list(name1=value1, name2=value2, etc.)"
for (v in 1:length(params)) assign(names(params)[v], params[[v]])
# Use elements of params as parameters in a simulation
}
Easiest way to solve scope problems like this is usually to try something simple out:
a = 1
params = c()
params$a = 2
myfun <- function(params) {
attach(params)
print(a)
detach(params)
}
myfun(params)
The following object(s) are masked _by_ .GlobalEnv:
a
# [1] 1
As you can see, R is picking up the global attribute a here.
It's almost always a good idea to avoid using attach and detach wherever possible -- scope ends up being tricky to handle (incidentally, it's also best to avoid naming variables c -- R will often figure out what you're referring to, but there are so many other letters out there, why risk it?). In addition, I find code using attach/detach almost impossible to decipher.
Jean-Luc's answer helped me immensely for a case that I had a data.frame Dat instead of the list as specified in the OP:
for (v in 1:ncol(Dat)) assign(names(Dat)[v], Dat[,v])

Resources