How to structure R code for inherently nested problems to be easily readible? - r

There are problems that inherently require several layers of nesting to be solved. In a current project, I frequently find myself using three nested applys in order to do something with the elements contained in the deepest layer of a nested list structure.
R's list handling and apply-family allows for quite concise code for this type of problems, but writing it still gives me headaches and I'm pretty sure that anyone else reading it will have to take several minutes to understand what I am doing. This is although I'm basically doing nothing complicated - if it weren't for the multiple layers of lists to be traversed.
Below I provide a snippet of code that I wrote that shall serve as an example. I would consider it concise, yet hard to read.
Context: I am writing a simulation of surface electromyographic data, i.e. changes in the electrical potential that can be measured on the human skin and which is invoked by muscular activity. For this, I consider several muscles (first layer of lists) which each consist of a number of so-called motor units (second layer of lists) which each are related to a set of electrodes (third layer of lists) placed on the skin. In the example code below, we have the objects .firing.contribs, which contains the information on how strong the action of a motor unit influences the potential at a specific electrode, and .firing.instants which contains the information on the instants in time at which these motor units are fired. The given function then calculates the time course of the potential at each electrode.
Here's the question: What are possible options to render this type of code easily understable for the reader? Particularly: How can I make clearer what I am actually doing, i.e. performing some computation on each (MU, electrode) pair and then summing up all potential contributions to each electrode? I feel this is currently not very obvious in the code.
Notes:
I know of plyr. So far, I have not been able to see how I can use this to render my code more readable, though.
I can't convert my structure of nested lists to a multidimensional array since the number of elements is not the same for each list element.
I searched for R implementations of the MapReduce algorithm, since this seems to be applicable to the problem I give as an example (although I'm not a MapReduce expert, so correct me if I'm wrong...). I only found packages for Big Data processing, which is not what I'm interested in. Anyway, my question is not about this particular (Map-Reducible) case but rather about a general programming pattern for inherently nested problems.
I am not interested in performance for the moment, just readability.
Here's the example code.
sum.MU.firing.contributions <- function(.firing.contribs, .firing.instants) {
calc.MU.contribs <- function(.MU.firing.contribs, .MU.firing.instants)
lapply(.MU.firing.contribs, calc.MU.electrode.contrib,
.MU.firing.instants)
calc.muscle.contribs <- function(.MU.firing.contribs, .MU.firing.instants) {
MU.contribs <- mapply(calc.MU.contribs, .MU.firing.contribs,
.MU.firing.instants, SIMPLIFY = FALSE)
muscle.contribs <- reduce.contribs(MU.contribs)
}
muscle.contribs <- mapply(calc.muscle.contribs, .firing.contribs,
.firing.instants, SIMPLIFY = FALSE)
surface.potentials <- reduce.contribs(muscle.contribs)
}
## Takes a list (one element per object) of lists (one element per electrode)
## that contain the time course of the contributions of that object to the
## surface potential at that electrode (as numerical vectors).
## Returns a list (one element per electrode) containing the time course of the
## contributions of this list of objects to the surface potential at all
## electrodes (as numerical vectors again).
reduce.contribs <- function(obj.list) {
contribs.by.electrode <- lapply(seq_along(obj.list[[1]]), function(i)
sapply(obj.list, `[[`, i))
contribs <- lapply(contribs.by.electrode, rowSums)
}
calc.MU.electrode.contrib <- function(.MU.firing.contrib, .MU.firing.instants) {
## This will in reality be more complicated since then .MU.firing.contrib
## will have a different (more complicated) structure.
.MU.firing.contrib * .MU.firing.instants
}
firing.contribs <- list(list(list(1,2),list(3,4)),
list(list(5,6),list(7,8),list(9,10)))
firing.instants <- list(list(c(0,0,1,0), c(0,1,0,0)),
list(c(0,0,0,0), c(1,0,1,0), c(0,1,1,0)))
surface.potentials <- sum.MU.firing.contributions(firing.contribs, firing.instants)

As suggested by user #alexis_laz, it is a good option to actually not use a nested list structure for representing data, but rather use a (flat, 2D, non-nested) data.frame. This greatly simplified coding and increased code conciseness and readability for the above example and seems to be promising for other cases, too.
It completely eliminates the need for nested apply-foo and complicated list traversing. Moreover, it allows for the usage of a lot of built-in R functionality that works on data frames.
I think there are two main reasons why I did not consider this solution before:
It does not feel like the natural representation of my data, since the data actually are nested. By fitting the data into a data.frame, we're losing direct information on this nesting. It can of course be reconstructed, though.
It introduces redundancy, as I will have a lot of equivalent entries in the categorical columns. Of course this doesn't matter by any measure since it's just indices that don't consume a lot of memory or something, but it still feels kind of bad.
Here is the rewritten example. Comments are most welcome.
sum.MU.firing.contributions <- function(.firing.contribs, .firing.instants) {
firing.info <- merge(.firing.contribs, .firing.instants)
firing.info$contribs.time <- mapply(calc.MU.electrode.contrib,
firing.info$contrib,
firing.info$instants,
SIMPLIFY = FALSE)
surface.potentials <- by(I(firing.info$contribs.time),
factor(firing.info$electrode),
function(list) colSums(do.call(rbind, list)))
surface.potentials
}
calc.MU.electrode.contrib <- function(.MU.firing.contrib, .MU.firing.instants) {
## This will in reality be more complicated since then .MU.firing.contrib
## will have a different (more complicated) structure.
.MU.firing.contrib * .MU.firing.instants
}
firing.instants <- data.frame(muscle = c(1,1,2,2,2),
MU = c(1,2,1,2,3),
instants = I(list(c(F,F,T,F),c(F,T,F,F),
c(F,F,F,F),c(T,F,T,F),c(F,T,T,F))))
firing.contribs <- data.frame(muscle = c(1,1,1,1,2,2,2,2,2,2),
MU = c(1,1,2,2,1,1,2,2,3,3),
electrode = c(1,2,1,2,1,2,1,2,1,2),
contrib = c(1,2,3,4,5,6,7,8,9,10))
surface.potentials <- sum.MU.firing.contributions(firing.contribs,
firing.instants)

Related

R language: how to work with dynamically sized vector?

I'm learning R programming, and trying to understand the best approach to work with a vector when you don't know the final size it will end up being. For example, in my case I need to build the vector inside a for loop, but only for some iterations, which aren't know beforehand.
METHOD 1
I could run through the loop a first time to determine the final vector length, initialize the vector to the correct length, then run through the loop a second time to populate the vector. This would be ideal from a memory usage standpoint, since the vector memory would occupy the required amount of memory.
METHOD 2
Or, I could use one for loop, and simply append to the vector as needed, but this would be inefficient from a memory allocation standpoint since a new block may need to be assigned each time a new element is appended to the vector. If you're working with big data, this could be a problem.
METHOD 3
In C or Matlab, I usually initialize the vector length to the largest possible length that I know the final vector could occupy, then populate a subset of elements in the for loop. When the loop completes, I'll re-size the vector length appropriately.
Since R is used a lot in data science, I thought this would be a topic others would have encountered and there may be a best practice that was recommended. Any thoughts?
Canonical R code would use lapply or similar to run the function on each element, then combine the results in some way. This avoids the need to grow a vector or know the size ahead of time. This is the functional programming approach to things. For example,
set.seed(5)
x <- runif(10)
some_fun <- function(x) {
if (x > 0.5) {
return(x)
} else {
return(NULL)
}
}
unlist(lapply(x, some_fun))
The size of the result vector is not specified, but is determined automatically by combining results.
Keep in mind that this is a trivial example for illustration. This particular operation could be vectorized.
I think Method1 is the best approach if you have a very large amount of data. But in general you might want to read this chapter before you make a final decision:
http://adv-r.had.co.nz/memory.html

Are For loops evil in R?

I've heard that you're not meant to force a procedural programming style onto R. I'm finding this pretty hard. I've just solved a problem with a for loop. Is this wrong? Is there a better, more "R-style" solution?
The problem: I have two columns: Col1 and Col2. Col1 contains job titles that have been entered in a free form way. I want to use Col2 to collect these job titles into categories (so "Junior Technician", "Engineering technician" and "Mech. tech." are all listed as "Technician".
I've done it like this:
jobcategories<-list(
"Junior Technician|Engineering technician|Mech. tech." = "Technician",
"Manager|Senior Manager|Group manager|Pain in the ****" = "Manager",
"Admin|Administrator|Group secretary" = "Administrator")
for (currentjob in names(jobcategories)) {
df$Col2[grep(currentjob,data$Col1)] <- jobcategories[[currentjob]]
}
This produces the right results, but I can't shake the feeling that (because of my procedural experience) I'm not using R properly. Could an R expert put me out of my misery?
EDIT
I was asked for the original data. Unfortunately, I can't supply it, because it's got confidential info in it. It's basically two columns. The first column holds just over 400 rows of different job titles (and the odd personal name). There are about 20 different categories that these 400 titles can be split into. The second column starts off as NA, then gets populated after running the for loop.
You are right that for loops are often discouraged in R, and in my experience this is for two main reasons:
Growing objects
As eloquently described in circle 2 of the R inferno, it can be extremely inefficient to grow an object one element at a time, as is often the temptation in for loops. For instance, this is a pretty common yet inefficient work flow, because it reallocates output each iteration of the loop:
output <- c()
for (idx in indices) {
scalar <- compute.new.scalar(idx)
output <- c(output, scalar)
}
This inefficiency can be removed by pre-allocating output to the proper size and using a for loop or by using a function like sapply.
Missing out on faster vectorized alternatives
The second source of inefficiency comes from performing a for loop over a fast operation when a vectorized alternative exists. For instance, consider the following code:
s <- 0
for (elt in x) {
s <- s + elt
}
This is a for loop over a very fast operation (adding two numbers), and the overhead of the loop will be significant compared to the vectorized sum function, which adds up all the elements in the vector. The sum function is quick because it's implemented in C, so it will be more efficient to do s <- sum(x) than to use the for loop (not to mention less typing). Sometime it takes more creativity to figure out how to replace a for loop with a fast interior with a vectorized alternative (cumsum and diff come up a lot), but it can lead to significant efficiency improvements. In cases where you have a fast loop interior but can't figure out how to use vectorized functions to achieve the same thing, I've found that reimplementing the loop with the Rcpp package can yield a faster alternative.
In summary...
For loops can be slow if you are incorrectly growing objects or you have a very fast interior of the loop and the entire thing can be replaced with a vectorized operation. Otherwise you're probably not losing too much efficiency, as the apply family of functions are performing for loops on the inside, too.
for loops are not 'evil' in R but they are typically slow compared to vector based methods and frequently not the best available solution, however they are easy to implement and easy to understand and you should not under-estimate the value of either of these.
In my view, therefore, you should use a for loop if you need to get something done quickly and can't see a better way to do it and you don't need to worry too much about speed.
You'll usually find there there is a non 'for-loop' way to do things.
For example:
If you create a simple table mapping your old jobs to the new ones:
job_map <- data.frame(
current = c("Junior Technician", "Engineering technician", "Mech. tech.",
"Manager", "Senior Manager", "Group manager", "Pain in the ****",
"Admin", "Administrator", "Group secretary"),
new = c(rep("Technician",3), rep("Manager",4), rep("Administrator",3))
)
And you had a table of jobs to reclassify:
my_df <- data.frame(job_name = sample(job_map$current, 50, replace = TRUE))
The match command will help you:
my_df$new <- job_map$new[match(my_df$job_name, job_map$current)]
my_df

Yet another apply Questions

I am totally convinced that an efficient R programm should avoid using loops whenever possible and instead should use the big family of the apply functions.
But this cannot happen without pain.
For example I face with a problem whose solution involves a sum in the applied function, as a result the list of results is reduced to a single value, which is not what I want.
To be concrete I will try to simplify my problem
assume N =100
sapply(list(1:N), function(n) (
choose(n,(floor(n/2)+1):n) *
eps^((floor(n/2)+1):n) *
(1- eps)^(n-((floor(n/2)+1):n))))
As you can see the function inside cause length of the built vector to explode
whereas using the sum inside would collapse everything to single value
sapply(list(1:N), function(n) (
choose(n,(floor(n/2)+1):n) *
eps^((floor(n/2)+1):n) *
(1- eps)^(n-((floor(n/2)+1):n))))
What I would like to have is a the list of degree of N.
so what do you think? how can I repair it?
Your question doesn't contain reproducible code (what's "eps"?), but on the general point about for loops and optimising code:
For loops are not incredibly slow. For loops are incredibly slow when used improperly because of how memory is assigned to objects. For primitive objects (like vectors), modifying a value in a field has a tiny cost - but expanding the /length/ of the vector is fairly costly because what you're actually doing is creating an entirely new object, finding space for that object, copying the name over, removing the old object, etc. For non-primitive objects (say, data frames), it's even more costly because every modification, even if it doesn't alter the length of the data.frame, triggers this process.
But: there are ways to optimise a for loop and make them run quickly. The easiest guidelines are:
Do not run a for loop that writes to a data.frame. Use plyr or dplyr, or data.table, depending on your preference.
If you are using a vector and can know the length of the output in advance, it will work a lot faster. Specify the size of the output object before writing to it.
Do not twist yourself into knots avoiding for loops.
So in this case - if you're only producing a single value for each thing in N, you could make that work perfectly nicely with a vector:
#Create output object. We're specifying the length in advance so that writing to
#it is cheap
output <- numeric(length = length(N))
#Start the for loop
for(i in seq_along(output)){
output[i] <- your_computations_go_here(N[i])
}
This isn't actually particularly slow - because you're writing to a vector and you've specified the length in advance. And since data.frames are actually lists of equally-sized vectors, you can even work around some issues with running for loops over data.frames using this; if you're only writing to a single column in the data.frame, just create it as a vector and then write it to the data.frame via df$new_col <- output. You'll get the same output as if you had looped through the data.frame, but it'll work faster because you'll only have had to modify it once.

Approaches to preserving object's attributes during extract/replace operations

Recently I encountered the following problem in my R code. In a function, accepting a data frame as an argument, I needed to add (or replace, if it exists) a column with data calculated based on values of the data frame's original column. I wrote the code, but the testing revealed that data frame extract/replace operations, which I've used, resulted in a loss of the object's special (user-defined) attributes.
After realizing that and confirming that behavior by reading R documentation (http://stat.ethz.ch/R-manual/R-patched/library/base/html/Extract.html), I decided to solve the problem very simply - by saving the attributes before the extract/replace operations and restoring them thereafter:
myTransformationFunction <- function (data) {
# save object's attributes
attrs <- attributes(data)
<data frame transformations; involves extract/replace operations on `data`>
# restore the attributes
attributes(data) <- attrs
return (data)
}
This approach worked. However, accidentally, I ran across another piece of R documentation (http://stat.ethz.ch/R-manual/R-patched/library/base/html/Extract.data.frame.html), which offers IMHO an interesting (and, potentially, a more generic?) alternative approach to solving the same problem:
## keeping special attributes: use a class with a
## "as.data.frame" and "[" method:
as.data.frame.avector <- as.data.frame.vector
`[.avector` <- function(x,i,...) {
r <- NextMethod("[")
mostattributes(r) <- attributes(x)
r
}
d <- data.frame(i = 0:7, f = gl(2,4),
u = structure(11:18, unit = "kg", class = "avector"))
str(d[2:4, -1]) # 'u' keeps its "unit"
I would really appreciate if people here could help by:
Comparing the two above-mentioned approaches, if they are comparable (I realize that the second approach as defined is for data frames, but I suspect it can be generalized to any object);
Explaining the syntax and meaning in the function definition in the second approach, especially as.data.frame.avector, as well as what is the purpose of the line as.data.frame.avector <- as.data.frame.vector.
I'm answering my own question, since I have just found an SO question (How to delete a row from a data.frame without losing the attributes), answers to which cover most of my questions posed above. However, additional explanations (for R beginners) for the second approach would still be appreciated.
UPDATE:
Another solution to this problem has been proposed in an answer to the following SO question: indexing operation removes attributes. Personally, however, I better like the approach, based on creating a new class, as it's IMHO semantically cleaner.

Why is using assign bad?

This post (Lazy evaluation in R – is assign affected?) covers some common ground but I am not sure it answers my question.
I stopped using assign when I discovered the apply family quite a while back, albeit, purely for reasons of elegance in situations such as this:
names.foo <- letters
values.foo <- LETTERS
for (i in 1:length(names.foo))
assign(names.foo[i], paste("This is: ", values.foo[i]))
which can be replaced by:
foo <- lapply(X=values.foo, FUN=function (k) paste("This is :", k))
names(foo) <- names.foo
This is also the reason this (http://cran.r-project.org/doc/FAQ/R-FAQ.html#How-can-I-turn-a-string-into-a-variable_003f) R-faq says this should be avoided.
Now, I know that assign is generally frowned upon. But are there other reasons I don't know? I suspect it may mess with the scoping or lazy evaluation but I am not sure? Example code that demonstrates such problems will be great.
Actually those two operations are quite different. The first gives you 26 different objects while the second gives you only one. The second object will be a lot easier to use in analyses. So I guess I would say you have already demonstrated the major downside of assign, namely the necessity of then needing always to use get for corralling or gathering up all the similarly named individual objects that are now "loose" in the global environment. Try imagining how you would serially do anything with those 26 separate objects. A simple lapply(foo, func) will suffice for the second strategy.
That FAQ citation really only says that using assignment and then assigning names is easier, but did not imply it was "bad". I happen to read it as "less functional" since you are not actually returning a value that gets assigned. The effect looks to be a side-effect (and in this case the assign strategy results in 26 separate side-effects). The use of assign seems to be adopted by people that are coming from languages that have global variables as a way of avoiding picking up the "True R Way", i.e. functional programming with data-objects. They really should be learning to use lists rather than littering their workspace with individually-named items.
There is another assignment paradigm that can be used:
foo <- setNames( paste0(letters,1:26), LETTERS)
That creates a named atomic vector rather than a named list, but the access to values in the vector is still done with names given to [.
As the source of fortune(236) I thought I would add a couple examples (also see fortune(174)).
First, a quiz. Consider the following code:
x <- 1
y <- some.function.that.uses.assign(rnorm(100))
After running the above 2 lines of code, what is the value of x?
The assign function is used to commit "Action at a distance" (see http://en.wikipedia.org/wiki/Action_at_a_distance_(computer_programming) or google for it). This is often the source of hard to find bugs.
I think the biggest problem with assign is that it tends to lead people down paths of thinking that take them away from better options. A simple example is the 2 sets of code in the question. The lapply solution is more elegant and should be promoted, but the mere fact that people learn about the assign function leads people to the loop option. Then they decide that they need to do the same operation on each object created in the loop (which would be just another simple lapply or sapply if the elegant solution were used) and resort to an even more complicated loop involving both get and apply along with ugly calls to paste. Then those enamored with assign try to do something like:
curname <- paste('myvector[', i, ']')
assign(curname, i)
And that does not do quite what they expected which leads to either complaining about R (which is as fair as complaining that my next door neighbor's house is too far away because I chose to walk the long way around the block) or even worse, delve into using eval and parse to get their constructed string to "work" (which then leads to fortune(106) and fortune(181)).
I'd like to point out that assign is meant to be used with environments.
From that point of view, the "bad" thing in the example above is using a not quite appropriate data structure (the base environment instead of a list or data.frame, vector, ...).
Side note: also for environments, the $ and $<- operators work, so in many cases the explicit assign and get isn't necessary there, neither.

Resources