Are For loops evil in R? - r

I've heard that you're not meant to force a procedural programming style onto R. I'm finding this pretty hard. I've just solved a problem with a for loop. Is this wrong? Is there a better, more "R-style" solution?
The problem: I have two columns: Col1 and Col2. Col1 contains job titles that have been entered in a free form way. I want to use Col2 to collect these job titles into categories (so "Junior Technician", "Engineering technician" and "Mech. tech." are all listed as "Technician".
I've done it like this:
jobcategories<-list(
"Junior Technician|Engineering technician|Mech. tech." = "Technician",
"Manager|Senior Manager|Group manager|Pain in the ****" = "Manager",
"Admin|Administrator|Group secretary" = "Administrator")
for (currentjob in names(jobcategories)) {
df$Col2[grep(currentjob,data$Col1)] <- jobcategories[[currentjob]]
}
This produces the right results, but I can't shake the feeling that (because of my procedural experience) I'm not using R properly. Could an R expert put me out of my misery?
EDIT
I was asked for the original data. Unfortunately, I can't supply it, because it's got confidential info in it. It's basically two columns. The first column holds just over 400 rows of different job titles (and the odd personal name). There are about 20 different categories that these 400 titles can be split into. The second column starts off as NA, then gets populated after running the for loop.

You are right that for loops are often discouraged in R, and in my experience this is for two main reasons:
Growing objects
As eloquently described in circle 2 of the R inferno, it can be extremely inefficient to grow an object one element at a time, as is often the temptation in for loops. For instance, this is a pretty common yet inefficient work flow, because it reallocates output each iteration of the loop:
output <- c()
for (idx in indices) {
scalar <- compute.new.scalar(idx)
output <- c(output, scalar)
}
This inefficiency can be removed by pre-allocating output to the proper size and using a for loop or by using a function like sapply.
Missing out on faster vectorized alternatives
The second source of inefficiency comes from performing a for loop over a fast operation when a vectorized alternative exists. For instance, consider the following code:
s <- 0
for (elt in x) {
s <- s + elt
}
This is a for loop over a very fast operation (adding two numbers), and the overhead of the loop will be significant compared to the vectorized sum function, which adds up all the elements in the vector. The sum function is quick because it's implemented in C, so it will be more efficient to do s <- sum(x) than to use the for loop (not to mention less typing). Sometime it takes more creativity to figure out how to replace a for loop with a fast interior with a vectorized alternative (cumsum and diff come up a lot), but it can lead to significant efficiency improvements. In cases where you have a fast loop interior but can't figure out how to use vectorized functions to achieve the same thing, I've found that reimplementing the loop with the Rcpp package can yield a faster alternative.
In summary...
For loops can be slow if you are incorrectly growing objects or you have a very fast interior of the loop and the entire thing can be replaced with a vectorized operation. Otherwise you're probably not losing too much efficiency, as the apply family of functions are performing for loops on the inside, too.

for loops are not 'evil' in R but they are typically slow compared to vector based methods and frequently not the best available solution, however they are easy to implement and easy to understand and you should not under-estimate the value of either of these.
In my view, therefore, you should use a for loop if you need to get something done quickly and can't see a better way to do it and you don't need to worry too much about speed.

You'll usually find there there is a non 'for-loop' way to do things.
For example:
If you create a simple table mapping your old jobs to the new ones:
job_map <- data.frame(
current = c("Junior Technician", "Engineering technician", "Mech. tech.",
"Manager", "Senior Manager", "Group manager", "Pain in the ****",
"Admin", "Administrator", "Group secretary"),
new = c(rep("Technician",3), rep("Manager",4), rep("Administrator",3))
)
And you had a table of jobs to reclassify:
my_df <- data.frame(job_name = sample(job_map$current, 50, replace = TRUE))
The match command will help you:
my_df$new <- job_map$new[match(my_df$job_name, job_map$current)]
my_df

Related

Need to loop through columns to perform t tests?

So I have a customer survey, and I need to determine if there are significant differences between the four areas. I obviously want to do a t-test on these, and here is my current R solution.
`for (i in colnames(c_survey))
assign(i, subset(c_survey, select=i))
elements <- list(quality, ease.of.use, price, service)
elements_alt<-list(service,price,ease.of.use,quality)
for(i in elements){print(names(elements)[i])
for (j in elements_alt) {print(t.test(i,j)$p.value)}}`
(Edit) I figured out the nested loop, but I still think there's a faster way to do what I want than this whole two list, nested loop nonsense. Also my output from this has no names on it so I have no idea what's being compared to what, and it includes all the duplicate inverse comparisons. I also can't save the results. I think my solution certainly lies elsewhere.
Also, would producing so many t-test p values even be the best statistical way to accomplish what I want? It seems like there should be something easier than this...

Yet another apply Questions

I am totally convinced that an efficient R programm should avoid using loops whenever possible and instead should use the big family of the apply functions.
But this cannot happen without pain.
For example I face with a problem whose solution involves a sum in the applied function, as a result the list of results is reduced to a single value, which is not what I want.
To be concrete I will try to simplify my problem
assume N =100
sapply(list(1:N), function(n) (
choose(n,(floor(n/2)+1):n) *
eps^((floor(n/2)+1):n) *
(1- eps)^(n-((floor(n/2)+1):n))))
As you can see the function inside cause length of the built vector to explode
whereas using the sum inside would collapse everything to single value
sapply(list(1:N), function(n) (
choose(n,(floor(n/2)+1):n) *
eps^((floor(n/2)+1):n) *
(1- eps)^(n-((floor(n/2)+1):n))))
What I would like to have is a the list of degree of N.
so what do you think? how can I repair it?
Your question doesn't contain reproducible code (what's "eps"?), but on the general point about for loops and optimising code:
For loops are not incredibly slow. For loops are incredibly slow when used improperly because of how memory is assigned to objects. For primitive objects (like vectors), modifying a value in a field has a tiny cost - but expanding the /length/ of the vector is fairly costly because what you're actually doing is creating an entirely new object, finding space for that object, copying the name over, removing the old object, etc. For non-primitive objects (say, data frames), it's even more costly because every modification, even if it doesn't alter the length of the data.frame, triggers this process.
But: there are ways to optimise a for loop and make them run quickly. The easiest guidelines are:
Do not run a for loop that writes to a data.frame. Use plyr or dplyr, or data.table, depending on your preference.
If you are using a vector and can know the length of the output in advance, it will work a lot faster. Specify the size of the output object before writing to it.
Do not twist yourself into knots avoiding for loops.
So in this case - if you're only producing a single value for each thing in N, you could make that work perfectly nicely with a vector:
#Create output object. We're specifying the length in advance so that writing to
#it is cheap
output <- numeric(length = length(N))
#Start the for loop
for(i in seq_along(output)){
output[i] <- your_computations_go_here(N[i])
}
This isn't actually particularly slow - because you're writing to a vector and you've specified the length in advance. And since data.frames are actually lists of equally-sized vectors, you can even work around some issues with running for loops over data.frames using this; if you're only writing to a single column in the data.frame, just create it as a vector and then write it to the data.frame via df$new_col <- output. You'll get the same output as if you had looped through the data.frame, but it'll work faster because you'll only have had to modify it once.

How to structure R code for inherently nested problems to be easily readible?

There are problems that inherently require several layers of nesting to be solved. In a current project, I frequently find myself using three nested applys in order to do something with the elements contained in the deepest layer of a nested list structure.
R's list handling and apply-family allows for quite concise code for this type of problems, but writing it still gives me headaches and I'm pretty sure that anyone else reading it will have to take several minutes to understand what I am doing. This is although I'm basically doing nothing complicated - if it weren't for the multiple layers of lists to be traversed.
Below I provide a snippet of code that I wrote that shall serve as an example. I would consider it concise, yet hard to read.
Context: I am writing a simulation of surface electromyographic data, i.e. changes in the electrical potential that can be measured on the human skin and which is invoked by muscular activity. For this, I consider several muscles (first layer of lists) which each consist of a number of so-called motor units (second layer of lists) which each are related to a set of electrodes (third layer of lists) placed on the skin. In the example code below, we have the objects .firing.contribs, which contains the information on how strong the action of a motor unit influences the potential at a specific electrode, and .firing.instants which contains the information on the instants in time at which these motor units are fired. The given function then calculates the time course of the potential at each electrode.
Here's the question: What are possible options to render this type of code easily understable for the reader? Particularly: How can I make clearer what I am actually doing, i.e. performing some computation on each (MU, electrode) pair and then summing up all potential contributions to each electrode? I feel this is currently not very obvious in the code.
Notes:
I know of plyr. So far, I have not been able to see how I can use this to render my code more readable, though.
I can't convert my structure of nested lists to a multidimensional array since the number of elements is not the same for each list element.
I searched for R implementations of the MapReduce algorithm, since this seems to be applicable to the problem I give as an example (although I'm not a MapReduce expert, so correct me if I'm wrong...). I only found packages for Big Data processing, which is not what I'm interested in. Anyway, my question is not about this particular (Map-Reducible) case but rather about a general programming pattern for inherently nested problems.
I am not interested in performance for the moment, just readability.
Here's the example code.
sum.MU.firing.contributions <- function(.firing.contribs, .firing.instants) {
calc.MU.contribs <- function(.MU.firing.contribs, .MU.firing.instants)
lapply(.MU.firing.contribs, calc.MU.electrode.contrib,
.MU.firing.instants)
calc.muscle.contribs <- function(.MU.firing.contribs, .MU.firing.instants) {
MU.contribs <- mapply(calc.MU.contribs, .MU.firing.contribs,
.MU.firing.instants, SIMPLIFY = FALSE)
muscle.contribs <- reduce.contribs(MU.contribs)
}
muscle.contribs <- mapply(calc.muscle.contribs, .firing.contribs,
.firing.instants, SIMPLIFY = FALSE)
surface.potentials <- reduce.contribs(muscle.contribs)
}
## Takes a list (one element per object) of lists (one element per electrode)
## that contain the time course of the contributions of that object to the
## surface potential at that electrode (as numerical vectors).
## Returns a list (one element per electrode) containing the time course of the
## contributions of this list of objects to the surface potential at all
## electrodes (as numerical vectors again).
reduce.contribs <- function(obj.list) {
contribs.by.electrode <- lapply(seq_along(obj.list[[1]]), function(i)
sapply(obj.list, `[[`, i))
contribs <- lapply(contribs.by.electrode, rowSums)
}
calc.MU.electrode.contrib <- function(.MU.firing.contrib, .MU.firing.instants) {
## This will in reality be more complicated since then .MU.firing.contrib
## will have a different (more complicated) structure.
.MU.firing.contrib * .MU.firing.instants
}
firing.contribs <- list(list(list(1,2),list(3,4)),
list(list(5,6),list(7,8),list(9,10)))
firing.instants <- list(list(c(0,0,1,0), c(0,1,0,0)),
list(c(0,0,0,0), c(1,0,1,0), c(0,1,1,0)))
surface.potentials <- sum.MU.firing.contributions(firing.contribs, firing.instants)
As suggested by user #alexis_laz, it is a good option to actually not use a nested list structure for representing data, but rather use a (flat, 2D, non-nested) data.frame. This greatly simplified coding and increased code conciseness and readability for the above example and seems to be promising for other cases, too.
It completely eliminates the need for nested apply-foo and complicated list traversing. Moreover, it allows for the usage of a lot of built-in R functionality that works on data frames.
I think there are two main reasons why I did not consider this solution before:
It does not feel like the natural representation of my data, since the data actually are nested. By fitting the data into a data.frame, we're losing direct information on this nesting. It can of course be reconstructed, though.
It introduces redundancy, as I will have a lot of equivalent entries in the categorical columns. Of course this doesn't matter by any measure since it's just indices that don't consume a lot of memory or something, but it still feels kind of bad.
Here is the rewritten example. Comments are most welcome.
sum.MU.firing.contributions <- function(.firing.contribs, .firing.instants) {
firing.info <- merge(.firing.contribs, .firing.instants)
firing.info$contribs.time <- mapply(calc.MU.electrode.contrib,
firing.info$contrib,
firing.info$instants,
SIMPLIFY = FALSE)
surface.potentials <- by(I(firing.info$contribs.time),
factor(firing.info$electrode),
function(list) colSums(do.call(rbind, list)))
surface.potentials
}
calc.MU.electrode.contrib <- function(.MU.firing.contrib, .MU.firing.instants) {
## This will in reality be more complicated since then .MU.firing.contrib
## will have a different (more complicated) structure.
.MU.firing.contrib * .MU.firing.instants
}
firing.instants <- data.frame(muscle = c(1,1,2,2,2),
MU = c(1,2,1,2,3),
instants = I(list(c(F,F,T,F),c(F,T,F,F),
c(F,F,F,F),c(T,F,T,F),c(F,T,T,F))))
firing.contribs <- data.frame(muscle = c(1,1,1,1,2,2,2,2,2,2),
MU = c(1,1,2,2,1,1,2,2,3,3),
electrode = c(1,2,1,2,1,2,1,2,1,2),
contrib = c(1,2,3,4,5,6,7,8,9,10))
surface.potentials <- sum.MU.firing.contributions(firing.contribs,
firing.instants)

Why is using assign bad?

This post (Lazy evaluation in R – is assign affected?) covers some common ground but I am not sure it answers my question.
I stopped using assign when I discovered the apply family quite a while back, albeit, purely for reasons of elegance in situations such as this:
names.foo <- letters
values.foo <- LETTERS
for (i in 1:length(names.foo))
assign(names.foo[i], paste("This is: ", values.foo[i]))
which can be replaced by:
foo <- lapply(X=values.foo, FUN=function (k) paste("This is :", k))
names(foo) <- names.foo
This is also the reason this (http://cran.r-project.org/doc/FAQ/R-FAQ.html#How-can-I-turn-a-string-into-a-variable_003f) R-faq says this should be avoided.
Now, I know that assign is generally frowned upon. But are there other reasons I don't know? I suspect it may mess with the scoping or lazy evaluation but I am not sure? Example code that demonstrates such problems will be great.
Actually those two operations are quite different. The first gives you 26 different objects while the second gives you only one. The second object will be a lot easier to use in analyses. So I guess I would say you have already demonstrated the major downside of assign, namely the necessity of then needing always to use get for corralling or gathering up all the similarly named individual objects that are now "loose" in the global environment. Try imagining how you would serially do anything with those 26 separate objects. A simple lapply(foo, func) will suffice for the second strategy.
That FAQ citation really only says that using assignment and then assigning names is easier, but did not imply it was "bad". I happen to read it as "less functional" since you are not actually returning a value that gets assigned. The effect looks to be a side-effect (and in this case the assign strategy results in 26 separate side-effects). The use of assign seems to be adopted by people that are coming from languages that have global variables as a way of avoiding picking up the "True R Way", i.e. functional programming with data-objects. They really should be learning to use lists rather than littering their workspace with individually-named items.
There is another assignment paradigm that can be used:
foo <- setNames( paste0(letters,1:26), LETTERS)
That creates a named atomic vector rather than a named list, but the access to values in the vector is still done with names given to [.
As the source of fortune(236) I thought I would add a couple examples (also see fortune(174)).
First, a quiz. Consider the following code:
x <- 1
y <- some.function.that.uses.assign(rnorm(100))
After running the above 2 lines of code, what is the value of x?
The assign function is used to commit "Action at a distance" (see http://en.wikipedia.org/wiki/Action_at_a_distance_(computer_programming) or google for it). This is often the source of hard to find bugs.
I think the biggest problem with assign is that it tends to lead people down paths of thinking that take them away from better options. A simple example is the 2 sets of code in the question. The lapply solution is more elegant and should be promoted, but the mere fact that people learn about the assign function leads people to the loop option. Then they decide that they need to do the same operation on each object created in the loop (which would be just another simple lapply or sapply if the elegant solution were used) and resort to an even more complicated loop involving both get and apply along with ugly calls to paste. Then those enamored with assign try to do something like:
curname <- paste('myvector[', i, ']')
assign(curname, i)
And that does not do quite what they expected which leads to either complaining about R (which is as fair as complaining that my next door neighbor's house is too far away because I chose to walk the long way around the block) or even worse, delve into using eval and parse to get their constructed string to "work" (which then leads to fortune(106) and fortune(181)).
I'd like to point out that assign is meant to be used with environments.
From that point of view, the "bad" thing in the example above is using a not quite appropriate data structure (the base environment instead of a list or data.frame, vector, ...).
Side note: also for environments, the $ and $<- operators work, so in many cases the explicit assign and get isn't necessary there, neither.

Merging dataframes in R on a pre-sorted column?

I usually work with big dataframes that are pretty well sorted (or can be easily sorted).
Given two dataframes, both sorted by 'user'
some.data <user> <data_1> <data_2>
user <user> <user_attr_1> <user_attr_2>
And I run m = merge(some.data,user), I receive the result as:
m = <user> <data_1> <data_2> <user_attr_1> <user_attr_2>
And this is fine so.
But merge doesn't take advantage of these dataframes being sorted on the common column making the merge pretty CPU/memory heavy. However, this merge could be done in O(n)
I am wondering if there is a way in R to conduct an efficient merge on sorted datasets?
I don't have any experience with it, but as far as I know, this is one of the issues that package data.tablewas designed to improve.
For most practical purposes, data.table=data.frame + index. As a consequence, when used right, this improves performance of quite a few large operations.
There is a danger that turning your data.frame into a data.table (i.e. adding the index) could take some time (although I expect this to be well optimized), but once you've got it up, functions like merge can easily use the index for better performance.
If your set of common keys/indexes is totally overlapping, that is...
Reduce(`&`, user$user.id %in% some.data$user.id)
...returns TRUE and they are, as you said, sorted,and there are no key duplicates then your merging problem is reduced to adding columns to a data.frame. Something in the lines along...
library(log4r)
t1 <- system.time(z <- merge(user, some.data, by='user.id'))
info(my.logger, paste('Elapsed time with merge():', t1['elapsed']))
t2 <- Sys.time()
r <- data.frame(user.id=user$user.id, V1.x=user$V1, V2.x=user$V2)
r[,names(some.data)] <- some.data[,names(some.data)
t3 <- Sys.time()
info(my.logger, paste('Elapsed time without:', t3-t2))
If the assumptions above do not hold, then it gets slightly messier set union of both key sets, translation function, NA padding) but the merging and overlapping assumption alone gets you a long way ahead.
Notice also that the timing of the seconds approach is biased since it's calling twice Sys.time() unlike the merge() timing which calls system.time() and only once.
(Excuse my lame usage of S.O. mark-up)

Resources