Why is using assign bad? - r

This post (Lazy evaluation in R – is assign affected?) covers some common ground but I am not sure it answers my question.
I stopped using assign when I discovered the apply family quite a while back, albeit, purely for reasons of elegance in situations such as this:
names.foo <- letters
values.foo <- LETTERS
for (i in 1:length(names.foo))
assign(names.foo[i], paste("This is: ", values.foo[i]))
which can be replaced by:
foo <- lapply(X=values.foo, FUN=function (k) paste("This is :", k))
names(foo) <- names.foo
This is also the reason this (http://cran.r-project.org/doc/FAQ/R-FAQ.html#How-can-I-turn-a-string-into-a-variable_003f) R-faq says this should be avoided.
Now, I know that assign is generally frowned upon. But are there other reasons I don't know? I suspect it may mess with the scoping or lazy evaluation but I am not sure? Example code that demonstrates such problems will be great.

Actually those two operations are quite different. The first gives you 26 different objects while the second gives you only one. The second object will be a lot easier to use in analyses. So I guess I would say you have already demonstrated the major downside of assign, namely the necessity of then needing always to use get for corralling or gathering up all the similarly named individual objects that are now "loose" in the global environment. Try imagining how you would serially do anything with those 26 separate objects. A simple lapply(foo, func) will suffice for the second strategy.
That FAQ citation really only says that using assignment and then assigning names is easier, but did not imply it was "bad". I happen to read it as "less functional" since you are not actually returning a value that gets assigned. The effect looks to be a side-effect (and in this case the assign strategy results in 26 separate side-effects). The use of assign seems to be adopted by people that are coming from languages that have global variables as a way of avoiding picking up the "True R Way", i.e. functional programming with data-objects. They really should be learning to use lists rather than littering their workspace with individually-named items.
There is another assignment paradigm that can be used:
foo <- setNames( paste0(letters,1:26), LETTERS)
That creates a named atomic vector rather than a named list, but the access to values in the vector is still done with names given to [.

As the source of fortune(236) I thought I would add a couple examples (also see fortune(174)).
First, a quiz. Consider the following code:
x <- 1
y <- some.function.that.uses.assign(rnorm(100))
After running the above 2 lines of code, what is the value of x?
The assign function is used to commit "Action at a distance" (see http://en.wikipedia.org/wiki/Action_at_a_distance_(computer_programming) or google for it). This is often the source of hard to find bugs.
I think the biggest problem with assign is that it tends to lead people down paths of thinking that take them away from better options. A simple example is the 2 sets of code in the question. The lapply solution is more elegant and should be promoted, but the mere fact that people learn about the assign function leads people to the loop option. Then they decide that they need to do the same operation on each object created in the loop (which would be just another simple lapply or sapply if the elegant solution were used) and resort to an even more complicated loop involving both get and apply along with ugly calls to paste. Then those enamored with assign try to do something like:
curname <- paste('myvector[', i, ']')
assign(curname, i)
And that does not do quite what they expected which leads to either complaining about R (which is as fair as complaining that my next door neighbor's house is too far away because I chose to walk the long way around the block) or even worse, delve into using eval and parse to get their constructed string to "work" (which then leads to fortune(106) and fortune(181)).

I'd like to point out that assign is meant to be used with environments.
From that point of view, the "bad" thing in the example above is using a not quite appropriate data structure (the base environment instead of a list or data.frame, vector, ...).
Side note: also for environments, the $ and $<- operators work, so in many cases the explicit assign and get isn't necessary there, neither.

Related

Accessing index inside *apply

I have two containers, conty and contx. The values of both are tied to each other. conty[1] relates to contx[1] etc. while using apply on contx I want to access the index inside an apply structure so I can put values from corresponding element in conty into contz depending upon the index of x.
lapply(contx, function(x) {
if (x==1) append(contz,conty[xindex])
})
I could easily do this in a for loop but everybody insists that using the apply is better. And I tried to look for examples but the only thing I could find was mostly stuff for generating maps where it wasn't entirely clear how I could adapt to my problem.
There are a few issues here.
"everybody insists that using the apply is better". Sorry, but they're wrong; it's not necessarily better. See the old-school Burns Inferno ("If you are using R and you think you’re in hell, this is a map for you"), chapter 4 ("Overvectorization"):
A common reflex is to use a function in the apply family. This is not vectorization, it is loop-hiding. The apply function has a for loop in its definition. The lapply function buries the loop, but execution times tend to be roughly equal to an explicit for loop ... Base your decision of using an apply function on Uwe’s Maxim (page 20). The issue is of human time rather than silicon chip time. Human time can be wasted by taking longer to write the code, and (often much more importantly) by taking more time to understand subsequently what it does.
However, what you are doing that's bad is growing an object (also covered in the Inferno). Assuming that in your example contz started as an empty list, this should work (is my example reflective of your use case?)
x <- c(1,2,3,1)
conty <- list("a","b","c","d")
contz <- conty[which(x==1)]
Alternatively, if you want to use both the value and the index in your function, you can write a two-variable function f(val,index) and then use Map(f,my_list,seq_along(my_list))

Why do the R functions mean() and sum() behave differently with vectors vs. raw strings?

I was wondering if there was an underlying programming logic as to why some basic R functions behave differently towards raw data input into them vs. vectors.
For example, if I do this
mean(1,2,3)
I don't get the correct answer, and don't get an error
But if I do this
sum(1,2,3)
I do get the right answer, even though I'd assume proper syntax would be sum(c(1,2,3))
And if I do this
sd(1,2,3)
I get an error Error in sd(1, 2, 3) : unused argument (3)
I'm interested into what, if any, the underlying programming logic of these different behaviors are. (I'm sure if I rooted around in the source code I could figure out exactly why they behave differently, but I want to know if there is a reason why the code might have been written that way).
Practically, I'm teaching a basic R class and want to explain to my students why things work that way; they get a bit tired of me saying "That's just how R works, live with it; and always put things in vectors to make life easy."
EDITS: I have bolded some sections to add emphasis. My question is largely about software design, not how these particular function happen to operate or how to determine their exact operation. That is, not "what arguments do these functions accept" but "why do simple mathematical functions in R appear (to a biologist) to have been designed differently".
the second argument taken by mean is trim, which is not a listed argument for sum. the first argument for sum is \dots, so, I believe, the function will try to compute the sum of all values entered as unnamed arguments.
mean and sum are generic functions, so they get deployed differently depending on an object's class.

rm(list=ls()) doesn't seem to make syntactical sense

I'm a long-time programmer in C, assembler, etc. I have just begun an intro to R (not by choice). On page 2, I encounter rm(list=ls()) which works to remove all objects. Since the = seems to be assigning the output of ls() to something named list, you would think that rm(list=ls()) would do the same thing, since the name shouldn't matter, or even more, why not just rm(ls())? Neither of those work. Further, the book says that ls() "lists all of the objects...", but the output of ls() isn't a list, it is a vector of character strings, apparently. So, apparently rm() takes something formally of type list, rather than simply a vector of character strings. Fine. But if I separately execute list=ls(), that creates a vector named list, not a list. So list=ls() followed by rm(list) is not the same as rm(list=ls()), apparently, in violation of what ought to be transitivity of operations or something.
So, apparently list=ls() inside of the rm() does not assign anything to something named list, apparently it produces something of type list to be fed into rm(). But that makes no sense, to write list=ls() if the goal is to produce something of type list from a vector (the output of ls()).
Should I just fall on my sword and realize that I am not going to be able to tolerate learning this language? I'm just too grumpy? Or, is the rest of R going to make perfect sense and I happened to stumble on some kind of ad-hoc adaptation of calling rm that was cooked up?

Ordered Map / Hash Table in R

While working with lists i've noticed an issue that i didn't expect.
result5 <- vector("list",length(queryResults[[1]]))
for(i in 1:length(queryResults[[1]])){
id <- queryResults[[1]][i]
result5[[id]] <-getPrices(id)
}
The problem is that after this code runs instead of the result staying the same size (w/e queryResults[[1]] is) it goes up to the last index creating a bunch of null entries in the middle.
result5 current stores a number of int,double lists so it looks like :
result5[[index(int)]][[row]][col]
While on it's own it's not too problematic I would rather avoid that simply for easier size calculations later on.
For clarification, id is an integer. And in the given case for loop offers same performance, but greater convenience than the apply functions.
After some testing seems like the easiest way of doing it is :
Using a hash package to convert it using a hash using :
result6 <- hash(queryResults[[1]],lapply(queryResults[[1]],getPrices))
And if it needs to get accessed calling
result6[[toString(id)]]
With the difference in performance being marginal, albeit it's still fairly annoying having to include toString in your code.
It's not clear exactly what your question is, but judging by the structure of the loop, you probably want
result5[[i]] <- getPrices(id)
rather than result5[[id]] <- getPrices(id).

Do you use attach() or call variables by name or slicing?

Many intro R books and guides start off with the practice of attaching a data.frame so that you can call the variables by name. I have always found it favorable to call variables with $ notation or square bracket slicing [,2]. That way I can use multiple data.frames without confusing them and/or use iteration to successively call columns of interest. I noticed Google recently posted coding guidelines for R which included the line
1) attach: avoid using it
How do people feel about this practice?
I never use attach. with and within are your friends.
Example code:
> N <- 3
> df <- data.frame(x1=rnorm(N),x2=runif(N))
> df$y <- with(df,{
x1+x2
})
> df
x1 x2 y
1 -0.8943125 0.24298534 -0.6513271
2 -0.9384312 0.01460008 -0.9238312
3 -0.7159518 0.34618060 -0.3697712
>
> df <- within(df,{
x1.sq <- x1^2
x2.sq <- x2^2
y <- x1.sq+x2.sq
x1 <- x2 <- NULL
})
> df
y x2.sq x1.sq
1 0.8588367 0.0590418774 0.7997948
2 0.8808663 0.0002131623 0.8806532
3 0.6324280 0.1198410071 0.5125870
Edit: hadley mentions transform in the comments. here is some code:
> transform(df, xtot=x1.sq+x2.sq, y=NULL)
x2.sq x1.sq xtot
1 0.41557079 0.021393571 0.43696436
2 0.57716487 0.266325959 0.84349083
3 0.04935442 0.004226069 0.05358049
I much prefer to use with to obtain the equivalent of attach on a single command:
with(someDataFrame, someFunction(...))
This also leads naturally to a form where subset is the first argument:
with(subset(someDataFrame, someVar > someValue),
someFunction(...))
which makes it pretty clear that we operate on a selection of the data. And while many modelling function have both data and subset arguments, the use above is more consistent as it also applies to those functions who do not have data and subset arguments.
The main problem with attach is that it can result in unwanted behaviour. Suppose you have an object with name xyz in your workspace. Now you attach dataframe abc which has a column named xyz. If your code reference to xyz, can you guarantee that is references to the object or the dataframe column? If you don't use attach then it is easy. just xyz refers to the object. abc$xyz refers to the column of the dataframe.
One of the main reasons that attach is used frequently in textbooks is that it shortens the code.
"Attach" is an evil temptation. The only place where it works well is in the classroom setting where one is given a single dataframe and expected to write lines of code to do the analysis on that one dataframe. The user is unlikely to ever use that data again once the assignement is done and handed in.
However, in the real world, more data frames can be added to the collection of data in a particular project. Furthermore one often copies and pastes blocks of code to be used for something similar. Often one is borrowing from something one did a few months ago and cannot remember the nuances of what was being called from where. In these circumstances one gets drowned by the previous use of "attach."
Just like Leoni said, with and within are perfect substitutes for attach, but I wouldn't completely dismiss it. I use it sometimes, when I'm working directly at the R prompt and want to test some commands before writing them on a script. Especially when testing multiple commands, attach can be a more interesting, convenient and even harmless alternative to with and within, since after you run attach, the command prompt is clear for you to write inputs and see outputs.
Just make sure to detach your data after you're done!
I prefer not to use attach(), as it is far too easy to run a batch of code several times each time calling attach(). The data frame is added to the search path each time, extending it unnecessarily. Of course, good programming practice is to also detach() at the end of the block of code, but that is often forgotten.
Instead, I use xxx$y or xxx[,"y"]. It's more transparent.
Another possibility is to use the data argument available in many functions which allows individual variables to be referenced within the data frame. e.g., lm(z ~ y, data=xxx).
While I, too, prefer not to use attach(), it does have its place when you need to persist an object (in this case, a data.frame) through the life of your program when you have several functions using it. Instead of passing the object into every R function that uses it, I think it is more convenient to keep it in one place and call its elements as needed.
That said, I would only use it if I know how much memory I have available and only if I make sure that I detach() this data.frame once it is out of scope.
Am I making sense?

Resources