Approaches to preserving object's attributes during extract/replace operations - r

Recently I encountered the following problem in my R code. In a function, accepting a data frame as an argument, I needed to add (or replace, if it exists) a column with data calculated based on values of the data frame's original column. I wrote the code, but the testing revealed that data frame extract/replace operations, which I've used, resulted in a loss of the object's special (user-defined) attributes.
After realizing that and confirming that behavior by reading R documentation (http://stat.ethz.ch/R-manual/R-patched/library/base/html/Extract.html), I decided to solve the problem very simply - by saving the attributes before the extract/replace operations and restoring them thereafter:
myTransformationFunction <- function (data) {
# save object's attributes
attrs <- attributes(data)
<data frame transformations; involves extract/replace operations on `data`>
# restore the attributes
attributes(data) <- attrs
return (data)
}
This approach worked. However, accidentally, I ran across another piece of R documentation (http://stat.ethz.ch/R-manual/R-patched/library/base/html/Extract.data.frame.html), which offers IMHO an interesting (and, potentially, a more generic?) alternative approach to solving the same problem:
## keeping special attributes: use a class with a
## "as.data.frame" and "[" method:
as.data.frame.avector <- as.data.frame.vector
`[.avector` <- function(x,i,...) {
r <- NextMethod("[")
mostattributes(r) <- attributes(x)
r
}
d <- data.frame(i = 0:7, f = gl(2,4),
u = structure(11:18, unit = "kg", class = "avector"))
str(d[2:4, -1]) # 'u' keeps its "unit"
I would really appreciate if people here could help by:
Comparing the two above-mentioned approaches, if they are comparable (I realize that the second approach as defined is for data frames, but I suspect it can be generalized to any object);
Explaining the syntax and meaning in the function definition in the second approach, especially as.data.frame.avector, as well as what is the purpose of the line as.data.frame.avector <- as.data.frame.vector.

I'm answering my own question, since I have just found an SO question (How to delete a row from a data.frame without losing the attributes), answers to which cover most of my questions posed above. However, additional explanations (for R beginners) for the second approach would still be appreciated.
UPDATE:
Another solution to this problem has been proposed in an answer to the following SO question: indexing operation removes attributes. Personally, however, I better like the approach, based on creating a new class, as it's IMHO semantically cleaner.

Related

R - Operate on a column w/o explicitely reassigning it?

I'm often writing things like:
dataframe$this_column <- as.Date(dataframe$this_column)
That is, when changing some column in my data frame [table], I'm constantly writing the column twice. Is there some function that allows me to directly change the data frame w/o explicitly reassigning it? Say: ch(dataframe$this_column, as.Date())
EDIT: While similar, the potential duplicate is not the same. I am not looking for a way to shorten self-referential reassignments. I'm looking to avoid the explicit reassignment all together. The answer I accepted here is an appropriate solution (and much better than the answers provided in the "duplicate" question, in regards to their relevance to my question).
Here is the example using magrittr package:
library(magrittr)
x = c('2015-12-12','2015-12-13','2015-12-14')
df = data.frame(x)
df$x %<>% as.Date

How to structure R code for inherently nested problems to be easily readible?

There are problems that inherently require several layers of nesting to be solved. In a current project, I frequently find myself using three nested applys in order to do something with the elements contained in the deepest layer of a nested list structure.
R's list handling and apply-family allows for quite concise code for this type of problems, but writing it still gives me headaches and I'm pretty sure that anyone else reading it will have to take several minutes to understand what I am doing. This is although I'm basically doing nothing complicated - if it weren't for the multiple layers of lists to be traversed.
Below I provide a snippet of code that I wrote that shall serve as an example. I would consider it concise, yet hard to read.
Context: I am writing a simulation of surface electromyographic data, i.e. changes in the electrical potential that can be measured on the human skin and which is invoked by muscular activity. For this, I consider several muscles (first layer of lists) which each consist of a number of so-called motor units (second layer of lists) which each are related to a set of electrodes (third layer of lists) placed on the skin. In the example code below, we have the objects .firing.contribs, which contains the information on how strong the action of a motor unit influences the potential at a specific electrode, and .firing.instants which contains the information on the instants in time at which these motor units are fired. The given function then calculates the time course of the potential at each electrode.
Here's the question: What are possible options to render this type of code easily understable for the reader? Particularly: How can I make clearer what I am actually doing, i.e. performing some computation on each (MU, electrode) pair and then summing up all potential contributions to each electrode? I feel this is currently not very obvious in the code.
Notes:
I know of plyr. So far, I have not been able to see how I can use this to render my code more readable, though.
I can't convert my structure of nested lists to a multidimensional array since the number of elements is not the same for each list element.
I searched for R implementations of the MapReduce algorithm, since this seems to be applicable to the problem I give as an example (although I'm not a MapReduce expert, so correct me if I'm wrong...). I only found packages for Big Data processing, which is not what I'm interested in. Anyway, my question is not about this particular (Map-Reducible) case but rather about a general programming pattern for inherently nested problems.
I am not interested in performance for the moment, just readability.
Here's the example code.
sum.MU.firing.contributions <- function(.firing.contribs, .firing.instants) {
calc.MU.contribs <- function(.MU.firing.contribs, .MU.firing.instants)
lapply(.MU.firing.contribs, calc.MU.electrode.contrib,
.MU.firing.instants)
calc.muscle.contribs <- function(.MU.firing.contribs, .MU.firing.instants) {
MU.contribs <- mapply(calc.MU.contribs, .MU.firing.contribs,
.MU.firing.instants, SIMPLIFY = FALSE)
muscle.contribs <- reduce.contribs(MU.contribs)
}
muscle.contribs <- mapply(calc.muscle.contribs, .firing.contribs,
.firing.instants, SIMPLIFY = FALSE)
surface.potentials <- reduce.contribs(muscle.contribs)
}
## Takes a list (one element per object) of lists (one element per electrode)
## that contain the time course of the contributions of that object to the
## surface potential at that electrode (as numerical vectors).
## Returns a list (one element per electrode) containing the time course of the
## contributions of this list of objects to the surface potential at all
## electrodes (as numerical vectors again).
reduce.contribs <- function(obj.list) {
contribs.by.electrode <- lapply(seq_along(obj.list[[1]]), function(i)
sapply(obj.list, `[[`, i))
contribs <- lapply(contribs.by.electrode, rowSums)
}
calc.MU.electrode.contrib <- function(.MU.firing.contrib, .MU.firing.instants) {
## This will in reality be more complicated since then .MU.firing.contrib
## will have a different (more complicated) structure.
.MU.firing.contrib * .MU.firing.instants
}
firing.contribs <- list(list(list(1,2),list(3,4)),
list(list(5,6),list(7,8),list(9,10)))
firing.instants <- list(list(c(0,0,1,0), c(0,1,0,0)),
list(c(0,0,0,0), c(1,0,1,0), c(0,1,1,0)))
surface.potentials <- sum.MU.firing.contributions(firing.contribs, firing.instants)
As suggested by user #alexis_laz, it is a good option to actually not use a nested list structure for representing data, but rather use a (flat, 2D, non-nested) data.frame. This greatly simplified coding and increased code conciseness and readability for the above example and seems to be promising for other cases, too.
It completely eliminates the need for nested apply-foo and complicated list traversing. Moreover, it allows for the usage of a lot of built-in R functionality that works on data frames.
I think there are two main reasons why I did not consider this solution before:
It does not feel like the natural representation of my data, since the data actually are nested. By fitting the data into a data.frame, we're losing direct information on this nesting. It can of course be reconstructed, though.
It introduces redundancy, as I will have a lot of equivalent entries in the categorical columns. Of course this doesn't matter by any measure since it's just indices that don't consume a lot of memory or something, but it still feels kind of bad.
Here is the rewritten example. Comments are most welcome.
sum.MU.firing.contributions <- function(.firing.contribs, .firing.instants) {
firing.info <- merge(.firing.contribs, .firing.instants)
firing.info$contribs.time <- mapply(calc.MU.electrode.contrib,
firing.info$contrib,
firing.info$instants,
SIMPLIFY = FALSE)
surface.potentials <- by(I(firing.info$contribs.time),
factor(firing.info$electrode),
function(list) colSums(do.call(rbind, list)))
surface.potentials
}
calc.MU.electrode.contrib <- function(.MU.firing.contrib, .MU.firing.instants) {
## This will in reality be more complicated since then .MU.firing.contrib
## will have a different (more complicated) structure.
.MU.firing.contrib * .MU.firing.instants
}
firing.instants <- data.frame(muscle = c(1,1,2,2,2),
MU = c(1,2,1,2,3),
instants = I(list(c(F,F,T,F),c(F,T,F,F),
c(F,F,F,F),c(T,F,T,F),c(F,T,T,F))))
firing.contribs <- data.frame(muscle = c(1,1,1,1,2,2,2,2,2,2),
MU = c(1,1,2,2,1,1,2,2,3,3),
electrode = c(1,2,1,2,1,2,1,2,1,2),
contrib = c(1,2,3,4,5,6,7,8,9,10))
surface.potentials <- sum.MU.firing.contributions(firing.contribs,
firing.instants)

Initialize a nested data frame in R

I have a nested data table called ldt. The highest dimension in each direction is
ldt[[44]][7200,4]
Meaning each nested data.table (DT) is 7200 rows by 4 columns, and there are 44 such nested DTs.
I want to create a second set of DTs based on the first.
Specifically, I want to execute the command
for (i in length(ldt)) {ldtSUM[[i]]<-ldt[[i]][, sum(V4), by="V1,V3"]}
If successful, this should generate a nested array of dimensionality
ldtSUM[[44]][120,4]
but instead doing so results in the error:
Error in ldtSUM[[i]] <- ldt[[i]][, sum(V4), by = "V1,V3"] : object 'ldtSUM' not found
presumably because I never initialized it.
So, to circumvent this, I attempted a variety of initialization techniques such as
ldtSUM=ldt[[1:44]][0,]
and many other statements, but they all failed for a variety of reasons, such as that I did not specify a number of empty rows, or "recursive indexing failed at level 3" etc.
So, in summary, what I would like to do is create a data.table wherein $V4 is summed by $V1,$V3 but I cannot seem to do this due to indexing and initialization issues.
Thank you very much!
Based on agstudy's answer, the correct code is:
ldtSUM<-lapply(ldt,function(x)x[, sum(V4), by = "V1,V3"]
You had two parts to your problem:
The first part you have already solved by:
ldtSUM<-lapply(ldt,function(x)x[, sum(V4), by = "V1,V3"]
The second part is on initialization, and that can be solved by following:
ldtSUM <- list()
After this initialization if you run your original code:
for (i in length(ldt)) {ldtSUM[[i]]<-ldt[[i]][, sum(V4), by="V1,V3"]}
should also work.
Note: For your problem, the solution suggested by agstudy is best. But the initialization is a general trick that is useful in variety of situations.

returning different data frames in a function - R

Is it possible to return 4 different data frames from one function?
Scenario:
I am trying to read a file, parse it, and return some parts of the file.
My function looks something like this:
parseFile <- function(file){
carFile <- read.table(file, header=TRUE, sep="\t")
carNames <- carFile[1,]
carYear <- colnames(carFile)
return(list(carFile,carNames,carYear))
}
I don't want to have to use list(carFile,carNames,carYear). Is there a way return the 3 data frames without returning them in a list first?
R does not support multiple return values. You want to do something like:
foo = function(x,y){return(x+y,x-y)}
plus,minus = foo(10,4)
yeah? Well, you can't. You get an error that R cannot return multiple values.
You've already found the solution - put them in a list and then get the data frames from the list. This is efficient - there is no conversion or copying of the data frames from one block of memory to another.
This is also logical, the return from a function should conceptually be a single entity with some meaning that is transferred to whatever function is calling it. This meaning is also better conveyed if you name the returned values of the list.
You could use a technique to create multiple objects in the calling environment, but when you do that, kittens die.
Note in your example carYear isn't a data frame - its a character vector of column names.
There are other ways you could do that, if you really really want, in R.
assign('carFile',carFile,envir=parent.frame())
If you use that, then carFile will be created in the calling environment. As Spacedman indicated you can only return one thing from your function and the clean solution is to go for the list.
In addition, my personal opinion is that if you find yourself in such a situation, where you feel like you need to return multiple dataframes with one function, or do something that no one has ever done before, you should really revisit your approach. In most cases you could find a cleaner solution with an additional function perhaps, or with the recommended (i.e. list).
In other words the
envir=parent.frame()
will do the job, but as SpacedMan mentioned
when you do that, kittens die
The zeallot package does what you need in a similar that Python can unpack variables from a function. Reproducible example below.
parseFile <- function(){
carMPG <- mtcars$mpg
carName <- rownames(mtcars)
carCYL <- mtcars$cyl
return(list(carMPG,carName,carCYL))
}
library(zeallot)
c(myFile, myName, myYear) %<-% parseFile()

Subsetting within a function

I'm trying to subset a dataframe within a function using a mixture of fixed variables and some variables which are created within the function (I only know the variable names, but cannot vectorise them beforehand). Here is a simplified example:
a<-c(1,2,3,4)
b<-c(2,2,3,5)
c<-c(1,1,2,2)
D<-data.frame(a,b,c)
subbing<-function(Data,GroupVar,condition){
g=Data$c+3
h=Data$c+1
NewD<-data.frame(a,b,g,h)
subset(NewD,select=c(a,b,GroupVar),GroupVar%in%condition)
}
Keep in mind that in my application I cannot compute g and h outside of the function. Sometimes I'll want to make a selection according to the values of h (as above) and other times I'll want to use g. There's also the possibility I may want to use both, but even just being able to subset using 1 would be great.
subbing(D,GroupVar=h,condition=5)
This returns an error saying that the object h cannot be found. I've tried to amend subset using as.formula and all sorts of things but I've failed every single time.
Besides the ease of the function there is a further reason why I'd like to use subset.
In the function I'm actually working on I use subset twice. The first time it's the simple subset function. It's just been pointed out below that another blog explored how it's probably best to use the good old data[colnames()=="g",]. Thanks for the suggestion, I'll have a go.
There is however another issue. I also use subset (or rather a variation) in my function because I'm dealing with several complex design surveys (see package survey), so subset.survey.design allows you to get the right variance estimation for subgroups. If I selected my group using [] I would get the wrong s.e. for my parameters, so I guess this is quite an important issue.
Thank you
It's happening right as the function is trying to define GroupVar in the beginning. R is looking for the object h by itself (not within the dataframe).
The best thing to do is refer to the column names in quotes in the subset function. But of course, then you'd have to sidestep the condition part:
subbing <- function(Data, GroupVar, condition) {
....
DF <- subset(Data, select=c("a","b", GroupVar))
DF <- DF[DF[,3] %in% condition,]
}
That will do the trick, although it can be annoying to have one data frame indexing inside another.

Resources