Assignment to the result of a function changes variable - r

Looking through the ave function, I found a remarkable line:
split(x, g) <- lapply(split(x, g), FUN) # From ave
Interestingly, this line changes the value of x, which I found unexpected. I expected that split(x,g) would result in a list, which could be assigned to, but discarded afterward. My question is, why does the value of x change?
Another example may explain better:
a <- data.frame(id=c(1,1,2,2), value=c(4,5,7,6))
# id value
# 1 1 4
# 2 1 5
# 3 2 7
# 4 2 6
split(a,a$id) # Split a row-wise by id into a list of size 2
# $`1`
# id value
# 1 1 4
# 2 1 5
# $`2`
# id value
# 3 2 7
# 4 2 6
# Find the row with highest value for each id
lapply(split(a,a$id),function(x) x[which.max(x$value),])
# $`1`
# id value
# 2 1 5
# $`2`
# id value
# 3 2 7
# Assigning to the split changes the data.frame a!
split(a,a$id)<-lapply(split(a,a$id),function(x) x[which.max(x$value),])
a
# id value
# 1 1 5
# 2 1 5
# 3 2 7
# 4 2 7
Not only has a changed, but it changed to a value that does not look like the right hand side of the assignment! Even if assigning to split(a,a$id) somehow changes a (which I don't understand), why does it result in a data.frame instead of a list?
Note that I understand that there are better ways to accomplish this task. My question is why does split(a,a$id)<-lapply(split(a,a$id),function(x) x[which.max(x$value),]) change a?

The help page for split says in its header: "The replacement forms replace values corresponding to such a division." So it really should not be unexpected, although I admit it is not widely used. I do not understand how your example illustrates that the assigned values "do not look like the RHS of the assignment!". The max values are assigned to the 'value' lists within categories defined by the second argument factor.
(I do thank you for the question. I had not realized that split<- was at the core of ave. I guess it is more widely used than I realized, since I think ave is a wonderfully useful function.)

Just after definition of a, perform split(a, a$id)=1, the result would be:
> a
id value
1 1 1
2 1 1
3 1 1
4 1 1

The key here is that split<- actually modified the LHS with RHS values.
Here's an example:
> x <- c(1,2,3);
> split(x,x==2)
$`FALSE`
[1] 1 3
$`TRUE`
[1] 2
> split(x,x==2) <- split(c(10,20,30),c(10,20,30)==20)
> x
[1] 10 20 30
Note the line where I re-assign split(x,x==2) <- . This actually reassigns x.
As the comments below have stated, you can look up the definition of split<- like so
> `split<-.default`
function (x, f, drop = FALSE, ..., value)
{
ix <- split(seq_along(x), f, drop = drop, ...)
n <- length(value)
j <- 0
for (i in ix) {
j <- j%%n + 1
x[i] <- value[[j]]
}
x
}
<bytecode: 0x1e18ef8>
<environment: namespace:base>

Related

R function to recursively merge unnamed arguments, input a list

I am trying to write a multi-merge alternative to merge which can merge-together more than two datasets on a single key.
The code I have is like this:
multimerge <- function(..., by, all=T) {
value <- list(...)
Reduce(function(x,y)merge(x,y,by=by, all=all), value)
}
But the thing I want to multi-merge is a list. Is it possible to pass a list argument as the ... in a function?
For instance:
List <- list(
data.frame('x'=c('a','b','c'), 'y'=1),
data.frame('x'=c('a','b','c'), 'z'=2)
)
would take
multimerge(List, by='x')
as an argument and give:
x y z
a 1 2
b 1 2
c 1 2
as output. But I do not want to write another version of multimerge.
purrr has a powerful function called flatten that would be perfect for this problem:
library(purrr)
multimerge <- function(..., by, all=T) {
value = flatten(list(...))
Reduce(function(x, y) merge(x, y, by=by, all=T), value)
}
No matter what is being fed into ..., flatten turns list(...) into a list of dataframes for Reduce. With this functionality, you can feed either a list of dataframes, several individual dataframes, both, or even several lists of dataframes.
You can also imitate the behavior of flatten by doing something like this in Base R:
multimerge <- function(..., by, all=T) {
value = list(...)
df_index = which(sapply(value, inherits, "data.frame"))
list_index = which(sapply(value, inherits, "list"))
value = c(value[df_index], unlist(value[list_index], recursive = FALSE))
Reduce(function(x, y) merge(x, y, by=by, all=T), value)
}
This applies unlist only to elements that are "lists" and keep dataframes untouched. Note that I used inherits instead of is.list, because dataframes are technically also lists!
Result:
> multimerge(List, by='x')
x y z
1 a 1 2
2 b 1 2
3 c 1 2
> multimerge(List[[1]], List[[2]], by='x')
x y z
1 a 1 2
2 b 1 2
3 c 1 2
> multimerge(List, List[[1]], List[[2]], by='x')
x y.x z.x y.y z.y
1 a 1 2 1 2
2 b 1 2 1 2
3 c 1 2 1 2
> multimerge(List, List, by='x')
x y.x z.x y.y z.y
1 a 1 2 1 2
2 b 1 2 1 2
3 c 1 2 1 2
Additional Notes:
From the documentation of ?flatten:
These functions remove a level hierarchy from a list. They are similar to unlist(), only ever remove a single layer of hierarchy, and are type-stable so you always know what the type of the output is.
The key word is "type-stability", meaning it always returns the same type of data structure.
> flatten(list(List, List[[1]], List[[2]]))
[[1]]
x y
1 a 1
2 b 1
3 c 1
[[2]]
x z
1 a 2
2 b 2
3 c 2
[[3]]
x y
1 a 1
2 b 1
3 c 1
[[4]]
x z
1 a 2
2 b 2
3 c 2
> unlist(list(List, List[[1]], List[[2]]), recursive = FALSE)
[[1]]
x y
1 a 1
2 b 1
3 c 1
[[2]]
x z
1 a 2
2 b 2
3 c 2
$x
[1] a b c
Levels: a b c
$y
[1] 1 1 1
$x
[1] a b c
Levels: a b c
$z
[1] 2 2 2
The main difference between flatten and unlist + recursive = FALSE is that flatten "unlists" only if the output matches the data structure of the rest, where as unlist + recursive = FALSE always flattens one level, so in my Base R example, I needed an extra step to check whether the element is a list or a dataframe.
So, the problem is that when you pass a list into multimerge the list gets put into another list, which then gets collapsed back into the original list. You could just do a check for superfluous length 1 lists, and strip off that level of lists:
multimerge <- function(..., by, all=T) {
value <- list(...)
if (length(value) == 1) value <- value[[1]]
Reduce(function(x,y)merge(x,y,by=by, all=all), value)
}

Complete.cases used on list of data frames

I'm trying to remove all the NA values from a list of data frames. The only way I have got it to work is by cleaning the data with complete.cases in a for loop. Is there another way of doing this with lapply as I had been trying for a while to no avail. Here is the code that works.
I start with
data_in <- lapply (file_name,read.csv)
Then have:
clean_data <- list()
for (i in seq_along(id)) {
clean_data[[i]] <- data_in[[i]][complete.cases(data_in[[i]]), ]
}
But what I tried to get to work was using lapply all the way like this.
comp <- lapply(data_in, complete.cases)
clean_data <- lapply(data_in, data_in[[id]][comp,])
Which returns this error "Error in [.default(xj, i) : invalid subscript type 'list' "
What I'd like to know is some alternatives or if I was going about this right. And why didn't the last example not work?
Thank you so much for your time. Have a nice day.
I'm not sure what you expected with
clean_data <- lapply(data_in, data_in[[id]][comp,])
The second parameter to lapply should be a proper function to which each member of the data_in list will be passed one at a time. Your expression data_in[[id]][comp,] is not a function. I'm not sure where you expected id to come from, but lapply does not create magic variables for you like that. Also, at this point comp is now a list itself of indices. You are making no attempt to iterate over this list in sync with your data_in list. If you wanted to do it in two separate steps, a more appropriate approach would be
comp <- lapply(data_in, complete.cases)
clean_data <- Map(function(d,c) {d[c,]}, data_in, comp)
Here we use Map to iterate over the data_in and comp lists simultaneously. They each get passed in to the function as a parameter and we can do the proper extraction that way. Otherwise, if we wanted to do it in one step, we could do
clean_data <- lapply(data_in, function(x) x[complete.cases(x),])
welcome to SO, please provide some working code next time
here is how i would do it with na.omit (since complete.cases only returns a logical)
(dat.l <- list(dat1 = data.frame(x = 1:2, y = c(1, NA)),
dat2 = data.frame(x = 1:3, y = c(1, NA, 3))))
# $dat1
# x y
# 1 1 1
# 2 2 NA
#
# $dat2
# x y
# 1 1 1
# 2 2 NA
# 3 3 3
Map(na.omit, dat.l)
# $dat1
# x y
# 1 1 1
#
# $dat2
# x y
# 1 1 1
# 3 3 3
Do you mean like the below?
> lst
$a
a
1 1
2 2
3 NA
4 3
5 4
$b
b
1 1
2 NA
3 2
4 3
5 4
$d
d e
1 NA 1
2 NA 2
3 3 3
4 4 NA
5 5 NA
> f <- function(x) x[complete.cases(x),]
> lapply(lst, f)
$a
[1] 1 2 3 4
$b
[1] 1 2 3 4
$d
d e
3 3 3
file_name[complete.cases(file_name), ]
complete.cases() returns only a logical value. This should do the job and returns only the rows with no NA values.

Difference between `names(df[1]) <- ` and `names(df)[1] <- `

Consider the following:
df <- data.frame(a = 1, b = 2, c = 3)
names(df[1]) <- "d" ## First method
## a b c
##1 1 2 3
names(df)[1] <- "d" ## Second method
## d b c
##1 1 2 3
Both methods didn't return an error, but the first didn't change the column name, while the second did.
I thought it has something to do with the fact that I'm operating only on a subset of df, but why, for example, the following works fine then?
df[1] <- 2
## a b c
##1 2 2 3
What I think is happening is that replacement into a data frame ignores the attributes of the data frame that is drawn from. I am not 100% sure of this, but the following experiments appear to back it up:
df <- data.frame(a = 1:3, b = 5:7)
# a b
# 1 1 5
# 2 2 6
# 3 3 7
df2 <- data.frame(c = 10:12)
# c
# 1 10
# 2 11
# 3 12
df[1] <- df2[1] # in this case `df[1] <- df2` is equivalent
Which produces:
# a b
# 1 10 5
# 2 11 6
# 3 12 7
Notice how the values changed for df, but not the names. Basically the replacement operator `[<-` only replaces the values. This is why the name was not updated. I believe this explains all the issues.
In the scenario:
names(df[2]) <- "x"
You can think of the assignment as follows (this is a simplification, see end of post for more detail):
tmp <- df[2]
# b
# 1 5
# 2 6
# 3 7
names(tmp) <- "x"
# x
# 1 5
# 2 6
# 3 7
df[2] <- tmp # `tmp` has "x" for names, but it is ignored!
# a b
# 1 10 5
# 2 11 6
# 3 12 7
The last step of which is an assignment with `[<-`, which doesn't respect the names attribute of the RHS.
But in the scenario:
names(df)[2] <- "x"
you can think of the assignment as (again, a simplification):
tmp <- names(df)
# [1] "a" "b"
tmp[2] <- "x"
# [1] "a" "x"
names(df) <- tmp
# a x
# 1 10 5
# 2 11 6
# 3 12 7
Notice how we directly assign to names, instead of assigning to df which ignores attributes.
df[2] <- 2
works because we are assigning directly to the values, not the attributes, so there are no problems here.
EDIT: based on some commentary from #AriB.Friedman, here is a more elaborate version of what I think is going on (note I'm omitting the S3 dispatch to `[.data.frame`, etc., for clarity):
Version 1 names(df[2]) <- "x" translates to:
df <- `[<-`(
df, 2,
value=`names<-`( # `names<-` here returns a re-named one column data frame
`[`(df, 2),
value="x"
) )
Version 2 names(df)[2] <- "x" translates to:
df <- `names<-`(
df,
`[<-`(
names(df), 2, "x"
) )
Also, turns out this is "documented" in R Inferno Section 8.2.34 (Thanks #Frank):
right <- wrong <- c(a=1, b=2)
names(wrong[1]) <- 'changed'
wrong
# a b
# 1 2
names(right)[1] <- 'changed'
right
# changed b
# 1 2

Extract only first line in a data frame from several subgroups that satisfy a conditional

I have a data frame similar to the dummy example here:
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
In the original data frame, there are many more groups, each with 10 values. For each group (a,b or c) I would like to extract the first line where value!=NA, but only the first line where this is true. As in a group there could be several values different from NA and from each other I can't simply subset.
I was imagining something like this using plyr and a conditional, but I honestly have no idea what the conditional should take:
ddply<-(df,.(Group),function(sub_data){
for(i in 1:length(sub_data$value)){
if(sub_data$Value!='NA'){'take value but only for the first non NA')
return(first line that satisfies)
})
Maybe this is easy with other strategies that I don't know of
Any suggestion is very much appreciated!
I know this has been answered but for this you should be looking at the data.table package. It provides a very expressive and terse syntax for doing what you ask:
df<-data.table(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
> df[ Value != "NA", .SD[1], by=Group ]
Group Value
1: a 10
2: b 4
3: c 2
Do youself a favor and learn data.table
Some other notes:
You can easily convert data.frames to data.tables
I think that you don't want "NA" but simply NA in your example, in that case the syntax is:
df[ ! is.na(Value), .SD[1], by=Group ]
Since you suggested plyr in the first place:
ddply(subset(df, !is.na(Value)), .(Group), head, 1L)
That assumes you have NAs and not 'NA's. If the latter (not recommended), then:
ddply(subset(df, Value != 'NA'), .(Group), head, 1L)
Note how concise this is. I would agree with using plyr.
If you're willing to use actual NA's vs strings, then the following should give you what you're looking for:
df <- (Group=rep(letters[1:3], each=3),
Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
print(df)
## Group Value
## 1 a <NA>
## 2 a <NA>
## 3 a 10
## 4 b <NA>
## 5 b 4
## 6 b 8
## 7 c <NA>
## 8 c <NA>
## 9 c 2
df.1 <- by(df, df$Group, function(x) {
head(x[complete.cases(x),], 1)
})
print(df.1)
## df$Group: a
## Group Value
## 3 a 10
## ------------------------------------------------------------------------
## df$Group: b
## Group Value
## 5 b 4
## ------------------------------------------------------------------------
## df$Group: c
## Group Value
## 9 c 2
First you should take care of NA's:
options(stringsAsFactors=FALSE)
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
And then maybe something like this would do the trick:
for(i in unique(df$Group)) {
for(j in df$Value[df$Group==i]) {
if(!is.na(j)) {
print(paste(i,j))
break
}
}
}
Assuming that Value is actually numeric, not character.
> df <- data.frame(Group=rep(letters[1:3],each=3),
Value=c(NA, NA, 10, NA, 4, 8, NA, NA, 2)
> do.call(rbind, lapply(split(df, df$Group), function(x){
x[ is.na(x[,2]) == FALSE, ][1,]
}))
## Group Value
## a a 10
## b b 4
## c c 2
I don't see any solutions using aggregate(...), which would be the simplest:
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
aggregate(Value~Group,df[df$Value!="NA",],head,1)
# Group Value
# 1 a 10
# 2 b 4
# 3 c 2
If your df contains actual NA, and not "NA" as in your example, then use this:
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
aggregate(Value~Group,df[!is.na(df$Value),],head,1)
Group Value
1 a 10
2 b 4
3 c 2
Your life would be easier if you marked missing values with NA and not as a character string 'NA'; the former is really missing to R and it has tools to work with such missingness. The latter ('NA') is really not missing except for the meaning that this string has to you alone; R cannot divine that information directly. Assuming you correct this, then the solution below is one way to go about doing this.
Similar in spirit to #hrbrmstr's by() but to my eyes aggregate() gives nicer output:
> foo <- function(x) head(x[complete.cases(x)], 1)
> aggregate(Value ~ Group, data = df, foo)
Group Value
1 a 10
2 b 4
3 c 2
> aggregate(df$Value, list(Group = df$Group), foo)
Group x
1 a 10
2 b 4
3 c 2

Matching without replacement by id in R

In R, I can easily match unique identifiers using the match function:
match(c(1,2,3,4),c(2,3,4,1))
# [1] 4 1 2 3
When I try to match non-unique identifiers, I get the following result:
match(c(1,2,3,1),c(2,3,1,1))
# [1] 3 1 2 3
Is there a way to match the indices "without replacement", that is, each index appearing only once?
othermatch(c(1,2,3,1),c(2,3,1,1))
# [1] 3 1 2 4 # note the 4 where there was a 3 at the end
you're looking for pmatch
pmatch(c(1,2,3,1),c(2,3,1,1))
# [1] 3 1 2 4
A more naive approach -
library(data.table)
a <- data.table(p = c(1,2,3,1))
a[,indexa := .I]
b <- data.table(q = c(2,3,1,1))
b[,indexb := .I]
setkey(a,p)
setkey(b,q)
# since they are permutation, therefore cbinding the ordered vectors should return ab with ab[,p] = ab[,q]
ab <- cbind(a,b)
setkey(ab,indexa)
ab[,indexb]
#[1] 3 1 2 4

Resources