R - sample used in %in% modify dataframe which is being subsetted - r

Not sure if I titled question correctly, because I don't fully understand the reason of following behaviour:
dfSet <- data.frame(ID = sample(1:15, size = 15, replace = FALSE), va1 = NA, va3 = 0, stringsAsFactors = FALSE)
dfSet[1:10, ]$va1 <- 'o1'
dfSet[11:15, ]$va1 <- 'o2'
dfSet[dfSet$ID %in% sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE), ]$va3 <- 1
print(length(unique(dfSet$ID)))
I expect that final print shows 15, but it doesn't. Instead 13 or 14 appears and dfSet is modified in the way, that there are at least two rows with the same ID. It seems that this part of the code:
dfSet[dfSet$ID %in% sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE), ]$va3 <- 1
modify $ID column - I don't know why?
Workaround:
temp <- sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE)
dfSet[dfSet$ID %in% temp, ]$va3 <- 1
In this case everything works as expected - there are 15 rows with unique ID.
The question is why direct usage of sample in %in% modifies data frame?

What seems to be the problem is that that R does some tricky thing when you assign to function return values. For example, something like
a <- c(1,3)
names(a) <- c("one", "three")
would look very odd in most languages. How do you assign a value to the return value of a function? What's really happening is that there is a function named names<- that's defined. Basically that's returning a transformed version of the original object that can then be used to replace the value passed to that function. So it really looks like this
.temp. <- `names<-`(a, c("one","three"))
a <- .temp.
The variable a is always completely replaced, not just it's names.
When you do something like
dfSet$a<-1
what's really happening again is
.temp. <- "$<-"(dfSet, a, 1)
dfSet <- .temp.
Now things get a bit more tricky when you try to do both [] and $ subsetting. Look at this sample
#for subsetting
f <- function(x,v) {print("testing"); x==v}
x <- rep(0:1, length.out=nrow(dfSet))
dfSet$a <- 0
dfSet[f(x,1),]$a<-1
Notice how "testing" is printed twice. What's going on is really more like
.temp1. <- "$<-"(dfSet[f(x,1),], a, 1)
.temp2. <- "[<-"(dfSet, f(x,1), , .temp1.)
dfSet <- .temp2.
So the f(x,1) is evaluated twice. This means that sample would be evaluated twice as well.
The error is a bit more obvious is you try to replace a variable that does not exist yet
dfSet[f(x,1),]$b<-1
# Warning message:
# In `[<-.data.frame`(`*tmp*`, f(x, 1), , value = list(ID = c(6L, :
# provided 4 variables to replace 3 variables
Here you get the warning because the .temp1. variable as added the column and now has 4 columns but when you try to do the assignment to .temp2. you now have a problem that the slice of the data frame that you are trying to replace is a different size.
The IDs are replaced because the $<- operator doesn't just return a new column, it returns a new data.frame with the column updated to whatever value you assigned. This means that the rows that were updated are returned along with the ID that was there when the assignment happened. This is saved in the .temp1. variable. Then when you do the [<- assignment, you are choosing a new set of rows to swap out. The values of all columns of these rows are replaced with the values from .temp1.. This means that you will be overwriting the IDs for the replacement rows and they may differ so you are likely to wind up with two or more copies of a given ID.

Although I'm not 100% sure, I suspect that R is running the sample two times. When you subset and assign in R, for example:
x[i:j,]$v1 <- 1
It gets evaluated as "take out rows i to j from x as a temporary data frame, assign 1 to the v1 column of that data frame, then copy the temporary data frame back into rows i to j in x".
So maybe the indexing expression (i:j) gets executed twice (once to extract, and once to put back), and if it's a random variable, it's going to put the results back in different rows than the ones originally selected.

Consider this simpler example:
x <- data.frame(a=1:10, b=10:1)
x$b <- 5
What the second line actually does is
x <- `$<-`(x, 'b', 5)
You can see that $<- is just a function that takes three arguments, an
object, a name and a value. (Note the backticks are necessary if you want to use $<- directly.)
The problem I think is that in your example x is an expression that
evaluates to different things each time it's evaluated, due to the call to
sample, so you should avoid this.
An alternative is to use [<- which apparently doesn't have this problem:
dfSet[dfSet$ID %in% sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE), 'va3'] <- 1

Related

Subset data.table based on value in column of type list

So I have this case currently of a data.table with one column of type list.
This list can contain different values, NULL among other possible values.
I tried to subset the data.table to keep only rows for which this column has the value NULL.
Behold... my attempts below (for the example I named the column "ColofTypeList"):
DT[is.null(ColofTypeList)]
It returns me an Empty data.table.
Then I tried:
DT[ColofTypeList == NULL]
It returns the following error (I expected an error):
Error in .prepareFastSubset(isub = isub, x = x, enclos = parent.frame(), :
RHS of == is length 0 which is not 1 or nrow (96). For robustness, no recycling is allowed (other than of length 1 RHS). Consider %in% instead.
(Just a precision my original data.table contains 96 rows, which is why the error message say such thing:
which is not 1 or nrow (96).
The number of rows is not the point).
Then I tried this:
DT[ColofTypeList == list(NULL)]
It returns the following error:
Error: comparison of these types is not implemented
I also tried to give a list of the same length than the length of the column, and got this same last error.
So my question is simple: What is the correct data.table way to subset the rows for which elements of this "ColofTypeList" are NULL ?
EDIT: here is a reproducible example
DT<-data.table(Random_stuff=c(1:9),ColofTypeList=rep(list(NULL,"hello",NULL),3))
Have fun!
If it is a list, we can loop through the list and apply the is.null to return a logical vector
DT[unlist(lapply(ColofTypeList, is.null))]
# ColofTypeList anotherCol
#1: 3
Or another option is lengths
DT[lengths(ColofTypeList)==0]
data
DT <- data.table(ColofTypeList = list(0, 1:5, NULL, NA), anotherCol = 1:4)
I have found another way that is also quite nice:
DT[lapply(ColofTypeList, is.null)==TRUE]
It is also important to mention that using isTRUE() doesn't work.

Add a column to empty data.frame

I want to initialise a column in a data.frame look so:
df$newCol = 1
where df is a data.frame that I have defined earlier and already done some processing on. As long as nrow(df)>0, this isn't a problem, but sometimes my data.frame has row length 0 and I get:
> df$newCol = 1
Error in `[[<-`(`*tmp*`, name, value = 1) :
1 elements in value to replace 0 elements
I can work around this by changing my original line to
df$newCol = rep(1,nrow(df))
but this seems a bit clumsy and is computationally prohibitive if the number of rows in df is large. Is there a built in or standard solution to this problem? Or should I use some custom function like so
addCol = function(df,name,value) {
if(nrow(df)==0){
df[,name] = rep(value,0)
}else{
df[,name] = value
}
df
}
If I understand correctly,
df = mtcars[0, ]
df$newCol = numeric(nrow(df))
should be it?
This is assuming that by "row length" you mean nrows, in which case you need to append a vector of length 0. In such case, numeric(nrow(df)) will give you the exact same result as rep(0, nrow(df)).
It also kind of assumes that you just need a new column, and not specifically column of ones - then you would simply do +1, which is a vectorized operation and therefore fast.
Other than that, I'm not sure you can have an "empty" column - the vector should have the same number of elements as the other vectors in the data frame. But numeric is fast, it should not hurt.

How to assign a subset from a data frame `a' to a subset of data frame `b'

It might be a trivial question (I am new to R), but I could not find a answer for my question, either here in SO or anywhere else. My scenario is the following.
I have an data frame df and i want to update a subset df$tag values. df is similar to the following:
id = rep( c(1:4), 3)
tag = rep( c("aaa", "bbb", "rrr", "fff"), 3)
df = data.frame(id, tag)
Then, I am trying to use match() to update the column tag from the subsets of the data frame, using a second data frame (e.g., aux) that contains two columns, namely, key and value. The subsets are defined by id = n, according to n in unique(df$id). aux looks like the following:
> aux
key value
"aaa" "valueAA"
"bbb" "valueBB"
"rrr" "valueRR"
"fff" "valueFF"
I have tried to loop over the data frame, as follows:
for(i in unique(df$id)){
indexer = df$id == i
# here is how I tried to update the dame frame:
df[indexer,]$tag <- aux[match(df[indexer,]$tag, aux$key),]$value
}
The expected result was the df[indexer,]$tag updated with the respective values from aux$value.
The actual result was df$tag fulfilled with NA's. I've got no errors, but the following warning message:
In '[<-.factor'('tmp', df$id == i, value = c(NA, :
invalid factor level, NA generated
Before, I was using df$tag <- aux[match(df$tag, aux$key),]$value, which worked properly, but some duplicated df$tags made the match() produce the misplaced updates in a number of rows. I also simulate the subsetting and it works fine. Can someone suggest a solution for this update?
UPDATE (how the final dataset should look like?):
> df
id tag
1 "valueAA"
2 "valueBB"
3 "valueRR"
4 "valueFF"
(...) (...)
Thank you in advance.
Does this produce the output you expect?
df$tag <- aux$value[match(df$tag, aux$key)]
merge() would work too unless you have duplicates in aux.
It turned out that my data was breaking all the available built-in functions providing me a wrong dataset in the end. Then, my solution (at least, a preliminary one) was the following:
to process each subset individually;
add each data frame to a list;
use rbindlist(a.list, use.names = T) to get a complete data frame with the results.

How to add a column to a dataframe with values of another based on multiple conditions

I have two data frames of different length, and I want to add a new column to the first data frame with corresponding values of the second data frame.
The corresponding value is defined by the following condition if (DF1[i,1] == DF2[,1] & DF1[i,2] == DF2[i,2]) == TRUE, then the value of this row should be taken from DF2 and written to DF1$newColumn[i].
The following data frames are used to illustrate the question:
DF1<-data.frame(X = rep(c("A","B","C"),each=3),
Y = rep(c("a","b","c"),each=3))
DF2<-data.frame(X = c("A","B","C"),
Y = c("a","b","c"),
Z = c(1:3))
I tried to use if() statements as in the text above but the condition returns a vector of TRUE/FALSE and that doesn't seem to work.
The code that works that I use now is
for (i in 1 : length(DF1[,1])) {
DF1$Z[i] <- subset(DF2,DF2$X == DF1$X[i] & DF2$Y == DF1$Y[i])$Z
}
However it is incredibly slow (user system elapsed 115.498 12.341 127.799 for my full dataframe) and there must be a more efficient way to code this. Also, I have read repeatedly that vectorizing is more efficient then loops but I don't know how to do that.
I do need to work with conditional statements though so something like
DF1$Zz<-rep(DF2$Z,each=3)
wouldn't work for my real dataset.
DF1$Z <- sapply(1:nrow(DF1), function(i) DF2$Z[DF2$X==DF1$X[i] & DF2$Y==DF1$Y[i]]) seems to be taking roughly a quarter of the time of your for loop.
I created DF1 with 300 each reps, and my function took ~2secs to run; your loop with subset took ~8secs to run, and repackaging your loop into an sapply it took ~5secs to run.

Losing Class information when I use apply in R

When I pass a row of a data frame to a function using apply, I lose the class information of the elements of that row. They all turn into 'character'. The following is a simple example. I want to add a couple of years to the 3 stooges ages. When I try to add 2 a value that had been numeric R says "non-numeric argument to binary operator." How do I avoid this?
age = c(20, 30, 50)
who = c("Larry", "Curly", "Mo")
df = data.frame(who, age)
colnames(df) <- c( '_who_', '_age_')
dfunc <- function (er) {
print(er['_age_'])
print(er[2])
print(is.numeric(er[2]))
print(class(er[2]))
return (er[2] + 2)
}
a <- apply(df,1, dfunc)
Output follows:
_age_
"20"
_age_
"20"
[1] FALSE
[1] "character"
Error in er[2] + 2 : non-numeric argument to binary operator
apply only really works on matrices (which have the same type for all elements). When you run it on a data.frame, it simply calls as.matrix first.
The easiest way around this is to work on the numeric columns only:
# skips the first column
a <- apply(df[, -1, drop=FALSE],1, dfunc)
# Or in two steps:
m <- as.matrix(df[, -1, drop=FALSE])
a <- apply(m,1, dfunc)
The drop=FALSE is needed to avoid getting a single column vector.
-1 means all-but-the first column, you could instead explicitly specify the columns you want, for example df[, c('foo', 'bar')]
UPDATE
If you want your function to access one full data.frame row at a time, there are (at least) two options:
# "loop" over the index and extract a row at a time
sapply(seq_len(nrow(df)), function(i) dfunc(df[i,]))
# Use split to produce a list where each element is a row
sapply(split(df, seq_len(nrow(df))), dfunc)
The first option is probably better for large data frames since it doesn't have to create a huge list structure upfront.

Resources