R - brevity when subsetting? - r

I'm still new to R and do all of my subsetting via the pattern:
data[ command that produces logical with same length as data ]
or
subset( data , command that produces logical with same length as data )
for example:
test = c("A", "B","C")
ignore = c("B")
result = test[ !( test %in% ignore ) ]
result = subset( test , !( test %in% ignore ) )
But I vaguely remember from my readings that there's a shorter/(more readable?) way to do this? Perhaps using the "with" function?
Can someone list alternative to the example above to help me understand the options in subsetting?

I don't know of a more succinct way of subsetting for your specific example, using only vectors. What you may be thinking of, regarding with, is subsetting data frames based on conditions using columns from that data frame. For example:
dat <- data.frame(variable1 = runif(10), variable2 = letters[1:10])
If we want grab a subset of dat based on a condition using variable1 we could do this:
dat[dat$variable1 < 0,]
or we can save ourselves having to write dat$* each time by using with:
with(dat,dat[variable1 < 0,])
Now, you'll notice that I really didn't save any keystrokes by doing that in this case. But if you have a data frame with a long name, and a complicated condition it can save you a bit. See also the related ?within command if you're altering the data frame in question.
Alternatively, you can use subset which can do essentially the same thing:
subset(dat, variable1 < 0)
subset can also handle conditions on the columns via the select argument.

The with function would help if test were a column in a data frame (or object in a list), but with global vectors with does not help.
Some people have created a not in operator that could save a couple of key strokes from what you did. If all the values in test are unique then the setdiff function may be what you are thinking of (but if for example you had multiple "A"s then setdiff would only return 1 of them).
With your ignore being only 1 value you could use test != ignore, but that does not generalize to ignore having 2 or more values.

I have seen timed comparisons of alternate methods and %in% (based on match) was one of the best performing strategies.
Alternates:
test[!test=="B"] #logical indexing
test[which(test != "B")] #numeric indexing
# the which() is not superfluous when there are NA's if you want them ignored

Another alternative to the original example:
test[test != ignore]
Other ways, using joran's example:
set.seed(1)
df <- data.frame(variable1 = runif(10), variable2 = letters[1:10])
Returning one column: df[[1]]. df$name is equivalent to df[["name", exact = FALSE]]
df[df[[1]] < 0.5, ]
df[df["variable1"] < 0.5, ]
Returning one data frame of one column: df[1]
df[df[1] < 0.5, ]
Using with
with(df, df[df[[1]] < 0.5, ]) # One column
with(df, df[df["variable1"] < 0.5, ]) # One column
with(df, df[df[1] < 0.5, ]) # data frame of one column
Using dplyr:
library(dplyr)
filter(df, variable1 < 0.5)

Related

Order and subset a multi-column dataframe in R?

I wanted to order by some column, and subset, a multi-column dataframe but the command used did not work
print(df[order(df$x) & df$x < 5,])
This does not order the results.
To debug this I generated a test dataframe with 1 column but this 'simplification' had unexpected effects
df <- data.frame(x = sample(1:50))
print(df[order(df$x) & df$x < 5,])
This does not order the results so I felt I had reproduced the problem but with simpler data.
Breaking down the process to first ordering and then subsetting led me to discover the ordering in this case does not generate a dataframe object
df <- data.frame(x = sample(1:50))
ndf <- df[order(df$x),]
print(class(ndf))
produces
[1] "integer"
Attempting to subset the resultant "integer" ndf object using dataframe syntax e.g.
print(ndf[ndf$x < 5, ])
obviously generates an error:
Error in ndf$x : $ operator is invalid for atomic vectors.
Simplifying even further, I found subsetting alone (not applying the order function ) does not generate a dataframe object
ndf <- df[df$x < 5,]
class(ndf)
[1] "integer"
It turns out for the multicolumn dataframe that separating the ordering and the subsetting does work as expected
df <- data.frame(x = sample(1:50), y = rnorm(50))
ndf <- df[order(df$x),]
print(ndf[ndf$x < 5, ])
and this solved my original problem, but led to two further questions:
Why is the type of object returned, as described above based on the 1 column dataframe test case, not a dataframe? ( I appreciate a 1 column dataframe just contains a single vector but it's still wrapped in a dataframe ?)
Is it possible to order and subset a multicolumn dataframe in 1 step?
A data.frame in R automatically simplifies to vectors when selecting just one column. This is a common and useful simplification and is better described in this question. Of course you can prevent that with drop=FALSE.
Subsetting and ordering are two different operations. You should do them in two logical steps (but possibly one line of code). This line doesn't make a lot of sense
df[order(df$x) & df$x < 5,]
Subsetting in R can either be done with a vector of row indices (which order() returns) or boolean values (which the < comparison returns). Mixing them (with just an &) doesn't make it clear how R should perform the subset. But you can break that out into two steps with subset()
subset(df[order(df$x),], x < 5)
This does the ordering first and then the subsetting. Note that the condition no longer directory references the value of df specfically, it's will filter the data from the re-ordered data.frame.
Operations like this is one of the reasons many people perfer the dplyr library for data manipulations. For example this can be done with
library(dplyr)
dd <- data.frame(x = sample(1:50))
dd %>% filter(x<5) %>% arrange(x)

R - sample used in %in% modify dataframe which is being subsetted

Not sure if I titled question correctly, because I don't fully understand the reason of following behaviour:
dfSet <- data.frame(ID = sample(1:15, size = 15, replace = FALSE), va1 = NA, va3 = 0, stringsAsFactors = FALSE)
dfSet[1:10, ]$va1 <- 'o1'
dfSet[11:15, ]$va1 <- 'o2'
dfSet[dfSet$ID %in% sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE), ]$va3 <- 1
print(length(unique(dfSet$ID)))
I expect that final print shows 15, but it doesn't. Instead 13 or 14 appears and dfSet is modified in the way, that there are at least two rows with the same ID. It seems that this part of the code:
dfSet[dfSet$ID %in% sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE), ]$va3 <- 1
modify $ID column - I don't know why?
Workaround:
temp <- sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE)
dfSet[dfSet$ID %in% temp, ]$va3 <- 1
In this case everything works as expected - there are 15 rows with unique ID.
The question is why direct usage of sample in %in% modifies data frame?
What seems to be the problem is that that R does some tricky thing when you assign to function return values. For example, something like
a <- c(1,3)
names(a) <- c("one", "three")
would look very odd in most languages. How do you assign a value to the return value of a function? What's really happening is that there is a function named names<- that's defined. Basically that's returning a transformed version of the original object that can then be used to replace the value passed to that function. So it really looks like this
.temp. <- `names<-`(a, c("one","three"))
a <- .temp.
The variable a is always completely replaced, not just it's names.
When you do something like
dfSet$a<-1
what's really happening again is
.temp. <- "$<-"(dfSet, a, 1)
dfSet <- .temp.
Now things get a bit more tricky when you try to do both [] and $ subsetting. Look at this sample
#for subsetting
f <- function(x,v) {print("testing"); x==v}
x <- rep(0:1, length.out=nrow(dfSet))
dfSet$a <- 0
dfSet[f(x,1),]$a<-1
Notice how "testing" is printed twice. What's going on is really more like
.temp1. <- "$<-"(dfSet[f(x,1),], a, 1)
.temp2. <- "[<-"(dfSet, f(x,1), , .temp1.)
dfSet <- .temp2.
So the f(x,1) is evaluated twice. This means that sample would be evaluated twice as well.
The error is a bit more obvious is you try to replace a variable that does not exist yet
dfSet[f(x,1),]$b<-1
# Warning message:
# In `[<-.data.frame`(`*tmp*`, f(x, 1), , value = list(ID = c(6L, :
# provided 4 variables to replace 3 variables
Here you get the warning because the .temp1. variable as added the column and now has 4 columns but when you try to do the assignment to .temp2. you now have a problem that the slice of the data frame that you are trying to replace is a different size.
The IDs are replaced because the $<- operator doesn't just return a new column, it returns a new data.frame with the column updated to whatever value you assigned. This means that the rows that were updated are returned along with the ID that was there when the assignment happened. This is saved in the .temp1. variable. Then when you do the [<- assignment, you are choosing a new set of rows to swap out. The values of all columns of these rows are replaced with the values from .temp1.. This means that you will be overwriting the IDs for the replacement rows and they may differ so you are likely to wind up with two or more copies of a given ID.
Although I'm not 100% sure, I suspect that R is running the sample two times. When you subset and assign in R, for example:
x[i:j,]$v1 <- 1
It gets evaluated as "take out rows i to j from x as a temporary data frame, assign 1 to the v1 column of that data frame, then copy the temporary data frame back into rows i to j in x".
So maybe the indexing expression (i:j) gets executed twice (once to extract, and once to put back), and if it's a random variable, it's going to put the results back in different rows than the ones originally selected.
Consider this simpler example:
x <- data.frame(a=1:10, b=10:1)
x$b <- 5
What the second line actually does is
x <- `$<-`(x, 'b', 5)
You can see that $<- is just a function that takes three arguments, an
object, a name and a value. (Note the backticks are necessary if you want to use $<- directly.)
The problem I think is that in your example x is an expression that
evaluates to different things each time it's evaluated, due to the call to
sample, so you should avoid this.
An alternative is to use [<- which apparently doesn't have this problem:
dfSet[dfSet$ID %in% sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE), 'va3'] <- 1

How to assign a subset from a data frame `a' to a subset of data frame `b'

It might be a trivial question (I am new to R), but I could not find a answer for my question, either here in SO or anywhere else. My scenario is the following.
I have an data frame df and i want to update a subset df$tag values. df is similar to the following:
id = rep( c(1:4), 3)
tag = rep( c("aaa", "bbb", "rrr", "fff"), 3)
df = data.frame(id, tag)
Then, I am trying to use match() to update the column tag from the subsets of the data frame, using a second data frame (e.g., aux) that contains two columns, namely, key and value. The subsets are defined by id = n, according to n in unique(df$id). aux looks like the following:
> aux
key value
"aaa" "valueAA"
"bbb" "valueBB"
"rrr" "valueRR"
"fff" "valueFF"
I have tried to loop over the data frame, as follows:
for(i in unique(df$id)){
indexer = df$id == i
# here is how I tried to update the dame frame:
df[indexer,]$tag <- aux[match(df[indexer,]$tag, aux$key),]$value
}
The expected result was the df[indexer,]$tag updated with the respective values from aux$value.
The actual result was df$tag fulfilled with NA's. I've got no errors, but the following warning message:
In '[<-.factor'('tmp', df$id == i, value = c(NA, :
invalid factor level, NA generated
Before, I was using df$tag <- aux[match(df$tag, aux$key),]$value, which worked properly, but some duplicated df$tags made the match() produce the misplaced updates in a number of rows. I also simulate the subsetting and it works fine. Can someone suggest a solution for this update?
UPDATE (how the final dataset should look like?):
> df
id tag
1 "valueAA"
2 "valueBB"
3 "valueRR"
4 "valueFF"
(...) (...)
Thank you in advance.
Does this produce the output you expect?
df$tag <- aux$value[match(df$tag, aux$key)]
merge() would work too unless you have duplicates in aux.
It turned out that my data was breaking all the available built-in functions providing me a wrong dataset in the end. Then, my solution (at least, a preliminary one) was the following:
to process each subset individually;
add each data frame to a list;
use rbindlist(a.list, use.names = T) to get a complete data frame with the results.

How to add a column to a dataframe with values of another based on multiple conditions

I have two data frames of different length, and I want to add a new column to the first data frame with corresponding values of the second data frame.
The corresponding value is defined by the following condition if (DF1[i,1] == DF2[,1] & DF1[i,2] == DF2[i,2]) == TRUE, then the value of this row should be taken from DF2 and written to DF1$newColumn[i].
The following data frames are used to illustrate the question:
DF1<-data.frame(X = rep(c("A","B","C"),each=3),
Y = rep(c("a","b","c"),each=3))
DF2<-data.frame(X = c("A","B","C"),
Y = c("a","b","c"),
Z = c(1:3))
I tried to use if() statements as in the text above but the condition returns a vector of TRUE/FALSE and that doesn't seem to work.
The code that works that I use now is
for (i in 1 : length(DF1[,1])) {
DF1$Z[i] <- subset(DF2,DF2$X == DF1$X[i] & DF2$Y == DF1$Y[i])$Z
}
However it is incredibly slow (user system elapsed 115.498 12.341 127.799 for my full dataframe) and there must be a more efficient way to code this. Also, I have read repeatedly that vectorizing is more efficient then loops but I don't know how to do that.
I do need to work with conditional statements though so something like
DF1$Zz<-rep(DF2$Z,each=3)
wouldn't work for my real dataset.
DF1$Z <- sapply(1:nrow(DF1), function(i) DF2$Z[DF2$X==DF1$X[i] & DF2$Y==DF1$Y[i]]) seems to be taking roughly a quarter of the time of your for loop.
I created DF1 with 300 each reps, and my function took ~2secs to run; your loop with subset took ~8secs to run, and repackaging your loop into an sapply it took ~5secs to run.

exclude rows which contain all 0's

I would like to exclude the rows in a data frame which contain all 0's.
I can check for if a row contain 0 or not by using %in% operator. But need to know how to iterate over an entire matrix. and then print the new matrix excluding the other rows.
How can I achieve that?
Using Senor O's sample data (DF <- data.frame(A=rep(c(1,0),5), B=0)), try:
DF[!rowSums(DF == 0) == ncol(DF), ]
This should work:
AllZeros = apply(DF, 1, function(X) all(X==0))
DF2 = DF[!AllZeros,]
Try it with:
DF <- data.frame(A=rep(c(1,0),5), B=0)
As sample data.
There are a ton of ways to do this as the guys who have answered so far will show you.
I'll provide one more example off of this example dataset created.
DF <- data.frame(A=rep(c(1,0),5), B=0)
The subset command works well.
newDF <- subset(DF, !(A == 0 & B == 0) )
Depending on the size of your matrix and the naming convention of your variables, this may be tedious in which case I'd go straight for the apply functions.

Resources