1)I need to intersect two vectors and return a vector with the same and with the intersected values.
intersect() does not return a vector with the same size.
2) Also why does this return c(TRUE TRUE TRUE) and not c(FALSE TRUE TRUE) ?
set1 = c(TRUE,FALSE,TRUE)
set2 = c(FALSE,FALSE,TRUE)
testset = set1 %in% set2
> print(testset)
[1] TRUE TRUE TRUE
I got as result TRUE TRUE TRUE and I need FALSE FALSE TRUE.
To do the intersection, you need to use the & operator, as follows:
testset = set1 & set2
This will give you the following result: FALSE FALSE TRUE
Hope it helps.
A %in% B checks for every element in A if that element is in B. The results always has the same length as length(A). Try e.g.
1:3 %in% 1:9
1:9 %in% 1:3
I think what you want is this:
set1 == set2
[1] FALSE TRUE TRUE
Related
This question already has answers here:
What is the difference between `%in%` and `==`?
(3 answers)
Closed 5 years ago.
My question concerns the practical difference between the == and %in% operators in R.
I have run into an instance at work where filtering with either operator gives different results (e.g. one results on 800 rows, and the other 1200). I have run into this problem in the past and am able to validate in a way that ensures I get the results I desire. However, I am still stumped regarding how they are different.
Can someone please shed some light on how these operators are different?
%in% is value matching and "returns a vector of the positions of (first) matches of its first argument in its second" (See help('%in%')) This means you could compare vectors of different lengths to see if elements of one vector match at least one element in another. The length of output will be equal to the length of the vector being compared (the first one).
1:2 %in% rep(1:2,5)
#[1] TRUE TRUE
rep(1:2,5) %in% 1:2
#[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#Note this output is longer in second
== is logical operator meant to compare if two things are exactly equal. If the vectors are of equal length, elements will be compared element-wise. If not, vectors will be recycled. The length of output will be equal to the length of the longer vector.
1:2 == rep(1:2,5)
#[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
rep(1:2,5) == 1:2
#[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
1:10 %in% 3:7
#[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
#is same as
sapply(1:10, function(a) any(a == 3:7))
#[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
NOTE: If possible, try to use identical or all.equal instead of == and.
Given two vectors, x and y, the code x == y will compare the first element of x with the first element of y, then the second element of x with the second element of y, and so on. When using x == y, the lengths of x and y must be the same. Here, compare means "is equal to" and therefore the output is a logical vector equal to the length of x (or y).
In the code x %in% y, the first element of x is compared to all elements in y, then the second element of x is compared to all elements of y, and so on. Here, compare means "is the current element of x equal to any value in y" and therefore the output is a logical vector that has the same length of x and not (necessarily) y.
Here is a code snippet illustrating the difference. Note that x and y have the same lengths but the elements of y are the elements of x in different order. Note too in the final examples that x is a 3-element vector being compared to the letters vector, which contains 26 elements.
> x <- c('a','b','c')
> y <- c('c', 'b', 'a')
> x == y
[1] FALSE TRUE FALSE
> x %in% y
[1] TRUE TRUE TRUE
> x %in% letters
[1] TRUE TRUE TRUE
> letters %in% x
[1] TRUE TRUE TRUE FALSE FALSE FALSE
[7] FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE
[19] FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE
Try it for objects of different length.
ac <- c("a", "b", "c")
ae <- c("a", "b", "c", "d", "e")
ac %in% ae
[1] TRUE TRUE TRUE
ac == ae
[1] TRUE TRUE TRUE FALSE FALSE
It's becomes clear that %in% checks whether or not the object is contained in the other object. Whereas == is a logical operator that checks for identity properties.
== cheks if elements of a vector is equal to all elements of another vector. Ideally two vectors will have the same size (or it will have unexpected results as when sizes don't match R recycles the shorter vector, silently if sizes are multiples of each other). For instance
c(1,2,3) == c(1,3,2)
[1] TRUE FALSE FALSE
or
c(1,2) == c(1,3,2)
[1] TRUE FALSE FALSE
Warning message:
In c(1, 2) == c(1, 3, 2) :
longer object length is not a multiple of shorter object length
%in% on the other hand checks which elements of list 1 is included in list 2
c(1,2,3) %in% c(1,3,2)
[1] TRUE TRUE TRUE
or
c(1,2) %in% c(1,3,2)
[1] TRUE TRUE
I try to subset values in R depending on values in column y like shown in the following:
I have the data set "data" which is like this:
data <- data.frame(y = c(0,0,2000,1500,20,77,88),
a = "bla", b = "bla")
And would end up with this:
I have this R code:
data <- arrange(subset(data, y != 0 & y < 1000 & y !=77 & [...]), desc(y))
print(head(data, n =100))
Which works.
However I would like to collect the values to exclude in a list as:
[0, 1000, 77]
And somehow loop through this, with the lowest possible running time instead of hardcoding them directly in the formula. Any ideas?
The list, should only contain "!=" operations:
[0, 77]
and the "<" should be remain in the formula or in another list.
I'm going to answer your original question because it's more interesting. I hope you won't mind.
Imagine you had values and operators to apply to your data:
my.operators <- c("!=","<","!=")
my.values <- c(0,1000,77)
You can use Map from base R to apply a function to two vectors. Here I'll use get so we can obtain the actual operator given by the character string.
Map(function(x,y)get(y)(data$y,x),my.values,my.operators)
[[1]]
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[[2]]
[1] TRUE TRUE FALSE FALSE TRUE TRUE TRUE
[[3]]
[1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE
As you can see, we get a list of logical vectors for each value, operator pair.
To better understand what's going on here, consider only the first value of each vector:
get("!=")(data$y,0)
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE
Now we can use Reduce:
Reduce(`&`,lapply(my.values,function(x) data$y!=x))
[1] FALSE FALSE TRUE TRUE TRUE FALSE TRUE
And finally subset the data:
data[Reduce("&",Map(function(x,y)get(y)(data$y,x),my.values,my.operators)),]
y a b
5 20 bla bla
7 88 bla bla
I'm trying to test a column of my dataset for dynamically changing given values. The values come from a previous calculation and change all the time, such that the ifelse command cannot be used.
I tried it with a for-loop since it needs to be flexible but it was not working. An example of my problem is below:
require(dplyr)
data <- data.frame(step=c(1,1,1,1,3,3,3,3,4,4,5,6,7,7,7,7,4,4,4,4,6,5,7,7,3,4,3,1))
data <- mutate(data, col2 = 0)
data <- mutate(data, col3 = 0)
data_check <- data.frame(step=c(3,4))
for(j in 1:length(data_check)){
for(i in 1:nrow(data)){
if(data$step[i] == data_check[j]){
data <- mutate(data, Occurrence = 1)
} else {
data <- mutate(data, Occurrence = 0)
}
}
}
The goal is to get an additional column 'Occurrence' in the dataset, which tells if any of the given values occur or not.
I can't understand what you're trying to do, but if you're trying to test if each entry in data$step is present in data_check or not, then something like:
data_check <- list(3,4) # so you can use the %in% operator
data$Occurrence <- data$step %in% data_check
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
[13] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
[25] TRUE TRUE TRUE FALSE
EDIT: and as Eumenedies said, you want to apply as.numeric() to that.
This question already has answers here:
What is the difference between `%in%` and `==`?
(3 answers)
Closed 5 years ago.
My question concerns the practical difference between the == and %in% operators in R.
I have run into an instance at work where filtering with either operator gives different results (e.g. one results on 800 rows, and the other 1200). I have run into this problem in the past and am able to validate in a way that ensures I get the results I desire. However, I am still stumped regarding how they are different.
Can someone please shed some light on how these operators are different?
%in% is value matching and "returns a vector of the positions of (first) matches of its first argument in its second" (See help('%in%')) This means you could compare vectors of different lengths to see if elements of one vector match at least one element in another. The length of output will be equal to the length of the vector being compared (the first one).
1:2 %in% rep(1:2,5)
#[1] TRUE TRUE
rep(1:2,5) %in% 1:2
#[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#Note this output is longer in second
== is logical operator meant to compare if two things are exactly equal. If the vectors are of equal length, elements will be compared element-wise. If not, vectors will be recycled. The length of output will be equal to the length of the longer vector.
1:2 == rep(1:2,5)
#[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
rep(1:2,5) == 1:2
#[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
1:10 %in% 3:7
#[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
#is same as
sapply(1:10, function(a) any(a == 3:7))
#[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
NOTE: If possible, try to use identical or all.equal instead of == and.
Given two vectors, x and y, the code x == y will compare the first element of x with the first element of y, then the second element of x with the second element of y, and so on. When using x == y, the lengths of x and y must be the same. Here, compare means "is equal to" and therefore the output is a logical vector equal to the length of x (or y).
In the code x %in% y, the first element of x is compared to all elements in y, then the second element of x is compared to all elements of y, and so on. Here, compare means "is the current element of x equal to any value in y" and therefore the output is a logical vector that has the same length of x and not (necessarily) y.
Here is a code snippet illustrating the difference. Note that x and y have the same lengths but the elements of y are the elements of x in different order. Note too in the final examples that x is a 3-element vector being compared to the letters vector, which contains 26 elements.
> x <- c('a','b','c')
> y <- c('c', 'b', 'a')
> x == y
[1] FALSE TRUE FALSE
> x %in% y
[1] TRUE TRUE TRUE
> x %in% letters
[1] TRUE TRUE TRUE
> letters %in% x
[1] TRUE TRUE TRUE FALSE FALSE FALSE
[7] FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE
[19] FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE
Try it for objects of different length.
ac <- c("a", "b", "c")
ae <- c("a", "b", "c", "d", "e")
ac %in% ae
[1] TRUE TRUE TRUE
ac == ae
[1] TRUE TRUE TRUE FALSE FALSE
It's becomes clear that %in% checks whether or not the object is contained in the other object. Whereas == is a logical operator that checks for identity properties.
== cheks if elements of a vector is equal to all elements of another vector. Ideally two vectors will have the same size (or it will have unexpected results as when sizes don't match R recycles the shorter vector, silently if sizes are multiples of each other). For instance
c(1,2,3) == c(1,3,2)
[1] TRUE FALSE FALSE
or
c(1,2) == c(1,3,2)
[1] TRUE FALSE FALSE
Warning message:
In c(1, 2) == c(1, 3, 2) :
longer object length is not a multiple of shorter object length
%in% on the other hand checks which elements of list 1 is included in list 2
c(1,2,3) %in% c(1,3,2)
[1] TRUE TRUE TRUE
or
c(1,2) %in% c(1,3,2)
[1] TRUE TRUE
Looking for a better way: How can I make R check the values of a flexible subset of multiple columns element-wise (let's say Var2 and Var3 here) and write the result of the check to a new logical column?
Is there a shorter, more elegant way than using row-wise apply() here?
df <- read.csv(
text = '"Var1","Var2","Var3"
"","",""
"","","a"
"","a",""
"a","a","a"
"a","","a"
"","a",""
"","",""
"","","a"
"","a",""
"","","a"'
)
criticalColumns <- c("Var2", "Var3")
df$criticalColumnsAreEmpty <-
apply(df[, criticalColumns], 1, function(curRow) {
return(all(curRow == ""))
})
I could also do this in an explicit way, but this is not a flexible then:
df$criticalColumnsAreEmpty <- df$Var2 == "" & df$Var3 == ""
Desired output:
Var1 Var2 Var3 criticalColumnsAreEmpty
TRUE
a FALSE
a FALSE
a a a FALSE
a a FALSE
a FALSE
TRUE
a FALSE
a FALSE
a FALSE
We can use rowSums on the logical matrix
df$criticalColumnsAreEmpty <- !rowSums(df[criticalColumns]!="")
df$criticalColumnsAreEmpty
#[1] TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
Or another option (for big datasets to avoid converting to matrix for memory reasons) is loop over the columns, check whether the elements are blank and use Reduce with &
Reduce(`&`, lapply(df[criticalColumns], function(x) !nzchar(as.character(x))))