df <- structure(list(x = 1:10, time = c(0.5, 0.5, 1, 2, 3, 0.5, 0.5,
1, 2, 3)), .Names = c("x", "time"), row.names = c(NA, -10L), class = "data.frame")
df[df$time %in% c(0.5, 3), ]
## x time
## 1 1 0.5
## 2 2 0.5
## 5 5 3.0
## 6 6 0.5
## 7 7 0.5
## 10 10 3.0
df[df$time == c(0.5, 3), ]
## x time
## 1 1 0.5
## 7 7 0.5
## 10 10 3.0
What is the difference between %in% and == here?
The problem is vector recycling.
Your first line does exactly what you'd expect. It checks what elements of df$time are in c(0.5, 3) and returns the values which are.
Your second line is trickier. It's actually equivalent to
df[df$time == rep(c(0.5,3), length.out=nrow(df)),]
To see this, let's see what happens if use a vector rep(0.5, 10):
rep(0.5, 10) == c(0.5, 3)
[1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
See how it returns every odd value. Essentially it's matching 0.5 to the vector c(0.5, 3, 0.5, 3, 0.5...)
You can manipulate a vector to produce no matches this way. Take the vector: rep(c(3, 0.5), 5):
rep(c(3, 0.5), 5) == c(0.5, 3)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
They're all FALSE; you are matching every 0.5 with 3 and vice versa.
In
df$time == c(0.5,3)
the c(0.5,3) first gets broadcast to the shape of df$time, i.e. c(0.5,3,0.5,3,0.5,3,0.5,3,0.5,3). Then the two vectors are compared element-by-element.
On the other hand,
df$time %in% c(0.5,3)
checks whether each element of df$time belongs to the set {0.5, 3}.
This is an old thread, but I haven't seen this answer anywhere and it might be relevant for some people.
Another difference between the two is handling of NAs (missing values).
NA == NA
[1] NA
NA %in% c(NA)
[1] TRUE
Related
df <- structure(list(x = 1:10, time = c(0.5, 0.5, 1, 2, 3, 0.5, 0.5,
1, 2, 3)), .Names = c("x", "time"), row.names = c(NA, -10L), class = "data.frame")
df[df$time %in% c(0.5, 3), ]
## x time
## 1 1 0.5
## 2 2 0.5
## 5 5 3.0
## 6 6 0.5
## 7 7 0.5
## 10 10 3.0
df[df$time == c(0.5, 3), ]
## x time
## 1 1 0.5
## 7 7 0.5
## 10 10 3.0
What is the difference between %in% and == here?
The problem is vector recycling.
Your first line does exactly what you'd expect. It checks what elements of df$time are in c(0.5, 3) and returns the values which are.
Your second line is trickier. It's actually equivalent to
df[df$time == rep(c(0.5,3), length.out=nrow(df)),]
To see this, let's see what happens if use a vector rep(0.5, 10):
rep(0.5, 10) == c(0.5, 3)
[1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
See how it returns every odd value. Essentially it's matching 0.5 to the vector c(0.5, 3, 0.5, 3, 0.5...)
You can manipulate a vector to produce no matches this way. Take the vector: rep(c(3, 0.5), 5):
rep(c(3, 0.5), 5) == c(0.5, 3)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
They're all FALSE; you are matching every 0.5 with 3 and vice versa.
In
df$time == c(0.5,3)
the c(0.5,3) first gets broadcast to the shape of df$time, i.e. c(0.5,3,0.5,3,0.5,3,0.5,3,0.5,3). Then the two vectors are compared element-by-element.
On the other hand,
df$time %in% c(0.5,3)
checks whether each element of df$time belongs to the set {0.5, 3}.
This is an old thread, but I haven't seen this answer anywhere and it might be relevant for some people.
Another difference between the two is handling of NAs (missing values).
NA == NA
[1] NA
NA %in% c(NA)
[1] TRUE
df <- structure(list(x = 1:10, time = c(0.5, 0.5, 1, 2, 3, 0.5, 0.5,
1, 2, 3)), .Names = c("x", "time"), row.names = c(NA, -10L), class = "data.frame")
df[df$time %in% c(0.5, 3), ]
## x time
## 1 1 0.5
## 2 2 0.5
## 5 5 3.0
## 6 6 0.5
## 7 7 0.5
## 10 10 3.0
df[df$time == c(0.5, 3), ]
## x time
## 1 1 0.5
## 7 7 0.5
## 10 10 3.0
What is the difference between %in% and == here?
The problem is vector recycling.
Your first line does exactly what you'd expect. It checks what elements of df$time are in c(0.5, 3) and returns the values which are.
Your second line is trickier. It's actually equivalent to
df[df$time == rep(c(0.5,3), length.out=nrow(df)),]
To see this, let's see what happens if use a vector rep(0.5, 10):
rep(0.5, 10) == c(0.5, 3)
[1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
See how it returns every odd value. Essentially it's matching 0.5 to the vector c(0.5, 3, 0.5, 3, 0.5...)
You can manipulate a vector to produce no matches this way. Take the vector: rep(c(3, 0.5), 5):
rep(c(3, 0.5), 5) == c(0.5, 3)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
They're all FALSE; you are matching every 0.5 with 3 and vice versa.
In
df$time == c(0.5,3)
the c(0.5,3) first gets broadcast to the shape of df$time, i.e. c(0.5,3,0.5,3,0.5,3,0.5,3,0.5,3). Then the two vectors are compared element-by-element.
On the other hand,
df$time %in% c(0.5,3)
checks whether each element of df$time belongs to the set {0.5, 3}.
This is an old thread, but I haven't seen this answer anywhere and it might be relevant for some people.
Another difference between the two is handling of NAs (missing values).
NA == NA
[1] NA
NA %in% c(NA)
[1] TRUE
I have two vectors,
x <- c(1,2,2,3,4)
y <- c(1,2,3)
And I want to get another vector of the elements that are in x that aren't in y; so in this case (2,4).
I've tried using the setdiff() function but this doesn't take into account duplicates (it would return only 4), so I'm not sure how to go about this.
Thank you!
Maybe try this:
x[-match(y,x,nomatch = 0)]
The nomatch = 0 is necessary to avoid mixing NAs with negative subscripts.
To deal with additional duplicates, as mentioned in the comments, another option might be to use vsetdiff from the package vecsets:
library(vecsets)
x = c(1, 2, 2, 3, 3, 4)
y = c(1, 2, 2, 3)
> vsetdiff(x,y)
[1] 3 4
It won't give the results as discussed by #Gregor, however, it should give the correct results based on the example:
x[duplicated(x) | !x %in% y]
[1] 2 4
In individual steps:
duplicated(x)
[1] FALSE FALSE TRUE FALSE FALSE
!x %in% y
[1] FALSE FALSE FALSE FALSE TRUE
duplicated(x) | !x %in% y
[1] FALSE FALSE TRUE FALSE TRUE
Considering OP's original example and reading #Gregor's comment, I wrote the following function that does what OP wants and also takes into account what #Gregor pointed out
## function to find values in x that are absent in y
x.not.in.y <- function(x, y) {
# get freq tables for x and y
x.tab <- table(x)
y.tab <- table(y)
# if a value is missing in y then set its freq to zero
y.tab[setdiff(names(x.tab), names(y.tab))] = 0
y.tab <- y.tab[names(y.tab) %in% names(x.tab)]
# get the difference of x and y freq and keep if > 0
diff.tab <- x.tab[order(names(x.tab))] - y.tab[order(names(y.tab))]
diff.tab <- diff.tab[diff.tab > 0]
# output vector of x values missing in y
unlist(
lapply(names(diff.tab), function(val) {
rep(as.numeric(val), diff.tab[val])
}),
use.names = F)
}
# OP's original data
x.not.in.y(x = c(1,2,2,3,4), y = c(1,2,3))
#> [1] 2 4
# #Gregor's data
x.not.in.y(x = c(1,2,2,3,3,4), y = c(1,2,2,3))
#> [1] 3 4
# some other data with extra value in y but absent in y
x.not.in.y(x = c(1,2,2,2,2,3,3,3,4,5), y = c(1,2,3,6))
#> [1] 2 2 2 3 3 4 5
Created on 2019-04-15 by the reprex package (v0.2.1)
df <- structure(list(x = 1:10, time = c(0.5, 0.5, 1, 2, 3, 0.5, 0.5,
1, 2, 3)), .Names = c("x", "time"), row.names = c(NA, -10L), class = "data.frame")
df[df$time %in% c(0.5, 3), ]
## x time
## 1 1 0.5
## 2 2 0.5
## 5 5 3.0
## 6 6 0.5
## 7 7 0.5
## 10 10 3.0
df[df$time == c(0.5, 3), ]
## x time
## 1 1 0.5
## 7 7 0.5
## 10 10 3.0
What is the difference between %in% and == here?
The problem is vector recycling.
Your first line does exactly what you'd expect. It checks what elements of df$time are in c(0.5, 3) and returns the values which are.
Your second line is trickier. It's actually equivalent to
df[df$time == rep(c(0.5,3), length.out=nrow(df)),]
To see this, let's see what happens if use a vector rep(0.5, 10):
rep(0.5, 10) == c(0.5, 3)
[1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
See how it returns every odd value. Essentially it's matching 0.5 to the vector c(0.5, 3, 0.5, 3, 0.5...)
You can manipulate a vector to produce no matches this way. Take the vector: rep(c(3, 0.5), 5):
rep(c(3, 0.5), 5) == c(0.5, 3)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
They're all FALSE; you are matching every 0.5 with 3 and vice versa.
In
df$time == c(0.5,3)
the c(0.5,3) first gets broadcast to the shape of df$time, i.e. c(0.5,3,0.5,3,0.5,3,0.5,3,0.5,3). Then the two vectors are compared element-by-element.
On the other hand,
df$time %in% c(0.5,3)
checks whether each element of df$time belongs to the set {0.5, 3}.
This is an old thread, but I haven't seen this answer anywhere and it might be relevant for some people.
Another difference between the two is handling of NAs (missing values).
NA == NA
[1] NA
NA %in% c(NA)
[1] TRUE
I am looking for a condition which will return the index of a vector satisfying a condition.
For example-
I have a vector b = c(0.1, 0.2, 0.7, 0.9)
I want to know the first index of b for which say b >0.65. In this case the answer should be 3
I tried which.min(subset(b, b > 0.65))
But this gives me 1 instead of 3.
Please help
Use which and take the first element of the result:
which(b > 0.65)[1]
#[1] 3
Be careful, which.max is wrong if the condition is never met, it does not return NA:
> a <- c(1, 2, 3, 2, 5)
> a >= 6
[1] FALSE FALSE FALSE FALSE FALSE
> which(a >= 6)[1]
[1] NA # desirable
> which.max(a >= 6)
[1] 1 # not desirable
Why? When all elements are equal, which.max returns 1:
> b <- c(2, 2, 2, 2, 2)
> which.max(b)
[1] 1
Note: FALSE < TRUE
You may use which.max:
which.max(b > 0.65)
# [1] 3
From ?which.max: "For a logical vector x, [...] which.max(x) return[s] the index of the first [...] TRUE
b > 0.65
# [1] FALSE FALSE TRUE TRUE
You should also have a look at the result of your code subset(b, b > 0.65) to see why it can't give you the desired result.