DIfference between using "%in%" and "==" while subsetting in R [duplicate]

DIfference between using "%in%" and "==" while subsetting in R [duplicate] - r

df <- structure(list(x = 1:10, time = c(0.5, 0.5, 1, 2, 3, 0.5, 0.5,
1, 2, 3)), .Names = c("x", "time"), row.names = c(NA, -10L), class = "data.frame")
df[df$time %in% c(0.5, 3), ]
## x time
## 1 1 0.5
## 2 2 0.5
## 5 5 3.0
## 6 6 0.5
## 7 7 0.5
## 10 10 3.0
df[df$time == c(0.5, 3), ]
## x time
## 1 1 0.5
## 7 7 0.5
## 10 10 3.0
What is the difference between %in% and == here?

The problem is vector recycling.
Your first line does exactly what you'd expect. It checks what elements of df$time are in c(0.5, 3) and returns the values which are.
Your second line is trickier. It's actually equivalent to
df[df$time == rep(c(0.5,3), length.out=nrow(df)),]
To see this, let's see what happens if use a vector rep(0.5, 10):
rep(0.5, 10) == c(0.5, 3)
[1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
See how it returns every odd value. Essentially it's matching 0.5 to the vector c(0.5, 3, 0.5, 3, 0.5...)
You can manipulate a vector to produce no matches this way. Take the vector: rep(c(3, 0.5), 5):
rep(c(3, 0.5), 5) == c(0.5, 3)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
They're all FALSE; you are matching every 0.5 with 3 and vice versa.

In
df$time == c(0.5,3)
the c(0.5,3) first gets broadcast to the shape of df$time, i.e. c(0.5,3,0.5,3,0.5,3,0.5,3,0.5,3). Then the two vectors are compared element-by-element.
On the other hand,
df$time %in% c(0.5,3)
checks whether each element of df$time belongs to the set {0.5, 3}.

This is an old thread, but I haven't seen this answer anywhere and it might be relevant for some people.
Another difference between the two is handling of NAs (missing values).
NA == NA
[1] NA
NA %in% c(NA)
[1] TRUE

Related

Wrong answer of R using == operator [duplicate]

df <- structure(list(x = 1:10, time = c(0.5, 0.5, 1, 2, 3, 0.5, 0.5,
1, 2, 3)), .Names = c("x", "time"), row.names = c(NA, -10L), class = "data.frame")
df[df$time %in% c(0.5, 3), ]
## x time
## 1 1 0.5
## 2 2 0.5
## 5 5 3.0
## 6 6 0.5
## 7 7 0.5
## 10 10 3.0
df[df$time == c(0.5, 3), ]
## x time
## 1 1 0.5
## 7 7 0.5
## 10 10 3.0
What is the difference between %in% and == here?

The problem is vector recycling.
Your first line does exactly what you'd expect. It checks what elements of df$time are in c(0.5, 3) and returns the values which are.
Your second line is trickier. It's actually equivalent to
df[df$time == rep(c(0.5,3), length.out=nrow(df)),]
To see this, let's see what happens if use a vector rep(0.5, 10):
rep(0.5, 10) == c(0.5, 3)
[1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
See how it returns every odd value. Essentially it's matching 0.5 to the vector c(0.5, 3, 0.5, 3, 0.5...)
You can manipulate a vector to produce no matches this way. Take the vector: rep(c(3, 0.5), 5):
rep(c(3, 0.5), 5) == c(0.5, 3)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
They're all FALSE; you are matching every 0.5 with 3 and vice versa.

In
df$time == c(0.5,3)
the c(0.5,3) first gets broadcast to the shape of df$time, i.e. c(0.5,3,0.5,3,0.5,3,0.5,3,0.5,3). Then the two vectors are compared element-by-element.
On the other hand,
df$time %in% c(0.5,3)
checks whether each element of df$time belongs to the set {0.5, 3}.

This is an old thread, but I haven't seen this answer anywhere and it might be relevant for some people.
Another difference between the two is handling of NAs (missing values).
NA == NA
[1] NA
NA %in% c(NA)
[1] TRUE

Error at filtering is deleting too many rows [duplicate]

df <- structure(list(x = 1:10, time = c(0.5, 0.5, 1, 2, 3, 0.5, 0.5,
1, 2, 3)), .Names = c("x", "time"), row.names = c(NA, -10L), class = "data.frame")
df[df$time %in% c(0.5, 3), ]
## x time
## 1 1 0.5
## 2 2 0.5
## 5 5 3.0
## 6 6 0.5
## 7 7 0.5
## 10 10 3.0
df[df$time == c(0.5, 3), ]
## x time
## 1 1 0.5
## 7 7 0.5
## 10 10 3.0
What is the difference between %in% and == here?

The problem is vector recycling.
Your first line does exactly what you'd expect. It checks what elements of df$time are in c(0.5, 3) and returns the values which are.
Your second line is trickier. It's actually equivalent to
df[df$time == rep(c(0.5,3), length.out=nrow(df)),]
To see this, let's see what happens if use a vector rep(0.5, 10):
rep(0.5, 10) == c(0.5, 3)
[1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
See how it returns every odd value. Essentially it's matching 0.5 to the vector c(0.5, 3, 0.5, 3, 0.5...)
You can manipulate a vector to produce no matches this way. Take the vector: rep(c(3, 0.5), 5):
rep(c(3, 0.5), 5) == c(0.5, 3)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
They're all FALSE; you are matching every 0.5 with 3 and vice versa.

In
df$time == c(0.5,3)
the c(0.5,3) first gets broadcast to the shape of df$time, i.e. c(0.5,3,0.5,3,0.5,3,0.5,3,0.5,3). Then the two vectors are compared element-by-element.
On the other hand,
df$time %in% c(0.5,3)
checks whether each element of df$time belongs to the set {0.5, 3}.

This is an old thread, but I haven't seen this answer anywhere and it might be relevant for some people.
Another difference between the two is handling of NAs (missing values).
NA == NA
[1] NA
NA %in% c(NA)
[1] TRUE

How to find elements of one vector that aren't in another (not using setdiff)

I have two vectors,
x <- c(1,2,2,3,4)
y <- c(1,2,3)
And I want to get another vector of the elements that are in x that aren't in y; so in this case (2,4).
I've tried using the setdiff() function but this doesn't take into account duplicates (it would return only 4), so I'm not sure how to go about this.
Thank you!

Maybe try this:
x[-match(y,x,nomatch = 0)]
The nomatch = 0 is necessary to avoid mixing NAs with negative subscripts.
To deal with additional duplicates, as mentioned in the comments, another option might be to use vsetdiff from the package vecsets:
library(vecsets)
x = c(1, 2, 2, 3, 3, 4)
y = c(1, 2, 2, 3)
> vsetdiff(x,y)
[1] 3 4

It won't give the results as discussed by #Gregor, however, it should give the correct results based on the example:
x[duplicated(x) | !x %in% y]
[1] 2 4
In individual steps:
duplicated(x)
[1] FALSE FALSE TRUE FALSE FALSE
!x %in% y
[1] FALSE FALSE FALSE FALSE TRUE
duplicated(x) | !x %in% y
[1] FALSE FALSE TRUE FALSE TRUE

Considering OP's original example and reading #Gregor's comment, I wrote the following function that does what OP wants and also takes into account what #Gregor pointed out
## function to find values in x that are absent in y
x.not.in.y <- function(x, y) {
# get freq tables for x and y
x.tab <- table(x)
y.tab <- table(y)
# if a value is missing in y then set its freq to zero
y.tab[setdiff(names(x.tab), names(y.tab))] = 0
y.tab <- y.tab[names(y.tab) %in% names(x.tab)]
# get the difference of x and y freq and keep if > 0
diff.tab <- x.tab[order(names(x.tab))] - y.tab[order(names(y.tab))]
diff.tab <- diff.tab[diff.tab > 0]
# output vector of x values missing in y
unlist(
lapply(names(diff.tab), function(val) {
rep(as.numeric(val), diff.tab[val])
}),
use.names = F)
}
# OP's original data
x.not.in.y(x = c(1,2,2,3,4), y = c(1,2,3))
#> [1] 2 4
# #Gregor's data
x.not.in.y(x = c(1,2,2,3,3,4), y = c(1,2,2,3))
#> [1] 3 4
# some other data with extra value in y but absent in y
x.not.in.y(x = c(1,2,2,2,2,3,3,3,4,5), y = c(1,2,3,6))
#> [1] 2 2 2 3 3 4 5
Created on 2019-04-15 by the reprex package (v0.2.1)

Difference between filter(dataframe, var %in% c(x,y,z) ) and filter(dataframe, var == c(x,y,z) [duplicate]

df <- structure(list(x = 1:10, time = c(0.5, 0.5, 1, 2, 3, 0.5, 0.5,
1, 2, 3)), .Names = c("x", "time"), row.names = c(NA, -10L), class = "data.frame")
df[df$time %in% c(0.5, 3), ]
## x time
## 1 1 0.5
## 2 2 0.5
## 5 5 3.0
## 6 6 0.5
## 7 7 0.5
## 10 10 3.0
df[df$time == c(0.5, 3), ]
## x time
## 1 1 0.5
## 7 7 0.5
## 10 10 3.0
What is the difference between %in% and == here?

The problem is vector recycling.
Your first line does exactly what you'd expect. It checks what elements of df$time are in c(0.5, 3) and returns the values which are.
Your second line is trickier. It's actually equivalent to
df[df$time == rep(c(0.5,3), length.out=nrow(df)),]
To see this, let's see what happens if use a vector rep(0.5, 10):
rep(0.5, 10) == c(0.5, 3)
[1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
See how it returns every odd value. Essentially it's matching 0.5 to the vector c(0.5, 3, 0.5, 3, 0.5...)
You can manipulate a vector to produce no matches this way. Take the vector: rep(c(3, 0.5), 5):
rep(c(3, 0.5), 5) == c(0.5, 3)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
They're all FALSE; you are matching every 0.5 with 3 and vice versa.

In
df$time == c(0.5,3)
the c(0.5,3) first gets broadcast to the shape of df$time, i.e. c(0.5,3,0.5,3,0.5,3,0.5,3,0.5,3). Then the two vectors are compared element-by-element.
On the other hand,
df$time %in% c(0.5,3)
checks whether each element of df$time belongs to the set {0.5, 3}.

This is an old thread, but I haven't seen this answer anywhere and it might be relevant for some people.
Another difference between the two is handling of NAs (missing values).
NA == NA
[1] NA
NA %in% c(NA)
[1] TRUE

R get index satisty the condition [duplicate]

I am looking for a condition which will return the index of a vector satisfying a condition.
For example-
I have a vector b = c(0.1, 0.2, 0.7, 0.9)
I want to know the first index of b for which say b >0.65. In this case the answer should be 3
I tried which.min(subset(b, b > 0.65))
But this gives me 1 instead of 3.
Please help

Use which and take the first element of the result:
which(b > 0.65)[1]
#[1] 3

Be careful, which.max is wrong if the condition is never met, it does not return NA:
> a <- c(1, 2, 3, 2, 5)
> a >= 6
[1] FALSE FALSE FALSE FALSE FALSE
> which(a >= 6)[1]
[1] NA # desirable
> which.max(a >= 6)
[1] 1 # not desirable
Why? When all elements are equal, which.max returns 1:
> b <- c(2, 2, 2, 2, 2)
> which.max(b)
[1] 1
Note: FALSE < TRUE

You may use which.max:
which.max(b > 0.65)
# [1] 3
From ?which.max: "For a logical vector x, [...] which.max(x) return[s] the index of the first [...] TRUE
b > 0.65
# [1] FALSE FALSE TRUE TRUE
You should also have a look at the result of your code subset(b, b > 0.65) to see why it can't give you the desired result.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

DIfference between using "%in%" and "==" while subsetting in R [duplicate] - r

In df$time == c(0.5,3) the c(0.5,3) first gets broadcast to the shape of df$time, i.e. c(0.5,3,0.5,3,0.5,3,0.5,3,0.5,3). Then the two vectors are compared element-by-element. On the other hand, df$time %in% c(0.5,3) checks whether each element of df$time belongs to the set {0.5, 3}.

This is an old thread, but I haven't seen this answer anywhere and it might be relevant for some people. Another difference between the two is handling of NAs (missing values). NA == NA [1] NA NA %in% c(NA) [1] TRUE

Related

Wrong answer of R using == operator [duplicate]

Error at filtering is deleting too many rows [duplicate]

How to find elements of one vector that aren't in another (not using setdiff)

Difference between filter(dataframe, var %in% c(x,y,z) ) and filter(dataframe, var == c(x,y,z) [duplicate]

R get index satisty the condition [duplicate]

Categories

Resources