I have been using the %in% operator for a long time since I knew about it.
However, I still don't understand how it works. At least, I thought that I knew how, but I always doubt about the order of the elements.
Here you have an example:
This is my dataframe:
df <- data.frame("col1"=c(1,2,3,4,30,21,320,123,4351,1234,3,0,43), "col2"=rep("something",13))
This how it looks
> df
col1 col2
1 1 something
2 2 something
3 3 something
4 4 something
5 30 something
6 21 something
7 320 something
8 123 something
9 4351 something
10 1234 something
11 3 something
12 0 something
13 43 something
Let's say I have a numerical vector:
myvector <- c(30,43,12,333334,14,4351,0,5,55,66)
And I want to check if all the numbers (or some) from my vector are in the previous dataframe. To do that, I always use %in%.
I thought 2 approaches:
#common in both: 30, 4351, 0, 43
# are the numbers from df$col1 in my vector?
trial1 <- subset(df, df$col1 %in% myvector)
# are the numbers of the vector in df$col1?
trial2 <- subset(df, myvector %in% df$col1)
Both approaches make sense to me and they should give the same result. However, only the result from trial1 is okay.
> trial1
col1 col2
5 30 something
9 4351 something
12 0 something
13 43 something
What I don't understand is why the second way is giving me some common numbers and some which are not in the vector.
col1 col2
1 1 something
2 2 something
6 21 something
7 320 something
11 3 something
12 0 something
Could someone explain to me how `%in% operator works and why the second way gives me the wrong result?
Thanks very much in advance
Regards
Answer is given, but a bit more detailed simply look at the %in% result
df$col1 %in% myvector
# [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
The above one is correct as you subset df and keep the TRUE values, row 5, 9, 12, 13
versus
myvector %in% df$col1
# [1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
This one goes wrong as you subset df and tell to keep 1, 2, 6, 7 and as length here is only 10 it recycles 11, 12, 13 as TRUE, TRUE, FALSE again so you get 11 and 12 in your subset as well
Related
Suppose I have the following vector:
x <- c(8, 6, 9, 9, 7, 3, 2, 5, 5, 1, 6, 8, 5, 2, 9, 3, 5, 10, 8, 2)
How can I find which elements are either 8 or 9?
This is one way to do it. First I get the indices at which x is either 8 or 9. Then we can verify that at those indices, x is indeed 8 and 9.
> inds <- which(x %in% c(8,9))
> inds
[1] 1 3 4 12 15 19
> x[inds]
[1] 8 9 9 8 9 8
You could try the | operator for short conditions
which(x == 8 | x == 9)
In this specific case you could also use grep:
# option 1
grep('[89]',x)
# option 2
grep('8|9',x)
which both give:
[1] 1 3 4 12 15 19
When you also want to detect number with more than one digit, the second option is preferred:
> grep('10|8',x)
[1] 1 12 18 19
However, I did put emphasis on this specific case at the start of my answer for a reason. As #DavidArenburg mentioned, this could lead to unintended results. Using for example grep('1|8',x) will detect both 1 and 10:
> grep('1|8',x)
[1] 1 10 12 18 19
In order to avoid that side-effect, you will have to wrap the numbers to be detected in word-bounderies:
> grep('\\b1\\b|8',x)
[1] 1 10 12 19
Now, the 10 isn't detected.
Here is a generalized solution to find the locations of all target values (only works for vectors and 1-dimensional arrays).
locate <- function(x, targets) {
results <- lapply(targets, function(target) which(x == target))
names(results) <- targets
results
}
This function returns a list because each target may have any number of matches, including zero. The list is sorted (and named) in the original order of the targets.
Here is an example in use:
sequence <- c(1:10, 1:10)
locate(sequence, c(2,9))
$`2`
[1] 2 12
$`9`
[1] 9 19
Alternatively, if you do not need to use the indices but just the elements you can do
> x <- sample(1:10,20,replace=TRUE)
> x
[1] 6 4 7 2 9 3 3 5 4 7 2 1 4 9 1 6 10 4 3 10
> x[8<=x & x<=9]
[1] 9 9
If you want to find the answer using loops, then the following script will do the job:
> req_nos<- c(8,9)
> pos<-list()
> for (i in 1:length(req_nos)){
pos[[i]]<-which(x==req_nos[i])}
The output will look like this:
>pos
[[1]]
[1] 1 12 19
[[2]]
[1] 3 4 15
Here, pos[[1]] contains positions of 8 and pos[[2]] contains positions of 9. If you are using the %in% method and change the input order of elements, i.e, c(9,8) instead of c(8,9), the output will be the same for both of them. This method alleviates such problem.
grepl maybe a useful function. Note that grepl appears in versions of R 2.9.0 and later. What's handy about grepl is that it returns a logical vector of the same length as x.
grepl(8, x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
grepl(9, x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[13] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
To arrive at your answer, you could do the following
grepl(8,x) | grepl(9,x)
Suppose I have the following vector:
x <- c(8, 6, 9, 9, 7, 3, 2, 5, 5, 1, 6, 8, 5, 2, 9, 3, 5, 10, 8, 2)
How can I find which elements are either 8 or 9?
This is one way to do it. First I get the indices at which x is either 8 or 9. Then we can verify that at those indices, x is indeed 8 and 9.
> inds <- which(x %in% c(8,9))
> inds
[1] 1 3 4 12 15 19
> x[inds]
[1] 8 9 9 8 9 8
You could try the | operator for short conditions
which(x == 8 | x == 9)
In this specific case you could also use grep:
# option 1
grep('[89]',x)
# option 2
grep('8|9',x)
which both give:
[1] 1 3 4 12 15 19
When you also want to detect number with more than one digit, the second option is preferred:
> grep('10|8',x)
[1] 1 12 18 19
However, I did put emphasis on this specific case at the start of my answer for a reason. As #DavidArenburg mentioned, this could lead to unintended results. Using for example grep('1|8',x) will detect both 1 and 10:
> grep('1|8',x)
[1] 1 10 12 18 19
In order to avoid that side-effect, you will have to wrap the numbers to be detected in word-bounderies:
> grep('\\b1\\b|8',x)
[1] 1 10 12 19
Now, the 10 isn't detected.
Here is a generalized solution to find the locations of all target values (only works for vectors and 1-dimensional arrays).
locate <- function(x, targets) {
results <- lapply(targets, function(target) which(x == target))
names(results) <- targets
results
}
This function returns a list because each target may have any number of matches, including zero. The list is sorted (and named) in the original order of the targets.
Here is an example in use:
sequence <- c(1:10, 1:10)
locate(sequence, c(2,9))
$`2`
[1] 2 12
$`9`
[1] 9 19
Alternatively, if you do not need to use the indices but just the elements you can do
> x <- sample(1:10,20,replace=TRUE)
> x
[1] 6 4 7 2 9 3 3 5 4 7 2 1 4 9 1 6 10 4 3 10
> x[8<=x & x<=9]
[1] 9 9
If you want to find the answer using loops, then the following script will do the job:
> req_nos<- c(8,9)
> pos<-list()
> for (i in 1:length(req_nos)){
pos[[i]]<-which(x==req_nos[i])}
The output will look like this:
>pos
[[1]]
[1] 1 12 19
[[2]]
[1] 3 4 15
Here, pos[[1]] contains positions of 8 and pos[[2]] contains positions of 9. If you are using the %in% method and change the input order of elements, i.e, c(9,8) instead of c(8,9), the output will be the same for both of them. This method alleviates such problem.
grepl maybe a useful function. Note that grepl appears in versions of R 2.9.0 and later. What's handy about grepl is that it returns a logical vector of the same length as x.
grepl(8, x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
grepl(9, x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[13] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
To arrive at your answer, you could do the following
grepl(8,x) | grepl(9,x)
Sample data:
df <- data.frame(
time = c(1, 2, 3, 4, 5, 6, 7),
status = c("good", "good", "good", "bad", "good", "good", "good")
)
Output:
time status
1 1 good
2 2 good
3 3 good
4 4 bad
5 5 good
6 6 good
7 7 good
I would like to add a new column statuschange IF status differs from the row above or below. The output would look like this:
time status statuschange
1 1 good NA
2 2 good TRUE
3 3 good FALSE
4 4 bad FALSE
5 5 good FALSE
6 6 good TRUE
7 7 good NA
I have the sense that are lots of ways to do this, but I haven't been able to figure it out. Any assistance is appreciated!
You can apply diff to see if two entries are the same. You want two of these diffs, to see if both entries around an element are the same:
> !(c(NA, diff(as.numeric(x$status))) | c(rev(diff(as.numeric(rev(x$status)))), NA))
[1] NA TRUE FALSE FALSE FALSE TRUE NA
The first expression tells whether the prior element is different:
> c(NA, diff(as.numeric(x$status)))
[1] NA 0 0 -1 1 0 0
The second tells whether the following element is different:
> c(rev(diff(as.numeric(rev(x$status)))), NA)
[1] 0 0 1 -1 0 0 NA
The "or" operation | returns TRUE for nonzero, which means a leading or following element is different, so we then invert the result with the leading !.
You can use something like this
df$A <- rep(0,7)
for(i in 2:6){
df$A[i] <- ifelse(df$status[i]==df$status[i-1]
& df$status[i]==df$status[i+1],'TRUE','FALSE')
}
df
Another solution using looping:
statuschange=as.character()
for(i in seq_along(df$status))
{
statuschange[i]<-df$status[i]==df$status[i-1] && df$status[i]==df$status[i+1]
}
df$statuschange<-statuschange
Here's an approach using zoo::rollapply to loop through the column:
> library(zoo)
> c(NA, rollapply(x$status, width=3, FUN=function(x) x[1]==x[2] & x[2]==x[3]), NA)
[1] NA TRUE FALSE FALSE FALSE TRUE NA
What this does, is create a series of "windows" each of width 3, rolling across the data, and applies the function to each subset. As the window is 3, we end up with a vector that is two elements shorter than the original column (width - 1 shorter). The endpoint NA values are then added with c.
I have the following data frame:
> my.data
A.Seats B.Seats
1 14,15 14,15,16
2 7 7,8
3 12,13 16,17
4 <NA> 10,11
I would like to check if the string within any row in column "A.Seats" is found within the same row of column "B.Seats". So the output would look something like this:
A.Seats B.Seats Check
1 14,15 14,15,16 TRUE
2 7 7,8 TRUE
3 12,13 16,17 FALSE
4 <NA> 10,11 FALSE
But I don't know how to create this table. As a start, I tried using grep:
grep(my.data$A.Seats,my.data$B.Seats)
But I receive the following output
[1] 1
Warning message:
In grep(my.data$A.Seats, my.data$B.Seats) :
argument 'pattern' has length > 1 and only the first element will be used
...and I can't get past this error. Any ideas as to how I can get the intended result?
Many Thanks
The "stringi" library has some vectorized functions that might be useful for something like this. I would suggest the stri_detect() function. Here's an example with some reproducible sample data. Note the difference in the values in the first and last row, and the difference in the results according to whether a regex or fixed approach was taken:
my.data <- data.frame(
A.Seats = c("14,15", "7", "12,13", NA, "14,19"),
B.Seats = c("14,15,16", "7,8", "16,17", "10,11", "14,15,16"))
my.data
# A.Seats B.Seats
# 1 14,15 14,15,16
# 2 7 7,8
# 3 12,13 16,17
# 4 <NA> 10,11
# 5 14,19 14,15,16
library(stringi)
stri_detect(my.data$B.Seats, fixed = my.data$A.Seats)
# [1] TRUE TRUE FALSE NA FALSE
stri_detect(my.data$B.Seats, regex = gsub(",", "|", my.data$A.Seats))
# [1] TRUE TRUE FALSE NA TRUE
The first option above treats the values in my.data$A.Seats as a fixed string pattern. The second option treats it as a regular expression to match any of the values.
Note that this maintains NA as NA, but that can easily be changed to FALSE if you need to.
If you don't want to think too much about mapply, you can consider Vectorize to make a vectorized version of grepl. Something like the following should do it:
vGrepl <- Vectorize(grepl)
vGrepl(my.data$A.Seats, my.data$B.Seats) # pattern is fixed
# [1] 1 1 0 NA 0
vGrepl(gsub(",", "|", my.data$A.Seats), my.data$B.Seats) # pattern is regex
# 14|15 7 12|13 <NA> 14|19
# 1 1 0 NA 1
as.logical(vGrepl(my.data$A.Seats, my.data$B.Seats)) # coerce to logical
# [1] TRUE TRUE FALSE NA FALSE
Because this calls grepl on each element in the vector, I don't think this will scale well though.
This is an approach to get what you need
> List <- lapply(my.data, function(x) strsplit(as.character(x), ","))
> transform(my.data, Check=sapply(mapply("%in%", List[[1]], List[[2]]), any))
A.Seats B.Seats Check
1 14,15 14,15,16 TRUE
2 7 7,8 TRUE
3 12,13 16,17 FALSE
4 <NA> 10,11 FALSE
Here's an alternative using grep
>transform(my.data,
Check=sapply(suppressWarnings(mapply("grep", List[[1]], List[[2]])), any))
Suppose I have the following vector:
x <- c(8, 6, 9, 9, 7, 3, 2, 5, 5, 1, 6, 8, 5, 2, 9, 3, 5, 10, 8, 2)
How can I find which elements are either 8 or 9?
This is one way to do it. First I get the indices at which x is either 8 or 9. Then we can verify that at those indices, x is indeed 8 and 9.
> inds <- which(x %in% c(8,9))
> inds
[1] 1 3 4 12 15 19
> x[inds]
[1] 8 9 9 8 9 8
You could try the | operator for short conditions
which(x == 8 | x == 9)
In this specific case you could also use grep:
# option 1
grep('[89]',x)
# option 2
grep('8|9',x)
which both give:
[1] 1 3 4 12 15 19
When you also want to detect number with more than one digit, the second option is preferred:
> grep('10|8',x)
[1] 1 12 18 19
However, I did put emphasis on this specific case at the start of my answer for a reason. As #DavidArenburg mentioned, this could lead to unintended results. Using for example grep('1|8',x) will detect both 1 and 10:
> grep('1|8',x)
[1] 1 10 12 18 19
In order to avoid that side-effect, you will have to wrap the numbers to be detected in word-bounderies:
> grep('\\b1\\b|8',x)
[1] 1 10 12 19
Now, the 10 isn't detected.
Here is a generalized solution to find the locations of all target values (only works for vectors and 1-dimensional arrays).
locate <- function(x, targets) {
results <- lapply(targets, function(target) which(x == target))
names(results) <- targets
results
}
This function returns a list because each target may have any number of matches, including zero. The list is sorted (and named) in the original order of the targets.
Here is an example in use:
sequence <- c(1:10, 1:10)
locate(sequence, c(2,9))
$`2`
[1] 2 12
$`9`
[1] 9 19
Alternatively, if you do not need to use the indices but just the elements you can do
> x <- sample(1:10,20,replace=TRUE)
> x
[1] 6 4 7 2 9 3 3 5 4 7 2 1 4 9 1 6 10 4 3 10
> x[8<=x & x<=9]
[1] 9 9
If you want to find the answer using loops, then the following script will do the job:
> req_nos<- c(8,9)
> pos<-list()
> for (i in 1:length(req_nos)){
pos[[i]]<-which(x==req_nos[i])}
The output will look like this:
>pos
[[1]]
[1] 1 12 19
[[2]]
[1] 3 4 15
Here, pos[[1]] contains positions of 8 and pos[[2]] contains positions of 9. If you are using the %in% method and change the input order of elements, i.e, c(9,8) instead of c(8,9), the output will be the same for both of them. This method alleviates such problem.
grepl maybe a useful function. Note that grepl appears in versions of R 2.9.0 and later. What's handy about grepl is that it returns a logical vector of the same length as x.
grepl(8, x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
grepl(9, x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[13] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
To arrive at your answer, you could do the following
grepl(8,x) | grepl(9,x)