I'm trying to check if a specific value is anywhere in a data frame.
I know the %in% operator should allow me to do this, but it doesn't seem to work the way I would expect when applying to a whole data frame:
A = data.frame(B=c(1,2,3,4), C=c(5,6,7,8))
1 %in% A
[1] FALSE
But if I apply this to the specific column the value is in it works the way I expect:
1 %in% A$C
[1] TRUE
What is the proper way of checking if a value is anywhere in a data frame?
You could do:
any(A==1)
#[1] TRUE
OR with Reduce:
Reduce("|", A==1)
OR
length(which(A==1))>0
OR
is.element(1,unlist(A))
To find the location of that value you can do f.ex:
which(A == 1, arr.ind=TRUE)
# row col
#[1,] 1 1
Or simply
sum(A == 1) > 0
#[1] TRUE
Loop through the variables with sapply, then use any.
any(sapply(A, function(x) 1 %in% x))
[1] TRUE
or following digEmAll's comment, you could use unlist, which takes a list (data.frame) and returns a vector.
1 %in% unlist(A)
[1] TRUE
The trick to understanding why your first attempt doesn't work, really comes down to understanding what a data frame is - namely a list of vectors of equal length. What you're trying to do here is not check if that list of vectors matches your condition, but checking if the values in those vectors matches the condition.
Try:
any(A == 1)
Returns FALSE or TRUE
Related
I have a simple question, but I can't figure out a simple solution:
library(data.table)
plouf <- data.table(1:10,letters[1:10])
plouf[V1 %in% c(3,1),V2]
[1] "a" "c"
I would like the output to keep the initial order of the subsetting vector, i.e. "c" "a". What are the possiblities ?
I have
sapply(c(3,1),function(x){plouf[V1 == x,V2]})
but I find it uggly.
edit
I have
setkey(plouf,V1)
plouf[c(3,1),V2]
which is surely the good way for data.table.
Still I am curious about what are the solutions
Here is one option with match that can be used in data.table and in base R as well. Unlike %in%, match returns the position index of the first match and this can be used to get the corresponding elements of the other column 'V2'
plouf[, V2[match(c(3, 1), V1)]]
#[1] "c" "a"
plouf[, match(c(3, 1), V1)] # returns numeric index
#[1] 3 1
plouf[, V1 %in% c(3, 1)] # returns logical vector
#[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Because thee %in% returns logical vector, when we use this to extract the elements, the elements corresponding to each TRUE value will be extracted i.e. it extracts from 1st and 3rd positions instead of 3rd and 1st
Using data.table keys will accomplish what you're going for here, the Keys and fast binary search based subset vignette here explains the usage.
library(data.table)
plouf <- data.table(1:10,letters[1:10])
## Set a key
setkey(plouf,V1)
## Use .() syntax for key subsetting to get associated values of V2
plouf[.(c(3,1)),V2]
#[1] "c" "a"
I've found R's ifelse statements to be pretty handy from time to time. For example:
ifelse(TRUE,1,2)
# [1] 1
ifelse(FALSE,1,2)
# [1] 2
But I'm somewhat confused by the following behavior.
ifelse(TRUE,c(1,2),c(3,4))
# [1] 1
ifelse(FALSE,c(1,2),c(3,4))
# [1] 3
Is this a design choice that's above my paygrade?
The documentation for ifelse states:
ifelse returns a value with the same
shape as test which is filled with
elements selected from either yes or
no depending on whether the element
of test is TRUE or FALSE.
Since you are passing test values of length 1, you are getting results of length 1. If you pass longer test vectors, you will get longer results:
> ifelse(c(TRUE, FALSE), c(1, 2), c(3, 4))
[1] 1 4
So ifelse is intended for the specific purpose of testing a vector of booleans and returning a vector of the same length, filled with elements taken from the (vector) yes and no arguments.
It is a common confusion, because of the function's name, to use this when really you want just a normal if () {} else {} construction instead.
I bet you want a simple if statement instead of ifelse - in R, if isn't just a control-flow structure, it can return a value:
> if(TRUE) c(1,2) else c(3,4)
[1] 1 2
> if(FALSE) c(1,2) else c(3,4)
[1] 3 4
Note that you can circumvent the problem if you assign the result inside the ifelse:
ifelse(TRUE, a <- c(1,2), a <- c(3,4))
a
# [1] 1 2
ifelse(FALSE, a <- c(1,2), a <- c(3,4))
a
# [1] 3 4
use `if`, e.g.
> `if`(T,1:3,2:4)
[1] 1 2 3
yeah, I think ifelse() is really designed for when you have a big long vector of tests and want to map each to one of two options. For example, I often do colors for plot() in this way:
plot(x,y, col = ifelse(x>2, 'red', 'blue'))
If you had a big long vector of tests but wanted pairs for outputs, you could use sapply() or plyr's llply() or something, perhaps.
Sometimes the user just needs a switch statement instead of an ifelse. In that case:
condition <- TRUE
switch(2-condition, c(1, 2), c(3, 4))
#### [1] 1 2
(which is another syntax option of Ken Williams's answer)
Here is an approach similar to that suggested by Cath, but it can work with existing pre-assigned vectors
It is based around using the get() like so:
a <- c(1,2)
b <- c(3,4)
get(ifelse(TRUE, "a", "b"))
# [1] 1 2
In your case, using if_else from dplyr would have been helpful: if_else is more strict than ifelse, and throws an error for your case:
library(dplyr)
if_else(TRUE,c(1,2),c(3,4))
#> `true` must be length 1 (length of `condition`), not 2
Found on everydropr:
ifelse(rep(TRUE, length(c(1,2))), c(1,2),c(3,4))
#>[1] 1 2
Can replicate the result of your condition to return the desired length
I currently have a string in R that looks like this:
df <- c ("BMMBMMBMMMMMBMMBM")
I need to determine how many times MM's appear in this string (in this example it's 4).
I've been using str_count(df, "MM") but this only counts how many times two M's are next to each other in the string (which returns 5).
Any help would be great...
Thanks!
Here's a base R approach without regular expressions:
with(rle(unlist(strsplit(x, ""))), sum(values == "M" & lengths >= 2))
# [1] 4
A possible approach is:
stringr::str_count(df, "MM+")
#output
[1] 4
+ means one or more
in base R:
lengths(gregexpr("MM+", df))
gregexpr returns a list, each element corresponds to one element of df.
lengths returns the length of each list element.
EDIT: as per the comment by #docendo discimus the second option is a little dangerous since it will return 1 if the string was not found.
lengths(gregexpr("xyz+", df))
#output
1
A safer option is:
lapply(gregexpr("MM+", df), function(x) length(x[x > 0]))
#output
[[1]]
[1] 4
lapply(gregexpr("xyz+", df), function(x) length(x[x > 0]))
#output
[[1]]
[1] 0
Base solution:
s <- "BMMBMMBMMMMMBMMBM"
lengths(gregexpr("MM+", s))
## [1] 4
Note that the input called df in the question is a character string, not a data frame, and c("X") is identical to "X" so the c and the parentheses are not needed.
Try the following pattern:
str_count(df,"(M)\\1+")
This will count two or more M as one case.
Or
str_count(df,"M{2,}")
I've found R's ifelse statements to be pretty handy from time to time. For example:
ifelse(TRUE,1,2)
# [1] 1
ifelse(FALSE,1,2)
# [1] 2
But I'm somewhat confused by the following behavior.
ifelse(TRUE,c(1,2),c(3,4))
# [1] 1
ifelse(FALSE,c(1,2),c(3,4))
# [1] 3
Is this a design choice that's above my paygrade?
The documentation for ifelse states:
ifelse returns a value with the same
shape as test which is filled with
elements selected from either yes or
no depending on whether the element
of test is TRUE or FALSE.
Since you are passing test values of length 1, you are getting results of length 1. If you pass longer test vectors, you will get longer results:
> ifelse(c(TRUE, FALSE), c(1, 2), c(3, 4))
[1] 1 4
So ifelse is intended for the specific purpose of testing a vector of booleans and returning a vector of the same length, filled with elements taken from the (vector) yes and no arguments.
It is a common confusion, because of the function's name, to use this when really you want just a normal if () {} else {} construction instead.
I bet you want a simple if statement instead of ifelse - in R, if isn't just a control-flow structure, it can return a value:
> if(TRUE) c(1,2) else c(3,4)
[1] 1 2
> if(FALSE) c(1,2) else c(3,4)
[1] 3 4
Note that you can circumvent the problem if you assign the result inside the ifelse:
ifelse(TRUE, a <- c(1,2), a <- c(3,4))
a
# [1] 1 2
ifelse(FALSE, a <- c(1,2), a <- c(3,4))
a
# [1] 3 4
use `if`, e.g.
> `if`(T,1:3,2:4)
[1] 1 2 3
yeah, I think ifelse() is really designed for when you have a big long vector of tests and want to map each to one of two options. For example, I often do colors for plot() in this way:
plot(x,y, col = ifelse(x>2, 'red', 'blue'))
If you had a big long vector of tests but wanted pairs for outputs, you could use sapply() or plyr's llply() or something, perhaps.
Sometimes the user just needs a switch statement instead of an ifelse. In that case:
condition <- TRUE
switch(2-condition, c(1, 2), c(3, 4))
#### [1] 1 2
(which is another syntax option of Ken Williams's answer)
Here is an approach similar to that suggested by Cath, but it can work with existing pre-assigned vectors
It is based around using the get() like so:
a <- c(1,2)
b <- c(3,4)
get(ifelse(TRUE, "a", "b"))
# [1] 1 2
In your case, using if_else from dplyr would have been helpful: if_else is more strict than ifelse, and throws an error for your case:
library(dplyr)
if_else(TRUE,c(1,2),c(3,4))
#> `true` must be length 1 (length of `condition`), not 2
Found on everydropr:
ifelse(rep(TRUE, length(c(1,2))), c(1,2),c(3,4))
#>[1] 1 2
Can replicate the result of your condition to return the desired length
In R, for the sake of example, I have a list composed of equal-length numeric vectors of form similar to:
list <- list(c(1,2,3),c(1,3,2),c(2,1,3))
[[1]]
[1] 1 2 3
[[2]]
[1] 1 3 2
[[3]]
[1] 2 1 3
...
Every element of the list is unique. I want to get the index number of the element x <- c(2,1,3), or any other particular numeric vector within the list.
I've attempted using match(x,list), which gives a vector full of NA, and which(list==(c(1,2,3)), which gives me a "(list) object cannot be coerced to type 'double'" error. Coercing the list to different types didn't seem to make a difference for the which function. I also attempted various grep* functions, but these don't return exact numeric vector matches. Using find(c(1,2,3),list) or even some fancy sapply which %in% type functions didn't give me what I was looking for. I feel like I have a type problem. Any suggestions?
--Update--
Summary of Solutions
Thanks for your replies. The method in the comment for this question is clean and works well (via akrun).
> which(paste(list)==deparse(x))
[1] 25
The next method didn't work correctly
> which(duplicated(c(x, list(y), fromLast = TRUE)))
[1] 49
> y
[1] 1 2 3
This sounds good, but in the next block you can see the problem
> y<-c(1,3,2)
> which(duplicated(c(list, list(y), fromLast = TRUE)))
[1] 49
More fundamentally, there are only 48 elements in the list I was using.
The last method works well (via BondedDust), and I would guess it is more efficient using an apply function:
> which( sapply(list, identical, y ))
[1] 25
match works fine if you pass it the right data.
L <- list(c(1,2,3),c(1,3,2),c(2,1,3))
match(list(c(2,1,3)), L)
#[1] 3
Beware that this works via coercing lists to character, so fringe cases will fail - with a hat-tip to #nicola:
match(list(1:3),L)
#[1] NA
even though:
1:3 == c(1,2,3)
#[1] TRUE TRUE TRUE
Although arguably:
identical(1:3,c(1,2,3))
#[1] FALSE
identical(1:3,c(1L,2L,3L))
#[1] TRUE
You can use duplicated(). If we add the matching vector to the end of the original list and set fromLast = TRUE, we will find the duplicate(s). Then we can use which() to get the index.
which(duplicated(c(list, list(c(2, 1, 3)), fromLast = TRUE))
# [1] 3
Or you could add it as the first element and subtract 1 from the result.
which(duplicated(c(list(c(2, 1, 3)), list))) - 1L
# [1] 3
Note that the type always matters with this type of comparison. When comparing integers and numerics, you will need to convert doubles to integers for this to run without issue. For example, 1:3 is not the same type as c(1, 2, 3).
> L <- list(c(1,2,3),c(1,3,2),c(2,1,3))
> sapply(L, identical, c(2,1,3))
[1] FALSE FALSE TRUE
> which( sapply(L, identical, c(2,1,3)) )
[1] 3
This would be slightly less restrictive in its test:
> which( sapply(L, function(x,y){all(x==y)}, c(1:3)) )
[1] 1
Try:
vapply(list,function(z) all(z==x),TRUE)
#[1] FALSE FALSE TRUE
Enclosing the above line to which gives you the index of the list.