How to check if a character vector contains a string - r

I'm very new to R, just got RSTudio last week, so this might be a dumb question but anyway, I think I'm getting contradictory statements about whether or not my string "rs2418691" is in my vector rsIDcolumn. When I use the %in% command it says no, but using the which command does give me a coordinate for it in the vector:
> "rs2418691" %in% rsIDcolumn
[1] FALSE
> which(rsIDcolumn == "rs2418691")
[1] 137853
Does anyone know what's going on please? Thank you!

I think you are refering to a dataframe column. If you have a dataframe called df, which has a column named rsIDcolumn you can check if a string is inside of it by doing:
"rs2418691" %in% df$rsIDcolumn

Just summing up, what is in the comment from #Adamm:
x <- data.frame(a=c("b", "c"))
"c" %in% x
#[1] FALSE
which(x == "c")
#[1] 2

Related

How to string count unique values in data strings

I am trying to find common words having 5 unique vowels (i.e: "aeiuo" without in a single word and without repetition)
I tried this:
library(tidyverse)
x<-c("appropriate","associate","available","colleague","experience","encourage","encouragi","associetu")
x[str_count(x,"[aeiuo]")>4]
Note that words ("encouragi" and "associetu") were used for the purpose of verifying my intended answer
the results I am generating are the following:
[3] "available" "colleague"
[5] "experience" "encourage"
[7] "encouragi" "associetu"
While I wanted to get only:
"encouragi" "associetu" which fulfill the criteria of having 5 distinct vowels (i.e: "aeiuo").
Is there any function to serve as string_count_unique?? if yes, which one? if not, what other function might you recommend me to use so that I meet the set criteria?
thank you in advance for your help!
One option could be:
x[lengths(lapply(str_extract_all(x, "a|e|i|u|o"), unique)) == 5]
[1] "encouragi" "associetu"
Maybe strsplit could help you
> x[sapply(strsplit(x,""),function(v) sum(unique(v)%in%c("a","e","i","o","u"))>4) ]
[1] "encouragi" "associetu"
Here's a way to do it using strsplit and setdiff. We loop over each string using sapply, we split each string into its letters, then we check if all vowels are present in the vector resulting from strsplit. If the length of the setdiff is greater than 0, one or more vowels are not present in the string.
keep <- sapply(x, FUN = function(x){
length(setdiff(c("a", "e", "i", "o", "u"), el(strsplit(x, "")))) == 0
})
x[keep]
# [1] "encouragi" "associetu"
The problem with your code is that you are counting if the sum of ANY of aeiou is >4. What you want is to check that the count of a is >0 AND that the count of e is >0 and so on. So you could check the following:
x[str_count(x,"[a]")>0 & str_count(x,"[e]")>0 & str_count(x,"[i]")>0 & str_count(x,"[o]")>0 & str_count(x,"[u]")>0]

Is the r function `match()` directionally dependent?

I am using the binary operator %in% to subset a dataframe (I got the idea from another stackoverflow thread), but when I double check the result by switching the arguments, I get different answers. I've read the R documentation on the match() function, and it seems like neither match() nor %in% should be directionally dependent. I really need to understand exactly what is happening to be confident in my results. Could anybody provide some insight?
> filtered_ordGeneNames_proteinIDs <- ordGeneNames_ProteinIDs[ordGeneNames_ProteinIDs$V4 %in% ordDEGs$X, ];
> filtered2_ordGeneNames_proteinIDs <- ordDEGs[ordDEGs$X %in% ordGeneNames_ProteinIDs$V4, ];
> nrow(filtered_ordGeneNames_proteinIDs)
[1] 5767
> nrow(filtered2_ordGeneNames_proteinIDs)
[1] 5746
Of course you have different results:
ordGeneNames_ProteinIDs$V4 %in% ordDEGs$X
tells you which element of ordGeneNames_ProteinIDs$V4 that is also in ordDEGs$X
where :
ordDEGs %in% $XordGeneNames_ProteinIDs$V4
tells you which element of ordDEGs$X that is also in ordGeneNames_ProteinIDs$V4
compare
c(1,2,3,4) %in% c(1,2,1, 2)
[1] TRUE TRUE FALSE FALSE
to
c(1,2,1, 2) %in% c(1,2,3,4)
[1] TRUE TRUE TRUE TRUE

matching first word from a string

I have following R programs.
Test<-"CLC2" %in% "CLC2,CLC2,CLC2"
Test
Test1<-"CLC2" %in% "CLC2"
Test1
In first case, I want also get logical condition to be true as it matches to first word (required in my case).
You can find a word in a string and (if necessary) check if it is the first word of a string
gregexpr(pattern = "CLC2","CLC2,CLC2,CLC2")[[1]][1] == 1
Try
"CLC2" %in% c("CLC2", "CLC2", "CLC2")
# [1] TRUE
or
"CLC2" %in% strsplit("CLC2,CLC2,CLC2", ",")[[1]]
# [1] TRUE
The 2nd one splits your string at every , character.
Edit
It you just want to look at the first value, then it should be
"CLC2" %in% strsplit("CLC2,CLC2,CLC2", ",")[[1]][1]
"CLC2" %in% c("CLC2", "CLC2", "CLC2")[1]
as pointed out by #PierreLafortune. In that case, you don't need %in% but could also use == as you are just comparing one value to another value.
You can also try
grepl('\\<CLC2\\>', unlist(strsplit("CLC2,CLC2,CLC2", ","))[1])
#[1] TRUE

How to detect that a vector is subset of specific vector?

I have two vectors (sets) like this:
first<-c(1,2,3,4,5)
second<-c(2,4,5)
how can I detect that whether second is subset of first or not? is there any function for this?
Here's one way
> all(second %in% first)
[1] TRUE
Here's another
setequal(intersect(first, second), second)
## [1] TRUE
Or
all(is.element(second, first))
## [1] TRUE
If the order of the array elements matters, string conversion could help:
ord_match <- function(x,y){
m <- c(0,grep(paste0(x,collapse=""),
paste0(y,collapse=""), fixed = T))
return(as.logical(m)[length(m)])
}

Having two vectors is there a nicer way to test if one is included in another?

Given two vectors:
x = c('a','b')
lookup = c('a','c','d','e','f')
test if each element in x is present in lookup. One way of doing it:
all(!is.na(match(x, lookup)))
I find this solution a bit verbose for R and wonder if there is better/shorter version.
%in% does this:
all(x %in% lookup)
## [1] FALSE
Can also use setdiff. See the associated help page for other set operations.
setdiff(x,lookup)
[1] "b"
> as.logical(length( setdiff(x,lookup) ) )
[1] TRUE

Resources