matching first word from a string - r

I have following R programs.
Test<-"CLC2" %in% "CLC2,CLC2,CLC2"
Test
Test1<-"CLC2" %in% "CLC2"
Test1
In first case, I want also get logical condition to be true as it matches to first word (required in my case).

You can find a word in a string and (if necessary) check if it is the first word of a string
gregexpr(pattern = "CLC2","CLC2,CLC2,CLC2")[[1]][1] == 1

Try
"CLC2" %in% c("CLC2", "CLC2", "CLC2")
# [1] TRUE
or
"CLC2" %in% strsplit("CLC2,CLC2,CLC2", ",")[[1]]
# [1] TRUE
The 2nd one splits your string at every , character.
Edit
It you just want to look at the first value, then it should be
"CLC2" %in% strsplit("CLC2,CLC2,CLC2", ",")[[1]][1]
"CLC2" %in% c("CLC2", "CLC2", "CLC2")[1]
as pointed out by #PierreLafortune. In that case, you don't need %in% but could also use == as you are just comparing one value to another value.

You can also try
grepl('\\<CLC2\\>', unlist(strsplit("CLC2,CLC2,CLC2", ","))[1])
#[1] TRUE

Related

Substring match when filtering rows

I have strings in file1 that matches part of the strings in file2. I want to filter out the strings from file2 that partly matches those in file1. Please see my try. Not sure how to define substring match in this way.
file1:
V1
species1
species121
species14341
file2
V1
genus1|species1|strain1
genus1|species121|strain1
genus1|species1442|strain1
genus1|species4242|strain1
genus1|species4131|strain1
my try:
file1[!file1$V1 %in% file2$V1]
You cannot use the %in% operator in this way in R. It is used to determine whether an element of a vector is in another vector, not like in in Python which can be used to match a substring: Look at this:
"species1" %in% "genus1|species1|strain1" # FALSE
"species1" %in% c("genus1", "species1", "strain1") # TRUE
You can, however, use grepl for this (the l is for logical, i.e. it returns TRUE or FALSE).
grepl("species1", "genus1|species1|strain1") # TRUE
There's an additional complication here in that you cannot use grepl with a vector, as it will only compare the first value:
grepl(file1$V1, "genus1|species1|strain1")
[1] TRUE
Warning message:
In grepl(file1$V1, "genus1|species1|strain1") :
argument 'pattern' has length > 1 and only the first element will be used
The above simply tells you that the first element of file1$V1 is in "genus1|species1|strain1".
Furthermore, you want to compare each element in file1$V1 to an entire vector of strings, rather than just one string. That's OK but you will get a vector the same length as the second vector as an output:
grepl("species1", file2$V1)
[1] TRUE TRUE TRUE FALSE FALSE
We can just see if any() of those are a match. As you've tagged your question with tidyverse, here's a dplyr solution:
library(dplyr)
file1 |>
rowwise() |> # This makes sure you only pass one element at a time to `grepl`
mutate(
in_v2 = any(grepl(V1, file2$V1))
) |>
filter(!in_v2)
# A tibble: 1 x 2
# Rowwise:
# V1 in_v2
# <chr> <lgl>
# 1 species14341 FALSE
One way to get what you want is using the grepl function. So, you can run the following code:
# Load library
library(qdapRegex)
# Extract the names of file2$V1 you are interested in (those between | |)
v <- unlist(rm_between(file2$V1, "|", "|", extract = T))
# Which of theese elements are in file1$V1?
elem.are <- which(v %in% file1$V1)
# Delete the elements in elem.are
file2$V1[-elem.are]
In v we save the names of file2$V1 we are interested in (those
between | |)
Then we save at elem.are the positions of those names which appear
in file1$V1
Finally, we omit those elements using file2$V1[-elem.are]

How to check if a character vector contains a string

I'm very new to R, just got RSTudio last week, so this might be a dumb question but anyway, I think I'm getting contradictory statements about whether or not my string "rs2418691" is in my vector rsIDcolumn. When I use the %in% command it says no, but using the which command does give me a coordinate for it in the vector:
> "rs2418691" %in% rsIDcolumn
[1] FALSE
> which(rsIDcolumn == "rs2418691")
[1] 137853
Does anyone know what's going on please? Thank you!
I think you are refering to a dataframe column. If you have a dataframe called df, which has a column named rsIDcolumn you can check if a string is inside of it by doing:
"rs2418691" %in% df$rsIDcolumn
Just summing up, what is in the comment from #Adamm:
x <- data.frame(a=c("b", "c"))
"c" %in% x
#[1] FALSE
which(x == "c")
#[1] 2

Why does %in% return false when matching string?

Can someone explain why %in% returns false in this case? The string <sentiment> exists in the larger string.
> x<-"hahahaha <sentiment>too much</sentiment> <feature>doge</feature>."
> "<sentiment>" %in% x
[1] FALSE
%in% checks whether the former element matches any of the elements in the latter. In this case x only has the element "hahahaha <sentiment>too much</sentiment> <feature>doge</feature>.", not "<sentiment>", so "<sentiment>" %in% x returns FALSE. For example, the following returns TRUE:
y = c(x, "<sentiment>")
# > y
# [1] "hahahaha <sentiment>too much</sentiment> <feature>doge</feature>."
# [2] "<sentiment>"
"<sentiment>" %in% y
# [1] TRUE
If you want to check whether "<sentiment>" is a substring of x, use grepl:
grepl("<sentiment>", x, fixed = TRUE)
# [1] TRUE
or use str_detect from stringr:
stringr::str_detect(x, fixed("<sentiment>"))
# [1] TRUE
%in% is the match operator, equivalent to the match function. It searches for an object in a vector (or similar), not an substring in a string.
To find in a string, use one of the pattern matching functions, such as grep or similar.

check if a character within a string is uppercase

I have a string
x <- "lowerUpper"
and want do determine if and which character within this string is an uppercase letter.
I can use toupper(x) == x, which tells me if all characters are uppercase, but how do I check if only some (and which) are?
One option is gregexpr to find the position where the character is uppercase
unlist(gregexpr("[A-Z]", x))
#[1] 6
You can also use the symbol \U to check for uppercase:
unlist(gregexpr("\\U", "lowerUpper"))
#[1] 6
> x <- "lowerUpper"
> sapply(strsplit(x, ''), function(a) which(a %in% LETTERS)[1])
[1] 6
or
> library(stringi)
> stri_locate_first_regex(x, "[A-Z]")
Another option is to check each letter:
which(toupper(strsplit(x,split = "")[[1]])==strsplit(x,split = "")[[1]])
#[1] 6
Perhaps a cleaner code version using %in%
unlist(strsplit("lowerUpper",'')) %in% LETTERS
An advantage here is the return of the logical vector indicating each letter position in the string. This solution works for multiple uppercase letters too, whereas the grep options return only the first match. Lastly, using LETTERS makes for more readable code to my mind.

Remove vector element with %in% returns character(0)

Got a quick question. I'm trying to remove a vector element with the code below. But I get character(0) returned instead of the rest of the vector elements.
What have I done wrong?
> str(ticker.names)
chr [1:10] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ASSA-B.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST"
> ticker.names[! 'AAK.ST' %in% ticker.names]
character(0)
If we need to remove the elements in `ticker.names' that are not 'AAK.ST'.
ticker.names[!ticker.names %in% 'AAK.ST']
Or use setdiff
setdiff(ticker.names, 'AAK.ST')
Consider the approach OP is using,
'AAK.ST' %in% ticker.names
#[1] TRUE
ticker.names['AAK.ST' %in% ticker.names]
#[1] "AAK.ST" "ABB.ST" "ALFA.ST"
By negating,
!'AAK.ST' %in% ticker.names
#[1] FALSE
ticker.names[!'AAK.ST' %in% ticker.names]
#character(0)
In the former case, the TRUE is recycled to the length of the 'ticker.names', so all the elements of the vector are returned, while in the latter, the FALSE gets recycled and no elements are returned.
data
ticker.names <- c('AAK.ST', 'ABB.ST', 'ALFA.ST')

Resources