Keep strings with partial match - r

I would like to keep only strings in my vector which partially match to strings in another vector.
Take a look on this example:
> dput(cc)
c("BLANK_0", "Greg_10", "Luke_40", "Luke_10", "Mark_10", "NA_40", "BLANK_10", "Joe_15", "Jane_10", "BLANK_40", "Greg_40", "Hvserk_40", "NA_10")
and I would like to keep strings starting like elements from a vector below:
> dput(vec_all_compounds)
c("Greg", "Luke", "Mark", "Joe", "Jane", "Hvserk")
It means all of: Greg_10, Luke_10, Hvserk_40, etc should be kept and remain unchanged. Doable ?

I would suggest next approach indexing your vectors with grepl():
#Code
cc[grepl(pattern = paste0(vec_all_compounds,collapse = '|'),cc)]
Output:
[1] "Greg_10" "Luke_40" "Luke_10" "Mark_10" "Joe_15" "Jane_10" "Greg_40" "Hvserk_40"

You can also use grep with value = TRUE :
grep(paste0(vec_all_compounds, collapse = "|"), cc, value = TRUE)
#[1] "Greg_10" "Luke_40" "Luke_10" "Mark_10" "Joe_15" "Jane_10" "Greg_40" "Hvserk_40"
Same with stringr::str_subset :
stringr::str_subset(cc, paste0(vec_all_compounds, collapse = "|"))

You can use gsub + %in%
> cc[gsub("_.*","",cc) %in% vec_all_compounds]
[1] "Greg_10" "Luke_40" "Luke_10" "Mark_10" "Joe_15" "Jane_10"
[7] "Greg_40" "Hvserk_40"

Related

How to find if a string contain certain characters without considering sequence?

I'm trying to match a name using elements from another vector with R. But I don't know how to escape sequence when using grep() in R.
name <- "Cry River"
string <- c("Yesterday Once More","Are You happy","Cry Me A River")
grep(name, string, value = TRUE)
I expect the output to be "Cry Me A River", but I don't know how to do it.
Use .* in the pattern
grep("Cry.*River", string, value = TRUE)
#[1] "Cry Me A River"
Or if you are getting names as it is and can't change it, you can split on whitespace and insert the .* between the words like
grep(paste(strsplit(name, "\\s+")[[1]], collapse = ".*"), string, value = TRUE)
where the regex is constructed in the below fashion
strsplit(name, "\\s+")[[1]]
#[1] "Cry" "River"
paste(strsplit(name, "\\s+")[[1]], collapse = ".*")
#[1] "Cry.*River"
Here is a base R option, using grepl:
name <- "Cry River"
parts <- paste0("\\b", strsplit(name, "\\s+")[[1]], "\\b")
string <- c("Yesterday Once More","Are You happy","Cry Me A River")
result <- sapply(parts, function(x) { grepl(x, string) })
string[rowSums(result) == length(parts)]
[1] "Cry Me A River"
The strategy here is to first split the string containing the various search terms, and generating individual regex patterns for each term. In this case, we generate:
\bCry\b and \bRiver\b
Then, we iterate over each term, and using grepl we check that the term appears in each of the strings. Finally, we retain only those matches which contained all terms.
We can do the grepl on splitted string and Reduce the list of logical vectors to a single logicalvector` and extract the matching element in 'string'
string[Reduce(`&`, lapply(strsplit(name, " ")[[1]], grepl, string))]
#[1] "Cry Me A River"
Also, instead of strsplit, we can insert the .* with sub
grep(sub(" ", ".*", name), string, value = TRUE)
#[1] "Cry Me A River"
Here's an approach using stringr. Is order important? Is case important? Is it important to match whole words. If you would just like to match 'Cry' and 'River' in any order and don't care about case.
name <- "Cry River"
string <- c("Yesterday Once More",
"Are You happy",
"Cry Me A River",
"Take me to the River or I'll Cry",
"The Cryogenic River Rag",
"Crying on the Riverside")
string[str_detect(string, pattern = regex('\\bcry\\b', ignore_case = TRUE)) &
str_detect(string, regex('\\bRiver\\b', ignore_case = TRUE))]

String between first two (.dots)

Hi have data which contains two or more dots. My requirement is to get string from first to second dot.
E.g string <- "abcd.vdgd.dhdsg"
Result expected =vdgd
I have used
pt <-strapply(string, "\\.(.*)\\.", simplify = TRUE)
which is giving correct data but for string having more than two dots its not working as expected.
e.g string <- "abcd.vdgd.dhdsg.jsgs"
its giving dhdsg.jsgs but expected is vdgd
Could anyone help me.
Thanks & Regards,
In base R we can use strsplit
ss <- "abcd.vdgd.dhdsg"
unlist(strsplit(ss, "\\."))[2]
#[1] "vdgd"
Or using gregexpr with regmatches
unlist(regmatches(ss, gregexpr("[^\\.]+", ss)))[2]
#[1] "vdgd"
Or using gsub (thanks #TCZhang)
gsub("^.+?\\.(.+?)\\..*$", "\\1", ss)
#[1] "vdgd"
Another option:
string <- "abcd.vdgd.dhdsg.jsgs"
library(stringr)
str_extract(string = string, pattern = "(?<=\\.).*?(?=\\.)")
[1] "vdgd"
I like this one because the str_extract function will return the first instance of the correct pattern, but you could also use str_extract_all to get all instances.
str_extract_all(string = string, pattern = "(?<=\\.).*?(?=\\.)")
[[1]]
[1] "vdgd" "dhdsg"
From here, you could index to get any position between two dots you want.
Another solution with the qdapRegex package:
library(qdapRegex)
ex_between("abcd.vdgd.dhdsg.jsgs", ".", ".")[[1]][1]
# "vdgd"
You can use read.table as well if you wish.Here providing the string as given in your problem and selecting the separator as dot("."), Once the column is converted into a data.frame, you may choose to select whatever column you want to pick(In this case it is column number 2).
read.table(text=string, sep=".",stringsAsFactors = FALSE)[,2]
Output:
> read.table(text=string, sep=".",stringsAsFactors = FALSE)[,2]
[1] "vdgd"
Here is a fun easy way via stringr
stringr::word(string, 2, sep = '\\.')
Here are two options that are vectorized over the input string vector:
You can try tstrsplit from data.table, which is vectorized over string:
> string <- c("abcd.vdgd.dhdsg", "abcd.vdgd.dhdsg.jsgs")
> tstrsplit(string, '.', fixed = TRUE)[[2]]
[1] "vdgd" "vdgd"
or regex:
> sub('.*?\\.(.*?)\\..*', '\\1', string)
[1] "vdgd" "vdgd"`

Extract expressions with square brackets

The example I have is as follow:
toMatch <- c("[1]", "[2]", "[3]")
names <- c("apple[1]", "apple", "apple[3]")
I want extract the terms in names which contains one of the patterns in toMatch.
This is what I tried
grep(toMatch, names, value=T)
But, it didn't work for me. Any suggestions?
The problem is that [ character used in toMatch is an reserved character with special meaning in regex/pattern. Hence, we need to first replace [ character with \\[.
Now, collapse toMatch with | and then use it as pattern in grepl function to search matching character in names.
The solution results are:
#Just for indexes
grepl(paste0(gsub("(\\[)","\\\\[",toMatch), collapse = "|"), names)
#[1] TRUE FALSE TRUE
#For values
grep(paste0(gsub("(\\[)","\\\\[",toMatch), collapse = "|"), names, value = TRUE)
#[1] "apple[1]" "apple[3]"
Data:
toMatch <- c("[1]", "[2]", "[3]")
names <- c("apple[1]", "apple", "apple[3]")
We could also remove the letter part and create a logical vector with %in%
names[sub("^[^[]*", "", names) %in% toMatch]
#[1] "apple[1]" "apple[3]"

Using Regex to edit a column in R [duplicate]

I've got a column people$food that has entries like chocolate or apple-orange-strawberry.
I want to split people$food by - and get the first entry from the split.
In python, the solution would be food.split('-')[0], but I can't find an equivalent for R.
If you need to extract the first (or nth) entry from each split, use:
word <- c('apple-orange-strawberry','chocolate')
sapply(strsplit(word,"-"), `[`, 1)
#[1] "apple" "chocolate"
Or faster and more explictly:
vapply(strsplit(word,"-"), `[`, 1, FUN.VALUE=character(1))
#[1] "apple" "chocolate"
Both bits of code will cope well with selecting whichever value in the split list, and will deal with cases that are outside the range:
vapply(strsplit(word,"-"), `[`, 2, FUN.VALUE=character(1))
#[1] "orange" NA
For example
word <- 'apple-orange-strawberry'
strsplit(word, "-")[[1]][1]
[1] "apple"
or, equivalently
unlist(strsplit(word, "-"))[1].
Essentially the idea is that split gives a list as a result, whose elements have to be accessed either by slicing (the former case) or by unlisting (the latter).
If you want to apply the method to an entire column:
first.word <- function(my.string){
unlist(strsplit(my.string, "-"))[1]
}
words <- c('apple-orange-strawberry', 'orange-juice')
R: sapply(words, first.word)
apple-orange-strawberry orange-juice
"apple" "orange"
I would use sub() instead. Since you want the first "word" before the split, we can simply remove everything after the first - and that's what we're left with.
sub("-.*", "", people$food)
Here's an example -
x <- c("apple", "banana-raspberry-cherry", "orange-berry", "tomato-apple")
sub("-.*", "", x)
# [1] "apple" "banana" "orange" "tomato"
Otherwise, if you want to use strsplit() you can round up the first elements with vapply()
vapply(strsplit(x, "-", fixed = TRUE), "[", "", 1)
# [1] "apple" "banana" "orange" "tomato"
I would suggest using head rather than [ in R.
word <- c('apple-orange-strawberry','chocolate')
sapply(strsplit(word, "-"), head, 1)
# [1] "apple" "chocolate"
dplyr/magrittr approach:
library(magrittr)
library(dplyr)
word = c('apple-orange-strawberry', 'chocolate')
strsplit(word, "-") %>% sapply(extract2, 1)
# [1] "apple" "chocolate"
Using str_remove() to delete everything after the pattern:
df <- data.frame(words = c('apple-orange-strawberry', 'chocolate'))
mutate(df, short = stringr::str_remove(words, "-.*")) # mutate method
stringr::str_remove(df$words, "-.*") # str_remove example
stringr::str_replace(df$words, "-.*", "") # str_replace method
stringr::str_split_fixed(df$words, "-", n=2)[,1] # str_split method similar to original question's Python code
tidyr::separate(df, words, into = c("short", NA)) # using the separate function
words short
1 apple-orange-strawberry apple
2 chocolate chocolate
stringr 1.5.0 introduced str_split_i to do this easily:
library(stringr)
str_split_i(c('apple-orange-strawberry','chocolate'), "-", 1)
[1] "apple" "chocolate"
The third argument represents the index you want to extract. Also new is that you can use negative values to index from the right:
str_split_i(c('apple-orange-strawberry','chocolate'), "-", -1)
[1] "strawberry" "chocolate"

Removing a group of words from a character vector

Let's say that I have a character vector of random names. I also have another character vector with a number of car makes and I want to remove any occurrence of a car incident in the original vector.
So given the vectors:
dat = c("Tonyhonda","DaveFord","Alextoyota")
car = c("Honda","Ford","Toyota","honda","ford","toyota")
I want to end up with something like below:
dat = c("Tony","Dave","Alex")
How can I remove part of a string in R?
gsub(x = dat, pattern = paste(car, collapse = "|"), replacement = "")
[1] "Tony" "Dave" "Alex"
Just formalizing 42-'s comment above. Rather than using
car = c("Honda","Ford","Toyota","honda","ford","toyota")
You can just use:
carlist = c("Honda","Ford","Toyota")
gsub(x = dat, pattern = paste(car, collapse = "|"), replacement = "", ignore.case = TRUE)
[1] "Tony" "Dave" "Alex"
That allows you to only put each word you want to exclude in the list one time.

Resources