Find first matching substring in a long string in R - r

I'm trying to find the first matching string from a vector in a long string. I have for example a example_string <- 'LionabcdBear1231DogextKittyisananimalTurtleisslow' and a matching_vector<- c('Turtle',Dog') Now I want that it returns 'Dog' as this is the first substring in the matching_vector that we see in the example string: LionabcdBear1231DogextKittyisananimalTurtleisslow
I already tried pmatch(example_string,matching_vector) but it doesn't work. Obviously as it doesn't work with substrings...
Thanks!
Tim

Is the following solution working for you?
example_string <- 'LionabcdBear1231DogextKittyisananimalTurtleisslow'
matching_vector<- c('Turtle','Dog')
match_ids <- sapply(matching_vector, function(x) regexpr(x ,example_string))
result <- names(match_ids)[which.min(match_ids)]
> result
[1] "Dog"

We can use stri_match_first from stringi
library(stringi)
stri_match_first(example_string, regex = paste(matching_vector, collapse="|"))

Related

Remove part of string after 3-digit number

I would like to substitute the strings in the list by cutting each string after the first 3-digit number.
a <- c("MTH314PHY410","LB471LB472","PHY472CHM141")
I would like for it to look something like
a <- c("MTH314","LB471","PHY472")
I have tried something like
b <- gsub("[100-999].*","",a)
but it returns c("MTH","LB","PHY") without the first number
A possible solution, based on stringr::str_remove:
library(stringr)
a <- c("MTH314PHY410","LB471LB472","PHY472CHM141")
str_remove(a, "(?<=\\d{3}).*")
#> [1] "MTH314" "LB471" "PHY472"
c("MTH314PHY410","LB471LB472","PHY472CHM141") %>%
stringr::str_extract('.+?\\d{3}')
[1] "MTH314" "LB471" "PHY472"

How to get any string we want?

The string is as shown below:
s <- "12N10-3A 12N10-3A-1 12N10-3A-2 YB10L-A2"
I can get the strings except from second one.
gsub("\\s.*","",s) #12N10-3A
gsub(".*\\s","",s) #YB10L-A2
gsub(".*\\s.*\\s(.*).*\\s(.*)","\\1",s) #12N10-3A-2
How to get the second string from s and what's short approach for each code line? I tried what I learnt on regex101.com
We can use stri_extract_last from stringi
library(stringi)
stri_extract_last(s, regex = '\\S+')
#[1] "YB10L-A2"
Or use word from stringr
library(stringr)
word(s, 4)
#[1] "YB10L-A2"
Just use strsplit:
items <- strsplit(s, "\\s+")[[1]]
If you want to access the last item, then just use:
items[4]
[1] "YB10L-A2"
If you really wanted to isolate the last term using sub, then here is one way:
sub(".*\\s+", "", s)

String between first two (.dots)

Hi have data which contains two or more dots. My requirement is to get string from first to second dot.
E.g string <- "abcd.vdgd.dhdsg"
Result expected =vdgd
I have used
pt <-strapply(string, "\\.(.*)\\.", simplify = TRUE)
which is giving correct data but for string having more than two dots its not working as expected.
e.g string <- "abcd.vdgd.dhdsg.jsgs"
its giving dhdsg.jsgs but expected is vdgd
Could anyone help me.
Thanks & Regards,
In base R we can use strsplit
ss <- "abcd.vdgd.dhdsg"
unlist(strsplit(ss, "\\."))[2]
#[1] "vdgd"
Or using gregexpr with regmatches
unlist(regmatches(ss, gregexpr("[^\\.]+", ss)))[2]
#[1] "vdgd"
Or using gsub (thanks #TCZhang)
gsub("^.+?\\.(.+?)\\..*$", "\\1", ss)
#[1] "vdgd"
Another option:
string <- "abcd.vdgd.dhdsg.jsgs"
library(stringr)
str_extract(string = string, pattern = "(?<=\\.).*?(?=\\.)")
[1] "vdgd"
I like this one because the str_extract function will return the first instance of the correct pattern, but you could also use str_extract_all to get all instances.
str_extract_all(string = string, pattern = "(?<=\\.).*?(?=\\.)")
[[1]]
[1] "vdgd" "dhdsg"
From here, you could index to get any position between two dots you want.
Another solution with the qdapRegex package:
library(qdapRegex)
ex_between("abcd.vdgd.dhdsg.jsgs", ".", ".")[[1]][1]
# "vdgd"
You can use read.table as well if you wish.Here providing the string as given in your problem and selecting the separator as dot("."), Once the column is converted into a data.frame, you may choose to select whatever column you want to pick(In this case it is column number 2).
read.table(text=string, sep=".",stringsAsFactors = FALSE)[,2]
Output:
> read.table(text=string, sep=".",stringsAsFactors = FALSE)[,2]
[1] "vdgd"
Here is a fun easy way via stringr
stringr::word(string, 2, sep = '\\.')
Here are two options that are vectorized over the input string vector:
You can try tstrsplit from data.table, which is vectorized over string:
> string <- c("abcd.vdgd.dhdsg", "abcd.vdgd.dhdsg.jsgs")
> tstrsplit(string, '.', fixed = TRUE)[[2]]
[1] "vdgd" "vdgd"
or regex:
> sub('.*?\\.(.*?)\\..*', '\\1', string)
[1] "vdgd" "vdgd"`

Regexpr not working as expected

For the following string <10.16;13.05) I want to match only the first number (sometimes the first number does not exist, i.e. <;13.05)). I used the following regular expression:
grep("[0-9]+\\.*[0-9]*(?=;)","<10.16;13.05)",value=T,perl=T)
However, the result is not "10.16" but "<10.16;13.05)". Could anyone please help me with this one? Thanks.
You could also use strsplit here with minimum regex, i.e.
x <- '<10.16;13.05)'
as.numeric(gsub('<(.*)', '\\1', unlist(strsplit(x, ';', fixed = TRUE))[1]))
#[1] 10.16
x <- '<;13.05)'
as.numeric(gsub('<(.*)', '\\1', unlist(strsplit(x, ';', fixed = TRUE))[1]))
#[1] NA
I believe you are using the wrong regex function. grep just tells you whether the patern was found, it does not extract it.
Try instead
regmatches("<10.16;13.05)", regexpr("\\d*\\.\\d*", "<10.16;13.05)"))

Find pattern in URL with stringr and regex

I have a dataframe df with some urls. There are subcategories within the slashes in the URLs I want to extract with stringr and str_extract
My data looks like
Text URL
Hello www.facebook.com/group1/bla/exy/1234
Test www.facebook.com/group2/fssas/eda/1234
Text www.facebook.com/group-sdja/sdsds/adeds/23234
Texter www.facebook.com/blablabla/sdksds/sdsad
I now want to extract everything after .com/ and the next /
I tried suburlpattern <- "^.com//{1,20}//$"
and df$categories <- str_extract(df$URL, suburlpattern)
But I only end up with NA in df$categories
Any idea what I am doing wrong here? Is it my regex code?
Any help is highly appreciated! Many thanks beforehand.
If you want to use str_extract, you need a regex that will get the value you need into the whole match, and you will need a (?<=[.]com/) lookbehind:
(?<=[.]com/)[^/]+
See the regex demo.
Details:
(?<=[.]com/) - the current location must be preceded with .com/ substring
[^/]+ - matches 1 or more characters other than /.
R demo:
> URL = c("www.facebook.com/group1/bla/exy/1234", "www.facebook.com/group2/fssas/eda/1234","www.facebook.com/group-sdja/sdsds/adeds/23234", "www.facebook.com/blablabla/sdksds/sdsad")
> df <- data.frame(URL)
> library(stringr)
> res <- str_extract(df$URL, "(?<=[.]com/)[^/]+")
> res
[1] "group1" "group2" "group-sdja" "blablabla"
this will return everything between the first set of forward slashes
library(stringr)
str_match("www.facebook.com/blablabla/sdksds/sdsad", "^[^/]+/(.+?)/")[2]
[1] "blablabla"
This works
library(stringr)
data <- c("www.facebook.com/group1/bla/exy/1234",
"www.facebook.com/group2/fssas/eda/1234",
"www.facebook.com/group-sdja/sdsds/adeds/23234",
"www.facebook.com/blablabla/sdksds/sdsad")
suburlpattern <- "/(.*?)/"
categories <- str_extract(data, suburlpattern)
str_sub(categories, start = 2, end = -2)
Results:
[1] "group1" "group2" "group-sdja" "blablabla"
Will only get you what's between the first and second slashes... but that seems to be what you want.

Resources