Extract substring in R using grepl - r

I have a table with a string column formatted like this
abcdWorkstart.csv
abcdWorkcomplete.csv
And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.
grepl("Work{*}.csv", data$filename)
Basically I want to extract whatever between Work and .csv
desired outcome:
start
complete

I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:
fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\\.csv$", "\\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"
You can work around this by filtering out the unchanged ones:
out[ out != fn ]
# [1] "start" "complete"
Or marking them invalid with NA (or something else):
out[ out == fn ] <- NA
out
# [1] "start" "complete" NA

With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
library(stringr)
str_extract(x, "(?<=Work).+(?=\\.csv)")
# [1] "start" "complete"

Just as an alternative way, remove everything you don't want.
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
gsub("^.*Work|\\.csv$", "", x)
#[1] "start" "complete"
please note:
I have to use gsub. Because I first remove ^.*Work then \\.csv$.
For [\\s\\S] or \\d\\D ... (does not work with [g]?sub)
https://regex101.com/r/wFgkgG/1
Works with akruns approach:
regmatches(v1, regexpr("(?<=Work)[\\s\\S]+(?=[.]csv)", v1, perl = T))
str1<-
'12
.2
12'
gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)
. matches also \n when using the R engine.

Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches
regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"
data
v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')

Related

R splitting string on predefined location

I have string, which should be split into parts from "random" locations. Split occurs always from next comma after colon.
My idea was to find colons with
stringr::str_locate_all(test, ":") %>%
unlist()
then find commas
stringr::str_locate_all(test, ",") %>%
unlist()
and from there to figure out position where it should be split up, but could not find suitable way to do it. Feels like there is always 6 characters after colon before the comma, but I can't be sure about that for whole data.
Here is example string:
dput(test)
"AA,KK,QQ,JJ,TT,99,88:0.5083,66,55:0.8303,AK,AQ,AJs,AJo:0.9037,ATs:0.0024,ATo:0.5678"
Here is what result should be
dput(result)
c("AA,KK,QQ,JJ,TT,99,88:0.5083", "66,55:0.8303", "AK,AQ,AJs,AJo:0.9037",
"ATs:0.0024", "ATo:0.5678")
Perehaps we can use regmatches like below
> regmatches(test, gregexpr("(\\w+,?)+:[0-9.]+", test))[[1]]
[1] "AA,KK,QQ,JJ,TT,99,88:0.5083" "66,55:0.8303"
[3] "AK,AQ,AJs,AJo:0.9037" "ATs:0.0024"
[5] "ATo:0.5678"
here is one option with strsplit - replace the , after the digit followed by the . and one or more digits (\\d+) with a new delimiter using gsub and then split with strsplit in base R
result1 <- strsplit(gsub("([0-9]\\.[0-9]+),", "\\1;", test), ";")[[1]]
-checking
> identical(result, result1)
[1] TRUE
If the number of characters are fixed, use a regex lookaround
result1 <- strsplit(test, "(?<=:.{6}),", perl = TRUE)[[1]]

Use Regular expressions extract specific characters

text <- c('d__Viruses|f__Closteroviridae|g__Closterovirus|s__Citrus_tristeza_virus',
'd__Viruses|o__Tymovirales|f__Alphaflexiviridae|g__Mandarivirus|s__Citrus_yellow_vein_clearing_virus',
'd__Viruses|o__Ortervirales|f__Retroviridae|s__Columba_palumbus_retrovirus')
I have tried but failed:
str_extract(text, pattern = 'f.*\\|')
How can I get
f__Closteroviridae
f__Alphaflexiviridae
f__Retroviridae
Any help will be high appreciated!
Make the regex non-greedy and since you don't want "|" in final output use positive lookahead.
stringr::str_extract(text, 'f.*?(?=\\|)')
#[1] "f__Closteroviridae" "f__Alphaflexiviridae" "f__Retroviridae"
In base R, we can use sub :
sub('.*(f_.*?)\\|.*', '\\1', text)
#[1] "f__Closteroviridae" "f__Alphaflexiviridae" "f__Retroviridae"
For a base R solution, I would use regmatches along with gregexpr:
m <- gregexpr("\\bf__[^|]+", text)
as.character(regmatches(text, m))
[1] "f__Closteroviridae" "f__Alphaflexiviridae" "f__Retroviridae"
The advantage of using gregexpr as above is that should an input contain more than one f__ matching term, we could also capture it. For example:
x <- 'd__Viruses|f__Closteroviridae|g__Closterovirus|f__some_virus'
m <- gregexpr("\\bf__[^|]+", x)
regmatches(x, m)[[1]]
[1] "f__Closteroviridae" "f__some_virus"
Data:
text <- c('d__Viruses|f__Closteroviridae|g__Closterovirus|s__Citrus_tristeza_virus',
'd__Viruses|o__Tymovirales|f__Alphaflexiviridae|g__Mandarivirus|s__Citrus_yellow_vein_clearing_virus',
'd__Viruses|o__Ortervirales|f__Retroviridae|s__Columba_palumbus_retrovirus')

extracting the second last word between the special characters "/"

I would like to extract the second last string after the '/' symbol. For example,
url<- c('https://example.com/names/ani/digitalcod-org','https://example.com/names/bmc/ambulancecod.org' )
df<- data.frame (url)
I want to extract the second word from the last between the two // and would like to get the words 'ani' and 'bmc'
so, I tried this
library(stringr)
df$name<- word(df$url,-2)
I need output which as follows:
name
ani
bmc
You can use word but you need to specify the separator,
library(stringr)
word(url, -2, sep = '/')
#[1] "ani" "bmc"
Try this:
as.data.frame(sapply(str_extract_all(df$url,"\\w{2,}(?=\\/)"),"["))[3,]
# V1 V2
#3 ani bmc
as.data.frame(sapply(str_extract_all(df$url,"\\w{2,}(?=\\/)"),"["))[2:3,]
# V1 V2
#2 names names
#3 ani bmc
Use gsub with
.*?([^/]+)/[^/]+$
In R:
urls <- c('https://example.com/names/ani/digitalcod-org','https://example.com/names/bmc/ambulancecod.org' )
gsub(".*?([^/]+)/[^/]+$", "\\1", urls)
This yields
[1] "ani" "bmc"
See a demo on regex101.com.
Here is a solution using strsplit
words <- strsplit(url, '/')
L <- lengths(words)
vapply(seq_along(words), function (k) words[[k]][L[k]-1], character(1))
# [1] "ani" "bmc"
A non-regex approach using basename
basename(mapply(sub, pattern = basename(url), replacement = "", x = url, fixed = TRUE))
#[1] "ani" "bmc"
basename(url) "removes all of the path up to and including the last path separator (if any)" and returns
[1] "digitalcod-org" "ambulancecod.org"
use mapply to replace this outcome for every element in url by "" and call basename again.

How to delete all strings except some specific letters in R?

after researching for a while, I didn't find exactly what I would like.
What I'd like to do is to keep an exact pattern in a string.
So this is my example:
text=c("hello, please keep THIS","THIS is important","all THIS should be done","not exactly This","not THHIS")
how to get exactly "THIS" in all strings:
res=c("THIS","THIS","THIS","","")
I tried gsubin r, but I don't know how to match characters.
For example I tried:
gsub("(THIS).*", "\\1", text) # This delete all string after "THIS".
gsub(".*(THIS)", "\\1", text) # This delete all string before "THIS".
To extract THIS or THAT as whole words, you may use the following regex:
\b(THIS|THAT)\b
where \b is a word boundary and (...|...) is a capturing group with | alternation operator (that can appear more than once, more alternatives can be added).
Since regmatches with gregexpr return a list of vectors with some empty entries whenever no match is found, you need to convert them into NA first, then unlist, and then turn to "".
Here is some base R code:
> text=c("hello, please keep THIS","THIS is important","all THIS should be done","not exactly This","not THHIS", "THAT is something I need, too")
[1] "THIS" "THIS" "THIS" "" "" ""
> matches <- regmatches(text, gregexpr("\\b(THIS|THAT)\\b", text))
> res <- lapply(matches, function(x) if (length(x) == 0) NA else x)
> res[is.na(res)] <- ""
> unlist(res)
[1] "THIS" "THIS" "THIS" "" "" "THAT"
We can use str_extract
library(stringr)
str_extract(text, "THIS")
#[1] "THIS" "THIS" "THIS" NA
It is better to have NA rather than ""
This will first delete elements which don't match THIS and then follows your original idea while storing intermediate result to a variable. It seems that you want to have empty strings for elements that do not match, and last line does that.
tmp <- text[grepl("THIS", text)]
gsub("(THIS).*", "\\1", tmp) -> tmp
gsub(".*(THIS)", "\\1", tmp) -> tmp
c(tmp, rep("", length(text) - length(tmp)))
gsub("[^THIS]","",text) seems to do the trick? "[^THIS]" matches everything except for THIS, and gsub replaces those matches with the empty string given as the second parameter. see comment, doesn't work as expected.

How to extract everything until first occurrence of pattern

I'm trying to use the stringr package in R to extract everything from a string up until the first occurrence of an underscore.
What I've tried
str_extract("L0_123_abc", ".+?(?<=_)")
> "L0_"
Close but no cigar. How do I get this one? Also, Ideally I'd like something that's easy to extend so that I can get the information in between the 1st and 2nd underscore and get the information after the 2nd underscore.
To get L0, you may use
> library(stringr)
> str_extract("L0_123_abc", "[^_]+")
[1] "L0"
The [^_]+ matches 1 or more chars other than _.
Also, you may split the string with _:
x <- str_split("L0_123_abc", fixed("_"))
> x
[[1]]
[1] "L0" "123" "abc"
This way, you will have all the substrings you need.
The same can be achieved with
> str_extract_all("L0_123_abc", "[^_]+")
[[1]]
[1] "L0" "123" "abc"
The regex lookaround should be
str_extract("L0_123_abc", ".+?(?=_)")
#[1] "L0"
Using gsub...
gsub("(.+?)(\\_.*)", "\\1", "L0_123_abc")
You can use sub from base using _.* taking everything starting from _.
sub("_.*", "", "L0_123_abc")
#[1] "L0"
Or using [^_] what is everything but not _.
sub("([^_]*).*", "\\1", "L0_123_abc")
#[1] "L0"
or using substr with regexpr.
substr("L0_123_abc", 1, regexpr("_", "L0_123_abc")-1)
#substr("L0_123_abc", 1, regexpr("_", "L0_123_abc", fixed=TRUE)-1) #More performant alternative
#[1] "L0"

Resources