after researching for a while, I didn't find exactly what I would like.
What I'd like to do is to keep an exact pattern in a string.
So this is my example:
text=c("hello, please keep THIS","THIS is important","all THIS should be done","not exactly This","not THHIS")
how to get exactly "THIS" in all strings:
res=c("THIS","THIS","THIS","","")
I tried gsubin r, but I don't know how to match characters.
For example I tried:
gsub("(THIS).*", "\\1", text) # This delete all string after "THIS".
gsub(".*(THIS)", "\\1", text) # This delete all string before "THIS".
To extract THIS or THAT as whole words, you may use the following regex:
\b(THIS|THAT)\b
where \b is a word boundary and (...|...) is a capturing group with | alternation operator (that can appear more than once, more alternatives can be added).
Since regmatches with gregexpr return a list of vectors with some empty entries whenever no match is found, you need to convert them into NA first, then unlist, and then turn to "".
Here is some base R code:
> text=c("hello, please keep THIS","THIS is important","all THIS should be done","not exactly This","not THHIS", "THAT is something I need, too")
[1] "THIS" "THIS" "THIS" "" "" ""
> matches <- regmatches(text, gregexpr("\\b(THIS|THAT)\\b", text))
> res <- lapply(matches, function(x) if (length(x) == 0) NA else x)
> res[is.na(res)] <- ""
> unlist(res)
[1] "THIS" "THIS" "THIS" "" "" "THAT"
We can use str_extract
library(stringr)
str_extract(text, "THIS")
#[1] "THIS" "THIS" "THIS" NA
It is better to have NA rather than ""
This will first delete elements which don't match THIS and then follows your original idea while storing intermediate result to a variable. It seems that you want to have empty strings for elements that do not match, and last line does that.
tmp <- text[grepl("THIS", text)]
gsub("(THIS).*", "\\1", tmp) -> tmp
gsub(".*(THIS)", "\\1", tmp) -> tmp
c(tmp, rep("", length(text) - length(tmp)))
gsub("[^THIS]","",text) seems to do the trick? "[^THIS]" matches everything except for THIS, and gsub replaces those matches with the empty string given as the second parameter. see comment, doesn't work as expected.
Related
Suppose I have list of string "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR" and need to get a vector of string that contains only numbers with bracket like eg. [+229][+57].
Is there a convenient way in R to do this?
Using base R, then try it with
> unlist(regmatches(s,gregexpr("\\[\\+\\d+\\]",s)))
[1] "[+229]" "[+57]" "[+229]"
Or you can use
> gsub(".*?(\\[.*\\]).*","\\1",gsub("\\].*?\\[","] | [",s))
[1] "[+229] | [+57] | [+229]"
We can use str_extract_all from stringr
stringr::str_extract_all(x, "\\[\\+\\d+\\]")[[1]]
#[1] "[+229]" "[+57]" "[+229]"
Wrap it in unique if you need only unique values.
Similarly, in base R using regmatches and gregexpr
regmatches(x, gregexpr("\\[\\+\\d+\\]", x))[[1]]
data
x <- "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR"
Seems like you want to remove the alphabetical characters, so
gsub("[[:alpha:]]", "", x)
where [:alpha:] is the class of alphabetical (lower-case and upper-case) characters, [[:alpha:]] says 'match any single alphabetical character', and gsub() says substitute, globally, any alphabetical character with the empty string "". This seems better than trying to match bracketed numbers, which requires figuring out which characters need to be escaped with a (double!) \\.
If the intention is to return the unique bracketed numbers, then the approach is to extract the matches (rather than remove the unwanted characters). Instead of using gsub() to substitute matches to a regular expression with another value, I'll use gregexpr() to identify the matches, and regmatches() to extract the matches. Since numbers always occur in [], I'll simplify the regular expression to match one or more (+) characters from the collection +[:digit:].
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> xx
[[1]]
[1] "+229" "+57" "+229"
xx is a list of length equal to the length of x. I'll write a function that, for any element of this list, makes the values unique, surrounds the values with [ and ], and concatenates them
fun <- function(x)
paste0("[", unique(x), "]", collapse = "")
This needs to be applied to each element of the list, and simplified to a vector, a task for sapply().
> sapply(xx, fun)
[1] "[+229][+57]"
A minor improvement is to use vapply(), so that the result is robust (always returning a character vector with length equal to x) to zero-length inputs
> x = character()
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> sapply(xx, fun) # Hey, this returns a list :(
list()
> vapply(xx, fun, "character") # vapply() deals with 0-length inputs
character(0)
U have a sentence where I need to extract the first even word. For example
df <- ("This is not the sentence")
For the above sentence, I need "This" to be extracted because it is the first even word
Another example is
df <- ("She is not going anywhere")
For the above sentence, I need "is" to be extracted because it is the first even word
We can write a function to do this. We split the string on whitespace count number of characters in each word and return the first even word.
extract_first_even_word <- function(text) {
all_words <- strsplit(text, "\\s+")[[1]]
all_words[which.max(nchar(all_words) %% 2 == 0)]
}
extract_first_even_word("This is not the sentence")
#[1] "This"
extract_first_even_word("She is not going anywhere")
#[1] "is"
I have a table with a string column formatted like this
abcdWorkstart.csv
abcdWorkcomplete.csv
And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.
grepl("Work{*}.csv", data$filename)
Basically I want to extract whatever between Work and .csv
desired outcome:
start
complete
I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:
fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\\.csv$", "\\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"
You can work around this by filtering out the unchanged ones:
out[ out != fn ]
# [1] "start" "complete"
Or marking them invalid with NA (or something else):
out[ out == fn ] <- NA
out
# [1] "start" "complete" NA
With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
library(stringr)
str_extract(x, "(?<=Work).+(?=\\.csv)")
# [1] "start" "complete"
Just as an alternative way, remove everything you don't want.
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
gsub("^.*Work|\\.csv$", "", x)
#[1] "start" "complete"
please note:
I have to use gsub. Because I first remove ^.*Work then \\.csv$.
For [\\s\\S] or \\d\\D ... (does not work with [g]?sub)
https://regex101.com/r/wFgkgG/1
Works with akruns approach:
regmatches(v1, regexpr("(?<=Work)[\\s\\S]+(?=[.]csv)", v1, perl = T))
str1<-
'12
.2
12'
gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)
. matches also \n when using the R engine.
Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches
regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"
data
v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
In R, I have strings looking like:
test <- 'ZYG11B|79699'
I want to keep only 'ZYG11B'.
My best attempt yet:
gsub ("|.*$", "", test) # should replace everything after '|' by nothing
but returns
> [1] ""
How should I do that?
It's a protected character which means it should be enclosed in square brackets or escaped with double slashes:
> gsub('[|].*$','', test)
[1] "ZYG11B"
> gsub('\\|.*$','', test)
[1] "ZYG11B"
We can do
library(stringr)
str_extract(test, "\\w+")
#[1] "ZYG11B"
I'm trying to use the stringr package in R to extract everything from a string up until the first occurrence of an underscore.
What I've tried
str_extract("L0_123_abc", ".+?(?<=_)")
> "L0_"
Close but no cigar. How do I get this one? Also, Ideally I'd like something that's easy to extend so that I can get the information in between the 1st and 2nd underscore and get the information after the 2nd underscore.
To get L0, you may use
> library(stringr)
> str_extract("L0_123_abc", "[^_]+")
[1] "L0"
The [^_]+ matches 1 or more chars other than _.
Also, you may split the string with _:
x <- str_split("L0_123_abc", fixed("_"))
> x
[[1]]
[1] "L0" "123" "abc"
This way, you will have all the substrings you need.
The same can be achieved with
> str_extract_all("L0_123_abc", "[^_]+")
[[1]]
[1] "L0" "123" "abc"
The regex lookaround should be
str_extract("L0_123_abc", ".+?(?=_)")
#[1] "L0"
Using gsub...
gsub("(.+?)(\\_.*)", "\\1", "L0_123_abc")
You can use sub from base using _.* taking everything starting from _.
sub("_.*", "", "L0_123_abc")
#[1] "L0"
Or using [^_] what is everything but not _.
sub("([^_]*).*", "\\1", "L0_123_abc")
#[1] "L0"
or using substr with regexpr.
substr("L0_123_abc", 1, regexpr("_", "L0_123_abc")-1)
#substr("L0_123_abc", 1, regexpr("_", "L0_123_abc", fixed=TRUE)-1) #More performant alternative
#[1] "L0"