Getting only matched part of the string in R - r

Is there a function in R that matches regexp and returns only the matched parts?
Something like grep -o, so:
> ogrep('.b.',c('abc','1b2b3b4'))
[[1]]
[1] abc
[[2]]
[1] 1b2 3b4

Try stringr:
library(stringr)
str_extract_all(c('abc','1b2b3b4'), '.b.')
# [[1]]
# [1] "abc"
#
# [[2]]
# [1] "1b2" "3b4"

I can't believe nobody ever mentioned regmatches!
x <- c('abc','1b2b3b4')
regmatches(x, gregexpr('.b.', x))
# [[1]]
# [1] "abc"
# [[2]]
# [1] "1b2" "3b4"
It makes me wonder, didn't regmatches exist two and half years ago?

You should probably give Gabor Grothendieck the check for writing the gsubfn package:
require(gsubfn)
#Loading required package: gsubfn
strapply(c('abc','1b2b3b4'), ".b.", I)
#Loading required package: tcltk
#Loading Tcl/Tk interface ... done
[[1]]
[1] "abc"
[[2]]
[1] "1b2" "3b4"
This just applies the identity function , I, to the matches of the pattern.

You need to combine gregexpr with substring, I reckon:
> s = c('abc','1b2b3b4')
> m = gregexpr('.b.',s)
> substring(s[1],m[[1]],m[[1]]+attr(m[[1]],'match.length')-1)
[1] "abc"
> substring(s[2],m[[2]],m[[2]]+attr(m[[2]],'match.length')-1)
[1] "1b2" "3b4"
The returned list 'm' has the start and lengths of matches. Loop over s to get all the substrings.

Related

R Extract a word from a character string using pattern matching

I need some help with pattern matching in R. I need to extract a whole word that starts with a common prefix, from a long character string. The word I want to extract always starts with the same prefix (AA), but the word is not the same length, and does not occur in the same location of the string.
mytext1 <- as.character("HORSE MONKEY LIZARD AA12345 SWORDFISH") # Return AA12345
mytext2 <- as.character("ELEPHANT AA100 KOALA POLAR.BEAR") # Want to return AA100
mytext3 <- as.character("CROCODILE DRAGON.FLY ANTELOPE") # Want to return NA
As an extension of this, what if there were two different patterns to match and I wanted to return a character string with both?
mytext4 <- as.character("TULIP AA999 DAISY BB123")
# Pattern matching to AA and BB
# Want to return AA999 BB123
Any help with this would be greatly appreciated :)
Here is a stringr approach. The regular expression matches AA preceded by a space or the start of the string (?<=^| ), and then as few characters as possible .*? until the next space or the end of the string (?=$| ). Note that you can combine all the strings into a vector and a vector will be returned. If you want all matches for each string, then use str_extract_all instead of str_extract and you get a list with a vector for each string. If you want to specify multiple matches, use an option and a capturing group (AA|BB) as shown.
mytext <- c(
as.character("HORSE MONKEY LIZARD AA12345 SWORDFISH"), # Return AA12345
as.character("ELEPHANT AA100 KOALA POLAR.BEAR"), # Want to return AA100,
as.character("AA3273 ELEPHANT KOALA POLAR.BEAR"), # Want to return AA3273
as.character("ELEPHANT KOALA POLAR.BEAR AA5785"), # Want to return AA5785
as.character("ELEPHANT KOALA POLAR.BEAR"), # Want to return nothing
as.character("ELEPHANT AA12345 KOALA POLAR.BEAR AA5785") # Can return only AA12345 or both
)
library(stringr)
mytext %>% str_extract("(?<=^| )AA.*?(?=$| )")
#> [1] "AA12345" "AA100" "AA3273" "AA5785" NA "AA12345"
mytext %>% str_extract_all("(?<=^| )AA.*?(?=$| )")
#> [[1]]
#> [1] "AA12345"
#>
#> [[2]]
#> [1] "AA100"
#>
#> [[3]]
#> [1] "AA3273"
#>
#> [[4]]
#> [1] "AA5785"
#>
#> [[5]]
#> character(0)
#>
#> [[6]]
#> [1] "AA12345" "AA5785"
as.character("TULIP AA999 DAISY BB123") %>% str_extract_all("(?<=^| )(AA|BB).*?(?=$| )")
#> [[1]]
#> [1] "AA999" "BB123"
Created on 2018-04-29 by the reprex package (v0.2.0).
You can get a base R solution using sub
sub(".*\\b(AA\\w*).*", "\\1", mytext1)
[1] "AA12345"
> sub(".*\\b(AA\\w*).*", "\\1", mytext2)
[1] "AA100"
I like keeping things in base R whenever possible, and there is already a solution for this. What you really are looking for is the regmatches() function. See Here
Extract or replace matched substrings from match data obtained by regexpr, gregexpr or regexec.
To solve your specific problem
matches = regexpr("(?<=^| )AA.*?(?=$| )", mytext1, perl=T)
regmatches(mytext1, matches)
> [1] "AA12345"
When there is no match:
matches = regexpr("(?<=^| )AA.*?(?=$| )", mytext3, perl=T)
regmatches(mytext3, matches)
> character(0)
If you want to avoid character(0) put your strings in a vector and run them all at once.
alltext = c(mytext1, mytext2, mytext3)
matches = regexpr("(?<=^| )AA.*?(?=$| )", alltext, perl=T)
regmatches(alltext, matches)
> [1] "AA12345" "AA100"
And finally, if you want a one-liner
regmatches(alltext, regexpr("(?<=^| )AA.*?(?=$| )", alltext, perl=T))
> [1] "AA12345" "AA100"

How to use str_split with regex in R?

I have this string:
235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things
I want to split the string by the 6-digit numbers. I.e. - I want this:
235072,testing,some2wg2f4,wf484-things
224072,and,other25wg4,14-thingies
223552,testing,some/2wr24,14084-things
How do I do this with regex? The following does not work (using stringr package):
> blahblah <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
> test <- str_split(blahblah, "([0-9]{6}.*)")
> test
[[1]]
[1] "" ""
What am I missing??
Here's an approach with base R using a positive lookahead and lookbehind, and thanks to #thelatemail for the correction:
strsplit(x, "(?<=.)(?=[0-9]{6})", perl = TRUE)[[1]]
# [1] "235072,testing,some252f4,14084-things"
# [2] "224072,and,other2524,14084-thingies"
# [3] "223552,testing,some/2wr24,14084-things"
An alternative approach with str_extract_all. Note I've used .*? to do 'non-greedy' matching, otherwise .* expands to grab everything:
> str_extract_all(blahblah, "[0-9]{6}.*?(?=[0-9]{6}|$)")[[1]]
[1] "235072,testing,some252f4,14084-things" "224072,and,other2524,14084-thingies" "223552,testing,some/2wr24,14084-things"
An easy-to-understand approach is to add a marker and then split on the locations of those markers. This has the advantage of being able to only look for 6-digit sequences and not require any other features in the surrounding text, whose features may change as you add new and unvetted data.
library(stringr)
library(magrittr)
str <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
out <-
str_replace_all(str, "(\\d{6})", "#SPLIT_HERE#\\1") %>%
str_split("#SPLIT_HERE#") %>%
unlist
[1] "" "235072,testing,some252f4,14084-things"
[3] "224072,and,other2524,14084-thingies" "223552,testing,some/2wr24,14084-things"
If your match occurs at the start or end of a string, str_split() will insert blank character entries in the results vector to indicate that (as it did above). If you don't need that information, you can easily remove it with out[nchar(out) != 0].
[1] "235072,testing,some252f4,14084-things" "224072,and,other2524,14084-thingies"
[3] "223552,testing,some/2wr24,14084-things"
With less complex regex, you can do as following:
s <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
l <- str_locate_all(string = s, "[0-9]{6}")
str_sub(string = s, start = as.data.frame(l)$start,
end = c(tail(as.data.frame(l)$start, -1) - 1, nchar(s)) )
# [1] "235072,testing,some252f4,14084-things"
# [2] "224072,and,other2524,14084-thingies"
# [3] "223552,testing,some/2wr24,14084-things"

grep exact match in vector inside a list in R

I have a list like this:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
> grep("ABC", map_tmp)
[1] 1 3
> grep("^ABC$", map_tmp)
[1] 1 # by using regex, I get the index of "ABC" in the list
> grep("^KML$", map_tmp)
[1] 5 # I wanted 3, but I got 5. Claiming the end of a string by "$" didn't help in this case.
> grep("^HIJ$", map_tmp)
integer(0) # the regex do not return to me the index of a string inside the vector
How can I get the index of a string (exact match) in the list?
I'm ok not to use grep. Is there any way to get the index of a certain string (exact match) in the list? Thanks!
Using lapply:
which(lapply(map_tmp, function(x) grep("^HIJ$", x))!=0)
The lapply function gives you a list of which for each element in the list (0 if there's no match). The which!=0 function gives you the element in the list where your string occurs.
Use either mapply or Map with str_detect to find the position, I have run only for one string "KML" , you can run it for all others. I hope this is helpful.
First of all we make the lists even so that we can process it easily
library(stringr)
map_tmp_1 <- lapply(map_tmp, `length<-`, max(lengths(map_tmp)))
### Making the list even
val <- t(mapply(str_detect,map_tmp_1,"^KML$"))
> which(val[,1] == T)
[1] 3
> which(val[,2] == T)
integer(0)
In case of "ABC" string:
val <- t(mapply(str_detect,map_tmp_1,"ABC"))
> which(val[,1] == T)
[1] 1
> which(val[,2] == T)
[1] 3
>
I had the same question. I cannot explain why grep would work well in a list with characters but not with regex. Anyway, the best way I found to match a character string using common R script is:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
sapply( map_tmp , match , 'ABC' )
It returns a list with similar structure as the input with 'NA' or '1', depending on the result of the match test:
[[1]]
[1] 1
[[2]]
[1] NA NA
[[3]]
[1] NA NA
[[4]]
[1] NA
[[5]]
[1] NA

Using str_view with a list of words in R

I want to use str_view from stringr in R to find all the words that start with "y" and all the words that end with "x." I have a list of words generated by Corpora, but whenever I launch the code, it returns a blank view.
Common_words<-corpora("words/common")
#start with y
start_with_y <- str_view(Common_words, "^[y]", match = TRUE)
start_with_y
#finish with x
str_view(Common_words, "$[x]", match = TRUE)
Also, I would like to find the words that are only 3 letters long, but no
ideas so far.
I'd say this is not about programming with stringr but learning some regex. Here are some sites I have found useful for learning:
http://www.regular-expressions.info/tutorial.html
http://www.rexegg.com/
https://www.debuggex.com/
Here the \\w or short hand class for word characters (i.e., [A-Za-z0-9_]) is useful with quantifiers (+ and {3} in these 2 cases). PS here I use stringi because stringr is using that in the backend anyway. Just skipping the middle man.
x <- c("I like yax because the rock to the max!",
"I yonx & yix to pick up stix.")
library(stringi)
stri_extract_all_regex(x, 'y\\w+x')
stri_extract_all_regex(x, '\\b\\w{3}\\b')
## > stri_extract_all_regex(x, 'y\\w+x')
## [[1]]
## [1] "yax"
##
## [[2]]
## [1] "yonx" "yix"
## > stri_extract_all_regex(x, '\\b\\w{3}\\b')
## [[1]]
## [1] "yax" "the" "the" "max"
##
## [[2]]
## [1] "yix"
EDIT Seems like these may be of use too:
## Just y starting words
stri_extract_all_regex(x, 'y\\w+\\b')
## Just x ending words
stri_extract_all_regex(x, 'y\\w+x')
## Words with n or more characters
stri_extract_all_regex(x, '\\b\\w{4,}\\b')

String Split into list R

Extract words from a string and make a list in R
str <- "qwerty keyboard"
result <- strsplit(str,"[[:space:]]")
What I get was..(down below)
result
[[1]]
[1] "qwerty" "keyboard"
What I need is..(down below)
result
[[1]]
[1] "qwerty"
[[2]]
[1] "keyboard"
[OR]
result
[[1]]
[1] "qwerty"
[2] "keyboard"
I am looking for a solution, if someone knows please post your solution here.
thanks in advance..
try:
str <- "qwerty keyboard"
result_1 <- strsplit(str,"[[:space:]]")[[1]][1]
result_2 <- strsplit(str,"[[:space:]]")[[1]][2]
result <- list(result_1,result_2)
Or
as.list(strsplit(str, '\\s+')[[1]])
as.list(unlist(strsplit(str, '[[:space:]]')))
As an alternative to strsplit(), you can make a list out of the result from scan().
as.list(scan(text=str, what=""))
# Read 2 items
# [[1]]
# [1] "qwerty"
#
# [[2]]
# [1] "keyboard"

Resources