I need some help with pattern matching in R. I need to extract a whole word that starts with a common prefix, from a long character string. The word I want to extract always starts with the same prefix (AA), but the word is not the same length, and does not occur in the same location of the string.
mytext1 <- as.character("HORSE MONKEY LIZARD AA12345 SWORDFISH") # Return AA12345
mytext2 <- as.character("ELEPHANT AA100 KOALA POLAR.BEAR") # Want to return AA100
mytext3 <- as.character("CROCODILE DRAGON.FLY ANTELOPE") # Want to return NA
As an extension of this, what if there were two different patterns to match and I wanted to return a character string with both?
mytext4 <- as.character("TULIP AA999 DAISY BB123")
# Pattern matching to AA and BB
# Want to return AA999 BB123
Any help with this would be greatly appreciated :)
Here is a stringr approach. The regular expression matches AA preceded by a space or the start of the string (?<=^| ), and then as few characters as possible .*? until the next space or the end of the string (?=$| ). Note that you can combine all the strings into a vector and a vector will be returned. If you want all matches for each string, then use str_extract_all instead of str_extract and you get a list with a vector for each string. If you want to specify multiple matches, use an option and a capturing group (AA|BB) as shown.
mytext <- c(
as.character("HORSE MONKEY LIZARD AA12345 SWORDFISH"), # Return AA12345
as.character("ELEPHANT AA100 KOALA POLAR.BEAR"), # Want to return AA100,
as.character("AA3273 ELEPHANT KOALA POLAR.BEAR"), # Want to return AA3273
as.character("ELEPHANT KOALA POLAR.BEAR AA5785"), # Want to return AA5785
as.character("ELEPHANT KOALA POLAR.BEAR"), # Want to return nothing
as.character("ELEPHANT AA12345 KOALA POLAR.BEAR AA5785") # Can return only AA12345 or both
)
library(stringr)
mytext %>% str_extract("(?<=^| )AA.*?(?=$| )")
#> [1] "AA12345" "AA100" "AA3273" "AA5785" NA "AA12345"
mytext %>% str_extract_all("(?<=^| )AA.*?(?=$| )")
#> [[1]]
#> [1] "AA12345"
#>
#> [[2]]
#> [1] "AA100"
#>
#> [[3]]
#> [1] "AA3273"
#>
#> [[4]]
#> [1] "AA5785"
#>
#> [[5]]
#> character(0)
#>
#> [[6]]
#> [1] "AA12345" "AA5785"
as.character("TULIP AA999 DAISY BB123") %>% str_extract_all("(?<=^| )(AA|BB).*?(?=$| )")
#> [[1]]
#> [1] "AA999" "BB123"
Created on 2018-04-29 by the reprex package (v0.2.0).
You can get a base R solution using sub
sub(".*\\b(AA\\w*).*", "\\1", mytext1)
[1] "AA12345"
> sub(".*\\b(AA\\w*).*", "\\1", mytext2)
[1] "AA100"
I like keeping things in base R whenever possible, and there is already a solution for this. What you really are looking for is the regmatches() function. See Here
Extract or replace matched substrings from match data obtained by regexpr, gregexpr or regexec.
To solve your specific problem
matches = regexpr("(?<=^| )AA.*?(?=$| )", mytext1, perl=T)
regmatches(mytext1, matches)
> [1] "AA12345"
When there is no match:
matches = regexpr("(?<=^| )AA.*?(?=$| )", mytext3, perl=T)
regmatches(mytext3, matches)
> character(0)
If you want to avoid character(0) put your strings in a vector and run them all at once.
alltext = c(mytext1, mytext2, mytext3)
matches = regexpr("(?<=^| )AA.*?(?=$| )", alltext, perl=T)
regmatches(alltext, matches)
> [1] "AA12345" "AA100"
And finally, if you want a one-liner
regmatches(alltext, regexpr("(?<=^| )AA.*?(?=$| )", alltext, perl=T))
> [1] "AA12345" "AA100"
Related
I have a list like this:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
> grep("ABC", map_tmp)
[1] 1 3
> grep("^ABC$", map_tmp)
[1] 1 # by using regex, I get the index of "ABC" in the list
> grep("^KML$", map_tmp)
[1] 5 # I wanted 3, but I got 5. Claiming the end of a string by "$" didn't help in this case.
> grep("^HIJ$", map_tmp)
integer(0) # the regex do not return to me the index of a string inside the vector
How can I get the index of a string (exact match) in the list?
I'm ok not to use grep. Is there any way to get the index of a certain string (exact match) in the list? Thanks!
Using lapply:
which(lapply(map_tmp, function(x) grep("^HIJ$", x))!=0)
The lapply function gives you a list of which for each element in the list (0 if there's no match). The which!=0 function gives you the element in the list where your string occurs.
Use either mapply or Map with str_detect to find the position, I have run only for one string "KML" , you can run it for all others. I hope this is helpful.
First of all we make the lists even so that we can process it easily
library(stringr)
map_tmp_1 <- lapply(map_tmp, `length<-`, max(lengths(map_tmp)))
### Making the list even
val <- t(mapply(str_detect,map_tmp_1,"^KML$"))
> which(val[,1] == T)
[1] 3
> which(val[,2] == T)
integer(0)
In case of "ABC" string:
val <- t(mapply(str_detect,map_tmp_1,"ABC"))
> which(val[,1] == T)
[1] 1
> which(val[,2] == T)
[1] 3
>
I had the same question. I cannot explain why grep would work well in a list with characters but not with regex. Anyway, the best way I found to match a character string using common R script is:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
sapply( map_tmp , match , 'ABC' )
It returns a list with similar structure as the input with 'NA' or '1', depending on the result of the match test:
[[1]]
[1] 1
[[2]]
[1] NA NA
[[3]]
[1] NA NA
[[4]]
[1] NA
[[5]]
[1] NA
I have a list filled with strings:
string<- c("SPG_L_subgenual_ACC_R", "SPG_R_MTG_L_pole", "MTG_L_pole_CerebellumGM_L")
I need to split the strings so they appear like:
"SPG_L", "subgenual_ACC_R", "SPG_R", "MTG_L_pole", "MTG_L_pole", "CerebellumGM_L"
I tried using the following regex expression to split the strings:
str_split(string,'(?<=[[RL]|pole])_')
But this leads to:
"SPG_L", "subgenual" "ACC_R", "SPG_R", "MTG_L", "pole", "MTG_L", "pole", "CerebellumGM_L"
How do I edit the regex expression so it splits each string element at the "_" after the first occurrence of "R", "L" unless the first occurrence of "R" or "L" is followed by "pole", then it splits the string element after the first occurrence of "pole" and only splits each string element once?
I suggest a matching approach using
^(.*?[RL](?:_pole)?)_(.*)
See the regex demo
Details
^ - start of string
(.*?[RL](?:_pole)?) - Group 1:
.*? - any zero or more chars other than line break chars as few as possible
[RL](?:_pole)? - R or L optionally followed with _pole
_ - an underscore
(.*) - Group 2: any zero or more chars other than line break chars as many as possible
See the R demo:
library(stringr)
x <- c("SPG_L_subgenual_ACC_R", "SPG_R_MTG_L_pole", "MTG_L_pole_CerebellumGM_L", "SFG_pole_R_IFG_triangularis_L", "SFG_pole_R_IFG_opercularis_L" )
res <- str_match_all(x, "^(.*?[RL](?:_pole)?)_(.*)")
lapply(res, function(x) x[-1])
Output:
[[1]]
[1] "SPG_L" "subgenual_ACC_R"
[[2]]
[1] "SPG_R" "MTG_L_pole"
[[3]]
[1] "MTG_L_pole" "CerebellumGM_L"
[[4]]
[1] "SFG_pole_R" "IFG_triangularis_L"
[[5]]
[1] "SFG_pole_R" "IFG_opercularis_L"
split_again = function(x){
if(length(x) > 1){
return(x)
}
else{
str_split(
string = x,
pattern = '(?<=[R|L])_',
n = 2)
}
}
str_split(
string = string,
pattern = '(?<=pole)_',
n = 2) %>%
lapply(split_again) %>%
unlist()
you could use sub then strsplit as shown:
strsplit(sub("^.*?[LR](?:_pole)?\\K_",":",string,perl=TRUE),":")
[[1]]
[1] "SPG_L" "subgenual_ACC_R"
[[2]]
[1] "SPG_R" "MTG_L_pole"
[[3]]
[1] "MTG_L_pole" "CerebellumGM_L"
Question
Using a regular expression, how do I keep all digits when splitting a string?
Overview
I would like to split each element within the character vector sample.text into two elements: one of only digits and one of only the text.
Current Attempt is Dropping Last Digit
This regular expression - \\d\\s{1} - inside of base::strsplit() removes the last digit. Below is my attempt, along with my desired output.
# load necessary data -----
sample.text <-
c("111110 Soybean Farming", "0116 Soybeans")
# split string by digit and one space pattern ------
strsplit(sample.text, split = "\\d\\s{1}")
# [[1]]
# [1] "11111" "Soybean Farming"
#
# [[2]]
# [1] "011" "Soybeans"
# desired output --------
# [[1]]
# [1] "111110" "Soybean Farming"
#
# [[2]]
# [1] "0116" "Soybeans"
# end of script #
Any advice on how I can split sample.text to keep all digits would be much appreciated! Thank you.
Because you're splitting on \\d, the digit there is consumed in the regex, and not present in the output. Use lookbehind for a digit instead:
strsplit(sample.text, split = "(?<=\\d) ", perl=TRUE)
http://rextester.com/GDVFU71820
Some alternative solutions, using very simple pattern matching on the first occurrence of space:
1) Indirectly by using sub to substitute your own delimiter, then strsplit on your delimiter:
E.g. you can substitute ';' for the first space (if you know that character does not exist in your data):
strsplit( sub(' ', ';', sample.text), split=';')
2) Using regexpr and regmatches
You can effectively match on the first " " (space character), and split as follows:
regmatches(sample.text, regexpr(" ", sample.text), invert = TRUE)
Result is a list, if that's what you are after per your sample desired output:
[[1]]
[1] "111110" "Soybean Farming"
[[2]]
[1] "0116" "Soybeans"
3) Using stringr library:
library(stringr)
str_split_fixed(sample.text, " ", 2) #outputs a character matrix
[,1] [,2]
[1,] "111110" "Soybean Farming"
[2,] "0116" "Soybeans"
I have this string:
235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things
I want to split the string by the 6-digit numbers. I.e. - I want this:
235072,testing,some2wg2f4,wf484-things
224072,and,other25wg4,14-thingies
223552,testing,some/2wr24,14084-things
How do I do this with regex? The following does not work (using stringr package):
> blahblah <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
> test <- str_split(blahblah, "([0-9]{6}.*)")
> test
[[1]]
[1] "" ""
What am I missing??
Here's an approach with base R using a positive lookahead and lookbehind, and thanks to #thelatemail for the correction:
strsplit(x, "(?<=.)(?=[0-9]{6})", perl = TRUE)[[1]]
# [1] "235072,testing,some252f4,14084-things"
# [2] "224072,and,other2524,14084-thingies"
# [3] "223552,testing,some/2wr24,14084-things"
An alternative approach with str_extract_all. Note I've used .*? to do 'non-greedy' matching, otherwise .* expands to grab everything:
> str_extract_all(blahblah, "[0-9]{6}.*?(?=[0-9]{6}|$)")[[1]]
[1] "235072,testing,some252f4,14084-things" "224072,and,other2524,14084-thingies" "223552,testing,some/2wr24,14084-things"
An easy-to-understand approach is to add a marker and then split on the locations of those markers. This has the advantage of being able to only look for 6-digit sequences and not require any other features in the surrounding text, whose features may change as you add new and unvetted data.
library(stringr)
library(magrittr)
str <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
out <-
str_replace_all(str, "(\\d{6})", "#SPLIT_HERE#\\1") %>%
str_split("#SPLIT_HERE#") %>%
unlist
[1] "" "235072,testing,some252f4,14084-things"
[3] "224072,and,other2524,14084-thingies" "223552,testing,some/2wr24,14084-things"
If your match occurs at the start or end of a string, str_split() will insert blank character entries in the results vector to indicate that (as it did above). If you don't need that information, you can easily remove it with out[nchar(out) != 0].
[1] "235072,testing,some252f4,14084-things" "224072,and,other2524,14084-thingies"
[3] "223552,testing,some/2wr24,14084-things"
With less complex regex, you can do as following:
s <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
l <- str_locate_all(string = s, "[0-9]{6}")
str_sub(string = s, start = as.data.frame(l)$start,
end = c(tail(as.data.frame(l)$start, -1) - 1, nchar(s)) )
# [1] "235072,testing,some252f4,14084-things"
# [2] "224072,and,other2524,14084-thingies"
# [3] "223552,testing,some/2wr24,14084-things"
I have a list like this:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
> grep("ABC", map_tmp)
[1] 1 3
> grep("^ABC$", map_tmp)
[1] 1 # by using regex, I get the index of "ABC" in the list
> grep("^KML$", map_tmp)
[1] 5 # I wanted 3, but I got 5. Claiming the end of a string by "$" didn't help in this case.
> grep("^HIJ$", map_tmp)
integer(0) # the regex do not return to me the index of a string inside the vector
How can I get the index of a string (exact match) in the list?
I'm ok not to use grep. Is there any way to get the index of a certain string (exact match) in the list? Thanks!
Using lapply:
which(lapply(map_tmp, function(x) grep("^HIJ$", x))!=0)
The lapply function gives you a list of which for each element in the list (0 if there's no match). The which!=0 function gives you the element in the list where your string occurs.
Use either mapply or Map with str_detect to find the position, I have run only for one string "KML" , you can run it for all others. I hope this is helpful.
First of all we make the lists even so that we can process it easily
library(stringr)
map_tmp_1 <- lapply(map_tmp, `length<-`, max(lengths(map_tmp)))
### Making the list even
val <- t(mapply(str_detect,map_tmp_1,"^KML$"))
> which(val[,1] == T)
[1] 3
> which(val[,2] == T)
integer(0)
In case of "ABC" string:
val <- t(mapply(str_detect,map_tmp_1,"ABC"))
> which(val[,1] == T)
[1] 1
> which(val[,2] == T)
[1] 3
>
I had the same question. I cannot explain why grep would work well in a list with characters but not with regex. Anyway, the best way I found to match a character string using common R script is:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
sapply( map_tmp , match , 'ABC' )
It returns a list with similar structure as the input with 'NA' or '1', depending on the result of the match test:
[[1]]
[1] 1
[[2]]
[1] NA NA
[[3]]
[1] NA NA
[[4]]
[1] NA
[[5]]
[1] NA