R substring based on Regular Expression - r

I have a strings like :
myString = "2 word1 & 4 word2"
myString = "4 word2"
myString = "2 word1"
I would like to get the number before the word1 and the number before word2
number1 = 2
number2 = 4
How can i do with a regular expression in R
I tried something like this but it only get the first number
gsub("([0-9]+).*", "\\1", myString)

You may extract specific number before a specific string using a regex with a lookahead:
> word1_res <- str_extract_all(myString, "\\d+(?=\\s*word1)")
> word1_res
[[1]]
[1] "2"
[[2]]
character(0)
[[3]]
[1] "2"
The results for word2 can be retrieved similarly:
word2_res <- str_extract_all(myString, "\\d+(?=\\s*word2)")
Details
\d+ - 1 or more digits...
(?=\\s*word2) - if immediately followed with:
\s* - 0+ whitespaces
word2 - a literal word2 substring.
A base R equivalent is
regmatches(myString, gregexpr("\\d+(?=\\s*word1)", myString, perl=TRUE))
regmatches(myString, gregexpr("\\d+(?=\\s*word2)", myString, perl=TRUE))
A sub almost equivalent solution would be
> sub(".*?(\\d+)\\s*word1.*|.*","\\1",myString)
[1] "2" "" "2"
> sub(".*?(\\d+)\\s*word2.*|.*","\\1",myString)
[1] "4" "4" ""
Note that this implies there is only one result per string, while str_extract_all will get all occurrences from the string.
To extract any chunk of 1+ digits as a whole word using a stringr solution with str_extract_all
library(stringr)
str_extract_all(myString, "\\b\\d+\\b")
or a base R one with regmatches/gregexpr:
myString <- c("2 word1 & 4 word2", "4 word2", "2 word1")
regmatches(myString, gregexpr("\\b\\d+\\b", myString))
See an online R demo. Output:
[[1]]
[1] "2" "4"
[[2]]
[1] "4"
[[3]]
[1] "2"
Details
\b - a word boundary
\d+ - 1 or more digits
\b - a word boundary.

try
myString = "2 word1 & 4 word2"
number1 = gsub("([0-9]+).*", "\\1", myString)
myString = "4 word2"
number2 = gsub("([0-9]+).*", "\\1", myString)
myString = "2 word1"
number3 = gsub("([0-9]+).*", "\\1", myString)
print(number1)
print(number2)
print(number3)
If you assign 3 times a string to myString, myString will only contain the last one.

This removes each occurrence of a letter or ampersand possibly followed by other non-space characters and then scans in what is left. The scan also converts them to numeric. No packages are used.
myString <- c("2 word1 & 4 word2", "4 word2", "2 word1")
lapply(myString, function(x) scan(text = gsub("[[:alpha:]&]\\S*", "", x), quiet = TRUE))
giving:
[[1]]
[1] 2 4
[[2]]
[1] 4
[[3]]
[1] 2

Related

How to strip down comma-separated strings to unique substrings

I'm struggling to strip down comma-separated strings to unique substrings in a cleanly fashion:
x <- c("Anna & x, Anna & x", #
"Alb, Berta 222, Alb",
"Al Pacino",
"Abb cd xy, Abb cd xy, C123, C123, B")
I seem to be doing fine with this combination of negative characterclass, negative lookahead and backreference; however what bothers me is that in many substrings there is unwanted whitespace:
library(stringr)
str_extract_all(x, "([^,]+)(?!.*\\1)")
[[1]]
[1] " Anna & x"
[[2]]
[1] " Berta 222" " Alb"
[[3]]
[1] "Al Pacino"
[[4]]
[1] " Abb cd xy" " C123" " B"
How can the pattern be refined so that no unwanted whitespace gets extracted?
Desired result:
#> [[1]]
#> [1] "Anna & x"
#> [[2]]
#> [1] "Alb" "Berta 222"
#> [[3]]
#> [1] "Al Pacino"
#> [[4]]
#> [1] "Abb cd xy" "C123" "B"
EDIT:
Just wanted to share this solution with double negative lookahead, which also works well (and thanks for the many useful solutions proposed!)
str_extract_all(x, "((?!\\s)[^,]+)(?!.*\\1)")
You can use str_split to get the individual substrings, followed by unique to remove repeated strings. For example:
library(tidyverse)
str_split(x, ", ?") %>% map(unique)
[[1]]
[1] "Anna & x"
[[2]]
[1] "Alb" "Berta 222"
[[3]]
[1] "Al Pacino"
[[4]]
[1] "Abb cd xy" "C123" "B"
If you want the output as a single vector of unique strings, you could do:
unique(unlist(str_split(x, ", ?")))
[1] "Anna & x" "Alb" "Berta 222" "Al Pacino" "Abb cd xy" "C123" "B"
In the code above we used the regex ", ?" to split at a comma or a comma followed by a space so that we don't end up with a leading space. For future reference, if you do need to get rid of leading or trailing whitespace, you can use str_trim. For example, if we had used "," in str_split we could do the following:
str_split(x, ",") %>%
map(str_trim) %>%
map(unique)
Change your pattern to the one below:
str_extract_all(x, "(\\b[^,]+)(?!.*\\1)")
[[1]]
[1] "Anna & x"
[[2]]
[1] "Berta 222" "Alb"
[[3]]
[1] "Al Pacino"
[[4]]
[1] "Abb cd xy" "C123" "B"
You need to start matching from a char other than a whitespace and a comma, then optionally match any zero or more chars other than a comma up to a char other than whitespace and a comma:
str_extract_all(x, "([^\\s,](?:[^,]*[^\\s,])?)(?!.*\\1)")
See the regex demo and an R demo online. Mind that if your strings contain line breaks, you need to prepend the pattern with (?s): str_extract_all(x, "(?s)([^\\s,](?:[^,]*[^\\s,])?)(?!.*\\1)").
If you need to make it case insensitive (e.g. Abb cd xy and ABB cD Xy are considered duplicates), add the i flag: str_extract_all(x, "(?i)([^\\s,](?:[^,]*[^\\s,])?)(?!.*\\1)") (or str_extract_all(x, "(?si)([^\\s,](?:[^,]*[^\\s,])?)(?!.*\\1)") if the DOTALL behavior is needed).
Details:
([^\s,](?:[^,]*[^\s,])?) - Group 1:
[^\s,] - a char other than whitespace and a comma
(?:[^,]*[^\s,])? - an optional sequence of
[^,]* - zero or more chars other than a comma
[^\s,] - a char other than whitespace and a comma
(?!.*\1) - a negative lookahead that fails the match if there are zero or more chars, as many as possible, followed with the Group 1 value.
Not exactly what you asked for, but the NLP frameworks can be helpful when the problems get more complex.
library(tidytext)
library(dplyr)
library(tibble)
tibble(text = x) %>%
rowid_to_column("stringid") %>%
unnest_regex(substring, text, pattern = ",", to_lower = FALSE) %>%
distinct(stringid, substring = trimws(substring))
# # A tibble: 7 x 2
# stringid substring
# <int> <chr>
# 1 1 Anna & x
# 2 2 Alb
# 3 2 Berta 222
# 4 3 Al Pacino
# 5 4 Abb cd xy
# 6 4 C123
# 7 4 B
Just add lapply(..., str_trim) to your code:
library(stringr)
lapply(str_extract_all(x, "([^,]+)(?!.*\\1)"), str_trim)
[[1]]
[1] "Anna & x"
[[2]]
[1] "Berta 222" "Alb"
[[3]]
[1] "Al Pacino"
[[4]]
[1] "Abb cd xy" "C123" "B"

How to split string in R with regular expression when parts of the regular expression are to be kept in the subsequent splitted strings?

I have a vector of character strings like this x = c("ABC", "ABC, EF", "ABC, DEF, 2 stems", "DE, other comments, and stuff").
I'd like to split each of these into two components: 1) the set of capital letters (2 or 3 letters, separated by commas), and 2) everything after the last "[A-Z][A-Z], ".
The results should be
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, DEF" "2 stems"
[[4]]
[1] "DE" "other comments, and stuff"
I tried strsplit(x, "[A-Z][A-Z], [a-z0-9]") and strsplit(x, "(?:[A-Z][A-Z], )[a-z0-9]"), both of which returned
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, D" " stems"
[[4]]
[1] "" "ther comments, and stuff"
The identification of where to split depends on a combination of the end of the first substring and the beginning of the second substring, and so those parts get excluded from the final result.
Any help appreciated in splitting as indicated above while including the relevant parts of the split regex in each substring!
One option would be str_split
library(stringr)
str_split(x, ", (?=[a-z0-9])", n = 2)
#[[1]]
#[1] "ABC"
#[[2]]
#[1] "ABC, EF"
#[[3]]
#[1] "ABC, DEF" "2 stems"
#[[4]]
#[1] "DE" "other comments, and stuff"

Split string after comma without trailing whitespace

As the title already says, I want to split this string
strsplit(c("aaa,aaa", "bbb, bbb", "ddd , ddd"), ",")
to that
[[1]]
[1] "aaa" "aaa"
[[2]]
[1] "bbb, bbb"
[[3]]
[1] "ddd , ddd"
Thus, the regular expression has to consider that no whitespace should occur after the comma. Could be a dupe, but was not able to find a solution by googling.
regular expression has to consider that no whitespace should occur after the comma
Use negative lookahead assertion:
> strsplit(c("aaa,aaa", "bbb, bbb", "ddd , ddd"), ",(?!\\s)", perl = TRUE)
[[1]]
[1] "aaa" "aaa"
[[2]]
[1] "bbb, bbb"
[[3]]
[1] "ddd , ddd"
,(?!\\s) matches , only if it's not followed by a space
Just to provide an alternative using (*SKIP)(*FAIL):
pattern <- " , (*SKIP)(*FAIL)|,"
data <- c("aaa,aaa", "bbb, bbb", "ddd , ddd")
strsplit(data, pattern, perl = T)
This yields the same as above.

Extract string between spaces

I have this data frame:
df <-c("AA AAAA 1B","A BBB 1", "CC RR 1W3", "SS RGTYC 0")
[1] "AA AAAA 1B" "A BBB 1" "CC RR 1W3" "SS RGTYC 0"
and I want to extract what is between spaces.
Desired result:
[1] "AAAA" "BBB" "RR" "RGTYC"
df <- c("AA AAAA 1B","A BBB 1", "CC RR 1W3", "SS RGTYC 0")
lst <- strsplit(df," ")
sapply(lst, '[[', 2)
# [1] "AAAA" "BBB" "RR" "RGTYC"
Instead of splitting it first and then selecting the relevant split, you can also extract it straight away using the stringr-package:
library(stringr)
str_extract(df, "(?<=\\s)(.*)(?=\\s)")
# [1] "AAAA" "BBB" "RR" "RGTYC"
This solution uses regular expressions, and this pattern is built up like this:
(?<=\\s) checks whether there is whitespace before
(?=\\s) checks whether there is a whitespace after
(.*) extracts everything in between the white spaces
Here is a gsub based approach (from base R). We match one more non-white spaces from the start (^) of the string followed by one or more spaces or (|) one or more white spaces followed by non-white spaces at the end of the string ($) and replace it with blank ("")
gsub("^\\S+\\s+|\\s+\\S+$", "", df)
#[1] "AAAA" "BBB" "RR" "RGTYC"
There is also a convenient function word from stringr
stringr::word(df, 2)
#[1] "AAAA" "BBB" "RR" "RGTYC"

How do I extract the first number that occurs after a matching pattern

Consider these examples:
examples <- c(
"abc foo",
"abc foo 17",
"0 abc defg foo 5 121",
"abc 12 foo defg 11"
)
Here I would like to return the first number that occurs after "foo". In this case: NA, 17, 5, 11. How can I do this? I tried using a look-behind, but with no luck.
library(stringr)
str_extract(examples, "(?<=foo.*)[0-9]+")
Error in stri_extract_first_regex(string, pattern, opts_regex = opts(pattern)) :
Look-Behind pattern matches must have a bounded maximum length. (U_REGEX_LOOK_BEHIND_LIMIT)
This seems to work:
str_match(examples, "foo.*?(\\d+)")
[,1] [,2]
[1,] NA NA
[2,] "foo 17" "17"
[3,] "foo 5" "5"
[4,] "foo defg 11" "11"
From ?regex:
By default repetition is greedy, so the maximal possible number of repeats is used. This can be changed to ‘minimal’ by appending ? to the quantifier.
From ?str_extract:
See Also
?str_match to extract matched groups; ?stri_extract for the underlying implementation.
You may use a base R solution like this:
> res <- gsub(".*?foo\\D*(\\d+).*|.*", "\\1", examples)
> res[nchar(res)==0] <- NA
> res
[1] NA "17" "5" "11"
As the regex will always match any string, you do not need to run a regex replacement twice, just fill out empty values with NA as the second step.
The pattern matches:
.*?foo - any 0+ chars as few as possible (since *? is lazy) up to the first occurrence of foo and then foo itself
\\D* - zero or more non-digit chars
(\\d+) - Group 1 that captures 1 or more digits (later, the value stored in the group can be referred with \1 backreference)
.* - the rest of the string
| - OR
.* - the whole string even if empty.
Base R gsub can do it:
# pulls fist instance of a digit
gsub('^\\D*(\\d*).*', '\\1', examples)
[1] "" "17" "0" "12"
Edit: actual solution using base R
ifelse(
grepl('foo\\D*\\d', examples),
gsub('^\\D*(\\d+).*', '\\1', gsub('.*foo\\s*', '', examples)),
NA)
[1] NA "17" "5" "11"

Resources