R - Replace Group 1 match in regex but not full match - r

Suppose I want to extract all letters between the letter a and c. I've been so far using the stringr package which gives a clear idea of the full matches and the groups. The package for example would give the following.
library(stringr)
str_match_all("abc", "a([a-z])c")
# [[1]]
# [,1] [,2]
# [1,] "abc" "b"
Suppose I only want to replace the group, and not the full match---in this case the letter b. The following would, however, replace the full match.
str_replace_all("abc", "a([a-z])c", "z")
[1] "z"
# Desired result: "azc"
Would there be any good ways to replace only the capture group? suppose I wanted to do multiple matches.
str_match_all("abcdef", "a([a-z])c|d([a-z])f")
# [[1]]
# [,1] [,2] [,3]
# [1,] "abc" "b" NA
# [2,] "def" NA "e"
str_replace_all("abcdef", "a([a-z])c|d([a-z])f", "z")
# [1] "zz"
# Desired result: "azcdzf"
Matching groups was easy enough, but I haven't found a solution when a replacement is desired.

It is not the way regex was designed. Capturing is a mechanism to get the parts of strings you need and when replacing, it is used to keep parts of matches, not to discard.
Thus, a natural solution is to wrap what you need to keep with capturing groups.
In this case here, use
str_replace_all("abc", "(a)[a-z](c)", "\\1z\\2")
Or with lookarounds (if the lookbehind is a fixed/known width pattern):
str_replace_all("abc", "(?<=a)[a-z](?=c)", "z")

Usually when I want to replace certain pattern of characters in a text\string I use the grep family functions, that is what we call working with regular expressions.
You can use sub function of the grep family functions to make replacements in strings.
Exemple:
sub("b","z","abc")
[1] "azc"
You may face more challenges working with replacement, for that, grep family functions offers many functionality:
replacing all characters by your preference except a and c:
sub("[^ac]+","z","abBbbbc")
[1] "azc"
replacing the second match
sub("b{2}","z","abBbbbc")
[1] "abBzbc"
replacing all characters after the pattern:
sub("b.*","z","abc")
[1] "az"
the same above except c:
sub("b.*[^c]","z","abc")
[1] "abc"
So on...
You can look for "regular expressions in R using grep" into internet and find many ways to work with regular expressions.

Related

R regex Grouping Not Working as Expected [duplicate]

Given a regular expression containing capture groups (parentheses) and a string, how can I obtain all the substrings matching the capture groups, i.e., the substrings usually referenced by "\1", "\2"?
Example: consider a regex capturing digits preceded by "xy":
s <- "xy1234wz98xy567"
r <- "xy(\\d+)"
Desired result:
[1] "1234" "567"
First attempt: gregexpr:
regmatches(s,gregexpr(r,s))
#[[1]]
#[1] "xy1234" "xy567"
Not what I want because it returns the substrings matching the entire pattern.
Second try: regexec:
regmatches(s,regexec("xy(\\d+)",s))
#[[1]]
#[1] "xy1234" "1234"
Not what I want because it returns only the first occurence of a matching for the entire pattern and the capture group.
If there was a gregexec function, extending regexec as gregexpr extends regexpr, my problem would be solved.
So the question is: how to retrieve all substrings (or indices that can be passed to regmatches as in the examples above) matching capture groups in an arbitrary regular expression?
Note: the pattern for r given above is just a silly example, it must remain arbitrary.
For a base R solution, what about just using gsub() to finish processing the strings extracted by gregexpr() and regmatches()?
s <- "xy1234wz98xy567"
r <- "xy(\\d+)"
gsub(r, "\\1", regmatches(s,gregexpr(r,s))[[1]])
# [1] "1234" "567"
Not sure about doing this in base, but here's a package for your needs:
library(stringr)
str_match_all(s, r)
#[[1]]
# [,1] [,2]
#[1,] "xy1234" "1234"
#[2,] "xy567" "567"
Many stringr functions also have parallels in base R, so you can also achieve this without using stringr.
For instance, here's a simplified version of how the above works, using base R:
sapply(regmatches(s,gregexpr(r,s))[[1]], function(m) regmatches(m,regexec(r,m)))
strapplyc in the gsubfn package does that:
> library(gsubfn)
>
> strapplyc(s, r)
[[1]]
[1] "1234" "567"
Try ?strapplyc for additional info and examples.
Related Functions
1) A generalization of strapplyc is strapply in the same package. It takes a function which inputs the captured portions of each match and returns the output of the function. When the function is c it reduces to strapplyc. For example, suppose we wish to return results as numeric:
> strapply(s, r, as.numeric)
[[1]]
[1] 1234 567
2) gsubfn is another related function in the same package. It is like gsub except the replacement string can be a replacement function (or a replacement list or a replacement proto object). The replacement function inputs the captured portions and outputs the replacement. The replacement replaces the match in the input string. If a formula is used, as in this example, the right hand side of the formula is regarded as the function body. In this example we replace the match with XY{#} where # is twice the matched input number.
> gsubfn(r, ~ paste0("XY{", 2 * as.numeric(x), "}"), s)
[1] "XY{2468}wz98XY{1134}"
UPDATE: Added strapply and gsubfn examples.
Since R 4.1.0, there is gregexec:
regmatches(s,gregexec(r,s))[[1]][2, ]
[1] "1234" "567"

Split string WITHOUT regex

I'm sure I used to know this, and I'm sure this is covered somewhere but since I can't find any Google/SO hits for this title search there probably should be one..
I want to split a string without using regex, e.g.
str = "abcx*defx*ghi"
Of course we can use stringr::str_split or strsplit with argument 'x[*]', but how can we just suppress regex entirely?
The argument fixed=TRUE can be useful in this instance
strsplit(str, "x*", fixed=TRUE)[[1]]
#[1] "abc" "def" "ghi"
Since the question also mentions a stringr::str_split, a stringr way might be of help, too.
You may use str_split with fixed(<YOUR_DELIMITER_STRING_HERE>, ignore_case = FALSE) or coll(pattern, ignore_case = FALSE, locale = "en", ...). See the stringr docs:
fixed: Compare literal bytes in the string. This is very fast, but not usually what you want for non-ASCII character sets.
coll Compare strings respecting standard collation rules
See the following R demo:
> str_split(str, fixed("x*"))
[[1]]
[1] "abc" "def" "ghi"
Collations are better illustrated with a letter that can have two representations:
> x <- c("Str1\u00e1Str2", "Str3a\u0301Str4")
> str_split(x, fixed("\u00e1"), simplify=TRUE)
[,1] [,2]
[1,] "Str1" "Str2"
[2,] "Str3áStr4" ""
> str_split(x, coll("\u00e1"), simplify=TRUE)
[,1] [,2]
[1,] "Str1" "Str2"
[2,] "Str3" "Str4"
A note about fixed():
fixed(x) only matches the exact sequence of bytes specified by x. This is a very limited “pattern”, but the restriction can make matching much faster. Beware using fixed() with non-English data. It is problematic because there are often multiple ways of representing the same character. For example, there are two ways to define “á”: either as a single character or as an “a” plus an accent.
...
coll(x) looks for a match to x using human-language collation rules, and is particularly important if you want to do case insensitive matching. Collation rules differ around the world, so you’ll also need to supply a locale parameter.
Simply wrap the regex inside fixed() to stop it being treated as a regex inside stringr::str_split()
Example
Normally, stringr::str_split() will treat the pattern as a regular expression, meaning certain characters have special meanings, which can cause errors if those regular expressions are not valid, e.g.:
library(stringr)
str_split("abcdefg[[[klmnop", "[[[")
Error in stri_split_regex(string, pattern, n = n, simplify = simplify, :
Missing closing bracket on a bracket expression. (U_REGEX_MISSING_CLOSE_BRACKET)
But if we simply wrap the pattern we are splitting by inside fixed(), it treat's it as a string literal, rather than a regular expression:
str_split("abcdefg[[[klmnop", fixed("[[["))
[[1]]
[1] "abcdefg" "klmnop"

Forced To Use mapply Is There A Workaround

I have a data.frame with a single column "Terms". This could contain a string of multiple words. Each term contains at least two words or more, no upper limit.
From this column "Terms", I would like to extract the last word and store it in a new column "Last".
# load library
library(dplyr)
library(stringi)
# read csv
df <- read("filename.txt",stringsAsFactors=F)
# show df
head(df)
# Term
# 1 this is for the
# 2 thank you for
# 3 the following
# 4 the fact that
# 5 the first
I have prepared a function LastWord which works well when a single string is given.
However, when a vector of string is given, it still works with the first string in the vector. This has forced me to use mapply when used with mutate, to add a column as seen below.
LastWord <- function(InputWord) {
stri_sub(InputWord,stri_locate_last(str=InputWord, fixed=" ")[1,1]+1, stri_length(InputWord))
}
df <- mutate(df, Last=mapply(LastWord, df$Term))
Using mapply makes the process very slow. I generally need to process around 10 to 15 million lines or terms at a time. It takes hours.
Could anyone suggest a way to create the LastWord function that works with vector rather than a string?
You can try:
df$LastWord <- gsub(".* ([^ ]+)$", "\\1", df$Term)
df
# Term LastWord
# 1 this is for the the
# 2 thank you for for
# 3 the following following
# 4 the fact that that
# 5 the first first
In the gsub call, the expression between the brackets matches anything that is not a space at least one time (instead of [^ ]+, [a-zA-Z]+ could work too) at the end of the string ($). The fact that it is in between brackets permit to capture the expression with \\1. So gsub only keeps what is in between brackets as replacement.
EDIT:
As #akrun mentionned in the comments, in this case, sub can also be used instead of gsub.
To extract the last word only, you can use a vectorized function from stringi directly which should be very fast
library(stringi)
df$LastWord <- stri_extract_last_words(df$Term)
Now if you want two new columns, one containing all words but the last and another one containing the last words, you can use some regular expression like
stri_match(df$Term, regex= "([\\w*\\s]*)\\s(\\w*)")
# [,1] [,2] [,3]
# [1,] "this is for the" "this is for" "the"
# [2,] "thank you for" "thank you" "for"
# [3,] "the following" "the" "following"
# [4,] "the fact that" "the fact" "that"
# [5,] "the first" "the" "first"
So what you want is
df[c("ExceptLast", "LastWord")] <-
stri_match(df$Term, regex= "([\\w*\\s]*)\\s(\\w*)")[, 2:3]
(Note that this won't work if df$Term contains only one word. In that case you will need to modify the regular expression, depending on which column you want it to be included in.)

Extracting pattern substrings from a text file in R

I wish to extract all unique substrings of text from a text file using R, that adhere to the form "matrixname[rowname,column number]". I have achieved only limited success with grep and extract_string_all (stringr) in the sense that it will only return the entire line and not the substring only. Trying to replace the unwanted text using gsub has been unsuccessful. Here is an example of the code that I have been using.
#Read in file
txt<-read.table("Project_R_code.R")
#create new object to create lines that contain this pattern
txt2<-grep("param\\[.*1\\]",txt$V1, value=TRUE)
#remove all text that does not match the above pattern
gsub("[^param\\[.*1\\]]","", txt2,perl=TRUE)
The second line works (but again doesn't give me a substring of that pattern only). However the gsub code for removing non-matching patterns keeps the lines and turns them into something like this:
[200] "[p.p]param[ama1]param[ama11]*[r1]param[ama1]...
and I have no idea why. I realise this method of paring down the line into something more manageable is more tedious but it's the only way I know how to get the patterns.
Preferably I would prefer R to spit out a list of all the (unique) substrings it finds in the text file, that match my pattern, but I don't know the command. Any help on this is much appreciated.
If you'd like to extract individual components, try str_match:
test <- c("aaa[name1,1]", "bbb[name2,3]", "ccc[name3,3]")
stringr::str_match(test, "([a-zA-Z0-9_]+)[[]([a-zA-Z0-9_]+),.*?(\\d+)\\]")
## [,1] [,2] [,3] [,4]
## [1,] "aaa[name1,1]" "aaa" "name1" "1"
## [2,] "bbb[name2,3]" "bbb" "name2" "3"
## [3,] "ccc[name3,3]" "ccc" "name3" "3"
Otherwise, use str_extract.
Note that to match [ in ERE/TRE we use a set containing a single [ character, i.e. [[].
Moreover, if you have many matches in a single string, use str_match_all or str_extract_all.

Extract websites links from a text in R

I have multiple texts that each may consist references to one or more web links. for example:
text1= "s#1212a as www.abcd.com asasa11".
How do I extract:
"www.abcd.com"
from this text in R? In other words I am looking to extract patterns that start with www and end with .com
regmatches This approach uses regexpr/grepgexpr and regmatches. I expanded the test data to include more examples.
text1 <- c("s#1212a www.abcd.com www.cats.com",
"www.boo.com",
"asdf",
"blargwww.test.comasdf")
# Regular expressions take some practice.
# check out ?regex or the wikipedia page on regular expressions
# for more info on creating them yourself.
pattern <- "www\\..*?\\.com"
# Get information about where the pattern matches text1
m <- gregexpr(pattern, text1)
# Extract the matches from text1
regmatches(text1, m)
Which gives
> regmatches(text1, m) ##
[[1]]
[1] "www.abcd.com" "www.cats.com"
[[2]]
[1] "www.boo.com"
[[3]]
character(0)
[[4]]
[1] "www.test.com"
Notice it returns a list. If we want a vector you can just use unlist on the result. This is because we used gregexpr which implies there could be multiple matches in our string. If we know there is at most one match we could use regexpr instead
> m <- regexpr(pattern, text1)
> regmatches(text1, m)
[1] "www.abcd.com" "www.boo.com" "www.test.com"
Notice, however, that this returns all results as a vector and only returns a single result from each string (note that www.cats.com isn't in the results). On the whole, though, I think either of these two methods is preferable to the gsub method because that way will return the entire input if there is no result found. For example take a look:
> gsub(text1, pattern=".*(www\\..*?\\.com).*", replace="\\1")
[1] "www.abcd.com" "www.boo.com" "asdf" "www.test.com"
And that's even after modifying the pattern to be a little more robust. We still get 'asdf' in the results even though it clearly doesn't match the pattern.
Shameless silly self promotion: regmatches was introduced with R 2.14 so if you're stuck with an earlier version of R you might be out of luck. Unless you're able to install the future2.14 package from my github repo which provides some support for functions introduced in 2.14 to earlier versions of R.
strapplyc. An alternative which gives the same result as ## above is:
library(gsubfn)
strapplyc(test1, pattern)
The regular expression Here is some explanation on how to decipher the regular expression:
pattern <- "www\\..*?\\.com"
Explanation:
www matches the www portion
\\. We need to escape an actual 'dot' character using \\ because a plain . represents "any character" in regular expressions.
.*? The . represents any character, the * tells to match 0 or more times, and the ? following the * tells it to not be greedy. Otherwise "asdf www.cats.com www.dogs.com asdf" would match all of "www.cats.com www.dogs.com" as a single match instead of recognizing that there are two matches in there.
\\. Once again we need to escape an actual dot character
com This part matches the ending 'com' that we want to match
Putting it all together it says: start with www. then match any characters until you reach the first ".com"
Check out the gsub function:
x = "s#1212a as www.abcd.com asasa11"
gsub(x=x, pattern=".*(www.*com).*", replace="\\1")
The basic idea is to surround the txt you want to retain in parenthesis, then replace the entire line with it. The replace parameter of gsub "\\1" refers to what was found in the parenthesis.
The solutions here are great and in base. For those that want a quick solution you can use qdap's genXtract. This functions basically takes a left and a right element(s) and it will extract everything in between. By setting with = TRUE it will include those elements:
text1 <- c("s#1212a www.abcd.com www.cats.com",
"www.boo.com",
"asdf",
"http://www.talkstats.com/ and http://stackoverflow.com/",
"blargwww.test.comasdf")
library(qdap)
genXtract(text1, "www.", ".com", with=TRUE)
## > genXtract(text1, "www.", ".com", with=TRUE)
## $`www. : .com1`
## [1] "www.abcd.com" "www.cats.com"
##
## $`www. : .com2`
## [1] "www.boo.com"
##
## $`www. : .com3`
## character(0)
##
## $`www. : .com4`
## [1] "www.talkstats.com"
##
## $`www. : .com5`
## [1] "www.test.com"
PS if you look at code for the function it is a wrapper for Dason's solution.

Resources