Newb regex help: string with ampersand, using R - r

I known this should be simple but I cannot return a subset of characters from a string using regex in R.
Foo <- 'propertyid=R206411&state_id='
Reg <- 'propertyid=(.*)&state_id='
Test <- grep(pattern=Reg, x=Foo, value=TRUE)
This captures the entire string for me and I want to capture just the R206411. The string I want to capture might vary in length and content, so the key is to have the capture begin after the '=' in propertyid=, and then end the capture once it sees the '&' in '&state_id'.
Thanks for your time.

You have to use positive lookbehind and lookahead assertions like this:
Foo <- 'propertyid=R206411&state_id='
Reg <- gregexpr('(?<=propertyid=).*(?=&state_id=)', Foo, perl=TRUE)
regmatches(Foo, Reg)

Well, grep doesn't play well with captured groups which is what you are trying to do. What you probably want is gsub
Foo <- 'propertyid=R206411&state_id='
Reg <- 'propertyid=(.*)&state_id='
gsub(Reg, "\\1", Foo)
# [1] "R206411"
Here we take your pattern, and we replace the match with "\1" (and since R requires us to escape backslashes we double the slash) which stands for the first capture group (which is what the parenthesis indicate). So since you match the whole string, it will replace the whole string with just the matching portion.

The strapplyc function in the gsubfn package can do exactly that. Using Foo and Reg from the question:
> library(gsubfn)
>
> strapplyc(Foo, Reg, simplify = TRUE)
[1] "R206411"

Related

R Use Regular Expression to capture number when sometimes the capture is at the end of the string or not

I need to capture the numbers out of a string that come after a certain parameter name.
I have it working for most, but there is one parameter that is sometimes at the end of the string, but not always. When using the regular expression, it seems to matter.
I've tried different things, but nothing seems to work in both cases.
# Regular expression to capture the digit after the phrase "AppliedWhenID="
p <- ".*&AppliedWhenID=(.\\d*)"
# Tried this, but when at end, it just grabs a blank
#p <- ".*&AppliedWhenID=(.\\d*)&.*|.*&AppliedWhenID=(.\\d*)$"
testAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2"
testNotAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2&AgDateTypeID=1"
# What should be returned is "2"
gsub(p, "\\1", testAtEnd) # works
gsub(p, "\\1", testNotAtEnd) # doesn't work, it captures 2 + &AgDateTypeID=1
Note that sub and gsub replace the found text(s), thus, in order to extract a part of the input string with a capturing group + a backreference, you need to actually match (and consume) the whole string.
Hence, you need to match the string to the end by adding .* at the end of the pattern:
p <- ".*&AppliedWhenID=(\\d+).*"
sub(p, "\\1", testNotAtEnd)
# => [1] "2"
sub(p, "\\1", testAtEnd)
# => [1] "2"
See the regex demo and the R online demo.
Note that gsub matches multiple occurrences, you need a single one, so it makes sense to replace gsub with sub.
Regex details
.* - any zero or more chars as many as possible
&AppliedWhenID= - a &AppliedWhenID= string
(\d+) - Group 1 (\1): one or more digits
.* - any zero or more chars as many as possible.
You could try using the string look behind conditional "(?<=)" and str_extract() from the stringr library.
testAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2"
testNotAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2&AgDateTypeID=1"
p <- "(?<=AppliedWhenID=)\\d+"
# What should be returned is "2"
library(stringr)
str_extract(testAtEnd, p)
str_extract(testNotAtEnd, p)
Or in base R
p <- ".*((?<=AppliedWhenID=)\\d+).*"
gsub(p, "\\1", testAtEnd, perl=TRUE)
gsub(p, "\\1", testNotAtEnd, perl=TRUE)

Capturing just "abc-def-ghi" in the string "utm_campaign=abc-def-ghi"

I am new to REGEX. As per title, I would like to capture abc-def-ghi in the string utm_campaign=abc-def-ghi. The string is usually embedded in an url. Using the following pattern (utm_campaign=[a-zA-Z0-9_-]+) I can match the entire string, but I really just want the second part of the string, which is abc-def-ghi. Is there an efficient way to do this in regex? Preferable language for this question is R.
Another option: gsub
> string <- "utm_campaign=abc-def-ghi"
> gsub(".*=(\\w*)", "\\1", string)
[1] "abc-def-ghi"
See regex in use here
(?<=utm_campaign=)[\w-]+
(?<=utm_campaign=) Positive lookbehind ensuring what precedes matches utm_campaign= literally
[\w-]+ Match any word character (a-zA-Z0-9_) or hyphen character one or more times
See code in use here
x <- "utm_campaign=abc-def-ghi"
m <- regexpr("(?<=utm_campaign=)[\\w-]+", x, perl=TRUE)
regmatches(x, m)
Result: abc-def-ghi

R: How to extract specific digits from a string?

I want to retrieve the first Numbers (here -> 344002) from a string:
string <- '<a href="/Archiv-Suche/!344002&s=&SuchRahmen=Print/" ratiourl-ressource="344002"'
I am preferably looking for a regular expression, which looks for the Numbers after the ! and before the &amp.
All I came up with is this but this catches the ! as well (!344002):
regmatches(string, gregexpr("\\!([[:digit:]]+)", string, perl =TRUE))
Any ideas?
Use this regex:
(?<=\!)\d+(?=&amp)
Use this code:
regmatches(string, gregexpr("(?<=\!)\d+(?=&amp)", string, perl=TRUE))
(?<=\!) is a lookbehind, the match will start following !
\d+ matches one digit or more
(?=&amp) stops the match if next characters are &amp
library(gsubfn)
strapplyc(string, "!(\\d+)")[[1]]
Old answer]
Test this code.
library(stringr)
str_extract(string, "[0-9]+")
similar question&answer is present here
Extract a regular expression match in R version 2.10
You may capture the digits (\d+) in between ! and &amp and get it with regexec/regmatches:
> string <- '<a href="/Archiv-Suche/!344002&s=&SuchRahmen=Print/" ratiourl-ressource="344002"'
> pattern = "!(\\d+)&"
> res <- unlist(regmatches(string,regexec(pattern,string)))
> res[2]
[1] "344002"
See the online R demo

Retrieving a specific part of a string in R

I have the next vector of strings
[1] "/players/playerpage.htm?ilkidn=BRYANPHI01"
[2] "/players/playerpage.htm?ilkidhh=WILLIROB027"
[3] "/players/playerpage.htm?ilkid=THOMPWIL01"
I am looking for a way to retrieve the part of the string that is placed after the equal sign meaning I would like to get a vector like this
[1] "BRYANPHI01"
[2] "WILLIROB027"
[3] "THOMPWIL01"
I tried using substr but for it to work I have to know exactly where the equal sign is placed in the string and where the part i want to retrieve ends
We can use sub to match the zero or more characters that are not a = ([^=]*) followed by a = and replace it with ''.
sub("[^=]*=", "", str1)
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
data
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
Using stringr,
library(stringr)
word(str1, 2, sep = '=')
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
Using strsplit,
strsplit(str1, "=")[[1]][2]
# [1] "BRYANPHI01"
With Sotos comment to get results as vector:
sapply(str1, function(x){
strsplit(x, "=")[[1]][2]
})
Another solution based on regex, but extracting instead of substituting, which may be more efficient.
I use the stringi package which provides a more powerful regex engine than base R (in particular, supporting look-behind).
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
stri_extract_all_regex(str1, pattern="(?<==).+$", simplify=T)
(?<==) is a look-behind: regex will match only if preceded by an equal sign, but the equal sign will not be part of the match.
.+$ matches everything until the end. You could replace the dot with a more precise symbol if you are confident about the format of what you match. For example, '\w' matches any alphanumeric character, so you could use "(?<==)\\w+$" (the \ must be escaped so you end up with \\w).

Extract first X Numbers from Text Field using Regex

I have strings that looks like this.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
I need to end up with:
"2134", "0983", and "8723"
Essentially, I need to extract the first four characters that are numbers from each element. Some begin with a letter (disallowing me from using a simple substring() function).
I guess technically, I could do something like:
x <- gsub("^P","",x)
x <- substr(x,1,4)
But I want to know how I would do this with regex!
You could use str_match from the stringr package:
library(stringr)
print(c(str_match(x, "\\d\\d\\d\\d")))
# [1] "2134" "0983" "8723"
You can do this with gsub too.
> sub('.?([0-9]{4}).*', '\\1', x)
[1] "2134" "0983" "8723"
>
I used sub instead of gsub to assure I only got the first match. .? says any single character and its optional (similar to just . but then it wouldn't match the case without the leading P). The () signify a group that I reference in the replacement '\\1'. If there were multiple sets of () I could reference them too with '\\2'. Inside the group, and you had the syntax correct, I want only numbers and I want exactly 4 of them. The final piece says zero or more trailing characters of any type.
Your syntax was working, but you were replacing something with itself so you wind up with the same output.
This will get you the first four digits of a string, regardless of where in the string they appear.
mapply(function(x, m) paste0(x[m], collapse=""),
strsplit(x, ""),
lapply(gregexpr("\\d", x), "[", 1:4))
Breaking it down into pieces:
What's going on in the above line is as follows:
# this will get you a list of matches of digits, and their location in each x
matches <- gregexpr("\\d", x)
# this gets you each individual digit
matches <- lapply(matches, "[", 1:4)
# individual characters of x
splits <- strsplit(x, "")
# get the appropriate string
mapply(function(x, m) paste0(x[m], collapse=""), splits, matches)
Another group capturing approach that doesn't assume 4 numbers.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
gsub("(^[^0-9]*)(\\d+)([^0-9].*)", "\\2", x)
## [1] "2134" "0983" "8723"

Resources