R: How to extract specific digits from a string?

R: How to extract specific digits from a string? - r

I want to retrieve the first Numbers (here -> 344002) from a string:
string <- '<a href="/Archiv-Suche/!344002&s=&SuchRahmen=Print/" ratiourl-ressource="344002"'
I am preferably looking for a regular expression, which looks for the Numbers after the ! and before the &amp.
All I came up with is this but this catches the ! as well (!344002):
regmatches(string, gregexpr("\\!([[:digit:]]+)", string, perl =TRUE))
Any ideas?

Use this regex:
(?<=\!)\d+(?=&amp)
Use this code:
regmatches(string, gregexpr("(?<=\!)\d+(?=&amp)", string, perl=TRUE))
(?<=\!) is a lookbehind, the match will start following !
\d+ matches one digit or more
(?=&amp) stops the match if next characters are &amp

library(gsubfn)
strapplyc(string, "!(\\d+)")[[1]]
Old answer]
Test this code.
library(stringr)
str_extract(string, "[0-9]+")
similar question&answer is present here
Extract a regular expression match in R version 2.10

You may capture the digits (\d+) in between ! and &amp and get it with regexec/regmatches:
> string <- '<a href="/Archiv-Suche/!344002&s=&SuchRahmen=Print/" ratiourl-ressource="344002"'
> pattern = "!(\\d+)&"
> res <- unlist(regmatches(string,regexec(pattern,string)))
> res[2]
[1] "344002"
See the online R demo

Related

substitute string when there is a dot + number + ':'

I have strings that look like these:
> ABCD.1:f_HJK
> ABFD.1:f_HTK
> CJD:f_HRK
> QQYP.2:f_HDP
So basically, I have always a string in the first part, I could have a part with . and a number, and after this part I always have ':' and a string.
I would like to remove the '. + number' when it is included in the string, using R.
I know that maybe regular expressions could be useful but I have not idea about I can apply them in this context. I know that I can substitute the '.' with gsub, but not idea about how I can add the information about number and ':'.
Thank you for your help.

Does this work:
v <- c('ABCD.1:f_HJK','ABFD.1:f_HTK','CJD:f_HRK','QQYP.2:f_HDP')
v
[1] "ABCD.1:f_HJK" "ABFD.1:f_HTK" "CJD:f_HRK" "QQYP.2:f_HDP"
gsub('([A-Z]{,4})(\\.\\d)?(:.*)','\\1\\3',v)
[1] "ABCD:f_HJK" "ABFD:f_HTK" "CJD:f_HRK" "QQYP:f_HDP"

You could also use any of the following depending on the structure of your string
If no other period and numbers in the string
sub("\\.\\d+", "", v)
[1] "ABCD:f_HJK" "ABFD:f_HTK" "CJD:f_HRK" "QQYP:f_HDP"
If you are only interested in the first pattern matched.
sub("^([A-Z]+)\\.\\d+:", "\\1:", v)
[1] "ABCD:f_HJK" "ABFD:f_HTK" "CJD:f_HRK" "QQYP:f_HDP"
Same as above, invoking perl. ie no captured groups
sub("^[A-Z]+\\K\\.\\d+", "", v, perl = TRUE)
[1] "ABCD:f_HJK" "ABFD:f_HTK" "CJD:f_HRK" "QQYP:f_HDP"

If I understood your explanation correctly, this should do the trick:
gsub("(\\.\\d+)", "", string)

Extract all text after last occurrence of a special character

I have the string in R
BLCU142-09|Apodemia_mejicanus
and I would like to get the result
Apodemia_mejicanus
Using the stringr R package, I have tried
str_replace_all("BLCU142-09|Apodemia_mejicanus", "[[A-Z0-9|-]]", "")
# [1] "podemia_mejicanus"
which is almost what I need, except that the A is missing.

You can use
sub(".*\\|", "", x)
This will remove all text up to and including the last pipe char. See the regex demo. Details:
.* - any zero or more chars as many as possible
\| - a | char (| is a special regex metacharacter that is an alternation operator, so it must be escaped, and since string literals in R can contain string escape sequences, the | is escaped with a double backslash).
See the R demo online:
x <- c("BLCU142-09|Apodemia_mejicanus", "a|b|c|BLCU142-09|Apodemia_mejicanus")
sub(".*\\|", "", x)
## => [1] "Apodemia_mejicanus" "Apodemia_mejicanus"

We can match one or more characters that are not a | ([^|]+) from the start (^) of the string followed by | in str_remove to remove that substring
library(stringr)
str_remove(str1, "^[^|]+\\|")
#[1] "Apodemia_mejicanus"
If we use [A-Z] also to match it will match the upper case letter and replace with blank ("") as in the OP's str_replace_all
data
str1 <- "BLCU142-09|Apodemia_mejicanus"

You can always choose to _extract rather than _remove:
s <- "BLCU142-09|Apodemia_mejicanus"
stringr::str_extract(s,"[[:alpha:]_]+$")
## [1] "Apodemia_mejicanus"
Depending on how permissive you want to be, you could also use [[:alpha:]]+_[[:alpha:]]+ as your target.

I would keep it simple:
substring(my_string, regexpr("|", my_string, fixed = TRUE) + 1L)

String replace with regex condition

I have a pattern that I want to match and replace with an X. However, I only want the pattern to be replaced if the preceding character is either an A, B or not preceeded by any character (beginning of string).
I know how to replace patterns using the str_replace_all function but I don't know how I can add this additional condition. I use the following code:
library(stringr)
string <- "0000A0000B0000C0000D0000E0000A0000"
pattern <- c("XXXX")
replacement <- str_replace_all(string, pattern, paste0("XXXX"))
Result:
[1] "XXXXAXXXXBXXXXCXXXXDXXXXEXXXXAXXXX"
Desired result:
Replacement only when preceding charterer is A, B or no character:
[1] "XXXXAXXXXBXXXXC0000D0000E0000AXXXX"

You may use
gsub("(^|[AB])0000", "\\1XXXX", string)
See the regex demo
Details
(^|[AB]) - Capturing group 1 (\1): start of string (^) or (|) A or B ([AB])
0000 - four zeros.
R demo:
string <- "0000A0000B0000C0000D0000E0000A0000"
pattern <- c("XXXX")
gsub("(^|[AB])0000", "\\1XXXX", string)
## -> [1] "XXXXAXXXXBXXXXC0000D0000E0000AXXXX"

Could you please try following. Using positive lookahead method here.
string <- "0000A0000B0000C0000D0000E0000A0000"
gsub(x = string, pattern = "(^|A|B)(?=0000)((?i)0000?)",
replacement = "\\1xxxx", perl=TRUE)
Output will be as follows.
[1] "xxxxAxxxxBxxxxC0000D0000E0000Axxxx"

Thanks to Wiktor Stribiżew for the answer! It also works with the stringr package:
library(stringr)
string <- "0000A0000B0000C0000D0000E0000A0000"
pattern <- c("0000")
replace <- str_replace_all(string, paste0("(^|[AB])",pattern), "\\1XXXX")
replace
[1] "XXXXAXXXXBXXXXC0000D0000E0000AXXXX"

Capturing just "abc-def-ghi" in the string "utm_campaign=abc-def-ghi"

I am new to REGEX. As per title, I would like to capture abc-def-ghi in the string utm_campaign=abc-def-ghi. The string is usually embedded in an url. Using the following pattern (utm_campaign=[a-zA-Z0-9_-]+) I can match the entire string, but I really just want the second part of the string, which is abc-def-ghi. Is there an efficient way to do this in regex? Preferable language for this question is R.

Another option: gsub
> string <- "utm_campaign=abc-def-ghi"
> gsub(".*=(\\w*)", "\\1", string)
[1] "abc-def-ghi"

See regex in use here
(?<=utm_campaign=)[\w-]+
(?<=utm_campaign=) Positive lookbehind ensuring what precedes matches utm_campaign= literally
[\w-]+ Match any word character (a-zA-Z0-9_) or hyphen character one or more times
See code in use here
x <- "utm_campaign=abc-def-ghi"
m <- regexpr("(?<=utm_campaign=)[\\w-]+", x, perl=TRUE)
regmatches(x, m)
Result: abc-def-ghi

Newb regex help: string with ampersand, using R

I known this should be simple but I cannot return a subset of characters from a string using regex in R.
Foo <- 'propertyid=R206411&state_id='
Reg <- 'propertyid=(.*)&state_id='
Test <- grep(pattern=Reg, x=Foo, value=TRUE)
This captures the entire string for me and I want to capture just the R206411. The string I want to capture might vary in length and content, so the key is to have the capture begin after the '=' in propertyid=, and then end the capture once it sees the '&' in '&state_id'.
Thanks for your time.

You have to use positive lookbehind and lookahead assertions like this:
Foo <- 'propertyid=R206411&state_id='
Reg <- gregexpr('(?<=propertyid=).*(?=&state_id=)', Foo, perl=TRUE)
regmatches(Foo, Reg)

Well, grep doesn't play well with captured groups which is what you are trying to do. What you probably want is gsub
Foo <- 'propertyid=R206411&state_id='
Reg <- 'propertyid=(.*)&state_id='
gsub(Reg, "\\1", Foo)
# [1] "R206411"
Here we take your pattern, and we replace the match with "\1" (and since R requires us to escape backslashes we double the slash) which stands for the first capture group (which is what the parenthesis indicate). So since you match the whole string, it will replace the whole string with just the matching portion.

The strapplyc function in the gsubfn package can do exactly that. Using Foo and Reg from the question:
> library(gsubfn)
>
> strapplyc(Foo, Reg, simplify = TRUE)
[1] "R206411"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: How to extract specific digits from a string? - r

Use this regex: (?<=\!)\d+(?=&amp) Use this code: regmatches(string, gregexpr("(?<=\!)\d+(?=&amp)", string, perl=TRUE)) (?<=\!) is a lookbehind, the match will start following ! \d+ matches one digit or more (?=&amp) stops the match if next characters are &amp

library(gsubfn) strapplyc(string, "!(\\d+)")[[1]] Old answer] Test this code. library(stringr) str_extract(string, "[0-9]+") similar question&answer is present here Extract a regular expression match in R version 2.10

Related

substitute string when there is a dot + number + ':'

Extract all text after last occurrence of a special character

String replace with regex condition

Capturing just "abc-def-ghi" in the string "utm_campaign=abc-def-ghi"

Newb regex help: string with ampersand, using R

Categories

Resources