Modify regex to exclude characters that occur at beginning - r

Using below code I'm extracting a generated html link :
mystr <- c("/url?q=http://www.mypage.html&sa=U&ved=0ahUKEwjgyMPj2pXXAhWB5CYKHXysDlsQqQIIKSgAMAg&usg=AOvVaw1VCvT8iznodM3l4xvc8CVq")
str_extract(mystr, "^.*(?=(&sa))")
This returns :
[1] "/url?q=http://www.mypage.html"
How to modify regex in order to exclude /url?q= ? So just http://www.mypage.html is returned ?

You can replace the beginning of the string (i.e. ^) with http,
stringr::str_extract(mystr, "http.*(?=(&sa))")
#[1] "http://www.mypage.html"

You may also use a base R sub solution to match up to the first http and capture it with any chsrs other than &:
sub(".*?(http[^&]*).*", "\\1", x)
You may precise the pattern to match only after q= aftrr .*?.
Details
.*? - any 0+ chars as few as possible,
(http[^&]*) - capturing group #1 matching http and then any zero or more chars other than &
.* - the rest of the string.
The \1 is a replacement backreference to the Group 1 value.

Related

Remove all dots but first in a string using R

I have some errors in some numbers showing numbers like "59.34343.23". I know the first dot is correct but the second one (or any after the first) should be remove. How can I remove those?
I tried using gsub in R:
gsub("(?<=\\..*)\\.", "", "59.34343.23", perl=T)
or
gsub("(?<!^[^.]*)\\.", "", "59.34343.23", perl=T)
However it gets the following error "invalid regular expression". But I have been trying the same code in a regex tester and it works.
What is my mistake here?
You can use
gsub("^([^.]*\\.)|\\.", "\\1", "59.34343.23")
gsub("^([^.]*\\.)|\\.", "\\1", "59.34343.23", perl=TRUE)
See the R demo online and the regex demo.
Details:
^([^.]*\.) - Capturing group 1 (referred to as \1 from the replacement pattern): any zero or more chars from the start of string and then a . char (the first in the string)
| - or
\. - any other dot in the string.
Since the replacement, \1, refers to Group 1, and Group 1 only contains a value after the text before and including the first dot is matched, the replacement is either this part of text, or empty string (i.e. the second and all subsequent occurrences of dots are removed).
We may use
gsub("^[^.]+\\.(*SKIP)(*FAIL)|\\.", "", str1, perl = TRUE)
[1] "59.3434323"
data
str1 <- "59.34343.23"
By specifying perl = TRUE you can convert matches of the following regular expression to empty strings:
^[^.]*\.[^.]*\K.|\.
Start your engine!
If you are unfamiliar with \K hover over it in the regular expression at the link to see an explanation of its effect.
There is always the option to only write back the dot if its the first in the line.
Key feature is to consume the other dots but don't write it back.
Effect is to delete trailing dots.
Below uses a branch reset to accomplish the goal (Perl mode).
(?m)(?|(^[^.\n]*\.)|()\.+)
Replace $1
https://regex101.com/r/cHcu4j/1
(?m)
(?|
( ^ [^.\n]* \. ) # (1)
| ( ) # (1)
\.+
)
The pattern that you tried does not match, because there is an infinite quantifier in the lookbehind (?<=\\..*) that is not supported.
Another variation using \G to get continuous matches after the first dot:
(?:^[^.]*\.|\G(?!^))[^.]*\K\.
In parts, the pattern matches:
(?: Non capture group for the alternation |
^[^.]*\. Start of string, match any char except ., then match .
| Or
\G(?!^) Assert the position at the end of the previous match (not at the start)
)[^.]* Optionally match any char except .
\K\. Clear the match buffer an match the dot (to be removed)
Regex demo | R demo
gsub("(?:^[^.]*\\.|\\G(?!^))[^.]*\\K\\.", "", "59.34343.23", perl=T)
Output
[1] "59.3434323"

R: How to use stringr to extract the substring as the output to mutate a column of strings that begins with a string pattern and end with a number?

I'm creating a small example to be put into mutate(). Not sure why this doesn't work.
> str_extract("rs1234-<b>C</b>","^rs*\\d$")
[1] NA
I'd be great if you can point to my misunderstanding of the language instead of merely providing a solution. I expect to get "rs1234".
The ^rs*\d$ regex matches
^ - start of string
rs* - r and zero or more occurrences of s char
\d - a digit
$ - end of string.
So, your pattern matches strings like rsssss1, r3, etc.
You need
str_extract("rs1234-<b>C</b>", "^rs\\d+")
where ^rs\d+ matches rs at the start of string and then one or more digits. See this regex demo.
But if I just want the substring in between "rs" and the last number. What should I do?
You would use rs.*\d:
str_extract("rs1234-<b>C</b>", "rs.*\\d")
where rs.*\d matches rs, then any zero or more chars other than line break chars as many as possible and then a digit.
NOTE: If you need to match line endings, too, you need to prepend the last pattern with (?s) inline DOTALL modifier.
See this regex demo.

Regex UTM google

I'm trying to extract a UTM from a Google link using r, but my regex doesn't seem to work properly.
Here an example of a google link :
xxx/yyy?utm_medium=display&utm_source=ogury&utm_campaign=TOTO&zzz=coco
I tried the following regex to extract TOTO:
.+&utm_campaign=([[a-z]]+)&.+
with no success.
If someone can help, thanks!
In your pattern, [[a-z]]+ is a malformed bracket expression, because it matches any char from the [[a-z] bracket expression (any lowercase ASCII letter or [) and then matches one or more ] chars. You meant to use single [ and ] here.
You may use sub with the following regex:
sub(".*[&?]utm_campaign=([^&]+).*", "\\1", s)
See the regex demo.
Details
.* - any 0+ chars, as many as possible
[&?] - a ? or &
utm_campaign= - a literal substring
([^&]+) - Capturing group 1: one or more chars other than & chars
.* - any 0+ chars, as many as possible
The \1 is the replacement backreference that puts the contents of Group 1 into the result.
See the R demo:
s <- "xxx/yyy?utm_medium=display&utm_source=ogury&utm_campaign=TOTO&zzz=coco"
sub(".*[&?]utm_campaign=([^&]+).*", "\\1", s)
## => [1] "TOTO"
You could use:
(?:&utm_campaign=)(\w+)
and use the first group captured
Try it Online
Here's a regex string that'll match the value of a utm_campaign parameter, regardless of its position in the query string.
(?<TOTO>(?<=utm_campaign=).*?(?=&|$))
Explanation:
?<TOTO> captures the result into a TOTO key after the regex is executed
(?<=utm_campaign=) is a look-behind that will ensure that the value is preceded by utm_campaign=
.*? will find the parameter value (i.e. TOTO). The reason for the ? is lazy evaluation - it will only search until the next rule is matched (see point below)
(?=&|$) is a look-ahead that will match either an & or the end of the string (in the case that utm_campaign is the last parameter)
You are searching for [[a-z]]+ however TOTO is uppercase, so not between 'a' and 'z'. You can update it to [[A-Za-z]]+ to match any case letter.
EDIT: [[A-Za-z]]+ will match any case letter, but will also match any '[' or ']' characters. If you do not wish to match these then you can change it to [A-Za-z]+ to only match any case letters

Remove characters before first and after second underscore extracting string between first and second underscore

I am using
gsub(".*_","",ldf[[j]]),1,nchar(gsub(".*_","",ldf[[j]]))-4)
to create a path and filename to write to. It works fine for names in lfd that only have one underscore. Having a filename with another underscore further back, it cuts everything off that is in front of the second underscore.
I have for example:
Arof_07122016_2.csv and I want 07122016, but I get 2. But I don't get why this is happening. How can I use this line to only cut off the characters in fromt of the first underscore and keep the second one?
It seems you want
sub("^[^_]*_([^_]*).*", "\\1", ldf[[j]])
See the regex demo
The pattern matches
^ - start of string
[^_]* - 0+ chars other than _
_ - an underascxore
([^_]*) - Capturing group #1: any 0+ chars other than _
.* - the rest of the string.
The \1 in the replacement pattern only keeps the captured value in the result.
R demo:
v <- c("Arof_07122016_2.csv", "Another_99999_ccccc_2.csv")
sub("^[^_]*_([^_]*).*", "\\1", v)
# => [1] "07122016" "99999"
Regular expression repetition is greedy by default, as explained in ?regex:
By default repetition is greedy, so the maximal possible number of
repeats is used. This can be changed to ‘minimal’ by appending ? to
the quantifier. (There are further quantifiers that allow approximate
matching: see the TRE documentation.)
So you should use the pattern ".*?_". However, gsub will make multiple matches so you end up with the same result. To remedy this use sub which will only make 1 match or specify that you want to match at the start of the string by using ^ in the regex.
sub(".*?_","","Arof_07122016_2.csv")
[1] "07122016_2.csv"
gsub("^.*?_","","Arof_07122016_2.csv")
[1] "07122016_2.csv"

Regex to maintain matched parts

I would like to achieve this result : "raster(B04) + raster(B02) - raster(A10mB03)"
Therefore, I created this regex: B[0-1][0-9]|A[1,2,6]0m/B[0-1][0-9]"
I am now trying to replace all matches of the string "B04 + B02 - A10mB03" with gsub("B[0-1][0-9]]|[A[1,2,6]0mB[0-1][0-9]", "raster()", string)
How could I include the original values B01, B02, A10mB03?
PS: I also tried gsub("B[0-1][0-9]]|[A[1,2,6]0mB[0-1][0-9]", "raster(\\1)", string) but it did not work.
Basically, you need to match some text and re-use it inside a replacement pattern. In base R regex methods, there is no way to do that without a capturing group, i.e. a pair of unescaped parentheses, enclosing the whole regex pattern in this case, and use a \\1 replacement backreference in the replacement pattern.
However, your regex contains some issues: [A[1,2,6] gets parsed as a single character class that matches A, [, 1, ,, 2 or 6 because you placed a [ before A. Also, note that , inside character classes matches a literal comma, and it is not what you expected. Another, similar issue, is with [0-9]] - it matches any ASCII digit with [0-9] and then a ] (the ] char does not have to be escaped in a regex pattern).
So, a potential fix for you expression can look like
gsub("(B[0-1][0-9]|A[126]0mB[0-1][0-9])", "raster(\\1)", string)
Or even just matching 1 or more word chars (considering the sample string you supplied)
gsub("(\\w+)", "raster(\\1)", string)
might do.
See the R demo online.

Resources