Remove characters before first and after second underscore extracting string between first and second underscore - r

I am using
gsub(".*_","",ldf[[j]]),1,nchar(gsub(".*_","",ldf[[j]]))-4)
to create a path and filename to write to. It works fine for names in lfd that only have one underscore. Having a filename with another underscore further back, it cuts everything off that is in front of the second underscore.
I have for example:
Arof_07122016_2.csv and I want 07122016, but I get 2. But I don't get why this is happening. How can I use this line to only cut off the characters in fromt of the first underscore and keep the second one?

It seems you want
sub("^[^_]*_([^_]*).*", "\\1", ldf[[j]])
See the regex demo
The pattern matches
^ - start of string
[^_]* - 0+ chars other than _
_ - an underascxore
([^_]*) - Capturing group #1: any 0+ chars other than _
.* - the rest of the string.
The \1 in the replacement pattern only keeps the captured value in the result.
R demo:
v <- c("Arof_07122016_2.csv", "Another_99999_ccccc_2.csv")
sub("^[^_]*_([^_]*).*", "\\1", v)
# => [1] "07122016" "99999"

Regular expression repetition is greedy by default, as explained in ?regex:
By default repetition is greedy, so the maximal possible number of
repeats is used. This can be changed to ‘minimal’ by appending ? to
the quantifier. (There are further quantifiers that allow approximate
matching: see the TRE documentation.)
So you should use the pattern ".*?_". However, gsub will make multiple matches so you end up with the same result. To remedy this use sub which will only make 1 match or specify that you want to match at the start of the string by using ^ in the regex.
sub(".*?_","","Arof_07122016_2.csv")
[1] "07122016_2.csv"
gsub("^.*?_","","Arof_07122016_2.csv")
[1] "07122016_2.csv"

Related

Remove all dots but first in a string using R

I have some errors in some numbers showing numbers like "59.34343.23". I know the first dot is correct but the second one (or any after the first) should be remove. How can I remove those?
I tried using gsub in R:
gsub("(?<=\\..*)\\.", "", "59.34343.23", perl=T)
or
gsub("(?<!^[^.]*)\\.", "", "59.34343.23", perl=T)
However it gets the following error "invalid regular expression". But I have been trying the same code in a regex tester and it works.
What is my mistake here?
You can use
gsub("^([^.]*\\.)|\\.", "\\1", "59.34343.23")
gsub("^([^.]*\\.)|\\.", "\\1", "59.34343.23", perl=TRUE)
See the R demo online and the regex demo.
Details:
^([^.]*\.) - Capturing group 1 (referred to as \1 from the replacement pattern): any zero or more chars from the start of string and then a . char (the first in the string)
| - or
\. - any other dot in the string.
Since the replacement, \1, refers to Group 1, and Group 1 only contains a value after the text before and including the first dot is matched, the replacement is either this part of text, or empty string (i.e. the second and all subsequent occurrences of dots are removed).
We may use
gsub("^[^.]+\\.(*SKIP)(*FAIL)|\\.", "", str1, perl = TRUE)
[1] "59.3434323"
data
str1 <- "59.34343.23"
By specifying perl = TRUE you can convert matches of the following regular expression to empty strings:
^[^.]*\.[^.]*\K.|\.
Start your engine!
If you are unfamiliar with \K hover over it in the regular expression at the link to see an explanation of its effect.
There is always the option to only write back the dot if its the first in the line.
Key feature is to consume the other dots but don't write it back.
Effect is to delete trailing dots.
Below uses a branch reset to accomplish the goal (Perl mode).
(?m)(?|(^[^.\n]*\.)|()\.+)
Replace $1
https://regex101.com/r/cHcu4j/1
(?m)
(?|
( ^ [^.\n]* \. ) # (1)
| ( ) # (1)
\.+
)
The pattern that you tried does not match, because there is an infinite quantifier in the lookbehind (?<=\\..*) that is not supported.
Another variation using \G to get continuous matches after the first dot:
(?:^[^.]*\.|\G(?!^))[^.]*\K\.
In parts, the pattern matches:
(?: Non capture group for the alternation |
^[^.]*\. Start of string, match any char except ., then match .
| Or
\G(?!^) Assert the position at the end of the previous match (not at the start)
)[^.]* Optionally match any char except .
\K\. Clear the match buffer an match the dot (to be removed)
Regex demo | R demo
gsub("(?:^[^.]*\\.|\\G(?!^))[^.]*\\K\\.", "", "59.34343.23", perl=T)
Output
[1] "59.3434323"

R: How to use stringr to extract the substring as the output to mutate a column of strings that begins with a string pattern and end with a number?

I'm creating a small example to be put into mutate(). Not sure why this doesn't work.
> str_extract("rs1234-<b>C</b>","^rs*\\d$")
[1] NA
I'd be great if you can point to my misunderstanding of the language instead of merely providing a solution. I expect to get "rs1234".
The ^rs*\d$ regex matches
^ - start of string
rs* - r and zero or more occurrences of s char
\d - a digit
$ - end of string.
So, your pattern matches strings like rsssss1, r3, etc.
You need
str_extract("rs1234-<b>C</b>", "^rs\\d+")
where ^rs\d+ matches rs at the start of string and then one or more digits. See this regex demo.
But if I just want the substring in between "rs" and the last number. What should I do?
You would use rs.*\d:
str_extract("rs1234-<b>C</b>", "rs.*\\d")
where rs.*\d matches rs, then any zero or more chars other than line break chars as many as possible and then a digit.
NOTE: If you need to match line endings, too, you need to prepend the last pattern with (?s) inline DOTALL modifier.
See this regex demo.

Regex UTM google

I'm trying to extract a UTM from a Google link using r, but my regex doesn't seem to work properly.
Here an example of a google link :
xxx/yyy?utm_medium=display&utm_source=ogury&utm_campaign=TOTO&zzz=coco
I tried the following regex to extract TOTO:
.+&utm_campaign=([[a-z]]+)&.+
with no success.
If someone can help, thanks!
In your pattern, [[a-z]]+ is a malformed bracket expression, because it matches any char from the [[a-z] bracket expression (any lowercase ASCII letter or [) and then matches one or more ] chars. You meant to use single [ and ] here.
You may use sub with the following regex:
sub(".*[&?]utm_campaign=([^&]+).*", "\\1", s)
See the regex demo.
Details
.* - any 0+ chars, as many as possible
[&?] - a ? or &
utm_campaign= - a literal substring
([^&]+) - Capturing group 1: one or more chars other than & chars
.* - any 0+ chars, as many as possible
The \1 is the replacement backreference that puts the contents of Group 1 into the result.
See the R demo:
s <- "xxx/yyy?utm_medium=display&utm_source=ogury&utm_campaign=TOTO&zzz=coco"
sub(".*[&?]utm_campaign=([^&]+).*", "\\1", s)
## => [1] "TOTO"
You could use:
(?:&utm_campaign=)(\w+)
and use the first group captured
Try it Online
Here's a regex string that'll match the value of a utm_campaign parameter, regardless of its position in the query string.
(?<TOTO>(?<=utm_campaign=).*?(?=&|$))
Explanation:
?<TOTO> captures the result into a TOTO key after the regex is executed
(?<=utm_campaign=) is a look-behind that will ensure that the value is preceded by utm_campaign=
.*? will find the parameter value (i.e. TOTO). The reason for the ? is lazy evaluation - it will only search until the next rule is matched (see point below)
(?=&|$) is a look-ahead that will match either an & or the end of the string (in the case that utm_campaign is the last parameter)
You are searching for [[a-z]]+ however TOTO is uppercase, so not between 'a' and 'z'. You can update it to [[A-Za-z]]+ to match any case letter.
EDIT: [[A-Za-z]]+ will match any case letter, but will also match any '[' or ']' characters. If you do not wish to match these then you can change it to [A-Za-z]+ to only match any case letters

R - replace last instance of a regex match and everything afterwards

I'm trying to use a regex to replace the last instance of a phrase (and everything after that phrase, which could be any character):
stringi::stri_replace_last_regex("_AB:C-_ABCDEF_ABC:45_ABC:454:", "_ABC.*$", "CBA")
However, I can't seem to get the refex to function properly:
Input: "_AB:C-_ABCDEF_ABC:45_ABC:454:"
Actual output: "_AB:C-CBA"
Desired output: "_AB:C-_ABCDEF_ABC:45_CBA"
I have tried gsub() as well but that hasn't worked.
Any ideas where I'm going wrong?
One solution is:
sub("(.*)_ABC.*", "\\1_CBA", Input)
[1] "_AB:C-_ABCDEF_ABC:45_CBA"
Have a look at what stringi::stri_replace_last_regex does:
Replaces with the given replacement string last substring of the input that matches a regular expression
What does your _ABC.*$ pattern match inside _AB:C-_ABCDEF_ABC:45_ABC:454:? It matches the first _ABC (that is right after C-) and all the text after to the end of the line (.*$ grabs 0+ chars other than line break chars to the end of the line). Hence, you only have 1 match, and it is the last.
Solutions can be many:
1) Capturing all text before the last occurrence of the pattern and insert the captured value with a replacement backreference (this pattern does not have to be anchored at the end of the string with $):
sub("(.*)_ABC.*", "\\1_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:")
2) Using a tempered greedy token to make sure you only match any char that does not start your pattern up to the end of the string after matching it (this pattern must be anchored at the end of the string with $):
sub("(?s)_ABC(?:(?!_ABC).)*$", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
Note that this pattern will require perl=TRUE argument to be parsed with a PCRE engine with sub (or you may use stringr::str_replace that is ICU regex library powered and supports lookaheads)
3) A negative lookahead may be used to make sure your pattern does not appear anywhere to the right of your pattern (this pattern does not have to be anchored at the end of the string with $):
sub("(?s)_ABC(?!.*_ABC).*", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
See the R demo online, all these three lines of code returning _AB:C-_ABCDEF_ABC:45_CBA.
Note that (?s) in the PCRE patterns is necessary in case your strings may contain a newline (and . in a PCRE pattern does not match newline chars by default).
Arguably the safest thing to do is using a negative lookahead to find the last occurrence:
_ABC(?:(?!_ABC).)+$
Demo
gsub("_ABC(?:(?!_ABC).)+$", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
[1] "_AB:C-_ABCDEF_ABC:45_CBA"
Using gsub and back referencing
gsub("(.*)ABC.*$", "\\1CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:")
[1] "_AB:C-_ABCDEF_ABC:45_CBA"

Modify regex to exclude characters that occur at beginning

Using below code I'm extracting a generated html link :
mystr <- c("/url?q=http://www.mypage.html&sa=U&ved=0ahUKEwjgyMPj2pXXAhWB5CYKHXysDlsQqQIIKSgAMAg&usg=AOvVaw1VCvT8iznodM3l4xvc8CVq")
str_extract(mystr, "^.*(?=(&sa))")
This returns :
[1] "/url?q=http://www.mypage.html"
How to modify regex in order to exclude /url?q= ? So just http://www.mypage.html is returned ?
You can replace the beginning of the string (i.e. ^) with http,
stringr::str_extract(mystr, "http.*(?=(&sa))")
#[1] "http://www.mypage.html"
You may also use a base R sub solution to match up to the first http and capture it with any chsrs other than &:
sub(".*?(http[^&]*).*", "\\1", x)
You may precise the pattern to match only after q= aftrr .*?.
Details
.*? - any 0+ chars as few as possible,
(http[^&]*) - capturing group #1 matching http and then any zero or more chars other than &
.* - the rest of the string.
The \1 is a replacement backreference to the Group 1 value.

Resources