Remove all dots but first in a string using R - r

I have some errors in some numbers showing numbers like "59.34343.23". I know the first dot is correct but the second one (or any after the first) should be remove. How can I remove those?
I tried using gsub in R:
gsub("(?<=\\..*)\\.", "", "59.34343.23", perl=T)
or
gsub("(?<!^[^.]*)\\.", "", "59.34343.23", perl=T)
However it gets the following error "invalid regular expression". But I have been trying the same code in a regex tester and it works.
What is my mistake here?

You can use
gsub("^([^.]*\\.)|\\.", "\\1", "59.34343.23")
gsub("^([^.]*\\.)|\\.", "\\1", "59.34343.23", perl=TRUE)
See the R demo online and the regex demo.
Details:
^([^.]*\.) - Capturing group 1 (referred to as \1 from the replacement pattern): any zero or more chars from the start of string and then a . char (the first in the string)
| - or
\. - any other dot in the string.
Since the replacement, \1, refers to Group 1, and Group 1 only contains a value after the text before and including the first dot is matched, the replacement is either this part of text, or empty string (i.e. the second and all subsequent occurrences of dots are removed).

We may use
gsub("^[^.]+\\.(*SKIP)(*FAIL)|\\.", "", str1, perl = TRUE)
[1] "59.3434323"
data
str1 <- "59.34343.23"

By specifying perl = TRUE you can convert matches of the following regular expression to empty strings:
^[^.]*\.[^.]*\K.|\.
Start your engine!
If you are unfamiliar with \K hover over it in the regular expression at the link to see an explanation of its effect.

There is always the option to only write back the dot if its the first in the line.
Key feature is to consume the other dots but don't write it back.
Effect is to delete trailing dots.
Below uses a branch reset to accomplish the goal (Perl mode).
(?m)(?|(^[^.\n]*\.)|()\.+)
Replace $1
https://regex101.com/r/cHcu4j/1
(?m)
(?|
( ^ [^.\n]* \. ) # (1)
| ( ) # (1)
\.+
)

The pattern that you tried does not match, because there is an infinite quantifier in the lookbehind (?<=\\..*) that is not supported.
Another variation using \G to get continuous matches after the first dot:
(?:^[^.]*\.|\G(?!^))[^.]*\K\.
In parts, the pattern matches:
(?: Non capture group for the alternation |
^[^.]*\. Start of string, match any char except ., then match .
| Or
\G(?!^) Assert the position at the end of the previous match (not at the start)
)[^.]* Optionally match any char except .
\K\. Clear the match buffer an match the dot (to be removed)
Regex demo | R demo
gsub("(?:^[^.]*\\.|\\G(?!^))[^.]*\\K\\.", "", "59.34343.23", perl=T)
Output
[1] "59.3434323"

Related

REGEX pattern match in R for Course number

I need to identify matching course number that have xx.3xxxxxx.
These are some examples of the course numbers.
26.3730004
27.0210000
26.3730009
26.7114001
23.9610071
26.0A34430
23.3670005
26.0B05430
I tried many patterns one example I used is the pattern below. It did not get any match.
"[^0-9]{2}\Q.\E3[^0-9]+$"
I tried using grep and grepl. I actually need the code to return indexes.
This code shows my attempt to tag the rows that have matches.
Teacher$virtual[
which(
grepl("[^0-9]{2}\\Q.\\E3[^0-9]+$",Teacher$CourseNumber))]
<- "1"
I need to remove any row from my dataframe that have the course number with that pattern. XX.3XXXXXX
But, my code did not find any match. Can you please help me?
You should use
grepl("^[0-9]{2}\\.3", Teacher$CourseNumber)
See the regex graph:
Details:
^ - start of a string
[0-9]{2} - two digits
\\. - a dot (note that a regex escape is a literal backslash, but inside a string literal, "...", a single backslash is used to form string escape sequences, hence the backslash must be double to obtain a literal backslash char necessary for a regex escape)
3 - a 3 char.
NOTE: If you want to use in-pattern quoting with \Q and \E (in between which all chars are treated literally) you need to use PCRE regex, add perl=TRUE and use
grepl("^[0-9]{2}\\Q.\\E3", Teacher$CourseNumber, perl=TRUE)
Now, the dot is treated as a literal dot, not a . metacharacter that matches any char but a line break char (in a PCRE regex, . does not match line break chars by default).
Here, this simple expression would likely cover that:
^[0-9]{2}\.[3].+$
which has a [3] boundary right after the .. It would probably work without start and end anchors:
[0-9]{2}\.[3].+
Demo
We can add or reduce the boundaries, if it'd be necessary.

Regex UTM google

I'm trying to extract a UTM from a Google link using r, but my regex doesn't seem to work properly.
Here an example of a google link :
xxx/yyy?utm_medium=display&utm_source=ogury&utm_campaign=TOTO&zzz=coco
I tried the following regex to extract TOTO:
.+&utm_campaign=([[a-z]]+)&.+
with no success.
If someone can help, thanks!
In your pattern, [[a-z]]+ is a malformed bracket expression, because it matches any char from the [[a-z] bracket expression (any lowercase ASCII letter or [) and then matches one or more ] chars. You meant to use single [ and ] here.
You may use sub with the following regex:
sub(".*[&?]utm_campaign=([^&]+).*", "\\1", s)
See the regex demo.
Details
.* - any 0+ chars, as many as possible
[&?] - a ? or &
utm_campaign= - a literal substring
([^&]+) - Capturing group 1: one or more chars other than & chars
.* - any 0+ chars, as many as possible
The \1 is the replacement backreference that puts the contents of Group 1 into the result.
See the R demo:
s <- "xxx/yyy?utm_medium=display&utm_source=ogury&utm_campaign=TOTO&zzz=coco"
sub(".*[&?]utm_campaign=([^&]+).*", "\\1", s)
## => [1] "TOTO"
You could use:
(?:&utm_campaign=)(\w+)
and use the first group captured
Try it Online
Here's a regex string that'll match the value of a utm_campaign parameter, regardless of its position in the query string.
(?<TOTO>(?<=utm_campaign=).*?(?=&|$))
Explanation:
?<TOTO> captures the result into a TOTO key after the regex is executed
(?<=utm_campaign=) is a look-behind that will ensure that the value is preceded by utm_campaign=
.*? will find the parameter value (i.e. TOTO). The reason for the ? is lazy evaluation - it will only search until the next rule is matched (see point below)
(?=&|$) is a look-ahead that will match either an & or the end of the string (in the case that utm_campaign is the last parameter)
You are searching for [[a-z]]+ however TOTO is uppercase, so not between 'a' and 'z'. You can update it to [[A-Za-z]]+ to match any case letter.
EDIT: [[A-Za-z]]+ will match any case letter, but will also match any '[' or ']' characters. If you do not wish to match these then you can change it to [A-Za-z]+ to only match any case letters

Remove characters before first and after second underscore extracting string between first and second underscore

I am using
gsub(".*_","",ldf[[j]]),1,nchar(gsub(".*_","",ldf[[j]]))-4)
to create a path and filename to write to. It works fine for names in lfd that only have one underscore. Having a filename with another underscore further back, it cuts everything off that is in front of the second underscore.
I have for example:
Arof_07122016_2.csv and I want 07122016, but I get 2. But I don't get why this is happening. How can I use this line to only cut off the characters in fromt of the first underscore and keep the second one?
It seems you want
sub("^[^_]*_([^_]*).*", "\\1", ldf[[j]])
See the regex demo
The pattern matches
^ - start of string
[^_]* - 0+ chars other than _
_ - an underascxore
([^_]*) - Capturing group #1: any 0+ chars other than _
.* - the rest of the string.
The \1 in the replacement pattern only keeps the captured value in the result.
R demo:
v <- c("Arof_07122016_2.csv", "Another_99999_ccccc_2.csv")
sub("^[^_]*_([^_]*).*", "\\1", v)
# => [1] "07122016" "99999"
Regular expression repetition is greedy by default, as explained in ?regex:
By default repetition is greedy, so the maximal possible number of
repeats is used. This can be changed to ‘minimal’ by appending ? to
the quantifier. (There are further quantifiers that allow approximate
matching: see the TRE documentation.)
So you should use the pattern ".*?_". However, gsub will make multiple matches so you end up with the same result. To remedy this use sub which will only make 1 match or specify that you want to match at the start of the string by using ^ in the regex.
sub(".*?_","","Arof_07122016_2.csv")
[1] "07122016_2.csv"
gsub("^.*?_","","Arof_07122016_2.csv")
[1] "07122016_2.csv"

Remove everything before the last space

I have a following string. I tried to remove all the strings before the last space but it seems I can't achieve it.
I tried to follow this post
Use gsub remove all string before first white space in R
str <- c("Veni vidi vici")
gsub("\\s*","\\1",str)
"Venividivici"
What I want to have is only "vici" string left after removing everything before the last space.
Your gsub("\\s*","\\1",str) code replaces each occurrence of 0 or more whitespaces with a reference to the capturing group #1 value (which is an empty string since you have not specified any capturing group in the pattern).
You want to match up to the last whitespace:
sub(".*\\s", "", str)
If you do not want to get a blank result in case your string has trailing whitespace, trim the string first:
sub(".*\\s", "", trimws(str))
Or, use a handy stri_extract_last_regex from stringi package with a simple \S+ pattern (matching 1 or more non-whitespace chars):
library(stringi)
stri_extract_last_regex(str, "\\S+")
# => [1] "vici"
Note that .* matches any 0+ chars as many as possible (since * is a greedy quantifier and . in a TRE pattern matches any char including line break chars), and grabs the whole string at first. Then, backtracking starts since the regex engine needs to match a whitespace with \s. Yielding character by character from the end of the string, the regex engine stumbles on the last whitespace and calls it a day returning the match that is removed afterwards.
See the R demo and a regex demo online:
str <- c("Veni vidi vici")
gsub(".*\\s", "", str)
## => [1] "vici"
Also, you may want to see how backtracking works in the regex debugger:
Those red arrows show backtracking steps.

R - replace last instance of a regex match and everything afterwards

I'm trying to use a regex to replace the last instance of a phrase (and everything after that phrase, which could be any character):
stringi::stri_replace_last_regex("_AB:C-_ABCDEF_ABC:45_ABC:454:", "_ABC.*$", "CBA")
However, I can't seem to get the refex to function properly:
Input: "_AB:C-_ABCDEF_ABC:45_ABC:454:"
Actual output: "_AB:C-CBA"
Desired output: "_AB:C-_ABCDEF_ABC:45_CBA"
I have tried gsub() as well but that hasn't worked.
Any ideas where I'm going wrong?
One solution is:
sub("(.*)_ABC.*", "\\1_CBA", Input)
[1] "_AB:C-_ABCDEF_ABC:45_CBA"
Have a look at what stringi::stri_replace_last_regex does:
Replaces with the given replacement string last substring of the input that matches a regular expression
What does your _ABC.*$ pattern match inside _AB:C-_ABCDEF_ABC:45_ABC:454:? It matches the first _ABC (that is right after C-) and all the text after to the end of the line (.*$ grabs 0+ chars other than line break chars to the end of the line). Hence, you only have 1 match, and it is the last.
Solutions can be many:
1) Capturing all text before the last occurrence of the pattern and insert the captured value with a replacement backreference (this pattern does not have to be anchored at the end of the string with $):
sub("(.*)_ABC.*", "\\1_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:")
2) Using a tempered greedy token to make sure you only match any char that does not start your pattern up to the end of the string after matching it (this pattern must be anchored at the end of the string with $):
sub("(?s)_ABC(?:(?!_ABC).)*$", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
Note that this pattern will require perl=TRUE argument to be parsed with a PCRE engine with sub (or you may use stringr::str_replace that is ICU regex library powered and supports lookaheads)
3) A negative lookahead may be used to make sure your pattern does not appear anywhere to the right of your pattern (this pattern does not have to be anchored at the end of the string with $):
sub("(?s)_ABC(?!.*_ABC).*", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
See the R demo online, all these three lines of code returning _AB:C-_ABCDEF_ABC:45_CBA.
Note that (?s) in the PCRE patterns is necessary in case your strings may contain a newline (and . in a PCRE pattern does not match newline chars by default).
Arguably the safest thing to do is using a negative lookahead to find the last occurrence:
_ABC(?:(?!_ABC).)+$
Demo
gsub("_ABC(?:(?!_ABC).)+$", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
[1] "_AB:C-_ABCDEF_ABC:45_CBA"
Using gsub and back referencing
gsub("(.*)ABC.*$", "\\1CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:")
[1] "_AB:C-_ABCDEF_ABC:45_CBA"

Resources