Remove everything before the last space - r

I have a following string. I tried to remove all the strings before the last space but it seems I can't achieve it.
I tried to follow this post
Use gsub remove all string before first white space in R
str <- c("Veni vidi vici")
gsub("\\s*","\\1",str)
"Venividivici"
What I want to have is only "vici" string left after removing everything before the last space.

Your gsub("\\s*","\\1",str) code replaces each occurrence of 0 or more whitespaces with a reference to the capturing group #1 value (which is an empty string since you have not specified any capturing group in the pattern).
You want to match up to the last whitespace:
sub(".*\\s", "", str)
If you do not want to get a blank result in case your string has trailing whitespace, trim the string first:
sub(".*\\s", "", trimws(str))
Or, use a handy stri_extract_last_regex from stringi package with a simple \S+ pattern (matching 1 or more non-whitespace chars):
library(stringi)
stri_extract_last_regex(str, "\\S+")
# => [1] "vici"
Note that .* matches any 0+ chars as many as possible (since * is a greedy quantifier and . in a TRE pattern matches any char including line break chars), and grabs the whole string at first. Then, backtracking starts since the regex engine needs to match a whitespace with \s. Yielding character by character from the end of the string, the regex engine stumbles on the last whitespace and calls it a day returning the match that is removed afterwards.
See the R demo and a regex demo online:
str <- c("Veni vidi vici")
gsub(".*\\s", "", str)
## => [1] "vici"
Also, you may want to see how backtracking works in the regex debugger:
Those red arrows show backtracking steps.

Related

Remove all dots but first in a string using R

I have some errors in some numbers showing numbers like "59.34343.23". I know the first dot is correct but the second one (or any after the first) should be remove. How can I remove those?
I tried using gsub in R:
gsub("(?<=\\..*)\\.", "", "59.34343.23", perl=T)
or
gsub("(?<!^[^.]*)\\.", "", "59.34343.23", perl=T)
However it gets the following error "invalid regular expression". But I have been trying the same code in a regex tester and it works.
What is my mistake here?
You can use
gsub("^([^.]*\\.)|\\.", "\\1", "59.34343.23")
gsub("^([^.]*\\.)|\\.", "\\1", "59.34343.23", perl=TRUE)
See the R demo online and the regex demo.
Details:
^([^.]*\.) - Capturing group 1 (referred to as \1 from the replacement pattern): any zero or more chars from the start of string and then a . char (the first in the string)
| - or
\. - any other dot in the string.
Since the replacement, \1, refers to Group 1, and Group 1 only contains a value after the text before and including the first dot is matched, the replacement is either this part of text, or empty string (i.e. the second and all subsequent occurrences of dots are removed).
We may use
gsub("^[^.]+\\.(*SKIP)(*FAIL)|\\.", "", str1, perl = TRUE)
[1] "59.3434323"
data
str1 <- "59.34343.23"
By specifying perl = TRUE you can convert matches of the following regular expression to empty strings:
^[^.]*\.[^.]*\K.|\.
Start your engine!
If you are unfamiliar with \K hover over it in the regular expression at the link to see an explanation of its effect.
There is always the option to only write back the dot if its the first in the line.
Key feature is to consume the other dots but don't write it back.
Effect is to delete trailing dots.
Below uses a branch reset to accomplish the goal (Perl mode).
(?m)(?|(^[^.\n]*\.)|()\.+)
Replace $1
https://regex101.com/r/cHcu4j/1
(?m)
(?|
( ^ [^.\n]* \. ) # (1)
| ( ) # (1)
\.+
)
The pattern that you tried does not match, because there is an infinite quantifier in the lookbehind (?<=\\..*) that is not supported.
Another variation using \G to get continuous matches after the first dot:
(?:^[^.]*\.|\G(?!^))[^.]*\K\.
In parts, the pattern matches:
(?: Non capture group for the alternation |
^[^.]*\. Start of string, match any char except ., then match .
| Or
\G(?!^) Assert the position at the end of the previous match (not at the start)
)[^.]* Optionally match any char except .
\K\. Clear the match buffer an match the dot (to be removed)
Regex demo | R demo
gsub("(?:^[^.]*\\.|\\G(?!^))[^.]*\\K\\.", "", "59.34343.23", perl=T)
Output
[1] "59.3434323"

R Capturing String inside Brackets

I'm trying to parse some of my chess pgn data but I'm having some trouble capturing characters just inside one bracket.
testString <- '[Event \"?\"]\n[Site \"http://www.chessmaniac.com play free chess\"]\n[Date \"2018.08.25\"]\n[Round \"-\"]\n[White \"NothingFancy 1497\"]\n[Black \"JR Smith 1985\"]\n[Result \"1-0\"]\n\n1.'
#Attempt to just get who white is, which is inside a bracket [White xxx]
findWhite <- regexpr('\\[White.*\\]', tempString)
regmatches(tempString, findWhite)
The stringr package seems to do what I want, but I'm curious what is different about the use of the same regular expression. I'm fine using stringr, but I like to also know how to do this in base R.
library(stringr)
str_extract(tempString, '\\[White.*\\]')
If you need the whole match starting with [White and ending with ] you may use
regmatches(testString, regexpr("\\[White\\s*[^][]*]", testString))
[1] "[White \"NothingFancy 1497\"]"
If you only need the substring inside double quotes:
regmatches(testString, regexpr("\\[White\\s*\\K[^][]*", testString, perl=TRUE))
[1] "\"NothingFancy 1497\""
See the regex demo.
To strip the double quotes, you may use something like
regmatches(testString, regexpr('\\[White\\s*"\\K.*(?="])', testString, perl=TRUE))
[1] "NothingFancy 1497"
See another regex demo and an online R demo.
Details
\\[ - a [ char
White - a literal substring
\\s* - 0+ whitespaces
\\K - match reset operator discarding the text matched so far
[^][]* - 0+ chars other than [ and ]
.* (in the other version) - matches any 0+ chars other than line break chars, as many as possible
(?="]) - a positive lookahead that matches a position inside a string that is immediately followed with "].
At least one way to do it in base R is to use sub and only keep the part that you want.
sub(".*\\[White\\s(*.*?)\\].*", "\\1", testString)
[1] "\"NothingFancy 1497\""

Remove characters before first and after second underscore extracting string between first and second underscore

I am using
gsub(".*_","",ldf[[j]]),1,nchar(gsub(".*_","",ldf[[j]]))-4)
to create a path and filename to write to. It works fine for names in lfd that only have one underscore. Having a filename with another underscore further back, it cuts everything off that is in front of the second underscore.
I have for example:
Arof_07122016_2.csv and I want 07122016, but I get 2. But I don't get why this is happening. How can I use this line to only cut off the characters in fromt of the first underscore and keep the second one?
It seems you want
sub("^[^_]*_([^_]*).*", "\\1", ldf[[j]])
See the regex demo
The pattern matches
^ - start of string
[^_]* - 0+ chars other than _
_ - an underascxore
([^_]*) - Capturing group #1: any 0+ chars other than _
.* - the rest of the string.
The \1 in the replacement pattern only keeps the captured value in the result.
R demo:
v <- c("Arof_07122016_2.csv", "Another_99999_ccccc_2.csv")
sub("^[^_]*_([^_]*).*", "\\1", v)
# => [1] "07122016" "99999"
Regular expression repetition is greedy by default, as explained in ?regex:
By default repetition is greedy, so the maximal possible number of
repeats is used. This can be changed to ‘minimal’ by appending ? to
the quantifier. (There are further quantifiers that allow approximate
matching: see the TRE documentation.)
So you should use the pattern ".*?_". However, gsub will make multiple matches so you end up with the same result. To remedy this use sub which will only make 1 match or specify that you want to match at the start of the string by using ^ in the regex.
sub(".*?_","","Arof_07122016_2.csv")
[1] "07122016_2.csv"
gsub("^.*?_","","Arof_07122016_2.csv")
[1] "07122016_2.csv"

R - replace last instance of a regex match and everything afterwards

I'm trying to use a regex to replace the last instance of a phrase (and everything after that phrase, which could be any character):
stringi::stri_replace_last_regex("_AB:C-_ABCDEF_ABC:45_ABC:454:", "_ABC.*$", "CBA")
However, I can't seem to get the refex to function properly:
Input: "_AB:C-_ABCDEF_ABC:45_ABC:454:"
Actual output: "_AB:C-CBA"
Desired output: "_AB:C-_ABCDEF_ABC:45_CBA"
I have tried gsub() as well but that hasn't worked.
Any ideas where I'm going wrong?
One solution is:
sub("(.*)_ABC.*", "\\1_CBA", Input)
[1] "_AB:C-_ABCDEF_ABC:45_CBA"
Have a look at what stringi::stri_replace_last_regex does:
Replaces with the given replacement string last substring of the input that matches a regular expression
What does your _ABC.*$ pattern match inside _AB:C-_ABCDEF_ABC:45_ABC:454:? It matches the first _ABC (that is right after C-) and all the text after to the end of the line (.*$ grabs 0+ chars other than line break chars to the end of the line). Hence, you only have 1 match, and it is the last.
Solutions can be many:
1) Capturing all text before the last occurrence of the pattern and insert the captured value with a replacement backreference (this pattern does not have to be anchored at the end of the string with $):
sub("(.*)_ABC.*", "\\1_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:")
2) Using a tempered greedy token to make sure you only match any char that does not start your pattern up to the end of the string after matching it (this pattern must be anchored at the end of the string with $):
sub("(?s)_ABC(?:(?!_ABC).)*$", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
Note that this pattern will require perl=TRUE argument to be parsed with a PCRE engine with sub (or you may use stringr::str_replace that is ICU regex library powered and supports lookaheads)
3) A negative lookahead may be used to make sure your pattern does not appear anywhere to the right of your pattern (this pattern does not have to be anchored at the end of the string with $):
sub("(?s)_ABC(?!.*_ABC).*", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
See the R demo online, all these three lines of code returning _AB:C-_ABCDEF_ABC:45_CBA.
Note that (?s) in the PCRE patterns is necessary in case your strings may contain a newline (and . in a PCRE pattern does not match newline chars by default).
Arguably the safest thing to do is using a negative lookahead to find the last occurrence:
_ABC(?:(?!_ABC).)+$
Demo
gsub("_ABC(?:(?!_ABC).)+$", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
[1] "_AB:C-_ABCDEF_ABC:45_CBA"
Using gsub and back referencing
gsub("(.*)ABC.*$", "\\1CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:")
[1] "_AB:C-_ABCDEF_ABC:45_CBA"

R utf-8 and replace a word from a sentence based on ending character

I have a requirement where I am working on a large data which is having double byte characters, in korean text. i want to look for a character and replace it. In order to display the korean text correctly in the browser I have changed the locale settings in R. But not sure if it gets updated for the code as well. below is my code to change locale to korean and the korean text gets visible properly in viewer, however in console it gives junk character on printing-
Sys.setlocale(category = "LC_ALL", locale = "korean")
My data is in a data.table format that contains a column with text in korean. example -
"광주광역시 동구 제봉로 49 (남동,(지하))"
I want to get rid of the 1st word which ends with "시" character. Then I want to get rid of the "(남동,(지하))" an the end. I was trying gsub, but it does not seem to be working.
New <- c("광주광역시 동구 제봉로 49 (남동,(지하))")
data <- as.data.table(New)
data[,New_trunc := gsub("\\b시", "", data$New)]
Please let me know where I am going wrong. Since I want to search the end of word, I am using \\b and since I want to replace any word ending with "시" character I am giving it as \\b시.....is this not the way to give? How to take care of () at the end of the sentence.
What would be a good source to refer to for regular expressions.
Is a utf-8 setting needed for the script as well?How to do that?
Since you need to match the letter you have at the end of the word, you need to place \b (word boundary) after the letter, so as to require a transition from a letter to a non-letter (or end of string) after that letter. A PCRE pattern that will handle this is
"\\s*\\b\\p{L}*시\\b"
Details
\\s* - zero or more whitespaces
\\b - a leading word boundary
\\p{L}* - zero or more letters
시 - your specific letter
\\b - end of the word
The second issue is that you need to remove a set of nested parentheses at the end of the string. You need again to rely on the PCRE regex (perl=TRUE) that can handle recursion with the help of a subroutine call.
> sub("\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] "광주광역시 동구 제봉로 49"
Details:
\\s* - zero or more whitespaces
(\\((?:[^()]++|(?1))*\\)) - Group 1 (will be recursed) matching
\\( - a literal (
(?:[^()]++|(?1))* - zero or more occurrences of
[^()]++ - 1 or more chars other than ( and ) (possessively)
| - or
(?1) - a subroutine call that repeats the whole Group 1 subpattern
\\) - a literal )
$ - end of string.
Now, if you need to combine both, you would see that R PCRE-powered gsub does not handle Unicode chars in the pattern so easily. You must tell it to use Unicode mode with (*UCP) PCRE verb.
> gsub("(*UCP)\\b\\p{L}*시\\b|\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] " 동구 제봉로 49"
Or using trimws to get rid of the leading/trailing whitespace:
> trimws(gsub("(*UCP)\\b\\p{L}*시\\b|(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE))
[1] "동구 제봉로 49"
See more details about the verb at PCRE Man page.

Resources