Applying a regular expression to a string in R - r

I'm just getting to know the language R, previously worked with python. The challenge is to replace the last character of each word in the string with *.
How it should look: example text in string, and result work: exampl* tex* i* strin*
My code:
library(tidyverse)
library(stringr)
string_example = readline("Enter our text:")
string_example = unlist(strsplit(string_example, ' '))
string_example
result = str_replace(string_example, pattern = "*\b", replacement = "*")
result
I get an error:
> result = str_replace(string_example, pattern = "*\b", replacement = "*")
Error in stri_replace_first_regex(string, pattern, fix_replacement(replacement), :
Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=``)
Help solve the task
Oh, I noticed an error, the pattern should be .\b. this is how the code is executed, but there is no replacement in the string

If you mean words consisting of letters only, you can use
string_example <- "example text in string"
library(stringr)
str_replace_all(string_example, "\\p{L}\\b", "*")
## => [1] "exampl* tex* i* strin*"
See the R demo and the regex demo.
Details:
\p{L} - a Unicode category (propery) class matching any Unicode letter
\b - a word boundary, in this case, it makes sure there is no other word character immediately on the right. It will fails the match if the letter matched with \p{L} is immediately followed with a letter, digit or _ (these are all word chars). If you want to limit this to a letter check, replace \b with (?!\p{L}).
Note the backslashes are doubled because in regular string literals backslashes are used to form string escape sequences, and thus need escaping themselves to introduce literal backslashes in string literals.
Some more things to consider
If you do not want to change one-letter words, add a non-word boundary at the start, "\\B\\p{L}\\b"
If you want to avoid matching letters that are followed with - + another letter (i.e. some compound words), you can add a lookahead check: "\\p{L}\\b(?!-)".
You may combine the lookarounds and (non-)word boundaries as you need.

Related

Remove everything before the last space

I have a following string. I tried to remove all the strings before the last space but it seems I can't achieve it.
I tried to follow this post
Use gsub remove all string before first white space in R
str <- c("Veni vidi vici")
gsub("\\s*","\\1",str)
"Venividivici"
What I want to have is only "vici" string left after removing everything before the last space.
Your gsub("\\s*","\\1",str) code replaces each occurrence of 0 or more whitespaces with a reference to the capturing group #1 value (which is an empty string since you have not specified any capturing group in the pattern).
You want to match up to the last whitespace:
sub(".*\\s", "", str)
If you do not want to get a blank result in case your string has trailing whitespace, trim the string first:
sub(".*\\s", "", trimws(str))
Or, use a handy stri_extract_last_regex from stringi package with a simple \S+ pattern (matching 1 or more non-whitespace chars):
library(stringi)
stri_extract_last_regex(str, "\\S+")
# => [1] "vici"
Note that .* matches any 0+ chars as many as possible (since * is a greedy quantifier and . in a TRE pattern matches any char including line break chars), and grabs the whole string at first. Then, backtracking starts since the regex engine needs to match a whitespace with \s. Yielding character by character from the end of the string, the regex engine stumbles on the last whitespace and calls it a day returning the match that is removed afterwards.
See the R demo and a regex demo online:
str <- c("Veni vidi vici")
gsub(".*\\s", "", str)
## => [1] "vici"
Also, you may want to see how backtracking works in the regex debugger:
Those red arrows show backtracking steps.

Regex to maintain matched parts

I would like to achieve this result : "raster(B04) + raster(B02) - raster(A10mB03)"
Therefore, I created this regex: B[0-1][0-9]|A[1,2,6]0m/B[0-1][0-9]"
I am now trying to replace all matches of the string "B04 + B02 - A10mB03" with gsub("B[0-1][0-9]]|[A[1,2,6]0mB[0-1][0-9]", "raster()", string)
How could I include the original values B01, B02, A10mB03?
PS: I also tried gsub("B[0-1][0-9]]|[A[1,2,6]0mB[0-1][0-9]", "raster(\\1)", string) but it did not work.
Basically, you need to match some text and re-use it inside a replacement pattern. In base R regex methods, there is no way to do that without a capturing group, i.e. a pair of unescaped parentheses, enclosing the whole regex pattern in this case, and use a \\1 replacement backreference in the replacement pattern.
However, your regex contains some issues: [A[1,2,6] gets parsed as a single character class that matches A, [, 1, ,, 2 or 6 because you placed a [ before A. Also, note that , inside character classes matches a literal comma, and it is not what you expected. Another, similar issue, is with [0-9]] - it matches any ASCII digit with [0-9] and then a ] (the ] char does not have to be escaped in a regex pattern).
So, a potential fix for you expression can look like
gsub("(B[0-1][0-9]|A[126]0mB[0-1][0-9])", "raster(\\1)", string)
Or even just matching 1 or more word chars (considering the sample string you supplied)
gsub("(\\w+)", "raster(\\1)", string)
might do.
See the R demo online.

keep only alphanumeric characters and space in a string using gsub

I have a string which has alphanumeric characters, special characters and non UTF-8 characters. I want to strip the special and non utf-8 characters.
Here's what I've tried:
gsub('[^0-9a-z\\s]','',"�+ Sample string here =�{�>E�BH�P<]�{�>")
However, This removes the special characters (punctuations + non utf8) but the output has no spaces.
gsub('/[^0-9a-z\\s]/i','',"�+ Sample string here =�{�>E�BH�P<]�{�>")
The result has spaces but there are still non utf8 characters present.
Any work around?
For the sample string above, output should be:
Sample string here
You could use the classes [:alnum:] and [:space:] for this:
sample_string <- "�+ Sample 2 string here =�{�>E�BH�P<]�{�>"
gsub("[^[:alnum:][:space:]]","",sample_string)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"
Alternatively you can use PCRE codes to refer to specific character sets:
gsub("[^\\p{L}0-9\\s]","",sample_string, perl = TRUE)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"
Both cases illustrate clearly that the characters still there, are considered letters. Also the EBHP inside are still letters, so the condition on which you're replacing is not correct. You don't want to keep all letters, you just want to keep A-Z, a-z and 0-9:
gsub("[^A-Za-z0-9 ]","",sample_string)
#> [1] " Sample 2 string here EBHP"
This still contains the EBHP. If you really just want to keep a section that contains only letters and numbers, you should use the reverse logic: select what you want and replace everything but that using backreferences:
gsub(".*?([A-Za-z0-9 ]+)\\s.*","\\1", sample_string)
#> [1] " Sample 2 string here "
Or, if you want to find a string, even not bound by spaces, use the word boundary \\b instead:
gsub(".*?(\\b[A-Za-z0-9 ]+\\b).*","\\1", sample_string)
#> [1] "Sample 2 string here"
What happens here:
.*? fits anything (.) at least 0 times (*) but ungreedy (?). This means that gsub will try to fit the smallest amount possible by this piece.
everything between () will be stored and can be refered to in the replacement by \\1
\\b indicates a word boundary
This is followed at least once (+) by any character that's A-Z, a-z, 0-9 or a space. You have to do it that way, because the special letters are contained in between the upper and lowercase in the code table. So using A-z will include all special letters (which are UTF-8 btw!)
after that sequence,fit anything at least zero times to remove the rest of the string.
the backreference \\1 in combination with .* in the regex, will make sure only the required part remains in the output.
stringr may use a differrent regex engine that supports POSIX character classes. The :ascii: names the class, which must generally be enclosed in square brackets [:asciii:], whithin the outer square bracket. The [^ indicates negation of the match.
library(stringr)
str_replace_all("�+ Sample string here =�{�>E�BH�P<]�{�>", "[^[:ascii:]]", "")
result in
[1] "+ Sample string here ={>EBHP<]{>"

R utf-8 and replace a word from a sentence based on ending character

I have a requirement where I am working on a large data which is having double byte characters, in korean text. i want to look for a character and replace it. In order to display the korean text correctly in the browser I have changed the locale settings in R. But not sure if it gets updated for the code as well. below is my code to change locale to korean and the korean text gets visible properly in viewer, however in console it gives junk character on printing-
Sys.setlocale(category = "LC_ALL", locale = "korean")
My data is in a data.table format that contains a column with text in korean. example -
"광주광역시 동구 제봉로 49 (남동,(지하))"
I want to get rid of the 1st word which ends with "시" character. Then I want to get rid of the "(남동,(지하))" an the end. I was trying gsub, but it does not seem to be working.
New <- c("광주광역시 동구 제봉로 49 (남동,(지하))")
data <- as.data.table(New)
data[,New_trunc := gsub("\\b시", "", data$New)]
Please let me know where I am going wrong. Since I want to search the end of word, I am using \\b and since I want to replace any word ending with "시" character I am giving it as \\b시.....is this not the way to give? How to take care of () at the end of the sentence.
What would be a good source to refer to for regular expressions.
Is a utf-8 setting needed for the script as well?How to do that?
Since you need to match the letter you have at the end of the word, you need to place \b (word boundary) after the letter, so as to require a transition from a letter to a non-letter (or end of string) after that letter. A PCRE pattern that will handle this is
"\\s*\\b\\p{L}*시\\b"
Details
\\s* - zero or more whitespaces
\\b - a leading word boundary
\\p{L}* - zero or more letters
시 - your specific letter
\\b - end of the word
The second issue is that you need to remove a set of nested parentheses at the end of the string. You need again to rely on the PCRE regex (perl=TRUE) that can handle recursion with the help of a subroutine call.
> sub("\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] "광주광역시 동구 제봉로 49"
Details:
\\s* - zero or more whitespaces
(\\((?:[^()]++|(?1))*\\)) - Group 1 (will be recursed) matching
\\( - a literal (
(?:[^()]++|(?1))* - zero or more occurrences of
[^()]++ - 1 or more chars other than ( and ) (possessively)
| - or
(?1) - a subroutine call that repeats the whole Group 1 subpattern
\\) - a literal )
$ - end of string.
Now, if you need to combine both, you would see that R PCRE-powered gsub does not handle Unicode chars in the pattern so easily. You must tell it to use Unicode mode with (*UCP) PCRE verb.
> gsub("(*UCP)\\b\\p{L}*시\\b|\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] " 동구 제봉로 49"
Or using trimws to get rid of the leading/trailing whitespace:
> trimws(gsub("(*UCP)\\b\\p{L}*시\\b|(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE))
[1] "동구 제봉로 49"
See more details about the verb at PCRE Man page.

R regex remove apostroph except the ones preceded and followed by letter

I'm cleaning a text and I'd like to remove any apostrophe except for the ones preceded and followed by letters such as in : i'm, i'll, he's..etc.
I the following preliminary solution, handling many cases, but I want a better one:
rmAps <- function(x) gsub("^\'+| \'+|\'+ |[^[:alpha:]]\'+(a-z)*|\\b\'*$", " ", x)
rmAps("'i'm '' ' 'we end' '")
[1] " i'm we end "
I also tried:
(?<![a-z])'(?![a-z])
But I think I am still missing sth.
gsub("'(?!\\w)|(?<!\\w)'", "", x, perl = TRUE)
#[1] "i'm we end "
Remove occasions when your character is not followed by a word character: '(?!\\w).
Remove occasions when your character is not preceded by a word character: (?<!\\w)'.
If either of those situations occur, you want to remove it, so '(?!\\w)|(?<!\\w)' should do the trick. Just note that \\w includes the underscore, and adjust as necessary.
Another option is
gsub("\\w'\\w(*SKIP)(*FAIL)|'", "", x, perl = TRUE)
In this case, you match any instances when ' is surrounded by word characters: \\w'\\w, and then force that match to fail with (*SKIP)(*FAIL). But, also look for ' using |'. The result is that only occurrences of ' not wrapped in word characters will be matched and substituted out.
You can use the following regular expression:
(?<=\w)'(?=\w)
(?<=) is a positive lookbehind. Everything inside needs to match before the next selector
(?=) is a positive lookahead. Everything inside needs to match after the previous selector
\w any alphanumeric character and the underscore
You could also switch \w to e.g. [a-zA-Z] if you want to restrict the results.
→ Here is your example on regex101 for live testing.

Resources