Replace punctuation in string - r

I want to replace the punctuation in a string by adding '\\' before the punctuation. The reason is I will be using regex on the string afterwards and it fails if there is a question mark without '\\' in front of it.
So basically, I would like to do something like this:
gsub("\\?","\\\\?", x)
Which converts a string "How are you?" to "How are you\\?" But I would like to do this for all punctuation. Is this possible?

You can use gsub with the [[:punct:]] regular expression alias as follows:
> x <- "Hi! How are you today?"
> gsub('([[:punct:]])', '\\\\\\1', x)
[1] "Hi\\! How are you today\\?"
Note the replacement starts with '\\\\' to produce the double backslash you requested while the '\\1' portion preserves the punctuation mark.

Related

Applying a regular expression to a string in R

I'm just getting to know the language R, previously worked with python. The challenge is to replace the last character of each word in the string with *.
How it should look: example text in string, and result work: exampl* tex* i* strin*
My code:
library(tidyverse)
library(stringr)
string_example = readline("Enter our text:")
string_example = unlist(strsplit(string_example, ' '))
string_example
result = str_replace(string_example, pattern = "*\b", replacement = "*")
result
I get an error:
> result = str_replace(string_example, pattern = "*\b", replacement = "*")
Error in stri_replace_first_regex(string, pattern, fix_replacement(replacement), :
Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=``)
Help solve the task
Oh, I noticed an error, the pattern should be .\b. this is how the code is executed, but there is no replacement in the string
If you mean words consisting of letters only, you can use
string_example <- "example text in string"
library(stringr)
str_replace_all(string_example, "\\p{L}\\b", "*")
## => [1] "exampl* tex* i* strin*"
See the R demo and the regex demo.
Details:
\p{L} - a Unicode category (propery) class matching any Unicode letter
\b - a word boundary, in this case, it makes sure there is no other word character immediately on the right. It will fails the match if the letter matched with \p{L} is immediately followed with a letter, digit or _ (these are all word chars). If you want to limit this to a letter check, replace \b with (?!\p{L}).
Note the backslashes are doubled because in regular string literals backslashes are used to form string escape sequences, and thus need escaping themselves to introduce literal backslashes in string literals.
Some more things to consider
If you do not want to change one-letter words, add a non-word boundary at the start, "\\B\\p{L}\\b"
If you want to avoid matching letters that are followed with - + another letter (i.e. some compound words), you can add a lookahead check: "\\p{L}\\b(?!-)".
You may combine the lookarounds and (non-)word boundaries as you need.

Split a character string in R on a single backslash [duplicate]

I am trying to extract the part of the string before the first backslash but I can't seem to get it tot work properly.
I have tried multiple ways of getting it to work, based on the manual page for strsplit and after searching online.
In my actual situation the strings are in a dataframe which I get from a database connection but I can simplify the situation with the following:
> strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3","\\",fixed=TRUE)
[[1]]
[1] "BLAAT1\022E:" "BLAAT2" "BLAAT3"
> strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3","\\",fixed=FALSE)
Error in strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3", "\\", fixed = FALSE) :
invalid regular expression '\', reason 'Trailing backslash'
> strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3","\\\\",fixed=TRUE)
[[1]]
[1] "BLAAT1\022E:\\BLAAT2\\BLAAT3"
> strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3","\\\\",fixed=FALSE)
[[1]]
[1] "BLAAT1\022E:" "BLAAT2" "BLAAT3"
The expected output would also split on the \ between BLAAT1 and 022E:
Thanks in advance
If you use a regex with strsplit function, a literal backslash can be coded as two literal backslashes (as a literal \ is a special regex metacharacter that is used to form regex escapes, like \d, \w, etc.), but since R string literals support string escape sequences (like "\r" for carriage return, "\n" for a newline char) a literal backslash needs to be defined with a double backslash.
So, "\\" is a literal \, and a regex pattern to match a literal backslash char, being \\, should be coded with 4 backslashes, "\\\\".
Here is a regex that you can use: it splits at \ and a non-printable character:
strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3","\\\\|[^[:print:]]",fixed=FALSE)
# [1] "BLAAT1" "E:" "BLAAT2" "BLAAT3"
See IDEONE demo

Regular Expression for deleting emoticons in R [duplicate]

I have a string like:
q <-"<U+00A6> 1000-66329"
I want to remove <U+00A6> and get only 1000 66329.
I tried using:
gsub("\u00a6"," ", q,perl=T)
But it is not removing anything. How should I do gsub in order to get only 1000 66329?
I just want to remove unicode <U+00A6> which is at the beginning of string.
Then you do not need a gsub, you can use a sub with "^\\s*<U\\+\\w+>\\s*" pattern:
q <-"<U+00A6> 1000-66329"
sub("^\\s*<U\\+\\w+>\\s*", "", q)
Pattern details:
^ - start of string
\\s* - zero or more whitespaces
<U\\+ - a literal char sequence <U+
\\w+ - 1 or more letters, digits or underscores
> - a literal >
\\s* - zero or more whitespaces.
If you also need to replace the - with a space, add |- alternative and use gsub (since now we expect several replacements and the replacement must be a space - same is in akrun's answer):
trimws(gsub("^\\s*<U\\+\\w+>|-", " ", q))
See the R online demo
If always is the first character, you can try:
substring("\U00A6 1000-66B29", 2)
if R prints the string as <U+00A6> 1000-66329 instead of ¦ 1000-66B29 then <U+00A6> is interpreted as the string "<U+00A6>" instead of the unicode character. Then you can do:
substring("<U+00A6> 1000-66329",9)
Both ways the result is:
[1] " 1000-66329"
We can also do
trimws(gsub("\\S+\\s+|-", " ", q))
#[1] "1000 66329"
Instead of removing you should convert it to the appropriate format ... You have to set your local to UTF-8 like so:
Sys.setlocale("LC_CTYPE", "en_US.UTF-8")
Maybe you will see the following message:
Warning message:
In Sys.setlocale("LC_CTYPE", "en_US.UTF-8") :
OS reports request to set locale to "en_US.UTF-8" cannot be honored
In this case you should use stringi::stri_trans_general(x, "zh")
Here "zh" means "chinese". You should know which language you have to convert to. That's it

R utf-8 and replace a word from a sentence based on ending character

I have a requirement where I am working on a large data which is having double byte characters, in korean text. i want to look for a character and replace it. In order to display the korean text correctly in the browser I have changed the locale settings in R. But not sure if it gets updated for the code as well. below is my code to change locale to korean and the korean text gets visible properly in viewer, however in console it gives junk character on printing-
Sys.setlocale(category = "LC_ALL", locale = "korean")
My data is in a data.table format that contains a column with text in korean. example -
"광주광역시 동구 제봉로 49 (남동,(지하))"
I want to get rid of the 1st word which ends with "시" character. Then I want to get rid of the "(남동,(지하))" an the end. I was trying gsub, but it does not seem to be working.
New <- c("광주광역시 동구 제봉로 49 (남동,(지하))")
data <- as.data.table(New)
data[,New_trunc := gsub("\\b시", "", data$New)]
Please let me know where I am going wrong. Since I want to search the end of word, I am using \\b and since I want to replace any word ending with "시" character I am giving it as \\b시.....is this not the way to give? How to take care of () at the end of the sentence.
What would be a good source to refer to for regular expressions.
Is a utf-8 setting needed for the script as well?How to do that?
Since you need to match the letter you have at the end of the word, you need to place \b (word boundary) after the letter, so as to require a transition from a letter to a non-letter (or end of string) after that letter. A PCRE pattern that will handle this is
"\\s*\\b\\p{L}*시\\b"
Details
\\s* - zero or more whitespaces
\\b - a leading word boundary
\\p{L}* - zero or more letters
시 - your specific letter
\\b - end of the word
The second issue is that you need to remove a set of nested parentheses at the end of the string. You need again to rely on the PCRE regex (perl=TRUE) that can handle recursion with the help of a subroutine call.
> sub("\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] "광주광역시 동구 제봉로 49"
Details:
\\s* - zero or more whitespaces
(\\((?:[^()]++|(?1))*\\)) - Group 1 (will be recursed) matching
\\( - a literal (
(?:[^()]++|(?1))* - zero or more occurrences of
[^()]++ - 1 or more chars other than ( and ) (possessively)
| - or
(?1) - a subroutine call that repeats the whole Group 1 subpattern
\\) - a literal )
$ - end of string.
Now, if you need to combine both, you would see that R PCRE-powered gsub does not handle Unicode chars in the pattern so easily. You must tell it to use Unicode mode with (*UCP) PCRE verb.
> gsub("(*UCP)\\b\\p{L}*시\\b|\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] " 동구 제봉로 49"
Or using trimws to get rid of the leading/trailing whitespace:
> trimws(gsub("(*UCP)\\b\\p{L}*시\\b|(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE))
[1] "동구 제봉로 49"
See more details about the verb at PCRE Man page.

Is there a simple way to get substring in R?

i get the substring of word in the following way:
word="xyz9874"
pattern="[0-9]+"
x=gregexpr(pattern,word)
substr(word,start=x[[1]],stop=x[[1]]+attr(x[[1]],"match.length")-1)
[1] "9874"
Is there a more simple way to get the result in R?
Sure, use gsub and backreferencing:
gsub( ".*?([0-9]+).*", "\\1", word )
Explanation: in most regex implementations, \1 is the back reference to the first subpattern matched. The subpattern is enclosed in parentheses. In R, you need to escape the backslash irrespective of the type of quotation marks you are using.
The question mark, an idiom of the "extended" regular expressions means that the given regex pattern should not be greedy, in other words -- it should take as little of the string as possible. Othrewise, the .* in the pattern .*([0-9]+) would match xyz987 and ([0-9]+) would match 4. Alternatively, we can write
gsub( ".*[^0-9]+([0-9]+).*", "\\1", word )
but then we have a problem with strings that start with a number.
By the way, note that instead of [0-9] you can write \d, or, actually, \\d:
gsub( ".*?(\\d+).*", "\\1", word )

Resources