Remove single character in R - r

i'm working on sentiment analysis with Arabic language by using R and in cleaning step i need to remove the single character.
I used this code to remove them and it works but a had some problem
for example here is the data
R<-("للمدافعين قال شركة وطنية قلت أقنعهم يعاملوننا كمواطنينقال جودتها عالية قلت جيدة غيرها غ")
as you see here "غ" is single character
gsub(" *\\b[[:alpha:]]{1}\\b *", "", R)
[1] "للمدافعين قال شركة وطنية قلت أقنعهم يعاملوننا كمواطنينقال جودتها عالية قلت جيدة غيرها\n"
but when I tried to apply it on the whole data set on text column like here
subdata1$text = gsub("*\\b[[:alpha:]]{1}\\b *", "", subdata1$text)
its doesn't remove anything and I don't known why?
hope you understand me
thank you

It seems the [:alpha:] POSIX character class does not work with all Unicode letters in your case.
I suggest using a PCRE pattern:
gsub("(*UCP)\\b\\p{L}\\b", "", R, perl=TRUE)
Here, (*UCP) is required to make \b word boundary Unicode aware and \p{L} matches any Unicode letter from a BMP plane. The perl=TRUE argument is required for the pattern to be processed with the PCRE regex engine.

Related

How do I add a space between two characters using regex in R?

I want to add a space between two punctuation characters (+ and -).
I have this code:
s <- "-+"
str_replace(s, "([:punct:])([:punct:])", "\\1\\s\\2")
It does not work.
May I have some help?
There are several issues here:
[:punct:] pattern in an ICU regex flavor does not match math symbols (\p{S}), it only matches punctuation proper (\p{P}), if you still want to match all of them, combine the two classes, [\p{P}\p{S}]
"\\1\\s\\2" replacement contains a \s regex escape sequence, and these are not supported in the replacement patterns, you need to use a literal space
str_replace only replaces one, first occurrence, use str_replace_all to handle all matches
Even if you use all the above suggestions, it still won't work for strings like -+?/. You need to make the second part of the regex a zero-width assertion, a positive lookahead, in order not to consume the second punctuation.
So, you can use
library(stringr)
s <- "-+?="
str_replace_all(s, "([\\p{P}\\p{S}])(?=[\\p{P}\\p{S}])", "\\1 ")
str_replace_all(s, "(?<=[\\p{P}\\p{S}])(?=[\\p{P}\\p{S}])", " ")
gsub("(?<=[[:punct:]])(?=[[:punct:]])", " ", s, perl=TRUE)
See the R demo online, all three lines yield [1] "- + ? =" output.
Note that in PCRE regex flavor (used with gsub and per=TRUE) the POSIX character class must be put inside a bracket expression, hence the use of double brackets in [[:punct:]].
Also, (?<=[[:punct:]]) is a positive lookbehind that checks for the presence of its pattern immediately on the left, and since it is non-consuming there is no need of any backreference in the replacement.

Add a white-space between number and special character condition R

I'm trying to use stringr or R base calls to conditionally add a white-space for instances in a large vector where there is a numeric value then a special character - in this case a $ sign without a space. str_pad doesn't appear to allow for a reference vectors.
For example, for:
$6.88$7.34
I'd like to add a whitespace after the last number and before the next dollar sign:
$6.88 $7.34
Thanks!
If there is only one instance, then use sub to capture digit and the $ separately and in the replacement add the space between the backreferences of the captured group
sub("([0-9])([$])", "\\1 \\2", v1)
#[1] "$6.88 $7.34"
Or with a regex lookaround
gsub("(?<=[0-9])(?=[$])", " ", v1, perl = TRUE)
data
v1 <- "$6.88$7.34"
This will work if you are working with a vectored string:
mystring<-as.vector('$6.88$7.34 $8.34$4.31')
gsub("(?<=\\d)\\$", " $", mystring, perl=T)
[1] "$6.88 $7.34 $8.34 $4.31"
This includes cases where there is already space as well.
Regarding the question asked in the comments:
mystring2<-as.vector('Regular_Distribution_Type† Income Only" "Distribution_Rate 5.34%" "Distribution_Amount $0.0295" "Distribution_Frequency Monthly')
gsub("(?<=[[:alpha:]])\\s(?=[[:alpha:]]+)", "_", mystring2, perl=T)
[1] "Regular_Distribution_Type<U+2020> Income_Only\" \"Distribution_Rate 5.34%\" \"Distribution_Amount $0.0295\" \"Distribution_Frequency_Monthly"
Note that the \ appears due to nested quotes in the vector, should not make a difference. Also <U+2020> appears due to encoding the special character.
Explanation of regex:
(?<=[[:alpha:]]) This first part is a positive look-behind created by ?<=, this basically looks behind anything we are trying to match to make sure what we define in the look behind is there. In this case we are looking for [[:alpha:]] which matches a alphabetic character.
We then check for a blank space with \s, in R we have to use a double escape so \\s, this is what we are trying to match.
Finally we use (?=[[:alpha:]]+), which is a positive look-ahead defined by ?= that checks to make sure our match is followed by another letter as explained above.
The logic is to find a blank space between letters, and match the space, which then is replaced by gsub, with a _
See all the regex here

Positive lookbehind on non-ASCII characters in R

I have an R function which tries to capitalise the first letter of every "word"
proper = function(x){
gsub("(?<=\\b)([[:alpha:]])", "\\U\\1", x, perl = TRUE)
}
This works pretty well, but when I have a word with a Māori macron in it like Māori I get improper capitalisation, e.g.
> proper("Māori")
[1] "MāOri"
Clearly the RE engine thinks the macron ā is a word boundary. Not sure why.
Since you are using a PCRE regex engine (enabled with perl=TRUE) you must pass the (*UCP) flag to the regex so that all shorthands and word boundaries could detect correct symbols/locations inside Unicode text:
proper = function(x){
gsub("(*UCP)\\b([[:alpha:]])", "\\U\\1", x, perl = TRUE)
}
proper("Māori")
## [1] "Māori"
See the R demo.
Note that \b is already a zero-width assertion and does not have to be placed into a positive lookbehind, i.e. (?<=\b) = \b.
\b basically denotes a boundary on characters other than [a-zA-Z0-9_] which includes multibyte characters as well unless a modifier called Unicode is set to affect engine behavior.
Unfortunately, gsub in R doesn't have this flag or I couldn't find anything about it in documentations.
A workaround would be:
(?<!\\S)([[:alpha:]])
which, in other hand, obviously fails on āmori.

R utf-8 and replace a word from a sentence based on ending character

I have a requirement where I am working on a large data which is having double byte characters, in korean text. i want to look for a character and replace it. In order to display the korean text correctly in the browser I have changed the locale settings in R. But not sure if it gets updated for the code as well. below is my code to change locale to korean and the korean text gets visible properly in viewer, however in console it gives junk character on printing-
Sys.setlocale(category = "LC_ALL", locale = "korean")
My data is in a data.table format that contains a column with text in korean. example -
"광주광역시 동구 제봉로 49 (남동,(지하))"
I want to get rid of the 1st word which ends with "시" character. Then I want to get rid of the "(남동,(지하))" an the end. I was trying gsub, but it does not seem to be working.
New <- c("광주광역시 동구 제봉로 49 (남동,(지하))")
data <- as.data.table(New)
data[,New_trunc := gsub("\\b시", "", data$New)]
Please let me know where I am going wrong. Since I want to search the end of word, I am using \\b and since I want to replace any word ending with "시" character I am giving it as \\b시.....is this not the way to give? How to take care of () at the end of the sentence.
What would be a good source to refer to for regular expressions.
Is a utf-8 setting needed for the script as well?How to do that?
Since you need to match the letter you have at the end of the word, you need to place \b (word boundary) after the letter, so as to require a transition from a letter to a non-letter (or end of string) after that letter. A PCRE pattern that will handle this is
"\\s*\\b\\p{L}*시\\b"
Details
\\s* - zero or more whitespaces
\\b - a leading word boundary
\\p{L}* - zero or more letters
시 - your specific letter
\\b - end of the word
The second issue is that you need to remove a set of nested parentheses at the end of the string. You need again to rely on the PCRE regex (perl=TRUE) that can handle recursion with the help of a subroutine call.
> sub("\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] "광주광역시 동구 제봉로 49"
Details:
\\s* - zero or more whitespaces
(\\((?:[^()]++|(?1))*\\)) - Group 1 (will be recursed) matching
\\( - a literal (
(?:[^()]++|(?1))* - zero or more occurrences of
[^()]++ - 1 or more chars other than ( and ) (possessively)
| - or
(?1) - a subroutine call that repeats the whole Group 1 subpattern
\\) - a literal )
$ - end of string.
Now, if you need to combine both, you would see that R PCRE-powered gsub does not handle Unicode chars in the pattern so easily. You must tell it to use Unicode mode with (*UCP) PCRE verb.
> gsub("(*UCP)\\b\\p{L}*시\\b|\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] " 동구 제봉로 49"
Or using trimws to get rid of the leading/trailing whitespace:
> trimws(gsub("(*UCP)\\b\\p{L}*시\\b|(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE))
[1] "동구 제봉로 49"
See more details about the verb at PCRE Man page.

Regular expression to remove specific multi-byte characters in R

I am trying to remove specific multi-byte characters in R.
Multibyte <- "Sungpil_한성필_韓盛弼_Han"
The linguistic structure of Multibyte is "English_Korean_Chinese_English" What I want to remove is the Korean word only or Chinese word only (not both).
A desired result is either :
Sungpil_한성필__Han # Chinese characters were removed.
or
Sungpil__韓盛弼_Han # Korean characters were removed.
Is there a simple way to do it by using gsub? I am only aware of a method to get English-only characters.
gsub("[^A-Za-z_]", "", Multibyte)
[1] "Sungpil___Han"
Answering the question itself, yes, you may do it with a mere gsub using a PCRE regex and Unicode property classes \p{Hangul} for matching Korean chars, and \p{Han} to match Chinese chars:
> Multibyte <- "Sungpil_한성필_韓盛弼_Han"
> gsub("\\p{Hangul}+", "",Multibyte, perl=TRUE)
[1] "Sungpil__韓盛弼_Han"
> gsub("\\p{Han}+", "",Multibyte, perl=TRUE)
[1] "Sungpil_한성필__Han"
See R online demo.
However, if you have a specific structure of the input text, use the other solution.
We can try with sub
sub("[^_]+_([A-Za-z]+)$", "_\\1", Multibyte)
#[1] "Sungpil_한성필__Han"

Resources