Regular expression to remove specific multi-byte characters in R - r

I am trying to remove specific multi-byte characters in R.
Multibyte <- "Sungpil_한성필_韓盛弼_Han"
The linguistic structure of Multibyte is "English_Korean_Chinese_English" What I want to remove is the Korean word only or Chinese word only (not both).
A desired result is either :
Sungpil_한성필__Han # Chinese characters were removed.
or
Sungpil__韓盛弼_Han # Korean characters were removed.
Is there a simple way to do it by using gsub? I am only aware of a method to get English-only characters.
gsub("[^A-Za-z_]", "", Multibyte)
[1] "Sungpil___Han"

Answering the question itself, yes, you may do it with a mere gsub using a PCRE regex and Unicode property classes \p{Hangul} for matching Korean chars, and \p{Han} to match Chinese chars:
> Multibyte <- "Sungpil_한성필_韓盛弼_Han"
> gsub("\\p{Hangul}+", "",Multibyte, perl=TRUE)
[1] "Sungpil__韓盛弼_Han"
> gsub("\\p{Han}+", "",Multibyte, perl=TRUE)
[1] "Sungpil_한성필__Han"
See R online demo.
However, if you have a specific structure of the input text, use the other solution.

We can try with sub
sub("[^_]+_([A-Za-z]+)$", "_\\1", Multibyte)
#[1] "Sungpil_한성필__Han"

Related

Regex issue in R when escaping regex special characters with str_extract

I'm trying to extract the status -- in this case the word "Active" from this pattern:
Status\nActive\nHometown\
Using this regex: https://regex101.com/r/xegX00/1, but I cannot get it to work in R using str_extract. It does seem weird to have dual escapes, but I've tried every possible combination here and cannot get this to work. Any help appreciated!
mutate(status=str_extract(df, "(?<=Status\\\\n)(.*?)(?=\\\\)"))
You can use sub in base R -
x <- "Status\nActive\nHometown\n"
sub('.*Status\n(.*?)\n.*', '\\1', x)
#[1] "Active"
If you want to use stringr, here is a suggestion with str_match which avoids using lookahead regex
stringr::str_match(x, 'Status\n(.*)\n')[, 2]
#[1] "Active"
Your regex fails because you tested it against a wrong text.
"Status\nActive\nHometown" is a string literal that denotes (defines, represents) the following plain text:
Status
Active
Hometown
In regular expression testers, you need to test against plain text!
To match a newline, you can use "\n" (i.e. a line feed char, an LF char), or "\\n", a regex escape that also matches a line feed char.
You can use
library(stringr)
x <- "Status\nActive\nHometown\n"
stringr::str_extract(x, "(?<=Status\\n).*") ## => [1] "Active"
## or
stringr::str_extract(x, "(?<=Status\n).*") ## => [1] "Active"
See the R demo online and a correct regex test.
Note you do not need an \n at the end of the pattern, as in an ICU regex flavor (used in R stringr regex methods), the . pattern matches any chars other than line break chars, so it is OK to just use .* to match the whole line.

Remove single character in R

i'm working on sentiment analysis with Arabic language by using R and in cleaning step i need to remove the single character.
I used this code to remove them and it works but a had some problem
for example here is the data
R<-("للمدافعين قال شركة وطنية قلت أقنعهم يعاملوننا كمواطنينقال جودتها عالية قلت جيدة غيرها غ")
as you see here "غ" is single character
gsub(" *\\b[[:alpha:]]{1}\\b *", "", R)
[1] "للمدافعين قال شركة وطنية قلت أقنعهم يعاملوننا كمواطنينقال جودتها عالية قلت جيدة غيرها\n"
but when I tried to apply it on the whole data set on text column like here
subdata1$text = gsub("*\\b[[:alpha:]]{1}\\b *", "", subdata1$text)
its doesn't remove anything and I don't known why?
hope you understand me
thank you
It seems the [:alpha:] POSIX character class does not work with all Unicode letters in your case.
I suggest using a PCRE pattern:
gsub("(*UCP)\\b\\p{L}\\b", "", R, perl=TRUE)
Here, (*UCP) is required to make \b word boundary Unicode aware and \p{L} matches any Unicode letter from a BMP plane. The perl=TRUE argument is required for the pattern to be processed with the PCRE regex engine.

Remove backslashes from string in R [duplicate]

Here is the string:
> raw.data[27834,1]
[1] "\xff$GPGGA"
I have tried advice from the following two questions, but with no luck:
How to escape a backslash in R?
How to escape backslashes in R string
Does anyone have a different solution from the above questions that might help? The ideal solution would be to remove the "\xff" portion, but for any combination of letters.
There is no backslash in that string. The displayed backslash is an escape marker. This and other features about entry and display of "special situations" are described in the ?Quotes help page.. You've been given one regex rather elliptical approach to removal. Here are a couple of other approaches .... only some of which actually succeed because the \ff is the first "character" and it's not really legal as an R character:
s <- "\xff$GPGGA"
strsplit(s, "")
#[[1]]
#[1] NA
Warning message:
In strsplit(s, "") : input string 1 is invalid in this locale
substr(s, 1,1)
#Error in substr(s, 1, 1) : invalid multibyte string at '<ff>$GP<47>GA'
gsub('.*([^A-Za-z].*)', '\\1',"\xff$GPGGA")#[1]
#[1] "$GPGGA"
?Quotes
gsub('\xff', '',"\xff$GPGGA")#[1]
#[1] "$GPGGA"
I think the reason that the regex functions don't choke on that string is that regex is actually a system mediated process whereas strsplit and substr are internal R functions.
#RichardScriven posts an example and when I tried to replicated it, I get yet a different example that shows the mapping to displayed characters is system specific. I'm on OSX 10.10.1 (Yosemite)>
cat('\xff')
ˇ
(I left off the octothorpe (#) that I would normally out in.)

Remove leading backslash from string R

Here is the string:
> raw.data[27834,1]
[1] "\xff$GPGGA"
I have tried advice from the following two questions, but with no luck:
How to escape a backslash in R?
How to escape backslashes in R string
Does anyone have a different solution from the above questions that might help? The ideal solution would be to remove the "\xff" portion, but for any combination of letters.
There is no backslash in that string. The displayed backslash is an escape marker. This and other features about entry and display of "special situations" are described in the ?Quotes help page.. You've been given one regex rather elliptical approach to removal. Here are a couple of other approaches .... only some of which actually succeed because the \ff is the first "character" and it's not really legal as an R character:
s <- "\xff$GPGGA"
strsplit(s, "")
#[[1]]
#[1] NA
Warning message:
In strsplit(s, "") : input string 1 is invalid in this locale
substr(s, 1,1)
#Error in substr(s, 1, 1) : invalid multibyte string at '<ff>$GP<47>GA'
gsub('.*([^A-Za-z].*)', '\\1',"\xff$GPGGA")#[1]
#[1] "$GPGGA"
?Quotes
gsub('\xff', '',"\xff$GPGGA")#[1]
#[1] "$GPGGA"
I think the reason that the regex functions don't choke on that string is that regex is actually a system mediated process whereas strsplit and substr are internal R functions.
#RichardScriven posts an example and when I tried to replicated it, I get yet a different example that shows the mapping to displayed characters is system specific. I'm on OSX 10.10.1 (Yosemite)>
cat('\xff')
ˇ
(I left off the octothorpe (#) that I would normally out in.)

How do I strip dollar signs ($) from data/ escape special characters in R?

I've been using gsub("toreplace","replacement", myvector) to clean out data in R. While this works for commas and the like, removing "$" has no effect. So if I do gsub("$","",myvector) all the dollar signs remain in place.
I think this is because $ is a special character in R. I tried escaping it "\$" but that yields the same result (no effect). And I couldn't find a resource on escaping special characters in R.
Obviously I should do this in preprocessing. But I was wondering if anyone out there knew how to either a) escape special characters in R b) get rid of pesky $ in R directly. For science.
You have to escape it twice, first for R, second for the regex.
gsub('\\$', '', c("a$a", "bb$"))
[1] "aa" "bb"
See ?Quotes for details on quoting and escaping.
Use fixed = TRUE:
gsub('$', '', c("a$a", "bb$"), fixed = TRUE)
Then you don't need to worry about any special characters. In stringr, this is implemented a little differently:
library(stringr)
str_replace_all(c("$100","ta$ty"), fixed("$"), "")
Thanks to DiggyF and James for the examples!
Escaping characters can be a pain some times, but just putting it in square brackets (make it a character class) helps with this:
> gsub("[$]","",c("$100","ta$ty"))
[1] "100" "taty"
if you have $ followed by number in set of data columns (e.g. $400,000) there is an easier way that worked like charm for me.
data%>%
mutate_at(5:6, parse_number)
where 5:6 are the data column numbers.

Resources