Regular Expression for deleting emoticons in R [duplicate] - r

I have a string like:
q <-"<U+00A6> 1000-66329"
I want to remove <U+00A6> and get only 1000 66329.
I tried using:
gsub("\u00a6"," ", q,perl=T)
But it is not removing anything. How should I do gsub in order to get only 1000 66329?

I just want to remove unicode <U+00A6> which is at the beginning of string.
Then you do not need a gsub, you can use a sub with "^\\s*<U\\+\\w+>\\s*" pattern:
q <-"<U+00A6> 1000-66329"
sub("^\\s*<U\\+\\w+>\\s*", "", q)
Pattern details:
^ - start of string
\\s* - zero or more whitespaces
<U\\+ - a literal char sequence <U+
\\w+ - 1 or more letters, digits or underscores
> - a literal >
\\s* - zero or more whitespaces.
If you also need to replace the - with a space, add |- alternative and use gsub (since now we expect several replacements and the replacement must be a space - same is in akrun's answer):
trimws(gsub("^\\s*<U\\+\\w+>|-", " ", q))
See the R online demo

If always is the first character, you can try:
substring("\U00A6 1000-66B29", 2)
if R prints the string as <U+00A6> 1000-66329 instead of ¦ 1000-66B29 then <U+00A6> is interpreted as the string "<U+00A6>" instead of the unicode character. Then you can do:
substring("<U+00A6> 1000-66329",9)
Both ways the result is:
[1] " 1000-66329"

We can also do
trimws(gsub("\\S+\\s+|-", " ", q))
#[1] "1000 66329"

Instead of removing you should convert it to the appropriate format ... You have to set your local to UTF-8 like so:
Sys.setlocale("LC_CTYPE", "en_US.UTF-8")
Maybe you will see the following message:
Warning message:
In Sys.setlocale("LC_CTYPE", "en_US.UTF-8") :
OS reports request to set locale to "en_US.UTF-8" cannot be honored
In this case you should use stringi::stri_trans_general(x, "zh")
Here "zh" means "chinese". You should know which language you have to convert to. That's it

Related

Split a character string in R on a single backslash [duplicate]

I am trying to extract the part of the string before the first backslash but I can't seem to get it tot work properly.
I have tried multiple ways of getting it to work, based on the manual page for strsplit and after searching online.
In my actual situation the strings are in a dataframe which I get from a database connection but I can simplify the situation with the following:
> strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3","\\",fixed=TRUE)
[[1]]
[1] "BLAAT1\022E:" "BLAAT2" "BLAAT3"
> strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3","\\",fixed=FALSE)
Error in strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3", "\\", fixed = FALSE) :
invalid regular expression '\', reason 'Trailing backslash'
> strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3","\\\\",fixed=TRUE)
[[1]]
[1] "BLAAT1\022E:\\BLAAT2\\BLAAT3"
> strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3","\\\\",fixed=FALSE)
[[1]]
[1] "BLAAT1\022E:" "BLAAT2" "BLAAT3"
The expected output would also split on the \ between BLAAT1 and 022E:
Thanks in advance
If you use a regex with strsplit function, a literal backslash can be coded as two literal backslashes (as a literal \ is a special regex metacharacter that is used to form regex escapes, like \d, \w, etc.), but since R string literals support string escape sequences (like "\r" for carriage return, "\n" for a newline char) a literal backslash needs to be defined with a double backslash.
So, "\\" is a literal \, and a regex pattern to match a literal backslash char, being \\, should be coded with 4 backslashes, "\\\\".
Here is a regex that you can use: it splits at \ and a non-printable character:
strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3","\\\\|[^[:print:]]",fixed=FALSE)
# [1] "BLAAT1" "E:" "BLAAT2" "BLAAT3"
See IDEONE demo

R utf-8 and replace a word from a sentence based on ending character

I have a requirement where I am working on a large data which is having double byte characters, in korean text. i want to look for a character and replace it. In order to display the korean text correctly in the browser I have changed the locale settings in R. But not sure if it gets updated for the code as well. below is my code to change locale to korean and the korean text gets visible properly in viewer, however in console it gives junk character on printing-
Sys.setlocale(category = "LC_ALL", locale = "korean")
My data is in a data.table format that contains a column with text in korean. example -
"광주광역시 동구 제봉로 49 (남동,(지하))"
I want to get rid of the 1st word which ends with "시" character. Then I want to get rid of the "(남동,(지하))" an the end. I was trying gsub, but it does not seem to be working.
New <- c("광주광역시 동구 제봉로 49 (남동,(지하))")
data <- as.data.table(New)
data[,New_trunc := gsub("\\b시", "", data$New)]
Please let me know where I am going wrong. Since I want to search the end of word, I am using \\b and since I want to replace any word ending with "시" character I am giving it as \\b시.....is this not the way to give? How to take care of () at the end of the sentence.
What would be a good source to refer to for regular expressions.
Is a utf-8 setting needed for the script as well?How to do that?
Since you need to match the letter you have at the end of the word, you need to place \b (word boundary) after the letter, so as to require a transition from a letter to a non-letter (or end of string) after that letter. A PCRE pattern that will handle this is
"\\s*\\b\\p{L}*시\\b"
Details
\\s* - zero or more whitespaces
\\b - a leading word boundary
\\p{L}* - zero or more letters
시 - your specific letter
\\b - end of the word
The second issue is that you need to remove a set of nested parentheses at the end of the string. You need again to rely on the PCRE regex (perl=TRUE) that can handle recursion with the help of a subroutine call.
> sub("\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] "광주광역시 동구 제봉로 49"
Details:
\\s* - zero or more whitespaces
(\\((?:[^()]++|(?1))*\\)) - Group 1 (will be recursed) matching
\\( - a literal (
(?:[^()]++|(?1))* - zero or more occurrences of
[^()]++ - 1 or more chars other than ( and ) (possessively)
| - or
(?1) - a subroutine call that repeats the whole Group 1 subpattern
\\) - a literal )
$ - end of string.
Now, if you need to combine both, you would see that R PCRE-powered gsub does not handle Unicode chars in the pattern so easily. You must tell it to use Unicode mode with (*UCP) PCRE verb.
> gsub("(*UCP)\\b\\p{L}*시\\b|\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] " 동구 제봉로 49"
Or using trimws to get rid of the leading/trailing whitespace:
> trimws(gsub("(*UCP)\\b\\p{L}*시\\b|(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE))
[1] "동구 제봉로 49"
See more details about the verb at PCRE Man page.

Replace punctuation in string

I want to replace the punctuation in a string by adding '\\' before the punctuation. The reason is I will be using regex on the string afterwards and it fails if there is a question mark without '\\' in front of it.
So basically, I would like to do something like this:
gsub("\\?","\\\\?", x)
Which converts a string "How are you?" to "How are you\\?" But I would like to do this for all punctuation. Is this possible?
You can use gsub with the [[:punct:]] regular expression alias as follows:
> x <- "Hi! How are you today?"
> gsub('([[:punct:]])', '\\\\\\1', x)
[1] "Hi\\! How are you today\\?"
Note the replacement starts with '\\\\' to produce the double backslash you requested while the '\\1' portion preserves the punctuation mark.

Remove backslashes from string in R [duplicate]

Here is the string:
> raw.data[27834,1]
[1] "\xff$GPGGA"
I have tried advice from the following two questions, but with no luck:
How to escape a backslash in R?
How to escape backslashes in R string
Does anyone have a different solution from the above questions that might help? The ideal solution would be to remove the "\xff" portion, but for any combination of letters.
There is no backslash in that string. The displayed backslash is an escape marker. This and other features about entry and display of "special situations" are described in the ?Quotes help page.. You've been given one regex rather elliptical approach to removal. Here are a couple of other approaches .... only some of which actually succeed because the \ff is the first "character" and it's not really legal as an R character:
s <- "\xff$GPGGA"
strsplit(s, "")
#[[1]]
#[1] NA
Warning message:
In strsplit(s, "") : input string 1 is invalid in this locale
substr(s, 1,1)
#Error in substr(s, 1, 1) : invalid multibyte string at '<ff>$GP<47>GA'
gsub('.*([^A-Za-z].*)', '\\1',"\xff$GPGGA")#[1]
#[1] "$GPGGA"
?Quotes
gsub('\xff', '',"\xff$GPGGA")#[1]
#[1] "$GPGGA"
I think the reason that the regex functions don't choke on that string is that regex is actually a system mediated process whereas strsplit and substr are internal R functions.
#RichardScriven posts an example and when I tried to replicated it, I get yet a different example that shows the mapping to displayed characters is system specific. I'm on OSX 10.10.1 (Yosemite)>
cat('\xff')
ˇ
(I left off the octothorpe (#) that I would normally out in.)

Remove leading backslash from string R

Here is the string:
> raw.data[27834,1]
[1] "\xff$GPGGA"
I have tried advice from the following two questions, but with no luck:
How to escape a backslash in R?
How to escape backslashes in R string
Does anyone have a different solution from the above questions that might help? The ideal solution would be to remove the "\xff" portion, but for any combination of letters.
There is no backslash in that string. The displayed backslash is an escape marker. This and other features about entry and display of "special situations" are described in the ?Quotes help page.. You've been given one regex rather elliptical approach to removal. Here are a couple of other approaches .... only some of which actually succeed because the \ff is the first "character" and it's not really legal as an R character:
s <- "\xff$GPGGA"
strsplit(s, "")
#[[1]]
#[1] NA
Warning message:
In strsplit(s, "") : input string 1 is invalid in this locale
substr(s, 1,1)
#Error in substr(s, 1, 1) : invalid multibyte string at '<ff>$GP<47>GA'
gsub('.*([^A-Za-z].*)', '\\1',"\xff$GPGGA")#[1]
#[1] "$GPGGA"
?Quotes
gsub('\xff', '',"\xff$GPGGA")#[1]
#[1] "$GPGGA"
I think the reason that the regex functions don't choke on that string is that regex is actually a system mediated process whereas strsplit and substr are internal R functions.
#RichardScriven posts an example and when I tried to replicated it, I get yet a different example that shows the mapping to displayed characters is system specific. I'm on OSX 10.10.1 (Yosemite)>
cat('\xff')
ˇ
(I left off the octothorpe (#) that I would normally out in.)

Resources