How can I match emoji with an R regex?

How can I match emoji with an R regex? - r

I want to determine which elements of my vector contain emoji:
x = c('😂', 'no', '🍹', '😀', 'no', '😛', '䨺', '감사')
x
# [1] "\U0001f602" "no" "\U0001f379" "\U0001f600" "no" "\U0001f61b" "䨺" "감사"
Related posts only cover other languages, and because mostly they refer to specialized libraries, I couldn't figure out a way to translate to R:
What is the regex to extract all the emojis from a string?
How do I remove emoji from string
replace emoji unicode symbol using regexp in javascript
Regular expression matching emoji in Mac OS X / iOS
remove unicode emoji using re in python
The second looked very promising, but alas (not fixed by supplying perl = TRUE):
x[grepl('[\u{1F600}-\u{1F6FF}]', x)]
Error: invalid \u{xxxx} sequence (line 1)
Similar issues come about from other questions. How can we match emoji in R?

I am converting the encoding to UTF-8 to compare the UTF-8 value of emoji's value with all the emoji's value in remoji library which is in UTF-8. I am using the stringr library to find the position of emoji's in the vector. One is free to use grep or any other function.
1st Method:
library(stringr)
xvect = c('😂', 'no', '🍹', '😀', 'no', '😛')
Encoding(xvect) <- "UTF-8"
which(str_detect(xvect,"[^[:ascii:]]")==T)
# [1] 1 3 4 6
Here 1,3,4 and 6 are emoji's character in this case.
Edited :
2nd Method:
Install a package called remoji using devtools using below command, Since we have already converted the emoji items into UTF-8. we can now compare the UTF-8 values of all the emoji's present in the emoji library. Use trimws to remove the whitespaces
install.packages("devtools")
devtools::install_github("richfitz/remoji")
library(remoji)
emj <- emoji(list_emoji(), TRUE)
xvect %in% trimws(emj)
Output:
which(xvect %in% trimws(emo))
# [1] 1 3 4 6
Both of the above methods are not full proof and first method assumes that there are no any ascii characters other than emojis in the vector and second method relies on the library information of remoji. In case where the a certain emoji information is not present in the library, the last command may yield a FALSE instead of TRUE.
Final Edit:
As per the discussion amongst OP(#MichaelChirico) and #SymbolixAU. Thanks to both of them it seems the problem with small typo of capital U. The new regex is xvect[grepl('[\U{1F300}-\U{1F6FF}]', xvect)] . The range in the character class is taken from F300 to F6FF. One can off course change this range to a new range in cases where an emoji lies outside this range. This may not be the complete list and over the period of time these ranges may keep increasing/changing.

Related

Identifying Unicode replacement characters (U+FFFD or � or black diamond question mark) in R

I have data from a large Qualtrics survey that was processed in Stata before I got it. I'm now trying to clean up the data in R. Some of the Portuguese characters have been replaced with �. I'm trying to flag text entry responses to a series of questions that were originally "não" ["no" in English] and are now recorded as "n�o". I can see in tests below that gsub() and grepl() can identify "�" in both a list or data frame, but when I try to use the same commands on the the real data, both commands fail to identify "n�o" and even "�". There is no error; it just fails to substitute for gsub() and marks FALSE when it should be TRUE for grepl().
Are there multiple types of � based on the underlying character? Is there some way to search for or replace � characters that will pick up any instance?
This example shows that gsub() and grepl() both work fine on a list or data frame:
list <- c("n�o ç não", "n�o", "nao", "não")
gsub("�", "ã", list)
grepl("�", list)
library(dplyr)
df <- data.frame(list)
df.new <- df %>%
mutate(
sub = gsub("�", "ã", df$list),
replace = grepl("�", list))
df.new$sub
df.new$replace
[1] "não ç não" "não" "nao" "não"
[1] TRUE TRUE FALSE FALSE
[1] "não ç não" "não" "nao" "não"
[1] TRUE TRUE FALSE FALSE
This same code fails to identify "�" in my real data.

My guess is you are on a windows machine, which sometimes doesn't play nice with unicode characters. To re-create im parsing your actual post to show you what you can do. I suggest using the stringi library, and using to replace all the characters that you know to be ã as a short-cut, but really you'd want to handle each possible-case with a blanket solution. Checkout ?stringi-search-charclass for more info on how to do this, but.. from your original post:
I have data from a large Qualtrics survey that was processed in Stata before I got it. I'm now trying to clean up the data in R. Some of the Portuguese characters have been replaced with �. I'm trying to flag text entry responses to a series of questions that were originally "não" ["no" in English] and are now recorded as "n�o". I can see in tests below that gsub() and grepl() can identify "�" in both a list or data frame, but when I try to use the same commands on the the real data, both commands fail to identify "n�o" and even "�". There is no error; it just fails to substitute for gsub() and marks FALSE when it should be TRUE for grepl().
We get:
library(xml2)
library(stringi)
this_post = "https://stackoverflow.com/questions/66540384/identifying-unicode-replacement-characters-ufffd-or-or-black-diamond-questio#66540384"
read_html(this_post) %>%
xml_find_all('//*[#id="question"]/div/div[2]/div[1]/p[1]') %>%
xml_text() %>% stri_replace_all_regex("\\p{So}", "ã")
I have data from a large Qualtrics survey that was processed in Stata before I got it. I'm now trying to clean up the data in R. Some of the Portuguese characters have been replaced with ã. I'm trying to flag text entry responses to a series of questions that were originally "não" ["no" in English] and are now recorded as "não". I can see in tests below that gsub() and grepl() can identify "ã" in both a list or data frame, but when I try to use the same commands on the the real data, both commands fail to identify "não" and even "ã". There is no error; it just fails to substitute for gsub() and marks FALSE when it should be TRUE for grepl().
For your original data... see if this works:
stringi::stri_escape_unicode(orig_data) %>%
stringi::stri_replace_all_regex("\\p{So}", "ã")
One-more thing
You cannot grepl with the unknown-char, because the function has no idea what you are asking it to match, instead try this:
stringi::stri_unescape_unicode("\\u00e3")
[1] "ã"
grepl("\u00e3", stringi::stri_escape_unicode(orig_data), perl = TRUE)
[1] TRUE FALSE FALSE TRUE
EDIT based on comment:
Below is a good solution, as the "question mark" chars you were getting are likely lost as a result of being ascii. NOTE that in the example I gave, you would merely be replacing ANY/ALL bad-chars with the "ã". Obviously this isn't a good approach, but if you read the help docs i am sure you'll see how you can use this approach blended with an escape to work for all your strings.
orig_data$repaired_text <- stringi::stri_enc_toutf8(orig_data$text) %>% stringi::stri_replace_all_regex("\\p{So}", "ã")

How can I set a comma separator for thousands as a default option for how the interpreter in R presents numbers?

I dont want a function. I just want to have that be the default way in which the R interpreter always displays numbers. Thanks in advance.

While I'm not aware of a way to have your numbers always display with commas, there is a way to turn-off scientific notation for your session and then format your numerical output to show commas as a string.
Here's one possible solution:
# Load library
load(scales)
# Turn-off scientific notation for your R session
options(scipen = 999)
# An example vector of big numbers
x = c(1000000000000000, 2000000000000, 3000000000000)
# Use the scales::comma() function to add commas
# Output will be formated as a string
comma(x)
#> [1] "1,000,000,000,000,000" "2,000,000,000,000" "3,000,000,000,000"

Filtering out entries in a column that contain UTF-8 arabic characters in R

I have a data set called event_table that has a column titled "c.Comments" which contains strings mostly in english, but has some arabic in some of the comment entries. I want to filter out rows in which the comments entry contains arabic characters.
I read the data into R from an xlsx file and the arabic characters show as UTF-8 "< U+4903 >< U+483d >" (with no spaces) etc.
I've tried using regular expressions to achieve what I want, but the strings I'm trying to match refuse to be filtered out. I've tried all kinds of different regular expressions, but none of them seem to do the trick. I'll try filtering out literally "
event_table <- event_table %>%
filter(!grepl("<U+", c.Comments, fixed = TRUE))
event_table <- event_table %>%
filter(!grepl("<U\\+", c.Comments)
"\x", "\d\d\d\d", and all sorts of other combinations have done nothing for me
I'm starting to suspect that my method of filtering may be the issue rather than the regular expression, so any suggestions would be greatly appreciated.

Arabic chars can be detected with grep/grepl using a PCRE regex like \p{Arabic}:
> df <- data.frame(x=c("123", "abc", "ﺏ"))
> df
x
1 123
2 abc
3 <U+FE8F>
> grepl("\\p{Arabic}", df$x, perl=TRUE)
[1] FALSE FALSE TRUE
In your case, the code will look like
event_table <- event_table %>%
filter(!grepl("\\p{Arabic}", c.Comments, perl=TRUE))

Look at the ?Syntax help page. The Unicode character associated with may vary with the assumed codepage. On my machine the R character would be created with the string: "\u4903" but it prints as a Chinese glyph. The regex engine in R (as documented in the ?regex help page which you should refer to now) is PCRE.
The pattern in this grepl expression will filter out the printing non-ASCII characters:
grepl("[[:alnum:]]|[[:punct:]]", "\u4903")
[1] FALSE
And I don't think you should be negating that grepl result:
dplyr::filter(data.frame("\u4903"), grepl("[[:alnum:]]|[[:punct:]]", "\u4903"))
[1] X.䤃.
<0 rows> (or 0-length row.names)
dplyr::filter(data.frame("\u4903"), !grepl("[[:alnum:]]|[[:punct:]]", "\u4903"))
X.䤃.
1 䤃

R encoding ASCII backtick

I have the following backtick on my list's names. Prior lists did not have this backtick.
$`1KG_1_14106394`
[1] "PRDM2"
$`1KG_20_16729654`
[1] "OTOR"
I found out that this is a 'ASCII grave accent' and read the R page on encoding types. However what to do about it ? I am not clear if this will effect some functions (such as matching on list names) or is it OK leave it as is ?
Encoding help page: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html
Thanks!

My understanding (and I could be wrong) is that the backticks are just a means of escaping a list name which otherwise could not be used if left unescaped. One example of using backticks to refer to a list name is the case of a name containing spaces:
lst <- list(1, 2, 3)
names(lst) <- c("one", "after one", "two")
If you wanted to refer to the list element containing the number two, you could do this using:
lst[["after one"]]
But if you want to use the dollar sign notation you will need to use backticks:
lst$`after one`
Update:
I just poked around on SO and found this post which discusses a similar question as yours. Backticks in variable names are necessary whenever a variable name would be forbidden otherwise. Spaces is one example, but so is using a reserved keyword as a variable name.
if <- 3 # forbidden because if is a keyword
`if` <- 3 # allowed, because we use backticks
In your case:
Your list has an element whose name begins with a number. The rules for variable names in R is pretty lax, but they cannot begin with a number, hence:
1KG_1_14106394 <- 3 # fails, variable name starts with a number
KG_1_14106394 <- 3 # allowed, starts with a letter
`1KG_1_14106394` <- 3 # also allowed, since escaped in backticks

Convert special letters to english letters in R

Is there a way to convert special letters, in a text, to english letters in R? For example:
Æ -> AE
Ø -> O
Å -> A
Edit: The reason I need this convert is R cant see that these two words are the same:
stringdist('oversættelse','oversaettelse')
[1] 2
grepl('oversættelse','oversaettelse')
FALSE
Some people tent to write using only english characters and some others not. In order to compare some texts I need to have them in the 'same format'.

I recently had a very similar problem and was pointed to the question Unicode normalization (form C) in R : convert all characters with accents into their one-unicode-character form?
basically the gist is for many of this special characters there exist more than one unicode representation - which will mess with text comparisons. The suggested solution is to use the stringi package function stri_trans_nfc - it has also a function stri_trans_general that supports transliteration, which might be exactly what you need.

You can use chartr
x <- "ØxxÅxx"
chartr("ØÅ", "OA", x)
[1] "OxxAxx"
And/or gsub
y <- "Æabc"
gsub("Æ", "AE", y)
[1] "AEabc"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How can I match emoji with an R regex? - r

Related

Identifying Unicode replacement characters (U+FFFD or � or black diamond question mark) in R

How can I set a comma separator for thousands as a default option for how the interpreter in R presents numbers?

Filtering out entries in a column that contain UTF-8 arabic characters in R

R encoding ASCII backtick

Convert special letters to english letters in R

Categories

Resources