utf8 encoding for emoji in R showing wrong result

utf8 encoding for emoji in R showing wrong result - r

I've a set of twitter emojis:
description r.encoding unicode width
shootingstar <f0><9f><8c><a0> U+1F320 16
wrappedgift <f0><9f><8e><81> U+1F381 16
yellowheart <f0><9f><92><9b> U+1F49B 16
femalesign <e2><99><80> U+2640 12
frowningface <e2><98><b9> U+2639 12
And a set of tweets:
[1] "Ring<f0><9f><9a><b4><e2><80><8d><e2><99><80><ef><b8><8f> Order today and have it within 3 days<e2><9d><a3><ef><b8><8f>\n"
[2] "Really I have been thinking <f0><9f><a4><94> about surfing <f0><9f><8f><84><e2><80><8d><e2><99><80><ef><b8><8f>"
When I'm trying to get name of emojis in these texts. using:
vec <- str_count(string, matchto)# string is the text, matchto is r.encodig
matches <- which(vec != 0)
in some cases it shows wrong result.
specifically for the emojis which are not in my emoji set.
for example
"femalesign" emoji is:<e2><99><80>
in both tweets, my codes shows "female sign" emoji, however when I checked the tweet, the user is actually using "woman biking" and "woman surfing" which are not in my emoji dataset
woman biking: <f0><9f><9a><b4><e2><80><8d><e2><99><80><ef><b8><8f>
woman surfing: <f0><9f><8f><84><e2><80><8d><e2><99><80><ef><b8><8f>
So the output that I was expecting was :
NA
NA
May I know if there's a solution? is there a specific pattern which can help?
I was wondering if there's a pattern /regex that can help in recognising
whether this kind of sequence for example "<f0><9f><9a><b4><e2><80><8d>
<e2><99><80> <ef><b8><8f>", belongs to an emoji , regardless which emoji is it
Because we have more than 2,000 emojis, and it'd be very time consuming to gather info of all emojis since I couldn't find a comprehensive file that includes emoji name and utf-8. Plus emojis frequently get updated.

Related

Converting journal titles to their abbreviated form

Good morning my hero!
I have a list of journal titles in English, Spanish and Portuguese that I want to convert to their abbreviated form. The official abbreviation dictionary for journal titles is the List of Title Word Abbreviations found on the ISSN website.
# example of my data
journal names <- c(journals = c("peste revista psicanalise sociedade", "abanico veterinario", "abcd arquivos brasileiros cirurgia digestiva sao paulo", "academo asuncion", "accion psicologica", "acimed", "acta academica", "acta amazonica", "acta bioethica", "acta bioquimica clinica latinoamericana")
I have split each title into a list of single words. So currently I have a list of lists, where each title is a list of its individual words.
[[1]]
[1] "peste" "revista" "psicanalise" "sociedade"
[[2]]
[1] "abanico" "veterinario"
Once I remove the stop words (as seen above), I need to match any relevant words to the suffixes or prefixes in the LTWA and then convert them to the abbreviation. I have converted the LTWA words so that they have regular expressions and can be used to search for a match easily with a package like stringi.
# this is an excerpt from the dataframe I created with the LTWA
the ABBREVIATIONS_NA replaces the n.a. with the original word and the REXP has the prefix/suffix with the regular expressions
WORDS,ABBREVIATIONS,LANGUAGES,REXP,ABBREVIATIONS_NA
proofreader,proofread.,eng,proofreader,proofread.
prophylact-,prophyl.,eng,^prophylact.*,prophyl.
propietario,prop.,spa,propietario,prop.
propriedade,propr.,por,propriedade,propr.
prostético,prostét.,spa,prostético,prostét.
protecção,prot.,por,protecção,prot.
proteccion-,prot.,spa,^proteccion.*,prot.
prototyping,prototyp.,eng,prototyping,prototyp.
provisional,n.a.,eng,provisional,provisional
provisóri-,n.a.,por,^provisóri.*,provisóri-
proyección,proyecc.,spa,proyección,proyecc.
psicanalise,psicanal.,por,psicanalise,psicanal.
psicoeduca-,psicoeduc.,spa,^psicoeduca.*,psicoeduc.
psicosomat-,psicosom.,spa,^psicosomat.*,psicosom.
psicotecni-,psicotec.,spa,^psicotecni.*,psicotec.
psicoterap-,psicoter.,spa,^psicoterap.*,psicoter.
psychedelic,n.a.,eng,psychedelic,psychedelic
psychoanal-,psychoanal.,eng,^psychoanal.*,psychoanal.
psychodrama,n.a.,eng,psychodrama,psychodrama
psychopatha,n.a.,por,psychopatha,psychopatha
pteridolog-,pteridol.,eng,^pteridolog.*,pteridol.
publicitar-,public.,spa,^publicitar.*,public.
puericultor,pueric.,spa,puericultor,pueric.
Puerto Rico,P. R.,spa,Puerto Rico,P. R.
The search and conversion needs to be done from largest prefix/suffix to smallest prefix/suffix, and words that have already been processed cannot be processed again.
The issue: I would like to convert each title word to its proper abbreviation. However, if there is a prefix like 'latinoamericano', it should only respond to the prefix 'latinoameri-' and be converted to latinoam. The problem is that it will also respond to 'latin-' and then get converted to 'latin.' How can I make it so that each word is only processed once?
Also note that my LTWA database only has about 12,000 words in total, so there will be words that don't have a match at all.
I have gotten up to this point, but not sure where to go from here to accomplish this. So far, I have only come up with very clunky solutions that do not work perfectly.
Thank you!

RemoveWords command not removing some weird words

The point is that im trying to remove some weird words (like <U+0001F399><U+FE0F>) from my text corpus to do some twitter analysis.
There are many words like that that i just can't remove by using <- tm_map(X, removeWords).
i have plenty of tweets agregated in a dataset. Then i use the following code:
corpus_tweets <- tm_map (corpus_tweets, removeWords, c("<U+0001F339>", "<U+0001F4CD>"))
if i try changing those weird words for regular ones (like "life" or "animal") that also appear on my dataset the regular ones get removed easily.
Any idea of how to solve this?

As these are Unicode characters, you need to figure out how to properly enter them in R.
The escape code syntax for Unicode in R probably is not <U+xxxx>, but rather something like \Uxxxx. See the manual for details (I don't use R - I am too annoyed by its inconsistencies. This is even an example for such an inconsistency, where apparently the string is printed differently than what R would accept as input.)

corpus_tweets <- tm_map (corpus_tweets, removeWords, c("\U0001F339", "\U0001F4CD","\uFE0F","\uFE0E"))
NOTE: You use a slash and lowercase u then 4 hex digits to specify a character from Unicode plane 0; you must use uppercase U then 8 hex digits for the other planes (which are typically emoji, given you are working with tweets).
BTW, see Some emojis (e.g. ☁) have two unicode, u'\u2601' and u'\u2601\ufe0f'. What does u'\ufe0f' mean? Is it the same if I delete it? for why you are getting the FE0F in there: they are when the user wants to choose a variation of an emoji, e.g. to add colour. FE0E is its partner (to say you want the plain text glyph).

R tweets with emojis

I scrapped tweets from the twitter API and the package rtweet but I don't know how to work with text with emojis because they are in the form '\U0001f600' and all the regex code that I tried failed until now. I can't get anything of it.
For example
text = 'text text. \U0001f600'
grepl('U',text)
Give me FALSE
grepl('000',text)
Also give me FALSE.
Another problem is that they are often sticked to the word before (for example i am here\U0001f600 )
So how can I make R recognize emojis of that format? What can I put in the grepl that will return me TRUE for any emojis of that format?

In R there tends to be a package for most things. And in this case textclean and with it comes the lexicon package which has a lot of dictionaries. Using textclean you have 2 functions you can use, replace_emoji and replace_emoji_identifier
text = c("text text. \U0001f600", "i am here\U0001f600")
# replace emoji with identifier:
textclean::replace_emoji_identifier(text)
[1] "text text. lexiconvygwtlyrpywfarytvfis " "i am here lexiconvygwtlyrpywfarytvfis "
# replace emoji with text representation
textclean::replace_emoji(text)
[1] "text text. grinning face " "i am here grinning face "
Next you could use sentimentr to use sentiment scoring on the emoji's or for text analysis quanteda. If you just want to check the presence as in your expected output:
grepl("lexicon[[:alpha:]]{20}", textclean::replace_emoji_identifier(text))
[1] TRUE TRUE

Your problem is that you use a single character \ in your code:
text = 'text text. \U0001f600'
It really should be \\:
text = 'text text. \\U0001f600'
I had a similar experience using the rtweet library.
In my case the tweets bring some Unicode code points, not just emoji, and with the following format: "some text<U+code-point>". What I did in this case was "convert" that code point to its graphic representation:
library(stringi)
#I use gsub() to replace "<U+code-point>" with "\\ucode-point", the appropriate format
# And stri_unescape_unicode() to un-escape all Unicode sequences
stri_unescape_unicode(gsub("<U\\+(\\S+)>",
"\\\\u\\1", #replace by \\ucode-point
"some text with #COVID<U+30FC>19"))
#[1] "some text with #COVIDー19"
If the Unicode code point is not delimited as in my case (<>), you should change the regular expression from "<U\\+(\\S+)>" to "U(\\S+)" . You should be careful here, because this will work correctly if a space character appears after the code point. In case you have words attached to the code point both before and after, it must be more specific and indicate the number of characters that compose it, example "U(....)".
You can try refining this regular expression using Character Classes, or specifying only hexadecimal digits "U([A-Fa-f0-9]+)".
Note that in the RStudio console, the emoji are not going to be seen, you can apply this function but to see the emoji you must use an R library for this purpose. However other characters can be seen: "#COVID<U+30FC>19" appears in the RStudio console as "#COVIDー19".
Edit: Actually "\\S+" didn't work for me when there were consecutive Unicode code points like "<U+0001F926><U+200D><U+2642>". In this case it only replaced the first occurrence, I didn't delve into that, I just changed my regular expression to "<U\\+([A-Fa-f0-9]+)>".
"[A-Fa-f0-9]" represents hexadecimal digits.

Using grep() with Unicode characters in R

(strap in!)
Hi, I'm running into issues involving Unicode encoding in R.
Basically, I'm importing data sets that contain Unicode (UTF-8) characters, and then running grep() searches to match values. For example, say I have:
bigData <- c("foo","αβγ","bar","αβγγ (abgg)", ...)
smallData <- c("αβγ","foo", ...)
What I'm trying to do is take the entries in smallData and match them to entries in bigData. (The actual sets are matrixes with columns of values, so what I'm trying to do is find the indexes of the matches, so I can tell what row to add the values to.) I've been using
matches <- grepl(smallData[i], bigData, fixed=T)
which usually results in a vector of matches. For i=2, it would return 1, since "foo" is element 1 of bigData. This is peachy and all is well. But RStudio seems to not be dealing with unicode characters properly. When I import the sets and view them, they use the character IDs.
dataset <- read_csv("[file].csv", col_names = FALSE, locale = locale())
Using View(dataset) shows "aß<U+03B3>" instead of "αβγ." The same goes for
dataset[1]
A tibble: 1x1 <chr>
[1] aß<U+03B3>
print(dataset[1])
A tibble: 1x1 <chr>
[1] aß<U+03B3>
However, and this is why I'm stuck rather than just adjusting the encoding:
paste(dataset[1])
[1] "αβγ"
Encoding(toString(dataset[1]))
[1] "UTF-8"
So it appears that R is recognizing in certain contexts that it should display Unicode characters, while in others it just sticks to--ASCII? I'm not entirely sure, but certainly a more limited set.
In any case, regardless of how it displays, what I want to do is be able to get
grep("αβγ", bigData)
[1] 2 4
However, none of the following work:
grep("αβ", bigData) #(Searching the two letters that do appear to convert)
grep("<U+03B3>",bigData,fixed=T) #(Searching the code ID itself)
grep("αβ", toString(bigData)) #(converts the whole thing to one string)
grep("\\β", bigData) #(only mentioning because it matches, bizarrely, to ß)
The only solution I've found is:
grep("\u03B3", bigData)
[1] 2 4
Which is not ideal for a couple reasons, most jarringly that it doesn't look like it's possible to just take every <U+####> and replace it with \u####, since not every Unicode character is converted to the <U+####> format, but none of them can be searched. (i.e., α and ß didn't turn into their unicode keys, but they're also not searchable by themselves. So I'd have to turn them into their keys, then alter their keys to a form that grep() can use, then search.)
That means I can't just regex the keys into a searchable format--and even if I could, I have a lot of entries including characters that'd need to be escaped (e.g., () or ), so having to remove the fixed=T term would be its own headache involving nested escapes.
Anyway...I realize that a significant part of the problem is that my set apparently involves every sort of character under the sun, and it seems I have thoroughly entrapped myself in a net of regular expressions.
Is there any way of forcing a search with (arbitrary) unicode characters? Or do I have to find a way of using regular expressions to escape every ( and α in my data set? (coordinate to that second question: is there a method to convert a unicode character to its key? I can't seem to find anything that does that specific function.)

How to count the number of sentences in a text in R?

I read a text into R using the readChar() function. I aim at testing the hypothesis that the sentences of the text have as many occurrences of letter "a" as occurrences of letter "b". I recently discovered the {stringr} package, which helped me a great deal to do useful things with my text such as counting the number of characters and the total number of occurrences of each letter in the entire text. Now, I need to know the number of sentences in the whole text. Does R have any function, which can help me do that? Thank you very much!

Thank you #gui11aume for your answer. A very good package I just found that can help do the work is {openNLP}. This is the code to do that:
install.packages("openNLP") ## Installs the required natural language processing (NLP) package
install.packages("openNLPmodels.en") ## Installs the model files for the English language
library(openNLP) ## Loads the package for use in the task
library(openNLPmodels.en) ## Loads the model files for the English language
text = "Dr. Brown and Mrs. Theresa will be away from a very long time!!! I can't wait to see them again." ## This sentence has unusual punctuation as suggested by #gui11aume
x = sentDetect(text, language = "en") ## sentDetect() is the function to use. It detects and seperates sentences in a text. The first argument is the string vector (or text) and the second argument is the language.
x ## Displays the different sentences in the string vector (or text).
[1] "Dr. Brown and Mrs. Theresa will be away from a very long time!!! "
[2] "I can't wait to see them again."
length(x) ## Displays the number of sentences in the string vector (or text).
[1] 2
The {openNLP} package is really great for natural language processing in R and you can find a good and short intro to it here or you can check out the package's documentation here.
Three more languages are supported in the package. You just need to install and load the corresponding model files.
{openNLPmodels.es} for Spanish
{openNLPmodels.ge} for German
{openNLPmodels.th} for Thai

What you are looking for is sentence tokenization, and it is not as straightforward as it seems, even in English (sentences like "I met Dr. Bennett, the ex husband of Mrs. Johson." can contain full stops).
R is definitely not the best choice for natural language processing. If you are Python proficient, I suggest you have a look at the nltk module, which covers this and many other topics. You can also copy the code from this blog post, which does sentence tokenization and word tokenization.
If you want to stick to R, I would suggest you count the end-of-sentence characters (., ?, !), since you are able to count characters. A way of doing it with a regular expression is like so:
text <- 'Hello world!! Here are two sentences for you...'
length(gregexpr('[[:alnum:] ][.!?]', text)[[1]])

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex