Translation and mapping of emoticons encoded as UTF-8 code in text - r

I am working with text which includes emoticons. I need to be able to find these and replace them with tags which can be analysed. How to do this?
> main$text[[4]]
[1] "Spread d wrd\xf0\u009f\u0098\u008e"
> grepl("\xf0", main$text[[4]])
[1] FALSE
I tried the above. Why did it not work? I also tried iconv into ASCII, then the byte encoding I got, could be searched with grepl.
> abc<-iconv(main$text[[4]], "UTF-8", "ASCII", "byte")
> abc
[1] "Spread d wrd<f0><9f><98><8e>"
> grepl("<f0>", abc)
[1] TRUE
I really do not understand what I did here and what happened. I also do not understand how the above conversion introduced \n characters into the text.
I also did not know how to encode these, once they were searcheable. I found a list here, but it fell short (for example, "U+E00E" - <ee><80><8e> was not in the list). Is there a comprehensive list for such a mapping?
ADDENDUM
After a lot of trial and error, here is what I realised. There are two kinds of encodings for the emojis in the data. One is in the form of bytes, which is searchable by grepl("\x9f", ...., useBytes=T), like the main$text[[4]], and another (main$text[[6]]) which is searchable as the unicode character without useBytes=T, i.e. grepl("\ue00e",....). Even the way they are displayed in View() and when called on the console is different. I am absolutely confused as to what is going on here.
main$text[[4]]
[1] "Spread d wrd\xf0\u009f\u0098\u008e"
main[4,]
timestamp fromMe remoteResource remoteResourceDisplayName type
b 2014-08-30 02:58:58 FALSE 112233#s.whatsapp.net ABC text
text date
b Spread d wrd<f0><U+009F><U+0098><U+008E> 307114
main$text[[6]]
[1] ""
main[6,]
timestamp fromMe remoteResource remoteResourceDisplayName type text
b 2014-08-30 02:59:17 FALSE 12345#s.whatsapp.net XYZ text <U+E00E>
date
b 307114
grepl("\ue00e", main$text[[6]])
[1] TRUE
grepl("<U+E00E>", main$text[[6]])
[1] FALSE
grepl("\u009f", main$text[[4]])
[1] FALSE
grepl("\x9f", main$text[[4]])
[1] FALSE
grepl("\x9f", main$text[[4]], fixed=T)
[1] FALSE
grepl("\x9f", main$text[[4]], useBytes=T)
[1] TRUE
The maps I have are also different. The one for the bytes case works well. But the other one doesnot, since I am unable to create the "\ue00e" required to search. Here is the sample of the other map, corresponding to the Softbank <U+E238>.
emmm[11]
[1] "E238"

Searching for a single byte of a multi-byte UTF-8 encoded character only works if done with useBytes = TRUE. The fact that "\xf0" here is a part of a multi-byte character is obscured by the less than perfect Unicode support of R on Windows (used in the original example, I presume). How to match by bytes:
foo <- "\xf0\x9f\x98\x8e" # U+1F60E SMILING FACE WITH SUNGLASSES
Encoding(foo) <- "UTF-8"
grepl("\xf0", foo, useBytes = TRUE)
I don't see much use for matching one byte, though. Searching for the whole character would then be:
grepl(foo, paste0("Smiley: ", foo, " and more"), useBytes = TRUE)
Valid ASCII codes correspond to integers 0–127. The iconv() conversion to ASCII in the example replaces any invalid byte 0xYZ (corresponding to integers 128–255) with the literal text <yz> where y and z are hexadecimal digits. As far as I can see, it should not introduce any newlines ("\n").
Using the character list linked to in the question, here is some example code which performs one kind of "emoji tagging" to input strings, namely replacing the emoji with its (slightly formatted) name.
emoji_table <- read.csv2("https://github.com/today-is-a-good-day/Emoticons/raw/master/emDict.csv",
stringsAsFactors = FALSE)
emoji_names <- emoji_table[, 1]
text_bytes_to_raw <- function(x) {
loc <- gregexpr("\\x", x, fixed = TRUE)[[1]] + 2
as.raw(paste0("0x", substring(x, loc, loc + 1)))
}
emoji_raw <- lapply(emoji_table[, 3], text_bytes_to_raw)
emoji_utf8 <- vapply(emoji_raw, rawToChar, "")
Encoding(emoji_utf8) <- "UTF-8"
gsub_many <- function(x, patterns, replacements) {
stopifnot(length(patterns) == length(replacements))
x2 <- x
for (k in seq_along(patterns)) {
x2 <- gsub(patterns[k], replacements[k], x2, useBytes = TRUE)
}
x2
}
tag_emojis <- function(x, codes, names) {
gsub_many(x, codes, paste0("<", gsub("[[:space:]]+", "_", names), ">"))
}
each_tagged <- tag_emojis(emoji_utf8, emoji_utf8, emoji_names)
all_in_one <- tag_emojis(paste0(emoji_utf8, collapse = ""),
emoji_utf8, emoji_names)
stopifnot(identical(paste0(each_tagged, collapse = ""), all_in_one))
As to why U+E00E is not on that emoji list, I don't think it should be. This code point is in a Private Use Area, where character mappings are not standardized. For comprehensive Unicode character lists, you cannot find a better authority than the Unicode Consortium, e.g. Unicode Emoji. Additionally, see convert utf8 code point strings like <U+0161> to utf8 .
Edit after addendum
When there is a string of exactly four hexadecimal digits representing a Unicode code point (let's say "E238"), the following code will convert the string to the corresponding UTF-8 representation, the occurrence of which can be checked with the grep() family of functions. This answers the question of how to "automatically" generate the character that can be manually created by typing "\uE238".
library(stringi)
hex4_to_utf8 <- function(x) {
stopifnot(grepl("^[[:xdigit:]]{4}$", x))
stringi::stri_enc_toutf8(stringi::stri_unescape_unicode(paste0("\\u", x)))
}
foo <- "E238"
foo_utf8 <- hex4_to_utf8(foo)
The value of the useBytes option should not matter in the following grep() call. In the previous code example, I used useBytes = TRUE as a precaution, as I'm not sure how well R on Windows handles Unicode code points U+10000 and larger (five or six digits). Clearly it cannot properly print such codepoints (as shown by the U+1F60E example), and input with the \U + 8 digits method is not possible.
The example in the question shows that R (on Windows) may print Unicode characters with the <U+E238> notation rather than as \ue238. The reason seems to be format(), also used in print.data.frame(). For example (R for Windows running on Wine):
> format("\ue238")
[1] "<U+E238>"
When tested in an 8-bit locale on Linux, the same notation is already used by the default print method. One must note that in this case, this is only a printed representation, which is different from how the character is originally stored.

Related

Create decimal values of character hex

With the following code I'm reading a text file and displaying one column on the screen.
externalData <- read.delim("testdata.txt", header = FALSE, sep = "#")
i = 1;
while(i < 11) {
appel <- as.character(externalData[i,3,1])
i = i + 1;
print(appel)
}
The output looks like this:
I'm trying to convert these values from hexadecimal to decimal.
I've tried the following:
strtoi(c(appel))
but this doesn't seem to work, this only removes the quotation marks from the first one and the last one, and sets everything in-between to N/A (probably because there are letters in them).
Here are 3 ways to convert hexadecimal(character) to decimal(numeric).
x <- c("158308", "bb1787", "853f91")
# 1.
strtoi(x, base = 16L)
# 2.
as.integer(as.hexmode(x))
# 3.
as.integer(paste0("0x", x))
# more general edition:
# as.integer(ifelse(!grepl("^0x", x), paste0("0x", x), x))
From ?strtoi
Convert strings to integers according to the given base using the C function strtol, or choose a suitable base following the C rules.
Arguments
x a character vector, or something coercible to this by as.character.
base an integer which is between 2 and 36 inclusive, or zero (default).
1.Create a reproducible minimal example
appel <- c("158308", "d8db89")
2.Solution using strtoi base argument:
strtoi(appel, base=16)
Returns:
[1] 1409800 14211977

gsub to remove unwanted precision

Could anyone please help to achieve the following with gsub in R?
input string: a=5.00,b=120,c=0.0003,d=0.02,e=5.20, f=1200.0,g=850.02
desired output: a=5,b=120,c=0.0003,d=0.02,e=5.2, f=1200, g=850.02
Practically, removing the redundant 0s after the decimal point if they are all just 0s, don't remove if real fractions exist.
I couldn't get this to work using gsub alone, but we can try splitting your input vector on comma, and then using an apply function with gsub:
x <- "a=5.00,b=120,c=0.0003,d=0.02,e=5.20, f=1200.0,g=850.02"
input <- sapply(unlist(strsplit(x, ",")), function(x) gsub("(?<=\\d)\\.$", "", gsub("(\\.[1-9]*)0+$", "\\1", x), perl=TRUE))
input <- paste(input, collapse=",")
input
[1] "a=5,b=120,c=0.0003,d=0.02,e=5.2, f=1200,g=850.02"
Demo
I actually make two calls to gsub. The first call strips off all trailing zeroes appearing after a decimal point, should the number have one. And the second call removes stray decimal points, in the case of a number like 5.00, which the first call would leave as 5. and not 5, the latter which we want.
To remove trailing 0s after the decimal, try this:
EDIT Forgot 5.00
x = c('5.00', '0.500', '120', '0.0003', '0.02', '5.20', '1200', '850.02')
gsub("\\.$" "", gsub("(\\.(|[1-9]+))0+$", "\\1", x))
# [1] "5" "0.5" "120" "0.0003" "0.02" "5.2" "1200" "850.02"
HT #TimBiegeleisen: I misread input as a vector of strings. For a single-string input, convert to vector of strings, which you can call gsub on, then collapse output back to a single string:
paste(
gsub("\\.$", "", gsub("(\\.(|[1-9]+))0+$", "\\1",
unlist(strsplit(x, ", ")))),
collapse=", ")
[1] "a=5, b=0.5, c=120, d=0.0003, e=0.02, f=5.2, g=1200, h=850.02"
gsub is a text processing tool that works on character level. It’s ignorant of any semantic interpretation.
However, you are specifically interested in manipulating this semantic interpretation, namely, the precision of numbers encoded in your text.
So use that: parse the numbers in the text, and write them out with the desired precision:
parse_key_value_pairs = function (text) {
parse_pair = function (pair) {
pair = strsplit(pair, "\\s*=\\s*")[[1]]
list(key = pair[1], value = as.numeric(pair[2]))
}
pairs = unlist(strsplit(text, "\\s*,\\s*"))
structure(lapply(pairs, parse_pair), class = 'kvp')
}
as.character.kvp = function (x, ...) {
format_pair = function (pair) {
sprintf('%s = %g', pair[1], pair[2])
}
pairs = vapply(x, format_pair, character(1))
paste(pairs, collapse = ", ")
}
And use it as follows:
text = "a=5.00,b=120,c=0.0003,d=0.02,e=5.20, f=1200.0,g=850.02"
parsed = parse_key_value_pairs(text)
as.character(parsed)
This uses several interesting features of R:
For text processing, it still uses regular expressions (inside strsplit).
To process multiple values, use lapply to apply a parsing function to parts of the string in turn
To reconstruct a key–value pair, format the string using sprintf. sprintf is a primitive text formatting tool adapted from C. But it’s fairly universal and it works OK in our case.
The parsed value is tagged with an S3 class name. This is how R implements object orientation.
Provide an overload of the standard generic as.character for our type. This means that any existing function that takes an object and displays it via as.character can deal with our parsed data type. In particular, this works with the {glue} library:
> glue::glue("result: {parsed}")
result: a = 5, b = 120, c = 0.0003, d = 0.02, e = 5.2, f = 1200, g = 850.02
This is probably not the most ideal solution, but for educational purposes, here is one way to call gsub only once using conditional regex:
x = 'a=5.00,b=120,c=0.0003,d=0.02,e=5.20, f=1200.0,g=850.02'
gsub('(?!\\d+(?:,|$))(\\.[0-9]*[1-9])?(?(1)0+\\b|\\.0+(?=(,|$)))', '\\1', x, perl = TRUE)
# [1] "a=5,b=120,c=0.0003,d=0.02,e=5.2, f=1200,g=850.02"
Notes:
(?!\\d+(?:,|$)) is a negative lookbehind that matches a digit one or more times following a comma or end of string. This effectively excludes the pattern from the overall regex match.
(\\.[0-9]*[1-9])? matches a literal dot, a digit zero or more times and a digit (except zero). The ? makes this pattern optional, and is crucial to how the conditional handles the back reference.
(?(1)0+\\b|\\.0+(?=(,|$))) is a conditional with the logic (?(IF)THEN|ELSE)
(1) is the (IF) part which checks if capture group 1 matched. This refers to (\\.[0-9]*[1-9])
0+\\b is the (THEN) part which matches only if (IF) is TRUE. In this case, only if (\\.[0-9]*[1-9]) matched, will the regex try to match a zero one or more times following a word boundary
\\.0+(?=(,|$)) is the (ELSE) part which matches only if (IF) is FALSE. In this case only if (\\.[0-9]*[1-9]) didn't match, will the regex try to match a literal dot, a zero one or more times following a comma or end of string
If we put 2. and 3. together, we get either (\\.[0-9]*[1-9])0+\\b or \\.0+(?=(,|$))
\\1 as a replacement therefore turns either (\\.[0-9]*[1-9])0+\\b to the pattern matched by (\\.[0-9]*[1-9]) or \\.0+(?=(,|$)) to blank. which translates to:
5.20 to 5.2 for the former
5.00 to 5 and 1200.0 to 1200 for the latter

Write variable length strings in single console line

I use RStudio. Within a loop, I want to display in a single console line a string of variable length. I am using cat(). If I use \n, different lines are written (not what I want):
A <- c("AAAAA","BBB","C")
for (i in 1:3){cat(A[i],"\n"); Sys.sleep(1)}
AAAAA
BBB
C
The use of \r works well when names are of the (nearly) same length, but in this case, the result is again not what I want:
for (i in 1:3){cat(A[i],"\r"); Sys.sleep(1)}
C B A
as it should be only the string "C" when the loop is finished.
I have also tried deleting many spaces with \b, but the length difference is large and many times the information is written one line above the current console line.
Is there a simple way to do this? (base R preferred)
Edit: What I want is that, in a single line, first the string "AAAAA" appears. After one second, only the string "BBB" should appear (not "BBB A"). After one second, only the string "C" should appear (not "C B A").
Your current method works if you first pad all the strings to the length of the longest one:
A <- c("AAAAA","BBB","C")
max_length = max(nchar(A))
A_filled = stringr::str_pad(A, max_length, side = "right")
for (i in 1:3){cat(A_filled[i],"\r"); Sys.sleep(1)}
To pad the strings in base R you can use sprintf:
max_length = max(nchar(A))
pad_format = paste0("%-", max_length, "s")
A_filled = sprintf(pad_format, A)
I tend to believe you want all the strings printed: This is a base R solution
A <- c("AAAAA","BBB","C")
x <-formatC(A, width = -max(nchar(A)))
for (i in 1:3){cat("\r",x[i]); Sys.sleep(1)}
I hope, just a simple cat works fine.
> for (i in 1:3){cat(A[i], " "); Sys.sleep(1)}
AAAAA BBB C
> for (i in 1:3){cat(A[i]); Sys.sleep(1)}
AAAAABBBC

Integers/expressions as names for elements in lists

I am trying to understand names, lists and lists of lists in R. It would be convenient to have a way to dynamically label them like this:
> ll <- list("1" = 2)
> ll
$`1`
[1] 2
But this is not working:
> ll <- list(as.character(1) = 2)
Error: unexpected '=' in "ll <- list(as.character(1) ="
Neither is this:
> ll <- list(paste(1) = 2)
Error: unexpected '=' in "ll <- list(paste(1) ="
Why is that? Both paste() and as.character() are returning "1".
The reason is that paste(1) is a function call that evaluates to a string, not a string itself.
The The R Language Definition says this:
Each argument can be tagged (tag=expr), or just be a simple expression.
It can also be empty or it can be one of the special tokens ‘...’, ‘..2’, etc.
A tag can be an identifier or a text string.
Thus, tags can't be expressions.
However, if you want to set names (which are just an attribute), you can do so with structure, eg
> structure(1:5, names=LETTERS[1:5])
A B C D E
1 2 3 4 5
Here, LETTERS[1:5] is most definitely an expression.
If your goal is simply to use integers as names (as in the question title), you can type them in with backticks or single- or double-quotes (as the OP already knows). They are converted to characters, since all names are characters in R.
I can't offer a deep technical explanation for why your later code fails beyond "the left-hand side of = is not evaluated in that context (of enumerating items in a list)". Here's one workaround:
mylist <- list()
mylist[[paste("a")]] <- 2
mylist[[paste("b")]] <- 3
mylist[[paste("c")]] <- matrix(1:4,ncol=2)
mylist[[paste("d")]] <- mean
And here's another:
library(data.table)
tmp <- rbindlist(list(
list(paste("a"), list(2)),
list(paste("b"), list(3)),
list(paste("c"), list(matrix(1:4,ncol=2))),
list(paste("d"), list(mean))
))
res <- setNames(tmp$V2,tmp$V1)
identical(mylist,res) # TRUE
The drawbacks of each approach are pretty serious, I think. On the other hand, I've never found myself in need of richer naming syntax.

R: serialize base64 encode/decode of text not exactly matching

in my previous question about using serialize() to create a CSV of objects I got a great answer from jmoy where he recommended base64 encoding of my serialized text. That was exactly what I was looking for. Oddly enough, when I try to put this in practice I get results that look right but don't exactly match what I ran through the serialize/encoding process.
The example below takes a list with 3 vectors and serializes each vector. Then each vector is base64 encoded and written to a text file along with a key. The key is simply the index number of the vector. I then reverse the process and read each line back from the csv. At the very end you can see some items don't exactly match. Is this a floating point issue? Something else?
require(caTools)
randList <- NULL
set.seed(2)
randList[[1]] <- rnorm(100)
randList[[2]] <- rnorm(200)
randList[[3]] <- rnorm(300)
#delete file contents
fileName <- "/tmp/tmp.txt"
cat("", file=fileName, append=F)
i <- 1
for (item in randList) {
myLine <- paste(i, ",", base64encode(serialize(item, NULL, ascii=T)), "\n", sep="")
cat(myLine, file=fileName, append=T)
i <- i+1
}
linesIn <- readLines(fileName, n=-1)
parsedThing <- NULL
i <- 1
for (line in linesIn){
parsedThing[[i]] <- unserialize(base64decode(strsplit(linesIn[[i]], split=",")[[1]][[2]], "raw"))
i <- i+1
}
#floating point issue?
identical(randList, parsedThing)
for (i in 1:length(randList[[1]])) {
print(randList[[1]][[i]] == parsedThing[[1]][[i]])
}
i<-3
randList[[1]][[i]] == parsedThing[[1]][[i]]
randList[[1]][[i]]
parsedThing[[1]][[i]]
Here's the abridged output:
> #floating point issue?
> identical(randList, parsedThing)
[1] FALSE
>
> for (i in 1:length(randList[[1]])) {
+ print(randList[[1]][[i]] == parsedThing[[1]][[i]])
+ }
[1] TRUE
[1] TRUE
[1] FALSE
[1] FALSE
[1] TRUE
[1] FALSE
[1] TRUE
[1] TRUE
[1] FALSE
[1] FALSE
...
>
> i<-3
> randList[[1]][[i]] == parsedThing[[1]][[i]]
[1] FALSE
>
> randList[[1]][[i]]
[1] 1.587845
> parsedThing[[1]][[i]]
[1] 1.587845
>
ascii=T in your call to serialize is making R do imprecise binary-decimal-binary conversions when serializing and unserializing causing the values to differ. If you remove ascii=T you get exactly the same numbers back as now it is a binary representation which is written out.
base64encode can encode raw vectors so it doesn't need ascii=T.
The binary representation used by serialize is architecture independent, so you can happily serialize on one machine and unserialize on another.
Reference: http://cran.r-project.org/doc/manuals/R-ints.html#Serialization-Formats
JD: I ran your code snippet on my Linux box, then looked at the differences computed by randList[[1]][[i]] - parsedThing[[1]][[i]].
Yes, the values are different, but only at the level my machine's floating-point tolerance. A typical difference was -4.440892e-16 -- which is pretty tiny. Some differences were zero.
It does not surprise me that the save/restore introduced that (tiny) level of change. Any significant data conversion runs the risk of "bobbling" the least significant digit.
Ok, now that you show the output I can explain to you what you're doing (following Paul's lead here).
As that is a known issue (see e.g. this R FAQ entry), you should buckle up and use any one of
identical()
all.equal()
functions from the RUnit package such as checkEquals
In sum, there seems nothing wrong with the base64 encoding you are using. You simply employed the wrong definition of exactly. But hey, we're economists, and anything below a trillion or two is rounding error anyway...

Resources