Can I convert Unicode into plain text in R? - r

The data I am using has many characters like "<U+XXXX>". Originally, it looks like this as a data point, "<U+043E><U+043A><U+0430><U+0437><U+044B>: 673".
I am curious what I should use to convert them easily and effectively into ordinary plain texts. I have rows of this Unicode in my table, and I am confused now.
I was looking for ways of conversion online, but most of them don't work. For example, I have tried this code on my data to convert it from UTF-8 into Latin; it failed.
www <- c("<U+043C>")
www %>% iconv(from = "UTF-8", to = "latin1")
[1] <U+043C>
Also, I have tried this without arrows. Still, it doesn't convert.
www <- c("U+043C")
www %>% iconv(from = "UTF-8", to = "latin1")
[1] U+043C
Alternatively, I tried this function.
example <- c("<U+041F><U+043E><U+043A><U+0430><U+0437><U+044B>: 58025")
iconv(example, "UTF-8", "latin1")
[1] "<U+041F><U+043E><U+043A><U+0430><U+0437><U+044B>: 58025"
Any ideas, folks?

When you type "<U+043C>" it is being interpreted as a literal string of 8 characters. Whether this string is interpreted as latin-1 or UTF doesn't matter, since they both encode these literal 8 characters the same way.
What you need to do is unescape the unicode strings. The stringi package can do this for you, but you need to do a bit of conversion first to get it in the right format. The following function should take care of it:
f <- function(x) {
x <- gsub(">", "", gsub("<U\\+", "\\\\u", x))
stringi::stri_unescape_unicode(x)
}
So you can do:
example <- c("<U+041F><U+043E><U+043A><U+0430><U+0437><U+044B>: 58025")
www <- c("<U+043C>")
f(example)
#> [1] "Показы: 58025"
f(www)
#> [1] "м"

Related

Convert a column of character mode into numeric in R?

I downloaded historical prices of an index but all prices are characters, that is are of the form : "24,31" (and i checked the mode).
I tried several things, such as :
as.numeric(as.character(VDAXcsv$Dernier))
which returns only NAs, or :
sapply(VDAXcsv$Dernier, as.numeric) | sapply(VDAXcsv, as.numeric)
Or simply
as.numeric(VDAXcsv)
And still only NAs, besides I tried to put "stringsAsFactors=FALSE" into my "read.zoo" function but it doesn't change anything. as.format doesn't work either.
x <- "24,31"
y <- as.numeric(gsub(",", ".", x))
y
# [1] 24.31
class(y)
# [1] "numeric"
A side note
I think depending on the data file you have, you might even want to prevent this to happen in the first place defining dec. Be careful if your sep is a comma as well though. It still can be an option, so you do not get your values as a character and therefor no need to do any replacements.
fread(file, header = T, sep = ";", dec = ",") # fread is data.table, but I think read.csv and any others support it as well
You can use this function:
library(readr)
parse_number("24,31", locale = locale(decimal_mark = ","))
Is vectorized, so just put VDAXcsv$Dernier as first argument.

Converting encoding of deparsed strings

I have the following vector:
x <- list("Chamberlain", "\"Roma\\u00F1ach\"", "<node>")
I want to convert it a vector with unicode character replaced with the UTF-8, like so:
goal <- list("Chamberlain", "Romañach", "<node>")
The deparsed string is causing problems. If the second string was instead:
wouldbenice <- "Roma\u00F1ach"
Then enc2native(wouldbenice) would do the right thing. (or lapply(x, enc2native) for the whole string.
I can get the second string to display correctly in UTF-8 with:
# displays "Romañach"
eval(parse(text = x[[2]]))
However, this goes poorly (throws parse errors) with x[1] and x[2]. How can I reliably parse the entire list into the appropriate encoding?
Use stringi package.
From stringi use stri_replace_all_regex for replacement and stri_unescape_unicode to unescape Unicode symbols.
library(stringi)
x <- list("Chamberlain", "\"Roma\\u00F1ach\"", "<node>")
removed_quotes <- stri_replace_all_regex(x, "\"", "")
unescaped <- stri_unescape_unicode(removed_quotes)
# [1] "Chamberlain" "Romañach" "<node>"
This satisfies the objective in base R, but seems less than ideal in other ways. Putting it here so readers can compare, though I think the stringi-based solution is probably the way to go.
utf8me <- function(x){
i <- grepl('\\u', x) # Not a robust way to detect a unicode char?
x[i] <- eval(parse(text=x[i])) #
x
}
lapply(x, utf8me)

In R, switch uppercase to lowercase and vice-versa in a string

I am familiar with the functions toupper() and tolower(), however that doesn't exactly give what I want here. Here is an example of the string I have and the string I want:
this = "This is the string THAT I have!"
that = "tHIS IS THE STRING that i HAVE!"
simple enough to describe with an example, harder to implement (i think).
Thanks!
I'm sort of curious if there is a better way than:
chartr(x = this,
old = paste0(c(letters,LETTERS),collapse = ""),
new = paste0(c(LETTERS,letters),collapse = ""))
Helpful observation by #Joris in the comments that ?chartr notes that you can use character ranges, avoiding the paste:
chartr("a-zA-Z", "A-Za-z",this)
Here's a way that works with characters that aren't between a and z / A and Z :
text <- "aBàÉ"
up <- gregexpr("\\p{Lu}", text, perl=TRUE)
lo <- gregexpr("\\p{Ll}", text, perl=TRUE)
regmatches(text,up)[[1]] <- tolower(regmatches(text,up)[[1]])
regmatches(text,lo)[[1]] <- toupper(regmatches(text,lo)[[1]])
text
#> [1] "AbÀé"

Generalizing a function to return a list of data.frame columns with invalid UTF-8 bytes/code points

I recently wrote a function that uses grep and regex to find invalid UTF-8 code point (Since I work on a mac, my locale is also UTF-8). The input doesn't have to be UTF-8, as it is looking for invalid UTF-8 bytes. I wrote the function for work, and was wondering if anyone could provide tips for generalizing/catch any red flags in the code that I didn't notice (e.g. using base code instead of dplyr). Feel free to use any of the code if it's useful to you.
enc_check <- function(data) {
library(dplyr)
library(magrittr)
# Create vector of all possible 2-digit hexadecimal numbers (2 digits is the lenth of a byte)
allBytes <- list(x_esc = '\\x',
hex1 = as.character(c(seq(0,9),
c('a','b','c','d','e','f'))),
hex2 = as.character(c(seq(0,9),
c('a','b','c','d','e','f')))
) %$%
expand.grid(x_esc, hex1, hex2) %>%
apply(1, paste, collapse = '')
# Valid mixed alphanumeric bytes
validBytes1 <- list(x_esc = '\\x',
hexNum = as.character(c(seq(2,7))),
hexAlpha = c('a','b','c','d','e','f')
) %$%
expand.grid(x_esc, hexNum, hexAlpha) %>%
apply(1, paste, collapse = '') %>%
extract(. != '\\x7f')
# Valid purely numeric bytes
validBytes2 <- list(x_esc = '\\x',
hexNum2 = as.character(seq(20,79))
) %$%
expand.grid(x_esc, hexNum2) %>%
apply(1, paste, collapse = '')
# New-line byte
validBytes3 <- '\\x0a'
# charToRaw('\n')
# [1] 0a
# Filter all possible combinations down to only invalid bytes
validBytes <- c(validBytes1, validBytes2, validBytes3)
invalidBytes <- allBytes %>%
extract(not(is_in(., validBytes)))
# Create list of data.frame columns with invalid bytes
a_vector <- vector()
matches <- list()
for (i in 1:dim(data)[2]) {
a_vector <- data[,i]
matches[[i]] <- unlist(sapply(invalidBytes, grep, a_vector, useBytes = TRUE))
}
# Get rid of empty list elements
matches %<>%
lapply(length) %$%
extract(matches, . > 0)
# matches <- matches[lapply(matches,length) > 0]
return(matches)
}
Edit: Here's the updated code with the suggestions implemented.
enc_check <- function(dataset) {
library(dplyr)
library(magrittr)
rASCII <- c( '\n', '\r', '\t','\b',
'\a', '\f', '\v', '\\', '\'', '\"', '\`')
validBytes <- paste0("\\x",
c(as.character(as.hexmode(32:126)),
sapply(rASCII, charToRaw))) %>%
extract(not(duplicated(.)))
invalidBytes <- allBytes %>%
extract(not(is_in(., validBytes)))
a_vector <- vector()
matches <- list()
for (i in 1:dim(dataset)[2]) {
a_vector <- dataset[,i]
matches[[i]] <- unlist(sapply(invalidBytes, grep, a_vector, useBytes = TRUE))
} # sapply() is preferable to lapply due to USE.NAMES = TRUE
names(matches) <- names(dataset)
matches %<>%
lapply(length) %$%
extract(matches, . > 0)
return(matches)
}
2nd Edit: A better strategy was to use iconv. Let's say you have a file or object with some invalid bytes but that is generally UTF-8. This is often the case with Mac computers, whose default locale setting seems to be UTF-8. Moreover, Mac-based RStudio seems to use UTF-8 internally, and this can't be changed even if you set your computer's locale to a different encoding. Anyway, you can use iconv to sub all invalid bytes, normally displayed as hexadecimal bytes, (e.g. "\x8f") for the Unicode replacement symbol. Then you can search for that symbol and return a list of unique observations within a data.frame column with that symbol. Based on that, you can use "sub()" to replace those characters with the desired ones. One thing to note is that converting the file to another encoding, say latin-1, can have unexpected results if invalid bytes are present. When I did this, I noticed that some invalid bytes were converted to Unicode control characters, while other invalid bytes apparently matched valid latin-1 bytes and were displayed as nonsensical characters. In either case, I wrote a package to search data.frames for these characters and return a list, then do some replacement. The package isn't nearly as official as something off of CRAN, but if anyone's interested then here's a link to the repository: https://github.com/jkroes/FixEncoding. It's important to note that the "stable" version of the package isn't on the "master" branch; it's actually on branch "iconv". The documentation can be searched in R via "?FixEncoding" after installation of the correct branch, then finding the functions listed there and searching help for those.
This will construct all the alpha-versions of the hex numbers to "ff":
allBytes <- as.character( as.hexmode(0:255) )
Or as greppish patterns as you seem to prefer:
allBytes <- paste0("\\x", as.character( as.hexmode(0:255) ) )
The "special" characters that R recognizes does include the "\n" that you lissted but also a few more listed on the ?Quotes help page:
rASCII <- c( '\n', '\r', '\t','\b',
'\a', '\f', '\v', '\\', '\'', '\"', '\`')
You could create a vector of valid grep patterns for "characters" " space up to tilde ("~") just with this:
validBytes1 <- c(rASCII, paste( "\\x", as.hexmode( c(20:126)) )
I have concerns about using this strategy since my R throws errors when it tried to do greppish matches with what it considers an invalid input string.
> txt <- "ttt\nuuu\tiii\xff"
> dfrm <- data.frame(a = txt)
> lapply(dfrm, grep, patt = "\\xff")
$a
integer(0)
Warning message:
In FUN(X[[i]], ...) : input string 1 is invalid in this locale
> lapply(dfrm, grep, patt = "\\\xff")
Error in FUN(X[[i]], ...) : regular expression is invalid in this locale
> lapply(dfrm, grep, patt = "\xff")
Error in FUN(X[[i]], ...) : regular expression is invalid in this locale
You may want to switch over to grepRaw since it doesn't throw the same errors:
> grepRaw("\xff", txt)
[1] 12
Or may use ?tools::showNonASCII as suggested by Duncan Murdoch when this came up on Rhelp 4 years ago:
?tools::showNonASCII
# and the help page has a reproducible example of its use:
out <- c(
"fa\xE7ile test of showNonASCII():",
"\\details{",
" This is a good line",
" This has an \xfcmlaut in it.",
" OK again.",
"}")
f <- tempfile()
cat(out, file = f, sep = "\n")
tools::showNonASCIIfile(f)
#-------output appears in red----
1: fa<e7>ile test of showNonASCII():
4: This has an <fc>mlaut in it.

how to encode url on R

how can I encode a url as this
http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi=InchI=1S/C21H30O9/c1-11(5-6-21(28)12(2)8-13(23)9-20(21,3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19/h5-8,14,16-19,22,25-28H,9-10H2,1-4H3/b6-5+,11-7-/t14-,16-,17+,18-,19+,21-/m1/s1&token=e4a6d6fb-ae07-4cf6-bae8-c0e6115bc681
to make this
http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi=InChI%3D1S%2FC21H30O9%2Fc1-11(5-6-21(28)12(2)8-13(23)9-20(21%2C3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19%2Fh5-8%2C14%2C16-19%2C22%2C25-28H%2C9-10H2%2C1-4H3%2Fb6-5%2B%2C11-7-%2Ft14-%2C16-%2C17%2B%2C18-%2C19%2B%2C21-%2Fm1%2Fs1
on R?
I tried
URLencode
but it does not work.
Thanks
It seems that you want to get rid of all but first URL GET data specifier and then to encode the associated data.
url <- "..."
library(stringi)
(addr <- stri_replace_all_regex(url, "\\?.*", ""))
## [1] "http://www.chemspider.com/inchi.asmx/InChIToSMILES"
args <- stri_match_first_regex(url, "[?&](.*?)=([^&]+)")
(data <- stri_replace_all_regex(
stri_trans_general(args[,3], "[^a-zA-Z0-9\\-()]Any-Hex/XML"),
"&#x([0-9a-fA-F]{2});", "%$1"))
## [1] "InchI%3D1S%2FC21H30O9%2Fc1-11(5-6-21(28)12(2)8-13(23)9-20(21%2C3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19%2Fh5-8%2C14%2C16-19%2C22%2C25-28H%2C9-10H2%2C1-4H3%2Fb6-5%2B%2C11-7-%2Ft14-%2C16-%2C17%2B%2C18-%2C19%2B%2C21-%2Fm1%2Fs1"
(addr <- stri_c(addr, "?", args[,2], "=", data))
## [1] "http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi=InchI%3D1S%2FC21H30O9%2Fc1-11(5-6-21(28)12(2)8-13(23)9-20(21%2C3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19%2Fh5-8%2C14%2C16-19%2C22%2C25-28H%2C9-10H2%2C1-4H3%2Fb6-5%2B%2C11-7-%2Ft14-%2C16-%2C17%2B%2C18-%2C19%2B%2C21-%2Fm1%2Fs1"
Here I made use of the ICU's transliterator (via stri_trans_general). All characters but A..Z, a..z, 0..9, (, ), and - have been converted to hexadecimal representation
(it seems that URLencode does not handle , even with reserved=TRUE) of the form &#xNN;. Then, each &#xNN; was converted to %NN with stri_replace_all_regex.
Here are two approaches:
1) gsubfn/URLencode If u is an R character string containing the URL then try this. This inputs everything after ? to URLencode replacing the input with the output of that function. Note that "\\K" kills everything in the buffer up to that point so that the ? itself does not get encoded:
library(gsubfn)
gsubfn("\\?\\K(.*)", ~ URLencode(x, TRUE), u, perl = TRUE)
It gives the following (which is not identical to the output in the question but may be sufficient):
http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi%3dInchI%3d1S%2fC21H30O9%2fc1-11(5-6-21(28)12(2)8-13(23)9-20(21,3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19%2fh5-8,14,16-19,22,25-28H,9-10H2,1-4H3%2fb6-5+,11-7-%2ft14-,16-,17+,18-,19+,21-%2fm1%2fs1%26token%3de4a6d6fb-ae07-4cf6-bae8-c0e6115bc681
2) gsubfn/curlEscape For a somewhat different output continuing to use gsubfn try:
library(RCurl)
gsubfn("\\?\\K(.*)", curlEscape, u, perl = TRUE)
giving:
http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi%3DInchI%3D1S%2FC21H30O9%2Fc1%2D11%285%2D6%2D21%2828%2912%282%298%2D13%2823%299%2D20%2821%2C3%294%297%2D15%2824%2930%2D19%2D18%2827%2917%2826%2916%2825%2914%2810%2D22%2929%2D19%2Fh5%2D8%2C14%2C16%2D19%2C22%2C25%2D28H%2C9%2D10H2%2C1%2D4H3%2Fb6%2D5%2B%2C11%2D7%2D%2Ft14%2D%2C16%2D%2C17%2B%2C18%2D%2C19%2B%2C21%2D%2Fm1%2Fs1%26token%3De4a6d6fb%2Dae07%2D4cf6%2Dbae8%2Dc0e6115bc681
ADDED curlEscape approach

Resources