The data I am using has many characters like "<U+XXXX>". Originally, it looks like this as a data point, "<U+043E><U+043A><U+0430><U+0437><U+044B>: 673".
I am curious what I should use to convert them easily and effectively into ordinary plain texts. I have rows of this Unicode in my table, and I am confused now.
I was looking for ways of conversion online, but most of them don't work. For example, I have tried this code on my data to convert it from UTF-8 into Latin; it failed.
www <- c("<U+043C>")
www %>% iconv(from = "UTF-8", to = "latin1")
[1] <U+043C>
Also, I have tried this without arrows. Still, it doesn't convert.
www <- c("U+043C")
www %>% iconv(from = "UTF-8", to = "latin1")
[1] U+043C
Alternatively, I tried this function.
example <- c("<U+041F><U+043E><U+043A><U+0430><U+0437><U+044B>: 58025")
iconv(example, "UTF-8", "latin1")
[1] "<U+041F><U+043E><U+043A><U+0430><U+0437><U+044B>: 58025"
Any ideas, folks?
When you type "<U+043C>" it is being interpreted as a literal string of 8 characters. Whether this string is interpreted as latin-1 or UTF doesn't matter, since they both encode these literal 8 characters the same way.
What you need to do is unescape the unicode strings. The stringi package can do this for you, but you need to do a bit of conversion first to get it in the right format. The following function should take care of it:
f <- function(x) {
x <- gsub(">", "", gsub("<U\\+", "\\\\u", x))
stringi::stri_unescape_unicode(x)
}
So you can do:
example <- c("<U+041F><U+043E><U+043A><U+0430><U+0437><U+044B>: 58025")
www <- c("<U+043C>")
f(example)
#> [1] "Показы: 58025"
f(www)
#> [1] "м"
I have the following vector:
x <- list("Chamberlain", "\"Roma\\u00F1ach\"", "<node>")
I want to convert it a vector with unicode character replaced with the UTF-8, like so:
goal <- list("Chamberlain", "Romañach", "<node>")
The deparsed string is causing problems. If the second string was instead:
wouldbenice <- "Roma\u00F1ach"
Then enc2native(wouldbenice) would do the right thing. (or lapply(x, enc2native) for the whole string.
I can get the second string to display correctly in UTF-8 with:
# displays "Romañach"
eval(parse(text = x[[2]]))
However, this goes poorly (throws parse errors) with x[1] and x[2]. How can I reliably parse the entire list into the appropriate encoding?
Use stringi package.
From stringi use stri_replace_all_regex for replacement and stri_unescape_unicode to unescape Unicode symbols.
library(stringi)
x <- list("Chamberlain", "\"Roma\\u00F1ach\"", "<node>")
removed_quotes <- stri_replace_all_regex(x, "\"", "")
unescaped <- stri_unescape_unicode(removed_quotes)
# [1] "Chamberlain" "Romañach" "<node>"
This satisfies the objective in base R, but seems less than ideal in other ways. Putting it here so readers can compare, though I think the stringi-based solution is probably the way to go.
utf8me <- function(x){
i <- grepl('\\u', x) # Not a robust way to detect a unicode char?
x[i] <- eval(parse(text=x[i])) #
x
}
lapply(x, utf8me)
How can I replace multiple backslashes with a single one? I know that in a string a single backslash is represented with \\ as demonstrated here:
nchar('\\')
[1] 1
So I want to replace replace all the backslashes in this string: 'thre\\\\fd' with one (prints as two) and when wrapped with cat will produce: thre\fd. I thought the stringi package has a way to do this easily but can't figure out how.
MWE (not correct output)
cat(gsub('\\\\', '\\', 'thre\\\\fd'))
## threfd
Desired Catted Output
thre\fd
Using the fixed = TRUE argument, we get
cat(gsub('\\\\', '\\', 'thre\\\\fd', fixed = TRUE), '\n')
#thre\fd
cat(gsub('\\\\\\', '\\\\', 'thre\\\\\\fd', fixed = TRUE), '\n')
#thre\\fd
If all strings have the same number of slashes, this is a very simple gsub:
x <- "test\\\\123"
gsub("\\\\","\",x)
output: "test\123"
I was using grep to do a case-insensitive search, but the problem is I get all values containg the pattern, not just the exact match, but if I use fixed=TRUE that invalidates the ignore.case=TRUE parameter.
g = c("PLD3","PLD2","PLD2ABC","DTPLD2a")
r = "pLd2"
grep(r,g,ignore.case=TRUE,value=TRUE)
>[1] "PLD2" "PLD2ABC" "DTPLD2a"
grep(r,g,ignore.case=TRUE,value=TRUE,fixed=TRUE)
>character(0)
EDIT
r is a user input, so basically it can be anything from a list of 30,000 genes, and it can be all lower-case, all upper-case, or a mixture of both.
And also in my list g the elements can be upper-case, lower-case or a mixture (it is a list of around 15,000 genes)
try
g = c("PLD3","PLD2","PLD2ABC","DTPLD2a")
r <- 'pLd2'
r2 <- paste('^', r, '$', sep = '')
grep(r2 , g ,ignore.case = T, value=TRUE)
[1] "PLD2"
basically the meta characters ^ and $ force grep to fix the regular expression at the start and the end.
I recently wrote a function that uses grep and regex to find invalid UTF-8 code point (Since I work on a mac, my locale is also UTF-8). The input doesn't have to be UTF-8, as it is looking for invalid UTF-8 bytes. I wrote the function for work, and was wondering if anyone could provide tips for generalizing/catch any red flags in the code that I didn't notice (e.g. using base code instead of dplyr). Feel free to use any of the code if it's useful to you.
enc_check <- function(data) {
library(dplyr)
library(magrittr)
# Create vector of all possible 2-digit hexadecimal numbers (2 digits is the lenth of a byte)
allBytes <- list(x_esc = '\\x',
hex1 = as.character(c(seq(0,9),
c('a','b','c','d','e','f'))),
hex2 = as.character(c(seq(0,9),
c('a','b','c','d','e','f')))
) %$%
expand.grid(x_esc, hex1, hex2) %>%
apply(1, paste, collapse = '')
# Valid mixed alphanumeric bytes
validBytes1 <- list(x_esc = '\\x',
hexNum = as.character(c(seq(2,7))),
hexAlpha = c('a','b','c','d','e','f')
) %$%
expand.grid(x_esc, hexNum, hexAlpha) %>%
apply(1, paste, collapse = '') %>%
extract(. != '\\x7f')
# Valid purely numeric bytes
validBytes2 <- list(x_esc = '\\x',
hexNum2 = as.character(seq(20,79))
) %$%
expand.grid(x_esc, hexNum2) %>%
apply(1, paste, collapse = '')
# New-line byte
validBytes3 <- '\\x0a'
# charToRaw('\n')
# [1] 0a
# Filter all possible combinations down to only invalid bytes
validBytes <- c(validBytes1, validBytes2, validBytes3)
invalidBytes <- allBytes %>%
extract(not(is_in(., validBytes)))
# Create list of data.frame columns with invalid bytes
a_vector <- vector()
matches <- list()
for (i in 1:dim(data)[2]) {
a_vector <- data[,i]
matches[[i]] <- unlist(sapply(invalidBytes, grep, a_vector, useBytes = TRUE))
}
# Get rid of empty list elements
matches %<>%
lapply(length) %$%
extract(matches, . > 0)
# matches <- matches[lapply(matches,length) > 0]
return(matches)
}
Edit: Here's the updated code with the suggestions implemented.
enc_check <- function(dataset) {
library(dplyr)
library(magrittr)
rASCII <- c( '\n', '\r', '\t','\b',
'\a', '\f', '\v', '\\', '\'', '\"', '\`')
validBytes <- paste0("\\x",
c(as.character(as.hexmode(32:126)),
sapply(rASCII, charToRaw))) %>%
extract(not(duplicated(.)))
invalidBytes <- allBytes %>%
extract(not(is_in(., validBytes)))
a_vector <- vector()
matches <- list()
for (i in 1:dim(dataset)[2]) {
a_vector <- dataset[,i]
matches[[i]] <- unlist(sapply(invalidBytes, grep, a_vector, useBytes = TRUE))
} # sapply() is preferable to lapply due to USE.NAMES = TRUE
names(matches) <- names(dataset)
matches %<>%
lapply(length) %$%
extract(matches, . > 0)
return(matches)
}
2nd Edit: A better strategy was to use iconv. Let's say you have a file or object with some invalid bytes but that is generally UTF-8. This is often the case with Mac computers, whose default locale setting seems to be UTF-8. Moreover, Mac-based RStudio seems to use UTF-8 internally, and this can't be changed even if you set your computer's locale to a different encoding. Anyway, you can use iconv to sub all invalid bytes, normally displayed as hexadecimal bytes, (e.g. "\x8f") for the Unicode replacement symbol. Then you can search for that symbol and return a list of unique observations within a data.frame column with that symbol. Based on that, you can use "sub()" to replace those characters with the desired ones. One thing to note is that converting the file to another encoding, say latin-1, can have unexpected results if invalid bytes are present. When I did this, I noticed that some invalid bytes were converted to Unicode control characters, while other invalid bytes apparently matched valid latin-1 bytes and were displayed as nonsensical characters. In either case, I wrote a package to search data.frames for these characters and return a list, then do some replacement. The package isn't nearly as official as something off of CRAN, but if anyone's interested then here's a link to the repository: https://github.com/jkroes/FixEncoding. It's important to note that the "stable" version of the package isn't on the "master" branch; it's actually on branch "iconv". The documentation can be searched in R via "?FixEncoding" after installation of the correct branch, then finding the functions listed there and searching help for those.
This will construct all the alpha-versions of the hex numbers to "ff":
allBytes <- as.character( as.hexmode(0:255) )
Or as greppish patterns as you seem to prefer:
allBytes <- paste0("\\x", as.character( as.hexmode(0:255) ) )
The "special" characters that R recognizes does include the "\n" that you lissted but also a few more listed on the ?Quotes help page:
rASCII <- c( '\n', '\r', '\t','\b',
'\a', '\f', '\v', '\\', '\'', '\"', '\`')
You could create a vector of valid grep patterns for "characters" " space up to tilde ("~") just with this:
validBytes1 <- c(rASCII, paste( "\\x", as.hexmode( c(20:126)) )
I have concerns about using this strategy since my R throws errors when it tried to do greppish matches with what it considers an invalid input string.
> txt <- "ttt\nuuu\tiii\xff"
> dfrm <- data.frame(a = txt)
> lapply(dfrm, grep, patt = "\\xff")
$a
integer(0)
Warning message:
In FUN(X[[i]], ...) : input string 1 is invalid in this locale
> lapply(dfrm, grep, patt = "\\\xff")
Error in FUN(X[[i]], ...) : regular expression is invalid in this locale
> lapply(dfrm, grep, patt = "\xff")
Error in FUN(X[[i]], ...) : regular expression is invalid in this locale
You may want to switch over to grepRaw since it doesn't throw the same errors:
> grepRaw("\xff", txt)
[1] 12
Or may use ?tools::showNonASCII as suggested by Duncan Murdoch when this came up on Rhelp 4 years ago:
?tools::showNonASCII
# and the help page has a reproducible example of its use:
out <- c(
"fa\xE7ile test of showNonASCII():",
"\\details{",
" This is a good line",
" This has an \xfcmlaut in it.",
" OK again.",
"}")
f <- tempfile()
cat(out, file = f, sep = "\n")
tools::showNonASCIIfile(f)
#-------output appears in red----
1: fa<e7>ile test of showNonASCII():
4: This has an <fc>mlaut in it.