I have the following vector:
x <- list("Chamberlain", "\"Roma\\u00F1ach\"", "<node>")
I want to convert it a vector with unicode character replaced with the UTF-8, like so:
goal <- list("Chamberlain", "Romañach", "<node>")
The deparsed string is causing problems. If the second string was instead:
wouldbenice <- "Roma\u00F1ach"
Then enc2native(wouldbenice) would do the right thing. (or lapply(x, enc2native) for the whole string.
I can get the second string to display correctly in UTF-8 with:
# displays "Romañach"
eval(parse(text = x[[2]]))
However, this goes poorly (throws parse errors) with x[1] and x[2]. How can I reliably parse the entire list into the appropriate encoding?
Use stringi package.
From stringi use stri_replace_all_regex for replacement and stri_unescape_unicode to unescape Unicode symbols.
library(stringi)
x <- list("Chamberlain", "\"Roma\\u00F1ach\"", "<node>")
removed_quotes <- stri_replace_all_regex(x, "\"", "")
unescaped <- stri_unescape_unicode(removed_quotes)
# [1] "Chamberlain" "Romañach" "<node>"
This satisfies the objective in base R, but seems less than ideal in other ways. Putting it here so readers can compare, though I think the stringi-based solution is probably the way to go.
utf8me <- function(x){
i <- grepl('\\u', x) # Not a robust way to detect a unicode char?
x[i] <- eval(parse(text=x[i])) #
x
}
lapply(x, utf8me)
Related
The data I am using has many characters like "<U+XXXX>". Originally, it looks like this as a data point, "<U+043E><U+043A><U+0430><U+0437><U+044B>: 673".
I am curious what I should use to convert them easily and effectively into ordinary plain texts. I have rows of this Unicode in my table, and I am confused now.
I was looking for ways of conversion online, but most of them don't work. For example, I have tried this code on my data to convert it from UTF-8 into Latin; it failed.
www <- c("<U+043C>")
www %>% iconv(from = "UTF-8", to = "latin1")
[1] <U+043C>
Also, I have tried this without arrows. Still, it doesn't convert.
www <- c("U+043C")
www %>% iconv(from = "UTF-8", to = "latin1")
[1] U+043C
Alternatively, I tried this function.
example <- c("<U+041F><U+043E><U+043A><U+0430><U+0437><U+044B>: 58025")
iconv(example, "UTF-8", "latin1")
[1] "<U+041F><U+043E><U+043A><U+0430><U+0437><U+044B>: 58025"
Any ideas, folks?
When you type "<U+043C>" it is being interpreted as a literal string of 8 characters. Whether this string is interpreted as latin-1 or UTF doesn't matter, since they both encode these literal 8 characters the same way.
What you need to do is unescape the unicode strings. The stringi package can do this for you, but you need to do a bit of conversion first to get it in the right format. The following function should take care of it:
f <- function(x) {
x <- gsub(">", "", gsub("<U\\+", "\\\\u", x))
stringi::stri_unescape_unicode(x)
}
So you can do:
example <- c("<U+041F><U+043E><U+043A><U+0430><U+0437><U+044B>: 58025")
www <- c("<U+043C>")
f(example)
#> [1] "Показы: 58025"
f(www)
#> [1] "м"
I'm working in R with strings like the following:
"a1_1;a1_2;a1_5;a1_6;a1_8"
"two1_1;two1_4;two1_5;two1_7"
I need to split these strings into two strings based on the last digit being less than 7 or not. For instance, the desired output for the two strings above would be:
"a1_1;a1_2;a1_5;a1_6" "a1_8"
"two1_1;two1_4;two1_5" "two1_7"
I attempted the following to no avail:
x <- "a1_1;a1_2;a1_5;a1_6;a1_8"
str_split("x", "(\\d<7);")
In an earlier version of the question I was helped by someone that provided the following function, but I don't think it's set up to handle digits both before and after the semicolon in the strings above. I'm trying to modify it but I haven't been able to get it to come out correctly.
f1 <- function(strn) {
strsplit(gsubfn("(;[A-Za-z]+\\d+)", ~ if(readr::parse_number(x) >= 7)
paste0(",", sub(";", "", x)) else x, strn), ",")[[1]]
}
Can anyone help me understand what I'd need to do to make this split as desired?
Splitting and recombining on ;, with a simple regex capture in between.
s <- c("a1_1;a1_2;a1_5;a1_6;a1_8", "two1_1;two1_4;two1_5;two1_7")
sp <- strsplit(s, ";")
lapply(sp,
function(x) {
l <- sub(".*(\\d)$", "\\1", x) < 7
c(paste(x[l], collapse=";"), paste(x[!l], collapse=";"))
}
)
# [[1]]
# [1] "a1_1;a1_2;a1_5;a1_6" "a1_8"
#
# [[2]]
# [1] "two1_1;two1_4;two1_5" "two1_7"
Suppose I have the following:
format.string <- "#AB#-#BC#/#DF#" #wanted to use $ but it is problematic
value.list <- c(AB="a", BC="bcd", DF="def")
I would like to apply the value.list to the format.string so that the named value is substituted. So in this example I should end up wtih a string: a-bcd/def
I tried to do it like the following:
resolved.string <- lapply(names(value.list),
function(x) {
sub(x = save.data.path.pattern,
pattern = paste0(c("#",x,"#"), collapse=""),
replacement = value.list[x]) })
But it doesn't seem to be working correctly. Where am I going wrong?
The glue package is designed for this. You can change the opening and closing delimiters using .open and .close, but they have to be different. Also note that value.list has to be either a list or a dataframe:
library(glue)
format.string <- "{AB}-{BC}/{DF}"
value.list <- list(AB="a", BC="bcd", DF="def")
glue_data(value.list, format.string)
# a-bcd/def
To answer your actual question, by using lapply over names(value.list) you, as your output shows, take each of the elements of value.list and perform the replacement. However, all this happens independently, i.e., the replacements aren't ultimately combined to a single result.
As to make something very similar to your approach work, we can use Reduce which does exactly this combining:
Reduce(function(x, y) sub(paste0(c("#", y, "#"), collapse = ""), value.list[y], x),
init = format.string, names(value.list))
# [1] "a-bcd/def"
If we call the anonymous function f, then the result is
f(f(f(format.string, "A"), "B"), "C")
exactly as you intended, I believe.
We can use gsubfn that can take a key/value pair as replacement to change the pattern with the 'value'
library(gsubfn)
gsub("#", "", gsubfn("[^#]+", as.list(value.list), format.string))
#[1] "a-bcd/def"
NOTE: 'value.list' is a vector and not a list
In R, what is an efficient way to convert a string encoded in hexadecimal, such as "40414243" to its equivalent characters e.g. "#ABC"?
For instance, the equivalent of this code:
library(stringr)
FromHexString <- function (hex.string) {
result <- ""
length <- str_length(hex.string)
for (i in seq(1, length, by=2)) {
hex.value <- str_sub(hex.string, i, i + 1)
char.code <- strtoi(hex.value, 16)
char <- rawToChar(as.raw(char.code))
result <- paste(result, char, sep="")
char
}
result
}
Which produces:
> FromHexString("40414243")
[1] "#ABC"
While the above code works, it's not efficient at all, using a lot of string concatenations.
So the question is how to write an idiomatic, efficient R function that does this operation.
Edit: My sample works only for ASCII encoding, not for UTF-8 encoded byte arrays.
Test if that is more efficient (for longer strings):
string <- "40414243"
intToUtf8(
strtoi(
do.call(
paste0,
as.data.frame(
matrix(
strsplit(string, split = "")[[1]],
ncol=2,
byrow=TRUE),
stringsAsFactors=FALSE)),
base=16L)
)
#[1] "#ABC"
Otherwise you could look for a C/C++ implementation.
Modify your code so that it uses lookup tables an example for R here. Your lookup table will have 255 values. Put them in vector and get their values from that vector.
Not: No other solution will beat this one if you need to do a lot of conversions.
If you don't want to use a lookup table (or just like codegolfing :-) ) , consider writing a vectorized version of something like:
bar <- unlist(strsplit(foo,'')) #separates input into individual elements
items <- sapply(1:(length(bar)/2),function(j)paste0(bar[(2*j-1):(2*j)],sep='',collapse=''))
followed with strtoi or whatever.
But even easier (I hope...) is
sapply(1:(nchar(foo)/2) function(j) substr(foo,(2*j-1),(2*j)))
how can I encode a url as this
http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi=InchI=1S/C21H30O9/c1-11(5-6-21(28)12(2)8-13(23)9-20(21,3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19/h5-8,14,16-19,22,25-28H,9-10H2,1-4H3/b6-5+,11-7-/t14-,16-,17+,18-,19+,21-/m1/s1&token=e4a6d6fb-ae07-4cf6-bae8-c0e6115bc681
to make this
http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi=InChI%3D1S%2FC21H30O9%2Fc1-11(5-6-21(28)12(2)8-13(23)9-20(21%2C3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19%2Fh5-8%2C14%2C16-19%2C22%2C25-28H%2C9-10H2%2C1-4H3%2Fb6-5%2B%2C11-7-%2Ft14-%2C16-%2C17%2B%2C18-%2C19%2B%2C21-%2Fm1%2Fs1
on R?
I tried
URLencode
but it does not work.
Thanks
It seems that you want to get rid of all but first URL GET data specifier and then to encode the associated data.
url <- "..."
library(stringi)
(addr <- stri_replace_all_regex(url, "\\?.*", ""))
## [1] "http://www.chemspider.com/inchi.asmx/InChIToSMILES"
args <- stri_match_first_regex(url, "[?&](.*?)=([^&]+)")
(data <- stri_replace_all_regex(
stri_trans_general(args[,3], "[^a-zA-Z0-9\\-()]Any-Hex/XML"),
"&#x([0-9a-fA-F]{2});", "%$1"))
## [1] "InchI%3D1S%2FC21H30O9%2Fc1-11(5-6-21(28)12(2)8-13(23)9-20(21%2C3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19%2Fh5-8%2C14%2C16-19%2C22%2C25-28H%2C9-10H2%2C1-4H3%2Fb6-5%2B%2C11-7-%2Ft14-%2C16-%2C17%2B%2C18-%2C19%2B%2C21-%2Fm1%2Fs1"
(addr <- stri_c(addr, "?", args[,2], "=", data))
## [1] "http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi=InchI%3D1S%2FC21H30O9%2Fc1-11(5-6-21(28)12(2)8-13(23)9-20(21%2C3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19%2Fh5-8%2C14%2C16-19%2C22%2C25-28H%2C9-10H2%2C1-4H3%2Fb6-5%2B%2C11-7-%2Ft14-%2C16-%2C17%2B%2C18-%2C19%2B%2C21-%2Fm1%2Fs1"
Here I made use of the ICU's transliterator (via stri_trans_general). All characters but A..Z, a..z, 0..9, (, ), and - have been converted to hexadecimal representation
(it seems that URLencode does not handle , even with reserved=TRUE) of the form &#xNN;. Then, each &#xNN; was converted to %NN with stri_replace_all_regex.
Here are two approaches:
1) gsubfn/URLencode If u is an R character string containing the URL then try this. This inputs everything after ? to URLencode replacing the input with the output of that function. Note that "\\K" kills everything in the buffer up to that point so that the ? itself does not get encoded:
library(gsubfn)
gsubfn("\\?\\K(.*)", ~ URLencode(x, TRUE), u, perl = TRUE)
It gives the following (which is not identical to the output in the question but may be sufficient):
http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi%3dInchI%3d1S%2fC21H30O9%2fc1-11(5-6-21(28)12(2)8-13(23)9-20(21,3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19%2fh5-8,14,16-19,22,25-28H,9-10H2,1-4H3%2fb6-5+,11-7-%2ft14-,16-,17+,18-,19+,21-%2fm1%2fs1%26token%3de4a6d6fb-ae07-4cf6-bae8-c0e6115bc681
2) gsubfn/curlEscape For a somewhat different output continuing to use gsubfn try:
library(RCurl)
gsubfn("\\?\\K(.*)", curlEscape, u, perl = TRUE)
giving:
http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi%3DInchI%3D1S%2FC21H30O9%2Fc1%2D11%285%2D6%2D21%2828%2912%282%298%2D13%2823%299%2D20%2821%2C3%294%297%2D15%2824%2930%2D19%2D18%2827%2917%2826%2916%2825%2914%2810%2D22%2929%2D19%2Fh5%2D8%2C14%2C16%2D19%2C22%2C25%2D28H%2C9%2D10H2%2C1%2D4H3%2Fb6%2D5%2B%2C11%2D7%2D%2Ft14%2D%2C16%2D%2C17%2B%2C18%2D%2C19%2B%2C21%2D%2Fm1%2Fs1%26token%3De4a6d6fb%2Dae07%2D4cf6%2Dbae8%2Dc0e6115bc681
ADDED curlEscape approach