URL / URI encoding in R - r

I have to request an API with an URL encoding according to RFC 3986, knowing that I have accented characters in my query.
For instance, this argument :
quel écrivain ?
should be encoded like this:
quel%20%C3%A9crivain%20%3F%0D%0A
Unfortunately, when I use URLencode, encoding, url_encode, or curlEscape, I have the resulting encoding:
URLencode("quel écrivain ?")
[1] "quel%20%E9crivain%20?"
The problem is on accented letters: for instance "é" is converted into "%E9" instead of "%C3%A9"...
I struggle with this URL encoding without finding any issue... As I don't have the hand on the API, I don't know how it handles the encoding.
A weird thing is that using POST instead of GET leads to a response in which word with accent are cutted into 2 different lines :
"1\tquel\tquel\tDET\tDET\tGender=Masc|Number=Sing\t5\tdet\t0\t_\n4\t<U+FFFD>\t<U+FFFD>\tSYM\tSYM\t_\t5\tcompound\t0\t_\n5\tcrivain\tcrivain\
As you can see, "écrivain" is splitted into "<U+FFFD>" (which is an ASCII encoding of "é") and "crivain".
I become mad with this encoding problem, if a brilliant mind could help me I would be very gratefull!

Set reserved = TRUE
i.e.
your_string <- "quel écrivain ?"
URLencode(your_string, reserved = TRUE)
# [1] "quel%20%C3%A9crivain%20%3F"

I do not think I am a brilliant mind, but I still have a possible solution for you. After using URLencode() it seems that your accented characters are converted into the trailing part of their unicode representation preceeded by a %. To convert your characters into readable characters you might turn them into "real unicode" and use the package stringi to make them readable. For your single string the solution worked on my machine, at least. I hope it also works for you.
Please note that I have introduced a % character at the end of your string to demonstrate that below gsub command should work in any case.
You might have to adapt the replacement pattern \\u00 to also cover unicode patterns that have more than the last two positions filled with something but 0, if this is relevant in your case.
library(stringi)
str <- "quel écrivain ?"
str <- URLencode(str)
#"quel%20%E9crivain%20?"
#replacing % by a single \ backslash to directly get correct unicode representation
#does not work since it is an escape character, therefore "\\"
str <- gsub("%", paste0("\\", "u00"), str , fixed = T)
#[1] "quel\\u0020\\u00E9crivain\\u0020?"
#since we have double escapes, we need the unescape function from stringi
#which recognizes double backslash as single backslash for the conversion
str <- stri_unescape_unicode(str)
#[1] "quel écrivain ?"

Related

How to remove "\" from paste function output with quotation marks?

I'm working with the following code:
Y_Columns <- c("Y.1.1")
paste('{"ImportId":"', Y_Columns, '"}', sep = "")
The paste function produces the following output:
"{\"ImportId\":\"Y.1.1\"}"
How do I get the paste function to omit the \? Such that, the output is:
"{"ImportId":"Y.1.1"}"
Thank you for your help.
Note: I did do a search on SO to see if there were any Q's that asked "what is an escape character in R". But I didn't review all the 160 answers, only the first 20.
This is one way of demonstrating what I wrote in my comment:
out <- paste('{"ImportId":"', Y_Columns, '"}', sep = "")
out
#[1] "{\"ImportId\":\"Y.1.1\"}"
?print
print(out,quote=FALSE)
#[1] {"ImportId":"Y.1.1"}
Both R and regex patterns use escape characters to allow special characters to be displayed in print output or input. (And sometimes regex patterns need to have doubled escapes.) R has a few characters that need to be "escaped" in certain situation. You illustrated one such situation: including double-quote character inside a result that will be printed with surrounding double-quotes. If you were intending to include any single quotes inside a character value that was delimited by single quotes at the time of creation, they would have needed to be escaped as well.
out2 <- '\'quoted\''
nchar(out2)
#[1] 8 ... note that neither the surround single-quotes nor the backslashes get counted
> out2
[1] "'quoted'" ... and the default output quote-char is a double-quote.
Here's a good Q&A to review:How to replace '+' using gsub() function in R
It has two answers, both useful: one shows how to double escape a special character and the other shows how to use teh fixed argument to get around that requirement.
And another potentially useful Q&A on the topic of handling Windows paths:
File path issues in R using Windows ("Hex digits in character string" error)
And some further useful reading suggestions: Look at the series of help pages that start with capital letters. (Since I can never remember which one has which nugget of essential information, I tried ?Syntax first and it has a "See Also" list of essential reading: Arithmetic, Comparison, Control, Extract, Logic, NumericConstants, Paren, Quotes, Reserved. and I then realized what I wanted to refer you to was most likely ?Quotes where all the R-specific escape sequence letters should be listed.

Split String by Upper Case Consisting of Both Latin and Unicode

Building on the Splitting String based on letters case answer;
lang <- "DeutschEsperantoItalianoNederlandsNedersaksiesNorskРусский"
strsplit(lang, "(?!^)(?=[[:upper:]])", perl = T)
results in
"Deutsch" "Esperanto" "Italiano" "Nederlands" "Nedersaksies" "NorskРусский"
The problem is the last pair is not separated as Russian is converted to UTF-8 (there will be more variation in the strings; e.g. more or less all other languages in Wikipedia). I checked online Regex testers and other SO answers but they are not much help with R. Tried iconv and Encoding workarounds in base R as well (can't seem to convert to UTF-16; conversion to bytes doesn't help). Thoughts?
Use unicode property \p{Lu} that means an uppercase (u) letter (L) in any alphabet. See http://www.regular-expressions.info/unicode.html
lang <- "DeutschEsperantoItalianoNederlandsNedersaksiesNorskРусский"
strsplit(lang, "(?!^)(?=\p{Lu})", perl = TRUE)

Remove backslashes from string in R [duplicate]

Here is the string:
> raw.data[27834,1]
[1] "\xff$GPGGA"
I have tried advice from the following two questions, but with no luck:
How to escape a backslash in R?
How to escape backslashes in R string
Does anyone have a different solution from the above questions that might help? The ideal solution would be to remove the "\xff" portion, but for any combination of letters.
There is no backslash in that string. The displayed backslash is an escape marker. This and other features about entry and display of "special situations" are described in the ?Quotes help page.. You've been given one regex rather elliptical approach to removal. Here are a couple of other approaches .... only some of which actually succeed because the \ff is the first "character" and it's not really legal as an R character:
s <- "\xff$GPGGA"
strsplit(s, "")
#[[1]]
#[1] NA
Warning message:
In strsplit(s, "") : input string 1 is invalid in this locale
substr(s, 1,1)
#Error in substr(s, 1, 1) : invalid multibyte string at '<ff>$GP<47>GA'
gsub('.*([^A-Za-z].*)', '\\1',"\xff$GPGGA")#[1]
#[1] "$GPGGA"
?Quotes
gsub('\xff', '',"\xff$GPGGA")#[1]
#[1] "$GPGGA"
I think the reason that the regex functions don't choke on that string is that regex is actually a system mediated process whereas strsplit and substr are internal R functions.
#RichardScriven posts an example and when I tried to replicated it, I get yet a different example that shows the mapping to displayed characters is system specific. I'm on OSX 10.10.1 (Yosemite)>
cat('\xff')
ˇ
(I left off the octothorpe (#) that I would normally out in.)

Remove leading backslash from string R

Here is the string:
> raw.data[27834,1]
[1] "\xff$GPGGA"
I have tried advice from the following two questions, but with no luck:
How to escape a backslash in R?
How to escape backslashes in R string
Does anyone have a different solution from the above questions that might help? The ideal solution would be to remove the "\xff" portion, but for any combination of letters.
There is no backslash in that string. The displayed backslash is an escape marker. This and other features about entry and display of "special situations" are described in the ?Quotes help page.. You've been given one regex rather elliptical approach to removal. Here are a couple of other approaches .... only some of which actually succeed because the \ff is the first "character" and it's not really legal as an R character:
s <- "\xff$GPGGA"
strsplit(s, "")
#[[1]]
#[1] NA
Warning message:
In strsplit(s, "") : input string 1 is invalid in this locale
substr(s, 1,1)
#Error in substr(s, 1, 1) : invalid multibyte string at '<ff>$GP<47>GA'
gsub('.*([^A-Za-z].*)', '\\1',"\xff$GPGGA")#[1]
#[1] "$GPGGA"
?Quotes
gsub('\xff', '',"\xff$GPGGA")#[1]
#[1] "$GPGGA"
I think the reason that the regex functions don't choke on that string is that regex is actually a system mediated process whereas strsplit and substr are internal R functions.
#RichardScriven posts an example and when I tried to replicated it, I get yet a different example that shows the mapping to displayed characters is system specific. I'm on OSX 10.10.1 (Yosemite)>
cat('\xff')
ˇ
(I left off the octothorpe (#) that I would normally out in.)

How do I strip dollar signs ($) from data/ escape special characters in R?

I've been using gsub("toreplace","replacement", myvector) to clean out data in R. While this works for commas and the like, removing "$" has no effect. So if I do gsub("$","",myvector) all the dollar signs remain in place.
I think this is because $ is a special character in R. I tried escaping it "\$" but that yields the same result (no effect). And I couldn't find a resource on escaping special characters in R.
Obviously I should do this in preprocessing. But I was wondering if anyone out there knew how to either a) escape special characters in R b) get rid of pesky $ in R directly. For science.
You have to escape it twice, first for R, second for the regex.
gsub('\\$', '', c("a$a", "bb$"))
[1] "aa" "bb"
See ?Quotes for details on quoting and escaping.
Use fixed = TRUE:
gsub('$', '', c("a$a", "bb$"), fixed = TRUE)
Then you don't need to worry about any special characters. In stringr, this is implemented a little differently:
library(stringr)
str_replace_all(c("$100","ta$ty"), fixed("$"), "")
Thanks to DiggyF and James for the examples!
Escaping characters can be a pain some times, but just putting it in square brackets (make it a character class) helps with this:
> gsub("[$]","",c("$100","ta$ty"))
[1] "100" "taty"
if you have $ followed by number in set of data columns (e.g. $400,000) there is an easier way that worked like charm for me.
data%>%
mutate_at(5:6, parse_number)
where 5:6 are the data column numbers.

Resources