No space between unicode character and next character - r

Say I want to print degrees Celsius in R, I could use unicode like this:
print("\U00B0 C")
[1] "° C"
Note, however, the space. I don't want it there, so I remove it:
print("\U00B0C")
[1] "ଌ"
Clearly, 00B0C is unicode for a very different character! Presumably, if there is any alphanumeric after the unicode it will, understandably, just interpret that as part of the unicode. I could use paste or something similar like this:
print(paste("\U00B0","C", sep = ""))
[1] "°C"
but is there a more concise way to indicate that the unicode is finished and I'm now just using regular letters?

Use lower case u:
print("\u00B0C")

Related

Unable to enclose double quotes inside string in R

unable to enclose "hey" in double quotes within a string . I am unable to get output as Hey "Hey"
ad<- "hey"
fd<- paste("Hey","",ad,sep="")
The dQuote function is made for this:
dQuote("hey")
# [1] "\"hey\""
Note that depending on the OS and your environment, it might add "fancy quotes" (angled/directional double-quotes). They may look good but if you want to reuse the results as a string in R, it won't work because R does not recognize its smart quotes as string-boundaries. You can explicitly disable it with dQuote(., q=FALSE). (The default is FALSE on windows except for the Rgui console, but I believe the default is TRUE elsewhere.)
Depending on your need, you may also like shQuote due to its escaping of existing embedded quotes:
cat(dQuote('"hey" there'), "\n")
# ""hey" there" # may not be right
cat(shQuote('"hey" there'), "\n")
# "\"hey\" there"
though whether that is correct depends on your needs; shQuote was designed for shell-quoting/escaping.
Ultimately in your example, I think you would use
ad <- "Hey"
paste("Hey", dQuote(ad))
# [1] "Hey \"Hey\""
Double quotes can be added with
paste0('"', "Hey", '"')
#[1] "\"Hey\""
Or
sprintf('"%s"', "Hey")
#[1] "\"Hey\""
Note that R displays strings with double quotes (") so to show double quotes as part of string it escapes it with backslash \. To see actual string you may use cat on it.
cat(paste0('"', "Hey", '"'))
#"Hey"
You can write raw strings like:
r"{"Hey"}"
r'{"Hey"}'
[1] "\"hey\""
See ?Quotes
Raw character constants are also available using a syntax similar to the one used in C++: r"(...)" with ... any character sequence, except that it must not contain the closing sequence )". The delimiter pairs [] and {} can also be used, and R can be used in place of r. For additional flexibility, a number of dashes can be placed between the opening quote and the opening delimiter, as long as the same number of dashes appear between the closing delimiter and the closing quote.

URL / URI encoding in R

I have to request an API with an URL encoding according to RFC 3986, knowing that I have accented characters in my query.
For instance, this argument :
quel écrivain ?
should be encoded like this:
quel%20%C3%A9crivain%20%3F%0D%0A
Unfortunately, when I use URLencode, encoding, url_encode, or curlEscape, I have the resulting encoding:
URLencode("quel écrivain ?")
[1] "quel%20%E9crivain%20?"
The problem is on accented letters: for instance "é" is converted into "%E9" instead of "%C3%A9"...
I struggle with this URL encoding without finding any issue... As I don't have the hand on the API, I don't know how it handles the encoding.
A weird thing is that using POST instead of GET leads to a response in which word with accent are cutted into 2 different lines :
"1\tquel\tquel\tDET\tDET\tGender=Masc|Number=Sing\t5\tdet\t0\t_\n4\t<U+FFFD>\t<U+FFFD>\tSYM\tSYM\t_\t5\tcompound\t0\t_\n5\tcrivain\tcrivain\
As you can see, "écrivain" is splitted into "<U+FFFD>" (which is an ASCII encoding of "é") and "crivain".
I become mad with this encoding problem, if a brilliant mind could help me I would be very gratefull!
Set reserved = TRUE
i.e.
your_string <- "quel écrivain ?"
URLencode(your_string, reserved = TRUE)
# [1] "quel%20%C3%A9crivain%20%3F"
I do not think I am a brilliant mind, but I still have a possible solution for you. After using URLencode() it seems that your accented characters are converted into the trailing part of their unicode representation preceeded by a %. To convert your characters into readable characters you might turn them into "real unicode" and use the package stringi to make them readable. For your single string the solution worked on my machine, at least. I hope it also works for you.
Please note that I have introduced a % character at the end of your string to demonstrate that below gsub command should work in any case.
You might have to adapt the replacement pattern \\u00 to also cover unicode patterns that have more than the last two positions filled with something but 0, if this is relevant in your case.
library(stringi)
str <- "quel écrivain ?"
str <- URLencode(str)
#"quel%20%E9crivain%20?"
#replacing % by a single \ backslash to directly get correct unicode representation
#does not work since it is an escape character, therefore "\\"
str <- gsub("%", paste0("\\", "u00"), str , fixed = T)
#[1] "quel\\u0020\\u00E9crivain\\u0020?"
#since we have double escapes, we need the unescape function from stringi
#which recognizes double backslash as single backslash for the conversion
str <- stri_unescape_unicode(str)
#[1] "quel écrivain ?"

regex to capture pattern with non-fixed length lookbehind - split string

I'd like to split the string into the following
S <- "No. Ok (whatever). If you must. Please try to be careful (shakes head)."
[1] No.
[2] Ok (whatever). If you must.
[3] Please try to be careful (shakes head).
The pattern is the first . before each (...).
I'm familiar with (?<=...) (i.e. positive lookbehind) but this doesn't seem to work with non-fixed length patterns. I'd like to know if I'm wrong about positive lookbehind or if there's some regex magic to do this. Thanks!
Note that I don't know much about ruby, but there should be something like a split method that uses a regex pattern as a delimiter and split the string accordingly.
Use this regex:
(?<=\.) (?=[^.]+?\(.+?\))
This looks for a space character. Behind the space, there must be a dot (?<=\.). After it (?=, there must be a bunch of characters that are not dots [^.]+?, and then a pair of brackets with something inside \(.+?\).
Try it online: https://regex101.com/r/8PcbFJ/1

Paste "25 \%" in R for further processing in LaTeX

I want a character variable in R taking the value from, lets say "a", and adding " \%", to create a %-sign later in LaTeX.
Usually I'd do something like:
a <- 5
paste(a,"\%")
but this fails.
Error: '\%' is an unrecognized escape in character string starting "\%"
Any ideas? A workaround would be to define another command giving the %-sign in LaTeX, but I'd prefer a solution within R.
As many other languages, certain characters in strings have a different meaning when they're escaped. One example for that is \n, which means newline instead of n. When you write \%, R tries to interpret % as a special character and fails doing so. You might want to try to escape the backslash, so that it is just a backslash:
paste(a, "\\%")
You can read on escape sequences here.
You can also look at the latexTranslate function from the Hmisc package, which will escape special characters from strings to make them LaTeX-compatible :
R> latexTranslate("You want to give me 100$ ? I agree 100% !")
[1] "You want to give me 100\\$ ? I agree 100\\% !"

How do I strip dollar signs ($) from data/ escape special characters in R?

I've been using gsub("toreplace","replacement", myvector) to clean out data in R. While this works for commas and the like, removing "$" has no effect. So if I do gsub("$","",myvector) all the dollar signs remain in place.
I think this is because $ is a special character in R. I tried escaping it "\$" but that yields the same result (no effect). And I couldn't find a resource on escaping special characters in R.
Obviously I should do this in preprocessing. But I was wondering if anyone out there knew how to either a) escape special characters in R b) get rid of pesky $ in R directly. For science.
You have to escape it twice, first for R, second for the regex.
gsub('\\$', '', c("a$a", "bb$"))
[1] "aa" "bb"
See ?Quotes for details on quoting and escaping.
Use fixed = TRUE:
gsub('$', '', c("a$a", "bb$"), fixed = TRUE)
Then you don't need to worry about any special characters. In stringr, this is implemented a little differently:
library(stringr)
str_replace_all(c("$100","ta$ty"), fixed("$"), "")
Thanks to DiggyF and James for the examples!
Escaping characters can be a pain some times, but just putting it in square brackets (make it a character class) helps with this:
> gsub("[$]","",c("$100","ta$ty"))
[1] "100" "taty"
if you have $ followed by number in set of data columns (e.g. $400,000) there is an easier way that worked like charm for me.
data%>%
mutate_at(5:6, parse_number)
where 5:6 are the data column numbers.

Resources