R replace quoted values in YAML - r

I am replacing values in Yaml files, and need to choose which values will be quoted.
I don't understand the current behaviour of write.yaml
library(yaml)
test <- list("a"=list("b"="123", "c"="rabbitmq"))
as.yaml(test)
write_yaml(test, "test.yaml")
In test.yaml, "123" is quoted while rabbitmq is not. How can I deliberately choose what will be quoted ?

It is actually possible to generate characters that are quoted very straightforwardly without any esoteric workaround.
So from the help to yaml::as.yaml we have:
"Character vectors that have an attribute of ‘quoted’ will be wrapped in double quotes (see below for example)."
Now, here is the test:
veck1 <- c("a")
veck2 <- veck1
attr(veck2, which = "quoted") <- TRUE # actually any attribute value works here
test_list <- list(unquoted_char = veck1, quoted_char = veck2)
write_yaml(test_list, "./test.yaml")
gives me the yaml file:
unquoted_char: a
quoted_char: "a"
which is exactly what was asked for #gaut.
Maybe the R/package version need an update, here are my settings.
R version:
major 4
minor 2.1
year 2022
month 06
day 23
svn rev 82513
language R
version.string R version 4.2.1 (2022-06-23)
nickname Funny-Looking Kid
package version yaml: ‘2.3.5’

workaround
Although this might be difficult for more complex YAML file outputs, there is a workaround to the apparent inability of the yaml package to force quotations, as explored below. We can use regex to match, in this example, any non-quoted continuous group of letters or numbers and force quotations around it.
library(stringr)
test <- list("a"=list("b"=123L, "c"="rabbitmq", d = "Yes"))
test_yaml <- as.yaml(test)
test_yaml
#> [1] "a:\n b: 123\n c: rabbitmq\n d: 'Yes'\n"
We can see here that we might want to quote 123 and rabbitmq (Yes is already in quotes).
test_yaml <- str_replace_all(test_yaml, "(?<=:\\s)([a-zA-Z0-9]*?)(?=\n)", "'\\1'")
test_yaml
#> [1] "a:\n b: '123'\n c: 'rabbitmq'\n d: 'Yes'\n"
And then write this to your YAML file.
write_yaml(test_yaml, "/Users/caldwellst/Desktop/test.yaml")
We then get your, I'm assuming, desired output.
a:
b: '123'
c: 'rabbitmq'
d: 'Yes'
yaml package
write_yaml is using strings to force a quote, because otherwise 123 would be interpreted as numeric. You don't need quotes for rabbitmq because it will automatically be interpreted as a string. If you actually want 123 to be interpreted as numeric, just pass it as such.
test <- list("a"=list("b"=123, "c"="rabbitmq"))
as.yaml(test)
write_yaml(test, "test.yaml")
Then you get:
a:
b: 123.0 # pass b as 123L to force integer output
c: rabbitmq
Per the documentation of write_yaml:
Character vectors that have a class of ‘verbatim’ will not be quoted in the output YAML document except when the YAML specification requires it
Although that is a way to force certain values to be output without quotes, there appears to be no easy way to force quoting of values. I've tried with passing unicode = T and explicit quotes in the string, but they are removed. As well, I tried to trick the system by passing in various handlers to manipulate class (and remove verbatim if it was there. However, none seem to work. As the source code is in C, not able to get something working with the write_yaml function.
There's a detailed answer about the YAML syntax and quoting in general in this great SO post.

Related

How to pass a chr variable into r"(...)"?

I've seen that since 4.0.0, R supports raw strings using the syntax r"(...)". Thus, I could do:
r"(C:\THIS\IS\MY\PATH\TO\FILE.CSV)"
#> [1] "C:\\THIS\\IS\\MY\\PATH\\TO\\FILE.CSV"
While this is great, I can't figure out how to make this work with a variable, or better yet with a function. See this comment which I believe is asking the same question.
This one can't even be evaluated:
construct_path <- function(my_path) {
r"my_path"
}
Error: malformed raw string literal at line 2
}
Error: unexpected '}' in "}"
Nor this attempt:
construct_path_2 <- function(my_path) {
paste0(r, my_path)
}
construct_path_2("(C:\THIS\IS\MY\PATH\TO\FILE.CSV)")
Error: '\T' is an unrecognized escape in character string starting ""(C:\T"
Desired output
# pseudo-code
my_path <- "C:\THIS\IS\MY\PATH\TO\FILE.CSV"
construct_path(path)
#> [1] "C:\\THIS\\IS\\MY\\PATH\\TO\\FILE.CSV"
EDIT
In light of #KU99's comment, I want to add the context to the problem. I'm writing an R script to be run from command-line using WIndows's CMD and Rscript. I want to let the user who executes my R script to provide an argument where they want the script's output to be written to. And since Windows's CMD accepts paths in the format of C:\THIS\IS\MY\PATH\TO, then I want to be consistent with that format as the input to my R script. So ultimately I want to take that path input and convert it to a path format that is easy to work with inside R. I thought that the r"()" thing could be a proper solution.
I think you're getting confused about what the string literal syntax does. It just says "don't try to escape any of the following characters". For external inputs like text input or files, none of this matters.
For example, if you run this code
path <- readline("> enter path: ")
You will get this prompt:
> enter path:
and if you type in your (unescaped) path:
> enter path: C:\Windows\Dir
You get no error, and your variable is stored appropriately:
path
#> [1] "C:\\Windows\\Dir"
This is not in any special format that R uses, it is plain text. The backslashes are printed in this way to avoid ambiguity but they are "really" just single backslashes, as you can see by doing
cat(path)
#> C:\Windows\Dir
The string literal syntax is only useful for shortening what you need to type. There would be no point in trying to get it to do anything else, and we need to remember that it is a feature of the R interpreter - it is not a function nor is there any way to get R to use the string literal syntax dynamically in the way you are attempting. Even if you could, it would be a long way for a shortcut.

iconv() returns NA when given a string with a specific special character

I am trying to convert some strings of an input file from UTF8 to ASCII. For most of the strings I give it, the conversion works perfectly fine with iconv(). However on some of them, it returns NA. While manually fixing the issue in the file seems like the simplest option, it is unfortunately not an option that I have available at the moment at all.
I have made a reproducible example of my problem but we assume to assume that I have to figure a way for iconv() to somehow convert the string in s1 and not get NA.
Here is the reproducible example:
s1 <- "Besançon" #as read from an input file I cannot modify
s2 <- "Paris"
s3 <- "Linköping"
s4 <- "Besançon" #Manual input for testing
s1 <- iconv(s1, to='ASCII//TRANSLIT')
s2 <- iconv(s2, to='ASCII//TRANSLIT')
s3 <- iconv(s3, to='ASCII//TRANSLIT')
s4 <- iconv(s4, to='ASCII//TRANSLIT')
I get the following output:
> s1
[1] NA
> s2
[1] "Paris"
> s3
[1] "Link\"oping"
> s4
[1] "Besancon"
After playing around with the code, I figured that something was wrong in the entry "Besançon" that is now copied exactly from the input file. When I input it manually myself, the problem is solved. Since I can't modify the input file at all, what do you think is the exact issue and would you have any idea on how to solve it?
Thanks in advance,
Edit:
After closer inspection, there is something odd in the characters of the first line. It seems to be taken away by SO's formatting.
But to reproduce it, the best I could give is these two images describing it. First image places my cursor just before the #
Second image is after pressing delete, which should delete the white space... turns out it deletes the ". So there is definitely something weird there.
It turns out that using sub='' actually solved the issue although I am quite unsure why.
iconv(s1, to='ASCII//TRANSLIT', sub='')
From the documentation sub
character string. If not NA it is used to replace any non-convertible
bytes in the input. (This would normally be a single character, but
can be more.) If "byte", the indication is "" with the hex code of
the byte. If "Unicode" and converting from UTF-8, the Unicode point in
the form "<U+xxxx>".
So I eventually figured out that there was a character I couldn't convert (nor see) in the string and using sub was a way to eliminate it. I am still not sure what this character is though. But the problem is solved.
There is probably a latin1 (or other encoding) character in your supposedly utf8 file. For example:
> latin=iconv('Besançon','utf8','latin1')
> iconv(latin,to='ascii//translit')
[1] NA
> iconv(latin,'utf8','ascii//translit')
[1] NA
> iconv(latin,'latin1','ascii//translit')
[1] "Besancon"
> iconv(l,'Windows-1250','ascii//translit')
[1] "Besancon"
You can e.g. make one new vector or data column with the result of each character set encoding in your data, and if one is NA, fall back to the next one, e.g.
utf8 = iconv(x,'utf8','ascii//translit')
latin1 = iconv(x,'latin1','ascii//translit')
win1250 = iconv(x,'Windows-1250','ascii//translit')
result = ifelse(
is.na(utf8),
ifelse(
is.na(latin1),
win1250,
latin1
),
utf8
)
If these encodings don't work, make a file with just the problem word, then use the unix/linux file command to detect the encoding, or else try some likely encodings.
I have in the past just listed all of iconv's supported encodings, tried all with lapply, and then used whichever results worked on each string, but some "from" encodings will return a non-NA but incorrect result, so it's best to try this on each unique character in your data in order to decide which subset of iconv's encodings to use and in which order.

How can I keep toJSON from quoting my JSON string in R?

I am using OpenCPU and R to create a web API that takes in some inputs and returns a topoJSON file from a database, as well as some other information. OpenCPU automatically pushes the output through toJSON, which results in JSON output that has quoted JSON in it (i.e., the topoJSON). This is obviously not ideal--especially since it then gets incredibly cluttered with backticked quotes (\"). I tried using fromJSON to convert it to an R object, which could then be converted back (which is incredibly inefficient), but it returns a slightly different syntax and the result is that it doesn't work.
I feel like there should be some way to convert the string to some other type of object that results in toJSON calling a different handler that tells it to just leave it alone, but I can't figure out how to do that.
> s <- '{"type":"Topology","objects":{"map": "0"}}'
> fromJSON(s)
$type
[1] "Topology"
$objects
$objects$map
[1] "0"
> toJSON(fromJSON(s))
{"type":["Topology"],"objects":{"map":["0"]}}
That's just the beginning of the file (I replaced the actual map with "0"), and as you can see, brackets appeared around "Topology" and "0". Alternately, if I just keep it as a string, I end up with this mess:
> toJSON(s)
["{\"type\":\"Topology\",\"objects\":{\"0000595ab81ec4f34__csv\": \"0\"}}"]
Is there any way to fix this so that I just get the verbatim string but without quotes and backticks?
EDIT: Note that because I'm using OpenCPU, the output needs to come from toJSON (so no other function can be used, unfortunately), and I can't do any post-processing.
To it seems you just want the values rather than vectors. Set auto_unbox=TRUE to turn length-one vectors into scalar values
toJSON(fromJSON(s), auto_unbox = TRUE)
# {"type":"Topology","objects":{"map":"0"}}
That does print without escaping for me (using jsonlite_1.5). Maybe you are using an older version of jsonlite. You can also get around that by using cat() to print the result. You won't see the slashes when you do that.
cat(toJSON(fromJSON(s), auto_unbox = TRUE))
You can manually unbox the relevant entries:
library(jsonlite)
s <- '{"type":"Topology","objects":{"map": "0"}}'
j <- fromJSON(s)
j$type <- unbox(j$type)
j$objects$map <- unbox(j$objects$map)
toJSON(j)
# {"type":"Topology","objects":{"map":"0"}}

Using grep() with Unicode characters in R

(strap in!)
Hi, I'm running into issues involving Unicode encoding in R.
Basically, I'm importing data sets that contain Unicode (UTF-8) characters, and then running grep() searches to match values. For example, say I have:
bigData <- c("foo","αβγ","bar","αβγγ (abgg)", ...)
smallData <- c("αβγ","foo", ...)
What I'm trying to do is take the entries in smallData and match them to entries in bigData. (The actual sets are matrixes with columns of values, so what I'm trying to do is find the indexes of the matches, so I can tell what row to add the values to.) I've been using
matches <- grepl(smallData[i], bigData, fixed=T)
which usually results in a vector of matches. For i=2, it would return 1, since "foo" is element 1 of bigData. This is peachy and all is well. But RStudio seems to not be dealing with unicode characters properly. When I import the sets and view them, they use the character IDs.
dataset <- read_csv("[file].csv", col_names = FALSE, locale = locale())
Using View(dataset) shows "aß<U+03B3>" instead of "αβγ." The same goes for
dataset[1]
A tibble: 1x1 <chr>
[1] aß<U+03B3>
print(dataset[1])
A tibble: 1x1 <chr>
[1] aß<U+03B3>
However, and this is why I'm stuck rather than just adjusting the encoding:
paste(dataset[1])
[1] "αβγ"
Encoding(toString(dataset[1]))
[1] "UTF-8"
So it appears that R is recognizing in certain contexts that it should display Unicode characters, while in others it just sticks to--ASCII? I'm not entirely sure, but certainly a more limited set.
In any case, regardless of how it displays, what I want to do is be able to get
grep("αβγ", bigData)
[1] 2 4
However, none of the following work:
grep("αβ", bigData) #(Searching the two letters that do appear to convert)
grep("<U+03B3>",bigData,fixed=T) #(Searching the code ID itself)
grep("αβ", toString(bigData)) #(converts the whole thing to one string)
grep("\\β", bigData) #(only mentioning because it matches, bizarrely, to ß)
The only solution I've found is:
grep("\u03B3", bigData)
[1] 2 4
Which is not ideal for a couple reasons, most jarringly that it doesn't look like it's possible to just take every <U+####> and replace it with \u####, since not every Unicode character is converted to the <U+####> format, but none of them can be searched. (i.e., α and ß didn't turn into their unicode keys, but they're also not searchable by themselves. So I'd have to turn them into their keys, then alter their keys to a form that grep() can use, then search.)
That means I can't just regex the keys into a searchable format--and even if I could, I have a lot of entries including characters that'd need to be escaped (e.g., () or ), so having to remove the fixed=T term would be its own headache involving nested escapes.
Anyway...I realize that a significant part of the problem is that my set apparently involves every sort of character under the sun, and it seems I have thoroughly entrapped myself in a net of regular expressions.
Is there any way of forcing a search with (arbitrary) unicode characters? Or do I have to find a way of using regular expressions to escape every ( and α in my data set? (coordinate to that second question: is there a method to convert a unicode character to its key? I can't seem to find anything that does that specific function.)

Unicode normalization (form C) in R : convert all characters with accents into their one-unicode-character form?

In Unicode, letters with accents can be represented in two ways: the accentuated letter itself, and the combination of the bare letter plus the accent. For example, é (+U00E9) and e´ (+U0065 +U0301) are usually displayed in the same way.
R renders the following (version 3.0.2, Mac OS 10.7.5):
> "\u00e9"
[1] "é"
> "\u0065\u0301"
[1] "é"
However, of course:
> "\u00e9" == "\u0065\u0301"
[1] FALSE
Is there a function in R which converts two-unicode-character-letters into their one-character form? In particular, here it would collapse "\u0065\u0301" into "\u00e9".
That would be extremely handy to process large quantities of strings. Plus, the one-character forms can easily be converted to other encodings via iconv -- at least for the usual Latin1 characters -- and is better handled by plot.
Thanks a lot in advance.
Ok, it appears that a package has been developed to enhance and simplify the string manipulation toolbox in R (finally!). It is called stringi and looks very promising. Its documentation is very well written, and in particular I find the pages about encodings and locales much more enlightening than some of the standard R documentation on the subject.
It has Unicode normalization functions, as I was looking for (here form C):
> stri_trans_nfc('\u00e9') == stri_trans_nfc('\u0065\u0301')
[1] TRUE
It also contains a smart comparison function which integrates these normalization questions and lessens the pain of having to think about them:
> stri_compare('\u00e9', '\u0065\u0301')
[1] 0
# i.e. equal ;
# otherwise it returns 1 or -1, i.e. greater or lesser, in the alphabetic order.
Thanks to the developers, Marek Gągolewski and Bartek Tartanus, and to Kurt Hornik for the info!

Resources