I built a script in R that automatically create a very long and complex SQL query to create a view over similar tables of 5 databases.
Of course there were integration issues to solve. The only one remaining to make this happen is the problem I am going to present you now.
Considering one very long string like
'"/*NOTES*/", "/*TABLE_ID*/", "/*TABLE_SUB_ID*/", "/*TABLE_SUB_SUB_ID*/", "OTHER_COLUMNS",'
My objective is to replace
this string '"/*' with this string '/*'
this string '*/",' with this string '*/'
I tried with:
gsub('"/*', '/*', '"/*NOTES*/", "/*TABLE_ID*/", "/*TABLE_SUB_ID*/", "/*TABLE_SUB_SUB_ID*/", "OTHER_COLUMNS",')
but it returns the string
'/**NOTES*//*, /**TABLE_ID*//*, /**TABLE_SUB_ID*//*, /**TABLE_SUB_SUB_ID*//*, /*OTHER_COLUMNS/*,'
whereas my expected output is the following string:
'/*NOTES*/ /*TABLE_ID*/ /*TABLE_SUB_ID*/ /*TABLE_SUB_SUB_ID*/ "OTHER_COLUMNS",'
Note the * is not escaped but it represents start (/*) and end (*/) of comments when the string will be run by a SQL compiler
Escaping regexes requires two backslashes, so the following will get you what you want:
gsub('"?(/\\*|\\*/)"?', '\\1', '"/*NOTES*/", "/*TABLE_ID*/", "/*TABLE_SUB_ID*/", "/*TABLE_SUB_SUB_ID*/", "OTHER_COLUMNS",')
# [1] "/*NOTES*/, /*TABLE_ID*/, /*TABLE_SUB_ID*/, /*TABLE_SUB_SUB_ID*/, \"OTHER_COLUMNS\","
FYI, double-backslashes are required for most, but the following are legitimate single-backslash special characters:
'\a\b\f\n\r\t\v'
# [1] "\a\b\f\n\r\t\v"
'\u0101' # unicode, numbers are variable
# [1] "a"
'\x0A' # hex, hex-numbers are variable
# [1] "\n"
Perhaps there are more, I didn't find the authoritative list though I'm sure it's in there somewhere.
Related
I've seen that since 4.0.0, R supports raw strings using the syntax r"(...)". Thus, I could do:
r"(C:\THIS\IS\MY\PATH\TO\FILE.CSV)"
#> [1] "C:\\THIS\\IS\\MY\\PATH\\TO\\FILE.CSV"
While this is great, I can't figure out how to make this work with a variable, or better yet with a function. See this comment which I believe is asking the same question.
This one can't even be evaluated:
construct_path <- function(my_path) {
r"my_path"
}
Error: malformed raw string literal at line 2
}
Error: unexpected '}' in "}"
Nor this attempt:
construct_path_2 <- function(my_path) {
paste0(r, my_path)
}
construct_path_2("(C:\THIS\IS\MY\PATH\TO\FILE.CSV)")
Error: '\T' is an unrecognized escape in character string starting ""(C:\T"
Desired output
# pseudo-code
my_path <- "C:\THIS\IS\MY\PATH\TO\FILE.CSV"
construct_path(path)
#> [1] "C:\\THIS\\IS\\MY\\PATH\\TO\\FILE.CSV"
EDIT
In light of #KU99's comment, I want to add the context to the problem. I'm writing an R script to be run from command-line using WIndows's CMD and Rscript. I want to let the user who executes my R script to provide an argument where they want the script's output to be written to. And since Windows's CMD accepts paths in the format of C:\THIS\IS\MY\PATH\TO, then I want to be consistent with that format as the input to my R script. So ultimately I want to take that path input and convert it to a path format that is easy to work with inside R. I thought that the r"()" thing could be a proper solution.
I think you're getting confused about what the string literal syntax does. It just says "don't try to escape any of the following characters". For external inputs like text input or files, none of this matters.
For example, if you run this code
path <- readline("> enter path: ")
You will get this prompt:
> enter path:
and if you type in your (unescaped) path:
> enter path: C:\Windows\Dir
You get no error, and your variable is stored appropriately:
path
#> [1] "C:\\Windows\\Dir"
This is not in any special format that R uses, it is plain text. The backslashes are printed in this way to avoid ambiguity but they are "really" just single backslashes, as you can see by doing
cat(path)
#> C:\Windows\Dir
The string literal syntax is only useful for shortening what you need to type. There would be no point in trying to get it to do anything else, and we need to remember that it is a feature of the R interpreter - it is not a function nor is there any way to get R to use the string literal syntax dynamically in the way you are attempting. Even if you could, it would be a long way for a shortcut.
I'm working with the following code:
Y_Columns <- c("Y.1.1")
paste('{"ImportId":"', Y_Columns, '"}', sep = "")
The paste function produces the following output:
"{\"ImportId\":\"Y.1.1\"}"
How do I get the paste function to omit the \? Such that, the output is:
"{"ImportId":"Y.1.1"}"
Thank you for your help.
Note: I did do a search on SO to see if there were any Q's that asked "what is an escape character in R". But I didn't review all the 160 answers, only the first 20.
This is one way of demonstrating what I wrote in my comment:
out <- paste('{"ImportId":"', Y_Columns, '"}', sep = "")
out
#[1] "{\"ImportId\":\"Y.1.1\"}"
?print
print(out,quote=FALSE)
#[1] {"ImportId":"Y.1.1"}
Both R and regex patterns use escape characters to allow special characters to be displayed in print output or input. (And sometimes regex patterns need to have doubled escapes.) R has a few characters that need to be "escaped" in certain situation. You illustrated one such situation: including double-quote character inside a result that will be printed with surrounding double-quotes. If you were intending to include any single quotes inside a character value that was delimited by single quotes at the time of creation, they would have needed to be escaped as well.
out2 <- '\'quoted\''
nchar(out2)
#[1] 8 ... note that neither the surround single-quotes nor the backslashes get counted
> out2
[1] "'quoted'" ... and the default output quote-char is a double-quote.
Here's a good Q&A to review:How to replace '+' using gsub() function in R
It has two answers, both useful: one shows how to double escape a special character and the other shows how to use teh fixed argument to get around that requirement.
And another potentially useful Q&A on the topic of handling Windows paths:
File path issues in R using Windows ("Hex digits in character string" error)
And some further useful reading suggestions: Look at the series of help pages that start with capital letters. (Since I can never remember which one has which nugget of essential information, I tried ?Syntax first and it has a "See Also" list of essential reading: Arithmetic, Comparison, Control, Extract, Logic, NumericConstants, Paren, Quotes, Reserved. and I then realized what I wanted to refer you to was most likely ?Quotes where all the R-specific escape sequence letters should be listed.
I am trying to compare a string (in memory) to the contents of a file to see if they are the same. Boring details on motivation are below the question if anyone cares.
My confusion is that when I hash file contents, I get a different result than when I hash the string.
library(readr)
library(digest)
# write the string to the file
the_string <- "here is some stuff"
the_file <- "fake.txt"
readr::write_lines(the_string, the_file)
# both of these functions (predictably) give the same hash
tools::md5sum(the_file)
# "44b0350ee9f822d10f2f9ca7dbe54398"
digest(file = the_file)
# "44b0350ee9f822d10f2f9ca7dbe54398"
# now read it back to a string and get something different
back_to_a_string <- readr::read_file(the_file)
# "here is some stuff\n"
digest(back_to_a_string)
# "03ed1c8a2b997277100399bef6f88939"
# add a newline because that's what write_lines did
orig_with_newline <- paste0(the_string, "\n")
# "here is some stuff\n"
digest(orig_with_newline)
# "03ed1c8a2b997277100399bef6f88939"
What I want to do is just digest(orig_with_newline) == digest(file = the_file) to see if they're the same (they are) but that returns FALSE because, as shown, the hashes are different.
Obviously I could either read the file back to a string with read_file or write the string to a temp file, but both of those seem a bit silly and hacky. I guess both of those are actually fine solutions, I really just want to understand why this is happening so that I can better understand how the hashing works.
Boring details on motivation
The situation is that I have a function that will write a string to a file, but if the file already exists then it will error unless the user has explicitly passed .overwrite = TRUE. However, if the file exists, I would like to check whether the string about to be written to the file is in fact the same thing that's already in the file. If this is the case, then I will skip the error (and the write). This code could be called in a loop and it will be obnoxious for the user to continually see this error that they are about to overwrite a file with the same thing that's already in it.
Short answer: I think you need to set serialize=FALSE. Supposing that the file doesn't contain the extra newline (see below),
digest(the_string,serialize=FALSE) == digest(file=the_file) ## TRUE
(serialize has no effect on the file= version of the command)
dealing with newlines
If you read ?write_lines, it only says
sep: The line separator ... [information about defaults for different OSes]
To me, this seems ambiguous as to whether the separator will be added after the last line or not. (You don't expect a "comma-separated list" to end with a comma ...)
On the other hand, ?base::writeLines is a little more explicit,
sep: character string. A string to be written to the connection
after each line of text.
If you dig down into the source code of readr you can see that it uses
output << na << sep;
for each line of code, i.e. it's behaving the same way as writeLines.
If you really just want to write the string to the file with no added nonsense, I suggest cat():
identical(the_string, { cat(the_string,file=the_file); readr::read_file(the_file) }) ## TRUE
(strap in!)
Hi, I'm running into issues involving Unicode encoding in R.
Basically, I'm importing data sets that contain Unicode (UTF-8) characters, and then running grep() searches to match values. For example, say I have:
bigData <- c("foo","αβγ","bar","αβγγ (abgg)", ...)
smallData <- c("αβγ","foo", ...)
What I'm trying to do is take the entries in smallData and match them to entries in bigData. (The actual sets are matrixes with columns of values, so what I'm trying to do is find the indexes of the matches, so I can tell what row to add the values to.) I've been using
matches <- grepl(smallData[i], bigData, fixed=T)
which usually results in a vector of matches. For i=2, it would return 1, since "foo" is element 1 of bigData. This is peachy and all is well. But RStudio seems to not be dealing with unicode characters properly. When I import the sets and view them, they use the character IDs.
dataset <- read_csv("[file].csv", col_names = FALSE, locale = locale())
Using View(dataset) shows "aß<U+03B3>" instead of "αβγ." The same goes for
dataset[1]
A tibble: 1x1 <chr>
[1] aß<U+03B3>
print(dataset[1])
A tibble: 1x1 <chr>
[1] aß<U+03B3>
However, and this is why I'm stuck rather than just adjusting the encoding:
paste(dataset[1])
[1] "αβγ"
Encoding(toString(dataset[1]))
[1] "UTF-8"
So it appears that R is recognizing in certain contexts that it should display Unicode characters, while in others it just sticks to--ASCII? I'm not entirely sure, but certainly a more limited set.
In any case, regardless of how it displays, what I want to do is be able to get
grep("αβγ", bigData)
[1] 2 4
However, none of the following work:
grep("αβ", bigData) #(Searching the two letters that do appear to convert)
grep("<U+03B3>",bigData,fixed=T) #(Searching the code ID itself)
grep("αβ", toString(bigData)) #(converts the whole thing to one string)
grep("\\β", bigData) #(only mentioning because it matches, bizarrely, to ß)
The only solution I've found is:
grep("\u03B3", bigData)
[1] 2 4
Which is not ideal for a couple reasons, most jarringly that it doesn't look like it's possible to just take every <U+####> and replace it with \u####, since not every Unicode character is converted to the <U+####> format, but none of them can be searched. (i.e., α and ß didn't turn into their unicode keys, but they're also not searchable by themselves. So I'd have to turn them into their keys, then alter their keys to a form that grep() can use, then search.)
That means I can't just regex the keys into a searchable format--and even if I could, I have a lot of entries including characters that'd need to be escaped (e.g., () or ), so having to remove the fixed=T term would be its own headache involving nested escapes.
Anyway...I realize that a significant part of the problem is that my set apparently involves every sort of character under the sun, and it seems I have thoroughly entrapped myself in a net of regular expressions.
Is there any way of forcing a search with (arbitrary) unicode characters? Or do I have to find a way of using regular expressions to escape every ( and α in my data set? (coordinate to that second question: is there a method to convert a unicode character to its key? I can't seem to find anything that does that specific function.)
I am using the Rlibstree package version 0.3-2 with the function getLongestCommonSubstring
So I have character strings that only contain 0-9 and >; they look like this:
string A:
0113>0213>0212>0312>0411>0611>0711>0812>1012>1112>1212>1412>1313>1413>1412>1311>1211>1212>1012>1013>0912>0812>0712>0513>0612>0511>0410>0309>0209>0308>0207>0107>0007>0109>0010>0110>0010>0008>0007>0106>0105>0204>0304>0503>0603>0701>0801>0802>0803>0904>1003>1002>1001>1002>1103>1004>0904>0803>0802>0701>0702>0603>0503>0403>0303>0204>0105>0104>0203>0302>0401>0302>0203>0204>0104>0105>0106>0107>0307>0308>0409>0410>0311>0212>0113>0213>0113>0213
String B:
0113>0213>0212>0312>0411>0511>0410>0409>0308>0307>0207>0107>0108>0109>0010>0110>0010>0009>0107>0207>0307>0308>0309>0209>0309>0410>0411>0611>0711>0812>0912>1012>1112>1212>1412>1313>1412>1212>1112>1012>1013>0912>0812>0612>0613>0513>0612>0611>0511>0411>0312>0213>0113>0213>0113>0212>0311>0411>0312>0213>0212>0311>0312>0311>0411>0410>0409>0308>0307>0207>0107>0106>0105>0204>0304>0503>0604>0603>0602>0601>0701>0801>0802>0803>0804>0904>1004>1003>1002>1001>1002>1001>1003>1004>0904>0803>0802>0801>0701>0602>0604>0504>0404>0304>0104>0105>0107>0108>0109>0108>0107>0207>0308>0409>0410>0311>0212>0213
String C:
0113>0213>0113>0213>0113>0213>0212>0311>0411>0611>0812>0912>1012>1212>1312>1412>1413>1314>1313>1213>1413>1412>1411>1311>1212>1011>0911>0811>0712>0611>0411>0410>0409>0309>0209>0309>0408>0410>0510>0611>0712>0611>0511>0411>0311>0310>0409>0309>0307>0207>0108>0109>0110>0010>0109>0108>0107>0006>0106>0105>0204>0203>0303>0204>0203>0302>0401>0402>0401>0302>0203>0304>0404>0504>0503>0604>0705>0605>0705>0604>0505>0504>0603>0503>0403>0303>0203>0104>0105>0005>0107>0108>0109>0108>0107>0207>0107>0106>0104>0204>0304>0404>0504>0603>0604>0603>0503>0504>0603>0702>0701>0801>0802>0804>0904>1004>1003>1002>1001>1002>1003>1104>1205>1304>1303>1403>1404>1403>1304>1205>1104>0904>0804>0802>0801>0701>0602>0703>0604>0704>0602>0701>0601>0602>0603>0504>0404>0303>0203>0204>0105>0106>0107>0207>0308>0408>0409>0308>0309>0409>0410>0411>0511>0611>0812>0912>1012>1112>1012>0912>1013>1012>1112>1212>1312>1313>1213>1313>1312>1412>1313>1312>1413>1313>1213>1313>1312>1112>1012>0911>1011>1112>1312>1412>1312>1413>1313>1312>1212>1112>0911>0811>0711>0511>0411>0312>0212>0312>0411>0511>0611>0612>0413>0513>0612>0611>0411>0312>0212>0213>0212>0213>0113>0213>0113
I want my input string to compare with String A.
See example below:
If I compare A and B, no problem, found two longest common substring, happy!
getLongestCommonSubstring(c(A,B))
[1] "07>0106>0105>0204>0304>0503>060" "12>1012>1112>1212>1412>1313>141"
BUT, if I compare A and C, something happened, as you can see the result,
I got \xc1 or ! at the end, and these special character will change every time.
Execute First time:
getLongestCommonSubstring(c(A,C))
[1] "04>1003>1002>1001>1002>1\xc1" ">0603>0503>0403>0303>020!"
Execute Second time:
getLongestCommonSubstring(c(A,C))
[1] "04>1003>1002>1001>1002>11" ">0603>0503>0403>0303>020!"
Execute Third time:
getLongestCommonSubstring(c(A,C))
[1] "04>1003>1002>1001>1002>1\xc1" ">0603>0503>0403>0303>020\xc1"
With these special character, or escape character in the string, I cannot perform tasks like the nchar() function, these characters are redundant and annoying.
For me, the only difference between B and C is their length, their format is the same, I really cannot figure out why this happened.