R problems with encoding after reading from MS SQL via ODBC - r

We have a database and R scripts running on different virtual machines.
First we connect to db
con <- dbConnect(
odbc(),
Driver = "SQL Server",
Server = "server", Database = "db", UID = "uid", PWD = "pwd",
encoding = "UTF-8"
)
and gathering the data
data <- dbGetQuery(con, "SELECT * FROM TableName")
The problem is following: when different machines running the same script, for some of these we face character variables encoding issues.
For example, this is what we have on machine A
> data$char_var[1]
[1] "фамилия"
> Encoding(data$char_var[1])
[1] "UTF-8"
> Sys.getlocale()
[1] "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251"
> Encoding(data$char_var[1]) <- "1251"
> data$char_var[1]
[1] "гревцев"
and this is what we have on machine B
> data$char_var[1]
[1] "<e3><f0><e5><e2><f6><e5><e2>"
> Encoding(data$char_var[1])
[1] "UTF-8"
> Sys.getlocale()
[1] "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251"
> Encoding(data$char_var[1]) <- "1251"
> data$char_var[1]
[1] "фамилия"
First script returns gibberish, but it prints the initial value correctly. The same code running on machine B initially prints utf-8 and then returns encoded values. What could be the reason for such a difference?
As a result we want a script that would have the same "фамилия" output value to show it on a dashboard.

According to the result of your call to Encoding(data$char_var[1]), both machines are declaring the returned results to be encoded using UTF-8.
On the first machine, that appears to be true, because you're seeing valid output. Then you mess it up by declaring the encoding incorrectly as "1251", and you see gibberish.
On the second machine, the result comes to you declared as UTF-8, but it isn't (which is why it looks like gibberish to start). When you change the declared encoding to be "1251" it looks okay, so that's what it must have been all along.
So you have two choices:
Make sure both machines are consistent about what they return from the dbGetQuery. You can handle either encoding, but you need to know what it is and make sure it's declared correctly.
Alternatively, try to detect what is being returned, and declare it appropriately. One way to do this might be to put a known string in the database and compare the result to that. If you know you should get "фамилия" and you get something else, switch the declared encoding. You could also try the readr::guess_encoding() function.
One other issue is that some downstream function might only be able handle one or the other of UTF-8 and 1251 encodings. (Windows R is really bad at non-native encodings, and UTF-8 is never native on Windows.) In that case you may want to actually convert to a common encoding. You use the iconv() function to do that, e.g.
iconv(char_var, from = "cp1251", to = "UTF-8")
will try to convert to UTF-8.

Related

How can I make a connection object without an open connection in R?

I cannot seem to create a closed connection object. I want to have a connection object sitting around, so I can give it a connection if isOpen(myCon) returns FALSE. I would expect myCon<-file() to give me a closed connection, but isOpen(myCon) returns TRUE. I see from the help file that file() actually returns a connection to an "anonymous file", which is apparently just a memory location that acts like a file. ...not what I want. If I create the anonymous file and execute close(myCon), then isOpen(myCon) gives an invalid connection error rather than returning FALSE. I don't want to be trapping errors just to get my false value. How can I create a valid connection that's not open? There must be a way isOpen(myCon) can return FALSE, otherwise it's a somewhat pointless function. My OS is Windows 7.
file() works as long as the first parameter (description) is a non-empty string. Example:
> myCon <- file("dumdum")
> isOpen(myCon)
[1] FALSE
The key is that the second parameter (open) is left empty (the default). It does not matter whether the string used for description is an existing file or not. However, this does not mean that the connection can't be used. R opens the connection as needed. For example:
> myCon <- file("important log file.txt")
> isOpen(myCon)
[1] FALSE
> cat("Thinking this will fail, because the connection is closed. ...wrong!", file=myCon)
> isOpen(myCon)
[1] FALSE
The file has just been overwritten with that one line.
The safe way to make a stand-by connection is to generate the description with tempfile(). That returns a string which is guaranteed "not to be currently in use" (...according to the help page. I interpret that to mean that the string is not the name of an existing file.)
> myCon <- file( tempfile() )
> isOpen(myCon)
[1] FALSE
> cat("Didn't mean to do this, but all it will do is create a new file.", file=myCon)
> isOpen(myCon)
[1] FALSE
In this case, that line was written to a file, but it did not overwrite anything.
Many thanks to Martin Morgan for pointing me in the right direction. I welcome additional comments.

R: Reading a UCS-2 LE bom file from GitHub

I have a program which creates and stores files automatically on GitHub. An example is
https://raw.githubusercontent.com/VIC-Laboratory-ExperimentalData/test/master/test-999-666.txt
However, the files are coded on Dos/Windows machine with UCS-2 LE BOM (according to notepad++).
I am trying to read this text file into R but to no avail:
repo <- "https://raw.githubusercontent.com/VIC-Laboratory-ExperimentalData/test/master"
file <- "test-999-666.txt"
myurl <- paste(repo, file, sep="/")
library(RCurl)
cnt <- getURL(myurl)
I get an error
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
caractère nul au milieu de la chaîne : '<ff><fe>*'
How can I configure getURL to read this file? I also tried with httr::GET (but receive an empty content).
This seems to be a relatively common pain point when working with files produced by Windows. I'm going to be honest and say that the solution I'm presenting doesn't seem the best, because it mainly bypasses getting everything into the right encoding and instead goes to the binary directly.
Using the same variables as you:
cnt <- getURLContent(myurl, binary = T)
cnt <- rawToChar(cnt[cnt != 00])
Should produce a parsable string.
The idea is that instead of trying to have curl read the file, let it treat it like binary and deal with encoding later on. This gives us a vector of type raw. Then, since the main issue seems to be that null characters (i.e. \00) were causing a problem, we just exclude them from cnt before coerce cnt from raw to char.
In the end, from your example, I get
"ÿþ*** Header Start ***\r\nVersionPersist: 1\r\nLevelName: Session\r\nLevelName: Block\r\nLevelName: Trial\r\nLevelName: SubTrial\r\nLevelName: LogLevel5\r\nLevelName: LogLevel6\r\nLevelName: LogLevel7\r\nLevelName: LogLevel8\r\nLevelName: LogLevel9\r\nLevelName: LogLevel10\r\nExperiment: test\r\nSessionDate: 07-04-2019\r\nSessionTime: 12:35:06\r\nSessionStartDateTimeUtc: 2019-07-04 16:35:06\r\nSubject: 999\r\nSession: 666\r\nDataFile.Basename: test-999-666\r\nRandomSeed: -1018314635\r\nGroup: 1\r\nDisplay.RefreshRate: 60.005\r\n*** Header End ***\r\nLevel: 1\r\n*** LogFrame Start ***\r\nExperiment: test\r\nSessionDate: 07-04-2019\r\nSessionTime: 12:35:06\r\nSessionStartDateTimeUtc: 2019-07-04 16:35:06\r\nSubject: 999\r\nSession: 666\r\nDataFile.Basename: test-999-666\r\nRandomSeed: -1018314635\r\nGroup: 1\r\nDisplay.RefreshRate: 60.005\r\nClock.Information: <?xml version=\"1.0\"?>\\n<Clock xmlns:dt=\"urn:schemas-microsoft-com:datatypes\"><Description dt:dt=\"string\">E-Prime Primary Realtime Clock</Description><StartTime><Timestamp dt:dt=\"int\">0</Timestamp><DateUtc dt:dt=\"string\">2019-07-04T16:35:05Z</DateUtc></StartTime><FrequencyChanges><FrequencyChange><Frequency dt:dt=\"r8\">2742255</Frequency><Timestamp dt:dt=\"r8\">492902384024</Timestamp><Current dt:dt=\"r8\">0</Current><DateUtc dt:dt=\"string\">2019-07-04T16:35:05Z</DateUtc></FrequencyChange></FrequencyChanges></Clock>\\n\r\nStudioVersion: 2.0.10.252\r\nRuntimeVersion: 2.0.10.356\r\nRuntimeVersionExpected: 2.0.10.356\r\nRuntimeCapabilities: Professional\r\nExperimentVersion: 1.0.0.543\r\nExperimentStuff.RT: 2555\r\n*** LogFrame End ***\r\n"
Which seems to contain all the right content.
If you want you can try adding options(encoding = "UCS-2LE-BOM") before this code, I don't know if it changes anything, but it seems like it affects rawToChar.

R script file encoding (R Studio)

Which file encoding do I have to use to be able to save this vector (Matching complex URLs within text blocks (R)) correctly in a R script? The special characters and Chinese signs seem to make things somehow complicated.
x <- c("http://foo.com/blah_blah",
"http://foo.com/blah_blah/",
"(Something like http://foo.com/blah_blah)",
"http://foo.com/blah_blah_(wikipedia)",
"http://foo.com/more_(than)_one_(parens)",
"(Something like http://foo.com/blah_blah_(wikipedia))",
"http://foo.com/blah_(wikipedia)#cite-1",
"http://foo.com/blah_(wikipedia)_blah#cite-1",
"http://foo.com/unicode_(✪)_in_parens",
"http://foo.com/(something)?after=parens",
"http://foo.com/blah_blah.",
"http://foo.com/blah_blah/.",
"<http://foo.com/blah_blah>",
"<http://foo.com/blah_blah/>",
"http://foo.com/blah_blah,",
"http://www.extinguishedscholar.com/wpglob/?p=364.",
"http://✪df.ws/1234",
"rdar://1234",
"rdar:/1234",
"x-yojimbo-item://6303E4C1-6A6E-45A6-AB9D-3A908F59AE0E",
"message://%3c330e7f840905021726r6a4ba78dkf1fd71420c1bf6ff#mail.gmail.com%3e",
"http://➡.ws/䨹",
"www.c.ws/䨹",
"<tag>http://example.com</tag>",
"Just a www.example.com link.",
"http://example.com/something?with,commas,in,url, but not at end",
"What about <mailto:gruber#daringfireball.net?subject=TEST> (including brokets).",
"mailto:name#example.com",
"bit.ly/foo",
"“is.gd/foo/”",
"WWW.EXAMPLE.COM",
"http://www.asianewsphoto.com/(S(neugxif4twuizg551ywh3f55))/Web_ENG/View_DetailPhoto.aspx?PicId=752",
"http://www.asianewsphoto.com/(S(neugxif4twuizg551ywh3f55))",
"http://lcweb2.loc.gov/cgi-bin/query/h?pp/horyd:#field(NUMBER+#band(thc+5a46634))")
I appreciate any help.
Running your example,
source('file.R', encoding="unknown")
works fine and saving as R object and reloading works as well:
save(x, file='kk.Rd')
load('kk.Rd')
You can get all different encodings with iconvlist() and test them all, for example:
vals <- lapply(iconvlist(), function(x)
tryCatch(source('file.R', encoding=x),
error=function(e)return(NULL)))
with file.R being your script, and then
iconvlist()[which(!sapply(vals, function(x)is.null(x)))]
gives you all encodings where no error was thrown while loading.
Does this help?

Validate a character as a file path?

What's the best way to determine if a character is a valid file path? So CheckFilePath( "my*file.csv") would return FALSE (on windows * is invalid character), whereas CheckFilePath( "c:\\users\\blabla\\desktop\\myfile.csv" ) would return TRUE.
Note that a file path can be valid but not exist on disk.
This is the code that save is using to perform that function:
....
else file(file, "wb")
on.exit(close(con))
}
else if (inherits(file, "connection"))
con <- file
else stop("bad file argument")
......
Perhaps file.exists() is what you're after? From the help page:
file.exists returns a logical vector indicating whether the files named by its argument exist.
(Here ‘exists’ is in the sense of the system's stat call: a file will be reported as existing only
if you have the permissions needed by stat. Existence can also be checked by file.access, which
might use different permissions and so obtain a different result.
Several other functions to tap into the computers file system are available as well, also referenced on the help page.
No, there's no way to do this (reliably). I don't see an operating system interface in neither Windows nor Linux to test this. You would normally try and create the file and get a fail message, or try and read the file and get a 'does not exist' kind of message.
So you should rely on the operating system to let you know if you can do what you want to do to the file (which will usually be read and/or write).
I can't think of a reason other than a quiz ("Enter a valid fully-qualified Windows file path:") to want to know this.
I would suggest trying checkPathForOutput function offered by the checkmate package. As stated in the linked documentation, the function:
Check[s] if a file path can be safely be used to create a file and write to it.
Example
checkmate::checkPathForOutput(x = tempfile(pattern = "sample_test_file", fileext = ".tmp"))
# [1] TRUE
checkmate::checkPathForOutput(x = "c:\\users\\blabla\\desktop\\myfile.csv")
# [1] TRUE
Invalid path
\0 character should not be used in Linux1 file names:
checkmate::check_path_for_output("my\0file.csv")
# Error: nul character not allowed (line 1)
1 Not tested on Windows, but looking at the code of checkmate::check_path_for_output indicates that function should work correctly on MS Windows system as well.

InputB vs. Get; code pages; slow reading on unix server

We have been using the usual code to read in a complete file into a string to then parse in VB6. The files are ANSI text but encoded using whatever code page the user was in at the time (we have Chinese and English users for example). This is the code
Open FileName For Binary As nFileUnit
sContents = StrConv(InputB(LOF(nFileUnit), nFileUnit), vbUnicode)
However, we have discovered this is VERY slow reading a file from a server running unix/linux, particularly when the ownership of the file is not the same as the process doing the reading.
I have rewritten the above using Get and discovered it is much faster and does not suffer from any issues with file ownership. I appreciate that this might be solved by reconfiguring the server somehow, but I think since deiscovering even without that issue, the Get method is still much faster than InputB I'd like to replace my existing code using Get.
I wonder if someone could tell me if this will really do the same thing. In particular, is it correctly doing the ANSI to Unicode conversion and will this always be true. My testing suggests the following replacement code does the same thing but faster:
Open FileName For Binary As nFileUnit
sContents = String(LOF(nFileUnit), " ")
Get #nFileUnit, , sContents
I also realise I could use a byte array, but again my tests suggest the above is simpler and works. So how does the buffer work correctly (if you believe the online help for Get it talks of characters returned - clearly this would cause problems when reading in an ANSI file written on the Chinese code page with 2-byte Chinese characters in it).
The following might be of interest becuase the InputB approach is commonly given as the method to read a complete file, but it is much slower, examples
Reading 380Kb file across the network from the unix server
InputB (file owned) = 0.875 sec
InputB (not owned) = 72.8 sec
Get (either) = 0.0156 sec
Reading a 9Mb file across the network from the unix server
InputB (file owned) = 19.65 sec
Get (either) = 0.42 sec
Thanks
Jonathan
InputB() is CVar(InputB$()), and is known to be horribly slow. My suspicion is that InputB$() reads the bytes and converts them to Unicode using the current codepage via some stock logic for reading text from disk, then does another conversion back to ANSI using the current codepage.
You might be far ahead to use ADODB.Stream.LoadFromFile() to load complete ANSI text files. You can set the .Type = adTypeText and .Charset = the appropriate ANSI encoding as required to read Unicode back out of it via .ReadText(x) where x can be a number of bytes, or adReadAll or adReadLine. For line reading you can set .LineSeparator to adCR, adCRLF, or adLF as required.
Many Charset values are supported: KOI8 for Cyrillic, Big5 for Chinese, etc.

Resources