R: Reading a UCS-2 LE bom file from GitHub

R: Reading a UCS-2 LE bom file from GitHub - r

I have a program which creates and stores files automatically on GitHub. An example is
https://raw.githubusercontent.com/VIC-Laboratory-ExperimentalData/test/master/test-999-666.txt
However, the files are coded on Dos/Windows machine with UCS-2 LE BOM (according to notepad++).
I am trying to read this text file into R but to no avail:
repo <- "https://raw.githubusercontent.com/VIC-Laboratory-ExperimentalData/test/master"
file <- "test-999-666.txt"
myurl <- paste(repo, file, sep="/")
library(RCurl)
cnt <- getURL(myurl)
I get an error
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
caractère nul au milieu de la chaîne : '<ff><fe>*'
How can I configure getURL to read this file? I also tried with httr::GET (but receive an empty content).

This seems to be a relatively common pain point when working with files produced by Windows. I'm going to be honest and say that the solution I'm presenting doesn't seem the best, because it mainly bypasses getting everything into the right encoding and instead goes to the binary directly.
Using the same variables as you:
cnt <- getURLContent(myurl, binary = T)
cnt <- rawToChar(cnt[cnt != 00])
Should produce a parsable string.
The idea is that instead of trying to have curl read the file, let it treat it like binary and deal with encoding later on. This gives us a vector of type raw. Then, since the main issue seems to be that null characters (i.e. \00) were causing a problem, we just exclude them from cnt before coerce cnt from raw to char.
In the end, from your example, I get
"ÿþ*** Header Start ***\r\nVersionPersist: 1\r\nLevelName: Session\r\nLevelName: Block\r\nLevelName: Trial\r\nLevelName: SubTrial\r\nLevelName: LogLevel5\r\nLevelName: LogLevel6\r\nLevelName: LogLevel7\r\nLevelName: LogLevel8\r\nLevelName: LogLevel9\r\nLevelName: LogLevel10\r\nExperiment: test\r\nSessionDate: 07-04-2019\r\nSessionTime: 12:35:06\r\nSessionStartDateTimeUtc: 2019-07-04 16:35:06\r\nSubject: 999\r\nSession: 666\r\nDataFile.Basename: test-999-666\r\nRandomSeed: -1018314635\r\nGroup: 1\r\nDisplay.RefreshRate: 60.005\r\n*** Header End ***\r\nLevel: 1\r\n*** LogFrame Start ***\r\nExperiment: test\r\nSessionDate: 07-04-2019\r\nSessionTime: 12:35:06\r\nSessionStartDateTimeUtc: 2019-07-04 16:35:06\r\nSubject: 999\r\nSession: 666\r\nDataFile.Basename: test-999-666\r\nRandomSeed: -1018314635\r\nGroup: 1\r\nDisplay.RefreshRate: 60.005\r\nClock.Information: <?xml version=\"1.0\"?>\\n<Clock xmlns:dt=\"urn:schemas-microsoft-com:datatypes\"><Description dt:dt=\"string\">E-Prime Primary Realtime Clock</Description><StartTime><Timestamp dt:dt=\"int\">0</Timestamp><DateUtc dt:dt=\"string\">2019-07-04T16:35:05Z</DateUtc></StartTime><FrequencyChanges><FrequencyChange><Frequency dt:dt=\"r8\">2742255</Frequency><Timestamp dt:dt=\"r8\">492902384024</Timestamp><Current dt:dt=\"r8\">0</Current><DateUtc dt:dt=\"string\">2019-07-04T16:35:05Z</DateUtc></FrequencyChange></FrequencyChanges></Clock>\\n\r\nStudioVersion: 2.0.10.252\r\nRuntimeVersion: 2.0.10.356\r\nRuntimeVersionExpected: 2.0.10.356\r\nRuntimeCapabilities: Professional\r\nExperimentVersion: 1.0.0.543\r\nExperimentStuff.RT: 2555\r\n*** LogFrame End ***\r\n"
Which seems to contain all the right content.
If you want you can try adding options(encoding = "UCS-2LE-BOM") before this code, I don't know if it changes anything, but it seems like it affects rawToChar.

Related

Failure of unz() to unzip from a zip file offset of more than 2^31 bytes

I have been obtaining .zip archives of genome annotation from NCBI (mainly gff files). In order save disk space I prefer not to unzip the archive, but to read these files directly into R using unz(). However, it seems that unz() is unable to extract files from the end of 'large' zip files:
ncbi.zip <- "file_location/name.zip"
files <- unzip(ncbi.zip, list=TRUE)
gff.files <- files$Name[ grep("gff$", files$Name) ]
## this works
gff.128 <- readLines( unz(ncbi.zip, gff.files[128]) )
## this gives an empty data structure (read.table() stops
## with an error saying no lines or similar
gff.129 <- readLines( unz(ncbi.zip, gff.files[129]) )
## there are 31 more gff files after the 129th one.
## no lines are read from any of these.
The zip file itself seems to be fine; I can unzip the specific files using unzip on the command line and unzip -t does not report any errors.
I've tried this with R versions 3.5 (openSuse Leap 15.1), 3.6, and 4.2 (centOS 7) and with more than one zip file and get exactly the same result.
I attached strace to R whilst reading in the 128 and 129th file. In both cases I get a lot of lseek towards the end of file (offset 2845892608, larger than 2^31) to start with. This is where I assume the zip directory can be found. For the 128th file (the one that can be read), I eventually get an lseek to an offset slightly below 2^31, followed by a set of lseeks and reads (that extend beyone 2^31).
For the 129th file, I get the same reads towards the end of the file, but then rather than finding a position within the file I get:
lseek(3, 2845933568, SEEK_SET) = 2845933568
lseek(3, 4294963200, SEEK_SET) = 4294963200
read(3, "", 4096) = 0
lseek(3, 4095, SEEK_CUR) = 4294967295
read(3, "", 4096) = 0
Which is a bit weird since the file itself is only about 2.8 GB. 4294967295, is of course 2^32 - 1.
To me this feels like an integer overflow bug, and I am considering to post a bug report. But am wondering if anyone has seen something similar before or if I am doing something stupid.

Having done what I should have started with (reading the specification for the zip64 format specification), it's actually clear that this is not an integer overflow error.
Zip files contain a central directory at the end of the archive; this contains amongst other things the names of the compressed files and the offset of the compressed data in the zip archive. The offset (and file size fields) are only given 4 bytes each in the standard directory field; when the offset is larger than this it should instead be given in the extra fields section and the value in the standard field should be set to 0xFFFFFFFF. Since this is the offset that gets used when reading the file it seems clear that the problem lies in the parsing of the extra field.
I had a look at the source code for R 4.2.1 and it seems that the problem is due to the way the offset specified in the standard offset field is tested:
if(file_info.uncompressed_size == (ZPOS64_T)(unsigned long)-1)
changing this == 0xFFFFFFFF seems to fix the problem.
I've submitted a bug report to R. Hopefully changing the check will not have any unintended consequences and the issue will be fixed.
Still, I'm curious as to whether anyone else has come across the same issue. Seems a bit unlikely that my experience is unique.

R problems with encoding after reading from MS SQL via ODBC

We have a database and R scripts running on different virtual machines.
First we connect to db
con <- dbConnect(
odbc(),
Driver = "SQL Server",
Server = "server", Database = "db", UID = "uid", PWD = "pwd",
encoding = "UTF-8"
)
and gathering the data
data <- dbGetQuery(con, "SELECT * FROM TableName")
The problem is following: when different machines running the same script, for some of these we face character variables encoding issues.
For example, this is what we have on machine A
> data$char_var[1]
[1] "фамилия"
> Encoding(data$char_var[1])
[1] "UTF-8"
> Sys.getlocale()
[1] "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251"
> Encoding(data$char_var[1]) <- "1251"
> data$char_var[1]
[1] "РіСЂРµРІС†РµРІ"
and this is what we have on machine B
> data$char_var[1]
[1] "<e3><f0><e5><e2><f6><e5><e2>"
> Encoding(data$char_var[1])
[1] "UTF-8"
> Sys.getlocale()
[1] "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251"
> Encoding(data$char_var[1]) <- "1251"
> data$char_var[1]
[1] "фамилия"
First script returns gibberish, but it prints the initial value correctly. The same code running on machine B initially prints utf-8 and then returns encoded values. What could be the reason for such a difference?
As a result we want a script that would have the same "фамилия" output value to show it on a dashboard.

According to the result of your call to Encoding(data$char_var[1]), both machines are declaring the returned results to be encoded using UTF-8.
On the first machine, that appears to be true, because you're seeing valid output. Then you mess it up by declaring the encoding incorrectly as "1251", and you see gibberish.
On the second machine, the result comes to you declared as UTF-8, but it isn't (which is why it looks like gibberish to start). When you change the declared encoding to be "1251" it looks okay, so that's what it must have been all along.
So you have two choices:
Make sure both machines are consistent about what they return from the dbGetQuery. You can handle either encoding, but you need to know what it is and make sure it's declared correctly.
Alternatively, try to detect what is being returned, and declare it appropriately. One way to do this might be to put a known string in the database and compare the result to that. If you know you should get "фамилия" and you get something else, switch the declared encoding. You could also try the readr::guess_encoding() function.
One other issue is that some downstream function might only be able handle one or the other of UTF-8 and 1251 encodings. (Windows R is really bad at non-native encodings, and UTF-8 is never native on Windows.) In that case you may want to actually convert to a common encoding. You use the iconv() function to do that, e.g.
iconv(char_var, from = "cp1251", to = "UTF-8")
will try to convert to UTF-8.

"Temporary name too long" error that appears unexpectedly using Rnotebook

While working with rnotebook I get the following error getting the output of a linear regression model via broom.
This is a dummy example of what I encounter:
N <- 100
a <- rnorm(N)
b <- a + rnorm(N)
df1 <- data.frame(a, b)
lModel <- lm(b ~ a, df1)
summary(lModel)
Then if I want to get the output of tidy(lModel) I get the error:
Error in tempfile(pattern = "_rs_rdf_", tmpdir = outputFolder, fileext = ".rdf") : temporary name too long
The thing is that I have used the tidy() function from broom not long before and got the output. I wonder what may be the issue, and how it can be fixed.
This is the traceback of the error above:
Error in tempfile(pattern = "_rs_rdf_", tmpdir = outputFolder, fileext = ".rdf") : temporary name too long
4.
tempfile(pattern = "_rs_rdf_", tmpdir = outputFolder, fileext = ".rdf")
3.
overridePrint(o$x, o$options, o$className, o$nRow, o$nCol)
2.
print.data.frame(x)
1.
function (x, ...) UseMethod("print")(x)
Thanks a lot in advance.

This error occurs on Windows systems when directories are nested too many levels in. The Windows API has a maximum path length of 260 characters.
Maximum Path Length Limitation
In the Windows API (with some exceptions discussed in the following paragraphs), the maximum length for a path is MAX_PATH, which is defined as 260 characters. A local path is structured in the following order: drive letter, colon, backslash, name components separated by backslashes, and a terminating null character. For example, the maximum path on drive D is "D:\some 256-character path string" where "" represents the invisible terminating null character for the current system codepage. (The characters < > are used here for visual clarity and cannot be part of a valid path string.)
This is pretty easy to avoid. Just adjust your working directory, or the structure where you're saving your tempfile. Either your file name is too long, or your directory is nested too deeply so the path is exceeding Windows' path limits.
As an aside, on Unix systems this the max path significantly longer but there is a max file name length of 255 chars.

This can be fixed with a registry change on Windows 10.
Open the registry edit tool by searching regedit in the start menu
Navigate to Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem
Change LongPathsEnabled from 0 to 1
Now the 256 character restriction will be ignored.
Update Aug 4th '18: If your R working directory is inside a Google Drive File Stream folder then the limit is still enforced. This is because GDFS is a virtual drive that has its own limations.

R script file encoding (R Studio)

Which file encoding do I have to use to be able to save this vector (Matching complex URLs within text blocks (R)) correctly in a R script? The special characters and Chinese signs seem to make things somehow complicated.
x <- c("http://foo.com/blah_blah",
"http://foo.com/blah_blah/",
"(Something like http://foo.com/blah_blah)",
"http://foo.com/blah_blah_(wikipedia)",
"http://foo.com/more_(than)_one_(parens)",
"(Something like http://foo.com/blah_blah_(wikipedia))",
"http://foo.com/blah_(wikipedia)#cite-1",
"http://foo.com/blah_(wikipedia)_blah#cite-1",
"http://foo.com/unicode_(✪)_in_parens",
"http://foo.com/(something)?after=parens",
"http://foo.com/blah_blah.",
"http://foo.com/blah_blah/.",
"<http://foo.com/blah_blah>",
"<http://foo.com/blah_blah/>",
"http://foo.com/blah_blah,",
"http://www.extinguishedscholar.com/wpglob/?p=364.",
"http://✪df.ws/1234",
"rdar://1234",
"rdar:/1234",
"x-yojimbo-item://6303E4C1-6A6E-45A6-AB9D-3A908F59AE0E",
"message://%3c330e7f840905021726r6a4ba78dkf1fd71420c1bf6ff#mail.gmail.com%3e",
"http://➡.ws/䨹",
"www.c.ws/䨹",
"<tag>http://example.com</tag>",
"Just a www.example.com link.",
"http://example.com/something?with,commas,in,url, but not at end",
"What about <mailto:gruber#daringfireball.net?subject=TEST> (including brokets).",
"mailto:name#example.com",
"bit.ly/foo",
"“is.gd/foo/”",
"WWW.EXAMPLE.COM",
"http://www.asianewsphoto.com/(S(neugxif4twuizg551ywh3f55))/Web_ENG/View_DetailPhoto.aspx?PicId=752",
"http://www.asianewsphoto.com/(S(neugxif4twuizg551ywh3f55))",
"http://lcweb2.loc.gov/cgi-bin/query/h?pp/horyd:#field(NUMBER+#band(thc+5a46634))")
I appreciate any help.

Running your example,
source('file.R', encoding="unknown")
works fine and saving as R object and reloading works as well:
save(x, file='kk.Rd')
load('kk.Rd')
You can get all different encodings with iconvlist() and test them all, for example:
vals <- lapply(iconvlist(), function(x)
tryCatch(source('file.R', encoding=x),
error=function(e)return(NULL)))
with file.R being your script, and then
iconvlist()[which(!sapply(vals, function(x)is.null(x)))]
gives you all encodings where no error was thrown while loading.
Does this help?

InputB vs. Get; code pages; slow reading on unix server

We have been using the usual code to read in a complete file into a string to then parse in VB6. The files are ANSI text but encoded using whatever code page the user was in at the time (we have Chinese and English users for example). This is the code
Open FileName For Binary As nFileUnit
sContents = StrConv(InputB(LOF(nFileUnit), nFileUnit), vbUnicode)
However, we have discovered this is VERY slow reading a file from a server running unix/linux, particularly when the ownership of the file is not the same as the process doing the reading.
I have rewritten the above using Get and discovered it is much faster and does not suffer from any issues with file ownership. I appreciate that this might be solved by reconfiguring the server somehow, but I think since deiscovering even without that issue, the Get method is still much faster than InputB I'd like to replace my existing code using Get.
I wonder if someone could tell me if this will really do the same thing. In particular, is it correctly doing the ANSI to Unicode conversion and will this always be true. My testing suggests the following replacement code does the same thing but faster:
Open FileName For Binary As nFileUnit
sContents = String(LOF(nFileUnit), " ")
Get #nFileUnit, , sContents
I also realise I could use a byte array, but again my tests suggest the above is simpler and works. So how does the buffer work correctly (if you believe the online help for Get it talks of characters returned - clearly this would cause problems when reading in an ANSI file written on the Chinese code page with 2-byte Chinese characters in it).
The following might be of interest becuase the InputB approach is commonly given as the method to read a complete file, but it is much slower, examples
Reading 380Kb file across the network from the unix server
InputB (file owned) = 0.875 sec
InputB (not owned) = 72.8 sec
Get (either) = 0.0156 sec
Reading a 9Mb file across the network from the unix server
InputB (file owned) = 19.65 sec
Get (either) = 0.42 sec
Thanks
Jonathan

InputB() is CVar(InputB$()), and is known to be horribly slow. My suspicion is that InputB$() reads the bytes and converts them to Unicode using the current codepage via some stock logic for reading text from disk, then does another conversion back to ANSI using the current codepage.
You might be far ahead to use ADODB.Stream.LoadFromFile() to load complete ANSI text files. You can set the .Type = adTypeText and .Charset = the appropriate ANSI encoding as required to read Unicode back out of it via .ReadText(x) where x can be a number of bytes, or adReadAll or adReadLine. For line reading you can set .LineSeparator to adCR, adCRLF, or adLF as required.
Many Charset values are supported: KOI8 for Cyrillic, Big5 for Chinese, etc.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Reading a UCS-2 LE bom file from GitHub - r

Related

Failure of unz() to unzip from a zip file offset of more than 2^31 bytes

R problems with encoding after reading from MS SQL via ODBC

"Temporary name too long" error that appears unexpectedly using Rnotebook

R script file encoding (R Studio)

InputB vs. Get; code pages; slow reading on unix server

Categories

Resources