.rds file internal format - r

I have lost a .rds file due to the device (let's call it volume 1) getting filled up. Usually when that happened R would throw an error and stop. In that case I had a safe copy on a different volume (volume 2). This time however, R would write the file on volume 1 without error and copy it over to volume 2. Now the file cannot be opened with readRDS anymore with the error "error reading from connection".
The file contains a data.table, is stored uncompressed and infoRDS can read the metadata:
> infoRDS('corrupt.rds')
$version
[1] 3
$writer_version
[1] "3.6.3"
$min_reader_version
[1] "3.5.0"
$format
[1] "xdr"
$native_encoding
[1] "UTF-8"
Also, hexView::readRaw can read the file and shows the names of the columns of the data table.
Using
readRaw('corrupt.rds', endian = 'big', human = 'real', width = 8, offset = 5)
I can see many of the numbers I need to recover. However, this seems very tedious of an approach, since I don't understand the internal format of the .rds file.
I also looked into xmlDeserializeHook which I don't understand how to use.
Of course the C code used by readRDS unserializeFromConn contains all the information of the used structure, but a higher level documentation would be helpful.
Is there an easier way than to dive into that C code or pick up the numbers manually one by one?

R internals contains a documentation of the serialisation format. Unless somebody published a more detailed description on the mailing list, that’s probably the best we can do. But (at a glance) this looks to be a fairly comprehensive description (definitely when taken together with the implementation).

Related

Attempts to parse bencode / torrent file in R

I wish I could parse torrent files automatically via R. I tried to use R-bencode package:
library('bencode')
test_torrent <- readLines('/home/user/Downloads/some_file.torrent', encoding = "UTF-8")
decoded_torrent <- bencode::bdecode(test_torrent)
but faced to error:
Error in bencode::bdecode(test_torrent) :
input string terminated unexpectedly
In addition if I try to parse just part of this file bdecode('\xe7\xc9\xe0\b\xfbD-\xd8\xd6(\xe2\004>\x9c\xda\005Zar\x8c\xdfV\x88\022t\xe4գi]\xcf'), I get
Error in bdecode("\xe7\xc9\xe0\b\xfbD-\xd8\xd6(\xe2\004>\x9c\xda\005Zar\x8c\xdfV\x88\022t\xe4գi]\xcf") :
Wrong encoding '�'. Allowed values are i, l, d or a digit.
Maybe there are another ways to do it in R? Or probably I can insert another language code in Rscript?
Thanks in advance!
It might be that the torrent file is somehow corrupted.
A bencode value must begin with the character i (for integers), l (for lists), d (for dictionaries) or a number (for the length of a string).
The example string ('\xe7\xc9...'), doesn't start with any of those characters, and hence it can't be decoded.
See this for more info on the bencode format.
There seem to be several issues here.
Firstly, your code should not treat torrent files as text files in UTF-8 encoding. Each torrent file is split into equally-sized pieces (except for the last piece ; )). Torrents contain a concatenation of SHA1 hashes of each of the pieces. SHA1 hashes are unlikely to be valid UTF-8 strings.
So, you should not read the file into memory using readLines, because that is for text files. Instead, you should use a connection:
test_torrent <- file("/home/user/Downloads/some_file.torrent")
open(test_torrent, "rb")
bencode::bdecode(test_torrent)
Secondly, it seems that this library is also suffering from a similar issue. As readChar that it makes use of, also assumes that it's dealing with text.
This might be due to recent R version changes though seeing as the library is over 6 years old. I was able to apply a quick hack and get it working by passing useBytes=TRUE to readChar.
https://github.com/UkuLoskit/R-bencode/commit/b97091638ee6839befc5d188d47c02567499ce96
You can install my version as follows:
install.packages("devtools")
library(devtools)
devtools::install_github("UkuLoskit/R-bencode")
Caveat lector! I'm not a R programmer :).

R: how to write a raster to disk without auxiliary file?

I'm writing a dataset to file in ERMapper format (.ers) using the Raster package in R, but I'm having issues with the resulting .aux.xml auxiliary file (which I'm actually not interested in).
Simple example:
rst <- raster(ncols=15000,nrows=10000)
rst[] <- 1.234
writeRaster(rst, filename='_test.ers', overwrite=TRUE)
The writeRaster() line takes some time to execute, the data file is quite large, about 1.2GB on disk.
When checking what's happening while writeRaster() is executed, I find that the .ers file (header file + associated data file) is typically generated in about 20 sec. Then, it takes writeRaster() another 20 - 25 sec to generate the .aux.xml file, which only contains statistics such as min, max, mean, and st. dev. (which likely explains why it takes so long to compute).
Since I don't care about the .aux.xml file, I would like writeRaster() to not bother with it at all, and save me 20 - 25 sec of exec time (I'm writing lots of these datasets to disk so a 50% speedup in my code is quite substantial).
Anyone has any idea how to tell writeRaster() to not create a .aux.xml file? I suspect it's a GDAL-related issue, but haven't been able to find an answer yet after much research...
Any help most welcome!
Options related to the GDAL file format drivers can be set using the (not so easy to find) rgdal::setCPLConfigOption function.
In your case,
rgdal::setCPLConfigOption("GDAL_PAM_ENABLED", "FALSE")
should disable the xml file creation.
HTH

Error while parsing a very large (10 GB) XML file in R, using the XML package

Context
I'm currently working on a project involving osm data (Open Street Map). In order to manipulate geographic objects, I have to convert the data (an osm xml file) into an object. The osmar package lets me do this, but it fails to parse the raw xml data.
The error
Error in paste(file, collapse = "\n") : result would exceed 2^31-1 bytes
The code
require(osmar)
osmar_obj <- get_osm("anything", source = osmsource_file("my filename"))
Inside the get_osm function, the code calls ret <- xmlParse(raw), which triggers the error after a few seconds.
The question
How am I supposed to read a large XML file (here 10GB), knowing that I have 64G of memory ?
Thanks a lot !
This is the solution I came up with, even though it is not 100% satisfying.
Transform the .osm file by removing every newline (but the last) in your shell
Run the exact same code as before, skipping the paste that is not needed anymore (since you just did the equivalent in shell)
Profit :)
Obviously, I'm not very happy with it because modifying the data file in shell is more a trick that an actual solution :(

Split big data in R

I have a big data file (~1GB) and I want to split it into smaller ones. I have R in hand and plan to use it.
Loading the whole into memory cannot be done as I would get the "cannot allocate memory for vector of xxx" error message.
Then I want to use the read.table() function with the parameters skip and nrows to read only parts of the file in. Then save out to individual files.
To do this, I'd like to know the number of lines in the big file first so I can workout how many rows should I set to individual files and how many files should I split into.
My question is: how can I get the number of lines from the big data file without fully loading it into R?
Suppose I can only use R. So cannot use any other programming languages.
Thank you.
Counting the lines should be pretty easy -- check this tutorial http://www.exegetic.biz/blog/2013/11/iterators-in-r/ (the "iterating through lines part).
The gist is to use ireadLines to open an iterator over the file
For Windows, something like this should work
fname <- "blah.R" # example file
res <- system(paste("find /v /c \"\"", fname), intern=T)[[2]]
regmatches(res, gregexpr("[0-9]+$", res))[[1]]
# [1] "39"

Multiple procedures in IDL program

I've written a procedure in IDL which performs some calculations on data and outputs an array of values. The calculations take about 2 minutes to run.
I need to then perform analysis on these results, and ideally I would like not to have to perform the initial calculations each time I want to perform some different analysis.
Is the best way to achieve this to save the output from the calculation to a data file and then read this in from a different program? Or is there a less cumbersome way to go about this?
Thanks in advance for any help
Yes, saving to a file is the easiest way to save the results from your first program for later use in the second (assuming you quit IDL between the two). There are may ways to save the data, depending on it's type and your preferences.
Easiest Way:
An IDL .sav file created by the SAVE command can store any kind of data, IDL variables, and even the whole state of your IDL session. Unfortunately, it only works for IDL (no other languages), and it can need to be re-generated if you upgrade IDL version. You read these files with RESTORE, which even remembers the names of the variables.
my_variable = 'Some data here.'
SAVE, my_variable, FILENAME='myfile.sav' ; save variable(s)
... IDL opened and closed here ...
RESTORE, 'myfile.sav' ; read variable(s) from file
print, my_variable
Some data here.
Most Portable Way:
For simple tabular data, CSV has the advantage of being highly portable and human readable. However, it's also slow, since numbers are stored in ASCII. Use WRITE_CSV to write, and READ_CSV to read.
Most Portable Binary Formats:
For complex data that needs to be read by multiple languages, consider the HDF5 or NetCDF libraries. Both of these are binary formats that can store most types of IDL-supported data. Note that NetCDF is actually an easier-to-use subset of HDF5.
Simplest Binary Format:
Another option for tabular data is a simple binary dump. Use WRITEU to write to a normal file opened for writing. Use READU to read from a normal file open for reading.
Assuming that your data calculations will only change very rarely, then, yes, your best solution is to just save the calculations to an output file, and then read them back into your analysis program. You don't say what kind of data this is, so it's hard to give a more specific answer. Assuming that you have a two-dimensional array of data, you could just write the results as a "flat" binary file:
pro perform_calculations
...
; assume mydata is a float array of dimensions [m,n]
openw, 1, 'results.dat'
writeu, 1, mydata
close, 1
end
Then, in either the same file or preferably a different .pro file:
pro perform_analysis
mydata = fltarr(m, n)
openr, 1, 'results.dat'
readu, 1, mydata
close, 1
...
end
Hope this helps.
Saving is a good way to do it, but if you run in the same session and your second program won't mess up the data from the first one, you can just call one and then pass the result to the second one.
pro do_calculations,result1,result2,result3
result1=1
result2=1.
result3=result1/result2
return
end
pro use_calculations,result1,result2,result3,result4
result4=result1-result2+result3
return
end
Then
IDL> do_calculations,result1,result2,result3
IDL> use_calculations,result1,result2,result3,result4
If you edit use_calculations, you can go again by:
IDL> use_calculations,result1,result2,result3,result4
Because the earlier results will stay in memory unless use_calculations does something bad to them.
You could also set up the second procedure to check to see if it has valid results from the first one and call it as needed.

Resources