Retrieving Sets in a HDF5 file in Julia - julia

There seems to be an issue between JLD, HDF5 and JLD2 formats for storing complex data in Julia. Until recently, JLD seemed to work pretty well, and it was suppossed to be HDF5-compatible. I have a data set that includes some data with data type Set. Saving and retrieving it on JLD format was straightforward. But now JLD is depprecated as of Julia 1.0.0, and JLD2 is "the new way to do it", but it explicitly says in the page that is not backwards compatible or that is safe (" If your tolerance for data loss is low, JLD may be a better choice at this time"). Okey, but JLD does not compule on Julia 1.0.0. So I cannot use it.
HDF5 seems to open the JLD adecuately, and to be able to retrieve arrays well. But not the sets. When I try to retrieve a Set I get this:
HDF5.HDF5Compound{1}((UInt8[0x68, 0x59, 0x46, 0x02, 0x00, 0x00, 0x00, 0x00],), ("dict_",), (HDF5.HDF5ReferenceObj,))
I do not know what kind of object this is, or how to interpret it. The original data was a set cointaing two integer arrays as elements. I tried to read it by indices but that cannot be made, and it says something akin to "dict", but it has no names on it.
I may be able to translate all data to JLD or to a plain HDF5 with a Julia 0.6.2 Script, but maybe I can learn to do it right with hdf5 + Julia 1.0.0

Related

Loading a .RDS file in Python 3

I have a dataset in RDS format that I managed in RStudio, but I would like to open this in Python to do the analysis. Would it be possible to open this type of format into Python?
I tried the following codes already:
pip install pyreadr
import pyreadr
result = pyreadr.read_r('/path/to/file.Rds')
However, I get a
MemoryError: Unable to allocate 18.9 MiB for an array with shape
(2483385,) and data type float64.
What can I do?
Pyreadr is a wrapper around the C library librdata, and librdata has a hardcoded limit on the size an R vector can have. The limit used to be very low in old versions, but it was increased. Your vector would fail in older versions but should work in a recent one, so please check that you are using the most recent version.
If that doesn't help, then it may be a bug. If you can share the file please submit an issue in github.
Here a link to the old issues in github librdata and pyreadr (theoretically now solved)
https://github.com/WizardMac/librdata/issues/19.
https://github.com/ofajardo/pyreadr/issues/3
EDIT:
The limit is now permently removed in pyreadr 0.3.0. Now this should not be an issue anymore.
From my knowledge, you could store the data to a pandas dataframe as mentioned in this link.
The second option(link)
How can I explicitly free memory in Python?
If you wrote a Python program that acts on a large input file to create a few million objects representing and it’s taking tons of memory and you need the best way to tell Python that you no longer need some of the data, and it can be freed?
The Simple answer to this problem is:
Force the garbage collector for releasing an unreferenced memory with gc.collect().
I hope this answers your query

does R binary format changes from version to version

My question is if an object in R saved to binary format using the save function can be different if saved from different (but recent) versions of R.
That is because I have a script that makes some calculations and save its results to a file. When reproducing the same calculations later, I decided to compare the two files using
diff --binary -s mv3p.Rdata mv3p.Rdata.backup
To my surprise the two files are different. However when analysing the contents in R, they are identical.
The new version is 3.3.1. I believe the older version have been created by R 3.3.0 but it could also be by 3.2.x, I am not 100% sure. I used the save command with only the object I wanted to save and the filename arguments.
So my question is : is it normal that the same object is written differently in different versions of R? is it documented somewhere? How can I be sure to be able to reproduce exactly the same file? On what can it depend (R version, OS, processor architecture, etc...)
Please , I am NOT asking if versions can be read by another version of R and I am NOT asking about very old R versions.
R data files also include the R version used to write it. That's one reason the files may be different. See here on documentation: http://biostat.mc.vanderbilt.edu/wiki/Main/RBinaryFormat
Also, you can use save(..., ascii=T) to see the difference in plain text.

Rcpp integrated with R package: Documentation of CPP code objects

I have been developing a package with Rcpp for C++ integration. I used RcppExport to make the functions return SEXP objects.
The issue is travis-ci seems to give warnings telling that there are undocumented code objects. (These are cpp functions). However I do not want users to directly access those functions as well.
How can I resolve this issue? How can I document these functions as well?
You seem to have an elementary misunderstanding here.
If your NAMESPACE contains a wildcard 'export all' a la exportPattern("^[[:alpha:]]+") then each global symbol is exported and per R standards that are clearly documented needs a help entry.
One easy fix is NOT to export everything and just write documentation for what you want exported. We sometimes do that and call the Rcpp function something like foo_impl and then have R functions foo (with documentation) call foo_impl. In that case you would just export foo and all is good.
In short, you are confused about R packages and not so much Rcpp. I would recommend downloading the sources of a few (small) Rcpp packages to get a feel for what they do.

How can I certify a file is exactly the same?

Is there a package or function that can be applied to a whole and heavy data object to get back a measure of changes in the file? Something based on hash keys would be great, so I can keep track on a shared file.
digest package (digest function) lets you create hash functions for R objects (possible ones: "md5", "sha1", "crc32", "sha256", "sha512", "xxhash32", "xxhash64"). You can also run external programs from R (e.g. md5sum on linux) with system commend (see e.g. here).

Can you save your session in Julia

Im very new to Julia and was trying to save my session (all the values, including functions for example) and didnt see any easy way. There seems to be a pretty complete low level write function for ints, floats, arrays, etc. But it doesnt, for example, write a DataFrames. Is there an easy way to do this or do I need to code all this from scratch? Im using V0.2.1.
Have you tried using the iJulia notebook? This might be useful for what you're describing. https://github.com/JuliaLang/IJulia.jl
You can do this with HDF5.jl. I don't know how well it works for functions, but it should work fine for data frames and any other native Julia type.
For functions you want to keep, I would probably just define them in a regular .jl file and include("def.jl") at the start of the session, for example.
Checkout out the Julia Data Format https://github.com/JuliaIO/JLD.jl
It can both save specific julia types as well as types you created yourself, and has macros to save your entire workspace at once.
I think it can be in Julia Data format (JLD).
https://github.com/JuliaIO/JLD.jl
If you have own data fromat e.g. type model
type Model
version::String
id::String
equations::Vector{Equation}
coefs::Vector{Matrix}
end
You can save it with command
using JLD
save("MODEL.jld", "modelS", model1)
and read as
pathReport = joinpath(homedir(),".julia/v0.5/foo/test")
m = JLD.load(joinpath(pathReport, "MODEL.jld"))
model2 = m["modelS"]
model2.equations[1].terms[2] == "EX_01"

Resources