So I've been trying to read this particular .mat file into R. I don't know too much about matlab, but I know enough that the R.matlab package can only read uncompressed data into R, and to save it as uncompressed I need to save it as such in matlab by using
save new.mat -v6.
Okay, so I did that, but when I used readMat("new.mat") in R, it just got stuck loading that forever. I also tried using package hdf5 via:
> hdf5load("new.mat", load=FALSE)->g
Error in hdf5load("new.mat", load = FALSE) :
can't handle hdf type 201331051
I'm not sure what this problem could be, but if anyone wants to try to figure this out the file is located at http://dibernardo.tigem.it/MANTRA/MANTRA_online/Matlab_Code%26Data.html and is called inventory.mat (the first file).
Thanks for your help!
This particular file has one object, inventory, which is a struct object, with a lot of different things inside of it. Some are cell arrays, others are vectors of doubles or logicals, and a couple are matrices of doubles. It looks like R.matlab does not like cells arrays within structs, but I'm not sure what's causing issues for R to load this. For reasons like this, I'd generally recommend avoiding mapping structs in Matlab to objects in R. It is similar to a list, and this one can be transformed to a list, but it's not always a good idea.
I recommend creating a new file, one for each object, e.g. ids = inventory.instance_ids and save each object to either a separate .mat file, or save all of them, except for the inventory object, into 1 file. Even better is to go to text, e.g via csvwrite, so that you can see what's being created.
I realize that's going around use of a Matlab to R reader, but having things in a common, universal format is much more useful for reproducibility than to acquire a bunch of different readers for a proprietary format.
Alternatively, you can pass objects in memory via R.matlab, or this set of functions + the R/DCOM interface (on Windows).
Although this doesn't address how to use R.matlab, I've done a lot of transferring of data between R and Matlab, in both directions, and I find that it's best to avoid .mat files (and, similarly, .rdat files). I like to pass objects in memory, so that I can inspect them on each side, or via standard text files. Dealing with application specific file formats, especially those that change quite a bit and are inefficient (I'm looking at you MathWorks), is not a good use of time. I appreciate the folks who work on readers, but having a lot more control over the data structures used in the target language is very much worth the space overhead of using a simple output file format. In-memory data transfer is very nice because you can interface programs, but that may be a distraction if your only goal is to move data.
Have you run the examples in http://cran.r-project.org/web/packages/R.matlab/R.matlab.pdf on pages 22 to 24? That will test your ability to read from versions 4 and 5. I'm not sure that R cannot read compressed files. There is an Rcompresssion package in Omegahat.
Related
So I'm working with a network dataset from Stanford's SNAP Datasets and "SNAP" has wrappers for Python and C++ but not R - however, the data is still usable since I believe it's a mix of CSV files.
I can actually read in the .edges file and form an igraph object but want to read in the other files, get the attributes & add those attributes to the igraph object for analysis. I'm just confused on how to work with the .circles, .egofeat, .feat, and .featnames files since the documentation on the dataset is very scarce. Hoping someone has worked with the dataset in R or even another language and has any pointers to get started.
Thank you!
I am currently storing the output (a Julia Dataframe) of my Julia simulation in a Parquet file using Parquet.jl. I would also like to save some of the simulation parameters (eg. a list of (byte-)strings) to that same output file.
Preferably, these parameters are different for each column as each column is the result of different starting conditions of my code. However, I could also work with a global parameter list and then untangle it afterwards by indexing.
I have found a solution for Python using pyarrow
https://mungingdata.com/pyarrow/arbitrary-metadata-parquet-table/.
Do you know a way how to do it in Julia?
It's not quite done yet, and it's not registered, but my rewrite of the Julia parquet package, Parquet2.jl does support both custom file metadata and individual column metadata (the keyword arguments metadata and column_metadata in Parquet2.writefile.
I haven't gotten to documentation for writing yet, but if you are feeling adventurous you can give it a shot. I do expect to finish up this package and register it within the next couple of weeks. I don't have unit tests in for writing yet, so of course, if you try it and have problems, please open an issue.
It's probably also worth mentioning that the main use case I recommend for parquet is if you must have parquet for compatibility reasons. Most of the time, Julia users are probably better off with Arrow.jl as the format has a number of advantages over parquet for most use cases, please see my FAQ answer on this. Of course, the reason I undertook writing the package is because parquet is arguably the only ubiquitous binary format in "big data world" so a robust writer is desperately needed.
For a shiny app in a repository containing a single static data file, what is the optimal format for that flat file (and corresponding function to read that file) which minimises the read time for that flat file to a data.frame?
For example, suppose when a shiny app starts it reads an .RDS, but suppose that takes ~30 seconds and we wish to decrease that. Are there any methods of saving the file and using a function which can save time?
Here's what I know already:
I have been reading some speed comparison articles, but none seem to comprehensive benchmark all methods in the context of a shiny app (and possible cores/threading implications). Some offer sound advice like trying to load in less data
I notice languages like julia can sometimes be faster, but I'm not sure if reading a file using another language would help since it would have to be converted to an object R recognises, and presumably that process would take longer than simply reading as an R object initially
I have noticed identical files seem to be smaller when saved as .RDS compared to .csv, however, I'm not sure if file size necessarily has an effect on read time.
I have a large file that I'm not able to load so I'm using a local file with xgb.DMatrix. But I'd like to use only a subset of the features. The documentation on xgboost says that the colset argument on slice is "currently not used" and there is no metion of this feature in the github page. And I haven't found any other clue of how to do column subsetting with external memory.
I wish to compare models generated with different features subsettings. The only thing I could think of is to create a new file with the features that I want to use but it's taking a long time and will take a lot of memory... I can't help wondering if there is a better way.
ps.: I tried using h2o package too but h2o.importFile froze.
Can I read a binary file written by C++ in R?
I have been using Rcpp in my R package and the simulations typically generate a large amount of data. I am planning to write the output to binary files in C++ and then read those back in R. This works if I write as text files but I didn't find a solution with binary files. The program sometimes crashes abruptly if I pass data using many NumericVectors (I am yet to fully understand the memory management using Rcpp).
Can this approach enable me to share larger datasets between C++ and R compared to what is possible by passing vectors? In C++, the maximum vector size is limited by RAM and address bus (may be?) but I think R is able to load larger vectors using swap. Am I correct or misunderstanding the concepts?
Yes you can. But it's "complicated".
You are embarking on a topic called binary serialization. There is a lot of work out there. In essence you are somewhere in the continum between of
minimal: open a file, write out N binary items; then on the other side read N binaries. We did something similar at work years ago where wrote some metadata with <rows,cols,version> and then a binary blob of rows * cols double to attach to a matrix
maximal: use a fully descriptive meta language like Protocol Buffer or MessagePack to describe the binary content, write it in C++ (using the appropriate library) and read in back in R (using the corresponding packages---I am involved with one each: RProtoBuf and RcppMsgPack).
And a lot in between. If you really only need to communicate between C(++) and R you could try the RData / rds format. There is one library: librdata and I experimented with it (and filed some bug reports and made some pull requests). I might start there.
So in short: do some research, figure out what to do and then do it :)
PS If you call C++ via Rcpp from R then you may not need files. We can pass large object back and forth -- the limit may be your RAM.