Write custom metadata to Parquet file in Julia - julia

I am currently storing the output (a Julia Dataframe) of my Julia simulation in a Parquet file using Parquet.jl. I would also like to save some of the simulation parameters (eg. a list of (byte-)strings) to that same output file.
Preferably, these parameters are different for each column as each column is the result of different starting conditions of my code. However, I could also work with a global parameter list and then untangle it afterwards by indexing.
I have found a solution for Python using pyarrow
https://mungingdata.com/pyarrow/arbitrary-metadata-parquet-table/.
Do you know a way how to do it in Julia?

It's not quite done yet, and it's not registered, but my rewrite of the Julia parquet package, Parquet2.jl does support both custom file metadata and individual column metadata (the keyword arguments metadata and column_metadata in Parquet2.writefile.
I haven't gotten to documentation for writing yet, but if you are feeling adventurous you can give it a shot. I do expect to finish up this package and register it within the next couple of weeks. I don't have unit tests in for writing yet, so of course, if you try it and have problems, please open an issue.
It's probably also worth mentioning that the main use case I recommend for parquet is if you must have parquet for compatibility reasons. Most of the time, Julia users are probably better off with Arrow.jl as the format has a number of advantages over parquet for most use cases, please see my FAQ answer on this. Of course, the reason I undertook writing the package is because parquet is arguably the only ubiquitous binary format in "big data world" so a robust writer is desperately needed.

Related

Subset of features on external memory

I have a large file that I'm not able to load so I'm using a local file with xgb.DMatrix. But I'd like to use only a subset of the features. The documentation on xgboost says that the colset argument on slice is "currently not used" and there is no metion of this feature in the github page. And I haven't found any other clue of how to do column subsetting with external memory.
I wish to compare models generated with different features subsettings. The only thing I could think of is to create a new file with the features that I want to use but it's taking a long time and will take a lot of memory... I can't help wondering if there is a better way.
ps.: I tried using h2o package too but h2o.importFile froze.

Read C++ binary file in R

Can I read a binary file written by C++ in R?
I have been using Rcpp in my R package and the simulations typically generate a large amount of data. I am planning to write the output to binary files in C++ and then read those back in R. This works if I write as text files but I didn't find a solution with binary files. The program sometimes crashes abruptly if I pass data using many NumericVectors (I am yet to fully understand the memory management using Rcpp).
Can this approach enable me to share larger datasets between C++ and R compared to what is possible by passing vectors? In C++, the maximum vector size is limited by RAM and address bus (may be?) but I think R is able to load larger vectors using swap. Am I correct or misunderstanding the concepts?
Yes you can. But it's "complicated".
You are embarking on a topic called binary serialization. There is a lot of work out there. In essence you are somewhere in the continum between of
minimal: open a file, write out N binary items; then on the other side read N binaries. We did something similar at work years ago where wrote some metadata with <rows,cols,version> and then a binary blob of rows * cols double to attach to a matrix
maximal: use a fully descriptive meta language like Protocol Buffer or MessagePack to describe the binary content, write it in C++ (using the appropriate library) and read in back in R (using the corresponding packages---I am involved with one each: RProtoBuf and RcppMsgPack).
And a lot in between. If you really only need to communicate between C(++) and R you could try the RData / rds format. There is one library: librdata and I experimented with it (and filed some bug reports and made some pull requests). I might start there.
So in short: do some research, figure out what to do and then do it :)
PS If you call C++ via Rcpp from R then you may not need files. We can pass large object back and forth -- the limit may be your RAM.

R: Help reading a particular .mat file into R

So I've been trying to read this particular .mat file into R. I don't know too much about matlab, but I know enough that the R.matlab package can only read uncompressed data into R, and to save it as uncompressed I need to save it as such in matlab by using
save new.mat -v6.
Okay, so I did that, but when I used readMat("new.mat") in R, it just got stuck loading that forever. I also tried using package hdf5 via:
> hdf5load("new.mat", load=FALSE)->g
Error in hdf5load("new.mat", load = FALSE) :
can't handle hdf type 201331051
I'm not sure what this problem could be, but if anyone wants to try to figure this out the file is located at http://dibernardo.tigem.it/MANTRA/MANTRA_online/Matlab_Code%26Data.html and is called inventory.mat (the first file).
Thanks for your help!
This particular file has one object, inventory, which is a struct object, with a lot of different things inside of it. Some are cell arrays, others are vectors of doubles or logicals, and a couple are matrices of doubles. It looks like R.matlab does not like cells arrays within structs, but I'm not sure what's causing issues for R to load this. For reasons like this, I'd generally recommend avoiding mapping structs in Matlab to objects in R. It is similar to a list, and this one can be transformed to a list, but it's not always a good idea.
I recommend creating a new file, one for each object, e.g. ids = inventory.instance_ids and save each object to either a separate .mat file, or save all of them, except for the inventory object, into 1 file. Even better is to go to text, e.g via csvwrite, so that you can see what's being created.
I realize that's going around use of a Matlab to R reader, but having things in a common, universal format is much more useful for reproducibility than to acquire a bunch of different readers for a proprietary format.
Alternatively, you can pass objects in memory via R.matlab, or this set of functions + the R/DCOM interface (on Windows).
Although this doesn't address how to use R.matlab, I've done a lot of transferring of data between R and Matlab, in both directions, and I find that it's best to avoid .mat files (and, similarly, .rdat files). I like to pass objects in memory, so that I can inspect them on each side, or via standard text files. Dealing with application specific file formats, especially those that change quite a bit and are inefficient (I'm looking at you MathWorks), is not a good use of time. I appreciate the folks who work on readers, but having a lot more control over the data structures used in the target language is very much worth the space overhead of using a simple output file format. In-memory data transfer is very nice because you can interface programs, but that may be a distraction if your only goal is to move data.
Have you run the examples in http://cran.r-project.org/web/packages/R.matlab/R.matlab.pdf on pages 22 to 24? That will test your ability to read from versions 4 and 5. I'm not sure that R cannot read compressed files. There is an Rcompresssion package in Omegahat.

R bindings for Mapnik?

I frequently find myself doing some analysis in R and then wanting to make a quick map. The standard plot() function does a reasonable job of quick, but I quickly find that I need to go to ggplot2 when I want to make something that looks nice or has more complex symbology requirements. Ggplot2 is great, but is sometimes cumbersome to convert a SpatialPolygonsDataFrame into the format required by Ggplot2. Ggplot2 can also be a tad slow when dealing with large maps that require specific projections.
It seems like I should be able to use Mapnik to plot spatial objects directly from R, but after exhausting my Google-fu, I cannot find any evidence of bindings. Rather than assume that such a thing doesn't exist, I thought I'd check here to see if anyone knows of an R - Mapnik binding.
The Mapnik FAQ explicitly mentions Python bindings -- as does the wiki -- with no mention of R, so I think you are correct that no (Mapnik-sponsored, at least) R bindings currently exist for Mapnik.
You might get a more satisfying (or at least more detailed) answer by asking on the Mapnik users list. They will know for certain if any projects exist to make R bindings for Mapnik, and if not, your interest may incite someone to investigate the possibility of generating bindings for R.
I would write the SpatialWotsitDataFrames to Shapefiles and then launch a Python Mapnik script. You could even use R to generate the Python script (package 'brew' is handy for making files from templates and inserting values form R).

writing functions vs. line-by-line interpretation in an R workflow

Much has been written here about developing a workflow in R for statistical projects. The most popular workflow seems to be Josh Reich's LCFD model. With a main.R containing code:
source('load.R')
source('clean.R')
source('func.R')
source('do.R')
so that a single source('main.R') runs the entire project.
Q: Is there a reason to prefer this workflow to one in which the line-by-line interpretive work done in load.R, clean.R, and do.R is replaced by functions which are called by main.R?
I can't find the link now, but I had read somewhere on SO that when programming in R one must get over their desire to write everything in terms of function calls---that R was MEANT to be written is this line-by-line interpretive form.
Q: Really? Why?
I've been frustrated with the LCFD approach and am going to probably write everything in terms of function calls. But before doing this, I'd like to hear from the good folks of SO as to whether this is a good idea or not.
EDIT: The project I'm working on right now is to (1) read in a set of financial data, (2) clean it (quite involved), (3) Estimate some quantity associated with the data using my estimator (4) Estimate that same quantity using traditional estimators (5) Report results. My programs should be written in such a way that it's a cinch to do the work (1) for different empirical data sets, (2) for simulation data, or (3) using different estimators. ALSO, it should follow literate programming and reproducible research guidelines so that it's simple for a newcomer to the code to run the program, understand what's going on, and how to tweak it.
I think that any temporary stuff created in source'd files won't get cleaned up. If I do:
x=matrix(runif(big^2),big,big)
z=sum(x)
and source that as a file, x hangs around although I don't need it. But if I do:
ff=function(big){
x = matrix(runif(big^2),big,big)
z=sum(x)
return(z)
}
and instead of source, do z=ff(big) in my script, the x matrix goes out of scope and so gets cleaned up.
Functions enable neat little re-usable encapsulations and don't pollute outside themselves. In general, they don't have side-effects. Your line-by-line scripts could be using global variables and names tied to the data set in current use, which makes them unre-usable.
I sometimes work line-by-line, but as soon as I get more than about five lines I see that what I have really needs making into a proper reusable function, and more often than not I do end up re-using it.
I don't think there is a single answer. The best thing to do is keep the relative merits in mind and then pick an approach for that situation.
1) functions. The advantage of not using functions is that all your variables are left in the workspace and you can examine them at the end. That may help you figure out what is going on if you have problems.
On the other hand, the advantage of well designed functions is that you can unit test them. That is you can test them apart from the rest of the code making them easier to test. Also when you use a function, modulo certain lower level constructs, you know that the results of one function won't affect the others unless they are passed out and this may limit the damage that one function's erroneous processing can do to another's. You can use the debug facility in R to debug your functions and being able to single step through them is an advantage.
2) LCFD. Regarding whether you should use a decomposition of load/clean/func/do regardless of whether its done via source or functions is a second question. The problem with this decomposition regardless of whether its done via source or functions is that you need to run one just to be able to test out the next so you can't really test them independently. From that viewpoint its not the ideal structure.
On the other hand, it does have the advantage that you may be able to replace the load step independently of the other steps if you want to try it on different data and can replace the other steps independently of the load and clean steps if you want to try different processing.
3) No. of Files There may be a third question implicit in what you are asking whether everything should be in one or multiple source files. The advantage of putting things in different source files is that you don't have to look at irrelevant items. In particular if you have routines that are not being used or not relevant to the current function you are looking at they won't interrupt the flow since you can arrange that they are in other files.
On the other hand, there may be an advantage in putting everything in one file from the viewpoint of (a) deployment, i.e. you can just send someone that single file, and (b) editing convenience as you can put the entire program in a single editor session which, for example, facilitates searching since you can search the entire program using the editor's functions as you don't have to determine which file a routine is in. Also successive undo commands will allow you to move backward across all units of your program and a single save will save the current state of all modules since there is only one. (c) speed, i.e. if you are working over a slow network it may be faster to keep a single file in your local machine and then just write it out occasionally rather than having to go back and forth to the slow remote.
Note: One other thing to think about is that using packages may be superior for your needs relative to sourcing files in the first place.
No one has mentioned an important consideration when writing functions: there's not much point in writing them unless you're repeating some action again and again. In some parts of an analysis, you'll being doing one-off operations, so there's not much point in writing a function for them. If you have to repeat something more than a few times, it's worth investing the time and effort to write a re-usable function.
Workflow:
I use something very similar:
Base.r: pulls primary data, calls on other files (items 2 through 5)
Functions.r: loads functions
Plot Options.r: loads a number of general plot options I use frequently
Lists.r: loads lists, I have a lot of them because company names, statements and the like change over time
Recodes.r: most of the work is done in this file, essentially it's data cleaning and sorting
No analysis has been done up to this point. This is just for data cleaning and sorting.
At the end of Recodes.r I save the environment to be reloaded into my actual analysis.
save(list=ls(), file="Cleaned.Rdata")
With the cleaning done, functions and plot options ready, I start getting into my analysis. Again, I continue to break it up into smaller files that are focused into topics or themes, like: demographics, client requests, correlations, correspondence analysis, plots, ect. I almost always run the first 5 automatically to get my environment set up and then I run the others on a line by line basis to ensure accuracy and explore.
At the beginning of every file I load the cleaned data environment and prosper.
load("Cleaned.Rdata")
Object Nomenclature:
I don't use lists, but I do use a nomenclature for my objects.
df.YYYY # Data for a certain year
demo.describe.YYYY ## Demographic data for a certain year
po.describe ## Plot option
list.describe.YYYY ## lists
f.describe ## Functions
Using a friendly mnemonic to replace "describe" in the above.
Commenting
I've been trying to get myself into the habit of using comment(x) which I've found incredibly useful. Comments in the code are helpful but oftentimes not enough.
Cleaning Up
Again, here, I always try to use the same object(s) for easy cleanup. tmp, tmp1, tmp2, tmp3 for example and ensuring to remove them at the end.
Functions
There has been some commentary in other posts about only writing a function for something if you're going to use it more than once. I'd like to adjust this to say, if you think there's a possibility that you may EVER use it again, you should throw it into a function. I can't even count the number of times I wished I wrote a function for a process I created on a line by line basis.
Also, BEFORE I change a function, I throw it into a file called Deprecated Functions.r, again, protecting against the "how the hell did I do that" effect.
I often divide up my code similarly to this (though I usually put Load and Clean in one file), but I never just source all the files to run the entire project; to me that defeats the purpose of dividing them up.
Like the comment from Sharpie, I think your workflow should depends a lot on the kind of work you're doing. I do mostly exploratory work, and in that context, keeping the data input (load and clean) separate from the analysis (functions and do), means that I don't have to reload and reclean when I come back the next day; I can instead save the data set after cleaning and then import it again.
I have little experience doing repetitive munging of daily data sets, but I imagine that I would find a different workflow helpful; as Hadley answers, if you're only doing something once (as I am when I load/clean my data), it may not be helpful to write a function. But if you're doing it over and over again (as it seems you would be) it might be much more helpful.
In short, I've found dividing up the code helpful for exploratory analyses, but would probably do something different for repetitive analyses, just like you're thinking about.
I've been pondering workflow tradeoffs for some time.
Here is what I do for any project involving data analysis:
Load and Clean: Create clean versions of the raw datasets for the project, as if I was building a local relational database. Thus, I structure the tables in 3n normal form where possible. I perform basic munging but I do not merge or filter tables at this step; again, I'm simply creating a normalized database for a given project. I put this step in its own file and I will save the objects to disk at the end using save.
Functions: I create a function script with functions for data filtering, merging and aggregation tasks. This is the most intellectually challenging part of the workflow as I'm forced to think about how to create proper abstractions so that the functions are reusable. The functions need to generalize so that I can flexibly merge and aggregate data from the load and clean step. As in the LCFD model, this script has no side effects as it only loads function definitions.
Function Tests: I create a separate script to test and optimize the performance of the functions defined in step 2. I clearly define what the output from the functions should be, so this step serves as a kind of documentation (think unit testing).
Main: I load the objects saved in step 1. If the tables are too big to fit in RAM, I can filter the tables with a SQL query, keeping with the database thinking. I then filter, merge and aggregate the tables by calling the functions defined in step 2. The tables are passed as arguments to the functions I defined. The output of the functions are data structures in a form suitable for plotting, modeling and analysis. Obviously, I may have a few extra line by line steps where it makes little sense to create a new function.
This workflow allows me to do lightning fast exploration at the Main.R step. This is because I have built clear, generalizable, and optimized functions. The main difference from the LCFD model is that I do not preform line-by-line filtering, merging or aggregating; I assume that I may want to filter, merge, or aggregate the data in different ways as part of exploration. Additionally, I don't want to pollute my global environment with lengthy line-by-line script; as Spacedman points out, functions help with this.

Resources