Object memory usage with R-reticulate python - r

I'm wondering how efficiently reticulate handles memory with python objects.
Suppose I have a 5GB pandas dataframe object called data_pandas, in reticulate::python and I'd like to make an analysis with R.
When I call the object from R like py$data_pandas, does it make a copy of this dataframe into R data.frame object internally (i.e. make another 5GB data.frame in R)?
And vice versa (calling R data.frame from python)?

I'm no expert, but it seems from the vignette on arrays that reticulate makes at least two copies of every python object:
"R arrays are only copied to Python when they need to be, otherwise data are shared. Python arrays are always copied when moved into R arrays. This can sometimes lead to three copies of any one array in memory at any one time (at the moment this was written). Future versions will reduce that copy overhead to two."
(From https://rstudio.github.io/reticulate/articles/arrays.html)

Related

RJulia - sharing large data structures without copying them

I would like to call R functions from within Julia on very large data structures.
I understand that PyCall can share variables between Julia and Python without copying them.
I also understand that as of August 2019 variables still needed to be copied when transferring between R and Julia (https://github.com/Non-Contradiction/JuliaCall/issues/114).
Q1: Is there any way to call R functions in Julia without having to copy data structures?
Q2: Does Apache Arrow work with RJulia?
Q3: If there is a way to transfer variables without copying them, does this apply to all data types? Numerical array, categorical array, time array, table, list, and dictionary?
Thank you in advance

Read C++ binary file in R

Can I read a binary file written by C++ in R?
I have been using Rcpp in my R package and the simulations typically generate a large amount of data. I am planning to write the output to binary files in C++ and then read those back in R. This works if I write as text files but I didn't find a solution with binary files. The program sometimes crashes abruptly if I pass data using many NumericVectors (I am yet to fully understand the memory management using Rcpp).
Can this approach enable me to share larger datasets between C++ and R compared to what is possible by passing vectors? In C++, the maximum vector size is limited by RAM and address bus (may be?) but I think R is able to load larger vectors using swap. Am I correct or misunderstanding the concepts?
Yes you can. But it's "complicated".
You are embarking on a topic called binary serialization. There is a lot of work out there. In essence you are somewhere in the continum between of
minimal: open a file, write out N binary items; then on the other side read N binaries. We did something similar at work years ago where wrote some metadata with <rows,cols,version> and then a binary blob of rows * cols double to attach to a matrix
maximal: use a fully descriptive meta language like Protocol Buffer or MessagePack to describe the binary content, write it in C++ (using the appropriate library) and read in back in R (using the corresponding packages---I am involved with one each: RProtoBuf and RcppMsgPack).
And a lot in between. If you really only need to communicate between C(++) and R you could try the RData / rds format. There is one library: librdata and I experimented with it (and filed some bug reports and made some pull requests). I might start there.
So in short: do some research, figure out what to do and then do it :)
PS If you call C++ via Rcpp from R then you may not need files. We can pass large object back and forth -- the limit may be your RAM.

Calling R functions from Julia

Is there a convenient way to call R functions from Julia?
If so, what mechanisms for doing so exist? (Potentially ranging from simply calling an R script from the shell & hand-coding the I/O to/from Julia, to interacting with an R environment over multiple Julia calls with Julia DataFrames being seamlessly converted to/from R DataFrames).
Calling R scripts and handcoding I/O is the best way to work with R for the moment. We have functions for reading the RDA binary format that R likes and should add some tools for working with it more easily and also writing data in that format, which will speed up I/O considerably relative to passing CSV files around -- which I've done in the past.
Converting between R and Julia DataFrames could be done, but would be quite costly as Julia isn't using a binary representation of data (e.g. NA) that's nearly equivalent to R's. So you'd need to do some non-trivial work to make this work in a way that would be substantially more efficient than using the RDA binary format.
One thing that would be really nice is to build solid Thrift bindings for both R and Julia and then call back and forth using those bindings.
For calling out to R from within Julia, the RCall package is currently your best bet. For calling out to Julia from within R, try the RJulia package. Both are a bit in the works.

How to put R objects in manually allocated memory?

I am developing a function in C for R package and I need to initialize R numeric vector in manually allocateed memory that is not garbage collected.
The standard function allocVector(REALSXP, XXX) allocates memory for me and initializes the object. I have already an allocated piece of memory, I need to initialize R object in this memory and return it to userspace.
Algorithm I am trying to follow
Allocate memory myself (actually it is a memory mapped file)
Put a R object (standard R numerical vector) in this memory (How?)
Prevent garbage collector from trying to collect it (How?)
Register finalizer for this object
Return R object user can use it
Get a notification that object is no longer referenced and deallocate the object
Your problem starts with 1. as the Writing R Extensions manual tells you (in its cryptic ways, see Section 5.9.2) that you must use R's memory "pool" for objects that you hand back to R. How else could R release the object's memory if it doesn't control the access?
Unless you use external pointers, which are also covered (somewhat) in the same manual, and some other places (other questions here, r-devel archives, several packages, ...).
And the R package bigmemory pretty much covers exactly this (also see the related bigmemory website. You could, if you're so inclined, start with bigmemory and derive a package 'mmapmemory' from it. Oh, and there is a package mmap but maybe you knew that already.

Big Data Process and Analysis in R

I know this is not a new concept by any stretch in R, and I have browsed the High Performance and Parallel Computing Task View. With that said, I am asking this question from a point of ignorance as I have no formal training in Computer Science and am entirely self taught.
Recently I collected data from the Twitter Streaming API and currently the raw JSON sits in a 10 GB text file. I know there have been great strides in adapting R to handle big data, so how would you go about this problem? Here are just a handful of the tasks that I am looking to do:
Read and process the data into a data frame
Basic descriptive analysis, including text mining (frequent terms, etc.)
Plotting
Is it possible to use R entirely for this, or will I have to write some Python to parse the data and throw it into a database in order to take random samples small enough to fit into R.
Simply, any tips or pointers that you can provide will be greatly appreciated. Again, I won't take offense if you describe solutions at a 3rd grade level either.
Thanks in advance.
If you need to operate on the entire 10GB file at once, then I second #Chase's point about getting a larger, possibly cloud-based computer.
(The Twitter streaming API returns a pretty rich object: a single 140-character tweet could weigh a couple kb of data. You might reduce memory overhead if you preprocess the data outside of R to extract only the content you need, such as author name and tweet text.)
On the other hand, if your analysis is amenable to segmenting the data -- for example, you want to first group the tweets by author, date/time, etc -- you could consider using Hadoop to drive R.
Granted, Hadoop will incur some overhead (both cluster setup and learning about the underlying MapReduce model); but if you plan to do a lot of big-data work, you probably want Hadoop in your toolbox anyway.
A couple of pointers:
an example in chapter 7 of Parallel R shows how to setup R and Hadoop for large-scale tweet analysis. The example uses the RHIPE package, but the concepts apply to any Hadoop/MapReduce work.
you can also get a Hadoop cluster via AWS/EC2. Check out
Elastic MapReduce
for an on-demand cluster, or use
Whirr
if you need more control over your Hadoop deployment.
There's a brand-new package called colbycol that lets you read in only the variables you want from enormous text files:
http://colbycol.r-forge.r-project.org/
read.table function remains the main data import function in R. This
function is memory inefficient and, according to some estimates, it
requires three times as much memory as the size of a dataset in order
to read it into R.
The reason for such inefficiency is that R stores data.frames in
memory as columns (a data.frame is no more than a list of equal length
vectors) whereas text files consist of rows of records. Therefore, R's
read.table needs to read whole lines, process them individually
breaking into tokens and transposing these tokens into column oriented
data structures.
ColByCol approach is memory efficient. Using Java code, tt reads the
input text file and outputs it into several text files, each holding
an individual column of the original dataset. Then, these files are
read individually into R thus avoiding R's memory bottleneck.
The approach works best for big files divided into many columns,
specially when these columns can be transformed into memory efficient
types and data structures: R representation of numbers (in some
cases), and character vectors with repeated levels via factors occupy
much less space than their character representation.
Package ColByCol has been successfully used to read multi-GB datasets
on a 2GB laptop.
10GB of JSON is rather inefficient for storage and analytical purposes. You can use RJSONIO to read it in efficiently. Then, I'd create a memory mapped file. You can use bigmemory (my favorite) to create different types of matrices (character, numeric, etc.), or store everything in one location, e.g. using HDF5 or SQL-esque versions (e.g. see RSQlite).
What will be more interesting is the number of rows of data and the number of columns.
As for other infrastructure, e.g. EC2, that's useful, but preparing a 10GB memory mapped file doesn't really require much infrastructure. I suspect you're working with just a few 10s of millions of rows and a few columns (beyond the actual text of the Tweet). This is easily handled on a laptop with efficient use of memory mapped files. Doing complex statistics will require either more hardware, cleverer use of familiar packages, and/or experimenting with some unfamiliar packages. I'd recommend following up with a more specific question when you reach that stage. The first stage of such work is simply data normalization, storage and retrieval. My answer for that is simple: memory mapped files.
To read chunks of the JSON file in, you can use the scan() function. Take a look at the skip and nlines arguments. I'm not sure how much performance you'll get versus using a database.

Resources