Why is collect in SparkR so slow? - r

I have a 500K row spark DataFrame that lives in a parquet file. I'm using spark 2.0.0 and the SparkR package inside Spark (RStudio and R 3.3.1), all running on a local machine with 4 cores and 8gb of RAM.
To facilitate construction of a dataset I can work on in R, I use the collect() method to bring the spark DataFrame into R. Doing so takes about 3 minutes, which is far longer than it'd take to read an equivalently sized CSV file using the data.table package.
Admittedly, the parquet file is compressed and the time needed for decompression could be part of the issue, but I've found other comments on the internet about the collect method being particularly slow, and little in the way of explanation.
I've tried the same operation in sparklyr, and it's much faster. Unfortunately, sparklyr doesn't have the ability to do date path inside joins and filters as easily as SparkR, and so I'm stuck using SparkR. In addition, I don't believe I can use both packages at the same time (i.e. run queries using SparkR calls, and then access those spark objects using sparklyr).
Does anyone have a similar experience, an explanation for the relative slowness of SparkR's collect() method, and/or any solutions?

#Will
I don't know whether the following comment actually answers your question or not but Spark does lazy operations. All the transformations done in Spark (or SparkR) doesn't really create any data they just create a logical plan to follow.
When you run Actions like collect, it has to fetch data directly from source RDDs (assuming you haven't cached or persisted data).
If your data is not large enough and can be handled by local R easily then there is no need for going with SparkR. Other solution can be to cache your data for frequent uses.

Short: Serialization/deserialization is very slow.
See for example post on my blog http://dsnotes.com/articles/r-read-hdfs
However it should be equally slow in both sparkR and sparklyr.

Related

Write custom metadata to Parquet file in Julia

I am currently storing the output (a Julia Dataframe) of my Julia simulation in a Parquet file using Parquet.jl. I would also like to save some of the simulation parameters (eg. a list of (byte-)strings) to that same output file.
Preferably, these parameters are different for each column as each column is the result of different starting conditions of my code. However, I could also work with a global parameter list and then untangle it afterwards by indexing.
I have found a solution for Python using pyarrow
https://mungingdata.com/pyarrow/arbitrary-metadata-parquet-table/.
Do you know a way how to do it in Julia?
It's not quite done yet, and it's not registered, but my rewrite of the Julia parquet package, Parquet2.jl does support both custom file metadata and individual column metadata (the keyword arguments metadata and column_metadata in Parquet2.writefile.
I haven't gotten to documentation for writing yet, but if you are feeling adventurous you can give it a shot. I do expect to finish up this package and register it within the next couple of weeks. I don't have unit tests in for writing yet, so of course, if you try it and have problems, please open an issue.
It's probably also worth mentioning that the main use case I recommend for parquet is if you must have parquet for compatibility reasons. Most of the time, Julia users are probably better off with Arrow.jl as the format has a number of advantages over parquet for most use cases, please see my FAQ answer on this. Of course, the reason I undertook writing the package is because parquet is arguably the only ubiquitous binary format in "big data world" so a robust writer is desperately needed.

Subset of features on external memory

I have a large file that I'm not able to load so I'm using a local file with xgb.DMatrix. But I'd like to use only a subset of the features. The documentation on xgboost says that the colset argument on slice is "currently not used" and there is no metion of this feature in the github page. And I haven't found any other clue of how to do column subsetting with external memory.
I wish to compare models generated with different features subsettings. The only thing I could think of is to create a new file with the features that I want to use but it's taking a long time and will take a lot of memory... I can't help wondering if there is a better way.
ps.: I tried using h2o package too but h2o.importFile froze.

Sharing a data.table in memory for parallel computing

Following the post about data.table and parallel computing, I'm trying to find a way to get an operation on a data.table parallized.
I have a data.table with 4 million rows of 14 observations and would like to share it in a common memory so that operations on it can be parallelized by using the "parallel"-package with parLapply without having to copy the table for each node in the cluster (what parLapply does). At the moment the costs for moving the data.table around are bigger than the benefit of parallel computation.
I found the "bigmemory"-package as an answer for sharing memory, but it doesn't maintain the "data.table"-structure of the data. So does anyone know a way to:
1) put the data.table in shared memory
2) maintain the "data.table"-structure of the data by doing so
3) use parallel processing on this data.table?
Thanks in advance!
Old question, but here is an answer since nobody else has answered and it might be helpful. I assume the problem you are having is because you are on windows and having to use the PSOCK type of cluster. Unfortunately for windows this means you have to copy the data to each node. However, there is a work around. Get hold of docker and spin up an Rserve instance on the docker vm (e.g. stevenpollack/docker-rserve). Since this will be linux based you can create a FORK cluster on the docker vm. Then using your native R instance you can send over only once copy of the data to the Rserve instance (check out the RSclient library), do your parallelized job on the vm, and collect the results back into your native R.
The "complete" solution, shared read and write access from multiple processes, and their problems is discussed here: https://github.com/Rdatatable/data.table/issues/3104
As rookie mentioned, if you fork an R process (with parallel::makeCluster(type = "FORK") or future::plan(multicore) (note that this does not work reliably in RStudio), the operating system will reuse memory pages that are not modified by the child process. So, your workers will share the same memory as long as they don't modify it (Copy-on-write). But this works only if you have all parallel workers on the same machine and fork() has its own problems (although this might be going too far if you simply want to conduct some parallel analysis).
Meanwhile, you could find the packages feather and fst interesting. feather provides a file format that can be read both by R and python and if I understood the docs correctly, feather::feather() gives you a file-backed read-only data-frame, albeit no data.table. This allows for moving data between those two languages.
fst employs the Zstandard compression algorithm to achieve very fast reading and writing speeds to disk. You can read in a part of a fst file using the fst() function (instead of read_fst()). So, every worker could just read the part of your table that it needs. Concurrent writing to the fst file is not possible. You would need to save every result in its own file and concatenate them afterwards.
Alternatively, for concurrent reading and writing, you could switch to a database, albeit that is slower than data.table. See SO/SQLite concurrent access

How to use zoo or xts with large data?

How can I use the R packages zoo or xts with very large data sets? (100GB)
I know there are some packages such as bigrf, ff, bigmemory that can deal with this problem but you have to use their limited set of commands, they don't have the functions of zoo or xts and I don't know how to make zoo or xts to use them.
How can I use it?
I've seen that there are also some other things, related with databases, such as sqldf and hadoopstreaming, RHadoop, or some other used by Revolution R. What do you advise?, any other?
I just want to aggreagate series, cleanse, and perform some cointegrations and plots.
I wouldn't like to need to code and implement new functions for every command I need, using small pieces of data every time.
Added: I'm on Windows
I have had a similar problem (albeit I was only playing with 9-10 GBs). My experience is that there is no way R can handle so much data on its own, especially since your dataset appears to contain time series data.
If your dataset contains a lot of zeros, you may be able to handle it using sparse matrices - see Matrix package ( http://cran.r-project.org/web/packages/Matrix/index.html ); this manual may also come handy ( http://www.johnmyleswhite.com/notebook/2011/10/31/using-sparse-matrices-in-r/ )
I used PostgreSQL - the relevant R package is RPostgreSQL ( http://cran.r-project.org/web/packages/RPostgreSQL/index.html ). It allows you to query your PostgreSQL database; it uses SQL syntax. Data is downloaded into R as a dataframe. It may be slow (depending on the complexity of your query), but it is robust and can be handy for data aggregation.
Drawback: you would need to upload data into the database first. Your raw data needs to be clean and saved in some readable format (txt/csv). This is likely to be the biggest issue if your data is not already in a sensible format. Yet uploading "well-behaved" data into the DB is easy ( see http://www.postgresql.org/docs/8.2/static/sql-copy.html and How to import CSV file data into a PostgreSQL table? )
I would recommend using PostgreSQL or any other relational database for your task. I did not try Hadoop, but using CouchDB nearly drove me round the bend. Stick with good old SQL

Big Data Process and Analysis in R

I know this is not a new concept by any stretch in R, and I have browsed the High Performance and Parallel Computing Task View. With that said, I am asking this question from a point of ignorance as I have no formal training in Computer Science and am entirely self taught.
Recently I collected data from the Twitter Streaming API and currently the raw JSON sits in a 10 GB text file. I know there have been great strides in adapting R to handle big data, so how would you go about this problem? Here are just a handful of the tasks that I am looking to do:
Read and process the data into a data frame
Basic descriptive analysis, including text mining (frequent terms, etc.)
Plotting
Is it possible to use R entirely for this, or will I have to write some Python to parse the data and throw it into a database in order to take random samples small enough to fit into R.
Simply, any tips or pointers that you can provide will be greatly appreciated. Again, I won't take offense if you describe solutions at a 3rd grade level either.
Thanks in advance.
If you need to operate on the entire 10GB file at once, then I second #Chase's point about getting a larger, possibly cloud-based computer.
(The Twitter streaming API returns a pretty rich object: a single 140-character tweet could weigh a couple kb of data. You might reduce memory overhead if you preprocess the data outside of R to extract only the content you need, such as author name and tweet text.)
On the other hand, if your analysis is amenable to segmenting the data -- for example, you want to first group the tweets by author, date/time, etc -- you could consider using Hadoop to drive R.
Granted, Hadoop will incur some overhead (both cluster setup and learning about the underlying MapReduce model); but if you plan to do a lot of big-data work, you probably want Hadoop in your toolbox anyway.
A couple of pointers:
an example in chapter 7 of Parallel R shows how to setup R and Hadoop for large-scale tweet analysis. The example uses the RHIPE package, but the concepts apply to any Hadoop/MapReduce work.
you can also get a Hadoop cluster via AWS/EC2. Check out
Elastic MapReduce
for an on-demand cluster, or use
Whirr
if you need more control over your Hadoop deployment.
There's a brand-new package called colbycol that lets you read in only the variables you want from enormous text files:
http://colbycol.r-forge.r-project.org/
read.table function remains the main data import function in R. This
function is memory inefficient and, according to some estimates, it
requires three times as much memory as the size of a dataset in order
to read it into R.
The reason for such inefficiency is that R stores data.frames in
memory as columns (a data.frame is no more than a list of equal length
vectors) whereas text files consist of rows of records. Therefore, R's
read.table needs to read whole lines, process them individually
breaking into tokens and transposing these tokens into column oriented
data structures.
ColByCol approach is memory efficient. Using Java code, tt reads the
input text file and outputs it into several text files, each holding
an individual column of the original dataset. Then, these files are
read individually into R thus avoiding R's memory bottleneck.
The approach works best for big files divided into many columns,
specially when these columns can be transformed into memory efficient
types and data structures: R representation of numbers (in some
cases), and character vectors with repeated levels via factors occupy
much less space than their character representation.
Package ColByCol has been successfully used to read multi-GB datasets
on a 2GB laptop.
10GB of JSON is rather inefficient for storage and analytical purposes. You can use RJSONIO to read it in efficiently. Then, I'd create a memory mapped file. You can use bigmemory (my favorite) to create different types of matrices (character, numeric, etc.), or store everything in one location, e.g. using HDF5 or SQL-esque versions (e.g. see RSQlite).
What will be more interesting is the number of rows of data and the number of columns.
As for other infrastructure, e.g. EC2, that's useful, but preparing a 10GB memory mapped file doesn't really require much infrastructure. I suspect you're working with just a few 10s of millions of rows and a few columns (beyond the actual text of the Tweet). This is easily handled on a laptop with efficient use of memory mapped files. Doing complex statistics will require either more hardware, cleverer use of familiar packages, and/or experimenting with some unfamiliar packages. I'd recommend following up with a more specific question when you reach that stage. The first stage of such work is simply data normalization, storage and retrieval. My answer for that is simple: memory mapped files.
To read chunks of the JSON file in, you can use the scan() function. Take a look at the skip and nlines arguments. I'm not sure how much performance you'll get versus using a database.

Resources