I was wondering if Julia has a package similar to Sparklyr in R that could handle out of memory data. My data is 11 GB in csv format.
I installed HPAT package in Julia, but I am not sure if it helps to handle big data. In addition, I noticed that there is a Spark package in Julia, does it have any function that enables me to import local data, like the spark_read_csv function in Sparklyr?
You can try https://github.com/JuliaComputing/JuliaDB.jl. This package is pretty new and still in development, but it is capable of loading CSV datasets larger than memory.
Related
Is there a data dictionary package that will work for R.
I have located a data_dict package in the following link for R, however, it will not run on the version of R I am using.
http://optimumsportsperformance.com/blog/creating-a-data-dictionary-function-in-r/
I am looking for a data dictionary package that will make light work of a number of large and complex data tables... I have heard these elusive data dictionary packages exist.
I wrote a package that might be what you're after. It's on CRAN under datadictionary, but it might be better to use the dev version on Github because it handles difftimes and dates better.
I have a spark data frame, I want to save this data frame as a table in Casandra and save as parquet in S3. everywhere python, java, scala codes are there. I'm not finding the right solution in R.
I have found Rcasandra package can i use this to do the same?
I am working with a very large dataset (~30 million rows). I have typically work with this dataset in SAS but would like to use some machine learning applications that do not exist in SAS but do in R. Unfortunately my PC can't handle a dataset of this size in R due to how R stores the entire dataset in memory.
Will calling the R functions from a SAS program solve this? At the very least I can run SAS on a server (I cannot do this with R).
I have a 500K row spark DataFrame that lives in a parquet file. I'm using spark 2.0.0 and the SparkR package inside Spark (RStudio and R 3.3.1), all running on a local machine with 4 cores and 8gb of RAM.
To facilitate construction of a dataset I can work on in R, I use the collect() method to bring the spark DataFrame into R. Doing so takes about 3 minutes, which is far longer than it'd take to read an equivalently sized CSV file using the data.table package.
Admittedly, the parquet file is compressed and the time needed for decompression could be part of the issue, but I've found other comments on the internet about the collect method being particularly slow, and little in the way of explanation.
I've tried the same operation in sparklyr, and it's much faster. Unfortunately, sparklyr doesn't have the ability to do date path inside joins and filters as easily as SparkR, and so I'm stuck using SparkR. In addition, I don't believe I can use both packages at the same time (i.e. run queries using SparkR calls, and then access those spark objects using sparklyr).
Does anyone have a similar experience, an explanation for the relative slowness of SparkR's collect() method, and/or any solutions?
#Will
I don't know whether the following comment actually answers your question or not but Spark does lazy operations. All the transformations done in Spark (or SparkR) doesn't really create any data they just create a logical plan to follow.
When you run Actions like collect, it has to fetch data directly from source RDDs (assuming you haven't cached or persisted data).
If your data is not large enough and can be handled by local R easily then there is no need for going with SparkR. Other solution can be to cache your data for frequent uses.
Short: Serialization/deserialization is very slow.
See for example post on my blog http://dsnotes.com/articles/r-read-hdfs
However it should be equally slow in both sparkR and sparklyr.
I have created a 300000 x 7 numeric matrix in R and I want to work with it in both R and Matlab. However, I'm not able to create a file well readeable for Matlab.
When using the command save(), with file=xx.csv, it recognizes 5 columns instead; with extension .txt all data is opened in a single column instead.
I have also tried with packages ff and ffdf to manage this big data (I guess the problem of R identifying rows and column when saving is related somehow to this), but I don't know how to save it in a readable format for Matlab afterwards.
An example of this dataset would be:
output <- matrix(runif(2100000, 1, 1000), ncol=7, nrow=300000)
If you want to work both with R and Matlab, and you have a matrix as big as yours, I'd suggest using the R.matlab package. The package provides methods readMat and writeMat. Both methods read/write the binary format that is understood by Matlab (and through R.matlab also by R).
Install the package by typing
install.packages("R.matlab")
Subsequently, don't forget to load the package, e.g. by
library(R.matlab)
The documentation of readMat and writeMat, accessible through ?readMat and ?writeMat, contains easy usage examples.