Operating on Spark data frames with SparkR and Sparklyr - unrealistic settings? - r

I am currently working with the SparkR and sparklyr package and I think that they are not suitable for high-dimensional sparse data sets.
Both packages have the paradigm that you can select/filter columns and rows of data frames by simple logical conditions on a few columns or rows. But this is often not what you would do on such large data sets. There you need to select rows and columns based on the values of hundreds of row or column entries. Often, you first have to calculate statistics on each row/column, and then use these values for the selections. Or, you want to address certain values in the data frame only.
For example,
How can I select all rows or columns that have less than 75% missing values?
How can I impute missing values with column- or row-specific values that were derived from each column or row?
To solve (#2), I need to execute functions on each row or column of a data frame separately. However, even functions like dapplyCollect of SparkR do not really help, as they are far too slow.
Maybe I am missing something, but I would say that SparkR and sparklyr do not really help in these situations. Am I wrong?
As a side note, I do not understand how libraries like MLlib or H2O could be integrated with Sparklyr if there are such severe limitations, e.g. in handling missing values.

Related

Read from SAS to R for only a subset of rows

I have a very large dataset in SAS (> 6million rows). I'm trying to read that to R. For this purpose, I'm using "read_sas" from the "haven" library in R.
However, due to its extremely large size, I'd like to split the data into subsets (e.g., 12 subsets each having 500000 rows), and then read each subset into R. I was wondering if there is any possible way to address this issue. Any input is highly appreciated!
Is there any way you can split the data with SAS beforehand ... ?
read_sas has skip and n_max arguments, so if your increment size is N=5e5 you should be able to set an index i to read in the ith chunk of data using read_sas(..., skip=(i-1)*N, n_max=N). (There will presumably be some performance penalty to skipping rows, but I don't know how bad it will be.)

What are the differences between data.frame, tibble and matrix?

In R, some functions only work on a data.frame and others only on a tibble or a matrix.
Converting my data using as.data.frame or as.matrix often solves this, but I am wondering how the three are different ?
Because they serve different purposes.
Short summary:
Data frame is a list of equal-length vectors. This means, that adding a column is as easy as adding a vector to a list. It also means that while each column has its own data type, the columns can be of different types. This makes data frames useful for data storage.
Matrix is a special case of an atomic vector that has two dimensions. This means that whole matrix has to have a single data type which makes them useful for algebraic operations. It can also make numeric operations faster in some cases since you don't have to perform type checks. However if you are careful enough with the data frames, it will not be a big difference.
Tibble is a modernized version of a data frame used in the tidyverse. They use several techniques to make them 'smarter' - for example lazy loading.
Long description of matrices, data frames and other data structures as used in R.
So to sum up: matrix and data frame are both 2d data structures. Each of these serves a different purpose and thus behaves differently. Tibble is an attempt to modernize the data frame that is used in the widely spread Tidyverse.
If I try to rephrase it from a less technical perspective:
Each data structure is making tradeoffs.
Data frame is trading a little of its efficiency for convenience and clarity.
Matrix is efficient, but harder to wield since it enforces restrictions upon its data.
Tibble is trading more of the efficiency even more convenience while also trying to mask the said tradeoff with techniques that try to postpone the computation to a time when it doesn't appear to be its fault.
About the difference between data frame and tibbles, the 2 main differences are explained here:https://www.rstudio.com/blog/tibble-1-0-0/
Besides, my understanding is the following:
-If you subset a tibble, you always get back a tibble.
-Tibbles can have complex entries.
-Tibbles can be grouped.
-Tibbles display better

rbind and cbind commands equivalent in SPSS

can someone provide and equivalent code in SPSS that merges datasets in SPSS to replicate the rbind and cbind commands usable in R ? Many thanks !
To add rows from dataset1 to dataset2, you can use ADD FILES. This requires that both datasets hold the same variables, with matching variable names and formats.
To add columns from dataset1 to dataset2, use MATCH FILES. This command matches the values for the desired variables in dataset2 to the right rows in dataset1 using keys present in both files (such as a respondent id). The keys are defined in the BY subcommand.
Please note that R and SPSS work in a totally different way. In short, SPSS (mainly) works with datasets in which variables are defined and formatted, while R can handle single values, vectors, matrices, dataframes etc. Simply copying columns from an existing dataset to another dataset (without paying attention tohow the files are sorted) and simply adding rows without matching the variable names and types in the existing dataset are very unusual in SPSS.
If you post an example of what you are trying to achieve, I could give you a more useful answer...

I want to process tens of thousands of columns using Spark via sparklyr, but I can't

I tried using sdf_pivot() to widen my column with duplicate values into multiple (a very big number) columns. I planned to use these columns as the feature space for training an ML model.
Example: I have a language element sequence in one column (words), which I wish to turn into binary matrix of a huge width (say, 100,000) and run a sentiment analysis using a logistic regression.
The first problem is that by default sparklyr does not allow me to make more than 10K columns, citing possible eeror in my design.
The second problem is that even if I override this warning and make lots of columns, further calculations last forever on this very wide data.
Question 1: is it a good practice to make extra wide datasets or I should work differently with so deep feature spaces, while using the power of fast parallel calculations with Spark?
Question 2: is it possible to construct the vector-type feature column avoiding the generation of a very wide matrix?
I just need a small example or practical tips to follow.
https://github.com/rstudio/sparklyr/issues/1322

how to use princomp() or prcomp() functions in R with large datasets, without trasposing the data?

I have just started knowing PCA and i wish to use it for a huge microarray dataset with more than 4,00,000 rows. I have my columns in the form of samples, and rows in the form of genes/locus. I did go through some tutorials on using PCA and came across princomp() and prcomp() and a few others.
Now, as i learn here that, in order to plot ¨samples¨ in the biplot, i would need to have them in the rows, and genes/locus in the columns, and hence i will have to transpose my data before using it for PCA.
However, since the rows are more than 4,00,000, i am not really able to transpose them into columns, because the columns are limited. So my question is that, is there any way to perform a PCA on my data, without transposing it, using these R functions ? If not, can anyone of you suggest me any other way or method to do so ?
Why do you hate to transpose your data? It's easy!
If you read your data into R (for example as the matrix microarray.data) you can transpose them with just a command:
transposed.microarray.data<-t(microarray.data)

Resources