I want to extract some statistical measurements from large Spark DataFrames (approx. 250K records, 250 columns) as e.g. max, mean, stddev, etc., yet I also want some non-standard measurements as e.g. skewness.
I am working on databricks using SparkR API. I know that there is the possibility of getting basic statistics via summary
df <- SparkR::asDataFrame(mtcars)
SparkR::summary(df, "min", "max", "50%", "mean") -> mySummary
head(mySummary)
However, this does not cover other measures as skewness and I am bassically extracting them using
SparkR::select(df, SparkR::skewness(df$mpg)) -> mpg_Skewness
head(mpg_Skewness)
This works, yet I am looking for a more performant way doing so. No matter whether I am looping over columns or using an apply function over the columns, Spark is always executing one job per column which are run subsequently.
I also tried to execute this for all columns as one job, but this is even slower.
Is it possible to force Spark to execute these kind of calculations separately for all columns but in parallel? What else could you propose to do such column operations as performant as possible? Any hints are highly appreciated!
Related
first question, I'll try to go straight to the point.
I'm currently working with tables and I've chosen R because it has no limit with dataframe sizes and can perform several operations over the data within the tables. I am happy with that, as I can manipulate it at my will, merges, concats and row and column manipulation works fine; but I recently had to run a loop with 0.00001 sec/instruction over a 6 Mill table row and it took over an hour.
Maybe the approach of R was wrong to begin with, and I've tried to look for the most efficient ways to run some operations (using list assignments instead of c(list,new_element)) but, since as far as I can tell, this is not something that you can optimize with some sort of algorithm like graphs or heaps (is just tables, you have to iterate through it all) I was wondering if there might be some other instructions or other basic ways to work with tables that I don't know (assign, extract...) that take less time, or configuration over RStudio to improve performance.
This is the loop, just so if it helps to understand the question:
my_list <- vector("list",nrow(table[,"Date_of_count"]))
for(i in 1:nrow(table[,"Date_of_count"])){
my_list[[i]] <- format(as.POSIXct(strptime(table[i,"Date_of_count"]%>%pull(1),"%Y-%m-%d")),format = "%Y-%m-%d")
}
The table, as aforementioned, has over 6 Mill rows and 25 variables. I want the list to be filled to append it to the table as a column once finished.
Please let me know if it lacks specificity or concretion, or if it just does not belong here.
In order to improve performance (and properly work with R and tables), the answer was a mixture of the first comments:
use vectors
avoid repeated conversions
if possible, avoid loops and apply functions directly over list/vector
I just converted the table (which, realized, had some tibbles inside) into a dataframe and followed the aforementioned keys.
df <- as.data.frame(table)
In this case, by doing this the dates were converted directly to character so I did not have to apply any more conversions.
New execution time over 6 Mill rows: 25.25 sec.
I understand from excellent resources here, here and here that data.table utilises automatic indexing (to create a key i.e. supercharged row names) and binary search based subset in contrast to tidyverse, which relies on vector scanning.
I understand that vector scanning requires scanning each individual row and the creation of nrow(dataset) length logical vectors, and that doing this repeatedly is not as efficient.
I'm wondering if someone can help me frame exactly how these two methods means that data.table operations run a lot faster compared to tidyverse when you need to group by a variable. I.e. is it because data.table automatically indexes the group_by column and breaks it into grouped subsets and runs operations on each subset, whilst a vector scanning approach would require the generation of n = unique groups of multiple logical vectors, and then run operations on each individual logical vector, before collating results?
Also, according to the data.table vignette,
We can set keys on multiple columns and the column can be of different
types...
Since the rows are reordered, a data.table can have at most one key
because it can not be sorted in more than one way.
What does it mean that we can set keys on multiple columns and yet a data.table can have at most one key? I.e. is it that during any moment when running an operation, there is only one reference key, but which column the reference key is set as can change as we progress to another component of the overall operation?
Thank you in advance!
There is no.
There are different ways to finding groups, and then to compute expression by groups. Each single thing can be differently implemented. They are not related to keys or index. Also data.table is not automatically creating key/index during group by (as of now).
data.table has very fast, carefully implemented, order function, it is being used to find groups. It was contributed to base R later on. There is an idea to use it in dplyr to speed up grouping: https://github.com/tidyverse/dplyr/issues/4406
Yet data.table order function got improved since then and now scales even better.
Aside from finding groups, there is a part about computing an expression. If we evaluate "user defined function" it will always be much slower. Many common functions are internally optimized, so they don't switch between R and C for every group. Here, data.table has also very carefully implemented "GForce" functions. Not sure but in dplyr they are called "hybrid evaluation".
It is always important to test on your particular data use case. If you have just 2 unique groups in data, then fast grouping algorithms will not shine much.
Also there is a community repository which meant to describe data.table algorithms https://github.com/asantucci/algo_data.table but it is not very active. I just recently posted there a comment about "groupby optimization", will paste it here as well. Answer was provided by data.table author Matt Dowle.
Q: does GForce allocate mem for biggest group, then copy there values of a group, to aggregate, so it can benefit from being contiguous in memory and will be more cache efficient? if so, do we check if groups aren't sorted already? so we can avoid doing allocation and copy?
A: gforce (gsum) assigns to many group results at once; it doesn't gather the groups together. You're describing non-gforce (dogroup.c) which copies to the largest group. See the branch in dogroups.c which knows whether groups are already grouped: it swithes to a memcpy. The memcpy is very fast (contiguous, pre-fetch) so it's pretty good already. We must copy because R's DATAPTR is not a pointer we can repoint, it's an offset from SEXP.
I have a main df of 250k observations to which I want to add a set of variables, which I had to compute in smaller dfs (5 different dfs of 50k observations each) due to the limitations in the left_join/merge-function's row size (of 2^31-1 observations).
I am now trying to use the left_join or merge-functions on the main df and the 5 smaller ones to add the columns for the new variables to the main df for 50k observations in each step.
mainFrame <- left_join(mainFrame, newVariablesFirstSubsample)
mainFrame <- left_join(mainFrame, newVariablesSecondSubsample)
mainFrame <- left_join(mainFrame, newVariablesThirdSubsample)
mainFrame <- left_join(mainFrame, newVariablesFourthSubsample)
mainFrame <- left_join(mainFrame, newVariablesFifthSubsample)
After the first left_join (which includes the new variables' values for the first 50k observations), R doesn't seem to include any values for the following groups of 50k observations, when I run the second to fifth left_joins. I derive this conclusion from building the summary stats for the respective columns after each left_join.
Any idea on what I do wrong or which other functions I might use?
Data tables allow you to create "keys" which are R's version of SQL's indexes. That will help you to expedite the search for the columns that R uses for their merging or left-joining.
If I were you, I would just export all of them to csv files and work them out from SQL or using SSIS service.
The problem I'm noting is that you are realizing the error from the summary statistics. Have you tried reversing the order in which you insert the tables. Or explicitly stating the names of the columns used in your left join?
Please let me us know the outcome.
Does any one know, how one could apply the following function that converts 3 columns table into a matrix using a file that has 2 billion rows (with less than 10GB memory).
where x is 1st, y is 2nd and z is 3rd column.
library(plyr)
daply(a, .(x, y), function(x) x$z)
If you cannot load all the tuples at once
I know this is not the answer you are looking for: use SQLite.
The problem with R is that it must load the entire frame at once. If you don't have enough memory, then it simply can't continue.
SQLite is way smarter than R to do aggregates. Perhaps the most important feature is that it optimizes the memory available, and if it can, it does not need to read all the elements at once. See this for details on how to do it.
http://www.r-bloggers.com/using-sqlite-in-r/
If SQLite does not support the aggregate you want, you can create it yourself (see user defined functions in SQLite).
Alternatively you can try to partition your data (outside R), so you can aggregate in stages. But that will still require some sort of program that can read process the files in less than the available memory. Unix/MacOS/Linux sort is one of those utilities that can deal with more-than-available-memory data. It might be useful.
Most of the data sets that I have worked with has generally been of moderate size (mostly less than 100k rows) and hence my code's execution time has usually not been that big a problem for me.
But I was recently trying to write a function that takes 2 dataframes as arguments (with, say, m & n rows) and returns a new dataframe with m*n rows. I then have to perform some operations on the resulting data set. So, even with small values of m & n (say around 1000 each ) the resulting dataframe would have more than a million rows.
When I try even simple operations on this dataset, the code takes an intolerably long time to run. Specifically, my resulting dataframe has 2 columns with numeric values and I need to add a new column which will compare the values of these columns and categorize them as - "Greater than", "less than", "Tied"
I am using the following code:
df %>% mutate(compare=ifelse(var1==var2,"tied",
ifelse(var1>var2,"Greater than","lesser then")
And, as I mentioned before, this takes forever to run. I did some research on this, and I figured out that apparently operations on data.table is significantly faster than dataframe, so maybe that's one option I can try.
But I have never used data.tables before. So before I plunge into that, I was quite curious to know if there are any other ways to speed up computation time for large data sets.
What other options do you think I can try?
Thanks!
For large problems like this I like to parallelize. Since operations on individual rows are atomic, meaning that the outcome of an operation on a particular row is independent of every other row, this is an "embarassingly parallel" situation.
library(doParallel)
library(foreach)
registerDoParallel() #You could specify the number of cores to use here. See the documentation.
df$compare <- foreach(m=df$m, n=df$n, .combine='c') %dopar% {
#Borrowing from #nicola in the comments because it's a good solution.
c('Less Than', 'Tied', 'Greater Than')[sign(m-n)+2]
}