I'm trying to get the distinct values from a SparkDataframe using below statement.
distVals <- collect(distinct(select(dataframeName, 'Column_name')))
To execute this statement, it takes around 30-40 minutes. Is there any better way to perform this?
Also there is not much time difference in collecting full dataframe and collecting distinct values. So why is it suggested not to collect entire dataset? Is it only because of the data size?
Since I have to get different kinds of filtered data, I'm looking for collecting the results faster.
Related
first question, I'll try to go straight to the point.
I'm currently working with tables and I've chosen R because it has no limit with dataframe sizes and can perform several operations over the data within the tables. I am happy with that, as I can manipulate it at my will, merges, concats and row and column manipulation works fine; but I recently had to run a loop with 0.00001 sec/instruction over a 6 Mill table row and it took over an hour.
Maybe the approach of R was wrong to begin with, and I've tried to look for the most efficient ways to run some operations (using list assignments instead of c(list,new_element)) but, since as far as I can tell, this is not something that you can optimize with some sort of algorithm like graphs or heaps (is just tables, you have to iterate through it all) I was wondering if there might be some other instructions or other basic ways to work with tables that I don't know (assign, extract...) that take less time, or configuration over RStudio to improve performance.
This is the loop, just so if it helps to understand the question:
my_list <- vector("list",nrow(table[,"Date_of_count"]))
for(i in 1:nrow(table[,"Date_of_count"])){
my_list[[i]] <- format(as.POSIXct(strptime(table[i,"Date_of_count"]%>%pull(1),"%Y-%m-%d")),format = "%Y-%m-%d")
}
The table, as aforementioned, has over 6 Mill rows and 25 variables. I want the list to be filled to append it to the table as a column once finished.
Please let me know if it lacks specificity or concretion, or if it just does not belong here.
In order to improve performance (and properly work with R and tables), the answer was a mixture of the first comments:
use vectors
avoid repeated conversions
if possible, avoid loops and apply functions directly over list/vector
I just converted the table (which, realized, had some tibbles inside) into a dataframe and followed the aforementioned keys.
df <- as.data.frame(table)
In this case, by doing this the dates were converted directly to character so I did not have to apply any more conversions.
New execution time over 6 Mill rows: 25.25 sec.
I have a main df of 250k observations to which I want to add a set of variables, which I had to compute in smaller dfs (5 different dfs of 50k observations each) due to the limitations in the left_join/merge-function's row size (of 2^31-1 observations).
I am now trying to use the left_join or merge-functions on the main df and the 5 smaller ones to add the columns for the new variables to the main df for 50k observations in each step.
mainFrame <- left_join(mainFrame, newVariablesFirstSubsample)
mainFrame <- left_join(mainFrame, newVariablesSecondSubsample)
mainFrame <- left_join(mainFrame, newVariablesThirdSubsample)
mainFrame <- left_join(mainFrame, newVariablesFourthSubsample)
mainFrame <- left_join(mainFrame, newVariablesFifthSubsample)
After the first left_join (which includes the new variables' values for the first 50k observations), R doesn't seem to include any values for the following groups of 50k observations, when I run the second to fifth left_joins. I derive this conclusion from building the summary stats for the respective columns after each left_join.
Any idea on what I do wrong or which other functions I might use?
Data tables allow you to create "keys" which are R's version of SQL's indexes. That will help you to expedite the search for the columns that R uses for their merging or left-joining.
If I were you, I would just export all of them to csv files and work them out from SQL or using SSIS service.
The problem I'm noting is that you are realizing the error from the summary statistics. Have you tried reversing the order in which you insert the tables. Or explicitly stating the names of the columns used in your left join?
Please let me us know the outcome.
I have in a Spark data frame with 10 million rows, where each row represents an alpha numeric string indicating id of a user, example:
602d38c9-7077-4ea1-bc8d-af5c965b4e85 my objective is to check if another id like aaad38c9-7087-4ef1-bc8d-af5c965b4e85 is present in the 10 million list.
I would want to do it efficiently and not search all 10 million records, every single time a search happens. Example can I sort my records alphabetically and ask SparkR to search only within records that begin with a instead of the universe to speed up search and make it computationally efficient?
Any solutions primarily using SparkR if not then any Spark solution would be helpful
You can use rlike which is for regex search within a dataframe column.
df.filter($"foo".rlike("regex"))
Or You can index spark dataframe into solr which will definitely search your string within few milliseconds.
https://github.com/lucidworks/spark-solr
I am working with dplyr and the dbplyr package to interface with my database. I have a table with millions of records. I also have a list of values that correspond to the key in that same table I wish to filter. Normally I would do something like this to filter the table.
library(ROracle)
# connect info omitted
con <- dbConnect(...)
# df with values - my_values
con %>% tbl('MY_TABLE') %>% filter(FIELD %in% my_values$FIELD)
However, that my_values object contains over 500K entries (hence why I don't provide actual data here). This is clearly not efficient when they will basically be put in an IN statement (It essentially hangs). Normally if I was writing SQL, I would create a temporary table and write a WHERE EXISTS clause. But in this instance, I don't have write privileges.
How can I make this query more efficient in R?
Note sure whether this will help, but a few suggestions:
Find other criteria for filtering. For example, if my_values$FIELD is consecutive or the list of values can be inferred by some other columns, you can seek help from the between filter: filter(between(FIELD, a, b))?
Divide and conquer. Split my_values into small batches, make queries for each batch, then combine the results. This may take a while, but should be stable and worth the wait.
Looking at your restrictions, I would approach it similar to how Polor Beer suggested, but I would send one db command per value using purrr::map and then use dplyr::bindrows() at the end. This way you'll have a nice piped code that will adapt if your list changes. Not ideal, but unless you're willing to write a SQL table variable manually, not sure of any other solutions.
I have a 3GB csv file. It is too large to load into R on my computer. Instead I would like to load a sample of the rows (say, 1000) without loading the full dataset.
Is this possible? I cannot seem to find an answer anywhere.
If you don't want to pay thousands of dollars to Revolution R so that you can load/analyze your data in one go, sooner or later, you need to figure out a way to sample you data.
And that step is easier to happen outside R.
(1) Linux Shell:
Assuming your data falls into a consistent format. Each row is one record. You can do:
sort -R data | head -n 1000 >data.sample
This will randomly sort all the rows and get the first 1000 rows into a separate file - data.sample
(2) If the data is not small enough to fit into memory.
There is also a solution to use database to store the data. For example, I have many tables stored in MySQL database in a beautiful tabular format. I can do a sample by doing:
select * from tablename order by rand() limit 1000
You can easily communicate between MySQL and R using RMySQL and you can index your column to guarantee the query speed. Also you can verify the mean or standard deviation of the whole dataset versus your sample if you want taking the power of database into consideration.
These are the two most commonly used ways based on my experience for dealing with 'big' data.