Currently, I am using foreach loop from doparallel library to run function calls in parallel across multiple cores of the same machine, which looks something like this:
out_results=foreach(i =1:length(some_list))%dopar%
{
out=functions_call(some_list[[i]])
return(out)
}
This some_list is a list of data frames and each data frame would have different number of columns, the function_call() is a function that does multiple things to the data such as data manipulations,then uses random forest for variable selection and then finally performs a least squares fit. The variable out is again a list of 3 data frames, and out_results will be a list of lists.
I am using CRAN libraries and some custom libraries created by me inside the function call, I want to avoid using spark ML libraries due to their limited functionality and re-writing of the entire code.
I want to leverage spark for running these function calls in parallel. Is it possible to do so? If yes in which direction should I be thinking. I have read a lot of documentation from sparklyr, but it doesn't seem to help much since the examples provided there are very straightforward.
SparklyR's homepage gives examples of arbitrary R code distributed on the Spark cluster. In particular, see their example with grouped operations.
Your main structure should be a data frame, which you will process rowwise. Probably something like the following (not tested):
some_list = list(tibble(a=1[0]), tibble(b=1), tibble(c=1:2))
all_data = tibble(i = seq_along(some_list), df = some_list)
# Replace this with your actual code.
# Should get one dataframe and produce one dataframe.
# Embedded dataframe columns are OK
transform_one = function(df_wrapped) {
# in your example, you expect only one record per group
stopifnot(nrow(df_wrapped)==1)
df = df_wrapped$df
res0 = df
res1 = tibble(x=10)
res2 = tibble(y=10:11)
return(tibble(res0 = list(res0), res1 = list(res1), res2 = list(res2)))
}
all_data %>% spark_apply(
transform_one,
group_by = c("i"),
columns = c("res0"="list", "res1"="list", "res2"="list"),
packages = c("randomForest", "etc")
)
All in all, this approach seems unnatural, as if we were forcing the use of Spark on a task which does not really fit. Maybe you should check for another parallelization framework?
Related
I currently have a list which is made up of around 80+ data frames, what I would like to do is to loop a chunk of code for each individual data frame within the list, without naming each one individually, or splitting them into individual data frames to work on.
Currently I split the list into each individual data frame using the below code:
dat5split <- setNames(split(dat5, dat5$CODE), paste0("df", unique(dat5$CODE)))
list2env(dat5split, globalenv())
I then work through each data frame individually:
# call in SPC function and write to 'results10000'
results10000<-SPC_XBAR(df10000,vol_n,seasonality)
results10000 = results10000 %>%
cbind(Spec = df10000$CODE) %>%
subset(`table_n` == 1)
results10000 <- results10000[order(results10000$tpd),]
results10000$Date <- as.Date(cbind(Date = df10000$CENSUS_DATE))
# call in SPC function and write to 'results10001'
results10001<-SPC_XBAR(df10001,vol_n,seasonality)
results10001 = results10001 %>%
cbind(Spec = df10001$CODE) %>%
subset(`table_n` == 1)
results10001 <- results10001[order(results10001$tpd),]
results10001$Date <- as.Date(cbind(Date = df10001$CENSUS_DATE))
Currently I call in the function 'SPC_XBAR' to where vol_n and seasonality are set earlier in the code. The script then passes the values to the function which then assigns the results to 'results10000, results10001' etc etc. Upon which I do a small bit of data wrangling on each newly created data frame before feeding the results back into sql server at the end.
As you can see each one is being individually hard coded which is not efficient.
What I would like to do is to loop a chunk of code for each individual data frame within the list, without naming each one individually.
I believe a loop would solve this issue but I am a little inexperienced when it comes to the ability to create a loop around it. Any advice would be much appreciated.
Cheers
Have you considered using lapply instead of a loop throughout the list? Check it here...
EDIT: I try to elaborate a bit more... What happens if you do this:
myFunction <- function(x) {
results<-SPC_XBAR(x,vol_n,seasonality)
results = results %>%
cbind(Spec = x$CODE) %>%
subset(`table_n` == 1)
results <- results[order(results$tpd),]
results$Date <- as.Date(cbind(Date = x$CENSUS_DATE))
results
}
lapply(dat5split, myFunction)
I would expect it to return a list of the resulting datasets
Let's say I defined a R function that takes two numerics as inputs :
effectifTouche <- function(audience, extrapolated){
TM = audience / 1000000
VE= extrapolated/100
TME = TM * VE
nbVis = TME / 1000000.1
return (nbVis)
}
And it gives me back a score, so I would like to use it as an udf on two columns of SparkR DataFrame.
It was working in pyspark, and I was wondering how was SparkR working.
So I tried many things in both Sparklyr and SparkR but I can't get this UDF working.
Ideally, I would love to just do this :
df %>%
dapply(df_join,
function(p) { effectifTouche(p$audience,p$extrapolated)
})
effectifTouche being my R function and audience, extrapolated my two columns of the spark DataFrame.
I will gladly take answers for both libraries SparkR and Sparklyr, because I tried both, and checked every single github issues with no success.
Thanks a lot
Edit for another tricky use case
df %>%
mutate(my_var = as.numeric(strptime(endHour,format="%H:%M:%S"),unit="secs"))
With simple arithmetic like this you're probably better off pushing the computation to Spark SQL, e.g.
df %>%
mutate(TM = audience / 1000000,
VE = extrapolated / 100,
TME = TM * VE,
nbVis = TME / 1000000.1)
If you actually need to use external R packages, we can help you better if you provide an example of df.
I'm trying to update an xml file with new nodes using xml2. It's easy if I just write everything manually as text,
oldXML <- read_xml("<Root><Trial><Number>3.14159 </Number><Adjective>Fast </Adjective></Trial></Root>")
but I'm developing an application that will run calculations and then put those values into the xml, so I need a mix of character and variables. It ends up looking like:
var1 <- 4.567
var2 <- "Slow"
newLine <- read_xml(paste0("<Trial><Number>",var1," </Number><Adjective>",var2," </Adjective></Trial>"))
xml_add_child(oldXML,newLine)
I suspect there's a much less kludgy way to do this than using paste0, but I can't get anything else to work. I'd like to be able to just instruct it to update the xml by reference to the dataframe, such that it can create new trials:
<Trial>
<Number>df$number[1]</Number>
<Adjective>df$adjective[1]</Adjective>
</Trial>
<Trial>
<Number>df$number[2]</Number>
<Adjective>df$adjective[2]</Adjective>
</Trial>
Is there any way to create new Trial nodes in approximately that fashion, or at least more naturally than using paste0 to insert variables? Is this something the XML package does better than xml2?
If you have your new values in a data.frame like this:
vars <- data.frame(Number = c(4.567, 3.211),
Adjective = c("Slow", "Slow"),
stringsAsFactors = FALSE)
you can convert it to a list of xml_document's as follows:
vars_xml <- lapply(purrr::transpose(vars),
function(x) {
as_xml_document(list(Trial = lapply(x, as.list)))
})
Then you can add the new nodes to the original xml:
for(trial in vars_xml) xml_add_child(oldXML, trial)
I don't know that this is better than your paste approach. Either way, you can wrap it in a function so you only have to write the ugly code once.
Here's a solution that builds on #Ista's excellent answer. Basically, I've dropped the first lapply in favor of purrr::map (we could probably replace the second lapply with a map, but I couldn't find a more readable way to accomplish that).
library(purrr)
vars_xml <- transpose(vars) %>%
map(~as_xml_document(list(Trial = lapply(.x, as.list))))
I have a quite big number of quite heavy datasets. I would like to extract a subset out of each of them and save it into different csv files (one for each dataset). These are the commands I would like to loop for all the files I have in the folder:
df <-read.csv("1985.csv",header=FALSE,stringsAsFactors=TRUE,sep="\t")
df_short <- df[df$V6=="OPP", ]
write.csv(df_short, file = "OPP_1985.csv",row.names=FALSE)
rm(df)
rm(df_short)
This is probably a very noob question, but I am struggling to understand how to do it, so I would appreciate a lot help with this!
EDIT:
Following #SimonShine's suggestion, I have run this code and it works!
You don't specify if you are trying to collect the subsets into one dataset, or if you are trying to make one file per subset. You refer to OPP_1985 that appears out of scope for the code you wrote. Did you mean to refer to df_short?
You could start by abstracting what you want to do with one datafile into a function, e.g.:
extract_and_save_from_dataset <- function(csvfile) {
df <- read.csv(csvfile, header=F, stringsAsFactors=T, sep="\t")
df_short <- df[df$V6 == "OPP",]
csvfile_short <- gsub(".csv", "_short.csv", csvfile)
write.csv(df_short, file=csvfile_short, row_names=F)
}
Assuming you have a collection of dataset filenames, you could apply this function multiple times:
# csvfiles <- c("OPP_1985.csv", "OPP_1986.csv", ...)
csvfiles <- list.files("/path/to/my/csvfiles")
for (csvfile in csvfiles) {
extract_and_save_from_dataset(csvfile)
}
The data.table approach is probably the fastest option, specially if you have a large dataset. The function fwrite{data.table} works in parallel using many CPUS, making it extremely fast.
Here is how you can divide your original data according to subgroups defined based on the values of df$V6 and save each subset into a separate .csv file.
library (data.table)
set(df)[, fwrite(.SD, paste0("output_", V6,".csv")), by = V6, .SDcols=names(df) ]
ps. The name of the files will be output_*.csv where * is the correspondent V6 value.
For an input data frame
input<-data.frame(col1=seq(1,10000),col2=seq(1,10000),col3=seq(1,10000),col4=seq(1,10000))
I have to run the following summaries stored in another Data frame
summary<-data.frame(Summary_name=c('Col1_col2','Col3_Col4','Col2_Col3'),
ColIndex=c("1,2","3,4","2,3"))
#summary
Summary_name ColIndex
Col1_col2 1,2
Col3_Col4 3,4
Col2_Col3 2,3
I have the following function to run the aggregates
loopSum<-function(input,summary){
for(i in seq(1,nrow(summary))){
summary$aggregate[i]<-sum(input[,as.numeric(unlist(str_split(summary$ColIndex[i],',')))])}
return(summary)
}
My requirement is to run the sum as used in loopSum only in parallel, ie I would like to run all the summaries in one shot and thus reduce the total time taken for the function to create the summaries. Is there a way to do this?
My actual scenarios requires me to create summary statistics over hundreds of columns for each Summary_name in summary data.frame, I am looking for the most optimized way to do this. Any help is much appreciated.
Does it improve the running time?
library(tidyr)
input1 <- colSums(input)
summary1 <- separate(summary, "ColIndex", into=c("X1", "X2"), sep=",", convert = TRUE)
summary$aggregate <- input1[summary1$X1] + input1[summary1$X2]