Is there a way to parallelize summary functions running over loop? - r

For an input data frame
input<-data.frame(col1=seq(1,10000),col2=seq(1,10000),col3=seq(1,10000),col4=seq(1,10000))
I have to run the following summaries stored in another Data frame
summary<-data.frame(Summary_name=c('Col1_col2','Col3_Col4','Col2_Col3'),
ColIndex=c("1,2","3,4","2,3"))
#summary
Summary_name ColIndex
Col1_col2 1,2
Col3_Col4 3,4
Col2_Col3 2,3
I have the following function to run the aggregates
loopSum<-function(input,summary){
for(i in seq(1,nrow(summary))){
summary$aggregate[i]<-sum(input[,as.numeric(unlist(str_split(summary$ColIndex[i],',')))])}
return(summary)
}
My requirement is to run the sum as used in loopSum only in parallel, ie I would like to run all the summaries in one shot and thus reduce the total time taken for the function to create the summaries. Is there a way to do this?
My actual scenarios requires me to create summary statistics over hundreds of columns for each Summary_name in summary data.frame, I am looking for the most optimized way to do this. Any help is much appreciated.

Does it improve the running time?
library(tidyr)
input1 <- colSums(input)
summary1 <- separate(summary, "ColIndex", into=c("X1", "X2"), sep=",", convert = TRUE)
summary$aggregate <- input1[summary1$X1] + input1[summary1$X2]

Related

Running parallel function calls with sparklyr

Currently, I am using foreach loop from doparallel library to run function calls in parallel across multiple cores of the same machine, which looks something like this:
out_results=foreach(i =1:length(some_list))%dopar%
{
out=functions_call(some_list[[i]])
return(out)
}
This some_list is a list of data frames and each data frame would have different number of columns, the function_call() is a function that does multiple things to the data such as data manipulations,then uses random forest for variable selection and then finally performs a least squares fit. The variable out is again a list of 3 data frames, and out_results will be a list of lists.
I am using CRAN libraries and some custom libraries created by me inside the function call, I want to avoid using spark ML libraries due to their limited functionality and re-writing of the entire code.
I want to leverage spark for running these function calls in parallel. Is it possible to do so? If yes in which direction should I be thinking. I have read a lot of documentation from sparklyr, but it doesn't seem to help much since the examples provided there are very straightforward.
SparklyR's homepage gives examples of arbitrary R code distributed on the Spark cluster. In particular, see their example with grouped operations.
Your main structure should be a data frame, which you will process rowwise. Probably something like the following (not tested):
some_list = list(tibble(a=1[0]), tibble(b=1), tibble(c=1:2))
all_data = tibble(i = seq_along(some_list), df = some_list)
# Replace this with your actual code.
# Should get one dataframe and produce one dataframe.
# Embedded dataframe columns are OK
transform_one = function(df_wrapped) {
# in your example, you expect only one record per group
stopifnot(nrow(df_wrapped)==1)
df = df_wrapped$df
res0 = df
res1 = tibble(x=10)
res2 = tibble(y=10:11)
return(tibble(res0 = list(res0), res1 = list(res1), res2 = list(res2)))
}
all_data %>% spark_apply(
transform_one,
group_by = c("i"),
columns = c("res0"="list", "res1"="list", "res2"="list"),
packages = c("randomForest", "etc")
)
All in all, this approach seems unnatural, as if we were forcing the use of Spark on a task which does not really fit. Maybe you should check for another parallelization framework?

R create a loop for a list of data frames

I currently have a list which is made up of around 80+ data frames, what I would like to do is to loop a chunk of code for each individual data frame within the list, without naming each one individually, or splitting them into individual data frames to work on.
Currently I split the list into each individual data frame using the below code:
dat5split <- setNames(split(dat5, dat5$CODE), paste0("df", unique(dat5$CODE)))
list2env(dat5split, globalenv())
I then work through each data frame individually:
# call in SPC function and write to 'results10000'
results10000<-SPC_XBAR(df10000,vol_n,seasonality)
results10000 = results10000 %>%
cbind(Spec = df10000$CODE) %>%
subset(`table_n` == 1)
results10000 <- results10000[order(results10000$tpd),]
results10000$Date <- as.Date(cbind(Date = df10000$CENSUS_DATE))
# call in SPC function and write to 'results10001'
results10001<-SPC_XBAR(df10001,vol_n,seasonality)
results10001 = results10001 %>%
cbind(Spec = df10001$CODE) %>%
subset(`table_n` == 1)
results10001 <- results10001[order(results10001$tpd),]
results10001$Date <- as.Date(cbind(Date = df10001$CENSUS_DATE))
Currently I call in the function 'SPC_XBAR' to where vol_n and seasonality are set earlier in the code. The script then passes the values to the function which then assigns the results to 'results10000, results10001' etc etc. Upon which I do a small bit of data wrangling on each newly created data frame before feeding the results back into sql server at the end.
As you can see each one is being individually hard coded which is not efficient.
What I would like to do is to loop a chunk of code for each individual data frame within the list, without naming each one individually.
I believe a loop would solve this issue but I am a little inexperienced when it comes to the ability to create a loop around it. Any advice would be much appreciated.
Cheers
Have you considered using lapply instead of a loop throughout the list? Check it here...
EDIT: I try to elaborate a bit more... What happens if you do this:
myFunction <- function(x) {
results<-SPC_XBAR(x,vol_n,seasonality)
results = results %>%
cbind(Spec = x$CODE) %>%
subset(`table_n` == 1)
results <- results[order(results$tpd),]
results$Date <- as.Date(cbind(Date = x$CENSUS_DATE))
results
}
lapply(dat5split, myFunction)
I would expect it to return a list of the resulting datasets

Saving output of for-loop for every iteration

I am currently working on an imputation project where I need to evaluate my methods of imputation. I have my incomplete dataframe with NAs from which I calculate the missing rate for every column/variable. My second data frame contains the complete cases which I extracted from the first data frame. I now want to simulate the missingness structure of the real data in the frame containing the complete cases. the data frame with the generated NAs get stored in the object "result" as you can see in the code. If I now want to replicate this code and thus generate 100 different data frames like "result", how do I replicate and save them separately?
I'm a beginner and would be really thankful for your answers!
I tried to put my loop which generates the NAs in another loop which contains the replicate() command and counts from 1:100 and saves these 100 replicated data frames but it didn't work at all.
result = data.frame(res0=rep(NA, dim(comp_cas)[1]))
for (i in 1:length(Z32_miss_item$miss_per_item)) {
dat = comp_cas[,i]
missRate = Z32_miss_item$miss_per_item[i]
cat (i, " ", paste0(dat, collapse=",") ," ", missRate, "!\n")
df <- data.frame("res"= GenMiss(x=dat, missrate = missRate), stringsAsFactors = FALSE)
colnames(df) = gsub("res", paste0("Var", i), colnames(df))
result = cbind(result, df)
}
result = result[,-1]
I expect that every data frame of the 100 runs get saved in a separate .rda file in my project folder.
also, is imputation and the evaluation of fitness of the latter beginner stuff in r or at what level of proficiency am I if you take a look at the code that I posted?
It is difficult to guess what exactly you are doing without some dummy data. But it is fine to have loops within loops and to save data.frames. Firstly, I would avoid the replicate function here as it has a strange syntax and just stick with plain loops. Secondly, you must make sure that the loops have different indexes (i.e. for(i ... should be surrounded by, say, for(j ... since functions can loop outside their scope in R. Finally, use saveRDS rather than save, as you can then have each object (data.frame) saved in separate .rds files. The save function is designed for saving your whole workspace so that you can pick up where you left off.
fun <- function(i){
df <- data.frame(x=rnorm(5))
names(df) <- paste0("x",i)
df
}
for(j in 1:100){
res <- data.frame(id=1:5)
for(i in 1:10){
res <- cbind(res, fun(i))
}
saveRDS(res, sprintf("replication_%s.rds",j))
}

R - Creating subsets of several datasets in a loop

I have a quite big number of quite heavy datasets. I would like to extract a subset out of each of them and save it into different csv files (one for each dataset). These are the commands I would like to loop for all the files I have in the folder:
df <-read.csv("1985.csv",header=FALSE,stringsAsFactors=TRUE,sep="\t")
df_short <- df[df$V6=="OPP", ]
write.csv(df_short, file = "OPP_1985.csv",row.names=FALSE)
rm(df)
rm(df_short)
This is probably a very noob question, but I am struggling to understand how to do it, so I would appreciate a lot help with this!
EDIT:
Following #SimonShine's suggestion, I have run this code and it works!
You don't specify if you are trying to collect the subsets into one dataset, or if you are trying to make one file per subset. You refer to OPP_1985 that appears out of scope for the code you wrote. Did you mean to refer to df_short?
You could start by abstracting what you want to do with one datafile into a function, e.g.:
extract_and_save_from_dataset <- function(csvfile) {
df <- read.csv(csvfile, header=F, stringsAsFactors=T, sep="\t")
df_short <- df[df$V6 == "OPP",]
csvfile_short <- gsub(".csv", "_short.csv", csvfile)
write.csv(df_short, file=csvfile_short, row_names=F)
}
Assuming you have a collection of dataset filenames, you could apply this function multiple times:
# csvfiles <- c("OPP_1985.csv", "OPP_1986.csv", ...)
csvfiles <- list.files("/path/to/my/csvfiles")
for (csvfile in csvfiles) {
extract_and_save_from_dataset(csvfile)
}
The data.table approach is probably the fastest option, specially if you have a large dataset. The function fwrite{data.table} works in parallel using many CPUS, making it extremely fast.
Here is how you can divide your original data according to subgroups defined based on the values of df$V6 and save each subset into a separate .csv file.
library (data.table)
set(df)[, fwrite(.SD, paste0("output_", V6,".csv")), by = V6, .SDcols=names(df) ]
ps. The name of the files will be output_*.csv where * is the correspondent V6 value.

R for loop to summarize matrix of data

New user to R (like 2 days of use new) and coming from MATLAB, syntax nuances are driving me a little crazy. If anyone can point me in a direction on this topic I would really appreciate it. I have this dataset (fl1.back), that has 32 variables (columns) and 513 measurements (rows), and I want to create a table with basic stat tables of 9 of the 32 columns of data. There's a separate datset(fl2.back) that I would also like to pull 1 column of data from for the final table.
Here's the code I used to do the above tasks for 1 of the columns of data (sodium measurements) from fl1.back and fl2.back:
fl1.back <- read.delim("web.flat",comment.char="#",colClasses="character")
fl1.back <- fl1.back[-1,]
fl2.back <- read.delim("web.flat2",comment.char="#",colClasses="character")
fl2.back <- fl2.back[-1,]
head(fl1.back)
head(fl2.back)
#for rep criteria for sodium
back.sod.rep <- fl2.back[fl2.back$P00930!="",]
back.sod.rep$P00930 <- as.numeric(back.sod.rep$P00930)
back.sod.rep$P00930
#for samples...sodium
back.sod <- fl1.back[fl1.back$P00930!="",]
back.sod$P00930 <- as.numeric(back.sod$P00930)
back.sod$P00930
head(back.sod)
back.sod.summ <- data.frame("Sodium")
back.sod.summ
colnames(back.sod.summ) <- "Compound"
back.sod.summ$WQ_crit <- "20 mg/L"
back.sod.summ$n <- nrow(back.sod)
back.sod.summ$n_det <- nrow(back.sod[back.sod$R00930!="<",])
back.sod.summ$min <- min(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$max <- max(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$mean <- mean(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$median <- median(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$percent_samp_det <- 100*(back.sod.summ$n_det/back.sod.summ$n)
back.sod.summ$percent_samp_above_crit <- 100*(length(back.sod[back.sod$P00930>20,"P00930"])/back.sod.summ$n)
back.sod.summ$percent_rep_above_crit <- (sum(back.sod.rep$P00930>=20)/(nrow(back.sod.rep)))
back.sod$P00930
length(back.sod[back.sod$P00930>back.sod.summ$WQ_crit,"P00930"])
back.sod.summ
final <- data.frame(back.sod.summ)
Instead of rewriting/copying and pasting the above code to create the data frame final, I would like to loop over the two datasets since I'm looking to repeat the same task, just on different columns of data. I really don't know where to start, and there doesn't seem to be much literature on for loops in R.
Any insight is appreciated!
Here is an example of what I think you want with the iris dataset:
library(plyr)
dlply(iris, .(Species), summary)
This can be extended if you need additional stats. Anyway, you probably should use (as I show above) the "split-apply-combine" approach as implemented in various functions and packages.

Resources