I am currently trying parallel computing in R.
I am trying to train a logistic ridge model , and I currently have 4 Cores on my computer. I would like to split my data set equally into 4 pieces, and use each core to train model (on the training data) and save the result of each core into a single vector . the problem is that i have no clue how to do it, right now I tried to parallel with the foreach package, but the problem is the each core sees the same training data. here is the code with the foreach package (which doesn't split the data) :
library(ridge)
library(parallel)
library(foreach)
num_of_cores <- detectCores()
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
data_per_core <- floor(nrow(mydata)/num_of_cores)
result <- data.frame()
r <- foreach(icount(4), .combine = cbind) %dopar% {
result <- logisticRidge(admit~ gre + gpa + rank,data = mydata)
coefficients(result)
}
any idea how to simultaneously split the data into x chunks and train the models in parallel ?
How about something like this? It uses snowfall instead of the foreach-library, but should give the same results.
library(snowfall)
library(ridge)
# for reproducability
set.seed(123)
num_of_cores <- parallel::detectCores()
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
data_per_core <- floor(nrow(mydata)/num_of_cores)
# we take random rows to each cluster, by sampleid
mydata$sampleid <- sample(1:num_of_cores, nrow(mydata), replace = T)
# create a small function that calculates the coefficients
regfun <- function(dat) {
library(ridge) # this has to be in the function, otherwise snowfall doesnt know the logistic ridge function
result <- logisticRidge(admit~ gre + gpa + rank, data = dat)
coefs <- as.numeric(coefficients(result))
return(coefs)
}
# prepare the data
datlist <- lapply(1:num_of_cores, function(i){
dat <- mydata[mydata$sampleid == i, ]
})
# initiate the clusters
sfInit(parallel = T, cpus = num_of_cores)
# export the function and the data to the cluster
sfExport("regfun")
# calculate, (sfClusterApply is very similar to sapply)
res <- sfClusterApply(datlist, function(datlist.element) {
regfun(dat = datlist.element)
})
#stop the cluster
sfStop()
# convert the list to a data.frame. data.table::rbindlist(list(res)) does the same job
res <- data.frame(t(matrix(unlist(res), ncol = num_of_cores)))
names(res) <- c("intercept", "gre", "gpa", "rank")
res
# res
# intercept gre
# 1 -3.002592 1.558363e-03
# 2 -4.142939 1.060692e-03
# 3 -2.967130 2.315487e-03
# 4 -1.176943 4.786894e-05
# gpa rank
# 1 0.7048146997 -0.382462408
# 2 0.9978841880 -0.314589628
# 3 0.6797382218 -0.464219036
# 4 -0.0004576679 -0.007618317
The itertools package provides a number of functions for iterating over various data structures with foreach loops. In this case, you could use the isplitRows function to split the data frame row-wise into one chunk per worker:
library(ridge)
library(doParallel)
library(itertools)
num_of_cores <- detectCores()
cl <- makePSOCKcluster(num_of_cores)
registerDoParallel(cl)
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
r <- foreach(d=isplitRows(mydata, chunks=num_of_cores),
.combine = cbind, .packages="ridge") %dopar% {
result <- logisticRidge(admit~ gre + gpa + rank, data = d)
coefficients(result)
}
isplitRows also takes a chunkSize argument if you want to control the maximum size of each chunk.
Note that using this technique, each worker only receives an appropriate fraction of mydata. This is particularly important for larger data frames with a PSOCK cluster.
Related
The R scripts and a reduced version of the file folder with the multiple csv file-formatted datasets in it can all be found in my GitHub Repository found in this Link.
In my script called 'LASSO code', after loading a file folder full of N csv file-formatted datasets into R and assigning them all to a list called 'datasets', I ran the following code to fit N LASSO Regressions, one to each of the datasets:
set.seed(11) # to ensure replicability
LASSO_fits <- lapply(dfs, function(i)
enet(x = as.matrix(select(i, starts_with("X"))),
y = i$Y, lambda = 0, normalize = FALSE))
Now, I would like to replicate this same process for a Backward Elimination Stepwise Regression we'll keep it simple by just using the step() function from the stats library) using another apply function rather than having to use a loop. The problem is this, the only was I know how to do this is by initializing or prepping it before running it by first establishing:
set.seed(100) # for reproducibility
full_fits <- vector("list", length = length(dfs))
Backward_Stepwise_fits <- vector("list", length = length(dfs))
And only then fitting all of the Backward_Stepwise_fits, but I cannot figure out how to put both full_fits and Backward_Stepwise_fits into the same apply function, the only way I can think of would be to use a for loop and stack them on top of each other inside of it, but that would be very computationally inefficient. And the number of datasets N I will be running both of these on is 260,000!
I wrote a for-loop that does in fact run, but it took over 12 hours to finish running on just 58,500 datasets which is unacceptably slow.
The code I used for it is the following:
set.seed(100) # for reproducibility
for(i in seq_along(dfs)) {
full_fits[[i]] <- lm(formula = Y ~ ., data = dfs[[i]])
Backward_Stepwise_fits[[i]] <- step(object = full_fits[[i]],
scope = formula(full_fits[[i]]),
direction = 'backward', trace = 0) }
I have tried the following, but get the corresponding error message in the Console:
> full_model_fits <- lapply(dfs, function(i)
+ lm(formula = Y ~ ., data = dfs))
Error in terms.formula(formula, data = data) :
duplicated name 'X1' in data frame using '.'
Ever thought about parallelizing the whole thing?
First, you could define the code more succinctly.
system.time(
res <- lapply(lst, \(X) {
full <- lm(Y ~ ., X)
back <- step(full, scope=formula(full), dir='back', trace=FALSE)
})
)
# user system elapsed
# 3.895 0.008 3.897
system.time(
res1 <- lapply(lst, \(X) step(lm(Y ~ ., X), dir='back', trace=FALSE))
)
# user system elapsed
# 3.820 0.016 3.833
stopifnot(all.equal(res, res1))
The results are equal, but no time difference.
Now, using parallel::parLapply.
library(parallel)
CL <- makeCluster(detectCores() - 1L)
system.time(
res2 <- parLapply(CL, lst, \(X) step(lm(Y ~ ., X), dir='back', trace=FALSE))
)
# user system elapsed
# 0.075 0.032 0.861
stopCluster(CL)
stopifnot(all.equal(res, res2))
On this machine about 4.5 times faster.
If we now want to run a forward step-wise selection using res2 as scope=, we need parallel::clusterMap, the multivariate variant of parLapply:
# CL <- makeCluster(detectCores() - 1L)
res3 <- clusterMap(CL, \(X, Y) step(lm(Y ~ 1, X), scope=formula(Y), dir='forw', trace=FALSE),
lst, res2)
# stopCluster(CL)
NB: This yielded the same coefficients as using the for loop you showed in comments:
stopifnot(all.equal(lapply(FS_fits, coef), unname(lapply(res3, coef))))
Your error duplicated name 'X1' in data frame using '.' means, that in some of your datasets there are two columns named "X1". Here's how to find them:
names(lst$dat6)[9] <- 'X1' ## producing duplicated column X1 for demo
sapply(lst, \(x) anyDuplicated(names(x)))
# dat1 dat2 dat3 dat4 dat5 dat6 dat7 dat8 dat9 dat10 dat11
# 0 0 0 0 0 9 0 0 0 0 0
# ...
Result shows, in dataset dat6 the 9th column is the (first) duplicate. All others are clean.
Data:
set.seed(42)
n <- 50
lst <- replicate(n, {dat <- data.frame(matrix(rnorm(500*30), 500, 30))
cbind(Y=rowSums(as.matrix(dat)%*%rnorm(ncol(dat))) + rnorm(nrow(dat)), dat)}, simplify=FALSE) |>
setNames(paste0('dat', seq_len(n)))
I have a table with ~1M entry points (where each line is an insurance contract, i.e. one client can have multiple contracts) and cols client_id, names and adresses.
The problem I am trying to solve is that the same client can have different client_id for each new contract.
To resolve this I have done the following:
Creating a New_ID as a 4th col in the table
Iterate twice over namesand calculate names similarity for each combination
Iterate twice over adressesand calculate names similarity for each combination
Inside each iteration: if name_similarity > 0.9 & adresses_similarity > 0.8 then New_ID takes the value of j
Used packages + fake data:
library(tidyverse)
library(stringdist) # strings' similarities
library(parallel) # parallel programming
library(foreach) # parallel programming
library(doParallel) # parallel programming
library(doSNOW) # parallel programming
# Fake data
client_id <- 1:6
names <- c("Name", "Naaame", "Name", "Namee", "Nammee", "Nammee")
adresses <- c("Adress", "Adressss", "Adress", "Adresss", "Aadressss", "Aadressss")
A <- data.frame(cbind(client_id, names, adresses)) %>%
mutate(New_ID = NA)
Script
The below nested for loops works well:
for(i in seq_along(A$client_id)){
for(j in seq_along(A$client_id)){
# calculate names similarities
name_similarity <- stringdist::stringsim(A$names[i],
A$names[j],
method = "osa",
useBytes = T)
# calculate adresses similarities
adresses_similarity <- stringdist::stringsim(A$adresses[i],
A$adresses[j],
method = "qgram",
useBytes = T)
# Decision & New_ID attribution
if(name_similarity > 0.9) {
if(adresses_similarity > 0.85){
A[i , 4] = j # New ID
}
} # decision end
} # Close j loop
} # Close i loop
Although the script above produces the expected result, it will take days to iterate over the real data size (~ 1M). So I thought of parallel programming.
Parallel programming:
I have tried to nest two foreach using the operator %:% and run it in parallel using the operator %dopar% of the doParallel package.
cl <- makeCluster(detectCores()) # Intiate clusters (I have 8 cores on my local machine)
registerDoSNOW(cl) # relate foreach to a parallel mecanism from {parallel}
clusterExport(cl, list("A")) # export data to clusters
clusterEvalQ(cl, c(library(tidyverse),
library(stringdist))) # export used packages to child clusters
foreach(i = seq_along(A$client_id) ) %:%
foreach(j = seq_along(A$client_id)) %dopar%{
# calculate names similarities
name_similarity <- stringdist::stringsim(A$names[i],
A$names[j],
method = "osa",
useBytes = T)
# calculate adresses similarities
adresses_similarity <- stringdist::stringsim(A$adresses[i],
A$adresses[j],
method = "osa",
useBytes = T)
# Decision & New_ID attribution
if(name_similarity > 0.9) {
if(adresses_similarity > 0.85){
A[i , 4] = j # New ID
}
} # decision end
}
stopCluster(cl)
However, after running the parallel nested foreach loops, the New_ID column still empty.
I've tried to unlist() the result as the foreach loop returns values in list, it doesn't work.
How can I write the nested parallel foreach to obtain the same result as in the nested for loops? Thanks
I am using R to do a ML project, I have prepared the dataset and split the data into 10 equal splits but the problem is I need to fit the model 10 times manually (10-fold CV). I have tried to create train and test data using a for loop but each time it runs, train is the whole dataset and test is null. Can someone help me, please?
# Preparing the data
data <- read.csv("./project.csv")
id <- seq(1:103342)
data[, 'id'] <- id
for (i in 3:8) {
data[,i] <- as.factor(data[,i])
}
# splitting the data into 10 equal data frames
f <- rep(seq(1, 10), each=round(103342/10), length.out=103342)
df <- split(data, f)
lapply(df, dim)
# running 10-fold cross-validation and computing error rate and AUC for each run.
results <- matrix(nrow=10, ncol=2, dimnames= list(c(), c('error_rate', 'auc')))
for (i in 1:10) {
train <- data[!(data$id %in% df$`i`$id),]
test <- df$`i`
print(dim(test)) # Here is my problem the print statement will print null 10 times
glm.fit <- glm(canceled ~ ., data=train, family=binomial)
glm.prob <- predict(glm.fit, newdata=test, type="response")
...
}
I am currently working on a For loop in R. If I run the For loop on my own data, it takes ages, and I believe because I did something inefficient in my code. Could you please help me with improving it?
# Loop through the samples, explaining one instance at a time.
shap_values <- vector("list", nrow(X)) # initialize the results list.
system.time({
for (i in seq_along(shap_values)) {
set.seed(224)
shap_values[[i]] <- iml::Shapley$new(predictor, x.interest = X[i, ],sample.size = 30)$results
shap_values[[i]]$predicted_value <- iml::Shapley$new(predictor, x.interest = X[i, ],sample.size = 30)$y.hat.interest
shap_values[[i]]$sample_num <- i # identifier to track our instances.
}
data_shap_values <- dplyr::bind_rows(shap_values) # collapse the list.
})
I believe that my problem is in the
shap_values[[i]]$sample_num
variable, since I am redoing there the calculations of the previous
shap_values[[i]]$predicted_value
variable. The reason why I added that variable, was because I needed the
$y.hat.interest
as part of the new data frame (which is called "shap_values" and later "data_shap_values").
REPRODUCIBLE EXAMPLE: (starts at "This is the important part:)
#Example Shapley
#https://cran.r-project.org/web/packages/iml/vignettes/intro.html
data("Boston", package = "MASS")
head(Boston)
set.seed(42)
#install.packages("iml")
library("iml")
library("randomForest")
data("Boston", package = "MASS")
rf = randomForest(medv ~ ., data = Boston, ntree = 50)
# We create a Predictor object, that holds the model and the data.
# The iml package uses R6 classes: New objects can be created by calling Predictor$new()
X = Boston[which(names(Boston) != "medv")]
predictor = Predictor$new(rf, data = X, y = Boston$medv)
# Feature Importance
## Shifting each future, and measring how much the performance drops ##
imp = FeatureImp$new(predictor, loss = "mae")
plot(imp)
# Shapley value. Assume that for 1 data point, the feature values play a game together, in which
# they get the prediction as payout. Tells us how fairly distibute the payout among the feature values.
View(X)
shapley = Shapley$new(predictor, x.interest = X[1,])
shapley$plot()
# Reuse the object to explain other data points
shapley$explain(x.interest = X[2,])
shapley$plot()
# Results in data.frame form can be extracted like this:
results = shapley$results
head(results)
# THIS IS THE IMPORTANT PART:
# It might make sense for testing, to reduce the data:
X = X[1:10,]
# Loop through the samples, explaining one instance at a time.
shap_values <- vector("list", nrow(X)) # initialize the results list.
system.time({
for (i in seq_along(shap_values)) {
set.seed(224)
shap_values[[i]] <- iml::Shapley$new(predictor, x.interest = X[i, ],sample.size = 30)$results
shap_values[[i]]$predicted_value <- iml::Shapley$new(predictor, x.interest = X[i, ],sample.size = 30)$y.hat.interest
shap_values[[i]]$sample_num <- i # identifier to track our instances.
}
data_shap_values <- dplyr::bind_rows(shap_values) # collapse the list.
})
Update
As requested by #Ralf Stubner a Profiling of the for loop:
You are doubling your run-time by calling imp::Shapely$new twice with identical parameters. As an alternative, you can create the object once and extract the two values:
system.time({
for (i in seq_along(shap_values)) {
set.seed(224)
shapley <- iml::Shapley$new(predictor, x.interest = X[i, ],sample.size = 30)
shap_values[[i]] <- shapley$results
shap_values[[i]]$predicted_value <- shapley$y.hat.interest
shap_values[[i]]$sample_num <- i # identifier to track our instances.
}
data_shap_values <- dplyr::bind_rows(shap_values) # collapse the list.
})
If you have sufficient RAM to store your data multiple times, you might also try parallel processing using parallel, foreach or future.apply.
A critical difference (CD) plot for comparing classifiers over multiple data sets (Demšar2006) can be generated with the mlr package like this:
# THIS WORKS
library(mlr)
lrns = list(makeLearner("classif.knn"), makeLearner("classif.svm"))
tasks = list(iris.task, sonar.task)
rdesc = makeResampleDesc("CV", iters = 2L)
meas = list(acc)
bmr = benchmark(lrns, tasks, rdesc, measures = meas)
cd = generateCritDifferencesData(bmr)
plotCritDifferences(cd)
This requires the evaluation results to reside in a rather complex BenchmarkResult object, although the data is basically a matrix (where M[i, j] holds the score of classifier i for data set j).
I have previously generated such data in a Python workflow and imported in R into a data.frame (as there seems to be no Python package for such plots).
How can I generate a CD plot from this data?
I thought about creating a BenchmarkResult from the data.frame, but didn't know where to start:
# THIS DOES NOT WORK
library(mlr)
# Here I would import results from my experiments instead of using random data
# e.g. scores for 5 classifiers and 30 data sets, each
results = data.frame(replicate(5, runif(30, 0, 1)))
# This is the functionality I'm looking for
bmr = benchmarkResultFromDataFrame(results)
cd = generateCritDifferencesData(bmr)
plotCritDifferences(cd)
I finally managed to create the plot. It is necessary to set only a handful of the BenchmarkResult's attributes:
leaners with id and short.name for each classifier
measures
results with aggr for each dataset/classifier combination
The code may then look like this (smaller example of 5 datasets):
library(mlr)
# Here I would import results from my experiments instead of using random data
# e.g. scores for 5 classifiers and 30 data sets, each
results <- data.frame(replicate(5, runif(30, 0, 1)))
clf <- c('clf1', 'clf2', 'clf3', 'clf4', 'clf5')
clf.short.name <- c('c1', 'c2', 'c3', 'c4', 'c5')
dataset <- c('dataset1', 'dataset2', 'dataset3', 'dataset4', 'dataset5')
score <- list(acc)
# Setting up the learners: id, short.name
bmr <- list()
for (i in 1:5){
bmr$learners[[clf[i]]]$id <- clf[i]
bmr$learners[[clf[i]]]$short.name <- clf.short.name[i]
}
# Setting up the measures
bmr$measures <- list(acc)
# Setting up the results
for (i in 1:5){
bmr$results$`dataset1`[[clf[i]]]$aggr <- list('acc.test.mean' = results[1, i])
}
for (i in 1:5){
bmr$results$`dataset2`[[clf[i]]]$aggr <- list('acc.test.mean' = results[2, i])
}
for (i in 1:5){
bmr$results$`dataset3`[[clf[i]]]$aggr <- list('acc.test.mean' = results[3, i])
}
for (i in 1:5){
bmr$results$`dataset4`[[clf[i]]]$aggr <- list('acc.test.mean' = results[4, i])
}
for (i in 1:5){
bmr$results$`dataset5`[[clf[i]]]$aggr <- list('acc.test.mean' = results[5, i])
}
# Set BenchmarkResult class
class(bmr) <- "BenchmarkResult"
# Statistics and plot
cd = generateCritDifferencesData(bmr)
plotCritDifferences(cd)
Anyone who could teach me better R to avoid these for loops and code duplication would still be very welcome!