Scoping issue when using doParallel

Scoping issue when using doParallel - r

I am trying to estimate multiple nonparametric models using the doParallel package. My problem though seems to be related to the np package.
Take a look at this reproducible example:
library(np)
library(doParallel)
df <- data.frame(Y = runif(100, 0, 10), X = rnorm(100))
models <- list(as.formula(Y ~ X))
npestimate <- function(m, data) {
LCLS <- npregbw(m, data = data, regtype = "lc", bwmethod = "cv.ls")
LLLS <- npregbw(m, data = data, regtype = "ll", bwmethod = "cv.ls")
# sigt <- npsigtest(LCLS, boot.method = "wild", boot.type = "I")
return(list(LCLS = LCLS, LLLS = LLLS))
}
cl <- makeCluster(length(models))
registerDoParallel(cl)
results <- foreach(m = models, .packages = "np", .verbose = T) %dopar%
npestimate(m, data = df)
stopCluster(cl)
As you can see I created a function called npestimate() in order to compute different stuff for each model. I commented out one line where I want to run significance tests using npsigtest. Usually, npsigtest gets the data used by looking in the environment where npregbw was called.
But this does not work here. I am not sure why but npsigtest just cannot find the data that was used in the two lines of code right above.
The data is automatically exported to the nodes, so using .export in foreach is redundant.
Any suggestions how to make this work?

npsigtest copies pretty much the approach used within lm and functions for lm objects. It thus has the same potential scoping pitfalls. The issue is the environment associated with the formula:
environment(models[[1]])
#<environment: R_GlobalEnv>
It's easy to fix:
npestimate <- function(m, data) {
environment(m) <- environment()
LCLS <- npregbw(m, data = data, regtype = "lc", bwmethod = "cv.ls")
LLLS <- npregbw(m, data = data, regtype = "ll", bwmethod = "cv.ls")
sigt <- npsigtest(LCLS, boot.method = "wild", boot.type = "I")
return(list(LCLS = LCLS, LLLS = LLLS))
}
I actually often prefer eval(bquote()) constructs because of such issues.

Related

Why does lm keep the whole environment when called within a function

If you call lm or glm inside a function it returns the whole environment.
Example:
fit_lm = function(dt){
# Do some heavy data processing
tmp = data.frame(x = rnorm(10000000))
# fit and return model
return(lm(y~x, data = dt))
}
dt = data.frame(x = runif(100))
dt$y = 4 * dt$x + rnorm(100, sd = 0.5)
fit = fit_lm(dt)
If I look at the environment attr(fit$terms, ".Environment") it will contain the data used for model fitting dt, but also contain the data frame tmp even if was not used by lm.
Does anyone know why?

How to speed up parallel foreach in R

I want to calculate a series of approx 1.000.000 wilcox.tests in R:
result <- foreach(i = 1:ncol(data), .combine=bind_rows, .multicombine= TRUE, .maxcombine = 1000 ) %do% {
w = wilcox.test(data[,i]~as.factor(groups),exact = FALSE)
df <- data.frame(Characters=character(),
Doubles=double(),
Doubles=double(),
stringsAsFactors=FALSE)
df[1,] = c(colnames(data)[i], w$statistic, w$p.value)
rownames(df) = colnames(beta_t1)[i]
colnames(df) = c("cg", "statistic", "p.value")
return(df)
}
If I do it with %dopar% and 15 cores it is slower than with single core %do%.
I suspect it is a memory access problem. My processors are hardly used to capacity either. Is it possible to split the data dataframe into chunks and then have each processor calculate 100K and then add them together? How can I speed up this foreach loop?

One thing that’s immediately striking is that you use eight lines to create and return a data.frame where a single expression is sufficient:
data.frame(
cg = colnames(data)[i],
statistic = w$statistic,
p.value = w$p.value
row.names = colnames(beta_t1)[i]
stringsAsFactors = FALSE
)
However, the upshot is that after the loop is run, foreach has to row-bind all these data.frames, and that operation is slow. It’s more efficient to return a list of the p-values and statistics and forget about the row and column names (these can be provided afterwards, and then don’t require subsetting and re-concatenation).
That is, change your code to
result = foreach(col = data) %do% {
w = wilcox.test(col ~ as.factor(groups), exact = FALSE)
list(w$statistic, w$p.value)
}
# Combine result and transform it into a data.frame:
results = data.frame(
cg = colnames(data),
statistic = vapply(results, `[[`, double(1L), 1L),
p.value = vapply(results, `[[`, double(1L), 2L),
row.names = colnames(beta_t1),
stringsAsFactors = FALSE # only necessary for R < 4.0!
)
(I never use foreach so I’m not exactly sure how to use it here but the above should roughly work; otherwise try mclapply from the ‘parallel’ package, it does the same, just using the familiar syntax of lapply.)

Plot multiple ROC curves using a for loop

I need to plot a number of different ROC curves on a single plot. To avoid manually creating each ROC curve, I have created a for loop that automates this process. However, for some reason, the code only outputs a single curve, for the last model in the list of names. Can anyone help me figure out why its not working? Please see below for a reproducible example:
library(pROC)
library(tidyverse)
dat_tst_2 <- data.frame(result = sample(letters[1:2], 100, replace = T))
preds_1 <- data.frame(x = runif(100),
y = runif(100))
preds_2 <- data.frame(x = runif(100),
y = runif(100))
names_preds <- c("preds_1", "preds_2")
output <- list()
for (j in 1:length(names_preds)) {
for (i in names_preds) {
roc_model <- roc(response = dat_tst_2$result,
predictor = eval(as.name(i))[,2],
levels = c("a", "b"),
plot = F)
output[[j]] <- roc_model
}
}
ggroc(output)

First make sure output has multiple items usin str(output). Then try instead to pass each item in output to ggroc:
lappy( output, function (out) { png()
print (ggroc(out))
dev.off() }

Rhadoop mapreduce for multiple input files

I'm building a mapreduce program, using R, that extracts the relevant features from a set of features in a dataset using genetic algorithm. I need to put many files as an input to my mapreduce job. My code below is my mapreduce program but it works only for one input file (data.csv).
library(caret)
library(dplyr)
library(rmr2)
Sys.setenv(HADOOP_CMD="/home/rania/hadoop-2.7.3/bin/hadoop")
Sys.getenv("HADOOP_CMD")
Sys.setenv(HADOOP_STREAMING="/home/rania/hadoop-streaming-2.7.3.jar")
library(rhdfs)
hdfs.init()
rmr.options(backend = "hadoop")
hdfs.mkdir("/user/rania/genetic")
hdfs.mkdir("/user/rania/genetic/data")
I put my files in one folder in hdfs
hadoop fs -copyFromLocal /home/rania/Downloads/matrices/*.csv /user/rania/genetic/data/
This is the map function
mon.map <- function(.,data){
data <- read.csv("/home/rania/Downloads/dataset.csv", header = T, sep = ";")
y <- c(1,0,1,0,1,1,1,1,0,0,1,0,1)
ga_ctrl <- gafsControl(functions = rfGA, # Assess fitness with RF
method = "cv") # 10 fold cross validation
set.seed(10)
lev <- c("1","0")
rf_ga3 <- gafs(x = data, y = y,
iters = 10, # 100 generations of algorithm
popSize = 4, # population size for each generation
levels = lev,
gafsControl = ga_ctrl)
keyval(rf_ga3$ga$final, data[names(data) %in% rf_ga3$ga$final] )
}
This is the reduce function
mon.reduce <- function(k,v){
keyval(k,v) }
Now i apply the mapreduce job
hdfs.root = 'genetic'
hdfs.data = file.path(hdfs.root, 'data')
hdfs.out = file.path(hdfs.root, 'out')
csv.format <- make.output.format("csv")
genetic = function (input, output) {mapreduce(input=input, output=output, input.format="csv",output.format=csv.format, map=mon.map,reduce=mon.reduce)}
out = genetic(hdfs.data, hdfs.out)
Then we print the result from hdfs
results <- from.dfs(out, format="csv")
print(results)
OR
hdfs.cat("/genetic/out/part-00000")
I tried to change the map function to make it work for many files but it failed
mon.map <- function(.,data){
data <- list.files(path="/home/rania/Downloads/matrices/", full.names=TRUE, pattern="\\.csv") %>% lapply(read.csv, header=TRUE, sep=",")
y <- c(1,0,1,0,1,1,1,1,0,0,1,0,1)
for (i in 1:4){
ga_ctrl <- gafsControl(functions = rfGA, # Assess fitness with RF
method = "cv") # 10 fold cross validation
set.seed(10)
lev <- c("1","0")
rf_ga3 <- gafs(x = data[[i]], y = y,
iters = 10, # 100 generations of algorithm
popSize = 4, # population size for each generation
levels = lev,
gafsControl = ga_ctrl)
}
keyval(rf_ga3$ga$final, do.call(cbind, Map(`[`, data, c(rf_ga3$ga$final))) )
}
what can i change in the previous map function to make it work for many input files? thanks

Using lapply and !is.na to subset list vectors in R

I'm trying to apply the solution I found here to generate machine learning models:
Best way to name objects programmatically using R?
Here's a dummy data set:
data_pred <- data.frame(x1 = 1:10, x2 = 11:20, x3 = 21:30)
data_resp <- data.frame(y1 = c(1:5, NA, 7:10), y2 = c(NA, 2, NA, 4:10))
Here was my for() loop method of modeling the predictors in data_pred on each individual column of measured responses in data_resp using the caret package:
# data_pred contains predictors
# data_resp contains one column per measurement
# 1 matching row per observation in both data_pred and data_resp
for (i in 1:ncol(data_resp)) {
train(x = data_pred[!is.na(data_resp[, i]), ],
y = data_resp[!is.na(data_resp[, i], i],
... )
}
Now I'm trying to do the same with lapply, which I think has numerous advantages. I'm having an issue with translating the !is.na() criteria on the fly so that I'm only modeling with non-NA cases for each response. Here was my initial function to test the lapply method:
rf_func <- function(y) {
train(x = data_pred,
y = y,
method = "rf",
tuneGrid = data.frame(.mtry = 3:6),
nodesize = 3,
ntrees = 500,
trControl = trControl) }
Then create an empty list to store results and apply the function to data_resp:
models <- list(NULL)
models$rf <- lapply(as.list(data_resp), rf_func)
That works fine since randomForest can handle NAs, but other methods cannot, so I need to remove those rows from each data_resp element as well as the corresponding rows from my predictors.
I tried this without success:
train(x = data_pred_scale[!is.na(y), ],
y = y[!is.na(y)],
... }
I also tried y[[!is.na(y)]]
How do I translate the data.frame method (df[!is.na(df2), ]) to lapply?

several different ways to go about it. A simple approach is with an anonymous function:
lapply(data_resp, function(x) rf_func(x[!is.na(x)]))

In fiddling around quite a bit with a single element of my as.list(data_frame) to simulate what lapply would be passing, I came up with this, which I think is working:
rf_func <- function(y) {
train(x = data_pred_scale[!(unlist(lapply(y, is.na))), ],
y = y[!(unlist(lapply(y, is.na)))],
method = "rf",
tuneGrid = data.frame(.mtry = 3:6),
nodesize = 3,
ntrees = 500,
trControl = trControl) }
models$rf <- lapply(as.list(data_resp), rf_func)
It does seem to be working. I [hackishly] compared the non-NA data set to the trainingData results in each caret model like so:
nas <- NULL
for(i in 1:ncol(data_resp)) {nas <- c(nas, length(data_resp[!is.na(data_resp[, i]), i]))}
model_nas <- NULL
for(i in 1:length(nas)) {model_nas <- c(model_nas, nrow(models$rf[[i]]$trainingData))}
identical(nas, model_nas)
[1] TRUE
So, is y[!unlist(lapply(y, is.na)))] the best/most elegant way to do this sort of thing It's pretty ugly...
Edit: Based on #Ricardo Saporta 's answer, I was able to come up with this (probably obvious to the veterans, but bear with me):
rf_func <- function(x, y) {
train(x = x,
y = y,
method = "rf",
tuneGrid = data.frame(.mtry = 3:6),
nodesize = 3,
ntrees = 500,
trControl = trControl) }
models$rf <- lapply(data_resp, function (y) {
rf_func(data_pred_scale[!is.na(y), ], y[!is.na(y)] )
}
)
Is there still a better way, or is that fairly decent? (Certainly prettier than my first mess up above.)