readBin from memory, rather than the disk - r

I've been working a bit lately with xgboost models:
library(xgboost)
set.seed(23)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
bst <- xgboost(
data = train$data, label = train$label, max.depth = 2,
eta = 1, nround = 2,objective = "binary:logistic")
pred <- predict(bst, test$data)
Unfortunately, if you try to save and load and xgboost model, the result is a segfault:
f <- tempfile()
saveRDS(bst, f)
bst_new = readRDS(f)
unlink(f)
print(bst_new) #NULL pointer
pred2 <- predict(bst_new, test$data) #This line segfaults
This is because an xgboost model is actually just a pointer to a C++ object in memory: the R package is really just a thin wrapper around some C++ code:
> print(bst)
<pointer: 0x10a5ae750>
attr(,"class")
[1] "xgb.Booster"
The package authors are aware of this, and have written 2 custom functions (xgb.save and xgb.load) for saving and loading xgboost models, but they aren't as useful as just being able to use save and load from base R. I've come up with a hack for saving the model as part of an R session, but it's not very pretty:
#Hack to save an xgboost model
f1 <- tempfile()
f2 <- tempfile()
xgb.save(bst, f1)
model_raw_bytes <- readBin(f1, what='raw', file.info(f)[1, "size"])
unlink(f1)
print(head(model_raw_bytes))
saveRDS(model_raw_bytes, f2)
#Hack to load an xgboost
f3 <- tempfile()
model_raw_bytes <- readRDS(f2)
writeBin(model_raw_bytes, f3)
model_2 <- xgb.load(f3)
pred_2 <- predict(model_2, test$data)
all.equal(pred_2, pred)
unlink(f2)
unlink(f3)
Is there a way to readBin directly from a location in memory, rather than from the disk? (This would save me from writing temp files to save and load the models).
Better yet, is there an accepted way for wrapping C++ objects inside R objects that I can guide the package authors towards? They've been pretty receptive to my comments on github and if I could submit a PR to make this work, I think they'd accept it.

Related

R package development: tests pass in console, but fail via devtools::test()

I am developing an R package that calls functions from the package rstan. As a MWE, my test file is currently set up like this, using code taken verbatim from rstan's example:
library(testthat)
library(rstan)
# stan's own example
stancode <- 'data {real y_mean;} parameters {real y;} model {y ~ normal(y_mean,1);}'
mod <- stan_model(model_code = stancode, verbose = TRUE)
fit <- sampling(mod, data = list(y_mean = 0))
# I added this line, and it's the culprit
summary(fit)$summary
When I run this code in the console or via the "Run Tests" button in RStudio, no errors are thrown. However, when I run devtools::test(), I get:
Error (test_moments.R:11:1): (code run outside of `test_that()`)
Error in `summary(fit)$summary`: $ operator is invalid for atomic vectors
and this error is definitely not occurring upstream of that final line of code, because removing the final line allows devtools::test() to run without error. I am running up-to-date packages devtools and rstan.
It seems that devtools::test evaluates the test code in a setting where S4 dispatch does not work in the usual way, at least for packages that you load explicitly in the test file (in this case rstan). As a result, summary dispatches to summary.default instead of the S4 method implemented in rstan for class "stanfit".
The behaviour that you're seeing might relate to this issue on the testthat repo, which seems unresolved.
Here is a minimal example that tries to illuminate what is happening, showing one possible (admittedly inconvenient) work-around.
pkgname <- "foo"
usethis::create_package(pkgname, rstudio = FALSE, open = FALSE)
setwd(pkgname)
usethis::use_testthat()
path_to_test <- file.path("tests", "testthat", "test-summary.R")
text <- "test_that('summary', {
library('rstan')
stancode <- 'data {real y_mean;} parameters {real y;} model {y ~ normal(y_mean,1);}'
mod <- stan_model(model_code = stancode, verbose = TRUE)
fit <- sampling(mod, data = list(y_mean = 0))
expect_identical(class(fit), structure('stanfit', package = 'rstan'))
expect_true(existsMethod('summary', 'stanfit'))
x <- summary(fit)
expect_error(x$summary)
expect_identical(x, summary.default(fit))
print(x)
f <- selectMethod('summary', 'stanfit')
y <- f(fit)
str(y)
})
"
cat(text, file = path_to_test)
devtools::test(".") # all tests pass
If your package actually imports rstan (in the NAMESPACE sense, not in the DESCRIPTION sense), then S4 dispatch seems to work fine, presumably because devtools loads your package and its dependencies in a "proper" way before running any tests.
cat("import(rstan)\n", file = "NAMESPACE")
newtext <- "test_that('summary', {
stancode <- 'data {real y_mean;} parameters {real y;} model {y ~ normal(y_mean,1);}'
mod <- stan_model(model_code = stancode, verbose = TRUE)
fit <- sampling(mod, data = list(y_mean = 0))
x <- summary(fit)
f <- selectMethod('summary', 'stanfit')
y <- f(fit)
expect_identical(x, y)
})
"
cat(newtext, file = path_to_test)
## You must restart your R session here. The current session
## is contaminated by the previous call to 'devtools::test',
## which loads packages without cleaning up after itself...
devtools::test(".") # all tests pass
If your test is failing and your package imports rstan, then something else may be going on, but it is difficult to diagnose without a minimal version of your package.
Disclaimer: Going out of your way to import rstan to get around a relatively obscure devtools issue should be considered more of a hack than a fix, and documented accordingly...

Save non-SparkDataFrame from Azure Databricks to local computer as .RData

In Databricks (SparkR), I run the batch algorithm of the self-organizing map in parallel from the kohonen package as it gives me considerable reductions in computation time as opposed to my local machine. However, after fitting the model I would like to download/export the trained model (a list) to my local machine to continue working with the results (create plots etc.) in a way that is not available in Databricks. I know how to save & download a SparkDataFrame to csv:
sdftest # a SparkDataFrame
write.df(sdftest, path = "dbfs:/FileStore/test.csv", source = "csv", mode = "overwrite")
However, I am not sure how to do this for a 'regular' R list object.
Is there any way to save the output created in Databricks to my local machine in .RData format? If not, is there a workaround that would still allow me to continue working with the model results locally?
EDIT :
library(kohonen)
# Load data
sdf.cluster <- read.df("abfss://cluster.csv", source = "csv", header="true", inferSchema = "true")
# Collet SDF to RDF as kohonen::som is not available for SparkDataFrames
rdf.cluster <- SparkR::collect(sdf.cluster)
# Change rdf to matrix as is required by kohonen::som
rdf.som <- as.matrix(rdf.cluster)
# Parallel Batch SOM from Kohonen
som.grid <- somgrid(xdim = 5, ydim = 5, topo="hexagonal",
neighbourhood.fct="gaussian")
set.seed(1)
som.model <- som(rdf.som, grid=som.grid, rlen=10, alpha=c(0.05,0.01), keep.data = TRUE, dist.fcts = "euclidean", mode = "online")
Any help is very much appreciated!
If all your models can fit into the driver's memory, you can use spark.lapply. It is a distributed version of base lapply which requires a function and a list. Spark will apply the function to each element of the list (like a map) and collect the returned objects.
Here is an example of fitting kohonen models, one for each iris species:
library(SparkR)
library(kohonen)
fit_model <- function(df) {
library(kohonen)
grid_size <- ceiling(nrow(df) ^ (1/2.5))
som_grid <- somgrid(xdim = grid_size, ydim = grid_size, topo = 'hexagonal', toroidal = T)
som_model <- som(data.matrix(df), grid = som_grid)
som_model
}
models <- spark.lapply(split(iris[-5], iris$Species), fit_model)
models
The models variable contains a list of kohonen models fitted in parallel:
$setosa
SOM of size 5x5 with a hexagonal toroidal topology.
Training data included.
$versicolor
SOM of size 5x5 with a hexagonal toroidal topology.
Training data included.
$virginica
SOM of size 5x5 with a hexagonal toroidal topology.
Training data included.
Then you can save/serialise the R object as usual:
saveRDS(models, file="/dbfs/kohonen_models.rds")
Note that any file stored into /dbfs/ path will be available through the Databrick's DBFS, accesible with the CLI or API.

Why does caret::predict() use parallel processing with XGBtree only?

I understand why parallel processing can be used during training only for XGB and cannot be used for other models. However, surprisingly I noticed that predict with xgb uses parallel processing too.
I noticed this by accident when I split my large 10M + data frame into pieces to predict on using foreach %dopar%. This caused some errors so to try to get around them I switched to sequential looping with %do% but noticed in the terminal that all processors where being used.
After some trial and error I found that caret::train() appears to use parallel processing where the model is XGBtree only (possibly others) but not on other models.
Surely predict could be done on parallel with any model, not just xgb?
Is it the default or expected behaviour of caret::predict() to use all available processors and is there a way to control this by e.g. switching it on or off?
Reproducible example:
library(tidyverse)
library(caret)
library(foreach)
# expected to see parallel here because caret and xgb with train()
xgbFit <- train(Species ~ ., data = iris, method = "xgbTree",
trControl = trainControl(method = "cv", classProbs = TRUE))
iris_big <- do.call(rbind, replicate(1000, iris, simplify = F))
nr <- nrow(iris_big)
n <- 1000 # loop over in chunks of 20
pieces <- split(iris_big, rep(1:ceiling(nr/n), each=n, length.out=nr))
lenp <- length(pieces)
# did not expect to see parallel processing take place when running the block below
predictions <- foreach(i = seq_len(lenp)) %do% { # %do% is a sequential loop
# get prediction
preds <- pieces[[i]] %>%
mutate(xgb_prediction = predict(xgbFit, newdata = .))
return(preds)
}
If you change method = "xgbTree" to e.g. method = "knn" and then try to run the loop again, only one processor is used.
So predict seems to use parallel processing automatically depending on the type of model.
Is this correct?
Is it controllable?
In this issue you can find the information you need:
https://github.com/dmlc/xgboost/issues/1345
As a summary, if you trained your model with parallelism, the predict method will also run with parallel processing.
If you want to change the latter behaviour you must change a setting:
xgb.parameters(bst) <- list(nthread = 1)
An alternative, is to change an environment variable:
OMP_NUM_THREADS
And as you explain, this only happens for xgbTree

Rscript - long time of execution

I'm trying to create predictive model in caret package in R and invoke prediction for new data from terminal/cmd. Here is reproducible example:
# Sonar_training.R
## learning and saving model
library(caret)
library(mlbench)
data(Sonar)
set.seed(107)
inTrain <- createDataPartition(y = Sonar$Class, p = .75,list = FALSE)
training <- Sonar[ inTrain,]
testing <- Sonar[-inTrain,]
saveRDS(testing,"test.rds")
ctrl <- trainControl(method = "repeatedcv",
repeats = 3)
plsFit <- train(Class ~ .,data = training,method = "pls",
tuneLength = 15,
trControl = ctrl,
preProc = c("center", "scale"))
plsClasses <- predict(plsFit, newdata = testing)
saveRDS(plsFit,"fit.rds")
And here is script to invoke by Rscript.exe:
# script.R
##reading model and predict test data
t <- Sys.time()
pls <- readRDS("fit.rds")
testing <- readRDS("test.rds")
head(predict(pls, newdata = testing))
print(Sys.time() - t)
I run this in terminal with following statement:
pawel#pawel-MS-1753:~$ Rscript script.R
Loading required package: pls
Attaching package: ‘pls’
The following object is masked from ‘package:stats’:
loadings
[1] M M R M R R
Levels: M R
Time difference of 2.209697 secs
Is there any way to do it faster/more efficient? For example is there possibility to not loading packages every execution? Is readRDS correct for reading models in this case?
You can try to profile your code with the "profvis" package:
#library(profvis)
profvis({
for (i in 1:100){
#your code here
}
})
I tried and it happens that 99% of the execution time is training time, 1% is saving/loading RDS data, and the rest costs about 0 (loading packages, loading data,...):
So if you don't want to optimize the training function itself, it seems you have very few ways to reduce execution time.
I've seen this occur for PLS classification models and I'm not sure of the issue. However, try using method = "simpls" instead. You will get approximately the same answers and it should complete quickly.

Runtime Library error in R with Random forest (Rborist)

I am using library Rborist in R,and one time I accomplished a task to construct a Random Forest model ,and to save the object with the function saveRDS.
Then, I shut down R , and loaded the object with the function readRDS.
It is when a error happened that I tried to predict using the Random Forest model.
This is the error message:
Microsoft Visual C++ Runtime Library
This application has requested the Runtime to terminate it an unusual
way. Please contact the application's support team for more
information.
This is the code:
library(caret)
library(Rborist)
dat <- read.csv("data.csv", header=T)
dat <- transform(dat, y = as.factor(y))
index <- createDataPartition(dat$y, p=.8, list=F)
train <- dat[index, ];test <- dat[-index,]
model <- Rborist(train[,-1], train$y, predProb=0.1, nTree = 500)
table = table(predict(model, test[,-1])$yPred,test$y)
table
sum(diag(table))/sum(table)
saveRDS(model,file="model.rds")
#once shut down ,and boot up R
library(Rborist)
test <- read.csv("test.csv", header=T)
model <- readRDS(file="model.rds")
pred = predict(model, test[,-1])$yPred # Error!!

Resources