Save non-SparkDataFrame from Azure Databricks to local computer as .RData - r

In Databricks (SparkR), I run the batch algorithm of the self-organizing map in parallel from the kohonen package as it gives me considerable reductions in computation time as opposed to my local machine. However, after fitting the model I would like to download/export the trained model (a list) to my local machine to continue working with the results (create plots etc.) in a way that is not available in Databricks. I know how to save & download a SparkDataFrame to csv:
sdftest # a SparkDataFrame
write.df(sdftest, path = "dbfs:/FileStore/test.csv", source = "csv", mode = "overwrite")
However, I am not sure how to do this for a 'regular' R list object.
Is there any way to save the output created in Databricks to my local machine in .RData format? If not, is there a workaround that would still allow me to continue working with the model results locally?
EDIT :
library(kohonen)
# Load data
sdf.cluster <- read.df("abfss://cluster.csv", source = "csv", header="true", inferSchema = "true")
# Collet SDF to RDF as kohonen::som is not available for SparkDataFrames
rdf.cluster <- SparkR::collect(sdf.cluster)
# Change rdf to matrix as is required by kohonen::som
rdf.som <- as.matrix(rdf.cluster)
# Parallel Batch SOM from Kohonen
som.grid <- somgrid(xdim = 5, ydim = 5, topo="hexagonal",
neighbourhood.fct="gaussian")
set.seed(1)
som.model <- som(rdf.som, grid=som.grid, rlen=10, alpha=c(0.05,0.01), keep.data = TRUE, dist.fcts = "euclidean", mode = "online")
Any help is very much appreciated!

If all your models can fit into the driver's memory, you can use spark.lapply. It is a distributed version of base lapply which requires a function and a list. Spark will apply the function to each element of the list (like a map) and collect the returned objects.
Here is an example of fitting kohonen models, one for each iris species:
library(SparkR)
library(kohonen)
fit_model <- function(df) {
library(kohonen)
grid_size <- ceiling(nrow(df) ^ (1/2.5))
som_grid <- somgrid(xdim = grid_size, ydim = grid_size, topo = 'hexagonal', toroidal = T)
som_model <- som(data.matrix(df), grid = som_grid)
som_model
}
models <- spark.lapply(split(iris[-5], iris$Species), fit_model)
models
The models variable contains a list of kohonen models fitted in parallel:
$setosa
SOM of size 5x5 with a hexagonal toroidal topology.
Training data included.
$versicolor
SOM of size 5x5 with a hexagonal toroidal topology.
Training data included.
$virginica
SOM of size 5x5 with a hexagonal toroidal topology.
Training data included.
Then you can save/serialise the R object as usual:
saveRDS(models, file="/dbfs/kohonen_models.rds")
Note that any file stored into /dbfs/ path will be available through the Databrick's DBFS, accesible with the CLI or API.

Related

Where to put a sample(, J) function in R (inside or outside?) a regsubsets() which is the FUN in an lapply() so it only runs J possible models?

The end goal here is to run a random sample (without replacement) of J different possible regression models rather than all 2^k - 1 possible models as in a traditional All Subsets Regression aka Best Subset Regression (also sometimes called Exhaustive Regression) on each of I different csv file formatted datasets all located within the same file folder.
Here is my code (it is in my GitHub Repository for this project, it is called 'EER script'):
# Load all libraries needed for this script.
# The library specifically needed to run a basic ASR is the 'leaps' library.
library(dplyr)
library(tidyverse)
library(stats)
library(ggplot2)
library(lattice)
library(caret)
library(leaps)
library(purrr)
directory_path <- "~/DAEN_698/sample obs"
filepath_list <- list.files(path = directory_path, full.names = TRUE, recursive = TRUE)
# reformat the names of each of the csv file formatted datasets
DS_names_list <- basename(filepath_list)
DS_names_list <- tools::file_path_sans_ext(DS_names_list)
datasets <- lapply(filepath_list, read.csv)
# code to run a normal All Subsets Regression
ASR_fits <- lapply(datasets, function(i)
regsubsets(x = as.matrix(select(i, starts_with("X"))),
y = i$Y, data = i, nvmax = 15,
intercept = TRUE, method = "exhaustive"))
ASR_fits_summary <- summary(ASR_fits)
This is the part I am completely stuck, I got the above to run and the ASR_fits_summary object is a list with I elements, all of the class 'regsubsets' which is exactly what I was hoping for, but that is still just a list of the estimates made by a traditional ASR, what I need to figure out is where I should insert a sample(, J) function within this lapply function so that each regsubsets chooses the optimal model out of just J randomly selected models from the 2k - 1 possible models to made it computational feasible.
I am guessing I will have to either nest another lapply within my current lapply function, or write a simple custom function that takes J random samples without replacement, but I just don't know at what step to put it.

Conditional simulation (with Kriging) in R with parallelization?

I am using gstat package in R to generate sequential gaussian simulations. My pc have 4 cores and I tried to parallelize the krige() function using the parallel package following the script provided by Guzmán to answer the question How to achieve parallel Kriging in R to speed up the process?.
The resulting simulations are, however, different from the ones using only one core at the time (no parallelization). It looks a geometry problem, but i can't find out how to fix it.
Next i will provide an example (using 4 cores) generating 2 simulations. You will see that after running the code, the simulated maps derived from parallelization show some artifacts (like vertical lines), and are different from the ones using only one core at the time.
The code needs the libraries gstat, sp, raster, parallel and spatstat. If any of the lines library() do not work, run install.packages() first.
library(gstat)
library(sp)
library(raster)
library(parallel)
library(spatstat)
# create a regular grid
nx=100 # number of columns
ny=100 # number of rows
srgr <- expand.grid(1:ny, nx:1)
names(srgr) <- c('x','y')
gridded(srgr)<-~x+y
# generate a spatial process (unconditional simulation)
g<-gstat(formula=z~x+y, locations=~x+y, dummy=T, beta=15, model=vgm(psill=3, range=10, nugget=0,model='Exp'), nmax=20)
sim <- predict(g, newdata=srgr, nsim=1)
r<-raster(sim)
# generate sample data (Poisson process)
int<-0.02
rpp<-rpoispp(int,win=owin(c(0,nx),c(0,ny)))
df<-as.data.frame(rpp)
coordinates(df)<-~x+y
# assign raster values to sample data
dfpp <-raster::extract(r,df,df=TRUE)
smp<-cbind(coordinates(df),dfpp)
smp<-smp[complete.cases(smp), ]
coordinates(smp)<-~x+y
# fit variogram to sample data
vs <- variogram(sim1~1, data=smp)
m <- fit.variogram(vs, vgm("Exp"))
plot(vs, model = m)
# generate 2 conditional simulations with one core processor
one <- krige(formula = sim1~1, locations = smp, newdata = srgr, model = m,nmax=12,nsim=2)
# plot simulation 1 and 2: statistics (min, max) are ok, simulations are also ok.
spplot(one["sim1"], main = "conditional simulation")
spplot(one["sim2"], main = "conditional simulation")
# generate 2 conditional with parallel processing
no_cores<-detectCores()
cl<-makeCluster(no_cores)
parts <- split(x = 1:length(srgr), f = 1:no_cores)
clusterExport(cl = cl, varlist = c("smp", "srgr", "parts","m"), envir = .GlobalEnv)
clusterEvalQ(cl = cl, expr = c(library('sp'), library('gstat')))
par <- parLapply(cl = cl, X = 1:no_cores, fun = function(x) krige(formula=sim1~1, locations=smp, model=m, newdata=srgr[parts[[x]],], nmax=12, nsim=2))
stopCluster(cl)
# merge all parts
mergep <- maptools::spRbind(par[[1]], par[[2]])
mergep <- maptools::spRbind(mergep, par[[3]])
mergep <- maptools::spRbind(mergep, par[[4]])
# create SpatialPixelsDataFrame from mergep
mergep <- SpatialPixelsDataFrame(points = mergep, data = mergep#data)
# plot mergep: statistics (min, max) are ok, but simulated maps show "vertical lines". i don't understand why.
spplot(mergep[1], main = "conditional simulation")
spplot(mergep[2], main = "conditional simulation")
I have tried your code and I think the problem lies with the way you split the work:
parts <- split(x = 1:length(srgr), f = 1:no_cores)
On my dual core machine that meant that all odd indices in srgr where handled by one process and all even indices where handled by the other process. This is probably the source of the vertical artifacts you are seeing.
A better way should be to split the data into consecutive chunks like this:
parts <- parallel::splitIndices(length(srgr), no_cores)
Using this splitting with the rest of your code I get results that look comparable to the sequential ones. At least to my untrained eyes ...
Original answer, which is only a minor effect. It still might make sense to fix the seed with set.seed for sequential and clusterSetRNGStream for parallel processing.
From what I have read about Kriging it requires you to draw random numbers. These random numbers will be different with parallel processing. See section 6 of the parallel vignette (vignette("parallel")) for more details.

Automatically select SOM_GRID in R

I am trying to run SOM algorithm in R using Kohenen package. In this I have to define xdim, ydim dimension manually. Refer below code:
som_grid <- somgrid(xdim=5, ydim=6, topo="hexagonal")
som_model <- som(data_train_matrix,
grid=som_grid,
keep.data = TRUE)
My questions:
Is there a method where it automatically selects dimensions based on data
Can any explain logic behind this selection so that can we write function in R to identify dimensions automatically
I'm not very well in R but I think that can help you:
#Consider a dummy xdim and ydim Data.
x<-c(seq(0,5,by=0.5))
y<-c(seq(0,6,by=0.5))
## Determine the sector starting and end points.
a<-rbind(1,2,3,4,5)
b<-rbind(1,2,3,4,5,6)
sectors<-cbind(a,b)
sectors
## See the table of the sector.

Keep R Model in memory for Rest API

We have a GLM R model that is ~ 2 GB in size. We are using this model to service a REST API. We need < 3 second response times. The problem is that using the conventional saveRDS / readRDS functions takes way too long. Our last option is to find a way to keep the R Model in memory in between sessions.
At a high level:
Service a REST API.
Keep a multi gigabyte model in memory.
Keep response times below 3 seconds.
Stuff we've tried and doesn't work:
Shrinking the size of the model. Our data scientist says that's as small as he can get it.
I've experimented with compression and other settings on saveRDS. The best I could do is 12 seconds to load the model from file.
We tried Microsoft R Server. All the web API requests would be redirected to the same session. Problem is that we had to wrap a decent amount of code around keeping the session alive. Even then it would flake out often.
Microsoft R Server Real Time is out because it only accepts models generated by the ScaleR. I know ScaleR has a GLM function, but I've been told its not an option.
Faster IO doesn't seem to help. It appears that the bottleneck is the deserialization of the rData file. R being single threaded doesn't help.
Edit:
The question is what R for REST API library / service would allow us to statefully keep a model in memory in between calls.
Continuing on my comment from above as well as the suggestion of #TenniStats, the best approach is to reduce the size of the GLM. Consider the following:
#generating some sample data that's fairly large
sample.data <- data.frame('target' = sample(c(1:10), size = 5000000, replace = T),
'regressor1' = rnorm(5000000),
'regressor2' = rnorm(5000000),
'regressor3' = rnorm(5000000),
'regressor4' = rnorm(5000000),
'regressor5' = rnorm(5000000),
'regressor6' = rnorm(5000000),
'regressor7' = rnorm(5000000),
'regressor8' = rnorm(5000000),
'regressor9' = rnorm(5000000),
'regressor10' = rnorm(5000000))
#building a toy glm - this one is about 3.3 GB
lm.mod <- glm(sample.data, formula = target ~ ., family = gaussian)
#baseline predictions
lm.default.preds <- predict(lm.mod, sample.data)
#extracting coefficients
lm.co <- coefficients(lm.mod)
#applying coefficients to original data set by row and adding intercept
lightweight.preds <- lm.co[1] +
apply(sample.data[,2:ncol(sample.data)],
1,
FUN = function(x) sum(x * lm.co[2:length(lm.co)]))
#clearing names from vector for comparison
names(lm.default.preds) <- NULL
#taa daa
all.equal(lm.default.preds, lightweight.preds)
Then we can do the following:
#saving for our example and starting timing
saveRDS(lm.co, file = 'myfile.RDS')
start.time <- Sys.time()
#reading from file
coefs.from.file <- readRDS('myfile.RDS')
#scoring function
light.scoring <- function(coeff, new.data) {
prediction <- coeff[1] + sum(coeff[2:length(coeff)] * new.data)
names(prediction) <- NULL
return(prediction)
}
#same as before
light.scoring(coefs.from.file, sample.data[1, 2:11])
#~.03 seconds on my machine
Sys.time() - start.time

Runtime Library error in R with Random forest (Rborist)

I am using library Rborist in R,and one time I accomplished a task to construct a Random Forest model ,and to save the object with the function saveRDS.
Then, I shut down R , and loaded the object with the function readRDS.
It is when a error happened that I tried to predict using the Random Forest model.
This is the error message:
Microsoft Visual C++ Runtime Library
This application has requested the Runtime to terminate it an unusual
way. Please contact the application's support team for more
information.
This is the code:
library(caret)
library(Rborist)
dat <- read.csv("data.csv", header=T)
dat <- transform(dat, y = as.factor(y))
index <- createDataPartition(dat$y, p=.8, list=F)
train <- dat[index, ];test <- dat[-index,]
model <- Rborist(train[,-1], train$y, predProb=0.1, nTree = 500)
table = table(predict(model, test[,-1])$yPred,test$y)
table
sum(diag(table))/sum(table)
saveRDS(model,file="model.rds")
#once shut down ,and boot up R
library(Rborist)
test <- read.csv("test.csv", header=T)
model <- readRDS(file="model.rds")
pred = predict(model, test[,-1])$yPred # Error!!

Resources