e1071 Package: naiveBayes prediction is slow - r

I am trying to run the naiveBayes classifier from the R package e1071. I am running into an issue where the time it takes to predict takes longer than the time it takes to train, by a factor of ~300.
I was wondering if anyone else has observed this behavior and, if so, if you have any suggestions on how to improve it.
This issue appears only in some instances. Below, I have code that trains and predicts the NB classifier on the Iris dataset. Here the training and prediction times match up quite closely (prediction takes 10x longer instead of 300x longer). The only other trace of this issue that I could find online is here. In that instance, the answer was to make sure that categorical variables are formatted as factors. I have done this, but still don't see any improvement.
I have played around with the sample size N and the problem seems to be lessened as N decreases. Perhaps this is intended behavior of the algorithm? Decreasing N by a factor of 10 causes the prediction to be only 150x slower, but increasing by a factor of 10 yields a similar slowdown of 300x. These numbers seem crazy to me, especially because I've used this algorithm in the past on datasets with ~300,000 examples and found it to be quite fast. Something seems fishy but I can't figure out what.
I'm using R version 3.3.1 on Linux. The e1071 package is up-to-date (2015 release).
The code below should be reproducible on any machine. FYI my machine timed the Iris classification at 0.003s, the Iris prediction at 0.032s, the simulated data classification at 0.045s, and the resulting prediction at 15.205s. If you get different numbers than these, please let me know as it could be some issue on my local machine.
# Remove everything from the environment and clear out memory
rm(list = ls())
gc()
# Load required packages and datasets
require(e1071)
data(iris)
# Custom function: tic/toc function to time the execution
tic <- function(gcFirst = TRUE, type=c("elapsed", "user.self", "sys.self"))
{
type <- match.arg(type)
assign(".type", type, envir=baseenv())
if(gcFirst) gc(FALSE)
tic <- proc.time()[type]
assign(".tic", tic, envir=baseenv())
invisible(tic)
}
toc <- function()
{
type <- get(".type", envir=baseenv())
toc <- proc.time()[type]
tic <- get(".tic", envir=baseenv())
print(toc - tic)
invisible(toc)
}
# set seed for reproducibility
set.seed(12345)
#---------------------------------
# 1. Naive Bayes on Iris data
#---------------------------------
tic()
model.nb.iris <- naiveBayes(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,data=iris)
toc()
tic()
pred.nb.iris <- predict(model.nb.iris, iris, type="raw")
toc()
#---------------------------------
# 2. Simulate data and reproduce NB error
#---------------------------------
# Hyperparameters
L <- 5 # no. of locations
N <- 1e4*L
# Data
married <- 1*(runif(N,0.0,1.0)>.45)
kids <- 1*(runif(N,0.0,1.0)<.22)
birthloc <- sample(1:L,N,TRUE)
major <- 1*(runif(N,0.0,1.0)>.4)
exper <- 15+4*rnorm(N)
exper[exper<0] <- 0
migShifter <- 2*runif(N,0.0,1.0)-1
occShifter <- 2*runif(N,0.0,1.0)-1
X <- data.frame(rep.int(1,N),birthloc,migShifter,occShifter,major,married,kids,exper,exper^2,exper^3)
colnames(X)[1] <- "constant"
rm(married)
rm(kids)
rm(birthloc)
rm(major)
rm(exper)
rm(occShifter)
# Parameters and errors
Gamma <- 15*matrix(runif(7*L), nrow=7, ncol=L)
eps <- matrix(rnorm(N*L, 0, 1), nrow=N, ncol=L)
# Deterministic portion of probabilities
u <- matrix(rep.int(0,N*L), nrow=N, ncol=L)
for (l in 1:L) {
u[ ,l] = (X$birthloc==l)*Gamma[1,l] +
X$major*Gamma[2,l] + X$married*Gamma[3,l]
X$kids*Gamma[4,l] + X$exper*Gamma[5,l]
X$occShifter*Gamma[6,l] + X$migShifter*X$married*Gamma[7,l]
eps[ ,l]
}
choice <- apply(u, 1, which.max)
# Add choice to data frame
dat <- cbind(choice,X)
# factorize categorical variables for estimation
dat$major <- as.factor(dat$major)
dat$married <- as.factor(dat$married)
dat$kids <- as.factor(dat$kids)
dat$birthloc <- as.factor(dat$birthloc)
dat$choice <- as.factor(dat$choice)
tic()
model.nb <- naiveBayes(choice~birthloc+major+married+kids+exper+occShifter+migShifter,data=dat,laplace=3)
toc()
tic()
pred.nb <- predict(model.nb, dat, type="raw")
toc()

I ran into the same problem. I needed to run naive bayes and predict a lot of times (1000's of times) on some big matrices (10000 rows, 1000-2000 cols). Since I had some time, I decided to implement my own implementation of naive bayes to make it a little faster:
https://cran.r-project.org/web/packages/fastNaiveBayes/index.html
I made some work out of this and created a package out of it: https://cran.r-project.org/web/packages/fastNaiveBayes/index.html. It is now around 330 times faster using a Bernoulli event model. Moreover, it implements a multinomial event model (even a bit faster) and a Gaussian model (slightly faster). Finally, a mixed model where it's possible to use different event models for different columns and combine them!
The reason e1071 is so slow in the predict function, is cause they use essentially a double for loop. There was already a pull request open from around beginning 2017 that at least vectorized one of these, but was not accepted yet.

Related

Avoid failure of confint.merMod on weighted models in lme4 when data object modified in calling frame

I'm facing a problem when using lme4 glmer function with weights, where if the data object passed to glmer is modified, some functions such as confint no longer work on the model. Here is an example:
library(lme4)
set.seed(1)
n <- 1000
df <- data.frame(
y=rbinom(n,1,.5),
w=runif(n,0,1)*.1+.95,
g=as.integer(round(runif(n,0,4)))
)
m <- glmer(cbind(y,1-y)~(1|g),data=df,weights=w,family=binomial())
confint(m)
df$w <- df$w*2
confint(m)
The 2nd call to confint gives this error:
Computing profile confidence intervals ...
Error in profile.merMod(object, which = parm, signames = oldNames, ...) :
Profiling over both the residual variance and
fixed effects is not numerically consistent with
profiling over the fixed effects only
It seems this has something to do with the profile function, as that function doesn't work after modifying the data frame.
The following seems to work to remove the dependency on the data object, but I am a bit uneasy not knowing if there might ever be bad side effects:
glmer2 <- function(...){
cl <- match.call()
df <- eval.parent(cl$data)
cl[1] <- call("glmer")
cl$data <- as.name("df")
eval(cl)
}
m <- glmer2(cbind(y,1-y)~(1|g),data=df,weights=w,family=binomial())
confint(m)
df$w <- df$w*2
confint(m)
(results of confint don't change)
The reason I need something like this is that I am creating a series of models, and need to re-compute the weights between each one, and it would be quite messy to keep all of the data objects.
Why do model functions seem to rely on the data object still being present and unchanged in the calling environment? And is there a better way to solve this issue?
(R version 3.6.3 (2020-02-29), x86_64-redhat-linux-gnu, lme4_1.1-21)

Conditional simulation (with Kriging) in R with parallelization?

I am using gstat package in R to generate sequential gaussian simulations. My pc have 4 cores and I tried to parallelize the krige() function using the parallel package following the script provided by Guzmán to answer the question How to achieve parallel Kriging in R to speed up the process?.
The resulting simulations are, however, different from the ones using only one core at the time (no parallelization). It looks a geometry problem, but i can't find out how to fix it.
Next i will provide an example (using 4 cores) generating 2 simulations. You will see that after running the code, the simulated maps derived from parallelization show some artifacts (like vertical lines), and are different from the ones using only one core at the time.
The code needs the libraries gstat, sp, raster, parallel and spatstat. If any of the lines library() do not work, run install.packages() first.
library(gstat)
library(sp)
library(raster)
library(parallel)
library(spatstat)
# create a regular grid
nx=100 # number of columns
ny=100 # number of rows
srgr <- expand.grid(1:ny, nx:1)
names(srgr) <- c('x','y')
gridded(srgr)<-~x+y
# generate a spatial process (unconditional simulation)
g<-gstat(formula=z~x+y, locations=~x+y, dummy=T, beta=15, model=vgm(psill=3, range=10, nugget=0,model='Exp'), nmax=20)
sim <- predict(g, newdata=srgr, nsim=1)
r<-raster(sim)
# generate sample data (Poisson process)
int<-0.02
rpp<-rpoispp(int,win=owin(c(0,nx),c(0,ny)))
df<-as.data.frame(rpp)
coordinates(df)<-~x+y
# assign raster values to sample data
dfpp <-raster::extract(r,df,df=TRUE)
smp<-cbind(coordinates(df),dfpp)
smp<-smp[complete.cases(smp), ]
coordinates(smp)<-~x+y
# fit variogram to sample data
vs <- variogram(sim1~1, data=smp)
m <- fit.variogram(vs, vgm("Exp"))
plot(vs, model = m)
# generate 2 conditional simulations with one core processor
one <- krige(formula = sim1~1, locations = smp, newdata = srgr, model = m,nmax=12,nsim=2)
# plot simulation 1 and 2: statistics (min, max) are ok, simulations are also ok.
spplot(one["sim1"], main = "conditional simulation")
spplot(one["sim2"], main = "conditional simulation")
# generate 2 conditional with parallel processing
no_cores<-detectCores()
cl<-makeCluster(no_cores)
parts <- split(x = 1:length(srgr), f = 1:no_cores)
clusterExport(cl = cl, varlist = c("smp", "srgr", "parts","m"), envir = .GlobalEnv)
clusterEvalQ(cl = cl, expr = c(library('sp'), library('gstat')))
par <- parLapply(cl = cl, X = 1:no_cores, fun = function(x) krige(formula=sim1~1, locations=smp, model=m, newdata=srgr[parts[[x]],], nmax=12, nsim=2))
stopCluster(cl)
# merge all parts
mergep <- maptools::spRbind(par[[1]], par[[2]])
mergep <- maptools::spRbind(mergep, par[[3]])
mergep <- maptools::spRbind(mergep, par[[4]])
# create SpatialPixelsDataFrame from mergep
mergep <- SpatialPixelsDataFrame(points = mergep, data = mergep#data)
# plot mergep: statistics (min, max) are ok, but simulated maps show "vertical lines". i don't understand why.
spplot(mergep[1], main = "conditional simulation")
spplot(mergep[2], main = "conditional simulation")
I have tried your code and I think the problem lies with the way you split the work:
parts <- split(x = 1:length(srgr), f = 1:no_cores)
On my dual core machine that meant that all odd indices in srgr where handled by one process and all even indices where handled by the other process. This is probably the source of the vertical artifacts you are seeing.
A better way should be to split the data into consecutive chunks like this:
parts <- parallel::splitIndices(length(srgr), no_cores)
Using this splitting with the rest of your code I get results that look comparable to the sequential ones. At least to my untrained eyes ...
Original answer, which is only a minor effect. It still might make sense to fix the seed with set.seed for sequential and clusterSetRNGStream for parallel processing.
From what I have read about Kriging it requires you to draw random numbers. These random numbers will be different with parallel processing. See section 6 of the parallel vignette (vignette("parallel")) for more details.

neuralnet:ANN results not reproducible even after setting seed

In nutshell will explain the code;
Am trying to forecast by creating 24 hourly models in a single day and collating the results in the data frame.Basic issue is not able to reproduce #the output even after setting seed.Please anyone help me.some custom functions #and objects i have made and there is no randomization in them.(Just FYI).
f <- as.formula("actual~ lag.1 + last3.avg+monsoon+mon.thurs+wdaySaturday+wdaySunday+holiday
") #Defining the formula for neural network
require(dplyr);require(neuralnet)
set.seed(123456)
nnet.hour=data.frame()#Initializing a dataframe
#k=0
#x=list()
for(i in 1:24){#Running it for 24 hours in a day
sub<-new.day.ahead[new.day.ahead$hour==i,]
sub$lag.1<-lag(sub$actual,1)
for(i in 1:nrow(sub)){
sub$last3.avg[i]=sum(lag(sub$actual,1)[i],lag(sub$actual,2)[i],lag(sub$actual,3)[i],na.rm=TRUE)/3
}
ind=which(sub$mod.date==ymd(t[1]));ind#t[1] is basically a date #initialisation,getting the index
monsoon=as.factor(sub$Monsoon.Dummy)
wday=as.factor(sub$wday.dummy)
holiday=as.factor(sub$holiday)
sub=as.data.frame(cbind(sub[,c(4,16,17)],cbind(
monsoon=model.matrix(~monsoon)[,-1],
wday=model.matrix(~wday)[,-1],
holiday=model.matrix(~holiday)[,-1]
)))
names(sub)[5]<-"mon.thurs"
##Normalising the data for training in a neural net
sub[,2][1]=0
maxs <- apply(sub, 2, max)
mins <- apply(sub, 2, min)
scaled <- as.data.frame(scale(sub, center = mins, scale = maxs - mins))
train<- scaled[1:I(ind-1),]
test<- scaled[ind,]
set.seed(123456)
nn <- neuralnet(f,data=train,hidden =7,linear.output = TRUE)
pr.nn<-neuralnet::compute(nn,test[,-1])
#Normalising back
pr.nn.<- pr.nn$net.result*(max(sub$actual)-min(sub$actual))+min(sub$actual)
test.r <- (test$actual)*(max(sub$actual)-min(sub$actual))+min(sub$actual)
u=mape(as.numeric(test.r),as.numeric(pr.nn.));u#Calculating Mean Absolute Percentage Error
if(i==1){
nnet.hour=data.frame(actual=as.numeric(test.r),forecast1=as.numeric(pr.nn.),mape=u)
}else{
nnet.hour=rbind(nnet.hour,data.frame(data.frame(actual=as.numeric(test.r),forecast1=as.numeric(pr.nn.),mape=u)))
nnet.hour=data.frame(nnet.hour)
}
}
Yes.This is solved.Actually for some iterations I failed to invoke 'dplyr' package ,so the lag variables i was creating using lag(function 'lag' is both in base as well as dplyr package) function were returning just the same series as the variable I was trying to forecast courtesy which errors were ~negligible.
Once I invoke dplyr package Results are reproducible.
Thanks.

Abysmally slow performance of caret package, with multicore

I am going through the book "Applied Predictive Modeling" by the author of the caret package.
The first example of a training on a svm takes hours to run on my 64 bit i7 16 GB xubuntu desktop [I gave up after 4 hours]. Since this is a "toy" dataset [800 rows, 42 variables], there sure must be a way to run this in a reasonable amount of time.
library(caret)
data(GermanCredit)
library(doMC)
registerDoMC(8)
GermanCredit <- GermanCredit[, -nearZeroVar(GermanCredit)]
GermanCredit$CheckingAccountStatus.lt.0 <- NULL
GermanCredit$SavingsAccountBonds.lt.100 <- NULL
GermanCredit$EmploymentDuration.lt.1 <- NULL
GermanCredit$EmploymentDuration.Unemployed <- NULL
GermanCredit$Personal.Male.Married.Widowed <- NULL
GermanCredit$Property.Unknown <- NULL
GermanCredit$Housing.ForFree <- NULL
## Split the data into training (80%) and test sets (20%)
set.seed(100)
inTrain <- createDataPartition(GermanCredit$Class, p = .8)[[1]]
GermanCreditTrain <- GermanCredit[ inTrain, ]
GermanCreditTest <- GermanCredit[-inTrain, ]
set.seed(1056)
svmFit = train(Class ~ .,
data = GermanCreditTrain,
method = "svmRadial")
Question: if this code is correct, how can it be run in a reasonable amount of time?
I ran into incredibly poor performance of svmRadial on Linux. It turns out that the issue for me too was with using multicore DoMC. svmRadial runs fine on a single core. The kernlab functions are the only ones in caret that exhibit this behaviour that I've seen. This is very frustrating, as I had to drop multicore for my entire script just to get the SVM functions working.

Running PLSR predictions parallel in R using foreach

Users,
I am looking for a solution to "parallelize" my PLSR predictions in order to save pprocessing time. I was trying to use the "foreach" construct with "doPar" (cf. 2nd part of code below), but I was unable to allocate the predicted values as well as the model performance parameters (RMSEP) to the output variable.
The code:
set.seed(10000) # generate some data...
mat <- replicate(100, rnorm(100))
y <- as.matrix(mat[,1], drop=F)
x <- mat[,2:100]
eD <- dist(x, method = "euclidean") # distance matrix to find close samples
eDm <- as.matrix(eD)
kns <- matrix(NA,nrow(x),10) # empty matrix to allocate 10 closest samples
for (i in 1:nrow(eDm)) { # identify closest samples in a loop and allocate to kns
kns[i,] <- head(order(eDm[,i]), 11)[-1]
}
So far I consider the code as "safe", but the next part is challenging me, since I never used the "foreach" construct before:
library(pls)
library(foreach)
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
out <- foreach(j = 1:nrow(mat), .combine="rbind", .packages="pls") %dopar% {
pls <- plsr(y ~ x, ncomp=5, validation="CV", , subset=kns[j,])
predict(pls, ncomp=5, newdata=x[j,,drop=F])
RMSEP(pls, estimate="CV")$val[1,1,5]
}
stopCluster(cl)
As I understand, the code line starting with "RMSEP(pls,..." is simply overwriting the previously written data from the "predict" code line. Somehow I was assuming the .combine option would take care of this?
Many thanks for your help!
Best, Chega
If you want to return two objects from the body of a foreach loop, you need to put them into an object such as a list:
out <- foreach(j = 1:nrow(mat), .packages="pls") %dopar% {
pls <- plsr(y ~ x, ncomp=5, validation="CV", , subset=kns[j,])
list(p=predict(pls, ncomp=5, newdata=x[j,,drop=F]),
r=RMSEP(pls, estimate="CV")$val[1,1,5])
}
Only the "final value" of the loop body is returned to the master and then processed by the .combine function.
Note that I removed the .combine argument so that the result will be a list of lists of length 2. It's not clear to me that rbind is the appropriate function to use to process the results.
Since this question was originally answered, the pls package has been modified to allow the cross-validation to be run in parallel. The implementation is trivially easy--simply a matter of defining either a persistent cluster, or the number of cores to use in a transient cluster, in pls.options.
If transient clusters are used, implementation literally requires only two lines of code:
library(parallel)
pls.options(parallel=NumberOfCoresToUse)
No changes to the output variables are needed.
I haven't checked whether parallelizing at the calibration level, as in the question, would be more efficient. I suspect it would be, particularly when the number of calibration iterations is much larger than the number of cross-validation steps (especially when the number of CVs isn't a multiple of the number of cores used), but this approach is so straightforward that the extra coding effort may not be worth it.

Resources