neuralnet:ANN results not reproducible even after setting seed - r

In nutshell will explain the code;
Am trying to forecast by creating 24 hourly models in a single day and collating the results in the data frame.Basic issue is not able to reproduce #the output even after setting seed.Please anyone help me.some custom functions #and objects i have made and there is no randomization in them.(Just FYI).
f <- as.formula("actual~ lag.1 + last3.avg+monsoon+mon.thurs+wdaySaturday+wdaySunday+holiday
") #Defining the formula for neural network
require(dplyr);require(neuralnet)
set.seed(123456)
nnet.hour=data.frame()#Initializing a dataframe
#k=0
#x=list()
for(i in 1:24){#Running it for 24 hours in a day
sub<-new.day.ahead[new.day.ahead$hour==i,]
sub$lag.1<-lag(sub$actual,1)
for(i in 1:nrow(sub)){
sub$last3.avg[i]=sum(lag(sub$actual,1)[i],lag(sub$actual,2)[i],lag(sub$actual,3)[i],na.rm=TRUE)/3
}
ind=which(sub$mod.date==ymd(t[1]));ind#t[1] is basically a date #initialisation,getting the index
monsoon=as.factor(sub$Monsoon.Dummy)
wday=as.factor(sub$wday.dummy)
holiday=as.factor(sub$holiday)
sub=as.data.frame(cbind(sub[,c(4,16,17)],cbind(
monsoon=model.matrix(~monsoon)[,-1],
wday=model.matrix(~wday)[,-1],
holiday=model.matrix(~holiday)[,-1]
)))
names(sub)[5]<-"mon.thurs"
##Normalising the data for training in a neural net
sub[,2][1]=0
maxs <- apply(sub, 2, max)
mins <- apply(sub, 2, min)
scaled <- as.data.frame(scale(sub, center = mins, scale = maxs - mins))
train<- scaled[1:I(ind-1),]
test<- scaled[ind,]
set.seed(123456)
nn <- neuralnet(f,data=train,hidden =7,linear.output = TRUE)
pr.nn<-neuralnet::compute(nn,test[,-1])
#Normalising back
pr.nn.<- pr.nn$net.result*(max(sub$actual)-min(sub$actual))+min(sub$actual)
test.r <- (test$actual)*(max(sub$actual)-min(sub$actual))+min(sub$actual)
u=mape(as.numeric(test.r),as.numeric(pr.nn.));u#Calculating Mean Absolute Percentage Error
if(i==1){
nnet.hour=data.frame(actual=as.numeric(test.r),forecast1=as.numeric(pr.nn.),mape=u)
}else{
nnet.hour=rbind(nnet.hour,data.frame(data.frame(actual=as.numeric(test.r),forecast1=as.numeric(pr.nn.),mape=u)))
nnet.hour=data.frame(nnet.hour)
}
}

Yes.This is solved.Actually for some iterations I failed to invoke 'dplyr' package ,so the lag variables i was creating using lag(function 'lag' is both in base as well as dplyr package) function were returning just the same series as the variable I was trying to forecast courtesy which errors were ~negligible.
Once I invoke dplyr package Results are reproducible.
Thanks.

Related

How to implement shapper:shap for whole dataset?

I have created a Random Forest model using the randomForest package
model_rf <- randomForest(y~ . , data = data_train,ntree=1000, keep.forest=TRUE,importance=TRUE)
To calculate Shapley values for the different features based on this RF model, I first create an "explainer object" and then use the "shapper" package
exp_rf <- DALEX::explain(model_rf, data = data_test[,-1], y = data_test[,1])
ive_rf <- shap(exp_rf, new_observation = data_test[1,-1])
To my knowledge, I can only apply the "shap" function to one observation (the "new_observation").
But I am looking for a way to calculate the shapley values for all of my respondents in my datafile.
I know this is possible in the "SHAP" package in Python; but is it also possible with the "shapper" package in R?
At the moment, I created a loop to calculate the shapley values for all respondents, but this will take me days to calculate for my entire datafile.
for(i in c(1:nrow(data_test)))
{
ive_rf <- shap(exp_rf,new_observation=data_test[i,-1])
shapruns<-cbind(shapruns,ive_rf[,"_attribution_"])
}
Any help would be much appreciated.
I recently published two R packages that are optimized for this kind of tasks: "kernelshap" (calculate SHAP values fast) and "shapviz" (plot SHAP values from any source). In your case, a working example would be:
library(randomForest)
library(kernelshap)
library(shapviz)
set.seed(1)
fit <- randomForest(Sepal.Length ~ ., data = iris,)
# Step 1: Calculate Kernel SHAP values
# bg_X is usually a small (50-200 rows) subset of the data
s <- kernelshap(fit, iris[-1], bg_X = iris)
# Step 2: Turn them into a shapviz object
sv <- shapviz(s)
# Step 3: Gain insights...
sv_importance(sv, kind = "bee")
sv_dependence(sv, v = "Petal.Length", color_var = "auto")

ARFIMA model and accurancy function

I am foresting with data sets from fpp2 package and forecast package. So my intention is to make automatic forecasting with a several time series. So for that reason I am forecasting with function. You can see code below:
# CODE
library(fpp2)
library(dplyr)
library(forecast)
df<-qauselec
# Forecasting function
fct_fun <- function(Z, hrz = forecast_horizon) {
timeseries <- msts(Z, start = 1956, seasonal.periods = 4)
forecast <- arfima(timeseries)
}
acc_list <- lapply(X = df, fct_fun)
So next step is to check accuracy of model. So for that reason I am trying with this line of code you can see below
accurancy_arfima <- lapply(acc_list, accuracy)
Until now this line of code or function accuracy worked perfectly with other models like snaive,ets etc. but with arfima can’t work properly.
So can anybody help me how to resolve this problem with accuracy function?
Follow R-documentation, Returns range of summary measures of the forecast accuracy. If x is provided, the function measures test set forecast accuracy based on x-f . If x is not provided, the function only produces training set accuracy measures of the forecasts based on f["x"]-fitted(f).
And usage summary can be seen :
accuracy(f, x, test = NULL, d = NULL, D = NULL,
...)
So :
accuracy(acc_list[[1]]$fitted, df)
If you want to evaluate separately accuracy, It will work.
a <- c()
for (i in 1:4) {
b <- accuracy(df[i], acc_list[[1]]$fitted[i])
a <- rbind(a,b)
}

e1071 Package: naiveBayes prediction is slow

I am trying to run the naiveBayes classifier from the R package e1071. I am running into an issue where the time it takes to predict takes longer than the time it takes to train, by a factor of ~300.
I was wondering if anyone else has observed this behavior and, if so, if you have any suggestions on how to improve it.
This issue appears only in some instances. Below, I have code that trains and predicts the NB classifier on the Iris dataset. Here the training and prediction times match up quite closely (prediction takes 10x longer instead of 300x longer). The only other trace of this issue that I could find online is here. In that instance, the answer was to make sure that categorical variables are formatted as factors. I have done this, but still don't see any improvement.
I have played around with the sample size N and the problem seems to be lessened as N decreases. Perhaps this is intended behavior of the algorithm? Decreasing N by a factor of 10 causes the prediction to be only 150x slower, but increasing by a factor of 10 yields a similar slowdown of 300x. These numbers seem crazy to me, especially because I've used this algorithm in the past on datasets with ~300,000 examples and found it to be quite fast. Something seems fishy but I can't figure out what.
I'm using R version 3.3.1 on Linux. The e1071 package is up-to-date (2015 release).
The code below should be reproducible on any machine. FYI my machine timed the Iris classification at 0.003s, the Iris prediction at 0.032s, the simulated data classification at 0.045s, and the resulting prediction at 15.205s. If you get different numbers than these, please let me know as it could be some issue on my local machine.
# Remove everything from the environment and clear out memory
rm(list = ls())
gc()
# Load required packages and datasets
require(e1071)
data(iris)
# Custom function: tic/toc function to time the execution
tic <- function(gcFirst = TRUE, type=c("elapsed", "user.self", "sys.self"))
{
type <- match.arg(type)
assign(".type", type, envir=baseenv())
if(gcFirst) gc(FALSE)
tic <- proc.time()[type]
assign(".tic", tic, envir=baseenv())
invisible(tic)
}
toc <- function()
{
type <- get(".type", envir=baseenv())
toc <- proc.time()[type]
tic <- get(".tic", envir=baseenv())
print(toc - tic)
invisible(toc)
}
# set seed for reproducibility
set.seed(12345)
#---------------------------------
# 1. Naive Bayes on Iris data
#---------------------------------
tic()
model.nb.iris <- naiveBayes(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,data=iris)
toc()
tic()
pred.nb.iris <- predict(model.nb.iris, iris, type="raw")
toc()
#---------------------------------
# 2. Simulate data and reproduce NB error
#---------------------------------
# Hyperparameters
L <- 5 # no. of locations
N <- 1e4*L
# Data
married <- 1*(runif(N,0.0,1.0)>.45)
kids <- 1*(runif(N,0.0,1.0)<.22)
birthloc <- sample(1:L,N,TRUE)
major <- 1*(runif(N,0.0,1.0)>.4)
exper <- 15+4*rnorm(N)
exper[exper<0] <- 0
migShifter <- 2*runif(N,0.0,1.0)-1
occShifter <- 2*runif(N,0.0,1.0)-1
X <- data.frame(rep.int(1,N),birthloc,migShifter,occShifter,major,married,kids,exper,exper^2,exper^3)
colnames(X)[1] <- "constant"
rm(married)
rm(kids)
rm(birthloc)
rm(major)
rm(exper)
rm(occShifter)
# Parameters and errors
Gamma <- 15*matrix(runif(7*L), nrow=7, ncol=L)
eps <- matrix(rnorm(N*L, 0, 1), nrow=N, ncol=L)
# Deterministic portion of probabilities
u <- matrix(rep.int(0,N*L), nrow=N, ncol=L)
for (l in 1:L) {
u[ ,l] = (X$birthloc==l)*Gamma[1,l] +
X$major*Gamma[2,l] + X$married*Gamma[3,l]
X$kids*Gamma[4,l] + X$exper*Gamma[5,l]
X$occShifter*Gamma[6,l] + X$migShifter*X$married*Gamma[7,l]
eps[ ,l]
}
choice <- apply(u, 1, which.max)
# Add choice to data frame
dat <- cbind(choice,X)
# factorize categorical variables for estimation
dat$major <- as.factor(dat$major)
dat$married <- as.factor(dat$married)
dat$kids <- as.factor(dat$kids)
dat$birthloc <- as.factor(dat$birthloc)
dat$choice <- as.factor(dat$choice)
tic()
model.nb <- naiveBayes(choice~birthloc+major+married+kids+exper+occShifter+migShifter,data=dat,laplace=3)
toc()
tic()
pred.nb <- predict(model.nb, dat, type="raw")
toc()
I ran into the same problem. I needed to run naive bayes and predict a lot of times (1000's of times) on some big matrices (10000 rows, 1000-2000 cols). Since I had some time, I decided to implement my own implementation of naive bayes to make it a little faster:
https://cran.r-project.org/web/packages/fastNaiveBayes/index.html
I made some work out of this and created a package out of it: https://cran.r-project.org/web/packages/fastNaiveBayes/index.html. It is now around 330 times faster using a Bernoulli event model. Moreover, it implements a multinomial event model (even a bit faster) and a Gaussian model (slightly faster). Finally, a mixed model where it's possible to use different event models for different columns and combine them!
The reason e1071 is so slow in the predict function, is cause they use essentially a double for loop. There was already a pull request open from around beginning 2017 that at least vectorized one of these, but was not accepted yet.

SVM is not generating forecast using R

I have sales data for 5 different product along with weather information.To read the data, we have daily sales data at a particular store and daily weather information like what is the temperature, average speed of the area where store is located.
I am using Support Vector Machine for prediction. It works well for all the products except one. Its giving me following error:
tunedModelLOG
named numeric(0)
Below is the code:
# load the packages
library(zoo)
library(MASS)
library(e1071)
library(rpart)
library(caret)
normalize <- function(x) {
a <- min(x, na.rm=TRUE)
b <- max(x, na.rm=TRUE)
(x - a)/(b - a)
}
# Define the train and test data
test_data <- train[1:23,]
train_data<-train[24:nrow(train),]
# Define the factors for the categorical data
names<-c("year","month","dom","holiday","blackfriday","after1","back1","after2","back2","after3","back3","is_weekend","weeday")
train_data[,names]<- lapply(train_data[,names],factor)
test_data[,names] <- lapply(test_data[,names],factor)
# Normalized the continuous data
normalized<-c("snowfall","depart","cool","preciptotal","sealevel","stnpressure","resultspeed","resultdir")
train_data[,normalized] <- data.frame(lapply(train_data[,normalized], normalize))
test_data[,normalized] <- data.frame(lapply(test_data[,normalized], normalize))
# Define the same level in train and test data
levels(test_data$month)<-levels(train_data$month)
levels(test_data$dom)<-levels(train_data$dom)
levels(test_data$year)<-levels(train_data$year)
levels(test_data$after1)<-levels(train_data$after1)
levels(test_data$after2)<-levels(train_data$after2)
levels(test_data$after3)<-levels(train_data$after3)
levels(test_data$back1)<-levels(train_data$back1)
levels(test_data$back2)<-levels(train_data$back2)
levels(test_data$back3)<-levels(train_data$back3)
levels(test_data$holiday)<-levels(train_data$holiday)
levels(test_data$is_weekend)<-levels(train_data$is_weekend)
levels(test_data$blackfriday)<-levels(train_data$blackfriday)
levels(test_data$is_weekend)<-levels(train_data$is_weekend)
levels(test_data$weeday)<-levels(train_data$weeday)
# Fit the SVM model and tune the parameters
svmReFitLOG=tune(svm,logunits~year+month+dom+holiday+blackfriday+after1+after2+after3+back1+back2+back3+is_weekend+depart+cool+preciptotal+sealevel+stnpressure+resultspeed+resultdir,data=train_data,ranges = list(epsilon = c(0,0.1,0.01,0.001), cost = 2^(2:9)))
retunedModeLOG <- svmReFitLOG$best.model
tunedModelLOG <- predict(retunedModeLOG,test_data)
Working file is available at the below link
https://drive.google.com/file/d/0BzCJ8ytbECPMVVJ1UUg2RHhQNFk/view?usp=sharing
What I am doing wrong? I would appreciate any kind of help.
Thanks in advance.

error in Predict.lm plyr Is there an alternative to mdply for prediction using newdata?

I am at a lost in trying to figure out the logic of how to get predictions using new data passed to predict.lm using plyr in place of a loop. Can anyone help? Example:
Because I am new to r and not a highly skilled programmer, my code will be painfully inefficient.
Stackflow community:
Thanks for the suggestions to create fake code of the problem. I am hoping this will help me solve this headache.
My goal is to make predictions on a new validation dataset using the coefficients from model built on training dataset. I will eventually be building an ARIMA as well as a linear model once I can get help solving the problem. I am building 24 regression models. One model for each hour of the day. My training data would be 90 days and my validation data would be 31 days.
Creating Some Data
require(plyr)
# setting up some fake data
set.seed(31)
foo <- function(myHour, myDate){
rlnorm(1, meanlog=0,sdlog=1)*(myHour) + (150*myDate)
}
Hour <- 1:24
Day <-1:90
dates <-seq(as.Date("2012-01-01"), as.Date("2012-3-30"), by = "day")
myData <- expand.grid( Day, Hour)
names(myData) <- c("Date","Hour")
myData$Adspend <- apply(myData, 1, function(x) foo(x[2], x[1]))
myData$Date <-dates
myData$Demand <-(rnorm(1,mean = 0, sd=1)+.75*myData$Adspend)
## ok, done with the fake data generation.
myData
#Run regression on training data
FIT <- dlply(myData, "Hour", function(x) lm(x[,4] ~ x[,3], data=x))
# Create new fake validation dataset (31days)
Hour <- 1:24
Day <- 1:31
dates <-seq(as.Date("2012-03-31"), as.Date("2012-4-30"), by = "day")
newData <- expand.grid( Day, Hour)
names(newData) <- c("Date","Hour")
set.seed(310)
fooNew <- function(myHour, myDate){
rlnorm(1, meanlog=0,sdlog=1)*5*(myHour) + (300*myDate)
}
newData$AdspendNew <- apply(newData, 1, function(x) fooNew(x[2], x[1]))
newData$Date <-dates
I then try to make predictions of Demand using the New values for Adspend
NewDatabyHour <-dlply(newData,"Hour")
PREDFIT <-mdply(cbind(mod=FIT, df=NewDatabyHour), function(mod,df) {
transform(df, pred=predict(mod,df))})
The error I am now getting is the following:
Error in data.frame(list(Date = c(15430, 15431, 15432, 15433, 15434, 15435, :
arguments imply differing number of rows: 31, 90
In addition: Warning message:
'newdata' had 31 rows but variables found have 90 rows
My Question is:
How do I make predictions on new data in which the new data has less observations than the training data?
My second question is: Is the process the same for auto.arima as for LM()?
Thank you again for any help.
Your problem arises in the way you are constructing your formula and then not having consistent names in the newdata argument to predict.lm (also mdply is not really what you want here)
predict.lm will look for objects in newdata that have the same names as the terms in your model object. Your current defintion has x[,4] as your 'x' term.
Instead, use the names, i.e.
FIT <- dlply(myData, "Hour", function(x) lm(Demand ~ Adspend, data=x))
Now when you create newData, continue to use the name Adspend
newData$Adspend <- apply(newData, 1, function(x) fooNew(x[2], x[1]))
Now you can use Map (which is a wrapper for mapply, a base R function not plyr) to move through FIT and NewDatabyHour to do your predictions (and combine with the new data
predicted <- Map(object = FIT, newdata = NewDatabyHour,
f = function(object,newdata) {
newdata$predicted = predict(object, newdata)
newdata})
# combine into whole data frame again
predDF <- rbind.fill(predicted)
Another (entirely) different approach would be to use nlme lmList
Data is partitioned according to the levels of the grouping factor g and individual lm fits are obtained for each data partition, using the model defined in object.
library(nlme)
# fit the model to each subset
FITS <- lmList(Demand ~ Adspend | Hour, data = myData)
# make the predictions
newData$predicted <- predict(FITS, newdata = newData)
(Please note that these regression models are almost certainly not the best way to analyse these data !)

Resources