RMOA Hoeffding Tree with Holdout Evaluation - r

I'm using the RMOA package for R to implement a Hoeffding Tree stream classifier with holdout evaluation.
Everything is training correctly, except when I try to evaluate my model from my held-out test stream I'm getting hit with the following error message:
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "c('HoeffdingTree', 'MOA_classifier', 'MOA_model')"
Having checked the answer to this question, the problem may stem from the fact that the predict() method exists in both the stats and RMOA packages. I have tried to use the :: notation in order to specify which package, but I can't seem to point to RMOA predict(). I also tried uninstalling stats altogether but it hasn't helped.
Does anyone know how to point directly to RMOA's predict(), or is my issue caused by something else entirely?
My R code is below. I'm simply streaming the iris data set for just now, and extracting the first 30 stream items to use for holdout evaluation.
holdout<-function(){
require("RMOA")
#Initialise streams
stream<-datastream_dataframe(iris)
test<-stream$get_points(n=30)
test<-datastream_dataframe(test)
#Specify model
mymodel<-HoeffdingTree(numericEstimator = "GaussianNumericAttributeClassObserver")
#Record execution time for training
start_time<-Sys.time()
while(!stream$finished)
{
mymodel <<- trainMOA(model=mymodel, formula = Species ~ Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, data=stream)
}
end_time<-Sys.time()
time_taken <- end_time - start_time
cat("Finished training. Elapsed time: ", time_taken)
#Empty vector to store individual accuracy results of holdout stream elements
accuracies<-c()
#Record the execution time of holdout evaluation
start_time<-Sys.time()
while(!test$finished)
{
samp<-test$get_points(n=1)
pred <- predict(mymodel, samp, type="response")
}
end_time<-Sys.time()
time_taken <- end_time - start_time
cat("Finished training. Elapsed time: ", time_taken)
}

the predict method in package RMOA is an internal variable you can call it like this:
RMOA:::predict.MOA_trainedmodel
full example:
library(RMOA)
data(iris)
stream <- datastream_dataframe(iris)
test <- stream$get_points(n = 30)
test <- datastream_dataframe(test)
mymodel <- HoeffdingTree(numericEstimator = "GaussianNumericAttributeClassObserver")
mymodel <- trainMOA(model = mymodel, formula = Species ~ Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, data = stream)
in my case the predict function is not masked (which is odd if it is not exported):
pred1 <- predict(mymodel, iris, type = "response")
but if it was I could use:
pred2 <- RMOA:::predict.MOA_trainedmodel(mymodel, iris, type = "response")
and the result would be the same:
all.equal(pred1, pred2)
#output
TRUE
I checked the NAMESPACE of RMOA and predict function is exported but for some reason
RMOA::predict.MOA_trainedmodel
results in
Error: 'predict.MOA_trainedmodel' is not an exported object from
'namespace:RMOA'
while
RMOA:::predict.MOA_trainedmodel
does not

Related

Avoid failure of confint.merMod on weighted models in lme4 when data object modified in calling frame

I'm facing a problem when using lme4 glmer function with weights, where if the data object passed to glmer is modified, some functions such as confint no longer work on the model. Here is an example:
library(lme4)
set.seed(1)
n <- 1000
df <- data.frame(
y=rbinom(n,1,.5),
w=runif(n,0,1)*.1+.95,
g=as.integer(round(runif(n,0,4)))
)
m <- glmer(cbind(y,1-y)~(1|g),data=df,weights=w,family=binomial())
confint(m)
df$w <- df$w*2
confint(m)
The 2nd call to confint gives this error:
Computing profile confidence intervals ...
Error in profile.merMod(object, which = parm, signames = oldNames, ...) :
Profiling over both the residual variance and
fixed effects is not numerically consistent with
profiling over the fixed effects only
It seems this has something to do with the profile function, as that function doesn't work after modifying the data frame.
The following seems to work to remove the dependency on the data object, but I am a bit uneasy not knowing if there might ever be bad side effects:
glmer2 <- function(...){
cl <- match.call()
df <- eval.parent(cl$data)
cl[1] <- call("glmer")
cl$data <- as.name("df")
eval(cl)
}
m <- glmer2(cbind(y,1-y)~(1|g),data=df,weights=w,family=binomial())
confint(m)
df$w <- df$w*2
confint(m)
(results of confint don't change)
The reason I need something like this is that I am creating a series of models, and need to re-compute the weights between each one, and it would be quite messy to keep all of the data objects.
Why do model functions seem to rely on the data object still being present and unchanged in the calling environment? And is there a better way to solve this issue?
(R version 3.6.3 (2020-02-29), x86_64-redhat-linux-gnu, lme4_1.1-21)

Scoping with formulae in coxph objects

I'm trying to write a set of functions where the first function fits a cox model (via coxph in the survival package in R), and the second function gets estimated survival for a new dataset, given the fitted model object from the first function. I'm running into some sort of scoping issue that I don't quite know how to solve without substantially re-factoring my code (the only way I could think to do it would be much less general and much harder to read).
I have a very similar set of functions that are based on the glm function that do not run into the same issue and give me the answers I would expect. I've included a short worked example below that demonstrates the issue. The glue.cox and glue.glm are functions that have the basic functionality I am trying to get. glue.glm works as expected (yielding the same values from a calculation in the global environment), but the glue.cox complains that it can't find the data that was used to fit the cox model and ends with an error. I don't understand how to do this with substitute but I suspect that is the way forward. I've hit a wall with experimenting.
library(survival)
data.global = data.frame(time=runif(20), x=runif(20))
newdata.global = data.frame(x=c(0,1))
f1 = Surv(time) ~ x # this is the part that messes it up!!!!! Surv gets eval
f2 = time ~ x # this is the part that messes it up!!!!! Surv gets eval
myfit.cox.global = coxph(f1, data=data.global)
myfit.glm.global = glm(f2, data=data.global)
myfit.glm.global2 = glm(time ~ x, data=data.global)
myfit.cox <- function(f, dat.local){
coxph(f, data=dat.local)
}
myfit.glm <- function(f, dat.local){
glm(f, data=dat.local)
}
mypredict.cox <- function(ft, dat.local){
newdata = data.frame(x=c(0,1))
tail(survfit(ft, newdata)$surv, 1)
}
mypredict.glm <- function(ft, dat.local){
newdata = data.frame(x=c(0,1))
predict(ft, newdata)
}
glue.cox <- function(f, dat.local){
fit = myfit.cox(f, dat.local)
mypredict.cox(fit, dat.local)
}
glue.glm <- function(f, dat.local){
fit = myfit.glm(f, dat.local)
mypredict.glm(fit, dat.local)
}
# these numbers are the goal for non-survival data
predict(myfit.glm.global, newdata = newdata.global)
0.5950440 0.4542248
glue.glm(f2, data.global)
0.5950440 0.4542248 # this works
# these numbers are the goal for survival data
tail(survfit(myfit.cox.global, newdata = newdata.global)$surv, 1)
[20,] 0.02300798 0.03106081
glue.cox(f1, data.global)
Error in eval(predvars, data, env) : object 'dat.local' not found
This appears to work, at least in the narrow sense of making glue.cox() work as desired:
myfit.cox <- function(f, dat.local){
environment(f) <- list2env(list(dat.local=dat.local))
coxph(f, data=dat.local)
}
The trick here is that most R modeling/model-processing functions look for data in the environment associated with the formula.
I don't know why glue.glm works without doing more digging, except for the general statement that [g]lm objects store more of the information needed for downstream processing internally (e.g. in the $qr element) than other model types.

How to write a predict function for mlr predict to upload in AzureML as webservice?

I am trying to upload a R Model in AzureML as webservice, model uses mlr package in R and its predict function, the output of mlr predict is a table of "PredictionClassif" "Prediction", for the linear model like Regression I use
PredictAction <- function(inputdata){
predict(RegModel, inputdata, type="response")
}
This is working perfectly fine in Azure.
When I use mlr package for classification with predict type probability, the predict function I have to write as,
PredictAction <- function(inputdata){
require(mlr)
predict(randomForest,newdata=inputdata)
}
When calling the function
publishWebService(ws, fun, name, inputSchema)
It produces an Error as
converting `inputSchema` to data frame
Error in convertArgsToAMLschema(lapply(x, class)) :
Error: data type "table" not supported
as the predict function produces a table which I don't know how to convert or modify, so I give the outputschema
publishWebService(ws, fun, name, inputSchema,outputschema)
I am not sure how to specify the outputschema https://cran.r-project.org/web/packages/AzureML/AzureML.pdf
outputschema is a list,
the predict function from mlr produces the output of class
class(pred_randomForest)
"PredictionClassif" "Prediction"
and the data output is a dataframe
class(pred_randomForest$data)
"data.frame"
I am seeking help on the syntax for outputschema in publishWebService function, or whether I have to add any other arguments of the function. Not sure where is the issue, whether AzureML can't read the wrapped Model or whether the predict function of mlr is executed properly in AzureML.
Getting Following Error in AzureML
Execute R Script Piped (RPackage) : The following error occurred during evaluation of R script: R_tryEval: return error: Error in UseMethod("predict") : no applicable method for 'predict' applied to an object of class "c('FilterModel', 'BaseWrapperModel', 'WrappedModel')"
here is the example of using XGBoost library in R:
library("xgboost") # the main algorithm
##Load the Azure workspace. You can find the ID and the pass in your workspace
ws <- workspace(
id = "Your workspace ID",
auth = "Your Auth Pass"
)
##Download the dataset
dataset <- download.datasets(ws, name = "Breast cancer data", quote="\"")
## split the dataset to get train and score data
## 75% of the sample size
smp_size <- floor(0.75 * nrow(dataset))
## set the seed to make your partition reproductible
set.seed(123)
## get index to split the dataset
train_ind <- sample(seq_len(nrow(dataset)), size = smp_size)
##Split train and test data
train_dataset <- dataset[train_ind, ]
test_dataset <- dataset[-train_ind, ]
#Get the features columns
features<-train_dataset[ , ! colnames(train_dataset) %in% c("Class") ]
#get the label column
labelCol <-train_dataset[,c("Class")]
#convert to data matrix
test_gboost<-data.matrix(test_dataset)
train_gboost<-data.matrix(train_dataset)
#train model
bst <- xgboost(data = train_gboost, label = train_dataset$Class, max.depth = 2, eta = 1,
nround = 2, objective = "binary:logistic")
#predict the model
pred <- predict(bst,test_gboost )
#Score model
test_dataset$Scorelabel<-pred
test_dataset$Scoreclasses<- as.factor(as.numeric(pred >= 0.5))
#Create
# Scoring Function
predict_xgboost <- function(new_data){
predictions <- predict(bst, data.matrix(new_data))
output <- data.frame(new_data, ScoredLabels =predictions)
output
}
#Publish the score function
api <- publishWebService(
ws,
fun = predict_xgboost,
name = "xgboost classification",
inputSchema = as.data.frame(as.table(train_gboost)),
data.frame = TRUE)

Getting Error Bootstrapping to test predictive model

rsq <- function(formula, Data1, indices) {
d <- Data1[indices,] # allows boot to select sample
fit <- lm(formula, Data1=d)
return(summary(fit)$r.square)
}
results = boot(data = Data1, statistic = rsq, R = 500)
When I execute the code, I get the following error:
Error in Data1[indices,] : incorrect number of dimensions
Background info: I am creating a predictive model using Linear Regressions. I would like to test my Predictive Model and through some research, I decided to use the Bootstrapping Method.
Credit goes to #Rui Barradas, check comments for original post.
If you read the help page for function boot::boot you will see that the function it calls has first argument data, then indices, then others. So change the order of your function definition to rsq <- function(Data1, indices, formula)
Another problem that I had was that I didn't define the Function.

Setting Random seeds do not affect classification methods C5.0 and ctree

I want to compare between two different classification methods, namely ctree and C5.0 in the libraries partyand c50 respectively, the comparison is to test their sensitivity to the initial start points. The test should be carried 30 times for each time the number of wrong classified items are calculated and stored in a vector then by using t-test I hope to see if they are really different or not.
library("foreign"); # for read.arff
library("party") # for ctree
library("C50") # for C5.0
trainTestSplit <- function(data, trainPercentage){
newData <- list();
all <- nrow(data);
splitPoint <- floor(all * trainPercentage);
newData$train <- data[1:splitPoint, ];
newData$test <- data[splitPoint:all, ];
return (newData);
}
ctreeErrorCount <- function(st,ss){
set.seed(ss);
model <- ctree(Class ~ ., data=st$train);
class <- st$test$Class;
st$test$Class <- NULL;
pre = predict(model, newdata=st$test, type="response");
errors <- length(which(class != pre)); # counting number of miss classified items
return(errors);
}
C50ErrorCount <- function(st,ss){
model <- C5.0(Class ~ ., data=st$train, seed=ss);
class <- st$test$Class;
pre = predict(model, newdata=st$test, type="class");
errors <- length(which(class != pre)); # counting number of miss classified items
return(errors);
}
compare <- function(n = 30){
data <- read.arff(file.choose());
set.seed(100);
errors = list(ctree = c(), c50 = c());
seeds <- floor(abs(rnorm(n) * 10000));
for(i in 1:n){
splitData <- trainTestSplit(data, 0.66);
errors$ctree[i] <- ctreeErrorCount(splitData, seeds[i]);
errors$c50[i] <- C50ErrorCount(splitData, seeds[i]);
}
cat("\n\n");
cat("============= ctree Vs C5.0 =================\n");
cat(paste(errors$ctree, " ", errors$c50, "\n"))
tt <- t.test(errors$ctree, errors$c50);
print(tt);
}
The program shown is supposedly doing the job of comparison, but because of the number of errors does not change in the vectors then the t.test function produces an error. I used iris inside R (but changing class to Class) and Winchester breast cancer data which can be downloaded here to test it but any data can be used as long as it has Class attribute
But I get in to the problem that the result of both methods remain constant and not changes while I am changing the random seed, theoretically ,as described in their documentation,both of the functions use random seeds, ctree uses set.seed(x) while C5.0 uses an argument called seed to set seed, unfortunatly I can not find the effect.
Could you please tell me how to control initials of these functions
ctrees does only depend on a random seed in the case where you configure it to use a random selection of input variables (ie that mtry > 0 within ctree_control). See http://cran.r-project.org/web/packages/party/party.pdf (p. 11)
In regards to C5.0-trees the seed is used this way:
ctrl = C5.0Control(sample=0.5, seed=ss);
model <- C5.0(Class ~ ., data=st$train, control = ctrl);
Notice that the seed is used to select a sample of the data, not within the algoritm itself. See http://cran.r-project.org/web/packages/C50/C50.pdf (p. 5)

Resources