How to run predict.boosting for new data? - r

I am trying to use predict.boosting for new data in adabag package. I can't find a way to use it for data without labels (or any other function from that package).
I am trying:
pr <- predict.boosting(modelfit, test[,2:ncol(test)])
It gives:
Error in `[.data.frame`(newdata, , as.character(object$formula[[2]])) :
undefined columns selected
However, if I include labels:
pr <- predict.boosting(modelfit, test)
it works just fine. But there has to be a way to use it as a predictive model for data without labels.
Thanks for any help!
EDIT
Example from package:
library(rusboost)
library(rpart)
data(iris)
make it an unbalanced dataset by removing most of the setosa observations
df <- iris[41:150,]
create binary variable
df$Setosa <- factor(ifelse(df$Species == "setosa", "setosa", "notsetosa"))
create index of negative examples
idx <- df$Setosa == "notsetosa"
run model
test.rusboost <- rusb(Setosa ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = df, boot = F, iters = 20, sampleFraction = .1, idx = idx)
predict.boosting(test.rusboost, df)
predict.boosting(test.rusboost, df[,1:4)

You should control that all your columns in train (the set you used to train the model) are present in test an with the same name.
Please check:
all(colnames(train) %in% colnames(test))
If it's false, you will need to control how you built train and test.
If it's TRUE, and in general, please provide a reproductible example.
Edit:
A nice way to control that columns are the same, and they contain the same factors is to use sameShape from dataPreparation package. If it's not the cas, it will add levels and columns (and warn you).
To use it:
library(dataPreparation)
test <- sameShape(test, train)

I came up with a workaround, I attached a column with the same name as the labels to my newdata and filled it with random factor levels.
df$Setosa <- factor(sample( c("setosa", "notsetosa"), nrow(df), replace=TRUE, prob=c(0.5, 0.5) ))
Then it works just fine.

Related

Why when I run the ggttest there is an error?

When I run the t-test for a numeric and a dichotomous variable there in no problem and I can see the results. The problem is when I run the ggttest of the same t-test. There is an error and says that one of my variable is not found. I do not why that happens. The aml dataset I used is from package boot. Below you can see the code:
https://i.stack.imgur.com/7kuaA.png
library(gginference)
time_group.test16537 = t.test(formula = time~group,
data = aml,
alternative = "two.sided",
paired = FALSE,
var.equal = FALSE,
conf.level = 0.95)
time_group.test16537
ggttest(time_group.test16537,
colaccept="lightsteelblue1",
colreject="gray84",
colstat="navyblue")
The problem comes with these lines of code in ggttest:
datnames <- strsplit(t$data.name, splitter)
len1 <- length(eval(parse(text = datnames[[1]][1])))
len2 <- length(eval(parse(text = datnames[[1]][2])))
It tries to find the len of group and time, but it doesn't see that it came from a data.frame. Pretty bad bug...
For your situation, supposedly you have less than 30 in each group and it plots a t-distribution, so do:
library(gginference)
library(boot)
gginference:::normt(t.test(time~group,data=aml),
colaccept = "lightsteelblue1",colreject = "grey84",
colstat = "navyblue")
t.test doesn't store your data in the output so there is no way that you could extract the data from the list of the output of t.test.
The only way to use formula is:
library(gginference)
t_test <- t.test(questionnaire$pulse ~ questionnaire$gender)
ggttest(t_test)
Original answer here: How to extract the dataset from an "htest" object when using formula in r

Column changes from "WinorLoss" to "Class"

I am working on constructing a logistic model on R (I am a beginner on R and am following a tutorial on building logistic models). I have done the following, everything works but when I complete the downsample function for some reason the column named "WinorLoss" changes to "Class" and I am sure this cause an issue with everything.
Could anyone please let me know if what I am doing makes sense or is there big errors I am making?
my_data <- read.csv('C:/Users/Magician/Desktop/R files/Fnaticfirstround.csv', header=TRUE)
my_data
str(my_data)
library(mlbench)
glm(Map ~ WinorLoss, family="binomial", data=my_data)
table(my_data$Map)
table(my_data$WinorLoss)
my_data$WinorLoss <- ifelse(my_data$WinorLoss == "W", 1,0)
my_data$WinorLoss <- factor(my_data$WinorLoss, levels = c(0,1))
my_data
table(my_data$WinorLoss)
library(caret)
'%ni%' <- Negate('%in%')
options(scipen=999)
set.seed(100)
trainDataIndex <- createDataPartition(my_data$WinorLoss, p=0.7, list=F)
trainData <- my_data[trainDataIndex, ]
testData <- my_data[-trainDataIndex, ]
trainData
testData
table(trainData$WinorLoss)
table(testData$WinorLoss)
set.seed(100)
down_train <- downSample(x = trainData[, colnames(trainData) %ni% "WinorLoss"],
y = trainData$WinorLoss)
down_train
When running trainData the columns returned are Date, Event, opponent, Map, Score, WinorLoss, winner.. but when I run the downtrain function the columns become Date, Event, opponent, Map, Score, winner, Class
Help Please!
Yep, downSample and some of the other caret packages do that by default, unless specified otherwise.
If you have a question about a particular function try the manual packages first.
?downSample
If you do this you will see all of the arguments
downSample(x, y, list = FALSE, yname = "Class")
So by default the function will change the yname to "Class" which is what you are seeing.
Thus to get your desired output:
down_train <- downSample(x = trainData[, colnames(trainData) %ni% "WinorLoss"],
y = trainData$WinorLoss,
yname = "WinorLoss")

library(e1071), tune Variable lengths differ

I have been attempting to utilize the iris dataset and although I've gotten svm to work from the e1071 library, I keep getting a 'variable lengths differ' error when I attempt to make tune work:
library(e1071)
data <- data.frame(iris$Sepal.Width,iris$Petal.Length,iris$Species)
svm_tr <- data[sample(nrow(datasvm), 100), ] #sample 100 random rows
tuned <- tune(svm, svm_tr$iris.Species~.,
data = svm_tr[1:2],
kernel = "linear",
ranges = list(cost=c(.001,.01,.1,1,10,100)))
I have checked the lengths of each of the columns in svm_tr[1:2] and they are the same length. I know the function doesn't take a dataframe directly but maybe I'm missing something?
I can get it to work with:
tune(svm, iris.Species ~ ., data = svm_tr[1:3],
kernel = "linear", ranges = list(cost=c(.001,.01,.1,1,10,100)))
If it's a formula interface you shouldn't be referring to a variable by using $ as all the required variables are sourced from the object specified by the data= argument. Note that I've also made data=svm_tr[1:3] instead of 1:2 so that the iris.Species column is included.

MXnet odd error

This is my first ANN so I imagine that there might be a lot of things done wrong here. I don't follow
I'm trying to predict species of flowers from iris data set provided in R language but I get following error:
Error in `dimnames<-.data.frame`(`*tmp*`, value = list(n)) :
invalid 'dimnames' given for data frame
My code:
require(mxnet)
train <- iris[1:130,]
test <- iris[131:150,]
train.data <- as.data.frame(train[-5])
train.label <- data.frame(model.matrix(data=train,object =~Species-1))
test.data <- as.data.frame(test[-5])
test.label <- data.frame(model.matrix(data=test,object =~Species-1))
var1 <- mx.symbol.Variable("data")
layer0 <- mx.symbol.FullyConnected(var1, num.hidden=3)
cat.out <- mx.symbol.SoftmaxOutput(layer0)
net.model <- mx.model.FeedForward.create(cat.out,
array.layout = "auto",
X=train.data,
y=train.label,
eval.data = list(data=test.data,label=test.label),
num.round = 20,
array.batch.size = 20,
learning.rate=0.1,
momentum=0.9,
eval.metric = mx.metric.accuracy)
UPDATE:
I managed to get rid of this error by specifying column to use in labels(traning.label[,1]and test.label[,1]).
However now I'm training my net to predict just one of my binary variables while I have 3 (one for each species).
I had the same problem, turned out that:
train.data should be a matrix
train.label should be a numeric vector
Check these two and hopefully it should work.
I had a similar problem but during the prediction step. It turns out that my features were in a Data Frame which was causing the issue. Once I converted the data frame into a matrix, the issue went away.
pred.values = stats::predict(model,as.matrix(features))
instead of
pred.values = stats::predict(model,features)
So, the features need to be a matrix both during training and during the process of making predictions.

R. How to apply sapply() to random forest

I need to use a batch of models of RandomForest package. I decided to use a list list.of.models to store them. Now I don't know how to apply them. I append a list using
list.of.models <- append(list.of.models, randomForest(data, as.factor(label))
and then tried to use
sapply(list.of.models[length(list.of.models)], predict, data, type = "prob")
to call the last one but the problem is that randomForest returns a list of many values, not a learner.
What to do to add to list RF-model and then call it? For example lets take a source code
data(iris)
set.seed(111)
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
iris.rf <- randomForest(Species ~ ., data=iris[ind == 1,])
iris.pred <- predict(iris.rf, iris[ind == 2,])
With append your're extending your model.list with the elements inside the RF.model, thus not recognized by predict.randomForest etc. because the outer container list and its attribute class="randomForest" are lost. Also data input is named newdata in predict.randomForest.
This should work:
set.seed(1234)
library(randomForest)
data(iris)
test = sample(150,25)
#create 3 models
RF.models = lapply(1:3,function(mtry) {
randomForest(formula=Species~.,data=iris[-test,],mtry=mtry)
})
#append extra model
RF.models[[length(RF.models)+1]] = randomForest(formula=Species~.,data=iris[-test,],mtry=4)
summary(RF.models)
#predict all models
sapply(RF.models,predict,newdata=iris[test,])
#predict one model
predict(RF.models[[length(RF.models)]],newdata=iris[test,])

Resources