predicting outcome with a model in R - r

I am trying to do a simple prediction, using linear regression
I have a data.frame where some of the items are missing price (and therefor noted NA).
This apperantely doesn't work:
#Simple LR
fit <- lm(Price ~ Par1 + Par2 + Par3, data=combi[!is.na(combi$Price),])
Prediction <- predict(fit, data=combi[is.na(combi$Price),]), OOB=TRUE, type = "response")
What should I put instead of data=combi[is.na(combi$Price),]) ?

Change data to newdata. Look at ?predict.lm to see what arguments predict can take. Additional arguments are ignored. So in your case data (and OOB) is ignored and the default is to return predictions on the training data.
Prediction <- predict(fit, newdata = combi[is.na(combi$Price),])
identical(predict(fit), predict(fit, data = combi[is.na(combi$Price),]))
## [1] TRUE

Related

difference between data and newdata arguments when predicting

When predicting in R, using the predict() function, the argument for the data on which we want to predict is newdata = . My question is, when putting data = instead of newdata = what happens ? Because it doesn't give an error, and the rmse obtained is not the same when using newdata =
Here is an example:
library(MASS)
set.seed(18)
Boston_idx = sample(1:nrow(Boston), nrow(Boston) / 2)
Boston_train = Boston[Boston_idx,]
Boston_test = Boston[-Boston_idx,]
library(rpart)
Boston_tree<-rpart(medv~., data=Boston_train)
tree.pred <- predict(Boston_tree, data=Boston_test)
tree.pred2 <- predict(Boston_tree, newdata=Boston_test)
rmse = function(m, o){
sqrt(mean((m - o)^2))
}
rmse(tree.pred,Boston_test$medv)
rmse(tree.pred2,Boston_test$medv)
data are the data used for fitting the model, newdata are data used for prediction. The help page of ?predict.rpart says:
newdata: data frame containing the values at which predictions are required. The predictors referred to in the right side of formula(object) must be present by name in newdata. If missing, the fitted values are returned.

Cannot generate predictions in mgcv when using discretization (discrete=T)

I am fitting a model using a random site-level effect using a generalized additive model, implemented in the mgcv package for R. I had been doing this using the function gam() however, to speed things up I need to shift to the bam() framework, which is basically the same as gam(), but faster. I further sped up fitting by passing the options bam(nthreads = N, discrete=T), where nthreads is the number of cores on my machine. However, when I use the discretization option, and then try to make predictions with my model on new data, while ignoring the random effect, I consistent get an error.
Here is code to generate example data and reproduce the error.
library(mgcv)
#generate data.
N <- 10000
x <- runif(N,0,1)
y <- (0.5*x / (x + 0.2)) + rnorm(N)*0.1 #non-linear relationship between x and y.
#uninformative random effect.
random.x <- as.factor(do.call(paste0, replicate(2, sample(LETTERS, N, TRUE), FALSE)))
#fit models.
fit1 <- gam(y ~ s(x) + s(random.x, bs = 're')) #this one takes ~1 minute to fit, rest faster.
fit2 <- bam(y ~ s(x) + s(random.x, bs = 're'))
fit3 <- bam(y ~ s(x) + s(random.x, bs = 're'), discrete = T, nthreads = 2)
#make predictions on new data.
newdat <- data.frame(runif(200, 0, 1))
colnames(newdat) <- 'x'
test1 <- predict(fit1, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test2 <- predict(fit2, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test3 <- predict(fit3, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
Making predictions with the third model which uses discretization throws this error (which the other two do not):
Error in model.frame.default(object$dinfo$gp$fake.formula[-2], newdata) :
variable lengths differ (found for 'random.x')
In addition: Warning message:
'newdata' had 200 rows but variables found have 10000 rows
How can I go about making predictions for a new dataset using the model fit with discretization?
newdata.gauranteed doesn't seem to be working for bam() models with discrete = TRUE. You could email the author and maintainer of mgcv and send him the reproducible example so he can take a look. See ?bug.reports.mgcv.
You probably want
names(newdat) <- "x"
as data frames have names.
But the workaround is just to pass in something for random.x
newdat <- data.frame(x = runif(200, 0, 1), random.x = random.x[[1]])
and then do your call to generate test3 and it will work.
The warning message and error are the result of you not specifying random.x in the newdata and then mgcv looking for random.x and finding it in the global environment. You should really gather that variables into a data frame and use the data argument when you are fitting your models, and try not to leave similarly named objects lying around in your global environment.

Am I not using rpart() right for classification?

I'm trying to do an example classification prediction with rpart() but for whatever reason it doesn't seem to be giving me the right predictions when I pass test data into a fitted tree.
library(rpart)
data.samples <- sample(1:nrow(cu.summary), nrow(cu.summary) * 0.7, replace = FALSE)
training.data <- cu.summary[data.samples, ]
test.data <- cu.summary[-data.samples, ]
fit <- rpart(
Type~Price + Country + Reliability + Mileage,
method="class",
data=training.data
)
fit.pruned<- prune(fit, cp=fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"])
prediction <- predict(fit.pruned, test.data)
prediction
#table(prediction, test.data$Type)
This seems to give me everything except the classes I was trying to predict in the first place. Am I using a particular syntax wrong somewhere?
You must specify the type of prediction
predict(fit.pruned, test.data, type="class")

How do I use predict() on new data for lme4::glmer model?

I have been trying to establish predictive performance (AUC ROC) for a glmer model. When I try and use the predict() function on a test data set, the output for this function is the length of my train data set.
folds = 10;
glmerperf=rep(0,folds); glmperf=glmerperf;
TB_Train.glmer.subset <- TB_Train.glmer %>% select(one_of(subset.vars), IDNO)
TB_Train.glmer.fs <- TB_Train.glmer.subset[,c(1:7, 22)]
TB_Train.glmer.ns <- TB_Train.glmer.subset[, 8:21]
TB_Train.glmer.cns <- TB_Train.glmer.ns %>% scale(center=TRUE, scale=TRUE) %>% cbind(TB_Train.glmer.fs)
foldsamples = caret::createFolds(TB_Train.glmer.cns$Case.Status, k = folds, list = TRUE, returnTrain = FALSE)
for (n in 1:folds)
{
testdata = TB_Train.glmer.cns[foldsamples[[n]],]
traindata = TB_Train.glmer.cns[-foldsamples[[n]],]
GLMER <- lme4::glmer(Case.Status ~ . + (1 | IDNO), data = traindata, family="binomial", control=glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=1000000)))
glmer.probs <- predict(GLMER, newdata=testdata$Non.TB.Case, type="response")
glmer.ROC <- roc(predictor=glmer.probs, response=testdata$Case.Status, levels=rev(levels(testdata$Case.Status)))
glmerperf[n] <- glmer.ROC$auc
}
prob <- predict(GLMER, newdata=TB_Test.glmer$Non.TB.Case, type="response", re.form=~(1|IDNO))
print(sprintf('Mean AUC ROC of model on test set for GLMER %f', mean(glmerperf)))
Both the prob and glmer.probs objects are the length of the traindata object, despite specifying the newdata argument. I have noticed issues with the predict function in the past, but none as specific as this one.
Also, when the model is run, I get several errors about needing to scale my data (which I already have) and that the model fails to converge. Any ideas on how to fix this? I have already bumped up the iterations and selected a new optimizer.
Figured out that error was arising from using the "." shortcut to specify all predictors for the model.

Predict outcome in R

I have been using the predict function in R to predict a randomForests model outcomes for a testing set when it suddenly it would only return the predicted levels instead of the probabilities. I specified the type as response but it still returns factors. What possibly could cause this?
The data consists in 23 variables, 20 of which are factors (unordered) and two of which are numeric. I am trying to predict whether a product will sell or not (0 or 1). Here is the code for the prediction:
library(randomForest)
rf = randomForest(sold ~., data = train, ntree=200, nodesize=25)
prf <- predict(rf, newdata = test, type ="response")
set type="prob"
data(iris)
library(randomForest)
seed(1234)
train.key = sort(sample(1:dim(iris)[1],100))
iris.train = iris[train.key,]
iris.test = iris[-train.key,]
rf = randomForest(Species ~., data = iris.train)
predicted.prob = predict(rf,newData=iris.test,type ="prob")

Resources