difference between data and newdata arguments when predicting - r

When predicting in R, using the predict() function, the argument for the data on which we want to predict is newdata = . My question is, when putting data = instead of newdata = what happens ? Because it doesn't give an error, and the rmse obtained is not the same when using newdata =
Here is an example:
library(MASS)
set.seed(18)
Boston_idx = sample(1:nrow(Boston), nrow(Boston) / 2)
Boston_train = Boston[Boston_idx,]
Boston_test = Boston[-Boston_idx,]
library(rpart)
Boston_tree<-rpart(medv~., data=Boston_train)
tree.pred <- predict(Boston_tree, data=Boston_test)
tree.pred2 <- predict(Boston_tree, newdata=Boston_test)
rmse = function(m, o){
sqrt(mean((m - o)^2))
}
rmse(tree.pred,Boston_test$medv)
rmse(tree.pred2,Boston_test$medv)

data are the data used for fitting the model, newdata are data used for prediction. The help page of ?predict.rpart says:
newdata: data frame containing the values at which predictions are required. The predictors referred to in the right side of formula(object) must be present by name in newdata. If missing, the fitted values are returned.

Related

Random Forest prediction error in R "No forest component in the object"

I am attempting to use a random forest regressor to classify a raster stack, but an error does not allow a prediction of "area_pct", have I not trained the model properly?
d100 is my dataset with predictor variables d100[,4:ncol(d100)] and prediction variable d100["area_pct"].
#change na values to zero
d100[is.na(d100)] <- 0
set.seed(100)
#split dataset into training (70%) and testing (30%)
id<- sample(2,nrow(d100), replace = TRUE, prob = c(0.7,0.3))
train_100<- d100[id==1,]
test_100 <- d100[id==2,]
train random forest model with randomForest package, this appears to work fine
final_CC_rf_20 = randomForest(x=train[,4:ncol(train)], y= train$area_pct,
xtest=test[,4:ncol(test)], ytest=test$area_pct, mtry=14, importance=TRUE, ntree = 600)
Then I try to predict a raster.
New raster stack with predictor variables
sentinel_2_20 <- stack( paste(getwd(), "Sentinel_SR_clip_20.tif", sep="/") )
area_classified_20_2018 <- predict(object = final_CC_rf_20 , newdata = sentinel_2_20,type = 'response', progress = 'window')
but error pops up:
#Error in predict.randomForest(object = final_CC_rf_20, newdata = sentinel_2_20, :
# No forest component in the object
any help would be extremely useful
The arguments you are using for predict (with raster data) are not correct. The first argument, object, should be the raster data, the second argument, model, should be the fitted model. There is no argument newdata.
Another problem is that you use keep.forest=FALSE which is the default when xtest is not NULL. You could set keep.forest=TRUE but that is not a good approach, generally, as you should fit your model with all data before you make a prediction (you are no longer evaluating your model). Thus, I would suggest fitting your model without xtest, like this
rfmod <- randomForest(x=d100[,4:ncol(train)], y=d100$area_pct,
mtry=14, importance=TRUE, ntree = 600)
And then do
p <- predict(sentinel_2_20, rfmod, type='response')
See ?raster::predict or ?terra::predict for working examples

How do we make a model in r using more than one row

Below given is my R code to create a model using R programming to predict the prices of diamonds from the diamond dataset. Here I am not able to create the model by giving a log for each row. Without using log I am getting a horrible model with incorrect prediction prices. I am also pasting the error shown and the dataset for reference.
The error is as given below
> mod =(lm(log(price)~log(carat)+log(x)+log(y)+log(z),data=train))
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
NA/NaN/Inf in 'x'
Link for the dataset is attached here: https://www.kaggle.com/shivam2503/diamonds
Below given is the code for the same
setwd ("C:/akash/study videos/virginia")
akash = read.csv("diamonds.csv")
#summary(akash)
ind = sample(2, nrow(akash),replace = TRUE , prob = c(0.8,0.2))
train = akash[ind==1,]
test = akash[ind==2,]
mod =(lm(log(price)~log(carat)+log(x)+log(y)+log(z),data=train))
summary(mod)
predicted = predict(mod,newdata = test)
mon = round(exp(predicted),0)
head(mon)
#head(test)
#View(akash)
Your model fails because the minimum values for your variable x,y,z are 0, so when you log-transform these variables you obtain -inf:
lapply(c("x","y","z"),function(x)summary(log(diamonds[[x]])))
you can try to log-transform just the outcome, remove the minimum values from the transformation, or simply change the model.
Just for example: here I compare the RMSE for the lm with no transformation, the lm with log(price) transformation, and a simple random forests model from the package ranger. I'm using caret to use the same model interface (by default carte::train perform a 25 bootstrap resample to choose the best parameters for the given model, so in this example, only random forest has some tuning parameters).
library(ggplot2)#for "diamonds" dataset
data("diamonds")
set.seed(5)
ind = sample(2, nrow(diamonds),replace = TRUE , prob = c(0.8,0.2))
train = diamonds[ind==1,]
test = diamonds[ind==2,]
library(caret)
rf <- train(price~carat+x+y+z,data=train,method="ranger")
lm <- train(price~carat+x+y+z,data=train,method="lm")
lm_log <- train(log(price)~carat+x+y+z,data=train,method="lm")
RMSE(predict(rf,test),test$price)/mean(test$price)*100
RMSE(predict(lm,test),test$price)/mean(test$price)*100
RMSE(exp(predict(lm_log,test)),test$price)/mean(test$price)*100
which give me:
[1] 35.73012
[1] 40.2437
[1] 45.92143

predict function with lasso regression

I am trying to implement lasso regression for my sales prediction problem. I am using glmnet package and cv.glmnet function to train the model.
library(glmnet)
set.seed(123)
model = cv.glmnet(as.matrix(x = train[, -which(names(train) %in% "Sales")]),
y = train$Sales,
alpha = 1,
lambda = 10^seq(4,-1,-0.1))
best_lambda = model$lambda.min
lasso_predictions_valid <- predict(model,s = best_lambda,type = "coefficients")
After I read few articles about implementing lasso regression I still don't know how to add my test data on which I want to apply the prediction. There is newx argument to be added to predict function that I do not know also. I mean in most regression types we have newdata or data argument that we fill our test data to it.
I think there is an error in your lasso_predictions_valid, you shouldn't put valid$sales as your newx, as I believe this is the actual sales number.
Once you have created the model with the train set, then for newx you need to pass matrix values of x that you want to make predictions on, I guess in this case it will be your validation set.
Looking at your example code above, I think your predict line should be something like:
lasso_predictions_valid <- predict(model, s = best_lambda,
newx = as.matrix(valid[, -which(names(valid) %in% "Sales")]),
type = "coefficients")
Then you should run your RMSE() line:
RMSE(lasso_predictions_valid, valid$Sales)

How do I use predict() on new data for lme4::glmer model?

I have been trying to establish predictive performance (AUC ROC) for a glmer model. When I try and use the predict() function on a test data set, the output for this function is the length of my train data set.
folds = 10;
glmerperf=rep(0,folds); glmperf=glmerperf;
TB_Train.glmer.subset <- TB_Train.glmer %>% select(one_of(subset.vars), IDNO)
TB_Train.glmer.fs <- TB_Train.glmer.subset[,c(1:7, 22)]
TB_Train.glmer.ns <- TB_Train.glmer.subset[, 8:21]
TB_Train.glmer.cns <- TB_Train.glmer.ns %>% scale(center=TRUE, scale=TRUE) %>% cbind(TB_Train.glmer.fs)
foldsamples = caret::createFolds(TB_Train.glmer.cns$Case.Status, k = folds, list = TRUE, returnTrain = FALSE)
for (n in 1:folds)
{
testdata = TB_Train.glmer.cns[foldsamples[[n]],]
traindata = TB_Train.glmer.cns[-foldsamples[[n]],]
GLMER <- lme4::glmer(Case.Status ~ . + (1 | IDNO), data = traindata, family="binomial", control=glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=1000000)))
glmer.probs <- predict(GLMER, newdata=testdata$Non.TB.Case, type="response")
glmer.ROC <- roc(predictor=glmer.probs, response=testdata$Case.Status, levels=rev(levels(testdata$Case.Status)))
glmerperf[n] <- glmer.ROC$auc
}
prob <- predict(GLMER, newdata=TB_Test.glmer$Non.TB.Case, type="response", re.form=~(1|IDNO))
print(sprintf('Mean AUC ROC of model on test set for GLMER %f', mean(glmerperf)))
Both the prob and glmer.probs objects are the length of the traindata object, despite specifying the newdata argument. I have noticed issues with the predict function in the past, but none as specific as this one.
Also, when the model is run, I get several errors about needing to scale my data (which I already have) and that the model fails to converge. Any ideas on how to fix this? I have already bumped up the iterations and selected a new optimizer.
Figured out that error was arising from using the "." shortcut to specify all predictors for the model.

predicting outcome with a model in R

I am trying to do a simple prediction, using linear regression
I have a data.frame where some of the items are missing price (and therefor noted NA).
This apperantely doesn't work:
#Simple LR
fit <- lm(Price ~ Par1 + Par2 + Par3, data=combi[!is.na(combi$Price),])
Prediction <- predict(fit, data=combi[is.na(combi$Price),]), OOB=TRUE, type = "response")
What should I put instead of data=combi[is.na(combi$Price),]) ?
Change data to newdata. Look at ?predict.lm to see what arguments predict can take. Additional arguments are ignored. So in your case data (and OOB) is ignored and the default is to return predictions on the training data.
Prediction <- predict(fit, newdata = combi[is.na(combi$Price),])
identical(predict(fit), predict(fit, data = combi[is.na(combi$Price),]))
## [1] TRUE

Resources