R- Random forest predict fails with NAs in predictors - r

The documentation (If I'm reading it correctly) says that the random forest predict function produces NA predictions if it encounters NA predictors for certain observations.
NOTE: If the object inherits from randomForest.formula, then any data
with NA are silently omitted from the prediction. The returned value
will contain NA correspondingly in the aggregated and individual tree
predictions (if requested), but not in the proximity or node matrices
However, if I try to use the predict function on a dataset with some NA's in predictors [NA's in 7 observations out of 2688] I encounter the following error condition, and prediction fails.
Error in predict.randomForest(model,
new.ds) : missing values in newdata
There is a slightly messy work-around that I would like to avoid if possible.
Am I doing/reading something wrong? Does it have to do something with the "inherits from randomForest.formula" clause?

Using some examples from the documentation:
set.seed(1)
x <- data.frame(x1=gl(32, 5), x2=runif(160), y=rnorm(160))
rf1 <- randomForest(x[-3], x[[3]], ntree=10)
> inherits(rf1,"randomForest.formula")
[1] FALSE
> iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE,
proximity=TRUE)
> inherits(iris.rf,"randomForest.formula")
[1] TRUE
So you probably called randomForest without using the formula interface to fit your model.

Related

How can I handle a confusionMatrix error when it says my Data is null

I am trying to run a random forests analysis in R and it works well when I fit the model and predict it on the test group but when I run the confusionMatrix it gives me the following error:
Error in table(data, reference, dnn = dnn, ...) : all arguments must have the same length
load the test and training data
trainData <- read.csv("./pml-training.csv")
testData <- read.csv("./pml-testing.csv")
dim(trainData)
dim(testData)
Data cleaning - Here, variables with nearly zero variance or that are almost always NA,
and the columns containing summary statistics or irrelevant data will be removed.
trainClean <- trainData[,colMeans(is.na(trainData))< .9]
trainClean <- trainData[,-c(1:7)]
nvz <- nearZeroVar(trainClean)
trainClean <- trainClean[,-nvz]
dim(trainClean)
Split the data into training (70%) and validation (30%)
inTrain <- createDataPartition(y=trainClean$classe, p=0.7, list=FALSE)
train <- trainClean[inTrain,]
valid <- trainClean[-inTrain,]
# Create a control for 3 fold validation
control <- trainControl(method="cv", number=3, verboseIter = FALSE)
Building the models
Random Forests
# Fit the model on train using random forest
modFit <- train(classe~., data=train, method="rf", trControl=control, tuneLength=5, na.action=na.omit)
modFit
modPredict<- predict(modFit, valid, na.action=na.omit) # predict on the valid data set.
# Turn valid$classe into a factor and check it
valid$classe <- as.factor(valid$classe)
modCM <- confusionMatrix(modPredict, as.factor(valid$classe))
modCM
table(modPredict, valid$classe)
When I check the length of modPredict it = 122, and valid$classe = 5885. If I try dim on modPredict, I get NULL. I have tried using na.action=na.omit on the prediction chunk. I have also tried NOT using na.action=na.omit on the prediction or the fit chunks.
I checked the test and valid data sets where I split the data using:
```length(train); length(valid); length(valid$classe); nrow(valid); nrow(train)```
The output is:
[1] 94
[1] 94
[1] 5885
[1] 5885
[1] 13737
I have been struggling with this problem and similar problems on my decision tree chunk as well. I don't want people to do my homework for me, but I could use a hint.
Thanks in advance

Get row number for prediction with caret

I use caret a lot for my machine learning tasks in R and I like it a lot.
But I face the following problem:
I train a model in caret, say a linear regression with lm()
When I want to score new data, I do: predict(model, new_data)
When new_datacontains missing values in my predictors, predict returns no prediction, instead of say NA
Is it possible to either:
return a prediction for all rows in new_data with a prediction of NA when it is not possible or
return predictions + the row number of the dataframe the prediction corresponds to?
E.g. like the mlr-package does with an id-column that shows which row the prediction corresponds to:
Here is the link to the mlr-predict page with more details:
mlr-package: predict with row-id
Any help greatly appreciated!
You can identify the cases with missing values prior to running caret::train() by creating a new column with the row names in your data set, since these default to the row numbers in the data frame.
Using the Sonar data set from the mlbench package as an illustration:
library(mlbench)
data(Sonar)
library(caret)
set.seed(95014)
# add row numbers
Sonar$rowId <- rownames(Sonar)
# create training & testing data sets
inTraining <- createDataPartition(Sonar$Class, p = .75, list=FALSE)
training <- Sonar[inTraining,]
testing <- Sonar[-inTraining,]
# set column 60 to NA for some values in test data
testing[48:51,60] <- NA
testing[!complete.cases(testing),"rowId"]
...and the output:
> testing[!complete.cases(testing),"rowId"]
[1] "193" "194" "200" "206"
You can then run predict() on the rows in the test data set that have complete cases. Again using the Sonar dataset with a random forest model and 3 fold cross validation to expedite processing:
fitControl <- trainControl(method = "cv",number = 3)
fit <- train(x,y, method="rf",data=Sonar,trControl = fitControl)
predicted <- predict(fit,testing[complete.cases(testing),])
Another way to handle this situation is to use an imputation strategy to eliminate the missing values for the independent variables in your model. My article on Github, Strategies for Handling Missing Values links to a number of research papers on this topic.

use cox model to estimate survival

I first establish a cox model in R:
test1<- test[1:20,]
model.1 <- coxph(Surv(test1$days,test1$status==1) ~ test1$MTT+test1$ADC,data=test1)
and when i tried to predict next patient's survival like this:
covs1 <- data.frame(test[21,]$MTT,test[21,]$ADC)
summary(survfit(model.1, newdata= covs1, type ="aalen"))
it gave me too many survival results and the warning is
"'newdata' had 1 row but variables found have 20 rows "
fyi, there are 20 events and the results contain 20 survival results.
The names of the columns in the datframe being given as the basis for a prediction must have the same column names as are in the RHS of the model formula. I don't think yours will qualifiy unless you do something like this:
test1<- test[1:20,]
model.1 <- coxph( Surv(days, status==1) ~ MTT + ADC, data=test1)
covs1 <- test[21, c("MTT", "ADC")]
# then do your prediction
You should not use $ to supply arguments to Surv. It is important that the model be constructed in the environment of the dataframe.

Different no of tuples for the prediction model and test set data in SVM

I have a dataset with two columns as shown below, where Column 1, timestamp is a particular value for time for which Column.10 gives the total power usage at that instance of time. There are totally 81502 instances for this data.
I'm doing support vector regression on this data in R using the e1071 package to predict the future usage of power. The code is given below. I first divided the dataset into training and test data. Then using the training data modeled the data using the svm function and then predict the power usage for the testset.
library(e1071)
attach(data.csv)
index <- 1:nrow(data.csv)
testindex <- sample(index,trunc(length(index)/3))
testset <- na.omit(data.csv[testindex, ])
trainingset <- na.omit(data.csv[-testindex, ])
model <- svm(Column.10 ~ timestamp, data=trainingset)
prediction <- predict(model, testset[,-2])
tab <- table(pred = prediction, true = testset[,2])
However, when I try to make a confusion matrix from the prediction, I'm getting the error:
Error in table(pred = prediction, true = testset[, 2]) : all arguments must have the same length
So I tried to find the length of the two arguments and found that
the length(prediction) to be 81502
and the length(testset[,2]) to be 27167
Since I had done the prediction only for the testset, I don't know how prediction is done for 81502 values. How are the total no of values different for the prediction and the testset? How is the power value for the entire dataset getting predicted eventhough I gave it only for the testset?
Change
prediction <- predict(model, testset[,-2])
in
prediction <- predict(model, testset)
However, you should not use table when doing regression, use the MSE instead.

How to deal with NA in a panel data regression?

I am trying to predict fitted values over data containing NAs, and based on a model generated by plm. Here's some sample code:
require(plm)
test.data <- data.frame(id=c(1,1,2,2,3), time=c(1,2,1,2,1),
y=c(1,3,5,10,8), x=c(1, NA, 3,4,5))
model <- plm(y ~ x, data=test.data, index=c("id", "time"),
model="pooling", na.action=na.exclude)
yhat <- predict(model, test.data, na.action=na.pass)
test.data$yhat <- yhat
When I run the last line I get an error stating that the replacement has 4 rows while data has 5 rows.
I have no idea how to get predict return a vector of length 5...
If instead of running a plm I run an lm (as in the line below) I get the expected result.
model <- lm(y ~ x, data=test.data, na.action=na.exclude)
As of version 2.6.2 of plm (2022-08-16), this should work out of the box: Predict out of sample on fixed effects model (from the NEWS file:
prediction implemented for fixed effects models incl. support for argument newdata and out-of-sample prediction. Help page (?predict.plm) added to specifically explain the prediction for fixed effects models and the out-of-sample case.
I think this is something that predict.plm ought to handle for you -- seems like an oversight on the package authors' part -- but you can use ?napredict to implement it for yourself:
pp <- predict(model, test.data)
na.stuff <- attr(model$model,"na.action")
(yhat <- napredict(na.stuff,pp))
## [1] 1.371429 NA 5.485714 7.542857 9.600000

Resources