subscript out of bounds Error, Random Forest Model - r

I'm trying to use the random forest model to predict Gender based on Height, Weight and Number of siblings. I've gotten the data from a much larger data set that contains dozens of variables, but I've cleaned it into this "clean" data.frame with omitted NA values and only the 4 variables I care about, the last column being Gender.
I've tried fiddling with the code and searching everywhere but I can't find a concrete fix.
Here's the code:
ind <- sample(nrow(clean),0.8*nrow(clean))
train <- clean[ind,]
test <- clean[-ind,]
rf <- randomForest(Gender ~ ., data = train[,1:4], ntree = 20)
pred <- predict(rf, newdata = test[,-c(length(test))])
cm <- table(test$Gender, pred)
cm
and here's the output:
Error in `[.default`(table(observed = y, predicted = out.class), levels(y), : subscript out of bounds
Traceback:
1. randomForest(Gender ~ ., data = train[, 1:4], ntree = 20)
2. randomForest.formula(Gender ~ ., data = train[, 1:4], ntree = 20)
3. randomForest.default(m, y, ...)
4. table(observed = y, predicted = out.class)[levels(y), levels(y)]
5. `[.table`(table(observed = y, predicted = out.class), levels(y),
. levels(y))
6. NextMethod()

The problem is likely that you have some kind of a variable level in your test data that was not reflected in your training data. So when it goes to assign the outcome, it has no basis to do so.
It is impossible to say for sure without sample data, but it is the most likely scenario. Try setting a seed set.seed=3 and then change the seed number set.seed=28 and so on, a few times to see if you end up finding a combination where you do not get the error.
Compare the conflicted data frame with the un-conflicted one to see what is missing.
EDIT:
Also, try running str(train) and str(test) to be sure the fields have remained the same. You can share that if you like by editing your post.
If any of the columns are factors with levels missing (meaning it has 10 levels but only 8 are represented in the train with 9 or 10 in the test) it might be a problem. They should be balanced if you are trying to create a predictor for all possible outcomes.
If nothing else works, you can set a seed and remove predictors one at a time until it runs correctly, then look to see how the train and test sets are different in that removed column.

Related

How to get around error "factor has new levels" in cross-validation glm?

My goal is to use cross-validation to evaluate the performance of a linear model.
My problem is that my training and testing sets might not always have the same variable levels.
Here is a reproducible data example:
set.seed(1)
x <- rnorm(n = 1000)
y <- rep(x = c("A","B"), times = c(500,500))
z <- rep(x = c("D","E","F"), times = c(997,2,1))
data <- data.frame(x,y,z)
summary(data)
Now let's make a glm model:
model_glm <- glm(x~., data = data)
And let's use cross-validation on this model:
library(boot)
cross_validation_glm <- cv.glm(data = data, glmfit = model_glm, K = 10)
And this is the kind of error output that you will get:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor z has new levels F
if you don't get this error, re-run the cross validation and at some point you will get a similar error.
The nature of the problem here is that when you do cross-validation, the train and test subsets might not have the exact same variable levels. Here our variable z has three levels (D,E,F).
In the total amount of our data there is much more D's than E's and F's.
Thus whenever you take a small subset of the whole data (to do cross-validation).
There is a very good chance that your z variable are all going to be set at the D's level.
Thus Eand F levels gets dropped, thus we get the error (This answer is helpful to understand the problem: https://stackoverflow.com/a/51555998/10972294).
My question is: how to avoid the drop in the first place?
If it is not possible, what are the alternatives?
(Keep in mind that this a reproducible example, the actual data I am using has many variables like z, I would like to avoid deleting them.)
To answer your question in the comment, I don't know if there is a function or not. Most likely there is one, but I have no idea on which package would contain it. For this example, this function should work:
set.seed(1)
x <- rnorm(n = 1000)
y <- rep(x = c("A","B"), times = c(500,500))
z <- rep(x = c("D","E","F"), times = c(997,2,1))
data <- data.frame(x,y,z)
#optional tag row for later identification:
#data$rowid<-1:nrow(data)
stratified <- function(df, column, percent){
#split dataframe into groups based on column
listdf<-split(df, df[[column]])
testsubgroups<-lapply(listdf, function(x){
#pick the number of samples per group, round up.
numsamples <- ceiling(percent*nrow(x))
#selects the rows
whichones <-sample(1:nrow(x), numsamples, replace = FALSE)
testsubgroup <-x[whichones,]
})
#combine the subgroups into one data frame
testgroup<-do.call(rbind, testsubgroups)
testgroup
}
testgroup<-stratified(data, "z", 0.8)
This will just split the initial data by column z, if you are interested is grouping by multiple columns then this could be extended by using the group_by function from the dplyr package, but that would be another question.
Comment on the statistics: If you just have a few examples for any particular factor, what type of fit do you expect? A poor fit with wide confidence limits.

PLS in R: Predicting new observations returns Fitted values instead

In the past few days I have developed multiple PLS models in R for spectral data (wavebands as explanatory variables) and various vegetation parameters (as individual response variables). In total, the dataset comprises of 56. The first 28 (training set) have been used for model calibration, now all I want to do is to predict the response values for the remaining 28 observations in the tesset. For some reason, however, R keeps on the returning the fitted values of the calibration set for a given number of components rather than predictions for the independent test set. Here is what the model looks like in short.
# first simulate some data
set.seed(123)
bands=101
data <- data.frame(matrix(runif(56*bands),ncol=bands))
colnames(data) <- paste0(1:bands)
data$height <- rpois(56,10)
data$fbm <- rpois(56,10)
data$nitrogen <- rpois(56,10)
data$carbon <- rpois(56,10)
data$chl <- rpois(56,10)
data$ID <- 1:56
data <- as.data.frame(data)
caldata <- data[1:28,] # define model training set
valdata <- data[29:56,] # define model testing set
# define explanatory variables (x)
spectra <- caldata[,1:101]
# build PLS model using training data only
library(pls)
refl.pls <- plsr(height ~ spectra, data = caldata, ncomp = 10, validation =
"LOO", jackknife = TRUE)
It was then identified that a model comprising of 3 components yielded the best performance without over-fitting. Hence, the following command was used to predict the values of the 28 observations in the testing set using the above calibrated PLS model with 3 components:
predict(refl.pls, ncomp = 3, newdata = valdata)
Sensible as the output may seem, I soon discovered that all this piece of code generates are the fitted values of the PLS model for the calibration/training data, rather than predictions. I discovered this because the below code, in which newdata = is omitted, yields identical results.
predict(refl.pls, ncomp = 3)
Surely something must be going wrong, although I cannot seem to find out what specifically is. Is there someone out there who can, and is willing to help me move in the right direction?
I think the problem is with the nature of the input data. Looking at ?plsr and str(yarn) that goes with the example, plsr requires a very specific data frame that I find tricky to work with. The input data frame should have a matrix as one of its elements (in your case, the spectral data). I think the following works correctly (note I changed the size of the training set so that it wasn't half the original data, for troubleshooting):
library("pls")
set.seed(123)
bands=101
spectra = matrix(runif(56*bands),ncol=bands)
DF <- data.frame(spectra = I(spectra),
height = rpois(56,10),
fbm = rpois(56,10),
nitrogen = rpois(56,10),
carbon = rpois(56,10),
chl = rpois(56,10),
ID = 1:56)
class(DF$spectra) <- "matrix" # just to be certain, it was "AsIs"
str(DF)
DF$train <- rep(FALSE, 56)
DF$train[1:20] <- TRUE
refl.pls <- plsr(height ~ spectra, data = DF, ncomp = 10, validation =
"LOO", jackknife = TRUE, subset = train)
res <- predict(refl.pls, ncomp = 3, newdata = DF[!DF$train,])
Note that I got the spectral data into the data frame as a matrix by protecting it with I which equates to AsIs. There might be a more standard way to do this, but it works. As I said, to me a matrix inside of a data frame is not completely intuitive or easy to grok.
As to why your version didn't work quite right, I think the best explanation is that everything needs to be in the one data frame you pass to plsr for the data sources to be completely unambiguous.

Used Predict function on New Dataset with different Columns

Using "stackloss" data in R, I created a regression model as seen below:
stackloss.lm = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.,data=stackloss)
stackloss.lm
newdata = data.frame(Air.Flow=stackloss$Air.Flow, Water.Temp= stackloss$Water.Temp, Acid.Conc.=stackloss$Acid.Conc.)
Suppose I get a new data set and would need predict its "stack.loss" based on the previous model as seen below:
#suppose I need to used my model on a new set of data
stackloss$predict1[-1] <- predict(stackloss.lm, newdata)
I get this error:
Error in `$<-.data.frame`(`*tmp*`, "predict1", value = numeric(0)) :
replacement has 0 rows, data has 21
Is their a way to used the predict function on different data set with the same columns but different rows?
Thanks in advance.
You can predict into a new data set of whatever length you want, you just need to make sure you assign the results to an existing vector of appropriate size.
This line causes a problem because
stackloss$predict1[-1] <- predict(stackloss.lm, newdata)
because you can't assign and subset a non-existing vector at the same time. This also doesn't work
dd <- data.frame(a=1:3)
dd$b[-1]<-1:2
The length of stackloss which you used to fit the model will always be the same length so re-assigning new values to that data.frame doesn't make sense. If you want to use a smaller dataset to predict on, that's fine
stackloss.lm = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.,data=stackloss)
newdata = head(data.frame(Air.Flow=stackloss$Air.Flow, Water.Temp= stackloss$Water.Temp, Acid.Conc.=stackloss$Acid.Conc.),5)
predict(stackloss.lm, newdata)
1 2 3 4 5
38.76536 38.91749 32.44447 22.30223 19.71165
Since the result has the same number of values as newdata has rows (n=5), it makes sense to attach these to newdata. It would not make sense to attach to stackloss because that has a different number of rows (n=21)
newdata$predcit1 <- predict(stackloss.lm, newdata)

R random forest - training set using target column for prediction

I am learning how to use various random forest packages and coded up the following from example code:
library(party)
library(randomForest)
set.seed(415)
#I'll try to reproduce this with a public data set; in the mean time here's the existing code
data = read.csv(data_location, sep = ',')
test = data[1:65] #basically data w/o the "answers"
m = sample(1:(nrow(factor)),nrow(factor)/2,replace=FALSE)
o = sample(1:(nrow(data)),nrow(data)/2,replace=FALSE)
train2 = data[m,]
train3 = data[o,]
#random forest implementation
fit.rf <- randomForest(train2[,66] ~., data=train2, importance=TRUE, ntree=10000)
Prediction.rf <- predict(fit.rf, test) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]
#cforest implementation
fit.cf <- cforest(train3[,66]~., data=train3, controls=cforest_unbiased(ntree=10000, mtry=10))
Prediction.cf <- predict(fit.cf, test, OOB=TRUE) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]
Data[,66] is the is the target factor I'm trying to predict, but it seems that by using "~ ." to solve for it is causing the formula to use the factor in the prediction model itself.
How do I solve for the dimension I want on high-ish dimensionality data, without having to spell out exactly which dimensions to use in the formula (so I don't end up with some sort of cforest(data[,66] ~ data[,1] + data[,2] + data[,3}... etc.?
EDIT:
On a high level, I believe one basically
loads full data
breaks it down to several subsets to prevent overfitting
trains via subset data
generates a fitting formula so one can predict values of target (in my case data[,66]) given data[1:65].
so my PROBLEM is now if I give it a new set of test data, let’s say test = data{1:65], it now says “Error in eval(expr, envir, enclos) :” where it is expecting data[,66]. I want to basically predict data[,66] given the rest of the data!
I think that if the response is in train3 then it will be used as a feature.
I believe this is more like what you want:
crtl <- cforest_unbiased(ntree=1000, mtry=3)
mod <- cforest(iris[,5] ~ ., data = iris[,-5], controls=crtl)

Use of randomforest() for classification in R?

I originally had a data frame composed of 12 columns in N rows. The last column is my class (0 or 1). I had to convert my entire data frame to numeric with
training <- sapply(training.temp,as.numeric)
But then I thought I needed the class column to be a factor column to use the randomforest() tool as a classifier, so I did
training[,"Class"] <- factor(training[,ncol(training)])
I proceed to creating the tree with
training_rf <- randomForest(Class ~., data = trainData, importance = TRUE, do.trace = 100)
But I'm getting two errors:
1: In Ops.factor(training[, "Status"], factor(training[, ncol(training)])) :
<= this is not relevant for factors (roughly translated)
2: In randomForest.default(m, y, ...) :
The response has five or fewer unique values. Are you sure you want to do regression?
I would appreciate it if someone could point out the formatting mistake I'm making.
Thanks!
So the issue is actually quite simple. It turns out my training data was an atomic vector. So it first had to be converted as a data frame. So I needed to add the following line:
training <- as.data.frame(training)
Problem solved!
First, your coercion to a factor is not working because of syntax errors. Second, you should always use indexing when specifying a RF model. Here are changes in your code that should make it work.
training <- sapply(training.temp,as.numeric)
training[,"Class"] <- as.factor(training[,"Class"])
training_rf <- randomForest(x=training[,1:(ncol(training)-1)], y=training[,"Class"],
importance=TRUE, do.trace=100)
# You can also coerce to a factor directly in the model statement
training_rf <- randomForest(x=training[,1:(ncol(training)-1)], y=as.factor(training[,"Class"]),
importance=TRUE, do.trace=100)

Resources