R: Missing data causes error with XGBoost / sparse.model.matrix - r

As far as I can understand, XGB should have the benefit of dealing with missing data, however, whenever I test the Boston housing set with a few NAs added, I get the error:
The length of labels must equal to the number of rows in the input data
The code I am running is
trainm <- sparse.model.matrix(class ~ ., data = train)
train_label <- train[,"class"]
train_matrix <- xgb.DMatrix(data = as.matrix(trainm) label=train_label)
When I don't add the NAs everything runs fine. I am pretty sure that the issue is that the NAs are removed from a sparse matrix which causes the confusion, but I am not sure how to address it.
My code is here.
Any feedback that can help me on will be highly appreciated.

You need to do corrective action for handling NAs while building sparse model matrix. Rest there is no problem wit your code/data. This is the modified code:
options(na.action='na.pass')
trainm <- sparse.model.matrix(class ~ ., data = train)
train_label <- train$class
train_matrix <- xgb.DMatrix(data = trainm, label=train$class)

Related

Errors while performing caret tuning in R

I am building a predictive model with caret/R and I am running into the following problems:
When trying to execute the training/tuning, I get this error:
Error in if (tmps < .Machine$double.eps^0.5) 0 else tmpm/tmps :
missing value where TRUE/FALSE needed
After some research it appears that this error occurs when there missing values in the data, which is not the case in this example (I confirmed that the data set has no NAs). However, I also read somewhere that the missing values may be introduced during the re-sampling routine in caret, which I suspect is what's happening.
In an attempt to solve problem 1, I tried "pre-processing" the data during the re-sampling in caret by removing zero-variance and near-zero-variance predictors, and automatically inputting missing values using a carets knn automatic imputing method preProcess(c('zv','nzv','knnImpute')), , but now I get the following error:
Error: Matrices or data frames are required for preprocessing
Needless to say I checked and confirmed that the input data set are indeed matrices, so I dont understand why I get this second error.
The code follows:
x.train <- predict(dummyVars(class ~ ., data = train.transformed),train.transformed)
y.train <- as.matrix(select(train.transformed,class))
vbmp.grid <- expand.grid(estimateTheta = c(TRUE,FALSE))
adaptive_trctrl <- trainControl(method = 'adaptive_cv',
number = 10,
repeats = 3,
search = 'random',
adaptive = list(min = 5, alpha = 0.05,
method = "gls", complete = TRUE),
allowParallel = TRUE)
fit.vbmp.01 <- train(
x = (x.train),
y = (y.train),
method = 'vbmpRadial',
trControl = adaptive_trctrl,
preProcess(c('zv','nzv','knnImpute')),
tuneGrid = vbmp.grid)
The only difference between the code for problem (1) and (2) is that in (1), the pre-processing line in the train statement is commented out.
In summary,
-There are no missing values in the data
-Both x.train and y.train are definitely matrices
-I tried using a standard 'repeatedcv' method in instead of 'adaptive_cv' in trainControl with the same exact outcome
-Forgot to mention that the outcome class has 3 levels
Anyone has any suggestions as to what may be going wrong?
As always, thanks in advance
reyemarr
I had the same problem with my data, after some digging i found that I had some Inf (infinite) values in one of the columns.
After taking them out (df <- df %>% filter(!is.infinite(variable))) the computation ran without error.

subscript out of bounds Error, Random Forest Model

I'm trying to use the random forest model to predict Gender based on Height, Weight and Number of siblings. I've gotten the data from a much larger data set that contains dozens of variables, but I've cleaned it into this "clean" data.frame with omitted NA values and only the 4 variables I care about, the last column being Gender.
I've tried fiddling with the code and searching everywhere but I can't find a concrete fix.
Here's the code:
ind <- sample(nrow(clean),0.8*nrow(clean))
train <- clean[ind,]
test <- clean[-ind,]
rf <- randomForest(Gender ~ ., data = train[,1:4], ntree = 20)
pred <- predict(rf, newdata = test[,-c(length(test))])
cm <- table(test$Gender, pred)
cm
and here's the output:
Error in `[.default`(table(observed = y, predicted = out.class), levels(y), : subscript out of bounds
Traceback:
1. randomForest(Gender ~ ., data = train[, 1:4], ntree = 20)
2. randomForest.formula(Gender ~ ., data = train[, 1:4], ntree = 20)
3. randomForest.default(m, y, ...)
4. table(observed = y, predicted = out.class)[levels(y), levels(y)]
5. `[.table`(table(observed = y, predicted = out.class), levels(y),
. levels(y))
6. NextMethod()
The problem is likely that you have some kind of a variable level in your test data that was not reflected in your training data. So when it goes to assign the outcome, it has no basis to do so.
It is impossible to say for sure without sample data, but it is the most likely scenario. Try setting a seed set.seed=3 and then change the seed number set.seed=28 and so on, a few times to see if you end up finding a combination where you do not get the error.
Compare the conflicted data frame with the un-conflicted one to see what is missing.
EDIT:
Also, try running str(train) and str(test) to be sure the fields have remained the same. You can share that if you like by editing your post.
If any of the columns are factors with levels missing (meaning it has 10 levels but only 8 are represented in the train with 9 or 10 in the test) it might be a problem. They should be balanced if you are trying to create a predictor for all possible outcomes.
If nothing else works, you can set a seed and remove predictors one at a time until it runs correctly, then look to see how the train and test sets are different in that removed column.

predict in caret ConfusionMatrix is removing rows

I'm fairly new to using the caret library and it's causing me some problems. Any
help/advice would be appreciated. My situations are as follows:
I'm trying to run a general linear model on some data and, when I run it
through the confusionMatrix, I get 'the data and reference factors must have
the same number of levels'. I know what this error means (I've run into it before), but I've double and triple checked my data manipulation and it all looks correct (I'm using the right variables in the right places), so I'm not sure why the two values in the confusionMatrix are disagreeing. I've run almost the exact same code for a different variable and it works fine.
I went through every variable and everything was balanced until I got to the
confusionMatrix predict. I discovered this by doing the following:
a <- table(testing2$hold1yes0no)
a[1]+a[2]
1543
b <- table(predict(modelFit,trainTR2))
dim(b)
[1] 1538
Those two values shouldn't disagree. Where are the missing 5 rows?
My code is below:
set.seed(2382)
inTrain2 <- createDataPartition(y=HOLD$hold1yes0no, p = 0.6, list = FALSE)
training2 <- HOLD[inTrain2,]
testing2 <- HOLD[-inTrain2,]
preProc2 <- preProcess(training2[-c(1,2,3,4,5,6,7,8,9)], method="BoxCox")
trainPC2 <- predict(preProc2, training2[-c(1,2,3,4,5,6,7,8,9)])
trainTR2 <- predict(preProc2, testing2[-c(1,2,3,4,5,6,7,8,9)])
modelFit <- train(training2$hold1yes0no ~ ., method ="glm", data = trainPC2)
confusionMatrix(testing2$hold1yes0no, predict(modelFit,trainTR2))
I'm not sure as I don't know your data structure, but I wonder if this is due to the way you set up your modelFit, using the formula method. In this case, you are specifying y = training2$hold1yes0no and x = everything else. Perhaps you should try:
modelFit <- train(trainPC2, training2$hold1yes0no, method="glm")
Which specifies y = training2$hold1yes0no and x = trainPC2.

R random forest - training set using target column for prediction

I am learning how to use various random forest packages and coded up the following from example code:
library(party)
library(randomForest)
set.seed(415)
#I'll try to reproduce this with a public data set; in the mean time here's the existing code
data = read.csv(data_location, sep = ',')
test = data[1:65] #basically data w/o the "answers"
m = sample(1:(nrow(factor)),nrow(factor)/2,replace=FALSE)
o = sample(1:(nrow(data)),nrow(data)/2,replace=FALSE)
train2 = data[m,]
train3 = data[o,]
#random forest implementation
fit.rf <- randomForest(train2[,66] ~., data=train2, importance=TRUE, ntree=10000)
Prediction.rf <- predict(fit.rf, test) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]
#cforest implementation
fit.cf <- cforest(train3[,66]~., data=train3, controls=cforest_unbiased(ntree=10000, mtry=10))
Prediction.cf <- predict(fit.cf, test, OOB=TRUE) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]
Data[,66] is the is the target factor I'm trying to predict, but it seems that by using "~ ." to solve for it is causing the formula to use the factor in the prediction model itself.
How do I solve for the dimension I want on high-ish dimensionality data, without having to spell out exactly which dimensions to use in the formula (so I don't end up with some sort of cforest(data[,66] ~ data[,1] + data[,2] + data[,3}... etc.?
EDIT:
On a high level, I believe one basically
loads full data
breaks it down to several subsets to prevent overfitting
trains via subset data
generates a fitting formula so one can predict values of target (in my case data[,66]) given data[1:65].
so my PROBLEM is now if I give it a new set of test data, let’s say test = data{1:65], it now says “Error in eval(expr, envir, enclos) :” where it is expecting data[,66]. I want to basically predict data[,66] given the rest of the data!
I think that if the response is in train3 then it will be used as a feature.
I believe this is more like what you want:
crtl <- cforest_unbiased(ntree=1000, mtry=3)
mod <- cforest(iris[,5] ~ ., data = iris[,-5], controls=crtl)

Use of randomforest() for classification in R?

I originally had a data frame composed of 12 columns in N rows. The last column is my class (0 or 1). I had to convert my entire data frame to numeric with
training <- sapply(training.temp,as.numeric)
But then I thought I needed the class column to be a factor column to use the randomforest() tool as a classifier, so I did
training[,"Class"] <- factor(training[,ncol(training)])
I proceed to creating the tree with
training_rf <- randomForest(Class ~., data = trainData, importance = TRUE, do.trace = 100)
But I'm getting two errors:
1: In Ops.factor(training[, "Status"], factor(training[, ncol(training)])) :
<= this is not relevant for factors (roughly translated)
2: In randomForest.default(m, y, ...) :
The response has five or fewer unique values. Are you sure you want to do regression?
I would appreciate it if someone could point out the formatting mistake I'm making.
Thanks!
So the issue is actually quite simple. It turns out my training data was an atomic vector. So it first had to be converted as a data frame. So I needed to add the following line:
training <- as.data.frame(training)
Problem solved!
First, your coercion to a factor is not working because of syntax errors. Second, you should always use indexing when specifying a RF model. Here are changes in your code that should make it work.
training <- sapply(training.temp,as.numeric)
training[,"Class"] <- as.factor(training[,"Class"])
training_rf <- randomForest(x=training[,1:(ncol(training)-1)], y=training[,"Class"],
importance=TRUE, do.trace=100)
# You can also coerce to a factor directly in the model statement
training_rf <- randomForest(x=training[,1:(ncol(training)-1)], y=as.factor(training[,"Class"]),
importance=TRUE, do.trace=100)

Resources