R randomForest combined models - Error Message - r

This is the problem I am having. I hope someone can explain why
I have a large dataset I am using to predict a categorical value - L,M,H - in the original data.frame it is a factor.
The training set is large, so I do not have enough memory to train on it - so I took a sample of my training dataset and create a randomForest. Then I created a different random sample and created a second forest, .... They all have similar performance which was a concern
I found the combine function in randomForest and decided to use it to combine my models.
I then need to use the new model to score the train set to get an OOB estimate and then the same with my validation sample.
I am having a problem with the prediction on the test set.
I basically get a message saying "Error in eval(expr,envirmenclos) : object 'XXX' not found" where XXX is the variable name. But this makes no sense as the variables never changed names
I redid this a few times, in case my data got corrupted.
Any idea why am I getting this?

Without the data is hard to know but this is my hunch based on similar errors in the past- If you are sampling your data and running separate models, you may run into a problem with categorical variables where the factor levels in one model do not match the factor levels from another model. The way to potentially fix this is to specify the factor levels in the data frame (using the levels function) before you run the model.
Edit- one way to debut is to run two models on the same sample data combine them and try to apply the model and see if you get the same error..

Related

Warning: Error in predict.randomForest: Type of predictors in new data do not match that of the training data

I get the above error message when trying to predict for a single new instance in Shiny, hence there cannot be a problem with the test data having a different number of levels.
RF<-randomForest(incurred_loss2~turnover+Limite.PI+NUTS1,data=sec,importance=TRUE,ntree=2000)
modelPred<-reactive({
turnoverInput<-as.numeric(input$sliderTurnover)
LOIInput<-as.numeric(input$sliderLOI)
LegalInput<-as.factor(input$selectLegal)
NUTS1Input<-as.factor(input$selectNUTS1)
predict(RF,newdata=data.frame(legal_form=LegalInput,turnover=turnoverInput,Limite.PI=LOIInput,NUTS1=NUTS1Input))
})
When I drop the two factor variables from the model so that I am only left with the two numerical variables I do not get the error message.
The error means that the data structure used to train the "RF" model isn't the same as the test data structure, and the fact that dropping the factors variables makes your code work confirms you have an issue with factors that aren't identical.
Seeing your code it might be possible as your training and testing sets aren't subsetted from the same/unique data.frame, and as you do several manual conversions (including manual conversions on factors).
I'd see two things to try depending on what you need:
merging your sets into one data.frame with a key column you can use
to separate train and test
or creating a function which outputs a standardized dataset you can use in your task, and passing your test and train set inside this
function (setting NA as an explicit level)
Below an example of problems manual conversions could cause using factors in random forests if one of your datasets have NA and the other has not:
r random forest error - type of predictors in new data do not match

Dealing with different number of levels in train and test data sets

My knowledge in this area is poor, so apologize me if this is a trivial question.
I need to train a model and I have two data sets: Train data for building the model and a Scoring data to apply the model on it.
One important categorical variable has 200 level in Train data and it has only 50 levels in the scoring data. In fact they only share 20 levels.
So, what is the correct way to deal with such situation? should I limit the levels to the intersect of the levels or keep it as it or what?
Bests.
There are a number of different options here. I assume you are talking about a single attribute here and i'm also assuming since you are talking about a level it is numerical:
The first option is to do nothing and to see what result you get.
The second is to normalize the values. Setting them all on the same scale accordingly from 0to1.
You could also try binning, I'm not sure what that is in R.
I'm not an expert but I've found doing some testing and trying different methods doesn't hurt. A program I use with school is called Weka it's free and opened source plus there are instructional videos that will introduce you to the theory behind data analysis
http://www.cs.waikato.ac.nz/ml/index.html
When using your test dataset to test your model, you will need to filter-out the levels that are not present in your test dataset (assuming your model cannot handle missing levels).
Alternatively, you could re-partition your data into a test and training sets where all of the levels in the test set are present in the training set. The createDataParition function from the caret package will do this for you - e.g. see here.

How do I set the levels in a dataset using the model data structure from bnlearn?

I'm trying to use models from the bnlearn package in R to do classifier predictions, but with some datasets, some ofthe variable values (levels) are rarely seen, which means that the test data partition may not have all of the values for variable represented in the data file.
When using predict() with the bn model on this type of data set, an error message similar to the following is returned:
: In check.data(data) : variable V3 has levels that are not observed
in the data.
I would like to reset the levels in the model similar to the method here:
Error in bn.fit predict function in bnlear R
but I don't have access to the original data, just the model.
So, how do I get the number of levels from the bn data structure to set the number of levels in the data set to be predicted?
The answer is that the question is asking the wrong thing. After quite a bit of poring over the code, the answer lies in a function, check.data, used to verify the data for both the learning and the predicting phases, which is, in this case, non-sensical. The correct answer is to modify bnlearn to eliminate this bug.

Applying univariate coxph function to multiple covariates (columns) at once

First, I gathered from this link Applying a function to multiple columns that using the "function" function would perhaps do what I'm looking for. However, I have not been able to make the leap from thinking about it in the way presented to making it actually work in my situation (or really even knowing where to start). I'm a beginner in R so I apologize in advance if this is a really "newb" question. My data is a data frame that consists of an event variable (tumor recurrence) and a time variable (followup time/time to recurrence) as well as recurrence risk factors (t-stage, tumor size,age at dx, etc.). Some risk factors are categorical and some are continuous. I have been running my univariate analysis by hand, one at a time like this example univariateageatdx<-coxph(survobj~agedx), and then collecting the data. This gets very tedious for multiple factors and doing it for a few different recurrence types. I figured there must be a way to code such that I could basically have one line of code that had the coxph equation and then applied it to all of my variables of interest and spit out a result that had the univariate analysis results for each factor. I tried using cbind to bind variables (i.e x<-cbind("agedx","tumor size") then running cox coxph(recurrencesurvobj~x) but this of course just did the multivariate analysis on these variables and didn't split them out as true univariate analyses.
I also tried the following code based on a similar problem that I found on a different site, but it gave the error shown and I don't know quite what to make of it. Is this on the right track?
f <- as.formula(paste('regionalsurvobj ~', paste(colnames(nodcistradmasvssubcutmasR)[6-9], collapse='+')))
I then ran it has coxph(f)
Gave me the results of a multivariate cox analysis.
Thanks!
**edit: I just fixed the error, I needed to use the column numbers I suppose not the names. Changes are reflected in the code above. However, it still runs the variables selected as a multivariate analysis and not as the true univariate analysis...
If you want to go the formula-route (which in your case with multiple outcomes and multiple variables might be the most practical way to go about it) you need to create a formula per model you want to fit. I've split the steps here a bit (making formulas, making models and extracting data), they can off course be combined this allows you to inspect all your models.
#example using transplant data from survival package
#make new event-variable: death or no death
#to have dichot outcome
transplant$death <- transplant$event=="death"
#making formulas
univ_formulas <- sapply(c("age","sex","abo"),function(x)as.formula(paste('Surv(futime,death)~',x))
)
#making a list of models
univ_models <- lapply(univ_formulas, function(x){coxph(x,data=transplant)})
#extract data (here I've gone for HR and confint)
univ_results <- lapply(univ_models,function(x){return(exp(cbind(coef(x),confint(x))))})

Multiple Imputation on New/Predictor Data

Can someone please help me understand how to handle missing values in new/unseen data? I have investigated a handful of multiple imputation packages in R and all seem only to impute the training and test set (at the same time). How do you then handle new unlabeled data to estimate in the same way that you did train/test? Basically, I want to use multiple imputation for missing values in the training/test set and also the same model/method for predictor data. Based on my research of multiple imputation (not an expert), it does not seem feasible to do this with MI? However, for example, using caret, you can easily use the same model that was used in training/test for new data. Any help would be greatly appreciated. Thanks.
** Edit
Basically, My data set contains many missing values. Deletion is not an option as it will discard most of my train/test set. Up to this point, I have encoded categorical variables, removed near zero variance and highl correlated variables. After this preprocessing, I was able to easily apply the mice package for imputation
m=mice(sg.enc)
At this point, I could use the pool command to apply the model against the imputed data sets. That works fine. However, I know that future data will have missing values and would like to somehow apply this MI incrementally?
It does not have multiple imputation, but the yaImpute package has a predict() function to impute values for new data. I ran a test using training data (that included NAs) to create a "yai" object, then used that object via predict() to impute values in a new testing data set. Unlike Caret preProcess(), yaImpute supports factor variables (at least for imputing values for them) in its knn algorithm. I did not yet test if factors can be part of the "predictors" for the missing target variables. yaImpute does support other imputation methods besides knn.

Resources