How to call randomForest predict for use with ROCR? - r

I am having a hard time understanding how to build a ROC curve and now I came to the conclusion that maybe I don't create the model correctly. I am running a randomforest model in the dataset where the class attribute "y_n" is 0 or 1. I have divided the datasets as bank_training and bank_testing for the prediction purpose.
Here are the steps i do:
bankrf <- randomForest(y_n~., data=bank_training, mtry=4, ntree=2,
keep.forest=TRUE, importance=TRUE)
bankrf.pred <- predict(bankrf, bank_testing, type='response',
predict.all=TRUE, norm.votes=TRUE)
Is it correct what I do till now? The bankrf.pred object that is created is a list object with 2 classes named: aggregate and individuals. I dont understand where did this 2 class names came out? Moreover when I run:
summary(bankrf.pred)
Length Class Mode
aggregate 22606 factor numeric
individual 45212 -none- character
What does this summary mean? The datasets (training & testing) are 22605 and 22606 long each. If someone can explain me what is happening I would be very grateful. I think there is something wrong in all this.
When I try to design the ROC curve with ROCR I use the following code:
library(ROCR)
pred <- prediction(bank_testing$y_n, bankrf.pred$c(0,1))
Error in is.data.frame(labels) : attempt to apply non-function
Is just a mistake in the way I try to create the ROC curve or is it from the beginning with randomForest?

The documentation for the function you are attempting to use includes this description of its two main arguments:
predictions A vector, matrix, list, or data frame containing the
predictions.
labels A vector, matrix, list, or data frame containing the true
class labels. Must have the same dimensions as 'predictions'.
You are currently passing the variable y_n to the predictions argument, and what looks to me like nonsense to the labels argument.
The predictions will be stored in the output of the random forest model. As documented at ?predict.randomForest, it will be a list with two components. aggregate will contain the predicted values for the entire forest, while individual will contain the predicted values for each individual tree.
So you probably want to do something like this:
predictions(bankrf.pred$aggregate, bank_testing$y_n)
See how that works? The predicted values are passed to the predictions argument, while the "labels" or true values, are passed to the labels argument.

You should erase the predict.all=TRUE argument from predict if you simply want to get the predicted classes. By using predict.all=TRUE you are telling the function to keep the predictions of all trees rather than the prediction from the forest.

Related

how to get global p for categorical variables in svy_vglm

I'm using the function svyVGAM::svy_vglm to run a multinomial model with survey weights:
mmodel <- svy_glm(y~x1+x2+x3+x4..., family=multinomial, design=w_data)
where x represent categorical variables, some with three or more levels. Through model summary, I can know the p-value for each coefficient, but I don't know how to get the p-value for the global variable.
In other contexts, anova(), waldtest(), lrtest(), ... could be used, but none of them seem to work with svy_vglm objects. tbl_regression does not work either: Error: No tidy method for objects of class svy_vglm.
Any help?
Thanks
You can do this using the coef and vcov methods. There's probably a package, but it's not hard to program
Suppose that model is your model object, design is your survey design object and index is a vector with the positions of the coefficients you want to test. If you had ten coefficients and wanted to test all except the first two, you would have index<-3:10, for example.
beta<-coef(model)[index]
V<-vcov(model)[index,index]
teststat <- crossprod(beta, solve(V,beta))
pf(teststat, df1=length(beta), df2=degf(design), lower.tail=FALSE)
This doesn't give you a likelihood ratio test; you'd probably need to write to the package author and suggest that as a new feature.

Regression in subpopulations using svyglm function in the R survey package

I would like to use the svyglm function from the survey package to run stratified regression models/regression models on subset of my population.
Suppose x is my predictor, y is my outcome, and z is a third (factor) variable. I would like to see individual relationships between x and y for different levels of z.
The documentation for this package says that "The correct standard error estimate for a subpopulation that isn’t a stratum is not just obtained by pretending that the sub population was a designed survey of its own. However, the subset function and [ method for survey design objects handle all these details automagically, so you can ignore this problem."
There is a subset argument in the svyglm function. My question is - do you specify the subpopulation in the subset argument of the design function, in the svyglm function, or both?
Either one, but not both.
The code inside svyglm looks like
subset <- substitute(subset)
subset <- eval(subset, model.frame(design), parent.frame())
if (!is.null(subset))
design <- design[subset, ]
The first two lines are handling where to look up the subset, and then it just gets used to [ the design.

Retrieving Conditional Probabilites for Naive Bayes Model developed using caret Package in R

My main question is this:
How do you retrieve conditional probabilities for a Naïve Bayes model using the caret Package in R?
Background:
I have run a Naïve Bayes Model using the caret Package in R. The dataset is essentially health dataset with a binary outcome variable (mistake vs not a mistake) with a series of categorical predictors and one or two numerical predictors. For this, we are using a 5 fold, cross validation technique.
The model runs fine, but I would like to retrieve the conditional probabilities. How do I do this? For example, one of the predictors is "Pulse" which has 3 factors: Low , Normal, and High. I would like to retrieve something like the following "Given a Low Pulse, what is the probability of a Mistake" AKA:
p(y = "Mistake" | Pulse="Low").
The relevant code is here:
ctrl<-trainControl(method="cv",number=5,classProb=T)
mod4<-train(Target~.,data=train,method="nb",trControl=ctrl)
In the KLAR package, it's not hard to do (The second line displays this):
model4<-naiveBayes(Target~. ,data=train, scale=T)
model4_variable_posterior_prob <- model4$tables[[var2]]
However, I'd really like to use the cross validated model that Caret produces above because it's a lot more accurate.
I should note that Caret produces some tables in here:
mod4$finalModel$tables$
However, I'm not sure if these tables are the conditional probabilities or some other values.
For example, mod4$finalModel$tables$PulseX2 produces the following:
[,1] [,2]
X1 0.1343284 0.3415149
X2 0.1731343 0.3789293
I believe PulseX2 is the table for Pulse= Medium and PulseX3 is the table for Pulse=High, but I'm not entirely sure. However I do know that in the above, X1 is a "mistake" and X2 is "not a mistake" But my question is, is [,1] column a "0" value for the categorical factor variable of PulseX2? And is [,2] column a "1" value for the categorical factor variable of PulseX2? So by that logic, is .3415149 p( y= Mistake (or X1=1) | Pulse = X2) above the baseline of PulseX1 or something? Does anyone know what these values mean?
Alternatively, if there is some way I can retrieve some information on the important individual factors (not just important variables) that too would be fine.
This isn't really about caret; that object is created by the NaiveBayes function in the klaR package. The documentation for that package says:
tables: A list of tables, one for each predictor variable. For each categorical variable a table giving, for each attribute level, the conditional probabilities given the target class. For each numeric variable, a table giving, for each target class, mean and standard deviation of the (sub-)variable or a object of class density.

using fitted() on output from lm with dummy variables

reg_ss <- predict(lm(stem_d~stand_id*yr,ss))
fitted.values(reg_ss)
#Error: $ operator is invalid for atomic vectors
I have tried this with fitted() and fitted.values() and receive the same error.
stand_id is a factor with 300+ levels and yr is an integer 1-19, but both are numbers.
I have data on tree stem density collected in stands every 2-3 years for 20 years. I want to run a linear regression and predict stem density for stands in the years between samplings, i.e. use data from year 1 and 3 to predict stem density in year 2.
Any suggestions on how I can get predicted values using fitted() or any other method would be greatly appreciated. I suspect it has something to do with dummy variables assigned to the categories but can't seem to find any information on a solution.
Thanks in advance!
If you want fitted values, you should not be calling predict() first.
reg_ss <- lm(stem_d~stand_id*yr,ss)
predict(reg_ss)
fitted(reg_ss)
When you don't pass new data to predict, it's basically doing the same thing as fitted so you get essentially the same values back. Both fitted and predict will return a simple named vector. You cannot use fitted on a named vector (hence the error message).
If you want to predict unobserved values, you need to pass a newdata= parameter to predict(). You should pass in a data.frame with columns named "stand_id" and "yr" just like ss. Make sure to match up the factor levels as well.

using predict() and table() in r

I have used glm on the learning data set which without NAs has 49511 observations.
glmodel<-glm(RESULT ~ ., family=binomial,data=learnfram)
Using that glm, I tried to predict the probability for the test data set which has 49943 without NAs. My resulting prediction has only 49511 elements.
predct<-predict(glmodel, type="response", data=testfram)
Why is it that the result of predict is not for 49511 elements?
I want to look for false positives and false negatives. I used table, but it is throwing error:
table(testfram$RESULT, predct>0.02)
## Error in table(testfram$RESULT, predct> 0.02) :
## all arguments must have the same length
How can I get my desired result?
You used the wrong parameter name in predict. It should be newdata=, not data=. So the reason you get 49511 elements is that the default for predict when you don't specify new data is to output the predicted values for the data you created the model with. Hence you're getting the predicted values for your original data.

Resources