using predict() and table() in r - r

I have used glm on the learning data set which without NAs has 49511 observations.
glmodel<-glm(RESULT ~ ., family=binomial,data=learnfram)
Using that glm, I tried to predict the probability for the test data set which has 49943 without NAs. My resulting prediction has only 49511 elements.
predct<-predict(glmodel, type="response", data=testfram)
Why is it that the result of predict is not for 49511 elements?
I want to look for false positives and false negatives. I used table, but it is throwing error:
table(testfram$RESULT, predct>0.02)
## Error in table(testfram$RESULT, predct> 0.02) :
## all arguments must have the same length
How can I get my desired result?

You used the wrong parameter name in predict. It should be newdata=, not data=. So the reason you get 49511 elements is that the default for predict when you don't specify new data is to output the predicted values for the data you created the model with. Hence you're getting the predicted values for your original data.

Related

Using Base R, how would I accomplish the following tasks?

Using base R, I've created a model and am trying to test it using the predict function to return the probability of making more than $50k in a year, turn it into a usable categorical variable, and then add the predicted outcome to my test dataframe dataToModel2 using the following coding and am unsure if I've done it right. Have I correctly fed my binary model prediction values into the dataframe used to test my models and what would represents the real outcomes here?
probabilities <- predict(theModel, newdata = dataToModel2 , type = "response")
dataToModel2$predictions <- probabilities > .5
str(dataToModel2)
If so, is there a formula to use that calculates the accuracy, false negatives, false positives, and positive predict values? I understand slightly that it has to do with making both the column for real outcome and the column for my model's predictions the same units(making real outcome True/False or 1/0), but am unsure on how to do that or why it is necessary.

Missing values in the outcome of predict.glmnet

I've built a lasso model with glmnet function.
Then I try to use this model in the function predict with a test set of 28055 rows but the prediction output has only 25118 values. I guess it did not include NA values because the predictors in the test set have some missing values. I know that for the glm package one can deal with this with na.action = na.pass but it does not seem to exist in the glmnet package. Any suggestion ?
EDIT : my test does not have any missing values neither has my train set

set random forest to classification

I am attempting a random forest on some data where the class variables is binary (either 1 or 0). Here is the code I'm running:
forest.model <- randomForest(x = ticdata2000[,1:85], y = ticdata2000[,86],
ntree=500,
mtry=9,
importance=TRUE,
norm.votes=TRUE,
na.action=na.roughfix,
replace=FALSE,
)
But when the forest gets to the end, I get the following error:
Warning message:
In randomForest.default(x = ticdata2000[, 1:85], y = ticdata2000[, :
The response has five or fewer unique values. Are you sure you want to do regression?
The answer, of course, is no. I don't want to do regression. I have a single, discrete variable that only takes on 2 classes. Of course, when I run predictions with this model, I get continuous numbers, when I want a list of zeroes and ones. Can someone tell me what I'm doing wrong to get this to use regression and not classification?
Change your response column to a factor using as.factor (or just factor). Since you've stored that variable as numeric 0's and 1's, R rightly interprets it as a numeric variable. If you want R to treat it differently, you have to tell it so.
This is mentioned in the documentation under the y argument:
A response vector. If a factor, classification is assumed, otherwise
regression is assumed. If omitted, randomForest will run in
unsupervised mode.

How to fit a model I built to another data set and get residuals?

I fitted a mixed model to Data A as follows:
model <- lme(Y~1+X1+X2+X3, random=~1|Class, method="ML", data=A)
Next, I want to see how the model fits Data B and also get the estimated residuals. Is there a function in R that I can use to do so?
(I tried the following method but got all new coefficients.)
model <- lme(Y~1+X1+X2+X3, random=~1|Class, method="ML", data=B)
The reason you are getting new coefficients in your second attempt with data=B is that the function lme returns a model fitted to your data set using the formula you provide, and stores that model in the variable model as you have selected.
To get more information about a model you can type summary(model_name). the nlme library includes a method called predict.lme which allows you to make predictions based on a fitted model. You can type predict(my_model) to get the predictions using the original data set, or type predict(my_model, some_other_data) as mentioned above to generate predictions using that model but with a different data set.
In your case to get the residuals you just need to subtract the predicted values from observed values. So use predict(my_model,some_other_data) - some_other_data$dependent_var, or in your case predict(model,B) - B$Y.
You model:
model <- lme(Y~1+X1+X2+X3, random=~1|Class, method="ML", data=A)
2 predictions based on your model:
pred1=predict(model,newdata=A,type='response')
pred2=predict(model,newdata=B,type='response')
missed: A function that calculates the percent of false positives, with cut-off set to 0.5.
(predicted true but in reality those observations were not positive)
missed = function(values,prediction){sum(((prediction > 0.5)*1) !=
values)/length(values)}
missed(A,pred1)
missed(B,pred2)

How to call randomForest predict for use with ROCR?

I am having a hard time understanding how to build a ROC curve and now I came to the conclusion that maybe I don't create the model correctly. I am running a randomforest model in the dataset where the class attribute "y_n" is 0 or 1. I have divided the datasets as bank_training and bank_testing for the prediction purpose.
Here are the steps i do:
bankrf <- randomForest(y_n~., data=bank_training, mtry=4, ntree=2,
keep.forest=TRUE, importance=TRUE)
bankrf.pred <- predict(bankrf, bank_testing, type='response',
predict.all=TRUE, norm.votes=TRUE)
Is it correct what I do till now? The bankrf.pred object that is created is a list object with 2 classes named: aggregate and individuals. I dont understand where did this 2 class names came out? Moreover when I run:
summary(bankrf.pred)
Length Class Mode
aggregate 22606 factor numeric
individual 45212 -none- character
What does this summary mean? The datasets (training & testing) are 22605 and 22606 long each. If someone can explain me what is happening I would be very grateful. I think there is something wrong in all this.
When I try to design the ROC curve with ROCR I use the following code:
library(ROCR)
pred <- prediction(bank_testing$y_n, bankrf.pred$c(0,1))
Error in is.data.frame(labels) : attempt to apply non-function
Is just a mistake in the way I try to create the ROC curve or is it from the beginning with randomForest?
The documentation for the function you are attempting to use includes this description of its two main arguments:
predictions A vector, matrix, list, or data frame containing the
predictions.
labels A vector, matrix, list, or data frame containing the true
class labels. Must have the same dimensions as 'predictions'.
You are currently passing the variable y_n to the predictions argument, and what looks to me like nonsense to the labels argument.
The predictions will be stored in the output of the random forest model. As documented at ?predict.randomForest, it will be a list with two components. aggregate will contain the predicted values for the entire forest, while individual will contain the predicted values for each individual tree.
So you probably want to do something like this:
predictions(bankrf.pred$aggregate, bank_testing$y_n)
See how that works? The predicted values are passed to the predictions argument, while the "labels" or true values, are passed to the labels argument.
You should erase the predict.all=TRUE argument from predict if you simply want to get the predicted classes. By using predict.all=TRUE you are telling the function to keep the predictions of all trees rather than the prediction from the forest.

Resources