Using base R, I've created a model and am trying to test it using the predict function to return the probability of making more than $50k in a year, turn it into a usable categorical variable, and then add the predicted outcome to my test dataframe dataToModel2 using the following coding and am unsure if I've done it right. Have I correctly fed my binary model prediction values into the dataframe used to test my models and what would represents the real outcomes here?
probabilities <- predict(theModel, newdata = dataToModel2 , type = "response")
dataToModel2$predictions <- probabilities > .5
str(dataToModel2)
If so, is there a formula to use that calculates the accuracy, false negatives, false positives, and positive predict values? I understand slightly that it has to do with making both the column for real outcome and the column for my model's predictions the same units(making real outcome True/False or 1/0), but am unsure on how to do that or why it is necessary.
Related
Suppose I was given a data as the following.
id1=rep(1:10,2)
trt=c(rep(1,10),rep(0,10))
outcome=rnorm(20)
set.seed(1005)
missing=c()
for(i in outcome){
if(rbinom(1,1,0.8*abs(i/max(abs(outcome))))==1){
missing=c(missing,which(outcome==i))
}
}
missing
trt[missing]=NA
dat=data.frame(id=id1,trt=as.factor(trt),outcome=outcome)
a1=mice::mice(dat,method=c('','logreg',''))
I can use mice to impute the data first and then conduct analysis on a1 where I assume the outcome and id predicts trt by logistic regression. In fact, only outcome predicts trt here. a1$formulas$trt will access the formula for imputation. I want to modify the formula here so that there is a constant offset.
forms_a1=a1$formulas
forms_a1$trt=as.formula(trt~outcome+offset(2))
mice::mice(dat,method=c('','logreg',''),formulas = forms_a1)
However, mice gives the following output.
iter imp variable
1 1 trtError in model.frame.default(formula, data = data, na.action = na.pass) :
variable lengths differ (found for 'offset(2)')
$Q1:$ How do I offset intercept here? I think I could add an extra column as variable to offset and modify formula and predictormatrix. The $\delta$ shift was implemented here(https://stefvanbuuren.name/fimd/sec-sensitivity.html) for continuous variable case. However, it seems that this may change estimated coefficients.
$Q2:$ If I am interested in offsetting a slope say by 10*id in above formula, how would I do so?
I'm trying to boost a classification tree using the gbm package in R and I'm a little bit confused about the kind of predictions I obtain from the predict function.
Here is my code:
#Load packages, set random seed
library(gbm)
set.seed(1)
#Generate random data
N<-1000
x<-rnorm(N)
y<-0.6^2*x+sqrt(1-0.6^2)*rnorm(N)
z<-rep(0,N)
for(i in 1:N){
if(x[i]-y[i]+0.2*rnorm(1)>1.0){
z[i]=1
}
}
#Create data frame
myData<-data.frame(x,y,z)
#Split data set into train and test
train<-sample(N,800,replace=FALSE)
test<-(-train)
#Boosting
boost.myData<-gbm(z~.,data=myData[train,],distribution="bernoulli",n.trees=5000,interaction.depth=4)
pred.boost<-predict(boost.myData,newdata=myData[test,],n.trees=5000,type="response")
pred.boost
pred.boost is a vector with elements from the interval (0,1).
I would have expected the predicted values to be either 0 or 1, as my response variable z also consists of dichotomous values - either 0 or 1 - and I'm using distribution="bernoulli".
How should I proceed with my prediction to obtain a real classification of my test data set? Should I simply round the pred.boost values or is there anything I'm doing wrong with the predict function?
Your observed behavior is correct. From documentation:
If type="response" then gbm converts back to the same scale as the
outcome. Currently the only effect this will have is returning
probabilities for bernoulli.
So you should be getting probabilities when using type="response" which is correct. Plus distribution="bernoulli" merely tells that labels follows bernoulli (0/1) pattern. You can omit that and still model will run fine.
To proceed do predict_class <- pred.boost > 0.5 (cutoff = 0.5) or else plot ROC curve to decide on cutoff yourself.
Try using adabag. Class, probabilities, votes and error are inbuilt in adabag which makes it easy to interpret, and of course less lines of codes.
I've run a simple model using orm (i.e. reg <- orm(formula = y ~ x)) and I'm having trouble understanding how to get predicted values for Y. I've never worked with models that use multiple intercepts. I want to know for each and every value of Y in my dataset what the predicted value from the model would be. I tried predict(reg, type="mean") and this produced values that are close to the predicted values from an OLS regression, but I'm not sure if this is what I want. I really just want something analogous to OLS where you can obtain the E(Y) given a set of predictors. If possible, please provide code I can run to do this with a very short explanation.
I have used glm on the learning data set which without NAs has 49511 observations.
glmodel<-glm(RESULT ~ ., family=binomial,data=learnfram)
Using that glm, I tried to predict the probability for the test data set which has 49943 without NAs. My resulting prediction has only 49511 elements.
predct<-predict(glmodel, type="response", data=testfram)
Why is it that the result of predict is not for 49511 elements?
I want to look for false positives and false negatives. I used table, but it is throwing error:
table(testfram$RESULT, predct>0.02)
## Error in table(testfram$RESULT, predct> 0.02) :
## all arguments must have the same length
How can I get my desired result?
You used the wrong parameter name in predict. It should be newdata=, not data=. So the reason you get 49511 elements is that the default for predict when you don't specify new data is to output the predicted values for the data you created the model with. Hence you're getting the predicted values for your original data.
I fitted a mixed model to Data A as follows:
model <- lme(Y~1+X1+X2+X3, random=~1|Class, method="ML", data=A)
Next, I want to see how the model fits Data B and also get the estimated residuals. Is there a function in R that I can use to do so?
(I tried the following method but got all new coefficients.)
model <- lme(Y~1+X1+X2+X3, random=~1|Class, method="ML", data=B)
The reason you are getting new coefficients in your second attempt with data=B is that the function lme returns a model fitted to your data set using the formula you provide, and stores that model in the variable model as you have selected.
To get more information about a model you can type summary(model_name). the nlme library includes a method called predict.lme which allows you to make predictions based on a fitted model. You can type predict(my_model) to get the predictions using the original data set, or type predict(my_model, some_other_data) as mentioned above to generate predictions using that model but with a different data set.
In your case to get the residuals you just need to subtract the predicted values from observed values. So use predict(my_model,some_other_data) - some_other_data$dependent_var, or in your case predict(model,B) - B$Y.
You model:
model <- lme(Y~1+X1+X2+X3, random=~1|Class, method="ML", data=A)
2 predictions based on your model:
pred1=predict(model,newdata=A,type='response')
pred2=predict(model,newdata=B,type='response')
missed: A function that calculates the percent of false positives, with cut-off set to 0.5.
(predicted true but in reality those observations were not positive)
missed = function(values,prediction){sum(((prediction > 0.5)*1) !=
values)/length(values)}
missed(A,pred1)
missed(B,pred2)