How to predict using glm in R? - r

So I have a glm that is defined like this
oring.glm = glm(oring.data$Damaged ~ oring.data$Temp, data = oring.data, family=binomial)
The data looks like this
Oring Temp
1 15
0 20
1 30
I want to predict what happens to the Oring at a specific temperature
I've tried doing this
logodds = predict(oring.glm, list(Temp=31))
But this gives me a list of values, as opposed to a single odds value.
How do I get that?

If I'm correct to assume that you DV is dichotomous, then I'd use the logit link function, and access the predicted value of my fit using simple indexing:
g=glm(y~x,family=binomial("logit")) #fit
predict(g)[1234] # gives the predicted value of y for the x=1234.

Related

How can I incorporate the prior weight in to my GLM function?

I am trying to incorporate the prior settings of my dependent variable in my logistic-regression in r using the glm-function. The data-set I am using is created to predict churn.
So far I am using the function below:
V1_log <- glm(CH1 ~ RET + ORD + LVB + REV3, data = trainingset, family =
binomial(link='logit'))
What I am looking for is how the weights function works and how to include it in the function or if there is another way to incorporate this. The dependent variable is a nominal variables with the options 0 or 1. The data set is imbalanced in a way that only 10 % has a value of 1 on the dependent variable CH1 and the other 90% has a value of 0. Therefore the weights are (0.1, 0.9)
My dataset Is build-up in the following manner:
Where the independent variables vary in data type between continues and class variables and
Although the ratio of 0 to 1s is 1:9, it does not mean the weights are 0.1 and 0.9. The weights decides how much emphasis you want to give observation compared to the others.
And in your case, if you want to predict something, it is essential you split your data into train and test, and see what influence the weights have on prediction.
Below is using the pima indian diabetes example, I subsample the Yes type such that the training set has 1:9 ratio.
set.seed(111)
library(MASS)
# we sample 10 from Yes and 90 from No
idx = unlist(mapply(sample,split(1:nrow(Pima.tr),Pima.tr$type),c(90,10)))
Data = Pima.tr
trn = Data[idx,]
test = Data[-idx,]
table(trn$type)
No Yes
90 10
Lets try regressing it with weight 9 if positive, 1 if negative:
library(caret)
W = 9
lvl = levels(trn$type)
#if positive we give it the defined weight, otherwise set it to 1
fit_wts = ifelse(trn$type==lvl[2],W,1)
fit = glm(type ~ .,data=trn,weight=fit_wts,family=binomial)
# we test it on the test set
pred = ifelse(predict(fit,test,type="response")>0.5,lvl[2],lvl[1])
pred = factor(pred,levels=lvl)
confusionMatrix(pred,test$type,positive=lvl[2])
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 34 26
Yes 8 32
You can see from above, you can see it's doing ok, but you are missing out on 8 positives and also falsely labeling 26 false positives. Let's say we try W = 3
W = 3
lvl = levels(trn$type)
fit_wts = ifelse(trn$type==lvl[2],W,1)
fit = glm(type ~ .,data=trn,weight=fit_wts,family=binomial)
pred = ifelse(predict(fit,test,type="response")>0.5,lvl[2],lvl[1])
pred = factor(pred,levels=lvl)
confusionMatrix(pred,test$type,positive=lvl[2])
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 39 30
Yes 3 28
Now we manage to get almost all the positive calls correct.. But still miss out on a lot of potential "Yes". Bottom line is, code above might work, but you need to do some checks to figure out what is the weight for your data.
You can also look around the other stats provided by confusionMatrix in caret to guide your choice.
In your dataset trainingset create a column called weights_col that contains your weights (.1, .9) and then run
V1_log <- glm(CH1 ~ RET + ORD + LVB + REV3, data = trainingset, family = binomial(link='logit'), weights = weights_col)

Extract individual coefficients from plm model (R)

I would like to ask if it is possible to extract coefficients for individual observations for a given variable in panel data using plm in R, for instance using the example data:
wi <- plm(inv ~ value + capital, data = Grunfeld, model = "within", effect = "twoways")
In other words, in this example a coefficient for each of the firms in the sample.
It is not clear to me what you are looking for:
1) "coefficients for individual observations"
vs.
2) "a coefficient for each of the firms"
.
These are two different things and I believe you cannot have neither.
1) Asking for more than one coefficient for one observation does not work.
2) Asking for one coefficient per firm but trying to estimate two (value and capital) does not seem plausible.
Maybe this answer gives what you are looking for:
Extract all individual slope coefficient from pooled OLS estimation in R
I think you're looking for the function fixef in the plm package, which extracts the "fixed effects" (firm coefficients in this case). In your example:
library(plm)
data("Grunfeld")
wi <- plm(inv ~ value + capital, data = Grunfeld, model = "within", effect = "twoways")
and then run:
> fixef(wi)
1 2 3 4 5 6 7 8 9
-134.227709 72.826531 -269.458508 -38.873866 -139.666304 -31.339066 -82.761099 -66.737194 -104.010153
10
-7.390586

Linear predictions always the same regardless of features in model

I'm using the Ames, Iowa housing prices data set.
I have a train set and test set. The test set is missing the dependent variable SalePrice. (No column for SalePrice exists).
I have done a linear model and now am trying to predict the Sale Price values on the test set. But when doing so, I always get these same predicted values for SalePrice, regardless of the model used.
Then when trying to calculate RMSE, I get NA.
Here is my model:
lm2 <- lm(SalePrice ~
GarageCars +
Neighborhood +
I(OverallQual^2) + OverallQual +
OverallQual*GrLivArea +
log2(LotArea) +
log2(GrLivArea) +
KitchenQual +
I(TotalBsmtSF^2) +
TotalBsmtSF
, data=train)
# Add an empty column to the test set,
# to be later filled in by predictions
# (Is this even necessary?):
test[, "SalePrice"] <- NA
# My predictions:
predictions <- predict(lm2, newdata = test)
head(predictions)
1 2 3 4 5 6
121093.5 170270.7 170029.5 187012.1 239359.2 172962.1
I always get these same values regardless of the model used. I suspect I'm just not understanding predict(). I suspect I am only getting the predicted values based on my train set rather than on my test set.
I know that the variable names need to match exactly those used in the model, but what other aspect of predict am I not understanding? Do I need to perform the same predictor variable transformations in the test set? Must I create variables to hold them?
Then I calculate the RMSE:
# Formula function for calculating RMSE:
rmse <- function(actual, pred) sqrt(mean((actual-pred)^2))
# Calculate rmse on test set:
rmse(test$SalePrice, predictions))
[1] NA
Could you please tell me what I'm doing wrong? Let me know if you need to see the data.

How to plot glm model coefficients with abline in R?

I'm struggling to plot the cofficients of an glm model using abline. Lets take this simple 2D example:
d <- iris[51:150, c(3:4,5)]
d[,3] <- factor(d[,3])
plot(d[,1:2], col=d[,3])
The glm model yields 4 coefficients:
m <- glm(formula = Species~Petal.Length*Petal.Width, data = d, family = "binomial")
m$coefficients
# (Intercept) Petal.Length Petal.Width Petal.Length:Petal.Width
# -131.23813 22.93553 63.63527 -10.63606
How to plot those with a simple abline?
Binomial models are usually not set up like this. You usually will have a single 0|1 response variable (i.e. predict whether a sample in a single species). Maybe because you only have 2 species included in your model, it still seems to work (this is not that case when all 3 spp are included).
The second trick is to predict type="response" and round these values to get discrete predictions:
d$pred <- factor(levels(d[,3])[round(predict(m, type="response"))+1])
plot(d[,1:2], col=d[,3])
points(d[,1:2], col=d$pred, pch=4)
here I've added an "X" for the predictions. If color is the same, then the prediction was correct. I count 5 samples where the prediction was incorrect.

Why doesn't predict like the dimensions of my newdata?

I want to perform a multiple regression in R and make predictions based on the trained model. Below is an example code I am using:
price = c(10,18,18,11,17)
predictors = cbind(c(5,6,3,4,5),c(2,1,8,5,6))
predict(lm(price ~ predictors), data.frame(predictors=matrix(c(3,5),nrow=1)))
So, based on the 2-variate regression model trained by 5 samples, I want to make a prediction for the test data point where the first variate is 3 and second variate is 5. But I get a warning from above code saying that 'newdata' had 1 rows but variable(s) found have 5 rows. How can I correct above code? Below code works fine where I give the variables separately to the model formula. But since I will have hundreds of variates, I have to give them in a matrix since it would be unfeasible to append hundreds of columns using + sign.
price = c(10,18,18,11,17)
predictor1 = c(5,6,3,4,5)
predictor2 = c(2,1,8,5,6)
predict(lm(price ~ predictor1 + predictor2), data.frame(predictor1=3,predictor2=5))
Thanks in advance!
The easiest way to get past the issue of matching up variable names from a matrix of covariates to newdata data.frame column names is to put your input data into a data.frame as well. Try this
price = c(10,18,18,11,17)
predictors = cbind(c(5,6,3,4,5),c(2,1,8,5,6))
indata<-data.frame(price,predictors=predictors)
predict(lm(price ~ ., indata), data.frame(predictors=matrix(c(3,5),nrow=1)))
Here we combine price and predictors into a data.frame such that it will be named the same say as the newdata data.frame. We use the . in the formula to mean "all other columns" so we don't have to specify them explicitly.
Need to build the model first, then predict from it:
mod1 <- lm(price ~ predictor1 + predictor2)
predict( mod1 , data.frame(predictor1=3,predictor2=5))

Resources