How to plot glm model coefficients with abline in R? - r

I'm struggling to plot the cofficients of an glm model using abline. Lets take this simple 2D example:
d <- iris[51:150, c(3:4,5)]
d[,3] <- factor(d[,3])
plot(d[,1:2], col=d[,3])
The glm model yields 4 coefficients:
m <- glm(formula = Species~Petal.Length*Petal.Width, data = d, family = "binomial")
m$coefficients
# (Intercept) Petal.Length Petal.Width Petal.Length:Petal.Width
# -131.23813 22.93553 63.63527 -10.63606
How to plot those with a simple abline?

Binomial models are usually not set up like this. You usually will have a single 0|1 response variable (i.e. predict whether a sample in a single species). Maybe because you only have 2 species included in your model, it still seems to work (this is not that case when all 3 spp are included).
The second trick is to predict type="response" and round these values to get discrete predictions:
d$pred <- factor(levels(d[,3])[round(predict(m, type="response"))+1])
plot(d[,1:2], col=d[,3])
points(d[,1:2], col=d$pred, pch=4)
here I've added an "X" for the predictions. If color is the same, then the prediction was correct. I count 5 samples where the prediction was incorrect.

Related

Too many coefficients with lm

I'm studying the relationship between expenditure per student and performance on pisa (a standardized test), i know that this regression can't give me a ceteris paribus relationship but this is the point of my exercise, i have to explain why it will not work.
I was running the regression on R with the basic code:
lm1=lm(a~b)
but the problem is that R reports me 32 coefficient, which is the number of the components of my population, while i should only receive the slope and the intercept, given that is a simple regression
This is the output that R gives me:
Call:
lm(formula = a ~ b)
Coefficients:
(Intercept) b10167.3 b10467.8 b10766.4 b10863.4 b10960.1 b11.688.4 b11028.1 b11052 b11207.3 b11855.9 b12424.3 b13930.8
522.9936 5.9561 0.3401 -20.6884 -14.8603 -15.0777 -3.5752 -23.0459 -27.1021 -42.2692 -20.4485 -35.3906 -30.7468
b14353.3 b2.997.9 b20450.9 b3714.8 b4996.3 b5291.6 b5851.7 b6190.7 b6663.3 b6725.3 b6747.2 b7074.9 b8189.1
-18.4412 -107.2872 -39.6793 -98.2315 -80.2505 -36.2202 -48.6179 -64.2414 1.3887 -19.0389 -59.9734 -32.0751 -31.5962
b8406.2 b8533.5 b8671.1 b8996.3 b9265.7 b9897.2
-13.4219 -26.0155 -13.9045 -37.9996 -17.0271 -27.2954
As you can see there are 32 coefficient while i should receive only two, it seems that R is reading each unite of the population as a variable but the dataset is, as always, set with variable in row. I can't figure out what is the problem.
It's not a problem with the lm function. It appears that R is treating $b$ as a categorical variable.
I have a made a small data with 5 observations, $a$ (numeric variable) and $b$ (categorical variable).
When I fit my model you will see that I am seeing a similar output as you (5 estimated coefficients).
data = data.frame(a = 1:5, b = as.factor(rnorm(5)))
lm(a~b, data)
Call:
lm(formula = a ~ b, data = data)
Coefficients:
(Intercept) b-0.16380292500502 b0.213340249988902 b0.423891299272316 b0.63738307939327
4 -3 -1 1 -2
To correct this you need to convert $b$ into a numerical vector.
data$b = as.numeric(as.character(data$b))
lm(a~b, data)
Call:
lm(formula = a ~ b, data = data)
Coefficients:
(Intercept) b
2.9580 0.2772
```

What is the correct way to use weights in a logistic regression in R?

My data includes survey data of car buyers. My data has a weight column that i used in SPSS to get sample sizes. Weight column is affected by demographic factors & vehicle sales. Now i am trying to put together a logistic regression model for a car segment which includes a few vehicles. I want to use the weight column in the logistic regression model & i tried to do so using "weights" in glm function. But the results are horrific. Deviances are too high, McFadden Rsquare too low. My dependent variable is binary, independent variables are on 1 to 5 scale. Weight column is numerical, ranging from 32 to 197. Could that be a reason that results are poor? Do i need to have values in weight column below 1?
Format of input file to R is -
WGT output I1 I2 I3 I4 I5
67 1 1 3 1 5 4
I1, I2, I3 being independent variables
logr<-glm(output~1,data=data1,weights=WGT,family="binomial")
logrstep<-step(logr,direction = "both",scope = formula(data1))\
logr1<-glm(output~ (formula from final iteration),weights = WGT,data=data1,family="binomial")
hl <- hoslem.test(data1$output,fitted(logr1),g=10)
I want a logistic regression model with better accuracy & gain a better understanding of using weights with logistic regression
I would check out the survey package. This will allow you to specify weights for the survey design using the svydesign function. Additionally, you can use the svyglm function to perform your weighted logistic regression. See http://r-survey.r-forge.r-project.org/survey/
Something like the following assuming your data is in a dataframe called df
my_svy <- svydesign(df, ids = ~1, weights = ~WGT)
Then you can do the following:
my_fit <- svyglm(output ~1, my_svy, family = "binomial")
For a full reprex check out the below example
library(survey)
# Generate Some Random Weights
mtcars$wts <- rnorm(nrow(mtcars), 50, 5)
# Make vs a factor just for illustrative purposes
mtcars$vs <- as.factor(mtcars$vs)
# Build the Complete survey Object
svy_df <- svydesign(data = mtcars, ids = ~1, weights = ~wts)
# Fit the logistic regression
fit <- svyglm(vs ~ gear + disp, svy_df, family = "binomial")
# Store the summary object
(fit_sumz <- summary(fit))
# Look at the AIC if desired
AIC(fit)
# Pull out the deviance if desired
fit_sumz$deviance
As far as the stepwise regression, this typically isn't a great methodology for a statistical point of view. It results in a higher R2 and some other issues regarding inference (see https://www.stata.com/support/faqs/statistics/stepwise-regression-problems/).

Regression with weights: Less standardized residuals then observations

I modelled a multiple Regression based on the Mincer-Wage-Equation and I added a weighting-factor to make it representative for the whole population.
But when I'm adding the weights function into my modell, R calculates less standardized residuals than I have observations.
Here's my modell:
lm(log(earings) ~ Gender + Age + Age^2 + Education, weights= phrf)
So I got problems to analyze the residuals because when I'm trying to plot the rstandard against the fitted.values R is telling: Different Variable Length in rstandard() found.
This Problem ist only by rstandard and rstudent, when I'm plotting the normal resid() against fitted.values there is no problem.
And when I'm leaving out the weights function I have not problems, too.
In the help file for rstudent():
Note that cases with weights == 0 are dropped from all these functions, but that if a linear model has been fitted with na.action = na.exclude, suitable values are filled in for the cases excluded during fitting.
A simple example to demonstrate:
set.seed(123)
x <- 1:100
y <- x + rnorm(100)
w <- runif(100)
w[44] <- 0
fit <- lm(y ~ x, weights=w)
length(fitted(fit))
length(rstudent(fit))
Gives:
> length(fitted(fit))
[1] 100
> length(rstudent(fit))
[1] 99
And this makes sense. If you have a weight of 0, the theoretical variance is 0 which is an infinite studentized or standardized residual.
Since you are effectively deleting those observations, you can subset the call to lm with subset=w!=0 or you can use that flag for the fitted values:
plot(fitted(fit)[w!=0], rstudent(fit))

binary response variable and categorical predictors in phylolm package

I am using phylolm package to run a model with binary response variable (0/1), and a continuous predictor and a categorical predictor having more than 3 levels. If I consider the categorical predictor as continuous, i.e., 0, 1, 2, 3, the model ran well and I can use summary(model) to obtain the model output. However, this kind of treatment (treating categorical levels as continuous) does not fit the reality, I think it will be right to consider them as category, in this way, The model worked, but I had the problem to obtain the model output, as when I used summary(model), it gave the results for each category compared to the first. I would like to have an "anova" kind of table to summarize the significance of each variable, however, anova function does not apply to this kind of analysis. I want to ask is there any way to obtain the results for this model?
Some example scripts:
require(phylolm)
set.seed(123456)
# Simulate a tree of 50 species
tre = rtree(50)
# Simulate a continuous trait
conTrait = rTrait(n=1,phy=tre)
# Make a design matrix for the binary trait simulation
X = cbind(rep(1,50),conTrait)
# Simulate a binary trait
binTrait = rbinTrait(n=1,phy=tre, beta=c(-1,0.5), alpha=1 ,X=X)
# Simulate a radom categorical trait
catTrait <-
as.factor(sample(c("A","B","C"),size=length(tre$tip.label),replace=TRUE))
# Create data frame
dat = data.frame(binTrait = binTrait, conTrait = conTrait, catTrait = catTrait)
### run the model
fit = phyloglm(binTrait ~ conTrait*catTrait, phy=tre, data=dat)
##model output
summary(fit)
Call:
phyloglm(formula = binTrait ~ conTrait * catTrait, data = dat,
phy = tre)
AIC logLik Pen.logLik
52.07 -19.04 -17.28
Method: logistic_MPLE
Mean tip height: 3.596271
Parameter estimate(s):
alpha: 1.437638
Coefficients:
Estimate StdErr z.value p.value
(Intercept) -0.61804 0.83270 -0.7422 0.4580
conTrait 1.52295 1.16256 1.3100 0.1902
catTraitB 0.92563 0.98812 0.9368 0.3489
catTraitC -0.24900 1.01255 -0.2459 0.8057
conTrait:catTraitB 0.49031 1.41858 0.3456 0.7296
conTrait:catTraitC -0.74376 1.29850 -0.5728 0.5668
Note: Wald-type p-values for coefficients, conditional on alpha=1.437638

'predict' gives different results than using manually the coefficients from 'summary'

Let me state my confusion with the help of an example,
#making datasets
x1<-iris[,1]
x2<-iris[,2]
x3<-iris[,3]
x4<-iris[,4]
dat<-data.frame(x1,x2,x3)
dat2<-dat[1:120,]
dat3<-dat[121:150,]
#Using a linear model to fit x4 using x1, x2 and x3 where training set is first 120 obs.
model<-lm(x4[1:120]~x1[1:120]+x2[1:120]+x3[1:120])
#Usig the coefficients' value from summary(model), prediction is done for next 30 obs.
-.17947-.18538*x1[121:150]+.18243*x2[121:150]+.49998*x3[121:150]
#Same prediction is done using the function "predict"
predict(model,dat3)
My confusion is: the two outcomes of predicting the last 30 values differ, may be to a little extent, but they do differ. Whys is it so? should not they be exactly same?
The difference is really small, and I think is just due to the accuracy of the coefficients you are using (e.g. the real value of the intercept is -0.17947075338464965610... not simply -.17947).
In fact, if you take the coefficients value and apply the formula, the result is equal to predict:
intercept <- model$coefficients[1]
x1Coeff <- model$coefficients[2]
x2Coeff <- model$coefficients[3]
x3Coeff <- model$coefficients[4]
intercept + x1Coeff*x1[121:150] + x2Coeff*x2[121:150] + x3Coeff*x3[121:150]
You can clean your code a bit. To create your training and test datasets you can use the following code:
# create training and test datasets
train.df <- iris[1:120, 1:4]
test.df <- iris[-(1:120), 1:4]
# fit a linear model to predict Petal.Width using all predictors
fit <- lm(Petal.Width ~ ., data = train.df)
summary(fit)
# predict Petal.Width in test test using the linear model
predictions <- predict(fit, test.df)
# create a function mse() to calculate the Mean Squared Error
mse <- function(predictions, obs) {
sum((obs - predictions) ^ 2) / length(predictions)
}
# measure the quality of fit
mse(predictions, test.df$Petal.Width)
The reason why your predictions differ is because the function predict() is using all decimal points whereas on your "manual" calculations you are using only five decimal points. The summary() function doesn't display the complete value of your coefficients but approximate the to five decimal points to make the output more readable.

Resources