Average Marginal Effects in R with complex interaction terms - r

I am using R to compute the linear regression on the following model, as well as find the marginal effects of age on pizza at specific points (20,30,40,50,55).
mod6.22c <- lm(pizza ~ age + income + age*income +
I((age*age)*income), data = piz4)
The problem I am running into is when using the margins command, R does not see interaction terms that are inserted into the lm with I((age x age) x income). The margins command will only produce accurate average marginal effects when the interaction terms are in the form of variable1 x variable1. I also can't create a new variable in my table table$newvariable <- table$variable1^2, because the margins command won't identify newvariable as related to variable1.
This has been fine up until now, where my interaction terms have only been a quadratic, or an xy interaction, but now I am at a point where I need to calculate the average marginal effects with the interaction term AGE^2xINCOME included in the model, but the only way I can seem to get the summary lm output to be correct is by using I(age^2*(income)) or by creating a new variable in my table. As stated before, the margins command can't read I(age^2*(income)), and if I create a new variable, the margins command doesn't recognize the variables are related, and the average marginal effects produced are incorrect.
The error I am receiving:
> summary(margins(mod6.22c, at = list(age= c(20,30,40,50,55)),
variables = "income"))
Error in names(classes) <- clean_terms(names(classes)) :
'names' attribute [4] must be the same length as the vector [3]
I appreciate any help in advance.
Summary of data:
Pizza is annual expenditure on pizza, female, hs, college and grad are dummy variables, income is in thousands of dollars per year, age is years old.
> head(piz4)
pizza female hs college grad income age agesq
1 109 1 0 0 0 19.5 25 625
2 0 1 0 0 0 39.0 45 2025
3 0 1 0 0 0 15.6 20 400
4 108 1 0 0 0 26.0 28 784
5 220 1 1 0 0 19.5 25 625
6 189 1 1 0 0 39.0 35 1225
Libraries used:
library(data.table)
library(dplyr)
library(margins)
tldr
This works:
mod6.22 <- lm(pizza ~ age + income + age*income, data = piz4)
**summary(margins(mod6.22, at = list(age= c(20,30,40,50,55)), variables = "income"))**
factor age AME SE z p lower upper
income 20.0000 4.5151 1.5204 2.9697 0.0030 1.5352 7.4950
income 30.0000 3.2827 0.9049 3.6276 0.0003 1.5091 5.0563
income 40.0000 2.0503 0.4651 4.4087 0.0000 1.1388 2.9618
income 50.0000 0.8179 0.7100 1.1520 0.2493 -0.5736 2.2095
income 55.0000 0.2017 0.9909 0.2036 0.8387 -1.7403 2.1438
This doesn't work:
mod6.22c <- lm(pizza ~ age + income + age*income + I((age * age)*income), data = piz4)
**summary(margins(mod6.22c, at = list(age= c(20,30,40,50,55)), variables = "income"))**
Error in names(classes) <- clean_terms(names(classes)) :
'names' attribute [4] must be the same length as the vector [3]
How do I get margins to read my interaction variable I((age*age)*income)?

Related

How can I solve the issue that VIF doesn't work after added more variables?

I created ordered logit models using polr in RStudio and the using the vif. It worked fine and I got my results.
But after I decided toadd more variables such as dummy for age groups, education and income groups. But when I then use vif I got the error about subscript out of bound
My used data for this:
Here is an example of my data set:
Work less
lifestatisfied
country
Work much
0
8
GB
1
1
8
SE
0
0
9
DK
1
0
9
DE
1
NA
5
NO
NA
continued:
health
education
income
age
marital status
3
3
Na
61
NA
4
2
2
30
NA
1
3
4
39
6
5
7
5
52
4
4
1
5
17
5
gender is dummy 1 or 2
age is respondents age like 35, 47 etc.
income is scaled and is 1 to 10
educ (education) is 1 to 7
health is scaled 1 to 5
work less is dummy i.e. 1 or 0
work much is dummy, i.e. 1 or 0
marital status is scaled 1 to 6
lifesatisfied is the dependent variable and is scaled 0 to 10.
My ordered regression model:
myorderedmodel = polr(factor(lifesatisfied, ordered = TRUE) ~ maritalstatus + gender + age + age20_29 + age30_39 + age40_49 + age50_59 + income + lowincome + avgincome + highincome + noeducation + loweducation + higheducation + health + child + religion + workless + workmuch, data = mydata, method = "logistic", Hess = TRUE)
vif(myorderedmodel)
Gives the following error:
Error in R[subs, subs] : subscript out of bounds
I really didn't understand the error. What does it mean? And how can it be solved?

Compute the average intercept and slope for a group from the lmer or lme function?

I was wondering, does anyone know how print the output lmer or lme summary data for a group within a data set in R? For example if this is what the header of my data (df) looks like:
SubjectID
week
group
weight
1
1
1
12.5
1
2
1
10.6
2
1
3
6.4
2
2
3
6.3
3
1
4
23.5
3
2
4
15.2
And I want to get the specific intercept and slope for the subjects in group 3 only. I would use the lmer function in the code below:
fit.coef <- lmer(weight ~ week*group + (week|SubjectID),
data = df,
control = lmerControl(optimizer = "bobyqa"))
I can get statistics for an individual in the data set or the intercept and slopes across the entire data set but I can't figure out how to calculate these specific values for all the items within a group (e.g. all subjects within group 3). I know this is easy in SAS but I can't figure out any way to do this in R despite googling for hours.
You haven't given us a reproducible example but I think you're looking for coef() ?
It gives the predicted effects for each random effect term, for each level of the grouping variable, in the example below the intercept and slope for each subject.
library(lme4)
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
coef(m1)
$Subject
(Intercept) Days
308 253.6637 19.6662617
309 211.0064 1.8476053
310 212.4447 5.0184295
330 275.0957 5.6529356
331 273.6654 7.3973743
332 260.4447 10.1951090
333 268.2456 10.2436499
334 244.1725 11.5418676
335 251.0714 -0.2848792
337 286.2956 19.0955511
349 226.1949 11.6407181
350 238.3351 17.0815038
351 255.9830 7.4520239
352 272.2688 14.0032871
369 254.6806 11.3395008
370 225.7921 15.2897709
371 252.2122 9.4791297
372 263.7197 11.7513080

How to account for categorical variables in calculating a risk score from regression model?

I have a data set which has a number of variables that I'd like to use to generate a risk score for getting a disease.
I have created a basic version of what I'm trying to do.
The dataset looks like this:
ID DISEASE_STATUS AGE SEX LOCATION
1 1 20 1 FRANCE
2 0 22 1 GERMANY
3 0 24 0 ITALY
4 1 20 1 GERMANY
5 1 20 0 ITALY
So the model I ran was:
glm(disease_status ~ age + sex + location, data=data, family=binomial(link='logit'))
The beta values produced by this model were as follows:
bage = −0.193
bsex = −0.0497
blocation= 1.344
To produce a risk score, I want to multiply the values for each individual by the beta values, eg:
risk score = (-0.193 * 20 (age)) + (-0.0497 * 1 (sex)) + (1.344 * ??? (location))
However, what value would I use to multiply the beta score for location by?
Thank you!

Plotting Logistic Regression in R, but I keep getting errors

I'm trying to plot a logistic regression in R, for a continuous independent variable and a dichotomous dependent variable. I have very limited experience with R, but my professor has asked me to add this graph to a paper I'm writing, and he said R would probably be the best way to create it. Anyway, I'm sure there are tons of mistakes here, but this is the sort of this previously suggested on StackOverflow:
ggplot(vvv, aes(x = vvv$V1, y=vvv$V2)) + geom_point() + stat_smooth(method="glm", family="binomial", se=FALSE)
curve(predict(ggg, data.frame(V1=x), type="response"), add=TRUE)
where vvv is the name of my csv file (31 obs. of 2 variables), V1 is the continuous variable, and V2 is the dichotomous one. Also, ggg (List of 30?) is the following:
ggg<- glm(formula = vvv$V2 ~ vvv$V1, family = "binomial", data = vvv)
The ggplot function produces a graph of my data points, but no logistic regression curve. The curve function results in the following error:
"Error in curve(predict(ggg, data.frame(V1 = x), type = "resp"), add = TRUE) : 'expr' did not evaluate to an object of length 'n'
In addition: Warning message:'newdata' had 101 rows but variables found have 31 rows"
I'm not sure what the problem is, and I'm having trouble finding resources for this specific issue. Can anybody help? It would be greatly appreciated :)
Edit: Thanks to anyone who responded! My data, vvv, is the following, where the percent was the initial probability for presence/absence of a species in a specific area, and the 1 and 0 indicate whether or not a species ended up being observed.:
V1 V2
1 95.00% 1
2 95.00% 0
3 95.00% 1
4 92.00% 1
5 92.00% 1
6 92.00% 1
7 92.00% 1
8 92.00% 1
9 92.00% 1
10 92.00% 1
11 85.00% 1
12 85.00% 1
13 85.00% 1
14 85.00% 1
15 85.00% 1
16 80.00% 1
17 80.00% 0
18 77.00% 1
19 77.00% 1
20 75.00% 0
21 70.00% 1
22 70.00% 0
23 70.00% 0
24 70.00% 1
25 70.00% 0
26 69.00% 1
27 65.00% 0
28 60.00% 1
29 50.00% 1
30 35.00% 0
31 25.00% 0
As #MrFlick commented, V1 is probably a factor. So, first you have to change it to numeric class. This just substitutes "%" for nothing and divides by 100, so you will have proportions as numeric class:
vvv$V1<-as.numeric(sub("%","",vvv$V1))/100
Doing this, you can use your own code and you will have a plot for a logistic regression:
ggplot(vvv, aes(x = vvv$V1, y=vvv$V2)) + geom_point() + stat_smooth(method="glm", family="binomial", se=F)
This should print not only the points, but also the logistic regression curve. I don't understand what is the point of using curves. From what I could understand from your question, this is enough for what you need.

How to plot variable's predicted probability based on glm model

I would like to plot each of the variables that are part of the glm model, where the y axis is the predicted probability and the x axis is the variable levels or values.
Here is my code that I tried in order to do it:
The data:
dat <- read.table(text = "target apcalc admit num
0 0 0 21
0 0 1 24
0 1 0 55
0 1 1 72
1 0 0 5
1 0 1 31
1 1 0 11
1 1 1 3", header = TRUE)
The glm model:
f<-glm(target ~ apcalc + admit +num, data = dat,family=binomial(link='logit'))
The loop to present the desired plot:
for(i in 1:length(f$var.names)){
plot(predict(f,i.var.names=i,newdata=dat,type='response'))
}
I got a strange plot as an output ("Index" in the x axis and "predict(f,i.var.names=i,newdata=dat,type='response')" in the y axis. How can I fix my code in order to get the desired result?
(I don't the reputation yet in order to present it here)
Heres plotting all your variables with the predicted probability,
f<-glm(target ~ apcalc + admit +num, data=dat,family=binomial(link="logit"))
PredProb=predict(f,type='response') #predicting probabilities
par(mfrow=c(2,2))
for(i in names(dat)){
plot(dat[,i],PredProb,xlab=i)
}
On running the f<-glm(.....) part, f$var.names is giving NULL as output. There must be some error there.
f<-glm(target ~ apcalc + admit +num, data=dat,family=binomial("logit"))
f
Call: glm(formula = target ~ apcalc + admit + num, family = binomial("logit"),
data = dat)
Coefficients:
(Intercept) apcalc admit num
2.2690 3.1742 2.4406 -0.1721
Degrees of Freedom: 7 Total (i.e. Null); 4 Residual
Null Deviance: 11.09
Residual Deviance: 5.172 AIC: 13.17
f$var.names
NULL

Resources