Regression in R using poly() function - r

The function poly() in R is used in order to produce orthogonal vectors and can be helpful to interpret coefficient significance. However, I don't see the point of using it for prediction. To my view, the two following model (model_1 and model_2) should produce the same predictions.
q=1:11
v=c(3,5,7,9.2,14,20,26,34,50,59,80)
model_1=lm(v~poly(q,2))
model_2=lm(v~1+q+q^2)
predict(model_1)
predict(model_2)
But it doesn't. Why?

Because they are not the same model. Your second one has one unique covariate, while the first has two.
> model_2
Call:
lm(formula = v ~ 1 + q + q^2)
Coefficients:
(Intercept) q
-15.251 7.196
You should use the I() function to modify one parameter inside your formula in order the regression to consider it as a covariate:
model_2=lm(v~1+q+I(q^2))
> model_2
Call:
lm(formula = v ~ 1 + q + I(q^2))
Coefficients:
(Intercept) q I(q^2)
7.5612 -3.3323 0.8774
will give the same prediction
> predict(model_1)
1 2 3 4 5 6 7 8 9 10 11
5.106294 4.406154 5.460793 8.270210 12.834406 19.153380 27.227133 37.055664 48.638974 61.977063 77.069930
> predict(model_2)
1 2 3 4 5 6 7 8 9 10 11
5.106294 4.406154 5.460793 8.270210 12.834406 19.153380 27.227133 37.055664 48.638974 61.977063 77.069930

Related

Set intercept to zero when using predict.glm

How do I remove the intercept from the prediction when using predict.glm? I'm not talking about the model itself, just in the prediction.
For example, I want to get the difference and standard error between x=1 and x=3
I tried putting newdata=list(x=2), intercept = NULL when using predict.glm and it doesn't work
So for example:
m <- glm(speed ~ dist, data=cars, family=gaussian(link="identity"))
prediction <- predict.glm(m, newdata=list(dist=c(2)), type="response", se.fit=T, intercept=NULL)
I'm not sure if this is somehow implemented in predict, but you could the following trick1.
Add a manual intercept column (i.e. a vector of 1s) to the data and use it in the model while adding 0 to RHS of formula (to remove the "automatic" intercept).
cars$intercept <- 1L
m <- glm(speed ~ 0 + intercept + dist, family=gaussian, data=cars)
This gives us an intercept column in the model.frame, internally used by predict,
model.frame(m)
# speed intercept dist
# 1 4 1 2
# 2 4 1 10
# 3 7 1 4
# 4 7 1 22
# ...
which allows us to set it to an arbitrary value such as zero.
predict.glm(m, newdata=list(dist=2, intercept=0), type="response", se.fit=TRUE)
# $fit
# 1
# 0.3311351
#
# $se.fit
# [1] 0.03498896
#
# $residual.scale
# [1] 3.155753

pooled model summary not showing R squared or adjusted R Squared

I'm using imputed data (via r-MICE) to carry out some linear regressions.
eg:
fitimp2 <- with(impdatlong_mids,
lm(nat11 ~ sex + AGE +
I(fasbathroom + fasbedroom + fascomputers +
fasdishwash + fasfamcar + fasholidays)+fatherhome1 +
motherhome1 +talkfather +talkmother + I(famsup+famhelp)+
fmeal))
When I call a summary:
summary(pool(fitimp2))
I don't get the signif codes / asterisks, which isn't a huge deal, just kind of inconvenient, but more importantly, I don't get the R or Adjusted R squared like I would with a regular model summary.
My output looks like:
term estimate
1 (Intercept) 1.560567449
2 sex 0.219087438
3 AGE 0.005548590
4 I(fasbathroom + fasbedroom + fascomputers + fasdishwash + fasfamcar + fasholidays) -0.009028995
5 fatherhome1 -0.055150616
6 motherhome1 0.001564544
7 talkfather 0.115541883
8 talkmother 0.149495541
9 I(famsup + famhelp) -0.006991828
10 fmeal 0.081613347
std.error statistic df p.value
1 0.162643898 9.59499539 1118.93509 0.000000e+00
2 0.024588831 8.91003863 4984.09857 0.000000e+00
3 0.007672715 0.72315871 3456.13665 4.696313e-01
4 0.005495148 -1.64308498 804.41067 1.007561e-01
5 0.030861154 -1.78705617 574.98597 7.445506e-02
6 0.057226626 0.02733944 90.61856 9.782491e-01
7 0.012924577 8.93970310 757.72150 0.000000e+00
8 0.016306200 9.16801814 239.68789 0.000000e+00
9 0.003215294 -2.17455343 1139.07321 2.986886e-02
10 0.011343686 7.19460591 2677.98522 8.095746e-13
Any ideas how to get the Rsquared values to display? Thanks in advance.
Have you tried
for unadjusted r-sq:
pool.r.squared(fitimp2)
for adjusted r-sq:
pool.r.squared(fitimp2, adjusted = TRUE)
see pg. 51 of https://cran.r-project.org/web/packages/mice/mice.pdf

How do I use vector values as variables in R

I have a dataframe called repay and I have created a vector for the variables names of the variables I am interested in called variables.
variables<-names(repay)[22:36]
I want to write a for loop that does some univariate analysis on each of the variables in variables. For example:
for (i in 1:length(variables))
{
model<-glm(Successful~ variables[i]
,data=repay
,family=binomial(link='logit'))
}
However it doesn't recognize variables[i] as a variable, giving the following error message:
Error in model.frame.default(formula = Successful ~ variables[i], data
= repay, : variable lengths differ (found for 'variables[i]')
Try using the formula function in R. It will allow correct interpretation of models as below:
for (i in 1:length(variables){
myglm <- glm(formula(paste("Successful", "~", variables[i])),
data = repay, family = binomial(link = 'logit'))
See my post here for more things you can do in this context.
Alternatively you can use assign yielding in as many models as the variables.
Let us consider
repay<-data.table(Successful=runif(10),a=sample(10),b=sample(10),c=runif(10))
variables<-names(repay)[2:4]
yielding:
>repay
Successful a b c
1: 0.8457686 7 9 0.2930537
2: 0.4050198 6 6 0.5948573
3: 0.1994583 2 8 0.4198423
4: 0.1471735 1 5 0.5906494
5: 0.7765083 8 10 0.7933327
6: 0.6503692 9 4 0.4262896
7: 0.2449512 4 1 0.7311928
8: 0.6754966 3 3 0.4723299
9: 0.7792951 10 7 0.9101495
10: 0.6281890 5 2 0.9215107
Then you can perform the loop
for (i in 1:length(variables)){
assign(paste0("model",i),eval(parse(text=paste("glm(Successful~",variables[i],",data=repay,family=binomial(link='logit'))"))))
}
resulting in 3 objects: model1,model2 and model3.
>model1
Call: glm(formula = Successful ~ a, family = binomial(link = "logit"),
data = repay)
Coefficients:
(Intercept) a
-0.36770 0.05501
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 5.752
Residual Deviance: 5.69 AIC: 17.66
Idem for model2, model3 et.c.
You could create a language object from a string,
var = "cyl"
lm(as.formula(sprintf("mpg ~ %s", var)), data=mtcars)
# alternative (see also substitute)
lm(bquote(mpg~.(as.name(var))), data=mtcars)
Small workaround that might help
for (i in 22:36)
{
ivar <- repay[i] #choose variable for running the model
repay2 <- data.frame(Successful= repay$Successful, ivar) #create new data frame with 2 variables only for running the model
#run model for new data frame repay2
model<-glm(Successful~ ivar
,data=repay2
,family=binomial(link='logit'))
}

Extract the variance-covariance matrix from a plm fixed effects model

I would like to extract the variance-covariance matrix from a simple plm fixed effects model. For example:
library(plm)
data("Grunfeld")
M1 <- plm(inv ~ lag(inv) + value + capital, index = 'firm',
data = Grunfeld)
The usual vcov function gives me:
vcov(M1)
lag(inv) value capital
lag(inv) 3.561238e-03 -7.461897e-05 -1.064497e-03
value -7.461897e-05 9.005814e-05 -1.806683e-05
capital -1.064497e-03 -1.806683e-05 4.957097e-04
plm's fixef function only gives:
fixef(M1)
1 2 3 4 5 6 7
-286.876375 -97.190009 -209.999074 -53.808241 -59.348086 -34.136422 -34.397967
8 9 10
-65.116699 -54.384488 -6.836448
Any help extracting the variance-covariance matrix that includes the fixed effects would be much appreciated.
Using names sometimes is very useful:
names(M1)
[1] "coefficients" "vcov" "residuals" "df.residual"
[5] "formula" "model" "args" "call"
M1$vcov
lag(inv) value capital
lag(inv) 1.265321e-03 3.484274e-05 -3.395901e-04
value 3.484274e-05 1.336768e-04 -7.463365e-05
capital -3.395901e-04 -7.463365e-05 3.662395e-04
Picking up your example, do the following to get the standard errors (if that is what you are interested in; it is not the whole variance-covariance matrix):
library(plm)
data("Grunfeld")
M1 <- plm(inv ~ lag(inv) + value + capital, index = 'firm',
data = Grunfeld)
fix <- fixef(M1)
fix_se <- attr(fix, "se")
fix_se
1 2 3 4 5 6 7 8 9 10
43.453642 25.948160 20.294977 11.245009 12.472005 9.934159 10.554240 11.083221 10.642589 9.164694
You can also use the summary function for more info:
summary(fix)
Estimate Std. Error t-value Pr(>|t|)
1 -286.8764 43.4536 -6.6019 4.059e-11 ***
2 -97.1900 25.9482 -3.7455 0.0001800 ***
3 -209.9991 20.2950 -10.3473 < 2.2e-16 ***
4 -53.8082 11.2450 -4.7851 1.709e-06 ***
5 -59.3481 12.4720 -4.7585 1.950e-06 ***
6 -34.1364 9.9342 -3.4363 0.0005898 ***
7 -34.3980 10.5542 -3.2592 0.0011174 **
8 -65.1167 11.0832 -5.8753 4.222e-09 ***
9 -54.3845 10.6426 -5.1101 3.220e-07 ***
10 -6.8364 9.1647 -0.7460 0.4556947
Btw, the documentation expains the "se" attribute:
Value An object of class "fixef". It is a numeric vector containing
the fixed effects with attribute se which contains the standard
errors. [...]"
Note: You might need the latest development version for that because much has improved there about fixef: https://r-forge.r-project.org/R/?group_id=406

User-defined function with lapply function

I'm attempting to establish a user-defined function that inputs predetermined variables (independent and dependent) from the active data frame. Let's take the example data frame df below looking at a coin toss outcome as a result of other recorded variables:
> df
outcome toss person hand age
1 H 1 Mary Left 18
2 T 2 Allen Left 12
3 T 3 Dom Left 25
4 T 4 Francesca Left 42
5 H 5 Mary Right 18
6 H 6 Allen Right 12
7 H 7 Dom Right 25
8 T 8 Francesca Right 42
The dfdata frame has a binomial response outcome being either heads or tails and I am going to look at how person,hand, and age might affect this categorical outcome. I plan to use a forward-selection approach which will test one variable against toss and then progress to add more.
As to keep things simple, I want to be able to identify the response/dependent (e.g., outcome) and predictor/independent (e.g., person,hand) variables before my user-defined function as such:
> independent<-c('person','hand','age')
> dependent<-'outcome'
Then create my function using the lapply and glm functions:
> test.func<-function(some_data,the_response,the_predictors)
+ {
+ lapply(the_predictors,function(a)
+ {
+ glm(substitute(as.name(the_response)~i,list(i=as.name(a))),data=some_data,family=binomial)
+ })
+ }
Yet, when I attempt to run the function with the predetermined vectors, this occurs:
> test.func(df,dependent,independent)
Error in as.name(the_response) : object 'the_response' not found
My expected response would be the following:
models<-lapply(independent,function(x)
+ {
+ glm(substitute(outcome~i,list(i=as.name(x))),data=df,family=binomial)
+ })
> models
[[1]]
Call: glm(formula = substitute(outcome ~ i, list(i = as.name(x))),
family = binomial, data = df)
Coefficients:
(Intercept) personDom personFrancesca personMary
1.489e-16 -1.799e-16 1.957e+01 -1.957e+01
Degrees of Freedom: 7 Total (i.e. Null); 4 Residual
Null Deviance: 11.09
Residual Deviance: 5.545 AIC: 13.55
[[2]]
Call: glm(formula = substitute(outcome ~ i, list(i = as.name(x))),
family = binomial, data = df)
**End Snippet**
As you can tell, using lapply and glm, I have created 3 simple models without all of the extra work doing it individually. You may be asking why create a user-defined function when you have simple code right there? I plan to run a while or repeat loop and it will decrease clutter.
Thank you for your assistance
I know code only answers are deprecated but I thought you were almost there and could just use the nudge to use the formula function (and to include 'the_response in the substitution):
test.func<-function(some_data,the_response,the_predictors)
{
lapply(the_predictors,function(a)
{print( form<- formula(substitute(resp~i,
list(resp=as.name(the_response), i=as.name(a)))))
glm(form, data=some_data,family=binomial)
})
}
Test:
> test.func(df,dependent,independent)
outcome ~ person
<environment: 0x7f91a1ba5588>
outcome ~ hand
<environment: 0x7f91a2b38098>
outcome ~ age
<environment: 0x7f91a3fad468>
[[1]]
Call: glm(formula = form, family = binomial, data = some_data)
Coefficients:
(Intercept) personDom personFrancesca personMary
8.996e-17 -1.540e-16 1.957e+01 -1.957e+01
Degrees of Freedom: 7 Total (i.e. Null); 4 Residual
Null Deviance: 11.09
Residual Deviance: 5.545 AIC: 13.55
[[2]]
Call: glm(formula = form, family = binomial, data = some_data)
#snipped

Resources