What is the function of I()? - r

I found the following code on the internet
mod1 <- lm(mpg ~ weight + I(weight^2) + foreign, auto)
What is the function I()? It seems that the result of weight^2 is same as I(weight^2).

The function of I() is to isolate terms in formulae from the usual formula parsing & syntax. There are other uses of I() in data frames where it helps create objects that have or inherit from class "AsIs" which allows the embedding of objects without the usual conversion that takes place.
In the formula case, as that is what you specifically ask about, ^ is a special formula operator which indicates crossing of the terms to the nth degree where n is given following the operator like so: ^n. As such ^ does not have it's usual arithmetic interpretation in a formula. (Likewise, the -, +, / and * operators also have special formula meanings and as a result I() is required to use them to isolate them from the formula parsing tools.)
In the specific example you give (which I'll illustrate using an in-built data set trees), if you forget to use I() round a the quadratic term, R will, in this case, ignore that term completely as Volume (weight in your example) is already in the model and you are asking for multi-way interactions of the variable with itself, which is not a quadratic term.
First without I():
> lm(Height ~ Volume + Volume^2, data = trees)
Call:
lm(formula = Height ~ Volume + Volume^2, data = trees)
Coefficients:
(Intercept) Volume
69.0034 0.2319
Notice how only the Volume term is in the formula? The correct specification for a quadratic model (actually it may not be, see below) is
> lm(Height ~ Volume + I(Volume^2), data = trees)
Call:
lm(formula = Height ~ Volume + I(Volume^2), data = trees)
Coefficients:
(Intercept) Volume I(Volume^2)
65.33587 0.47540 -0.00314
I said it may not be correct; this is due to correlation between Volume and Volume^2. An identical but more stable fit can be achieved by the use of orthogonal polynomials, whichpoly()` can produce for you. So the more stable speficiation would be:
> lm(Height ~ poly(Volume, 2), data = trees)
Call:
lm(formula = Height ~ poly(Volume, 2), data = trees)
Coefficients:
(Intercept) poly(Volume, 2)1 poly(Volume, 2)2
76.000 20.879 -5.278
Note that the fit is identical to the earlier model though with different coefficient estimates as the input data are different (orthogonal polynomials vs raw polynomials). You can see this by their summary() output if you don't believe me:
> summary(lm(Height ~ poly(Volume, 2), data = trees))
Call:
lm(formula = Height ~ poly(Volume, 2), data = trees)
Residuals:
Min 1Q Median 3Q Max
-11.2266 -3.6728 -0.0745 2.4073 9.9954
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 76.0000 0.9322 81.531 < 2e-16 ***
poly(Volume, 2)1 20.8788 5.1900 4.023 0.000395 ***
poly(Volume, 2)2 -5.2780 5.1900 -1.017 0.317880
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.19 on 28 degrees of freedom
Multiple R-squared: 0.3808, Adjusted R-squared: 0.3365
F-statistic: 8.609 on 2 and 28 DF, p-value: 0.001219
> summary(lm(Height ~ Volume + I(Volume^2), data = trees))
Call:
lm(formula = Height ~ Volume + I(Volume^2), data = trees)
Residuals:
Min 1Q Median 3Q Max
-11.2266 -3.6728 -0.0745 2.4073 9.9954
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 65.335867 4.110886 15.893 1.52e-15 ***
Volume 0.475398 0.246279 1.930 0.0638 .
I(Volume^2) -0.003140 0.003087 -1.017 0.3179
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.19 on 28 degrees of freedom
Multiple R-squared: 0.3808, Adjusted R-squared: 0.3365
F-statistic: 8.609 on 2 and 28 DF, p-value: 0.001219
Note the difference in the t-tests for the linear and quadratic terms in the models. This is where the orthogonality of the input polynomial terms helps.
To really see what ^ does in a formula (in case the terminology in ?formula is not something you are familiar with, consider this model:
> lm(Height ~ (Girth + Volume)^2, data = trees)
Call:
lm(formula = Height ~ (Girth + Volume)^2, data = trees)
Coefficients:
(Intercept) Girth Volume Girth:Volume
75.40148 -2.29632 1.86095 -0.05608
As there are two terms in (...)^2, the formula parsing code converts that into the main effects of the two variables plus their 2nd-order interaction. That model could be more succinctly written as Height ~ Girth * Volume, but ^ helps when you want higher-order interactions or interactions among a larger number of variables.

Related

How to fit a known linear equation to my data in R?

I used a linear model to obtain the best fit to my data, lm() function.
From literature I know that the optimal fit would be a linear regression with the slope = 1 and the intercept = 0. I would like to see how good this equation (y=x) fits my data? How do I proceed in order to find an R^2 as well as a p-value?
This is my data
(y = modelled, x = measured)
measured<-c(67.39369,28.73695,60.18499,49.32405,166.39318,222.29022,271.83573,241.72247, 368.46304,220.27018,169.92343,56.49579,38.18381,49.33753,130.91752,161.63536,294.14740,363.91029,358.32905,239.84112,129.65078,32.76462,30.13952,52.83656,67.35427,132.23034,366.87857,247.40125,273.19316,278.27902,123.24256,45.98363,83.50199,240.99459,266.95707,308.69814,228.34256,220.51319,83.97942,58.32171,57.93815,94.64370,264.78007,274.25863,245.72940,155.41777,77.45236,70.44223,104.22838,294.01645,312.42321,122.80831,41.65770,242.22661,300.07147,291.59902,230.54478,89.42498,55.81760,55.60525,111.64263,305.76432,264.27192,233.28214,192.75603,75.60803,63.75376)
modelled<-c(42.58318,71.64667,111.08853,67.06974,156.47303,240.41188,238.25893,196.42247,404.28974,138.73164,116.73998,55.21672,82.71556,64.27752,145.84891,133.67465,295.01014,335.25432,253.01847,166.69241,68.84971,26.03600,45.04720,75.56405,109.55975,202.57084,288.52887,140.58476,152.20510,153.99427,75.70720,92.56287,144.93923,335.90871,NA,264.25732,141.93407,122.80440,83.23812,42.18676,107.97732,123.96824,270.52620,388.93979,308.35117,100.79047,127.70644,91.23133,162.53323,NA ,276.46554,100.79440,81.10756,272.17680,387.28700,208.29715,152.91548,62.54459,31.98732,74.26625,115.50051,324.91248,210.14204,168.29598,157.30373,45.76027,76.07370)
Now I would like to see how good the equation y=x fits the data presented above (R^2 and p-value)?
I am very grateful if somebody can help me with this (basic) problem, as I found no answers to my question on stackoverflow?
Best regards Cyril
Let's be clear what you are asking here. You have an existing model, which is "the modelled values are the expected value of the measured values", or in other words, measured = modelled + e, where e are the normally distributed residuals.
You say that the "optimal fit" should be a straight line with intercept 0 and slope 1, which is another way of saying the same thing.
The thing is, this "optimal fit" is not the optimal fit for your actual data, as we can easily see by doing:
summary(lm(measured ~ modelled))
#>
#> Call:
#> lm(formula = measured ~ modelled)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -103.328 -39.130 -4.881 40.428 114.829
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 23.09461 13.11026 1.762 0.083 .
#> modelled 0.91143 0.07052 12.924 <2e-16 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 55.13 on 63 degrees of freedom
#> Multiple R-squared: 0.7261, Adjusted R-squared: 0.7218
#> F-statistic: 167 on 1 and 63 DF, p-value: < 2.2e-16
This shows us the line that would produce the optimal fit to your data in terms of reducing the sum of the squared residuals.
But I guess what you are asking is "How well do my data fit the model measured = modelled + e ?"
Trying to coerce lm into giving you a fixed intercept and slope probably isn't the best way to answer this question. Remember, the p value for the slope only tells you whether the actual slope is significantly different from 0. The above model already confirms that. If you want to know the r-squared of measured = modelled + e, you just need to know the proportion of the variance of measured that is explained by modelled. In other words:
1 - var(measured - modelled) / var(measured)
#> [1] 0.7192672
This is pretty close to the r squared from the lm call.
I think you have sufficient evidence to say that your data is consistent with the model measured = modelled, in that the slope in the lm model includes the value 1 within its 95% confidence interval, and the intercept contains the value 0 within its 95% confidence interval.
As mentioned in the comments, you can use the lm() function, but this actually estimates the slope and intercept for you, whereas what you want is something different.
If slope = 1 and the intercept = 0, essentially you have a fit and your modelled is already the predicted value. You need the r-square from this fit. R squared is defined as:
R2 = MSS/TSS = (TSS − RSS)/TSS
See this link for definition of RSS and TSS.
We can only work with observations that are complete (non NA). So we calculate each of them:
TSS = nonNA = !is.na(modelled) & !is.na(measured)
# residuals from your prediction
RSS = sum((modelled[nonNA] - measured[nonNA])^2,na.rm=T)
# total residuals from data
TSS = sum((measured[nonNA] - mean(measured[nonNA]))^2,na.rm=T)
1 - RSS/TSS
[1] 0.7116585
If measured and modelled are supposed to represent the actual and fitted values of an undisclosed model, as discussed in the comments below another answer, then if fm is the lm object for that undisclosed model then
summary(fm)
will show the R^2 and p value of that model.
The R squared value can actually be calculated using only measured and modelled but the formula is different if there is or is not an intercept in the undisclosed model. The signs are that there is no intercept since if there were an intercept sum(modelled - measured, an.rm = TRUE) should be 0 but in fact it is far from it.
In any case R^2 and the p value are shown in the output of the summary(fm) where fm is the undisclosed linear model so there is no point in restricting the discussion to measured and modelled if you have the lm object of the undisclosed model.
For example, if the undisclosed model is the following then using the builtin CO2 data frame:
fm <- lm(uptake ~ Type + conc, CO2)
summary(fm)
we have the this output where the last two lines show R squared and p value.
Call:
lm(formula = uptake ~ Type + conc, data = CO2)
Residuals:
Min 1Q Median 3Q Max
-18.2145 -4.2549 0.5479 5.3048 12.9968
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.830052 1.579918 16.349 < 2e-16 ***
TypeMississippi -12.659524 1.544261 -8.198 3.06e-12 ***
conc 0.017731 0.002625 6.755 2.00e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.077 on 81 degrees of freedom
Multiple R-squared: 0.5821, Adjusted R-squared: 0.5718
F-statistic: 56.42 on 2 and 81 DF, p-value: 4.498e-16

plm vs lm - different results?

I tried several times to use lm and plm to do a regression. And I get different results.
First, I used lm as follows:
fixed.Region1 <- lm(CapNormChange ~ Policychanges + factor(Region),
data=Panel)
Further I used plm in the following way:
fixed.Region2 <- plm(CapNormChange ~ Policychanges+ factor(Region),
data=Panel, index=c("Region", "Year"), model="within", effect="individual")
I think there is something wrong with plm because I don't see an intercept in the results (see below).
Furthermore, I am not entirely sure if + factor (Region) is necessary, however, if it is not there, I don't see the coefficients (and significance) for the dummy.
So, my question is:
I am using the plm function wrong? (or what is wrong about it)
If not, how can it be that the results are different?
If somebody could give me a hint, I would really appreciate.
Results from LM:
Call:
lm(formula = CapNormChange ~ Policychanges + factor(Region),
data = Panel)
Residuals:
Min 1Q Median 3Q Max
-31.141 -4.856 -0.642 1.262 192.803
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.3488 4.9134 3.531 0.000558 ***
Policychanges 0.6412 0.1215 5.277 4.77e-07 ***
factor(Region)Asia -19.3377 6.7804 -2.852 0.004989 **
factor(Region)C America + Carib 0.1147 6.8049 0.017 0.986578
factor(Region)Eurasia -17.6476 6.8294 -2.584 0.010767 *
factor(Region)Europe -20.7759 8.8993 -2.335 0.020959 *
factor(Region)Middle East -17.3348 6.8285 -2.539 0.012200 *
factor(Region)N America -17.5932 6.8064 -2.585 0.010745 *
factor(Region)Oceania -14.0440 6.8417 -2.053 0.041925 *
factor(Region)S America -14.3580 6.7781 -2.118 0.035878 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.72 on 143 degrees of freedom
Multiple R-squared: 0.3455, Adjusted R-squared: 0.3043
F-statistic: 8.386 on 9 and 143 DF, p-value: 5.444e-10`
Results from PLM:
Call:
plm(formula = CapNormChange ~ Policychanges, data = Panel, effect = "individual",
model = "within", index = c("Region", "Year"))
Balanced Panel: n = 9, T = 17, N = 153
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-31.14147 -4.85551 -0.64177 1.26236 192.80277
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Policychanges 0.64118 0.12150 5.277 4.769e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 66459
Residual Sum of Squares: 55627
R-Squared: 0.16299
Adj. R-Squared: 0.11031
F-statistic: 27.8465 on 1 and 143 DF, p-value: 4.7687e-07`
You would need to leave out + factor(Region) in your formula for the within model with plm to get what you want.
Within models do not have an intercept, but some software packages (esp. Stata and Gretl) report one. You can estimate it with plm by running within_intercept on you estimated model. The help page has the details about this somewhat artificial intercept.
If you want the individual effects and their significance, use summary(fixef(<your_plm_model>)). Use pFtest to check if the within specification seems worthwhile.
The R squareds diverge between the lm model and the plm model. This is due to the lm model (if used like this with the dummies, it is usually called the LSDV model (least squares dummy variables)) gives what is sometimes called the overall R squared while plm will give you the R squared of the demeaned regression, sometimes called the within R squared. Stata's documentation has some details about this: https://www.stata.com/manuals/xtxtreg.pdf

Likelihood ratio test: 'models were not all fitted to the same size of dataset'

I'm an absolute R beginner and need some help with my likelihood ratio tests for my univariate analyses. Here's the code:
#Univariate analysis for conscientiousness (categorical)
fit <- glm(BCS_Bin~Conscientiousness_cat,data=dat,family=binomial)
summary(fit)
#Likelihood ratio test
fit0<-glm(BCS_Bin~1, data=dat, family=binomial)
summary(fit0)
lrtest(fit, fit0)
The results are:
Call:
glm(formula = BCS_Bin ~ Conscientiousness_cat, family = binomial,
data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8847 -0.8847 -0.8439 1.5016 1.5527
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.84933 0.03461 -24.541 <2e-16 ***
Conscientiousness_catLow 0.11321 0.05526 2.049 0.0405 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7962.1 on 6439 degrees of freedom
Residual deviance: 7957.9 on 6438 degrees of freedom
(1963 observations deleted due to missingness)
AIC: 7961.9
Number of Fisher Scoring iterations: 4
And:
Call:
glm(formula = BCS_Bin ~ 1, family = binomial, data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8524 -0.8524 -0.8524 1.5419 1.5419
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.82535 0.02379 -34.69 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10251 on 8337 degrees of freedom
Residual deviance: 10251 on 8337 degrees of freedom
(65 observations deleted due to missingness)
AIC: 10253
Number of Fisher Scoring iterations: 4
For my LRT:
Error in lrtest.default(fit, fit0) :
models were not all fitted to the same size of dataset
I understand that this is happening because there's different numbers of observations missing? That's because it is data from a large questionnaire, and many more drop outs had occurred by the question assessing my predictor variable (conscientiousness) when compared with the outcome variable (body condition score/BCS). So I just have more data for BCS than conscientiousness, for example (it's producing the same error for many of my other variables too).
In order to run the likelihood ratio test, the model with just the intercept has to be fit to the same observations as the model that includes Conscientiousness_cat. So, you need the subset of the data that has no missing values for Conscientiousness_cat:
BCS_bin_subset = BCS_bin[complete.cases(BCS_bin[,"Conscientiousness_cat"]), ]
You can run both models on this subset of the data and the likelihood ratio test should run without error.
In your case, you could also do:
BCS_bin_subset = BCS_bin[!is.na(BCS_bin$Conscientiousness_cat), ]
However, it's nice to have complete.cases handy when you want a subset of a data frame with no missing values across multiple variables.
Another option that is more convenient if you're going to run multiple models, but that is also more complex is to first fit whatever model uses the largest number of variables from BCS_bin (since that model will exclude the largest number of observations due to missingness) and then use the update function to update that model to models with fewer variables. We just need to make sure that update uses the same observations each time, which we do using a wrapper function defined below. Here's an example using the built-in mtcars data frame:
library(lmtest)
dat = mtcars
# Create some missing values in mtcars
dat[1, "wt"] = NA
dat[5, "cyl"] = NA
dat[7, "hp"] = NA
# Wrapper function to ensure the same observations are used for each
# updated model as were used in the first model
# From https://stackoverflow.com/a/37341927/496488
update_nested <- function(object, formula., ..., evaluate = TRUE){
update(object = object, formula. = formula., data = object$model, ..., evaluate = evaluate)
}
m1 = lm(mpg ~ wt + cyl + hp, data=dat)
m2 = update_nested(m1, . ~ . - wt) # Remove wt
m3 = update_nested(m1, . ~ . - cyl) # Remove cyl
m4 = update_nested(m1, . ~ . - wt - cyl) # Remove wt and cyl
m5 = update_nested(m1, . ~ . - wt - cyl - hp) # Remove all three variables (i.e., model with intercept only)
lrtest(m5,m4,m3,m2,m1)

R - Regression Analysis for Logarthmic

I perform regression analysis and try to find the best fit model for the dataset diamonds.csv in ggplot2. I use price(response variable) vs carat and I perform linear regression, quadratic, and cubic regression. The line is not the best fit. I realize the logarithmic from excel has the best fitting line. However, I couldn't figure out how to code in R to find the logarithmic fitting line. Anyone can help?
Comparing Price vs Carat
model<-lm(price~carat, data = diamonds)
Model 2 uses the polynomial to compare
model2<-lm(price~carat + I(carat^2), data = diamonds)
use cubic in model3
model3 <- lm(price~carat + I(carat^2) + I(carat^3), data = diamonds)
How can I code the log in R to get same result as excel?
y = 0.4299ln(x) - 2.5495
R² = 0.8468
Thanks!
The result you report from excel y = 0.4299ln(x) - 2.5495 does not contain any polynomial or cubic terms. What are you trying to do? price is very skewed and as with say 'income' it is common practice to take the log from that. This also provides the R2 you are referring to, but very different coefficients for the intercept and carat parameter.
m1 <- lm(log(price) ~ carat, data = diamonds)
summary(m1)
Call:
lm(formula = log(price) ~ carat, data = diamonds)
Residuals:
Min 1Q Median 3Q Max
-6.2844 -0.2449 0.0335 0.2578 1.5642
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.215021 0.003348 1856 <2e-16 ***
carat 1.969757 0.003608 546 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3972 on 53938 degrees of freedom
Multiple R-squared: 0.8468, Adjusted R-squared: 0.8468
F-statistic: 2.981e+05 on 1 and 53938 DF, p-value: < 2.2e-16

Coding difference between linear and cubic model

I have two variables ENERGY and TEMP
I have created two other variables temp2 and temp 3
> temp2 <- data$temp^2
> temp3 <- data$temp^3
>data=cbind(data, energy, temp,temp2,temp3)
Now to create a cubic model would it look just like a linear model?
>model<-lm(energy~temp+temp2+temp3)
Edit:
Ok so I did what you suggested and this is the output:
> ?poly
> model<- lm( energy ~ poly(temp, 3) , data=data )
> summary(model)
Call:
lm(formula = energy ~ poly(temp, 3), data = data)
Residuals:
Min 1Q Median 3Q Max
-19.159 -11.257 -2.377 9.784 26.841
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 95.50 3.21 29.752 < 2e-16 ***
poly(temp, 3)1 207.90 15.72 13.221 2.41e-11 ***
poly(temp, 3)2 -50.07 15.72 -3.184 0.00466 **
poly(temp, 3)3 81.59 15.72 5.188 4.47e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.73 on 20 degrees of freedom
Multiple R-squared: 0.9137, Adjusted R-squared: 0.9008
F-statistic: 70.62 on 3 and 20 DF, p-value: 8.105e-11
I would assume that I would test for the goodness of fit test the same way and look at the Pr(>|t|). This would lead me to believe that all of the variables are significant.
would I be able to use this fitted regression model to predict the average energy consumption for an average difference in temperature?
Instead of coding up dummy variable you should consider using the poly function:
?poly # Polynomial contrasts
model<- lm( energy ~ poly(temp, 3) , data=data )
If you want to use the same columns as you would have gotten with the dummies approach (which is not good for statistical inference purposes), you can use the 'raw' parameter:
model.r<- lm( energy ~ poly(temp, 3, raw=TRUE) , data=data )
Predictions will be the same, but the standard errors will not. This should give you the same estimates as would be returned by #RomanLuštrik's suggestion. The terms will not be orthogonal, so their necessary correlations will be high and you will be unable to make correct inferences about independent effects.
Added question: "would I be able to use this fitted regression model to predict the average energy consumption for an average difference in temperature?"
No. You would need to specify a particular two temperatures and then predict could give you a difference, but that difference will vary depending on what the reference point is, even if the magnitude of the difference is the same.. That was a consequence of using a non-linear term. Maybe you should describe your goals and use a forum that is more geared to methods questions. SO is for coding when you know what you want to do. http://stats.stackexchange.com may be more appropriate when you have formulated your question with more clarity.
There are two ways to do polynomial regression with lm:
lm( y ~ x + I(x^2) + I(x^3) )
and
lm( y ~ poly(x, 3, raw=TRUE) )
(That's cubic. I'm sure you can generalise to quartic, quintic, etc.)

Resources