How does predict deal with models that include the AsIs function? - r

I have the model lm(y~x+I(log(x)) and I would like to use predict to get predictions of a new data frame containing new values of x, based on my model. How does predict deal with the AsIs function I in the model? Does the I(log(x)) need to be extra specified in the newdata argument of predict or does predict understand that it should construct and use I(log(x)) from x?
UPDATE
#DWin: The way the variables enter in the model affect the coefficients especially for interactions. My example is simplistic but try this out
x<-rep(seq(0,100,by=1),10)
y<-15+2*rnorm(1010,10,4)*x+2*rnorm(1010,10,4)*x^(1/2)+rnorm(1010,20,100)
z<-x^2
plot(x,y)
lm1<-lm(y~x*I(x^2))
lm2<-lm(y~x*x^2)
lm3<-lm(y~x*z)
summary(lm1)
summary(lm2)
summary(lm3)
You see that lm1=lm3, but lm2 is something different (only 1 coefficient). Assuming you don't want to create the dummy variable z (computationally inefficient for large datasets), the only way to build an interaction model like lm3 is with I. Again this is a very simplistic example (that may make no statistical sense) however it makes sense in complicated models.
#Ben Bolker: I would like to avoid guessing and try to ask for an authoritative answer (I can't direct check this with my models since they are much more complicated than the example). My guess is that predict correctly assumes and constructs the I(log(x))

You do not need to make your variable names look like the term I(x). Just use "x" in the newdata argument.
The reason lm(y~x*I(x^2)) and lm(y~x*x^2) are different is that "^" and "*" are reserved symbols for formula in R. That's not the case with the log function. It is also incorrect that interactions can only be constructed with I(). If you wanted a second degree polynomial in R you should use poly(x, 2). If you build with I(log(x)) or with just log(x) you should get the same model. Both of them will get transformed to the predictor value properly with predict if you use:
newdata=dataframe( x=seq( min(x), max(x), length=10) )
Using poly will protect you from incorrect inferences that are so commonly caused by the use of I(x^2).

Related

Prediction at a new value using lowess function in R

I am using lowess function to fit a regression between two variables x and y. Now I want to know the fitted value at a new value of x. For example, how do I find the fitted value at x=2.5 in the following example. I know loess can do that, but I want to reproduce someone's plot and he used lowess.
set.seed(1)
x <- 1:10
y <- x + rnorm(x)
fit <- lowess(x, y)
plot(x, y)
lines(fit)
Local regression (lowess) is a non-parametric statistical method, it's a not like linear regression where you can use the model directly to estimate new values.
You'll need to take the values from the function (that's why it only returns a list to you), and choose your own interpolation scheme. Use the scheme to predict your new points.
Common technique is spline interpolation (but there're others):
https://www.r-bloggers.com/interpolation-and-smoothing-functions-in-base-r/
EDIT: I'm pretty sure the predict function does the interpolation for you. I also can't find any information about what exactly predict uses, so I've tried to trace the source code.
https://github.com/wch/r-source/blob/af7f52f70101960861e5d995d3a4bec010bc89e6/src/library/stats/R/loess.R
else { ## interpolate
## need to eliminate points outside original range - not in pred_
I'm sure the R code calls the underlying C implementation, but it's not well documented so I don't know what algorithm it uses.
My suggestion is: either trust the predict function or roll out your own interpolation algorithm.

Predict y value for a given x in R

I have a linear model:
mod=lm(weight~age, data=f2)
I would like to input an age value and have returned the corresponding weight from this model. This is probably simple, but I have not found a simple way to do this.
Its usually more robust to use the predict method of lm:
f2<-data.frame(age=c(10,20,30),weight=c(100,200,300))
f3<-data.frame(age=c(15,25))
mod<-lm(weight~age,data=f2)
pred3<-predict(mod,f3)
This spares you from wrangling with all of the coefs when the models can be potentially large.
If your purposes are related to just one prediction you can just grab your coefficient with
coef(mod)
Or you can just build a simple equation like this.
coef(mod)[1] + "Your_Value"*coef(mod)[2]

Weighting the inverse of the variance in linear mixed model

I have a linear mixed model which is run 50 different times in a loop.
Each time the model is run, I want the response variable b to be weighted inversely with the variance. So if the variance of b is small, I want the weighting to be bigger and vice versa. This is a simplified version of the model:
model <- lme(b ~ type, random = ~1|replicate,weights = ~ I(1/b))
Here's the R data files:
b: https://www.dropbox.com/s/ziipdtsih5f0252/b.Rdata?dl=0
type: https://www.dropbox.com/s/90682ewib1lw06e/type.Rdata?dl=0
replicate: https://www.dropbox.com/s/kvrtao5i2g4v3ik/replicate.Rdata?dl=0
I'm trying to do this using the weights option in lme. Right now I have this as:
weights = ~ I(1/b).
But I don't think this is correct....maybe weights = ~ I(1/var(b)) ??
I also want to adjust this slightly as b consists of two types of data specified in the factor variable (of 2 levels) type.
I want to inversely weight the variance of each of these two levels separately. How could I do this?
I'm not sure it makes sense to talk about weighting the response variable in this manner. The descriptions I have found in the R-SIG-mixed-models mailing list refer to using inverse weighting derived from the predictor variables, either the fixed effects or the random effects. The weighting is used in minimizing the deviations of approximation of the model fits to the response. There is a function that returns the fixed effects variance (a sub-class of the varFunc family of functions) and it has a help page (linked from the weights section of the ?gls page):
?varFixed
?varFunc
It requires a formula object as its argument. So my original guess was:
model <- lme(b ~ type, random = ~1|replicate, weights = varFixed( ~type) )
Which you proved incorrect. How about seeing if this works:
model <- lme(b ~ type, random = ~1|replicate, weights = varFixed( ~1| type) )
(My continuing guess is that this weighting is the default situation and specifying these particular weights may not be needed. The inverse nature of the weighting is implied and does not need to be explicitly stated with "1/type". In the case of mixed models the "correct" construction depends on the design and the prior science and none of this has been presented, so this is really only a syntactic comment and not an endorsement of this model. I did not download the files. Seems rather odd to have three separate files and no code for linking them into a dataframe. Generally one would want to have a single data object within which the column names would be used in the formulas of the regression function. (I also suspect this is the default behavior of this function and so my untested prediction is that that you would be getting no change by omitting that 'weights' parameter.)

Pseudo R squared for cumulative link function

I have an ordinal dependent variable and trying to use a number of independent variables to predict it. I use R. The function I use is clm in the ordinal package, to perform a cumulative link function with a probit link, to be precise:
I tried the function pR2 in the package pscl to get the pseudo R squared with no success.
How do I get pseudo R squareds with the clm function?
Thanks so much for your help.
There are a variety of pseudo-R^2. I don't like to use any of them because I do not see the results as having a meaning in the real world. They do not estimate effect sizes of any sort and they are not particularly good for statistical inference. Furthermore in situations like this with multiple observations per entity, I think it is debatable which value for "n" (the number of subjects) or the degrees of freedom is appropriate. Some people use McFadden's R^2 which would be relatively easy to calculate, since clm generated a list with one of its values named "logLik". You just need to know that the logLikelihood is only a multiplicative constant (-2) away from the deviance. If one had the model in the first example:
library(ordinal)
data(wine)
fm1 <- clm(rating ~ temp * contact, data = wine)
fm0 <- clm(rating ~ 1, data = wine)
( McF.pR2 <- 1 - fm1$logLik/fm0$logLik )
[1] 0.1668244
I had seen this question on CrossValidated and was hoping to see the more statistically sophisticated participants over there take this one on, but they saw it as a programming question and dumped it over here. Perhaps their opinion of R^2 as a worthwhile measure is as low as mine?
Recommend to use function nagelkerke from rcompanion package to get Pseudo r-squared.
When your predictor or outcome variables are categorical or ordinal, the R-Squared will typically be lower than with truly numeric data. R-squared merely a very weak indicator about model's fit, and you can't choose model based on this.

Re-transform a linear model. Case study with R

Let's say I have a response variable which is not normally distributed and an explanatory variable. Let's create these two variables first (coded in R):
set.seed(12)
resp = (rnorm(120)+20)^3.79
expl = rep(c(1,2,3,4),30)
I run a linear model and I realize that the residuals are not normally distributed. (I know running a Shapiro might not be enough to justify that the residuals are not normally distributed but it is not the point of my question)
m1=lm(resp~expl)
shapiro.test(residuals(m1))
0.01794
Therefore I want to transform my explanatory variable (looking for a transformation with a Box-Cox for example).
m2=lm(resp^(1/3.79)~expl)
shapiro.test(residuals(m2))
0.4945
Ok, now my residuals are normally distributed it is fine! I now want to make a graphical representation of my data and my model. But I do not want to plot my explanatory variable in the transformed form because I would lose lots of its intuitive meaning. Therefore I do:
plot(x=expl,y=resp)
What if I now want to add the model? I could do this
abline(m2) # m2 is the model with transformed variable
but of course the line does not fit the data represented. I could do this:
abline(m1) # m1 is the model with the original variable.
but it is not the model I ran for the statistics! How can I re-transform the line predicted by m2 so that it fits the data?
plotexpl <- seq(1,4,length.out=10)
predresp <- predict(m2,newdata=list(expl=plotexpl))
lines(plotexpl, predresp^(3.79))
I won't discuss the statistical issues here (e.g. a non-significant test does not mean that H0 is true and your model is not better than the mean).
Since you've mentioned that the transformation might base on Box-Cox formula,
I would like to point out a issue you might want to consider.
According to the Box-cox transformation formula in the paper Box,George E. P.; Cox,D.R.(1964). "An analysis of transformations", your transformation implementation (in case it is a Box-Cox one) might need to be slightly edited.The transformed y should be (y^(lambda)-1)/lambda instead of y^(lambda). (Actually, y^(lambda) is called Tukey transformation, which is another distinct transformation formula.)
So, the code should be:
lambda=3.79
m2=lm(resp^((lambda-1)/lambda)~expl)
shapiro.test(residuals(m2))
More information
Correct implementation of Box-Cox transformation formula by boxcox() in R:
https://www.r-bloggers.com/on-box-cox-transform-in-regression-models/
A great comparison between Box-Cox transformation and Tukey transformation. http://onlinestatbook.com/2/transformations/box-cox.html
One could also find the Box-Cox transformation formula on Wikipedia:
en.wikipedia.org/wiki/Power_transform#Box.E2.80.93Cox_transformation
Please correct me if I misunderstood your implementation.

Resources