Predict y value for a given x in R - r

I have a linear model:
mod=lm(weight~age, data=f2)
I would like to input an age value and have returned the corresponding weight from this model. This is probably simple, but I have not found a simple way to do this.

Its usually more robust to use the predict method of lm:
f2<-data.frame(age=c(10,20,30),weight=c(100,200,300))
f3<-data.frame(age=c(15,25))
mod<-lm(weight~age,data=f2)
pred3<-predict(mod,f3)
This spares you from wrangling with all of the coefs when the models can be potentially large.

If your purposes are related to just one prediction you can just grab your coefficient with
coef(mod)
Or you can just build a simple equation like this.
coef(mod)[1] + "Your_Value"*coef(mod)[2]

Related

Estimation of a state-space model with lags in the measurement equation in R

I'm trying to estimate an SS model from this paper that has the following form:
Setting the order of the first lag polynomial to zero and the second one to one, we can reformulate it using terms from the MARSS package guide when applicable (x is the state, y is the observed variable, d is exogenous):
MARSS package allows for estimation of a simpler model that dooesn't include lagged variables in the measurement equation. Is there a way to estimate this one using MARSS or any other package without rewriting the estimation routine for this special case? Maybe there is a way to reformulate it so it could be "fed" to MARSS or some other package?
Take a look at how say the BSM Structural time series model or ARMA model is formulated as a MARSS model, aka a multivariate state-space model. That'll give you an idea of how to reform your model in multivariate state-space form.
Basically, your x will look like
See how the x_2 is just a dummy that is forced to be x(t-1)?
Now the y equation
The d and a are your D and A. I wrote in small case to spec that they are scalars. But they can be matrices in general (if y is multivariate say). Your inputs are the d_t and y_{t-1}. You prepare that 2x1xT matrix as an input.
Be careful with your initial condition specification. Probably best/easiest to set it at t=1 and estimate or use diffuse prior.
You can fit this model with MARSS. You can fit with any Kalman filter function that will allow you to pass in inputs in the y equation (some do, some don't). KFAS::KFS() allows that using the SScustom() function.
In MARSS the model list will look like so
mod.list=list(
B=matrix(list("b",1,0,0),2,2),
U=matrix(0,2,1),
Q=matrix(list("q",0,0,0),2,2),
Z=matrix(c("z", "c"),1,2),
A=matrix(0),
R=matrix("r"),
D=matrix(c("d", "a"),1,2),
x0=matrix(c("x1","x2"),2,1),
tinitx=1,
d=rbind(dt[2:TT],y[1:(TT-1)])
)
dat <- y[2:TT] # since you need y_{t-1} in the d (inputs)
fit <- MARSS(dat, model=mod.list)
It'll probably complain that it wants initial conditions for x0. Anything will work. The EM algorithm isn't sensitive to that like a BFGS or Newton algorithm. But method="BFGS" is actually often better for this type of structural ts model and in that case pick a reasonable initial condition for x (reasonable = close to your data in this case I think).

Command for finding the best linear model in R

Is there a way to get R to run all possible models (with all combinations of variables in a dataset) to produce the best/most accurate linear model and then output that model?
I feel like there is a way to do this, but I am having a hard time finding the information.
There are numerous ways this could be achieved, but for a simple way of doing this I would suggest that you have a look at the glmulti package, which is described in detail in this paper:
glmulti: An R Package for Easy Automated Model Selection with (Generalized) Linear Models
Alternatively, very simple example of the model selection as available on the Quick-R website:
# Stepwise Regression
library(MASS)
fit <- lm(y~x1+x2+x3,data=mydata)
step <- stepAIC(fit, direction="both")
step$anova # display results
Or to simplify even more, you can do more manual model comparison:
fit1 <- lm(y ~ x1 + x2 + x3 + x4, data=mydata)
fit2 <- lm(y ~ x1 + x2, data=mydata)
anova(fit1, fit2)
This should get you started. Although you should read my comment from above. This should build you a model based on all the data in your dataset and then compare all of the models with AIC and BIC.
# create a NULL vector called model so we have something to add our layers to
model=NULL
# create a vector of the dataframe column names used to build the formula
vars = names(data)
# remove variable names you don’t want to use (at least
# the response variable (if its in the first column)
vars = vars[-1]
# the combn function will run every different combination of variables and then run the glm
for(i in 1:length(vars)){
xx = combn(vars,i)
if(is.null(dim(xx))){
fla = paste("y ~", paste(xx, collapse="+"))
model[[length(model)+1]]=glm(as.formula(fla),data=data)
} else {
for(j in 1:dim(xx)[2]){
fla = paste("y ~", paste(xx[1:dim(xx)[1],j], collapse="+"))
model[[length(model)+1]]=glm(as.formula(fla),data=data)
}
}
}
# see how many models were build using the loop above
length(model)
# create a vector to extract AIC and BIC values from the model variable
AICs = NULL
BICs = NULL
for(i in 1:length(model)){
AICs[i] = AIC(model[[i]])
BICs[i] = BIC(model[[i]])
}
#see which models were chosen as best by both methods
which(AICs==min(AICs))
which(BICs==min(BICs))
I ended up running forwards, backwards, and stepwise procedures on data to select models and then comparing them based on AIC, BIC, and adj. R-sq. This method seemed most efficient. However, when I received the actual data to be used (the program I was writing was for business purposes), I was told to only model each explanatory variable against the response, so I was able to just call lm(response ~ explanatory) for each variable in question, since the analysis we ended up using it for wasn't worried about how they interacted with each other.
This is a very old question, but for those who are still encountering this discussion - the package olsrr and specifically the function ols_step_all_possible exhaustively produces an ols model for all possible subsets of variables, based on an lm object (such that by feeding it with a full model you will get all possible combinations), and returns a dataframe with R squared, adjusted R squared, aic, bic, etc. for all the models. This is very helpful in finding the best predictors but it is also very much time consuming.
see https://olsrr.rsquaredacademy.com/reference/ols_step_all_possible.html
I do not recommend just "cherry picking" the best performing model, rather I would actually look at the output and choose carefully for the most reasonable outcome. In case you would want to immediately get the best performing model (by some criteria, say number of predictors and R2) you may write a function that saves the dataframe, arranges it by number of predictors and orders it by descending R2 and spits out the top result.
The dredge() function in R also accomplishes this.

How to get Loess function for my data in R?

I have a some data and I draw them on a plot, using R.
After that, I draw the loess function about that data.
Here is the code:
data <- read.table("D:/data.csv", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
ur <- subset(data, select = c(users,responseTime))
ur <- ur[with(ur, order(users, responseTime)), ]
plot(ur, xlab="Users", ylab="Response Time (ms)")
lines(ur)
loess_fit <- loess(responseTime ~ users, ur)
lines(ur$users, predict(loess_fit), col = "blue")
Here's my plot's image:
How can I get the function of this regression?
For example: responseTime = 68 + 45 * users.
Thanks.
You can use the loess_fit object from your code to predict the response time. If you want to estimate the average response time for 230 users, you could do:
predict(loess_fit, newdata=data.frame(users=230))
Here is an interesting blog post on this subject.
EDIT: If you want to make predictions for values outside your data, you need a theory or further assumptions. The most simple assumption would be a linear fit,
lm_fit <- lm(responseTime ~ users, data=ur)
predict(lm_fit, newdata=data.frame(users=400))
However, your data may show heteroscedacity (non-constant variance) and may show non-normal residuals. You might want to check if that is the case. If it is, then a robust linear fitting procedure such as rlm from the package MASS, or a generalized linear model glm might be worth a try. I am not an expert for that, maybe someone else or at Cross Validated can provide better help.
The loess.demo function in the TeachingDemos package shows the logic underlying the loess fit. This can help you understand what is going on and why there is not a simple prediction function. However, for predicting, there is a predict function that works with loess fits to create the prediction. You can also find the linear equation that will predict for a specific value of x (but it will be different for each value of x you may want to predict for).

Pseudo R squared for cumulative link function

I have an ordinal dependent variable and trying to use a number of independent variables to predict it. I use R. The function I use is clm in the ordinal package, to perform a cumulative link function with a probit link, to be precise:
I tried the function pR2 in the package pscl to get the pseudo R squared with no success.
How do I get pseudo R squareds with the clm function?
Thanks so much for your help.
There are a variety of pseudo-R^2. I don't like to use any of them because I do not see the results as having a meaning in the real world. They do not estimate effect sizes of any sort and they are not particularly good for statistical inference. Furthermore in situations like this with multiple observations per entity, I think it is debatable which value for "n" (the number of subjects) or the degrees of freedom is appropriate. Some people use McFadden's R^2 which would be relatively easy to calculate, since clm generated a list with one of its values named "logLik". You just need to know that the logLikelihood is only a multiplicative constant (-2) away from the deviance. If one had the model in the first example:
library(ordinal)
data(wine)
fm1 <- clm(rating ~ temp * contact, data = wine)
fm0 <- clm(rating ~ 1, data = wine)
( McF.pR2 <- 1 - fm1$logLik/fm0$logLik )
[1] 0.1668244
I had seen this question on CrossValidated and was hoping to see the more statistically sophisticated participants over there take this one on, but they saw it as a programming question and dumped it over here. Perhaps their opinion of R^2 as a worthwhile measure is as low as mine?
Recommend to use function nagelkerke from rcompanion package to get Pseudo r-squared.
When your predictor or outcome variables are categorical or ordinal, the R-Squared will typically be lower than with truly numeric data. R-squared merely a very weak indicator about model's fit, and you can't choose model based on this.

How does predict deal with models that include the AsIs function?

I have the model lm(y~x+I(log(x)) and I would like to use predict to get predictions of a new data frame containing new values of x, based on my model. How does predict deal with the AsIs function I in the model? Does the I(log(x)) need to be extra specified in the newdata argument of predict or does predict understand that it should construct and use I(log(x)) from x?
UPDATE
#DWin: The way the variables enter in the model affect the coefficients especially for interactions. My example is simplistic but try this out
x<-rep(seq(0,100,by=1),10)
y<-15+2*rnorm(1010,10,4)*x+2*rnorm(1010,10,4)*x^(1/2)+rnorm(1010,20,100)
z<-x^2
plot(x,y)
lm1<-lm(y~x*I(x^2))
lm2<-lm(y~x*x^2)
lm3<-lm(y~x*z)
summary(lm1)
summary(lm2)
summary(lm3)
You see that lm1=lm3, but lm2 is something different (only 1 coefficient). Assuming you don't want to create the dummy variable z (computationally inefficient for large datasets), the only way to build an interaction model like lm3 is with I. Again this is a very simplistic example (that may make no statistical sense) however it makes sense in complicated models.
#Ben Bolker: I would like to avoid guessing and try to ask for an authoritative answer (I can't direct check this with my models since they are much more complicated than the example). My guess is that predict correctly assumes and constructs the I(log(x))
You do not need to make your variable names look like the term I(x). Just use "x" in the newdata argument.
The reason lm(y~x*I(x^2)) and lm(y~x*x^2) are different is that "^" and "*" are reserved symbols for formula in R. That's not the case with the log function. It is also incorrect that interactions can only be constructed with I(). If you wanted a second degree polynomial in R you should use poly(x, 2). If you build with I(log(x)) or with just log(x) you should get the same model. Both of them will get transformed to the predictor value properly with predict if you use:
newdata=dataframe( x=seq( min(x), max(x), length=10) )
Using poly will protect you from incorrect inferences that are so commonly caused by the use of I(x^2).

Resources