How can I print out the design matrix from lm in R? - r

Assume x2-x4 are continuous predictors occupying one column each in the design matrix created using lm() in R. I want to include x1 a categorical variable that has 3 levels.
Regression R code:
fit <- lm(y~as.factor(x1)+x2+x3+c4,data=mydata)
How can I print the design matrix from lm() in R and what would it look like? I need to know the default coding used in R so I can write contrast statements properly.

I think the model.matrix() function is what you're after.
As kjetil b halvorsen says you can specify the model in the argument. Or you can just hand model.matrix the defined model.
fit <- lm(y~as.factor(x1)+x2+x3+c4,data=mydata)
model.matrix(fit)
You can even get a design matrix for new data: model.matrix(fit, data=newdata)

The comment by kjetil b halvorsen helped me the most:
call res <- lm() with the argument x=TRUE then the design matrix will be returned in the model object res Then call str(res) to see the structure of res, and you will now how to get the design matrix from it. But easier is to call model.matrix(y ~ x + f, data=...) with the same model formula you use in lm.

Related

Memory issue when using factor for fixed effect regression

I am working with a data with 900,000 observations. There is a categorical variable x with 966 unique value that needs to be used as fixed effects. I am including fixed effects using factor(x) in the regression. It gives me an error like this
Error: cannot allocate vector of size 6.9 Gb
How to fix this error? or do I need to do something different in the regression for fixed effects?
Then, how do I run a regression like this:
rlm(y~x+ factor(fe), data=pd)
The set of dummy variables constructed from a factor has very low information content. For example, considering only the columns of your model matrix corresponding to your 966-level categorical predictor, each row contains exactly one 1 and 965 zeros.
Thus you can generally save a lot of memory by constructing a sparse model matrix using Matrix::sparse.model.matrix() (or MatrixModels::model.Matrix(*, sparse=TRUE) as suggested by the sparse.model.matrix documentation). However, to use this it's necessary for whatever regression machinery you're using to accept a model matrix + response vector rather than requiring a formula (for example, to do linear regression you would need sparse.model.matrix + lm.fit rather than being able to use lm).
In contrast to #RuiBarradas's estimate of 3.5Gb for a dense model matrix:
m <- Matrix::sparse.model.matrix(~x,
data=data.frame(x=factor(sample(1:966,size=9e5,replace=TRUE))))
format(object.size(m),"Mb")
## [1] "75.6 Mb"
If you are using the rlm function from the MASS package, something like this should work:
library(Matrix)
library(MASS)
mm <- sparse.model.matrix(~x + factor(fe), data=pd)
rlm(y=pd$y, x=mm, ...)
Note that I haven't actually tested this (you didn't give a reproducible example); this should at least get you past the step of creating the model matrix, but I don't know if rlm() does any internal computations that would break and/or make the model matrix non-sparse.

Predicting with plm function in R

I was wondering if it is possible to predict with the plm function from the plm package in R for a new dataset of predicting variables. I have create a model object using:
model <- plm(formula, data, index, model = 'pooling')
Now I'm hoping to predict a dependent variable from a new dataset which has not been used in the estimation of the model. I can do it through using the coefficients from the model object like this:
col_idx <- c(...)
df <- cbind(rep(1, nrow(df)), df[(1:ncol(df))[-col_idx]])
fitted_values <- as.matrix(df) %*% as.matrix(model_object$coefficients)
Such that I first define index columns used in the model and dropped columns due to collinearity in col_idx and subsequently construct a matrix of data which needs to be multiplied by the coefficients from the model. However, I can see errors occuring much easier with the manual dropping of columns.
A function designed to do this would make the code a lot more readable I guess. I have also found the pmodel.response() function but I can only get this to work for the dataset which has been used in predicting the actual model object.
Any help would be appreciated!
I wrote a function (predict.out.plm) to do out of sample predictions after estimating First Differences or Fixed Effects models with plm.
The function is posted here:
https://stackoverflow.com/a/44185441/2409896

polyval: from Matlab to R

I would like to use in R the following expression given in Matlab:
y1=polyval(p,end_v);
where p in Matlab is:
p = polyfit(Nodes_2,CInt_interp,3);
Right now in R I have:
p <- lm(Spectra_BIR$y ~ poly(Spectra_BIR$x,3, raw=TRUE))
But I do not know which command in R corresponds to the polyval from Matlab.
Many thanks!
r:
library(polynom)
predict(polynomial(1:3), c(5,7,9))
[1] 86 162 262
matlab (official example):
p = [3 2 1];
polyval(p,[5 7 9])
ans = 86 162 262
There is no exact equivalence in R for polyfit and polyvar, as these MATLAB routines are so primitive compared with R's statistical tool box.
In MATLAB, polyfit mainly returns polynomial regression coefficients (covariance can be obtained if required, though). polyvar takes regression coefficients p, and a set of new x values to predict the fitted polynomial.
In R, the fashion is: use lm to obtain a regression model (much broader; not restricted to polynomial regression); use summary.lm for model summary, like obtaining covariance; use predict.lm for prediction.
So here is the way to go in R:
## don't use `$` in formula; use `data` argument
fit <- lm(y ~ poly(x,3, raw=TRUE), data = Spectra_BIR)
Note, fit not only contains coefficients, but also essential components for orthogonal computation. If you want to extract coefficients, do coef(fit), or unname(coef(fit)) if you don't want names of coefficients to be shown.
Now, to predict, we do:
x.new <- rnorm(5) ## some random new `x`
## note, `predict.lm` takes a "lm" model, not coefficients
predict.lm(fit, newdata = data.frame(x = x.new))
predict.lm is much much more powerful than polyvar. It can return confidence interval. Have a read on ?predict.lm.
There are a few sensitive issues with the use of predict.lm. There have been countless questions / answers regarding this, and you can find the root question to which I often close those questions as duplicated:
Getting Warning: “ 'newdata' had 1 row but variables found have 32 rows” on predict.lm in R
Predict() - Maybe I'm not understanding it
So make sure you get the good habit of using lm and predict at the early stage of learning R.
Extra
It is also not difficult to construct something identical to polyvar in R. The function g in my answer Function for polynomials of arbitrary order is doing this, although by setting nderiv we can also get derivatives of the polynomial.

Command for finding the best linear model in R

Is there a way to get R to run all possible models (with all combinations of variables in a dataset) to produce the best/most accurate linear model and then output that model?
I feel like there is a way to do this, but I am having a hard time finding the information.
There are numerous ways this could be achieved, but for a simple way of doing this I would suggest that you have a look at the glmulti package, which is described in detail in this paper:
glmulti: An R Package for Easy Automated Model Selection with (Generalized) Linear Models
Alternatively, very simple example of the model selection as available on the Quick-R website:
# Stepwise Regression
library(MASS)
fit <- lm(y~x1+x2+x3,data=mydata)
step <- stepAIC(fit, direction="both")
step$anova # display results
Or to simplify even more, you can do more manual model comparison:
fit1 <- lm(y ~ x1 + x2 + x3 + x4, data=mydata)
fit2 <- lm(y ~ x1 + x2, data=mydata)
anova(fit1, fit2)
This should get you started. Although you should read my comment from above. This should build you a model based on all the data in your dataset and then compare all of the models with AIC and BIC.
# create a NULL vector called model so we have something to add our layers to
model=NULL
# create a vector of the dataframe column names used to build the formula
vars = names(data)
# remove variable names you don’t want to use (at least
# the response variable (if its in the first column)
vars = vars[-1]
# the combn function will run every different combination of variables and then run the glm
for(i in 1:length(vars)){
xx = combn(vars,i)
if(is.null(dim(xx))){
fla = paste("y ~", paste(xx, collapse="+"))
model[[length(model)+1]]=glm(as.formula(fla),data=data)
} else {
for(j in 1:dim(xx)[2]){
fla = paste("y ~", paste(xx[1:dim(xx)[1],j], collapse="+"))
model[[length(model)+1]]=glm(as.formula(fla),data=data)
}
}
}
# see how many models were build using the loop above
length(model)
# create a vector to extract AIC and BIC values from the model variable
AICs = NULL
BICs = NULL
for(i in 1:length(model)){
AICs[i] = AIC(model[[i]])
BICs[i] = BIC(model[[i]])
}
#see which models were chosen as best by both methods
which(AICs==min(AICs))
which(BICs==min(BICs))
I ended up running forwards, backwards, and stepwise procedures on data to select models and then comparing them based on AIC, BIC, and adj. R-sq. This method seemed most efficient. However, when I received the actual data to be used (the program I was writing was for business purposes), I was told to only model each explanatory variable against the response, so I was able to just call lm(response ~ explanatory) for each variable in question, since the analysis we ended up using it for wasn't worried about how they interacted with each other.
This is a very old question, but for those who are still encountering this discussion - the package olsrr and specifically the function ols_step_all_possible exhaustively produces an ols model for all possible subsets of variables, based on an lm object (such that by feeding it with a full model you will get all possible combinations), and returns a dataframe with R squared, adjusted R squared, aic, bic, etc. for all the models. This is very helpful in finding the best predictors but it is also very much time consuming.
see https://olsrr.rsquaredacademy.com/reference/ols_step_all_possible.html
I do not recommend just "cherry picking" the best performing model, rather I would actually look at the output and choose carefully for the most reasonable outcome. In case you would want to immediately get the best performing model (by some criteria, say number of predictors and R2) you may write a function that saves the dataframe, arranges it by number of predictors and orders it by descending R2 and spits out the top result.
The dredge() function in R also accomplishes this.

How does predict deal with models that include the AsIs function?

I have the model lm(y~x+I(log(x)) and I would like to use predict to get predictions of a new data frame containing new values of x, based on my model. How does predict deal with the AsIs function I in the model? Does the I(log(x)) need to be extra specified in the newdata argument of predict or does predict understand that it should construct and use I(log(x)) from x?
UPDATE
#DWin: The way the variables enter in the model affect the coefficients especially for interactions. My example is simplistic but try this out
x<-rep(seq(0,100,by=1),10)
y<-15+2*rnorm(1010,10,4)*x+2*rnorm(1010,10,4)*x^(1/2)+rnorm(1010,20,100)
z<-x^2
plot(x,y)
lm1<-lm(y~x*I(x^2))
lm2<-lm(y~x*x^2)
lm3<-lm(y~x*z)
summary(lm1)
summary(lm2)
summary(lm3)
You see that lm1=lm3, but lm2 is something different (only 1 coefficient). Assuming you don't want to create the dummy variable z (computationally inefficient for large datasets), the only way to build an interaction model like lm3 is with I. Again this is a very simplistic example (that may make no statistical sense) however it makes sense in complicated models.
#Ben Bolker: I would like to avoid guessing and try to ask for an authoritative answer (I can't direct check this with my models since they are much more complicated than the example). My guess is that predict correctly assumes and constructs the I(log(x))
You do not need to make your variable names look like the term I(x). Just use "x" in the newdata argument.
The reason lm(y~x*I(x^2)) and lm(y~x*x^2) are different is that "^" and "*" are reserved symbols for formula in R. That's not the case with the log function. It is also incorrect that interactions can only be constructed with I(). If you wanted a second degree polynomial in R you should use poly(x, 2). If you build with I(log(x)) or with just log(x) you should get the same model. Both of them will get transformed to the predictor value properly with predict if you use:
newdata=dataframe( x=seq( min(x), max(x), length=10) )
Using poly will protect you from incorrect inferences that are so commonly caused by the use of I(x^2).

Resources