I have fitted a lm as follows:
data <- data.frame(x=rnorm(50), x2=runif(50), y=rnorm(50), g=rep(1:3,length.out=50))
model <- lm(y ~ x + x2 + factor(g), data=data)
I want to extract the coefficients of each of the levels of the factor variable by refering to them using names for instance the way I would do with a continuous variable like 'x':
model$coefficients["x"]
I have tried using:
> model$coefficients["g"]
<NA>
NA
But it fails since the levels are renamed as can be observed below:
> model$coefficients
(Intercept) x x2 factor(g)2 factor(g)3
0.60058881 0.01232678 -0.65508242 -0.25919674 -0.04841089
I have also tried using the displayed names using:
model$coefficients["factor(g)2"]
but it doesn't work. How can i get this right?
Many thanks.
I always try to use the coef() function together with the grep() in these cases, I would do something like this:
data <- data.frame(x=rnorm(50), x2=runif(50), y=rnorm(50), g=rep(1:3,length.out=50))
model <- lm(y ~ x + x2 + factor(g), data=data)
estimates <- coef(model)
# Just get the g:2
estimates[grep("^factor\\(g\\)2", names(estimates))]
# If you want to get both factors you just skip the 2
estimates[grep("^factor\\(g\\)", names(estimates))]
# This case does not really require fancy
# regular expressions so you could write
estimates[grep("factor(g)", names(estimates), fixed=TRUE)]
# This comes much more in handy when you have a more complex situtation where
# coefficients have similar names
data <- data.frame(x=rnorm(50), great_g_var=runif(50), y=rnorm(50),
g_var=factor(rep(1:3,length.out=50)),
g_var2=factor(sample(1:3,size=50, replace=TRUE)))
model <- lm(y ~ x + great_g_var + g_var + g_var2, data=data)
estimates <- coef(model)
# Now if you want to do a simple fixed grep you could end up
# with unexpected estimates
estimates[grep("g_var", names(estimates), fixed=TRUE)]
# Returns:
# great_g_var g_var2 g_var3 g_var22 g_var23
# -0.361707955 -0.058988495 0.010967326 -0.008952616 -0.297461520
# Therefore you may want to use regular expressions, here's how you select g_var
estimates[grep("^g_var[0-9]$", names(estimates))]
# Returns:
# g_var2 g_var3
# -0.05898849 0.01096733
# And if you want to have g_var2 you write:
estimates[grep("^g_var2[0-9]$", names(estimates))]
# Returns:
# g_var22 g_var23
# -0.008952616 -0.297461520
Related
I have a linear model with lots of explaining variables (independent variables)
model <- lm(y ~ x1 + x2 + x3 + ... + x100)
some of which are linear depended on each other (multicollinearity).
I want the machine to search for the name of the explaining variable which has the highest VIF coefficient (x2 for example), delete it from the formula and then run the old lm function with the new formula
model <- lm(y ~ x1 + x3 + ... + x100)
I already learned how to retrieve the name of the explaining variable which has the highest VIF coefficient:
max_vif <- function(x) {
vifac <- data.frame(vif(x))
nameofmax <- rownames(which(vifac == max(vifac), arr.ind = TRUE))
return(nameofmax)
}
But I still don't understand how to search the needed explaining variable, delete it from the formula and run the function again.
We can use the update function and paste in the column that needs to be removed. We first can fit a model, and then use update to change that model's formula. The model formula can be expressed as a character string, which allows you to concatenate the general formula .~. and whatever variable(s) you'd like removed (using the minus sign -).
Here is an example:
fit1 <- lm(wt ~ mpg + cyl + am, data = mtcars)
coef(fit1)
# (Intercept) mpg cyl am
# 4.83597190 -0.09470611 0.08015745 -0.52182463
rm_var <- "am"
fit2 <- update(fit1, paste0(".~. - ", rm_var))
coef(fit2)
# (Intercept) mpg cyl
# 5.07595833 -0.11908115 0.08625557
Using max_vif we can wrap this into a function:
rm_max_vif <- function(x){
# find variable(s) needing to be removed
rm_var <- max_vif(x)
# concatenate with "-" to remove variable(s) from formula
rm_var <- paste(paste0("-", rm_var), collapse = " ")
# update model
update(x, paste0(".~.", rm_var))
}
Problem solved!
I created a list containing all variables for lm model:
Price <- list(y,x1,...,x100)
Then I used different way for setting lm model:
model <- lm(y ~ ., data = Price)
So we can just delete variable with the highest VIF from Price list.
With the function i already came up the code will be:
Price <- list(y,x1,x2,...,x100)
model <- lm(y ~ ., data = Price)
max_vif <- function(x) { # Function for finding name of variable with the highest VIF
vifac <- data.frame(vif(x))
nameofmax <- rownames(which(vifac == max(vifac), arr.ind = TRUE))
return(nameofmax)
}
n <- max(data.frame(vif(model)))
while(n >= 5) { # Loop for deleting variable with the highest VIF from `Price` list one after another, untill there is no VIF equal or higher then 5
Price[[m]] <- NULL
model_auto <- lm(y ~ ., data = Price)
m <- max_vif(model)
n <- max(data.frame(vif(model)))
}
Suppose, there is some data.frame foo_data_frame and one wants to find regression of the target column Y by some others columns. For that purpose usualy some formula and model are used. For example:
linear_model <- lm(Y ~ FACTOR_NAME_1 + FACTOR_NAME_2, foo_data_frame)
That does job well if the formula is coded statically. If it is desired to root over several models with the constant number of dependent variables (say, 2) it can be treated like that:
for (i in seq_len(factor_number)) {
for (j in seq(i + 1, factor_number)) {
linear_model <- lm(Y ~ F1 + F2, list(Y=foo_data_frame$Y,
F1=foo_data_frame[[i]],
F2=foo_data_frame[[j]]))
# linear_model further analyzing...
}
}
My question is how to do the same affect when the number of variables is changing dynamically during program running?
for (number_of_factors in seq_len(5)) {
# Then root over subsets with #number_of_factors cardinality.
for (factors_subset in all_subsets_with_fixed_cardinality) {
# Here I want to fit model with factors from factors_subset.
linear_model <- lm(Does R provide smth to write here?)
}
}
See ?as.formula, e.g.:
factors <- c("factor1", "factor2")
as.formula(paste("y~", paste(factors, collapse="+")))
# y ~ factor1 + factor2
where factors is a character vector containing the names of the factors you want to use in the model. This you can paste into an lm model, e.g.:
set.seed(0)
y <- rnorm(100)
factor1 <- rep(1:2, each=50)
factor2 <- rep(3:4, 50)
lm(as.formula(paste("y~", paste(factors, collapse="+"))))
# Call:
# lm(formula = as.formula(paste("y~", paste(factors, collapse = "+"))))
# Coefficients:
# (Intercept) factor1 factor2
# 0.542471 -0.002525 -0.147433
An oft forgotten function is reformulate. From ?reformulate:
reformulate creates a formula from a character vector.
A simple example:
listoffactors <- c("factor1","factor2")
reformulate(termlabels = listoffactors, response = 'y')
will yield this formula:
y ~ factor1 + factor2
Although not explicitly documented, you can also add interaction terms:
listofintfactors <- c("(factor3","factor4)^2")
reformulate(termlabels = c(listoffactors, listofintfactors),
response = 'y')
will yield:
y ~ factor1 + factor2 + (factor3 + factor4)^2
Another option could be to use a matrix in the formula:
Y = rnorm(10)
foo = matrix(rnorm(100),10,10)
factors=c(1,5,8)
lm(Y ~ foo[,factors])
You don't actually need a formula. This works:
lm(data_frame[c("Y", "factor1", "factor2")])
as does this:
v <- c("Y", "factor1", "factor2")
do.call("lm", list(bquote(data_frame[.(v)])))
I generally solve this by changing the name of my response column. It is easier to do dynamically, and possibly cleaner.
model_response <- "response_field_name"
setnames(model_data_train, c(model_response), "response") #if using data.table
model_gbm <- gbm(response ~ ., data=model_data_train, ...)
Suppose, there is some data.frame foo_data_frame and one wants to find regression of the target column Y by some others columns. For that purpose usualy some formula and model are used. For example:
linear_model <- lm(Y ~ FACTOR_NAME_1 + FACTOR_NAME_2, foo_data_frame)
That does job well if the formula is coded statically. If it is desired to root over several models with the constant number of dependent variables (say, 2) it can be treated like that:
for (i in seq_len(factor_number)) {
for (j in seq(i + 1, factor_number)) {
linear_model <- lm(Y ~ F1 + F2, list(Y=foo_data_frame$Y,
F1=foo_data_frame[[i]],
F2=foo_data_frame[[j]]))
# linear_model further analyzing...
}
}
My question is how to do the same affect when the number of variables is changing dynamically during program running?
for (number_of_factors in seq_len(5)) {
# Then root over subsets with #number_of_factors cardinality.
for (factors_subset in all_subsets_with_fixed_cardinality) {
# Here I want to fit model with factors from factors_subset.
linear_model <- lm(Does R provide smth to write here?)
}
}
See ?as.formula, e.g.:
factors <- c("factor1", "factor2")
as.formula(paste("y~", paste(factors, collapse="+")))
# y ~ factor1 + factor2
where factors is a character vector containing the names of the factors you want to use in the model. This you can paste into an lm model, e.g.:
set.seed(0)
y <- rnorm(100)
factor1 <- rep(1:2, each=50)
factor2 <- rep(3:4, 50)
lm(as.formula(paste("y~", paste(factors, collapse="+"))))
# Call:
# lm(formula = as.formula(paste("y~", paste(factors, collapse = "+"))))
# Coefficients:
# (Intercept) factor1 factor2
# 0.542471 -0.002525 -0.147433
An oft forgotten function is reformulate. From ?reformulate:
reformulate creates a formula from a character vector.
A simple example:
listoffactors <- c("factor1","factor2")
reformulate(termlabels = listoffactors, response = 'y')
will yield this formula:
y ~ factor1 + factor2
Although not explicitly documented, you can also add interaction terms:
listofintfactors <- c("(factor3","factor4)^2")
reformulate(termlabels = c(listoffactors, listofintfactors),
response = 'y')
will yield:
y ~ factor1 + factor2 + (factor3 + factor4)^2
Another option could be to use a matrix in the formula:
Y = rnorm(10)
foo = matrix(rnorm(100),10,10)
factors=c(1,5,8)
lm(Y ~ foo[,factors])
You don't actually need a formula. This works:
lm(data_frame[c("Y", "factor1", "factor2")])
as does this:
v <- c("Y", "factor1", "factor2")
do.call("lm", list(bquote(data_frame[.(v)])))
I generally solve this by changing the name of my response column. It is easier to do dynamically, and possibly cleaner.
model_response <- "response_field_name"
setnames(model_data_train, c(model_response), "response") #if using data.table
model_gbm <- gbm(response ~ ., data=model_data_train, ...)
Using different sources, I wrote a little function that creates a table with standard errors, t statistics and standard errors that are clustered according to a group variable "cluster" after a linear regression model. The code is as follows
cl1 <- function(modl,clust) {
# model is the regression model
# clust is the clustervariable
# id is a unique identifier in ids
library(plm)
library(lmtest)
# Get Formula
form <- formula(modl$call)
# Get Data frame
dat <- eval(modl$call$data)
dat$row <- rownames(dat)
dat$id <- ave(dat$row, dat[[deparse(substitute(clust))]], FUN =seq_along)
pdat <- pdata.frame(dat,
index=c("id", deparse(substitute(clust)))
, drop.index= F, row.names= T)
# # Regression
reg <- plm(form, data=pdat, model="pooling")
# # Adjustments
G <- length(unique(dat[, deparse(substitute(clust))]))
N <- length(dat[,deparse(substitute(clust))])
# # Resid degrees of freedom, adjusted
dfa <- (G/(G-1))*(N-1)/reg$df.residual
d.vcov <- dfa* vcovHC(reg, type="HC0", cluster="group", adjust=T)
table <- coeftest(reg, vcov=d.vcov)
# # Output: se, t-stat and p-val
cl1out <- data.frame(table[, 2:4])
names(cl1out) <- c("se", "tstat", "pval")
# # Cluster VCE
return(cl1out)
}
For a regression like reg1 <- lm (y ~ x1 + x2 , data= df), calling the function cl1(reg1, cluster) will work just fine.
However, if I use a model like reg2 <- lm(y ~ . , data=df), I will get the error message:
Error in terms.formula(object) : '.' in formula and no 'data' argument
After some tests, I am guessing that I can't use "." to signal "use all variables in the data frame" for {plm}. Is there a way I can do this with {plm}? Otherwise, any ideas on how I could improve my function in a way that does not use {plm} and that accepts all possible specifications of a linear model?
Indeed you can't use . notation for formula within plm pacakge.
data("Produc", package = "plm")
plm(gsp ~ .,data=Produc)
Error in terms.formula(object) : '.' in formula and no 'data' argument
One idea is to expand the formula when you have a .. Here is a custom function that does the job (surely is done within other packages):
expand_formula <-
function(form="A ~.",varNames=c("A","B","C")){
has_dot <- any(grepl('.',form,fixed=TRUE))
if(has_dot){
ii <- intersect(as.character(as.formula(form)),
varNames)
varNames <- varNames[!grepl(paste0(ii,collapse='|'),varNames)]
exp <- paste0(varNames,collapse='+')
as.formula(gsub('.',exp,form,fixed=TRUE))
}
else as.formula(form)
}
Now test it :
(eform = expand_formula("gsp ~ .",names(Produc)))
# gsp ~ state + year + pcap + hwy + water + util + pc + emp + unemp
plm(eform,data=Produc)
# Model Formula: gsp ~ state + year + pcap + hwy + water + util + pc + emp + unemp
# <environment: 0x0000000014c3f3c0>
Stargazer produces very nice latex tables for lm (and other) objects. Suppose I've fit a model by maximum likelihood. I'd like stargazer to produce a lm-like table for my estimates. How can I do this?
Although it's a bit hacky, one way might be to create a "fake" lm object containing my estimates -- I think this would work as long as summary(my.fake.lm.object) works. Is that easily doable?
An example:
library(stargazer)
N <- 200
df <- data.frame(x=runif(N, 0, 50))
df$y <- 10 + 2 * df$x + 4 * rt(N, 4) # True params
plot(df$x, df$y)
model1 <- lm(y ~ x, data=df)
stargazer(model1, title="A Model") # I'd like to produce a similar table for the model below
ll <- function(params) {
## Log likelihood for y ~ x + student's t errors
params <- as.list(params)
return(sum(dt((df$y - params$const - params$beta*df$x) / params$scale, df=params$degrees.freedom, log=TRUE) -
log(params$scale)))
}
model2 <- optim(par=c(const=5, beta=1, scale=3, degrees.freedom=5), lower=c(-Inf, -Inf, 0.1, 0.1),
fn=ll, method="L-BFGS-B", control=list(fnscale=-1), hessian=TRUE)
model2.coefs <- data.frame(coefficient=names(model2$par), value=as.numeric(model2$par),
se=as.numeric(sqrt(diag(solve(-model2$hessian)))))
stargazer(model2.coefs, title="Another Model", summary=FALSE) # Works, but how can I mimic what stargazer does with lm objects?
To be more precise: with lm objects, stargazer nicely prints the dependent variable at the top of the table, includes SEs in parentheses below the corresponding estimates, and has the R^2 and number of observations at the bottom of the table. Is there a(n easy) way to obtain the same behavior with a "custom" model estimated by maximum likelihood, as above?
Here are my feeble attempts at dressing up my optim output as a lm object:
model2.lm <- list() # Mimic an lm object
class(model2.lm) <- c(class(model2.lm), "lm")
model2.lm$rank <- model1$rank # Problematic?
model2.lm$coefficients <- model2$par
names(model2.lm$coefficients)[1:2] <- names(model1$coefficients)
model2.lm$fitted.values <- model2$par["const"] + model2$par["beta"]*df$x
model2.lm$residuals <- df$y - model2.lm$fitted.values
model2.lm$model <- df
model2.lm$terms <- model1$terms # Problematic?
summary(model2.lm) # Not working
I was just having this problem and overcame this through the use of the coef se, and omit functions within stargazer... e.g.
stargazer(regressions, ...
coef = list(... list of coefs...),
se = list(... list of standard errors...),
omit = c(sequence),
covariate.labels = c("new names"),
dep.var.labels.include = FALSE,
notes.append=FALSE), file="")
You need to first instantiate a dummy lm object, then dress it up:
#...
model2.lm = lm(y ~ ., data.frame(y=runif(5), beta=runif(5), scale=runif(5), degrees.freedom=runif(5)))
model2.lm$coefficients <- model2$par
model2.lm$fitted.values <- model2$par["const"] + model2$par["beta"]*df$x
model2.lm$residuals <- df$y - model2.lm$fitted.values
stargazer(model2.lm, se = list(model2.coefs$se), summary=FALSE, type='text')
# ===============================================
# Dependent variable:
# ---------------------------
# y
# -----------------------------------------------
# const 10.127***
# (0.680)
#
# beta 1.995***
# (0.024)
#
# scale 3.836***
# (0.393)
#
# degrees.freedom 3.682***
# (1.187)
#
# -----------------------------------------------
# Observations 200
# R2 0.965
# Adjusted R2 0.858
# Residual Std. Error 75.581 (df = 1)
# F Statistic 9.076 (df = 3; 1)
# ===============================================
# Note: *p<0.1; **p<0.05; ***p<0.01
(and then of course make sure the remaining summary stats are correct)
I don't know how committed you are to using stargazer, but you can try using the broom and the xtable packages, the problem is that it won't give you the standard errors for the optim model
library(broom)
library(xtable)
xtable(tidy(model1))
xtable(tidy(model2))