Using different sources, I wrote a little function that creates a table with standard errors, t statistics and standard errors that are clustered according to a group variable "cluster" after a linear regression model. The code is as follows
cl1 <- function(modl,clust) {
# model is the regression model
# clust is the clustervariable
# id is a unique identifier in ids
library(plm)
library(lmtest)
# Get Formula
form <- formula(modl$call)
# Get Data frame
dat <- eval(modl$call$data)
dat$row <- rownames(dat)
dat$id <- ave(dat$row, dat[[deparse(substitute(clust))]], FUN =seq_along)
pdat <- pdata.frame(dat,
index=c("id", deparse(substitute(clust)))
, drop.index= F, row.names= T)
# # Regression
reg <- plm(form, data=pdat, model="pooling")
# # Adjustments
G <- length(unique(dat[, deparse(substitute(clust))]))
N <- length(dat[,deparse(substitute(clust))])
# # Resid degrees of freedom, adjusted
dfa <- (G/(G-1))*(N-1)/reg$df.residual
d.vcov <- dfa* vcovHC(reg, type="HC0", cluster="group", adjust=T)
table <- coeftest(reg, vcov=d.vcov)
# # Output: se, t-stat and p-val
cl1out <- data.frame(table[, 2:4])
names(cl1out) <- c("se", "tstat", "pval")
# # Cluster VCE
return(cl1out)
}
For a regression like reg1 <- lm (y ~ x1 + x2 , data= df), calling the function cl1(reg1, cluster) will work just fine.
However, if I use a model like reg2 <- lm(y ~ . , data=df), I will get the error message:
Error in terms.formula(object) : '.' in formula and no 'data' argument
After some tests, I am guessing that I can't use "." to signal "use all variables in the data frame" for {plm}. Is there a way I can do this with {plm}? Otherwise, any ideas on how I could improve my function in a way that does not use {plm} and that accepts all possible specifications of a linear model?
Indeed you can't use . notation for formula within plm pacakge.
data("Produc", package = "plm")
plm(gsp ~ .,data=Produc)
Error in terms.formula(object) : '.' in formula and no 'data' argument
One idea is to expand the formula when you have a .. Here is a custom function that does the job (surely is done within other packages):
expand_formula <-
function(form="A ~.",varNames=c("A","B","C")){
has_dot <- any(grepl('.',form,fixed=TRUE))
if(has_dot){
ii <- intersect(as.character(as.formula(form)),
varNames)
varNames <- varNames[!grepl(paste0(ii,collapse='|'),varNames)]
exp <- paste0(varNames,collapse='+')
as.formula(gsub('.',exp,form,fixed=TRUE))
}
else as.formula(form)
}
Now test it :
(eform = expand_formula("gsp ~ .",names(Produc)))
# gsp ~ state + year + pcap + hwy + water + util + pc + emp + unemp
plm(eform,data=Produc)
# Model Formula: gsp ~ state + year + pcap + hwy + water + util + pc + emp + unemp
# <environment: 0x0000000014c3f3c0>
Related
In a Cox regression framework, I'd like to implement piecewise continuous time-dependent coefficients. This is for variable that do not satisify the PH assumption.
In this vignette https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf, there are examples for step functions (p.17), and coefficient with some functional form (p.20).
What I'd like is to somehow have a piecewise relationship. Using the example provided in the vignette:
library(survival)
vfit <- coxph(Surv(time, status) ~ trt + prior + karno, veteran)
zp <- cox.zph(vfit, transform= function(time) log(time +20))
## Step functions
vet2 <- survSplit(Surv(time, status) ~ ., data= veteran, cut=c(90, 180),
episode= "tgroup", id="id")
vfit2 <- coxph(Surv(tstart, time, status) ~ trt + prior +
karno:strata(tgroup), data=vet2)
## Functional form
vfit3 <- coxph(Surv(time, status) ~ trt + prior + karno + tt(karno),
data=veteran,
tt = function(x, t, ...) x * log(t+20))
plot(zp[3])
abline(coef(vfit3)[3:4], col=2)
From the plot (also on p.21 in the vignette), we might argue that we could have a similar but inverted trend from approx. Time=200. I've tried but without success.
First tried directly with a piecewise function with the tt argument but it does not give two sets of coefficents, only one coef for karno and one for tt(karno). I mean we should have something like ax+b for t<200 and cx+d for t>=200
vfit3 <- coxph(Surv(time, status) ~ trt + prior + karno + tt(karno),
data=veteran,
tt = function(x, t, t1, t2, ...) x * log(t1+20) * (t<200) +
x * t2 * (t>=200))
So in a second step, I tried to mix both step functions with some functional form for each. Meaning to split the data in two time periods as for step functions and then fit a function in each. But gives error.
vfit3 <- coxph(Surv(tstart, time, status) ~ trt + prior +
(karno + tt(karno)):strata(tgroup),
data=vet2,
tt = function(x, t, ...) x * log(t+20) * (t<200) -
x * t * (t>=200))
Does someone knows how to implement this?
EDIT:
This is what I've come up with
library(survival)
## Original model
m1 <- coxph(formula = Surv(time, status) ~ trt + prior + karno,
data = veteran)
## Transform to long format as in the link
vet1 <- survSplit(Surv(time, status)~., data = veteran, id = "id",
cut = unique(veteran$time))
## Add a grouping variable (strata) for time before 200 days and after.
vet1$tgroup <- ifelse(vet1$time < 200, 1, 2)
## Add a time-transform function
## Here it is the same function for both strata, but they could be different
## e.g. ifelse(vet1$time < 200, f1(time), f2(time))
## Actually not sure, as we need to be careful with the time scale... Anyway
vet1$time1 <- log(vet1$time + 20)
## Same model as in the link, but then add an interaction with the strata
m2 <- coxph(formula = Surv(tstart, time, status)~
trt + prior + (karno + karno:time1):strata(tgroup), data = vet1)
## Some plots as in the vignette
zp <- cox.zph(m1, transform = function(time) log(time +20))
plot(zp[3])
abline(coef(m2)[c(3,5)], col="tomato")
abline(coef(m2)[c(4,6)], col="tomato")
In R stepwise forward regression, I would like to specify several minimal models. I am looking for the best model whith choices between 12 variables (6 flow variables Q_ and 6 precipitation variables LE_).
Biggest model takes into account all the variables :
formule <- "Q ~ 0 + Q_minus_1h + Q_minus_2h + Q_minus_3h + Q_minus_4h + Q_minus_5h + Q_minus_6h + LE_6h + LE_12h + LE_18h + LE_24h + LE_30h + LE_36h"
biggest <- formula(lm(formule, Sub_fit))
With Sub_fit my set of data (data frame with Q and my 12 variables).
I would like to have at least one variable "LE_XX" in my model. So my minimal model could be :
formule <- "Q ~ 0 + LE_6h"
smallest <- formula(lm(formule, Sub_fit))
OR
formule <- "Q ~ 0 + LE_12h"
smallest <- formula(lm(formule, Sub_fit))
OR...
formule <- "Q ~ 0 + LE_36h"
smallest <- formula(lm(formule, Sub_fit))
With finally :
modele.res <- step(lm(as.formula("Q ~ 0"),data=Sub_fit), direction='forward', scope=list(lower=smallest, upper=biggest))
"lower", into "scope", does not allow a list but should be one unique formula. Is it possible to do what I need ?
To specify several minimal models in stepwise forward regression, create the smallest formulas with, for instance, lapply and then loop through them.
In the example below, built-in data set mtcars is used to fit several models having mpg as response, one per each of the 3 last variables in the data set.
data(mtcars)
biggest <- mpg ~ .
sml <- names(mtcars)[9:11]
small_list <- lapply(sml, function(x) {
fmla <- paste("mpg", x, sep = "~")
as.formula(fmla)
})
names(small_list) <- sml
fit <- lm(mpg ~ ., mtcars)
fit_list <- lapply(small_list, function(smallest){
step(fit, scope = list(lower = smallest, upper = biggest))
})
Now select with AIC as criterion
min_aic <- sapply(fit_list, AIC)
min_aic
# am gear carb
#154.1194 155.9852 154.5631
fit_list[[which.min(min_aic)]]
stepwise function in StepReg R package can include some variables you want in all models during the stepwise regression.
library(StepReg)
f1 <- Q ~ 0 + Q_minus_1h + Q_minus_2h + Q_minus_3h + Q_minus_4h + Q_minus_5h + Q_minus_6h + LE_6h + LE_12h + LE_18h + LE_24h + LE_30h + LE_36h
## include LE_6h in the model
stepwise(formula=f1,
data=yourdata,
include="LE_6h",
selection="forward",
select="AIC")
## include LE_6h and LE_12h in the model
stepwise(formula=f1,
data=yourdata,
include=c("LE_6h","LE_12h"),
selection="forward",
select="AIC")
I want to create a function which will perform panel regression with 3-level dummies included.
Let's consider within model with time effects :
library(plm)
fit_panel_lr <- function(y, x) {
x[, length(x) + 1] <- y
#adding dummies
mtx <- matrix(0, nrow = nrow(x), ncol = 3)
mtx[cbind(seq_len(nrow(mtx)), 1 + (as.integer(unlist(x[, 2])) - min(as.integer(unlist(x[, 2])))) %% 3)] <- 1
colnames(mtx) <- paste0("dummy_", 1:3)
#converting to pdataframe and adding dummy variables
x <- pdata.frame(x)
x <- cbind(x, mtx)
#performing panel regression
varnames <- names(x)[3:(length(x))]
varnames <- varnames[!(varnames == names(y))]
form <- paste0(varnames, collapse = "+")
x_copy <- data.frame(x)
form <- as.formula(paste0(names(y), "~", form,'-1'))
params <- list(
formula = form, data = x_copy, model = "within",
effect = "time"
)
pglm_env <- list2env(params, envir = new.env())
model_plm <- do.call("plm", params, envir = pglm_env)
model_plm
}
However, if I use data :
data("EmplUK", package="plm")
dep_var<-EmplUK['capital']
df1<-EmplUK[-6]
In output I will get :
> fit_panel_lr(dep_var, df1)
Model Formula: capital ~ sector + emp + wage + output + dummy_1 + dummy_2 +
dummy_3 - 1
<environment: 0x000001ff7d92a3c8>
Coefficients:
sector emp wage output
-0.055179 0.328922 0.102250 -0.002912
How come that in formula dummies are considered and in coefficients are not ? Is there any rational explanation or I did something wrong ?
One point why you do not see the dummies on the output is because they are linear dependent to the other data after the fixed-effect time transformation. They are dropped so what is estimable is estimated and output.
Find below some (not readily executable) code picking up your example from above:
dat <- cbind(EmplUK, mtx) # mtx being the dummy matrix constructed in your question's code for this data set
pdat <- pdata.frame(dat)
rhs <- paste(c("emp", "wage", "output", "dummy_1", "dummy_2", "dummy_3"), collapse = "+")
form <- paste("capital ~" , rhs)
form <- formula(form)
mod <- plm(form, data = pdat, model = "within", effect = "time")
detect.lindep(mod$model) # before FE time transformation (original data) -> nothing offending
detect.lindep(model.matrix(mod)) # after FE time transformation -> dummies are offending
The help page for detect.lindep (?detect.lindep is included in package plm) has some more nice examples on linear dependence before and after FE transformation.
A suggestion:
As for constructing dummy variables, I suggest to use R's factor with three levels and not have the dummy matrix constructed yourself. Using a factor is typically more convinient and less error prone. It is converted to the binary dummies (treatment style) by your typical estimation function using the model.frame/model.matrix framework.
I have a linear model with lots of explaining variables (independent variables)
model <- lm(y ~ x1 + x2 + x3 + ... + x100)
some of which are linear depended on each other (multicollinearity).
I want the machine to search for the name of the explaining variable which has the highest VIF coefficient (x2 for example), delete it from the formula and then run the old lm function with the new formula
model <- lm(y ~ x1 + x3 + ... + x100)
I already learned how to retrieve the name of the explaining variable which has the highest VIF coefficient:
max_vif <- function(x) {
vifac <- data.frame(vif(x))
nameofmax <- rownames(which(vifac == max(vifac), arr.ind = TRUE))
return(nameofmax)
}
But I still don't understand how to search the needed explaining variable, delete it from the formula and run the function again.
We can use the update function and paste in the column that needs to be removed. We first can fit a model, and then use update to change that model's formula. The model formula can be expressed as a character string, which allows you to concatenate the general formula .~. and whatever variable(s) you'd like removed (using the minus sign -).
Here is an example:
fit1 <- lm(wt ~ mpg + cyl + am, data = mtcars)
coef(fit1)
# (Intercept) mpg cyl am
# 4.83597190 -0.09470611 0.08015745 -0.52182463
rm_var <- "am"
fit2 <- update(fit1, paste0(".~. - ", rm_var))
coef(fit2)
# (Intercept) mpg cyl
# 5.07595833 -0.11908115 0.08625557
Using max_vif we can wrap this into a function:
rm_max_vif <- function(x){
# find variable(s) needing to be removed
rm_var <- max_vif(x)
# concatenate with "-" to remove variable(s) from formula
rm_var <- paste(paste0("-", rm_var), collapse = " ")
# update model
update(x, paste0(".~.", rm_var))
}
Problem solved!
I created a list containing all variables for lm model:
Price <- list(y,x1,...,x100)
Then I used different way for setting lm model:
model <- lm(y ~ ., data = Price)
So we can just delete variable with the highest VIF from Price list.
With the function i already came up the code will be:
Price <- list(y,x1,x2,...,x100)
model <- lm(y ~ ., data = Price)
max_vif <- function(x) { # Function for finding name of variable with the highest VIF
vifac <- data.frame(vif(x))
nameofmax <- rownames(which(vifac == max(vifac), arr.ind = TRUE))
return(nameofmax)
}
n <- max(data.frame(vif(model)))
while(n >= 5) { # Loop for deleting variable with the highest VIF from `Price` list one after another, untill there is no VIF equal or higher then 5
Price[[m]] <- NULL
model_auto <- lm(y ~ ., data = Price)
m <- max_vif(model)
n <- max(data.frame(vif(model)))
}
I am trying to refit a full model of class merMod with just the intercept (the null model). However, refitting using update.merMod gives a different answer than fitting the null model by hand, e.g.:
# Generate random data
set.seed(9)
dat <- data.frame(
x = do.call(c, lapply(1:5, function(x) rnorm(100, x))),
random = letters[1:5]
)
dat$y = rnbinom(500, mu = exp(dat$x), size = 1)
library(lme4)
# Get full model
full <- glmer.nb(y ~ x + (1 | random), dat)
# Write out intercept-only model by hand
null <- glmer.nb(y ~ 1 + (1 | random), dat)
# Update
null2 <- update(full, . ~ 1 -. + (1 | random))
VarCorr(null)
VarCorr(null2)
Any idea why this is an how I can use update to get the same vcov matrix?