Applying 'clustering functions' to a series of linear models - r

I want to iterate over a list of linear models and apply "clustered" standard errors to each model using the vcovCL function. My goal is to do this as efficiently as possible (I am running a linear model across many columns of a dataframe). My problem is trying to specify additional arguments inside of the anonymous function. Below I simulate some fake data. Precincts represent my cross-sectional dimension; months represent my time dimension (5 units observed across 4 months). The variable int is a dummy for when an intervention takes place.
df <- data.frame(
precinct = c( rep(1, 4), rep(2, 4), rep(3, 4), rep(4, 4), rep(5, 4) ),
month = rep(1:4, 5),
crime = rnorm(20, 10, 5),
int = c(c(0, 1, 1, 0), rep(0, 4), rep(0, 4), c(1, 1, 1, 0), rep(0, 4))
)
df[1:10, ]
outcome <- df[3]
est <- lapply(outcome, FUN = function(x) { lm(x ~ as.factor(precinct) + as.factor(month) + int, data = df) })
se <- lapply(est, function(x) { sqrt(diag(vcovCL(x, cluster = ~ precinct + month))) })
I receive the following error message when adding the cluster argument inside of the vcovCL function.
Error in eval(expr, envir, enclos) : object 'x' not found
The only way around it, in my estimation, would be to index the dataframe, i.e., df$, and then specify the 'clustering' variables. Could this be achieved by specifying an additional argument for df inside of the function call? Is this code efficient?
Maybe specifying the model equation formulaically is a better way to go, I suppose.
Any thoughts/comments are always helpful :)

Here is one approach that would retrieve clustered standard errors for multiple models:
library(sandwich)
# I am going to use the same model three times to get the "sequence" of linear models.
mod <- lm(crime ~ as.factor(precinct) + as.factor(month) + int, data = df)
# define function to retrieve standard errors:
robust_se <- function(mod) {sqrt(diag(vcovCL(mod, cluster = list(df$precinct, df$month))))}
# apply function to all models:
se <- lapply(list(mod, mod, mod), robust_se)
If you want to get the entire output adjusted, the following might be helpful:
library(lmtest)
adj_stats <- function(mod) {coeftest(mod, vcovCL(mod, cluster = list(df$precinct, df$month)))}
adjusted_models <- lapply(list(mod, mod, mod), adj_stats)
To address the multiple column issue:
In case you are struggling with running linear models over several columns, the following might be helpful. All the above would stay the same, except that you are passing your list of models to lapply.
First, let's use this dataframe here:
df <- data.frame(
precinct = c( rep(1, 4), rep(2, 4), rep(3, 4), rep(4, 4), rep(5, 4) ),
month = rep(1:4, 5),
crime = rnorm(20, 10, 5),
crime2 = rnorm(20, 10, 5),
crime3 = rnorm(20, 10, 5),
int = c(c(0, 1, 1, 0), rep(0, 4), rep(0, 4), c(1, 1, 1, 0), rep(0, 4))
)
Let's define the outcome columns:
outcome_columns <- c("crime", "crime2", "crime3")
Now, let's run a regression with each outcome:
models <- lapply(outcome_columns,
function(outcome) lm( eval(parse(text = paste0(outcome, " ~ as.factor(precinct) + as.factor(month) + int"))), data = df) )
And then you would just call
adjusted_models <- lapply(models, adj_stats)
Regarding efficiency:
The above code is efficient in that it is easily adjustable and quick to write up. For most use cases, it will be perfectly fine. For computational efficiency, note that your design matrix is the same in all cases, i.e. by precomputing the common elements (e.g. inv(X'X)*X'), you could save some computations. You would however lose out on the convenience of many built-in functions.

Related

lm formula with variable names in it

I want to write a function that would take a lm model, try to add some feature and test its statistical significance. I've give it a go with the code as follows:
library(rlang)
library(tidyverse)
dataset <- data.frame(y = rnorm(100, 2, 3),
x1 = rnorm(100, 0, 4),
x2 = rnorm(100, 2, 1),
x3 = rnorm(100, 9, 1))
model1 <- lm(y ~ ., data = dataset)
dataset2 <- dataset %>%
mutate(x10 = rnorm(100, 20, 9),
x11 = rnorm(100, 3, 3))
test_var <- function(data, var, model){
y_name <- names(model$model)[1]
dataset_new <- data %>%
select_at(vars(y_name,
str_remove_all(labels(model), '`'),
var))
model_new <- lm(y_name ~ ., data = dataset_new)
return(summary(model_new))
}
As you can notice, to create a new model from available dataset I need to specify which variable should be dependent variable. However, I don't know this name directly, I just need to pull it out from the original model. So I did it in a function above, but it results in an error:
Error in model.frame.default(formula = y_name ~ ., data = dataset_new, :
variable lengths differ (found for 'y')
Correct me if I'm wrong but I believe this is due to y_name being a string, not a symbol. So I have tried the following editions:
test_var <- function(data, var, model){
y_name <- sym(names(model$model)[1])
dataset_new <- data %>%
select_at(vars(!!y_name,
str_remove_all(labels(model), '`'),
var))
model_new <- lm(eval(y_name) ~ ., data = dataset_new)
return(summary(model_new))
}
Although it seems to work, the resulting model is a perfect fit, as y is taken not only as dependent variable, but also as one of the features. Specifying formula with eval(y_name) ~ . - eval(y_name) doesn't help here. So my question is: how should I pass the dependent variable name to lm formula to build a correct model?
Since dataset_new contains the dependent variable in the first column, you may in fact use simply
lm(dataset_new)

Plotting large gts objects in dygraphs

I am new to R programming. I've generated a hierarchical time series using the hts package.I need to plot time series in each hierarchy separately using dygraphs.
library(hts)
abc <- ts(5 + matrix(sort(rnorm(1000)), ncol = 10, nrow = 100))
colnames(abc) <- c("A10A", "A10B", "A10C", "A20A", "A20B",
"B30A", "B30B", "B30C", "B40A", "B40B")
y <- hts(abc, characters = c(1, 2, 1))
fcasts1 <- forecast(y, method = "bu" ,h=4, fmethod = "arima",
parallel = TRUE)
dygraph(fcasts1,y)
I keep getting this error message ,
Error in UseMethod("as.xts") :
no applicable method for 'as.xts' applied to an object of class "c('gts', 'hts')"
Is there a solution for this issue ?Maybe if someone could tell me how to put the variables right in dygraph.
It is not possible to directly plot hts objects using dygraph. What you need to do is convert the hts$bts object into a matrix and then convert into a normal time series using ts() function.
Here is an example I've worked out.
library(hts)
abc <- ts(5 + matrix(sort(rnorm(1000)), ncol = 10, nrow = 100))
colnames(abc) <- c("A10A", "A10B", "A10C", "A20A", "A20B",
"B30A", "B30B", "B30C", "B40A", "B40B")
y <- hts(abc, characters = c(1, 2, 1))
fcasts1 <- forecast.gts(y, method = "bu" ,h=4, fmethod = "arima",
parallel = TRUE)
ts1 <- as.matrix(fcasts1$bts)
ts1 <- ts(ts1,start = c(2016,3), frequency = 12)
dygraph(ts1[,"A10A"],main='Sample dygraph ',ylab = 'Demand')

Why does a substituted formula work for lm and oneway.test, but not aov?

example <- data.frame(
var1 = c(1, 2, 3, 4, 5, 6, 7, 8),
class = c(rep(1, 4), rep(2, 4))
)
example$class <- as.factor(example$class)
This question provides a fix for using substitute and as.name to create a formula for aov, but I don't understand why the formula works for oneway.test and lm. Can someone explain?
fm <- substitute(i ~ class, list(i = as.name('var1')))
oneway.test(fm, example)
One-way analysis of means (not assuming equal variances)
data: var1 and class
F = 19.2, num df = 1, denom df = 6, p-value = 0.004659
lm(fm, example)
Call:
lm(formula = fm, data = example)
Coefficients:
(Intercept) class2
2.5 4.0
aov(fm, example)
Error in terms.default(formula, "Error", data = data) :
no terms component nor attribute
The problem is that substitute is returning an unevaluated call, not a formula. Compare
class(substitute(a~b))
# [1] "call"
class(a~b)
# [1] "formula"
If you evaluate it (as was done in the other answer), both will work
fm <- eval(substitute(i ~ class, list(i = as.name('var1'))))
oneway.test(fm, example)
aov(fm, example)
The error message you were getting was from the terms function which is called by aov(). This function needs to operate on a formula, not a call. This is basically what was happening
# ok
terms(a~b)
# doesn't work
unf <- quote(a~b) #same as substitute(a~b)
terms(unf)
# Error in terms.default(unf) : no terms component nor attribute
# ok
terms(eval(unf))
One possible source of the difference is that fm is actually a call not a formula and apparently some functions do the conversion while others do not.
If you do:
fm <- as.formula(fm)
Then the call to aov will work.

random model formula object

I want to put formula in random model, but I think following error is due to wrong formula object (?), but could not fix it.
set.seed(1234)
mydata <- data.frame (A = rep(1:3, each = 20), B = rep(1:2, each = 30),
C = rnorm(60, 10, 5))
mydata$A <- as.factor(mydata$A)
mydata$B <- as.factor(mydata$B)
myfunction <- function (mydata, yvars, genovar, replication) {
require("lme4")
formula = paste ("yvars" ~ 1|"genovar" + 1|"replication")
model1 <- lmer(formula, data = dataframe, REML = TRUE)
return(ranef(model2))
}
myfunction(mydata=dataf, yvars = "C", genovar = "A", replication = "B")
Error: length(formula <- as.formula(formula)) == 3 is not TRUE
There were several wonky things in here, but this is I think close to what you want.
set.seed(1234)
mydata <- data.frame (A = factor(rep(1:3, each = 20)),
B = factor(rep(1:2, each = 30)),
C = rnorm(60, 10, 5))
require("lme4")
myfunction <- function (mydata, yvars, genovar, replication) {
formula <- paste (yvars,"~ (1|",genovar,") + (1|",replication,")")
model1 <- lmer(as.formula(formula), data = mydata, REML = TRUE)
return(ranef(model1))
}
myfunction(mydata=mydata, yvars = "C", genovar = "A", replication = "B")
Beware, however, that lmer doesn't work the way that classical random-effects ANOVA does -- it may perform very badly with such small numbers of replicates. (In the example I tried it set the variance of A to zero, which is at least not unreasonable.) The GLMM FAQ has some discussion of this issue. (Random-effects ANOVA would have exceedingly low power in that case but might not be quite as bad.) If you really want to do random-effects models on such small samples you might want to consider reconstructing the classical method-of-moments approach (as I recall there is/was a raov formula in S-PLUS that did random-effects ANOVA, but I don't know if it was ever implemented in R).
Finally, for future questions along these lines you may do better on the r-sig-mixed-models#r-project.org mailing list -- Stack Overflow is nice but there is more R/mixed-model expertise over there.

How to grab coefficients with R when estimating a Zero Inflation Model

Probably pretty easy, but I want to know, how to grab coefficients when using the zeroinfl command?
treatment <- factor(rep(c(1, 2), c(43, 41)),
levels = c(1, 2),labels = c("placebo", "treated"))
improved <- factor(rep(c(1, 2, 3, 1, 2, 3), c(29, 7, 7, 13, 7, 21)),
levels = c(1, 2, 3),labels = c("none", "some", "marked"))
numberofdrugs <- rpois(84, 2)
healthvalue <- rpois(84,0.5)
y <- data.frame(healthvalue,numberofdrugs, treatment, improved)
require(pscl)
ZIP<-zeroinfl(healthvalue~numberofdrugs+treatment+improved, y)
summary(ZIP)
I usually use ZIP$coef[1] to grab a coefficient, but unfortunately here you grab a whole bunch. So how can I grab one single coeficients from a ZIP model?
Use the coef extraction function to list all coefficients in one long vector, and then you can use single index notation to select them:
coef(ZIP)[1]
count_(Intercept)
0.1128742
Alternatively, you need to select which model you want to get the coefficients from first:
ZIP$coef$count[1]
(Intercept)
0.1128742
ZIP$coef[[1]][1]
(Intercept)
0.1128742
If you wanted to get fancy you could split the coefficients into a list:
clist <- function(m) {
cc <- coef(m)
ptype <- gsub("_.+$","",names(cc))
ss <- split(cc,ptype)
lapply(ss, function(x) names(x) <- gsub("^.*_","",names(x)))
}
> clist(ZIP)
$count
(Intercept) numberofdrugs treatmenttreated improvedsome
-1.16112045 0.16126724 -0.07200549 -0.34807344
improvedmarked
0.23593220
$zero
(Intercept) numberofdrugs treatmenttreated improvedsome
7.509235 -14.449669 -58.644743 -8.060501
improvedmarked
58.034805
c1 <- clist(ZIP)
c1$count["numberofdrugs"]

Resources