R Stargazer - Manually Specify R^2 And Write to .tex - r

I am using Stargazer to report the results from some models where I use robust standard errors. The process of calculating these and then feeding the models to Stargazer strips out data such as R^2 and so I need to add it in manually. Doing so, however, is causing me problems. Below is the basic stargazer() call that I'm trying to run. Following it and some discussion is the code needed to generate the data going into the stargazer() call:
stargazer(fit1_robust, fit2_robust,
keep.stat = c("n", "adj.rsq"), # doesn't actually result in keeping the stats, but including it just to demonstrate such.
add.lines = list(c("Adjusted $R^2$", fit1_r2, fit2_r2)),
out = "~/Test.tex"
)
When I call this, I get the following error:
Error in if (nchar(text.matrix[r, c]) > max.length[real.c]) { :
missing value where TRUE/FALSE needed
There are some interesting aspects to this:
The error does not occur if I omit the ^ and instead just use "Adjusted $R2$"
The error does not occur if I don't use the out argument to specify a .tex file to export to.
Addressing either of these bullets "solves" the error, but at the expense of my code not really doing what I want it to. How can I manually add the adjusted R^2 in the way I've done here (and more generally, add notes involving ^)?
(Note: I also tried escaping the ^ character, replacing it with /^. That gave an error. If I use a double escape: //^ that prevents the error, but then a single escape shows up in the generated .tex file and that's not what I want.)
Here is the rest of the code to get to the point of having all objects needed for the above stargazer() call:
library(stargazer)
library(lmtest)
library(sandwich)
#################
# Simulate Data #
#################
N = 100
A = rnorm(N)
B = rnorm(N)
Y = 2*A + B + rnorm(N)
Data = data.frame(Y, A, B)
#####################################
# Fit Models and Find Robust Errors #
#####################################
fit1 = lm(Y~A)
fit2 = lm(Y~A+B)
fit1_robust = coeftest(fit1, vcov = sandwich)
fit2_robust = coeftest(fit2, vcov = sandwich)
fit1_r2 = round(summary(fit1)$adj.r.squared, 4)
fit2_r2 = round(summary(fit2)$adj.r.squared, 4)

One workaround is to save the output of stargazer in a variable, then write it to file afterwards:
star = stargazer(fit1_robust, fit2_robust,
add.lines = list(c("Adjusted $R^2$", fit1_r2, fit2_r2))
)
cat(star, sep = '\n', file = 'Test.tex')

Related

ggcoef_model error when two random intercepts

When trying to graph the conditional fixed effects of a glmmTMB model with two random intercepts in GGally I get the error:
There was an error calling "tidy_fun()". Most likely, this is because the
function supplied in "tidy_fun=" was misspelled, does not exist, is not
compatible with your object, or was missing necessary arguments (e.g. "conf.level=" or "conf.int="). See error message below.
Error: Error in "stop_vctrs()":
! Can't recycle "..1" (size 3) to match "..2" (size 2).`
I have tinkered with figuring out the issue and it seems to be related to the two random intercepts included in the model. I have also tried extracting the coefficient and standard error information separately through broom.mixed::tidy and then feeding the data frame into GGally:ggcoef() with no avail. Any suggestions?
# Example with built-in randu data set
data(randu)
randu$A <- factor(rep(c(1,2), 200))
randu$B <- factor(rep(c(1,2,3,4), 100))
# Model
test <- glmmTMB(y ~ x + z + (0 +x|A) + (1|B), family="gaussian", data=randu)
# A few of my attempts at graphing--works fine when only one random effects term is in model
ggcoef_model(test)
ggcoef_model(test, tidy_fun = broom.mixed::tidy)
ggcoef_model(test, tidy_fun = broom.mixed::tidy, conf.int = T, intercept=F)
ggcoef_model(test, tidy_fun = broom.mixed::tidy(test, effects="fixed", component = "cond", conf.int = TRUE))
There are some (old!) bugs that have recently been fixed (here, here) that would make confidence interval reporting on RE parameters break for any model with multiple random terms (I think). I believe that if you are able to install updated versions of both glmmTMB and broom.mixed:
remotes::install_github("glmmTMB/glmmTMB/glmmTMB#ci_tweaks")
remotes::install_github("bbolker/broom.mixed")
then ggcoef_model(test) will work.

Is it possible to use lqmm with a mira object?

I am using the package lqmm, to run a linear quantile mixed model on an imputed object of class mira from the package mice. I tried to make a reproducible example:
library(lqmm)
library(mice)
summary(airquality)
imputed<-mice(airquality,m=5)
summary(imputed)
fit1<-lqmm(Ozone~Solar.R+Wind+Temp+Day,random=~1,
tau=0.5, group= Month, data=airquality,na.action=na.omit)
fit1
summary(fit1)
fit2<-with(imputed, lqmm(Ozone~Solar.R+Wind+Temp+Day,random=~1,
tau=0.5, group= Month, na.action=na.omit))
"Error in lqmm(Ozone ~ Solar.R + Wind + Temp + Day, random = ~1, tau = 0.5, :
`data' must be a data frame"
Yes, it is possible to get lqmm() to work in mice. Viewing the code for lqmm(), it turns out that it's a picky function. It requires that the data argument is supplied, and although it appears to check if the data exists in another environment, it doesn't seem to work in this context. Fortunately, all we have to do to get this to work is capture the data supplied from mice and give it to lqmm().
fit2 <- with(imputed,
lqmm(Ozone ~ Solar.R + Wind + Temp + Day,
data = data.frame(mget(ls())),
random = ~1, tau = 0.5, group = Month, na.action = na.omit))
The explanation is that ls() gets the names of the variables available, mget() gets those variables as a list, and data.frame() converts them into a data frame.
The next problem you're going to find is that mice::pool() requires there to be tidy() and glance() methods to properly pool the multiple imputations. It looks like neither broom nor broom.mixed have those defined for lqmm. I threw together a very quick and dirty implementation, which you could use if you can't find anything else.
To get pool(fit2) to run you'll need to create the function tidy.lqmm() as below. Then pool() will assume the sample size is infinite and perform the calculations accordingly. You can also create the glance.lqmm() function before running pool(fit2), which will tell pool() the residual degrees of freedom. Afterwards you can use summary(pooled) to find the p-values.
tidy.lqmm <- function(x, conf.int = FALSE, conf.level = 0.95, ...) {
broom:::as_tidy_tibble(data.frame(
estimate = coef(x),
std.error = sqrt(
diag(summary(x, covariance = TRUE,
R = 50)$Cov[names(coef(x)),
names(coef(x))]))))
}
glance.lqmm <- function(x, ...) {
broom:::as_glance_tibble(
logLik = as.numeric(stats::logLik(x)),
df.residual = summary(x, R = 2)$rdf,
nobs = stats::nobs(x),
na_types = "rii")
}
Note: lqmm uses bootstrapping to estimate the standard error. By default it uses R = 50 bootstrapping replicates, which I've copied in the tidy.lqmm() function. You can change that line to increase the number of replicates if you like.
WARNING: Use these functions and the results with caution. I know just enough to be dangerous. To me it looks like these functions work to give sensible results, but there are probably intricacies that I'm not aware of. If you can find a more authoritative source for similar functions that work, or someone who is familiar with lqmm or pooling mixed models, I'd trust them more than me.

GLM model runs in interactive code but not when I use knitr

I'm having issues with knitr.
Specifically, I have a model which runs absolutely fine in the console but when I try and knit the document, R throws an error.
Load the dataset (available here to facilitate replication )
scabies <- read.csv(file = "S1-Dataset_CSV.csv", header = TRUE, sep = ",")
scabies$agegroups <- as.factor(cut(scabies$age, c(0,10,20,Inf), labels = c("0-10","11-20","21+"), include.lowest = TRUE))
scabies$agegroups <-relevel(scabies$agegroups, ref = "21+")
scabies$house_cat <- as.factor(cut(scabies$house_inhabitants, c(0,5,10,Inf), labels = c("0-5","6-10","10+"), include.lowest = TRUE))
scabies$house_cat <- relevel(scabies$house_cat, ref = "0-5")
scabies <- scabies %>% mutate(scabies = case_when(scabies_infestation=="yes"~1,
scabies_infestation=="no"~0)) %>%
mutate(impetigo = case_when(impetigo_active=="yes" ~1,
impetigo_active=="no" ~0))
fit the model
scabiesrisk <- glm(scabies~agegroups+gender+house_cat,data=scabies,family=binomial())
scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint(scabiesrisk)))
scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_summary <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_summary
This code runs absolutely fine in the Console.
But when I try knitr I get:
Error in model.frame.default(formula = scabies ~ agegroups + gender +
: invalid type(list) for variable 'scabies Calls: ... glm
-> eval -> eval -> -> model.frame.default
I was able to reproduce the problem you describe, but haven't yet fully understood what happens under the hood.
This Markdown chunck is interesting :
```{r}
scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint((scabiesrisk))))
scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_summary <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_summary
```
If I manually quickly execute the lines in the chunck one after another (ctrl+Enter x 4), sometimes I get two profiling messages:
Waiting for profiling to be done...
Waiting for profiling to be done...
In this case, summary(scabiesrisk) is a matrix:
> class(scabiesrisk_summary)
[1] "matrix" "array"
If I manually slowly execute the lines in the chunk, I get only one profiling message:
Waiting for profiling to be done...
summary(scabiesrisk) is a summary.glm :
> class(scabiesrisk_summary)
[1] "summary.glm"
Looks like profiling is launched on a separate thread, and depending on whether it was finished or not, summary function doesn't have the same behaviour. If profiling is finished, it returns the expected summary.glm object, but if it isn't the case it launches another profiling and returns a matrix.
In particular, with a matrix scabiesrisk_summary$coefficients isn't available and I get in this situation the following error message:
Error in scabiesrisk_summary$coefficients :
$ operator is invalid for atomic vectors
This could possibly also happen while knitting : does knitting overhead make profiling slower so that the problem occurs?
With the workaround found here (use confint.defaultinstead of confint), I wasn't able to reproduce the above problem:
scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint.default((scabiesrisk))))
scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_summary <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_summary
OR 2.5 % 97.5 % Estimate Std. Error
(Intercept) 0.09357141 0.06984512 0.1253575 -2.3690303 0.1492092
agegroups0-10 2.20016940 1.60953741 3.0075383 0.7885344 0.1594864
agegroups11-20 2.53291768 1.79985894 3.5645415 0.9293719 0.1743214
gendermale 1.44749159 1.13922803 1.8391682 0.3698321 0.1221866
house_cat6-10 1.30521927 1.02586104 1.6606512 0.2663710 0.1228792
house_cat10+ 1.17003712 0.67405594 2.0309692 0.1570355 0.2813713
z value Pr(>|z|)
(Intercept) -15.8772359 9.110557e-57
agegroups0-10 4.9442116 7.645264e-07
agegroups11-20 5.3313714 9.747386e-08
gendermale 3.0267824 2.471718e-03
house_cat6-10 2.1677478 3.017788e-02
house_cat10+ 0.5581076 5.767709e-01
So you could also probably try this in your case.
Contrary to confint.defaut which is a directly readable R function, confint is a S3 dispatch method (thanks #Ben Bolker for the internal references in comments), and I didn't yet investigate further what could explain this surprising behaviour.
Another option seems to save scabiesrisk_summary in another variable.
I tried hard but was never able to reproduce the problem after doing so :
```{r}
scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint((scabiesrisk))))
scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_final <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_final
```
I strongly suspect that you forgot to include library(tidyverse) in your script. If tidyverse is loaded, then your code works fine. If it's not:
the step where you try to mutate() (and use %>%) fails, so the scabies variable is never created within the scabies data set
glm(scabies ~ ...) then interprets the response variable scabies as being the whole data set, and complains that the response variable is "invalid type(list)".
For this reason it's good practice to avoid having variables within data frames that have the same name as the data frames themselves ...
Your data transformation steps can be cleaned up a little bit (as.factor() is redundant; you can do all of the transformations as steps within a single mutate() call; as.numeric(x=="yes") is a shorter way to turn a string into a 0/1 variable ...) If I were going to do a lot more of this I would write a custom mycut() function that took breakpoints and a desired reference level as input arguments, constructed custom labels, and did the releveling.
library(tidyverse)
scabies <- (read.csv(file = "S1-Dataset_CSV.csv") %>%
mutate(agegroups <- cut(age, c(0,10,20,Inf),
labels = c("0-10","11-20","21+"),
include.lowest = TRUE),
agegroups = relevel(agegroups, ref = "21+"),
house_cat = cut(house_inhabitants, c(0,5,10,Inf),
labels = c("0-5","6-10","10+"),
include.lowest = TRUE),
house_cat = relevel(house_cat, ref = "0-5"),
scabies = as.numeric(scabies_infestation=="yes"),
impetigo = as.numeric(impetigo_active=="yes"))
)

How to correctly `dput` a fitted linear model (by `lm`) to an ASCII file and recreate it later?

I want to persist a lm object to a file and reload it into another program. I know I can do this by writing/reading a binary file via saveRDS/readRDS, but I'd like to have an ASCII file instead of a binary file. At a more general level, I'd like to know why my idioms for reading in dput output in general is not behaving as I'd expect.
Below are examples of making a simple fit, and successful and unsuccessful recreations of the model:
dat_train <- data.frame(x=1:4, z=c(1, 2.1, 2.9, 4))
fit <- lm(z ~ x, dat_train)
rm(dat_train) # Just to make sure fit is not dependent upon `dat_train existence`
dat_score <- data.frame(x=c(1.5, 3.5))
## This works (of course)
predict(fit, dat_score)
# 1 2
# 1.52 3.48
Saving to binary file works:
## http://stackoverflow.com/questions/5118074/reusing-a-model-built-in-r
saveRDS(fit, "model.RDS")
fit2 <- readRDS("model.RDS")
predict(fit2, dat_score)
# 1 2
# 1.52 3.48
So does this (dput it in the R session not to a file):
fit2 <- eval(dput(fit))
predict(fit2, dat_score)
# 1 2
# 1.52 3.48
But if I persist file to disk, I cannot figure out how to get back into normal shape:
dput(fit, file = "model.R")
fit3 <- source("model.R")$value
# Error in is.data.frame(data): object 'dat_train' not found
predict(fit3, dat_score)
# Error in predict(fit3, dat_score): object 'fit3' not found
Trying to be explicit with the eval does not work either:
## http://stackoverflow.com/questions/9068397/import-text-file-as-single-character-string
dput(fit, file="model.R")
fit4 <- eval(parse(text=paste(readLines("model.R"), collapse=" ")))
# Error in is.data.frame(data): object 'dat_train' not found
predict(fit4, dat_score)
# Error in predict(fit4, dat_score): object 'fit4' not found
In both cases above, I expect fit3 and fit4 to both work, but they don't recompile into a lm object that I can use with predict().
Can anyone advise me on how I can persist a model to a file with a structure(...) ASCII-like structure, and then re-read it back in as a lm object I can use in predict()? And why my current methods are not working?
Step 1:
You need to control de-parsing options:
dput(fit, control = c("quoteExpressions", "showAttributes"), file = "model.R")
You can read more on all possible options in ?.deparseOpts.
The "quoteExpressions" wraps all calls / expressions / languages with quote, so that they are not evaluated when you later re-parse it. Note:
source is doing parsing;
call field in your fitted "lm" object is a call:
fit$call
# lm(formula = z ~ x, data = dat_train)
So, without "quoteExpressions", R will try to evaluate lm call during parsing. And if we evaluate it, it is fitting a linear model, and R will aim to find dat_train, which will not exist in your new R session.
The "showAttributes" is another mandatory option, as "lm" object has class attributes. You certainly don't want to discard all class attributes and only export a plain "list" object, right? Moreover, many elements in a "lm" object, like model (the model frame), qr (the compact QR matrix) and terms (terms info), etc all have attributes. You want to keep them all.
If you don't set control, the default setting with:
control = c("keepNA", "keepInteger", "showAttributes")
will be used. As you can see, there is no "quoteExpressions", so you will get into trouble.
You can also specify "keepInteger" and "keepNA", but I don't see the need for "lm" object.
------
Step 2:
The above step will get source working correctly. You can recover your model:
fit1 <- source("model.R")$value
However, it is not yet ready for generic functions like summary and predict to work. Why?
The critical issue is the terms object in fit1 is not really a "terms" object, but only a formula (it is even not a formula, but only a "language" object without "formula" class!). Just compare fit$terms and fit1$terms, and you will see the difference. Don't be surprised; we've set "quoteExpressions" earlier. While that is definitely helpful to prevent evaluation of call, it has side-effect for terms. So we need to reconstruct terms as best as we can.
Fortunately, it is sufficient to do:
fit1$terms <- terms.formula(fit1$terms)
Though this still does not recover all information in fit$terms (like variable classes are missing), it is readily a valid "terms" object.
Why is a "terms" object critical? Because all generic functions rely on it. You may not need to know more on this, as it is really technical, so I will stop here.
Once this is done, we can successfully use predict (and summary, too):
predict(fit1) ## no `newdata` given, using model frame `fit1$model`
# 1 2 3 4
#1.03 2.01 2.99 3.97
predict(fit1, dat_score) ## with `newdata`
# 1 2
#1.52 3.48
-------
Conclusion remark:
Although I have shown you how to get things work, I don't really recommend you doing this in general. An "lm" object will be pretty large when you fit a model to a large dataset, for example, residuals, fitted.values are long vectors, and qr and model are huge matrices / data frames. So think about this.
This is an important update!
As mentioned in the previous answer, the most challenging bit is to recover $terms as best as we can. The suggested method using terms.formula works for OP's example, but not for the following with bs() and poly():
dat <- data.frame(x1 = runif(20), x2 = runif(20), x3 = runif(20), y = rnorm(20))
library(splines)
fit <- lm(y ~ bs(x1, df = 3) + poly(x2, degree = 3) + x3, data = dat)
rm(dat)
If we follow the previous answer:
dput(fit, control = c("quoteExpressions", "showAttributes"), file = "model.R")
fit1 <- source("model.R")$value
fit1$terms <- terms.formula(fit1$terms)
We will see that summary.lm and anova.lm work correctly, but not predict.lm:
predict(fit1, newdata = data.frame(x1 = 0.5, x2 = 0.5, x3 = 0.5))
Error in bs(x1, df = 3) : could not find function "bs"
This is because ".Environment" attribute of $terms is missing. We need
environment(fit1$terms) <- .GlobalEnv
Now run above predict again we see a different error:
Error in poly(x2, degree = 3) :
'degree' must be less than number of unique points
This is because we are missing "predvars" attributes for safe / correct prediction of bs() and poly().
A remedy is that we need to dput such special attribute additionally:
dput(attr(fit$terms, "predvars"), control = "quoteExpressions", file = "predvars.R")
then read and add it
attr(fit1$terms, "predvars") <- source("predvars.R")$value
Now running predict works correctly.
Note that "dataClass" attribute of $terms is also missing, but this does not seem to cause any problem for any generic functions.

predict in caret ConfusionMatrix is removing rows

I'm fairly new to using the caret library and it's causing me some problems. Any
help/advice would be appreciated. My situations are as follows:
I'm trying to run a general linear model on some data and, when I run it
through the confusionMatrix, I get 'the data and reference factors must have
the same number of levels'. I know what this error means (I've run into it before), but I've double and triple checked my data manipulation and it all looks correct (I'm using the right variables in the right places), so I'm not sure why the two values in the confusionMatrix are disagreeing. I've run almost the exact same code for a different variable and it works fine.
I went through every variable and everything was balanced until I got to the
confusionMatrix predict. I discovered this by doing the following:
a <- table(testing2$hold1yes0no)
a[1]+a[2]
1543
b <- table(predict(modelFit,trainTR2))
dim(b)
[1] 1538
Those two values shouldn't disagree. Where are the missing 5 rows?
My code is below:
set.seed(2382)
inTrain2 <- createDataPartition(y=HOLD$hold1yes0no, p = 0.6, list = FALSE)
training2 <- HOLD[inTrain2,]
testing2 <- HOLD[-inTrain2,]
preProc2 <- preProcess(training2[-c(1,2,3,4,5,6,7,8,9)], method="BoxCox")
trainPC2 <- predict(preProc2, training2[-c(1,2,3,4,5,6,7,8,9)])
trainTR2 <- predict(preProc2, testing2[-c(1,2,3,4,5,6,7,8,9)])
modelFit <- train(training2$hold1yes0no ~ ., method ="glm", data = trainPC2)
confusionMatrix(testing2$hold1yes0no, predict(modelFit,trainTR2))
I'm not sure as I don't know your data structure, but I wonder if this is due to the way you set up your modelFit, using the formula method. In this case, you are specifying y = training2$hold1yes0no and x = everything else. Perhaps you should try:
modelFit <- train(trainPC2, training2$hold1yes0no, method="glm")
Which specifies y = training2$hold1yes0no and x = trainPC2.

Resources