Writing Regression summary as CSV, including Model Stats - r

The question is as it sounds; at present I use the broom package to tidy up my regression summary and then use write_csv to turn that summary into a csv naturally. However, the problem is, is that this 'tidied' summary doesn't contain useful stats like R Squared, Residual distribution and p value from the F Statistic.
Does anyone know how to write up a regression summary into a csv which would contain this useful information?
Thanks.

An alternative would be to create a function that stores in a list all the information you need as follows:
lm(mpg ~ cyl, mtcars) -> model
model_stats <- function(model) {
data.frame(model = toString(model$call),
broom::augment(model)) -> info1
data.frame(model = toString(model$call),
broom::tidy(model)) -> info2
data.frame(model = toString(model$call),
broom::glance(model)) -> info3
list(info1, info2, info3) -> info_all
return(info_all)
}
out <- model_stats(model)
sapply(seq_along(out), function(i) write.csv(model_stats(model)[[i]], paste0('info', i, '.csv')))

I assume, the core issue of your question is that you don't handle all summary tables as data.frames or list (by compiling all information), aren't you!?
So if you just want to write a particular statistics summary (aov, TukeyHSD, augmented, glance etc.) in csv, you should change it to a data.frame
Some example from "broom" vignette: https://cran.r-project.org/web/packages/broom/vignettes/broom.html
glmfit <- glm(am ~ wt, mtcars, family = "binomial")
tidy(glmfit)
fit1 <- as.data.frame(augment(glmfit))
write.csv(fit1, "test.csv")

The value returned by lm is an object which can be further processed using summary. From the value returned by summary you can access different information and process it manually.
# Data and model fit
df <- data.frame(a = rnorm(100), b = rnorm(100))
mod <- lm(a~b, data = df)
su <- summary(mod)
# Helper function to create the output
fill <- function(row, mat) {
c(row, rep("", ncol(mat)-length(row)))
}
# Create the output. `su$coefficients` provides the matrix to which more information is added using `rbind`.
output <- rbind(
su$coefficients,
`Additional stats` = fill("", output),
Rsq = fill(su$r.squared, output),
Adj.Rsq = fill(su$adj.r.squared, output),
F = fill(su$fstatistic, output))
# Write to disk with row and colnames (col.names = NA)
write.table(output, file="model-summary.csv", sep = ",", col.names=NA)

Related

Automate statistical analysis (one way ANOVA, Dunnett post-hoc) in a for loop R

I need to compute a statistical analysis that compares control vs treatment for each gene, I've identified the type of stats that I need and it works if performed to one gene:
df <- data.frame(Gene = c("A","A","A","A","A","A","B","B","B","B","B","B"),
Value =c(12.554595492,13.554595492,14.554595492,8.554595492,2.554595492,3.554595492,13.554595492,8.554595492,16.5444425154,11.5444425154,3.5444425154,4.5444425154),
Treat=c("tre1","tre1","tre2","tre2","Control","Control","tre1","tre1","tre2","tre2","Control","Control"))
df$Treat <- as.factor(df$Treat)
A <- df %>% filter("A"== Gene)
model <- aov(Value ~ Treat, data = A)
summary(model)
model_1 <- glht(model = model,linfct = mcp(Treat="Dunnett"))
summary(model_1)
output<-capture.output(summary(model_1), file=NULL, append=FALSE)
output_df <-as.data.frame(output)
I would like to automate this operation to all the genes in the df and generate a final df that collects the results of all summaries.
When I try to perform the code in a for loop, I get the error
"Error in eval(predvars, data, env) : object 'Value' not found". Where is the code wrong?
df_list <- split(df,f=df$Gene)
df_final <- data.frame()[1:16, ]
for(i in length(df_list)){
model <- aov(Value ~ Treat, data = i)
summary(model)
model_1 <- glht(model = model,linfct = mcp(Treat="Dunnett"))
summary(model_1)
output<-capture.output(summary(model_1), file=NULL, append=FALSE)
output_df <-as.data.frame(output)
df_final <- cbind (df_final,output_df)
}
The only difference I found between the dataframe A and those in the list is the Value as number in A and as double in the list. Is this the reason? how to fix it?
Any help or suggestion is appreciated!
Thanks.

How to add a q value (adjusted p-value) to a modelsummary table after pooling the results of a multinomial model over multiple imputed datasets

I am using modelsummary to display the results of several multinomial models, each pooled over 5 datasets using the mice::pool function. It works great, but I want to add the q-value / adjusted p-value for false discovery rate.
I understand I need to create a tidy_custom.mipo function to add this statistic but I can't get it to work.
Below is the code to get the 'pool_univariate' list of mipo objects, which I then pass to modelsummary. It works great, I just want to add the q-value statistic.
Any idea how to do that?
Thanks a lot!
# list of exposures
exposures <- c(
Cs(exposure1,exposure2,exposure3)
## model function
models <- function(x) {
lapply(imputed_data, function(y)
multinom(as.formula(
paste0(
"outcome ~ ",
x
)
), data = y, model = TRUE)
)
}
## run models
models_univariate <- as.list(seq(1,length(exposures)))
models_univariate <- pblapply(exposures, models)
## pool
pool_univariate <- as.list(seq(1,length(exposures)))
# run pool
for(j in seq_along(exposures)) {
pool_univariate[[j]] <- pool(models_univariate[[j]])
}
It is difficult to answer this question without a minimal working example. Here I give a simpler example than the original, for the linear regression context.
First, load the package and estimate a regression model:
library(modelsummary)
mod <- lm(mpg ~ hp + drat + vs + am, data = mtcars)
Second, since we want to summarize a model of class lm, we define a new method called tidy_custom.lm. This function takes a statistical model as input, and returns a data frame that conforms to the broom package specification, with one column called term and other columns containing matching statistics. In the current example, the data frame will include three new statistics (q.value, bonferroni and holm). These values are computed using R’s p.adjust function, which adjusts p values for multiple comparison:
tidy_custom.lm <- function(x, ...) {
out <- broom::tidy(x)
out$q.value <- p.adjust(out$p.value, n = 10, method = "fdr")
out$bonferroni <- p.adjust(out$p.value, n = 10, method = "bonferroni")
out$holm <- p.adjust(out$p.value, n = 10, method = "holm")
return(out)
}
Now, we can call modelsummary with our lm model, and request the statistics:
modelsummary(mod, statistic = "q.value")
We can also compare different p values and label them nicely using glue strings:
modelsummary(mod,
statistic = c(
"p = {p.value}",
"q = {q.value}",
"p (Bonferroni) = {bonferroni}",
"p (Holm) = {holm}"
)
)

How can I extract one specific coefficient from multiple lavaan models?

I wrote a function to run several lavaan models at once (from 5 different datasets). In the output I get the 5 different outputs. However, I would like to extract one specific estimate from each of these models, because I am using these in a meta-analysis (and I have many more models)
Here is my code for running the model:
df_list <- list ('Y1'=emo_dyn_1,'Y2'=emo_dyn_2,'Y3'=emo_dyn_3,'Y4'=emo_dyn_4,'Y5'=emo_dyn_5)
model <- 'DepB ~ isdNA + imeanNA + sex + age'
fun = function(emo_dyn){
fit=sem(model,
data=emo_dyn,
estimator = "MLR",
missing = "ml.x")
summ = summary(fit, standardized = TRUE)
list(fit = fit,summary = summ)
}
results <- lapply(df_list,fun)
names(results) <- names(df_list)
results
And this is how I extract the coefficient. It kinda makes it a dataframe and then I extract the specific value from it. Not sure if that is the best option. It is about the standardized estimate of a specific path. But it is just copy and paste and I am sure this goes easier, but I don't know how to write this loop.
emo_dyn_1_est<-standardizedSolution(results$Y1$fit) # Standardised coefficients
emo_dyn_1_est_1<-emo_dyn_1_est[1, 4]
emo_dyn_1_est_1
emo_dyn_2_est<-standardizedSolution(results$Y2$fit) # Standardised coefficients
emo_dyn_2_est_2<-emo_dyn_2_est[1, 4]
emo_dyn_2_est_2
emo_dyn_3_est<-standardizedSolution(results$Y3$fit) # Standardised coefficients
emo_dyn_3_est_3<-emo_dyn_3_est[1, 4]
emo_dyn_3_est_3
emo_dyn_4_est<-standardizedSolution(results$Y4$fit) # Standardised coefficients
emo_dyn_4_est_4<-emo_dyn_4_est[1, 4]
emo_dyn_4_est_4
emo_dyn_5_est<-standardizedSolution(results$Y5$fit) # Standardised coefficients
emo_dyn_5_est_5<-emo_dyn_5_est[1, 4]
emo_dyn_5_est_5
lavaan has the parameterEstimates function so you can do something like:
df_list <- list ('Y1'=emo_dyn_1,'Y2'=emo_dyn_2,'Y3'=emo_dyn_3,'Y4'=emo_dyn_4,'Y5'=emo_dyn_5)
model <- 'DepB ~ isdNA + imeanNA + sex + age'
fun <- function(emo_dyn){
fit <- sem(model,
data=emo_dyn,
estimator = "MLR",
missing = "ml.x")
fit
}
results <- lapply(df_list,fun)
names(results) <- names(df_list)
## Get a specific parameter
get_param <- function(fit, coef_pos) {
param <- parameterEstimates(fit, standardized = TRUE)[coef_pos, "std.lv"]
param
}
lapply(results, get_param, coef_pos = 1)
I made one change: in your lapply to get the results I only kept the model fit. If you want all the summaries you can just do lapply(results, summary). The get_param function assumes that you know the position in the results table of the parameter you want.
If you want to keep your existing lapply for the results then something like this would work:
results_fit_only <- lapply(results, "[[", "fit")
lapply(results_fit_only, get_param, coef_pos = 1)

Grouping by a user defined list within a custom function in R

I am trying to create a custom function in R that lets the user perform linear regressions on a data set, I would like the user to be able to input variables for the data to be grouped by so that multiple regressions are performed on the data set. The problem I am having is trying to get a user defined list of variables into the custom function. Below I have tried using "..." however this does not work. If anyone has any idea how I should be approaching this that would be great. For reference For reference - lr.1 = the dataset - ddate = the x variable - alue = the y variable - the variables that the data should be grouped by)
`grouped.lr = function(lr.1,ddate, value, ...){
test = lr.1 %>%
group_by(...) %>%
nest() %>%
mutate(mod = map(data, fitmodel.test),
pars = map(mod, tidy),
pred = map(mod, augment))}`
It seems like the use of a formula might be fitting here, as it allows the user to specify the predictor-response relations.
The formula object is also accepted as a format for various models and can thus be directly passed down to the lm() function.
# function training a linear model and a random forest
build_my_models <- function(formula, data) {
lm.fit <- lm(formula, data)
rf.fit <- randomForest(formula, data)
return(list(lm.fit, rf.fit))
}
# data frame with three continuous variables
a <- rnorm(100)
b <- rnorm(100, mean = 2, sd = 4)
c <- 2*a + b
my_data <- data.frame(a = a, b = b, c = c)
# build the models
my_models <- build_my_models(a ~ ., my_data)
# here the formula 'a ~ .' defines the relation between response and predictors
# (the dot indicates that 'a' depends on all other variables in the data frame)
If you want to implement a model yourself, it's never a bad idea to stick to R's syntax and conventions. You can check to documentation on how to parse the formula for your specific needs.

how to use loop to do linear regression in R

I wonder if I can use such as for loop or apply function to do the linear regression in R. I have a data frame containing variables such as crim, rm, ad, wd. I want to do simple linear regression of crim on each of other variable.
Thank you!
If you really want to do this, it's pretty trivial with lapply(), where we use it to "loop" over the other columns of df. A custom function takes each variable in turn as x and fits a model for that covariate.
df <- data.frame(crim = rnorm(20), rm = rnorm(20), ad = rnorm(20), wd = rnorm(20))
mods <- lapply(df[, -1], function(x, dat) lm(crim ~ x, data = dat))
mods is now a list of lm objects. The names of mods contains the names of the covariate used to fit the model. The main negative of this is that all the models are fitted using a variable x. More effort could probably solve this, but I doubt that effort is worth the time.
If you are just selecting models, which may be dubious, there are other ways to achieve this. For example via the leaps package and its regsubsets function:
library("leapls")
a <- regsubsets(crim ~ ., data = df, nvmax = 1, nbest = ncol(df) - 1)
summa <- summary(a)
Then plot(a) will show which of the models is "best", for example.
Original
If I understand what you want (crim is a covariate and the other variables are the responses you want to predict/model using crim), then you don't need a loop. You can do this using a matrix response in a standard lm().
Using some dummy data:
df <- data.frame(crim = rnorm(20), rm = rnorm(20), ad = rnorm(20), wd = rnorm(20))
we create a matrix or multivariate response via cbind(), passing it the three response variables we're interested in. The remaining parts of the call to lm are entirely the same as for a univariate response:
mods <- lm(cbind(rm, ad, wd) ~ crim, data = df)
mods
> mods
Call:
lm(formula = cbind(rm, ad, wd) ~ crim, data = df)
Coefficients:
rm ad wd
(Intercept) -0.12026 -0.47653 -0.26419
crim -0.26548 0.07145 0.68426
The summary() method produces a standard summary.lm output for each of the responses.
Suppose you want to have response variable fix as first column of your data frame and you want to run simple linear regression multiple times individually with other variable keeping first variable fix as response variable.
h=iris[,-5]
for (j in 2:ncol(h)){
assign(paste("a", j, sep = ""),lm(h[,1]~h[,j]))
}
Above is the code which will create multiple list of regression output and store it in a2,a3,....

Resources