Feeding probitmfx output in a list to Stargazer - r

Similarly to here, I am running a series of regressions on subgroups (all combinations of year and group) using tidyr.
year <- rep(2014:2015, length.out = 10000)
group <- sample(c(0,1,2,3,4,5,6), replace=TRUE, size=10000)
value <- sample(10000, replace = T)
female <- sample(c(0,1), replace=TRUE, size=10000)
smoker <- sample(c(0,1), replace=TRUE, size=10000)
dta <- data.frame(year = year, group = group, value = value, female=female, smoker = smoker)
library(dplyr)
library(broom)
library(stargazer)
# create list of dfs
table_list <- dta %>%
group_by(year, group) %>%
group_split()
# apply the model to each df and produce stargazer result
model_list <- lapply(table_list, function(x) probitmfx(smoker ~ female, data = x))
stargazer(model_list, type = "text")
I get an error saying
% Error: Unrecognized object type.
Anybody know how I can get around this issue?

As Colin noted in the comments, this type of model does not appear to be supported by stargazer. However, it is supported out-of-the-box by the modelsummary package (disclaimer: I am the author).
You can omit the output argument altogether to get a nice HTML table, or change it to save your table to LaTeX, Word, or lots of other formats.
# Code from the original question
library(mfx)
library(dplyr)
year <- rep(2014:2015, length.out = 10000)
group <- sample(c(0,1,2,3,4,5,6), replace=TRUE, size=10000)
value <- sample(10000, replace = T)
female <- sample(c(0,1), replace=TRUE, size=10000)
smoker <- sample(c(0,1), replace=TRUE, size=10000)
dta <- data.frame(year = year, group = group, value = value, female=female, smoker = smoker)
table_list <- dta %>%
group_by(year, group) %>%
group_split()
model_list <- lapply(table_list, function(x) probitmfx(smoker ~ female, data = x))
# New code
library(modelsummary)
modelsummary(model_list, output = "markdown")
Model 1
Model 2
Model 3
Model 4
Model 5
Model 6
Model 7
Model 8
Model 9
Model 10
Model 11
Model 12
Model 13
Model 14
female
0.022
-0.026
-0.033
0.013
-0.030
-0.001
-0.003
-0.073
0.006
-0.041
0.075
-0.006
-0.009
0.023
(0.039)
(0.038)
(0.036)
(0.037)
(0.037)
(0.038)
(0.037)
(0.038)
(0.036)
(0.038)
(0.037)
(0.038)
(0.037)
(0.038)
Num.Obs.
670
688
761
727
740
675
739
688
751
703
733
710
737
678
AIC
932.4
957.3
1058.1
1009.6
1029.2
939.7
1027.6
953.8
1044.9
977.4
1016.0
988.2
1025.6
943.5
BIC
941.4
966.4
1067.4
1018.8
1038.4
948.7
1036.8
962.9
1054.1
986.5
1025.2
997.4
1034.8
952.5
Log.Lik.
-464.206
-476.648
-527.074
-502.806
-512.588
-467.837
-511.809
-474.924
-520.425
-486.682
-506.004
-492.123
-510.810
-469.740

Related

Any way to reverse the direction of comparisons when using emmeans contrast with "interaction" argument?

I'm trying to use emmeans to test "contrasts of contrasts" with custom orthogonal contrasts applied to a zero-inflated negative binomial model. The study design has 4 groups (study_group: grp1, grp2, grp3, grp4), each of which is assessed at 3 timepoints (time: Time1, Time2, Time3).
With the code below, I am able to get very close to, but not exactly, what I want. The contrasts that emerge are expressed in terms of ratios such as grp1/grp2, grp1/grp3,..., grp3/grp4 ("lower over higher"; see output following code).
What would be immensely helpful to me to have a way to flip these ratios to be grp2/grp1, grp3/grp1,..., grp4/grp3 ("higher over lower"). I've tried sticking reverse=TRUE in various spots, but to no effect.
Short of re-leveling the study_group factor, is there anyway to do this in emmeans?
Thanks!
library(glmmTMB)
library(emmeans)
set.seed(3456)
# Building grid for study design: 4 groups of 3 sites,
# each with 20 participants observed 3 times
site <- rep(1:12, each=60)
pid <- 1000*site+10*(rep(rep(1:20,each=3),12))
study_group <- c(rep("grp1",180), rep("grp2",180), rep("grp3",180), rep("grp4",180))
grp_num <- c(rep(0,180), rep(1,180), rep(2,180), rep(3,180))
time <- c(rep(c("Time1", "Time2", "Time3"),240))
time_num <- c(rep(c(0:2),240))
# Site-level random effects (intercepts)
site_eff_count = rep(rnorm(12, mean = 0, sd = 0.5), each = 60)
site_eff_zeros = rep(rnorm(12, mean = 0, sd = 0.5), each = 60)
# Simulating a neg binomial outcome
y_count <- rnbinom(n = 720, mu=exp(3.25 + grp_num*0.15 + time_num*-0.20 + grp_num*time_num*0.15 + site_eff_count), size=0.8)
# Simulating some extra zeros
log_odds = (-1.75 + grp_num*0.2 + time_num*-0.40 + grp_num*time_num*0.50 + site_eff_zeros)
prob_1 = plogis(log_odds)
prob_0 = 1 - prob_1
y_zeros <- rbinom(n = 720, size = 1, prob = prob_0)
# Building datasest with ZINB-ish outcome
data_ZINB <- data.frame(site, pid, study_group, time, y_count, y_zeros)
data_ZINB$y_obs <- ifelse(y_zeros==1, y_count, 0)
# Estimating ZINB GLMM in glmmTMB
mod_ZINB <- glmmTMB(y_obs ~ 1
+ study_group + time + study_group*time
+ (1|site),
family=nbinom2,
zi = ~ .,
data=data_ZINB)
#summary(mod_ZINB)
# Getting model-estimated "cell" means for conditional (non-zero) sub-model
# in response (not linear predictor) scale
count_means <- emmeans(mod_ZINB,
pairwise ~ time | study_group,
component="cond",
type="response",
adjust="none")
# count_means
# Defining custom contrast function for orthogonal time contrasts
# contr1 = Time 2 - Time 1
# contr2 = Time 3 - Times 1 and 2
compare_arms.emmc <- function(levels) {
k <- length(levels)
contr1 <- c(-1,1,0)
contr2 <- c(-1,-1,2)
coef <- data.frame()
coef <- as.data.frame(lapply(seq_len(k - 1), function(i) {
if(i==1) contr1 else contr2
}))
names(coef) <- c("T1vT2", "T1T2vT3")
attr(coef, "adjust") = "none"
coef
}
# Estimating pairwise between-group "contrasts of contrasts"
# i.e., testing if time contrasts differ across groups
compare_arms_contrast <- contrast(count_means[[1]],
interaction = c("compare_arms", "pairwise"),
by = NULL)
compare_arms_contrast
applying theemmeans::contrast function as above yields this:
time_compare_arms study_group_pairwise ratio SE df null t.ratio p.value
T1vT2 grp1 / grp2 1.091 0.368 693 1 0.259 0.7957
T1T2vT3 grp1 / grp2 0.623 0.371 693 1 -0.794 0.4276
T1vT2 grp1 / grp3 1.190 0.399 693 1 0.520 0.6034
T1T2vT3 grp1 / grp3 0.384 0.241 693 1 -1.523 0.1283
T1vT2 grp1 / grp4 0.664 0.245 693 1 -1.108 0.2681
.
.
.
T1T2vT3 grp3 / grp4 0.676 0.556 693 1 -0.475 0.6346
Tests are performed on the log scale
The answer, provided by Russ Lenth in the comments and in the emmeans documentation for the contrast function, is to replace pairwise with revpairwise in the contrast function call.

R/gtsummary: excluding outliers in tbl_summary

Is it possible to have gtsummary::tbl_summary compute the mean excluding outliers? For example in the following code I present sample data of some z-scores. Is it possible to specify what, or add a clause, to how gtsummary::tbl_summary handles each column?
set.seed(42)
n <- 1000
dat <- data.frame(id=1:n,
treat = factor(sample(c('Treat','Control'), n, rep=TRUE, prob=c(.5, .5))),
outcome1=runif(n, min=-3.6, max=2.3),
outcome2=runif(n, min=-1.9, max=3.3),
outcome3=runif(n, min=-2.5, max=2.8),
outcome4=runif(n, min=-3.1, max=2.2))
dat %>% select(-c(id)) %>% tbl_summary(by=treat, statistic = list(all_continuous() ~ "{mean} ({min} to {max})"))
For example, suppose I want the table to report the mean of outcome1only in cases where outcome1 >= -2.9 and for outcome2 only when cases are outcome2 < 3.0 etc.
Many thanks in advance for any guidance offered.
You can define a new mean function that excludes outlying values. You can define the outlier in any way you'd like. Then pass that function to tbl_summary(). Example below!
library(gtsummary)
packageVersion("gtsummary")
#> [1] '1.5.2'
set.seed(42)
n <- 1000
dat <- data.frame(id=1:n,
treat = factor(sample(c('Treat','Control'), n, rep=TRUE, prob=c(.5, .5))),
outcome1=runif(n, min=-3.6, max=2.3),
outcome2=runif(n, min=-1.9, max=3.3),
outcome3=runif(n, min=-2.5, max=2.8),
outcome4=runif(n, min=-3.1, max=2.2))
mean_no_extreme <- function(x) {
x <- na.omit(x)
sd <- sd(x)
mean <- mean(x)
# calculate mean excluding extremes
mean(x[x >= mean - sd * 3 & x <= mean + sd * 3])
}
dat %>%
select(-c(id)) %>%
tbl_summary(
by=treat,
statistic = all_continuous() ~ "{mean_no_extreme} ({min} to {max})"
) %>%
as_kable()
Characteristic
Control, N = 527
Treat, N = 473
outcome1
-0.64 (-3.59 to 2.30)
-0.70 (-3.60 to 2.30)
outcome2
0.68 (-1.89 to 3.30)
0.78 (-1.87 to 3.28)
outcome3
0.20 (-2.47 to 2.80)
0.23 (-2.48 to 2.80)
outcome4
-0.36 (-3.09 to 2.19)
-0.41 (-3.10 to 2.20)
Created on 2022-03-22 by the reprex package (v2.0.1)

Changing bootstrapped sample variables w/o regenerating bootstrap

I am using a bootstrapped dataset to fit a model. After fitting the model, I would like to change the bootstrapped dataset and use this new dataset to predict.
My problem is that I can't change the bootstrapped dataset. It often tells me that the variable that I am trying to change cannot be found. Other times (as in the case below) it won't let me calculate the mean by bootstrapped sample.
Why is this?
library(tidymodels)
library(broom)
year <- rep(2014:2016, length.out=10000)
group <- factor(sample(c(0,1,2,3,4,5,6), replace=TRUE, size=10000))
female <- sample(c(0,1), replace=TRUE, size=10000)
smoker <- sample(c(0,1), replace=TRUE, size=10000)
dta <- tibble(year = year, group = group, female = female, smoker = smoker)
boot <- bootstraps(dta,
times = 2,
apparent = TRUE,
replace = TRUE)
mods <- boot %>%
nest(data = c(-all_of(female))) %>%
mutate(model = map(data, ~ glm(smoker ~ group, data = .,
family = binomial(link = "probit"))))
new_boot <- boot %>%
group_by(id) %>% # calculate the mean by bootstrapped sample
mutate(female=mean(female),
smoker=mean(smoker))
new_boot # female and smoker are calculated for entire dataset
splits id female smoker
<list> <chr> <dbl> <dbl>
1 <split [10000/3578]> Bootstrap1 0.492 0.502
2 <split [10000/3681]> Bootstrap2 0.492 0.502
3 <split [10000/10000]> Apparent 0.492 0.502
Why is this? How can I change the bootstrapped sample?

can't group by with a tibble

I'm doing cross validation (five fold). Then I want to calculate the mean value for each group in a given data set I used for that cv. Please note that I need to use the following functions.
data(mpg)
library(modelr)
cv <- crossv_kfold(mpg, k = 5)
models1 <- map(cv$train, ~lm(hwy ~ displ, data = .))
get_pred <- function(model, test_data){
data <- as.data.frame(test_data)
pred <- add_predictions(data, model)
return(pred)
}
pred1 <- map2_df(models1, cv$test, get_pred, .id = "Run")
MSE1 <- pred1 %>% group_by(Run) %>%
summarise(MSE = mean( (hwy - pred)^2))
MSE1
My problem lies with the output of 'summarise'. The function should be applied to each group. The result should look something like this:
## # A tibble: 5 x 2
## Run MSE
## <chr> <dbl>
## 1 1 27.889532
## 2 2 8.673054
## 3 3 17.033056
## 4 4 12.552037
## 5 5 9.138741
Unfortunately, I get only one value:
MSE
1 14.77799
How can I get a tibble like that above?
When I run your code, I get the style of output you are expecting (though the numbers are different (as the seed wasn't set in your example)); I do not see a summarise-type problem like you do:
library(ggplot2)
library(modelr)
library(purrr)
library(dplyr)
data(mpg)
cv <- crossv_kfold(mpg, k = 5)
models1 <- map(cv$train, ~lm(hwy ~ displ, data = .))
get_pred <- function(model, test_data){
data <- as.data.frame(test_data)
pred <- add_predictions(data, model)
return(pred)
}
pred1 <- map2_df(models1, cv$test, get_pred, .id = "Run")
MSE1 <- pred1 %>% group_by(Run) %>%
summarise(MSE = mean( (hwy - pred)^2))
MSE1
# A tibble: 5 x 2
Run MSE
<chr> <dbl>
1 1 7.80
2 2 12.5
3 3 9.82
4 4 27.3
5 5 17.5

Tabulate coefficients from lm

I have 10 linear models where I only need some information, namely: r-squared, p-value, coefficients of slope and intercept. I managed to extract these values (via ridiculously repeating the code). Now, I need to tabulate these values (Info in the columns; the rows listing results from linear models 1-10). Can anyone please help me? I have hundreds more linear models to do. I'm sure there must be a way.
Data file hosted here
Code:
d<-read.csv("example.csv",header=T)
# Subset data
A3G1 <- subset(d, CatChro=="A3G1"); A4G1 <- subset(d, CatChro=="A4G1")
A3G2 <- subset(d, CatChro=="A3G2"); A4G2 <- subset(d, CatChro=="A4G2")
A3G3 <- subset(d, CatChro=="A3G3"); A4G3 <- subset(d, CatChro=="A4G3")
A3G4 <- subset(d, CatChro=="A3G4"); A4G4 <- subset(d, CatChro=="A4G4")
A3G5 <- subset(d, CatChro=="A3G5"); A4G5 <- subset(d, CatChro=="A4G5")
A3D1 <- subset(d, CatChro=="A3D1"); A4D1 <- subset(d, CatChro=="A4D1")
A3D2 <- subset(d, CatChro=="A3D2"); A4D2 <- subset(d, CatChro=="A4D2")
A3D3 <- subset(d, CatChro=="A3D3"); A4D3 <- subset(d, CatChro=="A4D3")
A3D4 <- subset(d, CatChro=="A3D4"); A4D4 <- subset(d, CatChro=="A4D4")
A3D5 <- subset(d, CatChro=="A3D5"); A4D5 <- subset(d, CatChro=="A4D5")
# Fit individual lines
rA3G1 <- lm(Qend~Rainfall, data=A3G1); summary(rA3G1)
rA3D1 <- lm(Qend~Rainfall, data=A3D1); summary(rA3D1)
rA3G2 <- lm(Qend~Rainfall, data=A3G2); summary(rA3G2)
rA3D2 <- lm(Qend~Rainfall, data=A3D2); summary(rA3D2)
rA3G3 <- lm(Qend~Rainfall, data=A3G3); summary(rA3G3)
rA3D3 <- lm(Qend~Rainfall, data=A3D3); summary(rA3D3)
rA3G4 <- lm(Qend~Rainfall, data=A3G4); summary(rA3G4)
rA3D4 <- lm(Qend~Rainfall, data=A3D4); summary(rA3D4)
rA3G5 <- lm(Qend~Rainfall, data=A3G5); summary(rA3G5)
rA3D5 <- lm(Qend~Rainfall, data=A3D5); summary(rA3D5)
rA4G1 <- lm(Qend~Rainfall, data=A4G1); summary(rA4G1)
rA4D1 <- lm(Qend~Rainfall, data=A4D1); summary(rA4D1)
rA4G2 <- lm(Qend~Rainfall, data=A4G2); summary(rA4G2)
rA4D2 <- lm(Qend~Rainfall, data=A4D2); summary(rA4D2)
rA4G3 <- lm(Qend~Rainfall, data=A4G3); summary(rA4G3)
rA4D3 <- lm(Qend~Rainfall, data=A4D3); summary(rA4D3)
rA4G4 <- lm(Qend~Rainfall, data=A4G4); summary(rA4G4)
rA4D4 <- lm(Qend~Rainfall, data=A4D4); summary(rA4D4)
rA4G5 <- lm(Qend~Rainfall, data=A4G5); summary(rA4G5)
rA4D5 <- lm(Qend~Rainfall, data=A4D5); summary(rA4D5)
# Gradient
summary(rA3G1)$coefficients[2,1]
summary(rA3D1)$coefficients[2,1]
summary(rA3G2)$coefficients[2,1]
summary(rA3D2)$coefficients[2,1]
summary(rA3G3)$coefficients[2,1]
summary(rA3D3)$coefficients[2,1]
summary(rA3G4)$coefficients[2,1]
summary(rA3D4)$coefficients[2,1]
summary(rA3G5)$coefficients[2,1]
summary(rA3D5)$coefficients[2,1]
# Intercept
summary(rA3G1)$coefficients[2,2]
summary(rA3D1)$coefficients[2,2]
summary(rA3G2)$coefficients[2,2]
summary(rA3D2)$coefficients[2,2]
summary(rA3G3)$coefficients[2,2]
summary(rA3D3)$coefficients[2,2]
summary(rA3G4)$coefficients[2,2]
summary(rA3D4)$coefficients[2,2]
summary(rA3G5)$coefficients[2,2]
summary(rA3D5)$coefficients[2,2]
# r-sq
summary(rA3G1)$r.squared
summary(rA3D1)$r.squared
summary(rA3G2)$r.squared
summary(rA3D2)$r.squared
summary(rA3G3)$r.squared
summary(rA3D3)$r.squared
summary(rA3G4)$r.squared
summary(rA3D4)$r.squared
summary(rA3G5)$r.squared
summary(rA3D5)$r.squared
# adj r-sq
summary(rA3G1)$adj.r.squared
summary(rA3D1)$adj.r.squared
summary(rA3G2)$adj.r.squared
summary(rA3D2)$adj.r.squared
summary(rA3G3)$adj.r.squared
summary(rA3D3)$adj.r.squared
summary(rA3G4)$adj.r.squared
summary(rA3D4)$adj.r.squared
summary(rA3G5)$adj.r.squared
summary(rA3D5)$adj.r.squared
# p-level
p <- summary(rA3G1)$fstatistic
pf(p[1], p[2], p[3], lower.tail=FALSE)
p2 <- summary(rA3D1)$fstatistic
pf(p2[1], p2[2], p2[3], lower.tail=FALSE)
p3 <- summary(rA3G2)$fstatistic
pf(p3[1], p3[2], p3[3], lower.tail=FALSE)
p4 <- summary(rA3D2)$fstatistic
pf(p4[1], p4[2], p4[3], lower.tail=FALSE)
p5 <- summary(rA3G3)$fstatistic
pf(p5[1], p5[2], p5[3], lower.tail=FALSE)
p6 <- summary(rA3D3)$fstatistic
pf(p6[1], p6[2], p6[3], lower.tail=FALSE)
p7 <- summary(rA3G4)$fstatistic
pf(p7[1], p7[2], p7[3], lower.tail=FALSE)
p8 <- summary(rA3D4)$fstatistic
pf(p8[1], p8[2], p8[3], lower.tail=FALSE)
p9 <- summary(rA3G5)$fstatistic
pf(p9[1], p9[2], p9[3], lower.tail=FALSE)
p10 <- summary(rA3D5)$fstatistic
pf(p10[1], p10[2], p10[3], lower.tail=FALSE)
This is the structure of my expected outcome:
Is there any way to achieve this?
Here is a base R solution:
data <- read.csv("./data/so53933238.csv",header=TRUE)
# split by value of CatChro into a list of datasets
dataList <- split(data,data$CatChro)
# process the list with lm(), extract results to a data frame, write to a list
lmResults <- lapply(dataList,function(x){
y <- summary(lm(Qend ~ Rainfall,data = x))
Intercept <- y$coefficients[1,1]
Slope <- y$coefficients[2,1]
rSquared <- y$r.squared
adjRSquared <- y$adj.r.squared
f <- y$fstatistic[1]
pValue <- pf(y$fstatistic[1],y$fstatistic[2],y$fstatistic[3],lower.tail = FALSE)
data.frame(Slope,Intercept,rSquared,adjRSquared,pValue)
})
lmResultTable <- do.call(rbind,lmResults)
# add CatChro indicators
lmResultTable$catChro <- names(dataList)
lmResultTable
...and the output:
> lmResultTable
Slope Intercept rSquared adjRSquared pValue catChro
A3D1 0.0004085644 0.011876543 0.28069553 0.254054622 0.0031181110 A3D1
A3D2 0.0005431693 0.023601325 0.03384173 0.005425311 0.2828170556 A3D2
A3D3 0.0001451185 0.022106960 0.04285322 0.002972105 0.3102578215 A3D3
A3D4 0.0006614213 0.009301843 0.37219027 0.349768492 0.0003442445 A3D4
A3D5 0.0001084626 0.014341399 0.04411669 -0.008987936 0.3741011769 A3D5
A3G1 0.0001147645 0.024432020 0.03627553 0.011564648 0.2329519751 A3G1
A3G2 0.0004583538 0.026079409 0.06449971 0.041112205 0.1045970987 A3G2
A3G3 0.0006964512 0.043537869 0.07587433 0.054383038 0.0670399684 A3G3
A3G4 0.0006442175 0.023706652 0.17337420 0.155404076 0.0032431299 A3G4
A3G5 0.0006658466 0.025994831 0.17227383 0.150491566 0.0077413595 A3G5
>
To render the output in a tabular format in HTML, one can use knitr::kable().
library(knitr)
kable(lmResultTable[1:5],row.names=TRUE,digits=5)
...which produces the following output after rendering the Markdown:
Consider building a matrix of lm results. First create a defined function to handle your generalized data frame model build with results extraction. Then, call by which can subset your data frame by a factor column and pass each subset into defined method. Finally, rbind all grouped matrices together for a singular output
lm_results <- function(df) {
model <- lm(Qend ~ Rainfall, data = df)
res <- summary(model)
p <- res$fstatistic
c(gradient = res$coefficients[2,1],
intercept = res$coefficients[2,2],
r_sq = res$r.squared,
adj_r_sq = res$adj.r.squared,
f_stat = p[['value']],
p_value = unname(pf(p[1], p[2], p[3], lower.tail=FALSE))
)
}
matrix_list <- by(d, d$group, lm_results)
final_matrix <- do.call(rbind, matrix_list)
To demonstrate on random, seeded data
set.seed(12262018)
data_tools <- c("sas", "stata", "spss", "python", "r", "julia")
d <- data.frame(
group = sample(data_tools, 500, replace=TRUE),
int = sample(1:15, 500, replace=TRUE),
Qend = rnorm(500) / 100,
Rainfall = rnorm(500) * 10
)
Results
mat_list <- by(d, d$group, lm_results)
final_matrix <- do.call(rbind, mat_list)
final_matrix
# gradient intercept r_sq adj_r_sq f_stat p_value
# julia -1.407313e-04 1.203832e-04 0.017219149 0.004619395 1.3666258 0.24595273
# python -1.438116e-04 1.125170e-04 0.018641512 0.007230367 1.6336233 0.20464162
# r 2.031717e-04 1.168037e-04 0.041432175 0.027738349 3.0256098 0.08635510
# sas -1.549510e-04 9.067337e-05 0.032476668 0.021355710 2.9203121 0.09103619
# spss 9.326656e-05 1.068516e-04 0.008583473 -0.002682623 0.7618853 0.38511469
# stata -7.079514e-05 1.024010e-04 0.006013841 -0.006568262 0.4779679 0.49137093
Here in only a couple of lines:
library(tidyverse)
library(broom)
# create grouped dataframe:
df_g <- df %>% group_by(CatChro)
df_g %>% do(tidy(lm(Qend ~ Rainfall, data = .))) %>%
select(CatChro, term, estimate) %>% spread(term, estimate) %>%
left_join(df_g %>% do(glance(lm(Qend ~ Rainfall, data = .))) %>%
select(CatChro, r.squared, adj.r.squared, p.value), by = "CatChro")
And the result will be:
# A tibble: 10 x 6
# Groups: CatChro [?]
CatChro `(Intercept)` Rainfall r.squared adj.r.squared p.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A3D1 0.0119 0.000409 0.281 0.254 0.00312
2 A3D2 0.0236 0.000543 0.0338 0.00543 0.283
3 A3D3 0.0221 0.000145 0.0429 0.00297 0.310
4 A3D4 0.00930 0.000661 0.372 0.350 0.000344
5 A3D5 0.0143 0.000108 0.0441 -0.00899 0.374
6 A3G1 0.0244 0.000115 0.0363 0.0116 0.233
7 A3G2 0.0261 0.000458 0.0645 0.0411 0.105
8 A3G3 0.0435 0.000696 0.0759 0.0544 0.0670
9 A3G4 0.0237 0.000644 0.173 0.155 0.00324
10 A3G5 0.0260 0.000666 0.172 0.150 0.00774
So, how does this work?
The following creates a dataframe with all coefficients and the corresponding statistics (tidy turns the result of lm into a dataframe):
df_g %>%
do(tidy(lm(Qend ~ Rainfall, data = .)))
A tibble: 20 x 6
Groups: CatChro [10]
CatChro term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 A3D1 (Intercept) 0.0119 0.00358 3.32 0.00258
2 A3D1 Rainfall 0.000409 0.000126 3.25 0.00312
3 A3D2 (Intercept) 0.0236 0.00928 2.54 0.0157
4 A3D2 Rainfall 0.000543 0.000498 1.09 0.283
I understand that you want to have the intercept and the coefficient on Rainfall as individual columns, so let's "spread" them out. This is achieved by first selecting the relevant columns, and then invoking tidyr::spread, as in
select(CatChro, term, estimate) %>% spread(term, estimate)
This gives you:
df_g %>% do(tidy(lm(Qend ~ Rainfall, data = .))) %>%
select(CatChro, term, estimate) %>% spread(term, estimate)
A tibble: 10 x 3
Groups: CatChro [10]
CatChro `(Intercept)` Rainfall
<chr> <dbl> <dbl>
1 A3D1 0.0119 0.000409
2 A3D2 0.0236 0.000543
3 A3D3 0.0221 0.000145
4 A3D4 0.00930 0.000661
Glance gives you the summary statistics you are looking for, for each model one. The models are indexed by group, here CatChro, so it is easy to just merge them onto the previous dataframe, which is what the rest of the code is about.
Another solution, with lme4::lmList. The summary() method for objects produced by lmList does almost everything you want (although it doesn't store p-values, that's something I had to add below).
m <- lme4::lmList(Qend~Rainfall|CatChro,data=d)
s <- summary(m)
pvals <- apply(s$fstatistic,1,function(x) pf(x[1],x[2],x[3],lower.tail=FALSE))
data.frame(intercept=coef(s)[,"Estimate","(Intercept)"],
slope=coef(s)[,"Estimate","Rainfall"],
r.squared=s$r.squared,
adj.r.squared=unlist(s$adj.r.squared),
p.value=pvals)
Using library(data.table) you can do
d <- fread("example.csv")
d[, .(
r2 = (fit <- summary(lm(Qend~Rainfall)))$r.squared,
adj.r2 = fit$adj.r.squared,
intercept = fit$coefficients[1,1],
gradient = fit$coefficients[2,1],
p.value = {p <- fit$fstatistic; pf(p[1], p[2], p[3], lower.tail=FALSE)}),
by = CatChro]
# CatChro r2 adj.r2 intercept gradient p.value
# 1: A3G1 0.03627553 0.011564648 0.024432020 0.0001147645 0.2329519751
# 2: A3D1 0.28069553 0.254054622 0.011876543 0.0004085644 0.0031181110
# 3: A3G2 0.06449971 0.041112205 0.026079409 0.0004583538 0.1045970987
# 4: A3D2 0.03384173 0.005425311 0.023601325 0.0005431693 0.2828170556
# 5: A3G3 0.07587433 0.054383038 0.043537869 0.0006964512 0.0670399684
# 6: A3D3 0.04285322 0.002972105 0.022106960 0.0001451185 0.3102578215
# 7: A3G4 0.17337420 0.155404076 0.023706652 0.0006442175 0.0032431299
# 8: A3D4 0.37219027 0.349768492 0.009301843 0.0006614213 0.0003442445
# 9: A3G5 0.17227383 0.150491566 0.025994831 0.0006658466 0.0077413595
#10: A3D5 0.04411669 -0.008987936 0.014341399 0.0001084626 0.3741011769

Resources