Summarizing data in table by group for each variable in r - r

I have some data that I'd like to properly format with some summary values in R. I've played with aggregate and other things such as summaryBy, but none produced what I wanted to.
Here's the data
data <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48),
x1 = c(0.2846,0.3741,0.4208,0.3756,0.3476,0.3664,0.2852,0.3537,0.3116,0.3124,0.364,0.3934,0.3456,0.3034,0.3139,0.2766,0.3034,0.3159,0.3648,0.4046,0.3961,0.3451,0.2059,0.3184,0.2481,0.3503,0.331,0.3166,0.3203,0.1868,0.245,0.1625,0.2227,0.196,0.1697,0.2064,0.1369,0.1938,0.1498,0.1315,0.1523,0.2151,0.168,0.1427,0.3083,0.301,0.2328,0.2747),
x2 = c(-0.4364,-0.5262,-0.5338,-0.5037,-0.4758,-0.5003,-0.4359,-0.5002,-0.4027,-0.424,-0.4811,-0.5492,-0.3846,-0.3899,-0.4473,-0.3688,-0.3946,-0.4112,-0.4833,-0.4909,-0.4865,-0.368,0.295,-0.3221,-0.2482,-0.5424,-0.5021,-0.4453,-0.3952,0.3915,0.4472,0.364,0.436,0.3877,0.4077,0.2737,0.3104,0.3514,0.3256,0.287,0.3126,0.3648,-0.2596,-0.1913,-0.3656,-0.4598,-0.3198,-0.3685),
x3 = c(0.6043,0.5141,0.4638,0.486,0.3691,0.4104,0.426,0.3846,0.3191,0.4347,0.5842,0.4638,0.4418,0.523,0.5009,0.4568,0.5105,0.5421,0.4857,0.4063,0.391,0.4114,0.5189,0.5248,0.4942,0.2855,0.6107,0.4712,0.2009,0.4632,0.4457,0.3914,0.4547,0.4801,0.4873,0.5501,0.4442,0.4458,0.4651,0.5748,0.5231,0.4869,0.1769,0.099,0.5013,0.4543,0.4601,0.4396),
x4 = c(0.4895,0.6991,0.6566,0.6106,0.6976,0.6883,0.6533,0.6951,0.6852,0.5062,0.5682,0.6172,0.5073,0.6514,0.577,0.5228,0.6571,0.6132,0.4893,0.7904,0.6519,0.6582,0.6919,0.6011,0.6145,0.5943,0.4608,0.5997,0.4431,0.4082,0.5641,0.4535,0.5448,0.4632,0.4237,0.6187,0.4115,0.4995,0.4504,0.4103,0.4511,0.527,0.3654,0.2537,0.6317,0.478,0.5915,0.5283),
trt = c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","C","C","C","C","C","C","C","C","C","C","C","C","C","D","D","D","D","D","D")
)
And I'd like the data to summarize in the following way.
A | B | C | D
-------------------+------------+----------+-----------+-----------+------------+-----------+-------------
| Mean | Std.Dev | Mean | Std.Dev | Mean | Std.Dev | Mean | Std.Dev |
-----+-------------+------------+----------+-----------+-----------+------------+-----------+-------------
| X1 | 0.3456 | 0.04104 |0.3207333 | 0.0514311 | 0.1821923 | 0.0350107 | 0.2379167 | 0.06966645 |
-----+-------------+------------+----------+-----------+-----------+------------+-----------+-------------
| X2 | -0.4674143 | 0.05489628 |-0.37406 | 0.2003379 | 0.3584308 | 0.05489583 | -0.3274333| 0.0936547 |
-----+-------------+------------+----------+-----------+-----------+------------+-----------+-------------
| X3 | 0.4589214 | 0.07952784 |0.45406 | 0.1036369 | 0.4778769 | 0.04866813 | 0.3552 | 0.1713025 |
-----+-------------+------------+----------+-----------+-----------+------------+-----------+-------------
| X4 | 0.6232571 | 0.0762495 |0.5976867 | 0.0914621 | 0.4789231 | 0.06686731 | 0.4747667 | 0.1428023 |
-------------------+------------+----------+-----------+-----------+------------+-----------+-------------
One of the ways that I tried doing using aggregate is the following:
library(dplyr)
t(data[,2:5] %>% group_by(data$trt) %>% summarise_each(funs(mean, sd)))
but it produced in this format:
[,1] [,2] [,3] [,4]
data$trt "A" "B" "C" "D"
x1_mean "0.3456000" "0.3207333" "0.1821923" "0.2379167"
x2_mean "-0.4674143" "-0.3740600" " 0.3584308" "-0.3274333"
x3_mean "0.4589214" "0.4540600" "0.4778769" "0.3552000"
x4_mean "0.6232571" "0.5976867" "0.4789231" "0.4747667"
x1_sd "0.04104517" "0.05143110" "0.03501070" "0.06966645"
x2_sd "0.05489628" "0.20033792" "0.05489583" "0.09365470"
x3_sd "0.07952784" "0.10363689" "0.04866813" "0.17130249"
x4_sd "0.07624950" "0.09146218" "0.06686731" "0.14280235"
Is it possible to do what I want in R?

Here's one way to do it:
data %>%
select(-id) %>%
gather(row, val, -trt) %>%
group_by(trt, row) %>%
summarise_all(funs(Mean=mean, `Std.Dev`=sd)) %>%
gather(col, val, Mean, `Std.Dev`) %>%
unite("col", trt, col) %>%
spread(col, val)
# # A tibble: 4 x 9
# row A_Mean A_Std.Dev B_Mean B_Std.Dev C_Mean C_Std.Dev D_Mean D_Std.Dev
# * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 x1 0.346 0.0410 0.321 0.0514 0.182 0.0350 0.238 0.0697
# 2 x2 -0.467 0.0549 -0.374 0.200 0.358 0.0549 -0.327 0.0937
# 3 x3 0.459 0.0795 0.454 0.104 0.478 0.0487 0.355 0.171
# 4 x4 0.623 0.0762 0.598 0.0915 0.479 0.0669 0.475 0.143
You might add %>% tibble::column_to_rownames("row") to turn the first column into row names, however, it's deprecated.

Here is a way to do it using base R and aggregate
apply(data[,2:5], 2, function(x) aggregate(x, by=list(data$trt), FUN=summary))
$x1
Group.1 x.Min. x.1st Qu. x.Median x.Mean x.3rd Qu. x.Max.
1 A 0.2846 0.3118 0.3506 0.3456 0.3722 0.4208
2 B 0.2059 0.3086 0.3184 0.3207 0.3477 0.4046
3 C 0.1315 0.1523 0.1868 0.1822 0.2064 0.2450
4 D 0.1427 0.1842 0.2538 0.2379 0.2944 0.3083
$x2
Group.1 x.Min. x.1st Qu. x.Median x.Mean x.3rd Qu. x.Max.
1 A -0.5492 -0.5028 -0.4784 -0.4674 -0.4270 -0.3846
2 B -0.5424 -0.4849 -0.4112 -0.3741 -0.3684 0.2950
3 C 0.2737 0.3126 0.3640 0.3584 0.3915 0.4472
4 D -0.4598 -0.3678 -0.3427 -0.3274 -0.2746 -0.1913
$x3
Group.1 x.Min. x.1st Qu. x.Median x.Mean x.3rd Qu. x.Max.
1 A 0.3191 0.4143 0.4528 0.4589 0.5071 0.6043
2 B 0.2009 0.4088 0.4857 0.4541 0.5147 0.6107
3 C 0.3914 0.4458 0.4651 0.4779 0.4873 0.5748
4 D 0.0990 0.2426 0.4470 0.3552 0.4586 0.5013
$x4
Group.1 x.Min. x.1st Qu. x.Median x.Mean x.3rd Qu. x.Max.
1 A 0.4895 0.5788 0.6524 0.6233 0.6875 0.6991
2 B 0.4431 0.5499 0.6011 0.5977 0.6545 0.7904
3 C 0.4082 0.4237 0.4535 0.4789 0.5270 0.6187
4 D 0.2537 0.3936 0.5032 0.4748 0.5757 0.6317

Related

How does gtsummary produce confidence intervals and standard error statistics for glm models? (Code Examples Included)

Want to preface this with heaps of appreciate for gtsummary -- wonderful package.
After using tidymodels, GLM, and gtsummary for a while, I've been trying to understand gtsummary's computations for GLM model performance and confidence intervals.
Can the anyone and/or Dr. Sjoberg + gtsummary team explain the following questions 1 & 2
Question 1: Why are standard errors different when using broom::tidy() vs. parameters::model_parameters() functions to extract model residual data?
(Bolded text in print outs shows differences)
library(gtsummary)
library(parameters)
library(rsample)
library(broom)
trial2 <- trial %>% select(age, grade, response, trt) %>%
drop_na()
model_trial2 <- glm(response ~ age + grade + trt,
data = trial2,
family=binomial(link="logit"))
broom::tidy(model_trial2, exponentiate = TRUE)
# # A tibble: 5 × 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 0.184 **0.630** -2.69 0.00715
# 2 age 1.02 0.0114 1.67 0.0952
# 3 gradeII 0.852 **0.395** -0.406 0.685
# 4 gradeIII 1.01 0.385 0.0199 0.984
# 5 trtDrug B 1.13 **0.321** 0.387 0.699
preadmission_model_parameters <- model_trial2 %>% parameters::model_parameters(exponentiate = TRUE)
preadmission_model_parameters
# Parameter | Odds Ratio | SE | 95% CI | z | p
# ---------------------------------------------------------------
# (Intercept) | 0.18 | **0.12** | [0.05, 0.61] | -2.69 | 0.007
# age | 1.02 | 0.01 | [1.00, 1.04] | 1.67 | 0.095
# grade [II] | 0.85 | **0.34** | [0.39, 1.85] | -0.41 | 0.685
# grade [III] | 1.01 | 0.39 | [0.47, 2.15] | 0.02 | 0.984
# trt [Drug B] | 1.13 | **0.36** | [0.60, 2.13] | 0.39 | 0.699
Question 2: (a) What method does gtsummary use to produce confidence intervals? (b) can the user define (stratified or unstratified) k-fold cross-validation or bootstraps to produce confidence intervals?
(Bolded differences in confidence intervals for the reg_intervals() bootstrapped confidence intervals and the unknown method gtsummary tbl_regression() confidence intervals.)
library(gtsummary)
library(parameters)
library(rsample)
library(broom)
trial2 <- trial %>% select(age, grade, response, trt) %>%
drop_na()
bootstraps(trial2, times = 10)
trial_bootrapped_confidence_intervals <- reg_intervals(response ~ age + grade + trt,
data = trial2,
model_fn = "glm",
keep_reps = TRUE,
family=binomial(link="logit"))
trial_bootrapped_confidence_intervals_exp <- trial_bootrapped_confidence_intervals %>%
select(term:.alpha) %>%
mutate(across(.cols = c(.lower, .estimate, .upper), ~exp(.))) %>%
as_tibble()
trial_bootrapped_confidence_intervals_exp
# # A tibble: 4 × 5
# term .lower .estimate .upper .alpha
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 age 0.997 1.02 1.04 0.05
# 2 gradeII **0.400** 0.846 **1.86** 0.05
# 3 gradeIII 0.473 1.01 2.10 0.05
# 4 trtDrug B 0.600 1.14 2.22 0.05
model_trial2_tbl_regression <-
glm(response ~ age + grade + trt,
data = trial2,
family=binomial(link="logit")) %>%
tbl_regression(
exponentiate = T
) %>%
add_global_p()
model_trial2_tbl_regression_metrics <- model_trial2_tbl_regression$table_body %>%
select(
label,
estimate,
std.error,
statistic,
conf.low ,
conf.high,
p.value
)
model_trial2_tbl_regression_metrics
# A tibble: 8 × 7
# label estimate std.error statistic conf.low conf.high p.value
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Age 1.02 0.0114 1.67 0.997 1.04 0.0909
# 2 Grade NA NA NA NA NA 0.894
# 3 I NA NA NA NA NA NA
# 4 II 0.852 0.395 -0.406 **0.389** **1.85** NA
# 5 III 1.01 0.385 0.0199 0.472 2.15 NA
# 6 Chemotherapy Treatment NA NA NA NA NA 0.699
# 7 Drug A NA NA NA NA NA NA
# 8 Drug B 1.13 0.321 0.387 0.603 2.13 NA
The issue is with the exponentiation (applied as the family is binomial). Broom::tidy does not exponentiate the standard errors but parameters does. You can see this with broom::tidy(model_trial2, exponentiate = TRUE) and broom::tidy(model_trial2, exponentiate = FALSE), which return the same standard errors. parameters::model_parameters(exponentiate = TRUE) and parameters::model_parameters(exponentiate = FALSE) return different standard errors. When exponentiate is FALSE for parameters, the standard errors match. This is discussed in Check exponentiate behavior in tidy methods #422
To create a custom tidier for gtsummary, see FAQ + Gallery

Tidy way to get `summary` output per group?

My code frequently uses tapply and summary as shown below:
data <- tibble(
year = rep(2018:2021, 3),
x = runif(length(year))
)
tapply(data$x, data$year, summary)
The output looks like:
$`2018`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3914 0.5696 0.7477 0.6668 0.8045 0.8614
$`2019`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1910 0.2863 0.3816 0.4179 0.5313 0.6809
(etc.)
Is there a way to get such summary-like output in a tibble?
Desired output, using ugly code:
tapply(data$x, data$year, summary)%>%
map(~ as.numeric(round(.x, 2))) %>%
map_dfr(set_names, names(summary(1))) %>%
add_column(year = 2018:2021, .before = 1)
# A tibble: 4 x 7
year Min. `1st Qu.` Median Mean `3rd Qu.` Max.
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018 0.39 0.570 0.75 0.67 0.8 0.86
2 2019 0.19 0.290 0.38 0.42 0.53 0.68
3 2020 0.01 0.35 0.7 0.55 0.82 0.93
4 2021 0.06 0.15 0.24 0.32 0.45 0.66
I'm hoping that there is a nice combination of dplyr functions to do that better -- my code to get the desired output is hacky.
Of course, I'm hoping not to have to rewrite base R's summary function, as below:
summarise(`Min` = min(x), `1st Qu.` = quantile(x, 0.25), ...)
Here is a concise tidyverse way.
library(dplyr)
library(purrr)
library(tidyr)
data %>%
nest_by(year) %>%
mutate(data = map(data, summary)) %>%
unnest_wider(data)
# # A tibble: 4 x 7
# year Min. `1st Qu.` Median Mean `3rd Qu.` Max.
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2018 0.105 0.256 0.407 0.307 0.407 0.407
# 2 2019 0.0354 0.205 0.375 0.313 0.452 0.529
# 3 2020 0.272 0.467 0.662 0.546 0.684 0.705
# 4 2021 0.00564 0.107 0.208 0.252 0.375 0.542
You can also just convert the table output from your original line. Note that here it converted year to character, so you would probably want to change that back.
library(purrr)
tapply(data$x, data$year, summary) %>%
map_dfr(c, .id = "year")
# # A tibble: 4 x 7
# year Min. `1st Qu.` Median Mean `3rd Qu.` Max.
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2018 0.105 0.256 0.407 0.307 0.407 0.407
# 2 2019 0.0354 0.205 0.375 0.313 0.452 0.529
# 3 2020 0.272 0.467 0.662 0.546 0.684 0.705
# 4 2021 0.00564 0.107 0.208 0.252 0.375 0.542
Base R solution
Try with by followed by do.call/rbind.
do.call(rbind, by(data$x, data$year, summary))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
#2018 0.45126737 0.5437956 0.6363238 0.6343376 0.7258727 0.8154215
#2019 0.70134602 0.7425629 0.7837798 0.8227042 0.8833833 0.9829869
#2020 0.02726706 0.3338530 0.6404389 0.4591294 0.6750606 0.7096822
#2021 0.26667973 0.3242120 0.3817443 0.4953048 0.6096173 0.8374904
This returns a "matrix":
class(do.call(rbind, by(data$x, data$year, summary)))
#[1] "matrix" "array"
To get a "data.frame", coerce the return value after, don't use rbind.data.frame, it will loose the column names.
smry <- do.call(rbind, by(data$x, data$year, summary))
as.data.frame(smry)
dplyr solution.
A dplyr and purrr solution could be the following. Note that it doesn't round, it coerces the return value of map_dfr, which is columns of class "table", to numeric instead.
library(purrr)
library(dplyr)
tapply(data$x, data$year, summary)%>%
map_dfr(set_names, names(summary(1))) %>%
mutate(across(everything(), as.numeric))
## A tibble: 4 x 6
# Min. `1st Qu.` Median Mean `3rd Qu.` Max.
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 0.451 0.544 0.636 0.634 0.726 0.815
#2 0.701 0.743 0.784 0.823 0.883 0.983
#3 0.0273 0.334 0.640 0.459 0.675 0.710
#4 0.267 0.324 0.382 0.495 0.610 0.837
Another possible tidyverse solution. Same basic idea as Rui's solution above, but a little more verbose since it uses nest() and unnest() before pivoting back to wide data.
library(tidyverse)
data <- tibble(
year = rep(2018:2021, 3),
x = runif(length(year))
)
df_summary <- data %>%
nest_by(year) %>%
mutate(
summary = map(data, ~list(summary(.x))),
df = map(summary, ~data.frame(names = names(.x), values = c(.x))),
) %>%
unnest(df) %>%
select(-data, -summary) %>%
pivot_wider(names_from = names, values_from = values)
year Min. `1st Qu.` Median Mean `3rd Qu.` Max.
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018 0.204 0.351 0.498 0.538 0.705 0.912
2 2019 0.548 0.673 0.798 0.767 0.877 0.956
3 2020 0.228 0.416 0.604 0.604 0.792 0.980
4 2021 0.240 0.314 0.388 0.357 0.416 0.443

Extracting final p-value statistic from an lm lapply loop with multiple models

I have the following code that automatically performs lm between my independent variable (Kpl) and all my other dependent variables (Y1, Y2, ...., Yi):
linear_summary <- lapply(testdata[,-1], function(x) summary(lm(Kpl ~ x)))
The output for this is
Call:
lm(formula = Kpl ~ x)
Residuals:
Min 1Q Median 3Q Max
-1.37567 -0.52392 0.04236 0.67444 0.81316
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.7282 0.3456 5.001 0.000402 ***
x -0.1550 0.2712 -0.571 0.579196
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.772 on 11 degrees of freedom
Multiple R-squared: 0.02883, Adjusted R-squared: -0.05946
F-statistic: 0.3265 on 1 and 11 DF, p-value: 0.5792
$Y2
Call:
lm(formula = Kpl ~ x)
Residuals:
Min 1Q Median 3Q Max
-1.2472 -0.4236 -0.2057 0.7140 1.0348
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6900 0.9010 0.766 0.460
x 0.8832 0.8767 1.007 0.335
Residual standard error: 0.7495 on 11 degrees of freedom
Multiple R-squared: 0.08447, Adjusted R-squared: 0.001238
F-statistic: 1.015 on 1 and 11 DF, p-value: 0.3354
Etc. (I have truncated it for just the first 2 correlations)
I wanted to extract the final p-value for the whole model for each of the instances (0.5792 and 0.3354 in these two cases). Ideally this would come in some sort of table form with the associated correlation variable, i.e. Y1=0.5792 Y2=0.3354.
Most of the info I can find either seem to only work for a single correlation (as opposed to an sapply with multiple correlations) or I do not seem to get it to work, which could be a problem with my original code.
Any suggestions for a person just starting with R on how to solve this?
Edit: The data looks something like this
| X | Y1 | Y2 | Y3 | Y4 |
| -------- | ------------|-------------|-------------|-------------|
| 0.33767 | 2.33063062 | 1.013212308 | 1.277996888 | 1.373238355 |
| 0.33767 | 0.095967324 | 0.508830529 | 0.789257027 | 0.815877121 |
| 1.010474 | 2.344657045 | 0.842490752 | 1.240582283 | 1.262360905 |
| 1.010474 | 0.08135992 | 0.912535398 | 0.384427466 | 0.409817599 |
| 1.183276 | 0.135626937 | 0.967877981 | 0.505801442 | 0.576288093 |
| 1.536974 | 1.507146148 | 1.428839993 | 1.316569449 | 1.392022619 |
| 1.536974 | 1.255210981 | 1.191822955 | 1.395769591 | 1.41903939 |
| 2.017965 | 1.410299711 | 1.121560244 | 1.369835675 | 1.385143026 |
| 2.017965 | 1.032587109 | 1.372235121 | 1.390878783 | 1.42741762 |
| 2.3436 | 1.275999998 | 0.930400789 | 1.19877482 | 1.217540034 |
| 2.3436 | 1.250513383 | 1.063880146 | 1.206719195 | 1.23325973 |
| 2.387598 | 0.182866909 | 0.89588293 | 0.416923749 | 0.45364797 |
| 2.387598 | 0.097133916 | 0.750430855 | 0.506463633 | 0.03434754 |
These are the actual values that I used to get the correlations above
I think the p-value is not stored, you need to calculate it from the fstatistics, maybe something like this:
set.seed(111)
testdata = data.frame(Kpl = rnorm(100), Y1 = rnorm(100),
Y2 = rnorm(100), Y3 = rnorm(100))
IV = colnames(testdata)[-1]
DV = "Kpl"
linear_summary <- lapply(IV,function(x){
summary(lm(reformulate(response=DV,termlabels=x),data=testdata))
})
names(linear_summary) = IV
tab = lapply(IV,function(x){
p = with(
linear_summary[[x]],
pf(fstatistic[1],fstatistic[2],fstatistic[3],lower.tail=FALSE)
)
data.frame(IV = x, p = p)
})
do.call(rbind,tab)
IV p
value Y1 0.5757187
value1 Y2 0.4922582
value2 Y3 0.4009439
Check for example first summary:
linear_summary[[1]]
Call:
lm(formula = reformulate(response = DV, termlabels = x), data = testdata)
Residuals:
Min 1Q Median 3Q Max
-2.94515 -0.73325 0.05448 0.57901 2.76026
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.01382 0.10747 -0.129 0.898
Y1 -0.05950 0.10597 -0.562 0.576
Residual standard error: 1.075 on 98 degrees of freedom
Multiple R-squared: 0.003207, Adjusted R-squared: -0.006964
F-statistic: 0.3153 on 1 and 98 DF, p-value: 0.5757
Ok I edited my code in the following way:
library(purrr)
library(dplyr)
library(broom)
library(tidyr)
df %>% # Solution 1
pivot_longer(-X) %>%
group_split(name) %>%
set_names(nm = map(., ~ first(.x$name))) %>%
map(~ tidy(lm(X ~ value, data = .))) %>%
bind_rows(.id = "var") %>%
filter(term == "value")
# A tibble: 4 x 6
var term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Y1 value -0.155 0.271 -0.571 0.579
2 Y2 value 0.883 0.877 1.01 0.335
3 Y3 value 0.0341 0.552 0.0618 0.952
4 Y4 value -0.158 0.469 -0.337 0.743
Or you can use this:
df %>% # Solution 2
pivot_longer(Y1:Y4) %>%
group_by(name) %>%
arrange(.by_group = TRUE) %>%
nest() %>%
mutate(models = map(data, ~ lm(X ~ value, data = .)),
glance = map(models, glance)) %>%
unnest(glance)
# A tibble: 4 x 15
# Groups: name [4]
name data models r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<chr> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Y1 <tibbl~ <lm> 0.0288 -0.0595 0.772 0.327 0.579 1 -14.0 34.0 35.7
2 Y2 <tibbl~ <lm> 0.0845 0.00124 0.750 1.01 0.335 1 -13.6 33.2 34.9
3 Y3 <tibbl~ <lm> 0.000348 -0.0905 0.783 0.00382 0.952 1 -14.2 34.4 36.1
4 Y4 <tibbl~ <lm> 0.0102 -0.0798 0.779 0.113 0.743 1 -14.1 34.2 35.9
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
I know you already got your answer but here I presented 2 other solutions. Thought it may be ok to learn alternative ways of dealing with a problem and thank you for your question, it was very good.

Reproducing a result from R in Stata - Telling R or Stata to remove the same variables causing perfect collinearity/singularities

I am trying to reproduce a result from R in Stata (Please note that the data below is fictitious and serves just as an example). For some reason however, Stata appears to deal with certain issues differently than R. It chooses different dummy variables to kick out in case of multicollinearity.
I have posted a related question dealing with the statistical implications of these country-year dummies being removed here.
In the example below, R kicks out 2, while Stata kicks out 3, leading to a different result. Check for example the coefficients and p-values for vote and vote_won.
In essence, all I want to know is how to communicate to either R or Stata, which variables to kick out, so that they both do the same.
Data
The data looks as follows:
library(data.table)
library(dplyr)
library(foreign)
library(censReg)
library(wooldridge)
data('mroz')
year= c(2005, 2010)
country = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
n <- 2
DT <- data.table( country = rep(sample(country, length(mroz), replace = T), each = n),
year = c(replicate(length(mroz), sample(year, n))))
x <- DT
DT <- rbind(DT, DT); DT <- rbind(DT, DT); DT <- rbind(DT, DT) ; DT <- rbind(DT, DT); DT <- rbind(DT, x)
mroz <- mroz[-c(749:753),]
DT <- cbind(mroz, DT)
DT <- DT %>%
group_by(country) %>%
mutate(base_rate = as.integer(runif(1, 12.5, 37.5))) %>%
group_by(country, year) %>%
mutate(taxrate = base_rate + as.integer(runif(1,-2.5,+2.5)))
DT <- DT %>%
group_by(country, year) %>%
mutate(vote = sample(c(0,1),1),
votewon = ifelse(vote==1, sample(c(0,1),1),0))
rm(mroz,x, country, year)
The lm regression in R
summary(lm(educ ~ exper + I(exper^2) + vote + votewon + country:as.factor(year), data=DT))
Call:
lm(formula = educ ~ exper + I(exper^2) + vote + votewon + country:as.factor(year),
data = DT)
Residuals:
Min 1Q Median 3Q Max
-7.450 -0.805 -0.268 0.954 5.332
Coefficients: (3 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.170064 0.418578 26.69 < 0.0000000000000002 ***
exper 0.103880 0.029912 3.47 0.00055 ***
I(exper^2) -0.002965 0.000966 -3.07 0.00222 **
vote 0.576865 0.504540 1.14 0.25327
votewon 0.622522 0.636241 0.98 0.32818
countryA:as.factor(year)2005 -0.196348 0.503245 -0.39 0.69653
countryB:as.factor(year)2005 -0.530681 0.616653 -0.86 0.38975
countryC:as.factor(year)2005 0.650166 0.552019 1.18 0.23926
countryD:as.factor(year)2005 -0.515195 0.638060 -0.81 0.41968
countryE:as.factor(year)2005 0.731681 0.502807 1.46 0.14605
countryG:as.factor(year)2005 0.213345 0.674642 0.32 0.75192
countryH:as.factor(year)2005 -0.811374 0.637254 -1.27 0.20334
countryI:as.factor(year)2005 0.584787 0.503606 1.16 0.24594
countryJ:as.factor(year)2005 0.554397 0.674789 0.82 0.41158
countryA:as.factor(year)2010 0.388603 0.503358 0.77 0.44035
countryB:as.factor(year)2010 -0.727834 0.617210 -1.18 0.23869
countryC:as.factor(year)2010 -0.308601 0.504041 -0.61 0.54056
countryD:as.factor(year)2010 0.785603 0.503165 1.56 0.11888
countryE:as.factor(year)2010 0.280305 0.452293 0.62 0.53562
countryG:as.factor(year)2010 0.672074 0.674721 1.00 0.31954
countryH:as.factor(year)2010 NA NA NA NA
countryI:as.factor(year)2010 NA NA NA NA
countryJ:as.factor(year)2010 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.3 on 728 degrees of freedom
Multiple R-squared: 0.037, Adjusted R-squared: 0.0119
F-statistic: 1.47 on 19 and 728 DF, p-value: 0.0882
Same regression in Stata
write.dta(DT, "C:/Users/.../mroz_adapted.dta")
encode country, gen(n_country)
reg educ c.exper c.exper#c.exper vote votewon n_country#i.year
note: 9.n_country#2010.year omitted because of collinearity
note: 10.n_country#2010.year omitted because of collinearity
Source | SS df MS Number of obs = 748
-------------+---------------------------------- F(21, 726) = 1.80
Model | 192.989406 21 9.18997171 Prob > F = 0.0154
Residual | 3705.47583 726 5.1039612 R-squared = 0.0495
-------------+---------------------------------- Adj R-squared = 0.0220
Total | 3898.46524 747 5.21882897 Root MSE = 2.2592
---------------------------------------------------------------------------------
educ | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
exper | .1109858 .0297829 3.73 0.000 .052515 .1694567
|
c.exper#c.exper | -.0031891 .000963 -3.31 0.001 -.0050796 -.0012986
|
vote | .0697273 .4477115 0.16 0.876 -.8092365 .9486911
votewon | -.0147825 .6329659 -0.02 0.981 -1.257445 1.227879
|
n_country#year |
A#2010 | .0858634 .4475956 0.19 0.848 -.7928728 .9645997
B#2005 | -.4950677 .5003744 -0.99 0.323 -1.477421 .4872858
B#2010 | .0951657 .5010335 0.19 0.849 -.8884818 1.078813
C#2005 | -.5162827 .447755 -1.15 0.249 -1.395332 .3627664
C#2010 | -.0151834 .4478624 -0.03 0.973 -.8944434 .8640767
D#2005 | .3664596 .5008503 0.73 0.465 -.6168283 1.349747
D#2010 | .5119858 .500727 1.02 0.307 -.4710599 1.495031
E#2005 | .5837942 .6717616 0.87 0.385 -.7350329 1.902621
E#2010 | .185601 .5010855 0.37 0.711 -.7981486 1.169351
F#2005 | .5987978 .6333009 0.95 0.345 -.6445219 1.842117
F#2010 | .4853639 .7763936 0.63 0.532 -1.038881 2.009608
G#2005 | -.3341302 .6328998 -0.53 0.598 -1.576663 .9084021
G#2010 | .2873193 .6334566 0.45 0.650 -.956306 1.530945
H#2005 | -.4365233 .4195984 -1.04 0.299 -1.260294 .3872479
H#2010 | -.1683725 .6134262 -0.27 0.784 -1.372673 1.035928
I#2005 | -.39264 .7755549 -0.51 0.613 -1.915238 1.129958
I#2010 | 0 (omitted)
J#2005 | 1.036108 .4476018 2.31 0.021 .1573591 1.914856
J#2010 | 0 (omitted)
|
_cons | 11.58369 .350721 33.03 0.000 10.89514 12.27224
---------------------------------------------------------------------------------
Just for your question about which 'variables to kick out": I guess you meant which combination of interaction terms to be used as the reference group for calculating regression coefficients.
By default, Stata uses the combination of the lowest values of two variables as the reference while R uses the highest values of two variables as the reference. I use Stata auto data to demonstrate this:
# In R
webuse::webuse("auto")
auto$foreign = as.factor(auto$foreign)
auto$rep78 = as.factor(auto$rep78)
# Model
r_model <- lm(mpg ~ rep78:foreign, data=auto)
broom::tidy(r_model)
# A tibble: 11 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 26.3 1.65 15.9 2.09e-23
2 rep781:foreign0 -5.33 3.88 -1.38 1.74e- 1
3 rep782:foreign0 -7.21 2.41 -2.99 4.01e- 3
4 rep783:foreign0 -7.33 1.91 -3.84 2.94e- 4
5 rep784:foreign0 -7.89 2.34 -3.37 1.29e- 3
6 rep785:foreign0 5.67 3.88 1.46 1.49e- 1
7 rep781:foreign1 NA NA NA NA
8 rep782:foreign1 NA NA NA NA
9 rep783:foreign1 -3.00 3.31 -0.907 3.68e- 1
10 rep784:foreign1 -1.44 2.34 -0.618 5.39e- 1
11 rep785:foreign1 NA NA NA NA
In Stata:
. reg mpg i.foreign#i.rep78
note: 1.foreign#1b.rep78 identifies no observations in the sample
note: 1.foreign#2.rep78 identifies no observations in the sample
Source | SS df MS Number of obs = 69
-------------+---------------------------------- F(7, 61) = 4.88
Model | 839.550121 7 119.935732 Prob > F = 0.0002
Residual | 1500.65278 61 24.6008652 R-squared = 0.3588
-------------+---------------------------------- Adj R-squared = 0.2852
Total | 2340.2029 68 34.4147485 Root MSE = 4.9599
-------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+----------------------------------------------------------------
foreign#rep78 |
Domestic#2 | -1.875 3.921166 -0.48 0.634 -9.715855 5.965855
Domestic#3 | -2 3.634773 -0.55 0.584 -9.268178 5.268178
Domestic#4 | -2.555556 3.877352 -0.66 0.512 -10.3088 5.19769
Domestic#5 | 11 4.959926 2.22 0.030 1.082015 20.91798
Foreign#1 | 0 (empty)
Foreign#2 | 0 (empty)
Foreign#3 | 2.333333 4.527772 0.52 0.608 -6.720507 11.38717
Foreign#4 | 3.888889 3.877352 1.00 0.320 -3.864357 11.64213
Foreign#5 | 5.333333 3.877352 1.38 0.174 -2.419912 13.08658
|
_cons | 21 3.507197 5.99 0.000 13.98693 28.01307
-------------------------------------------------------------------------------
To reproduce the previous R in Stata, we could recode those two variables foreign and rep78:
. reg mpg i.foreign2#i.rep2
note: 0b.foreign2#1.rep2 identifies no observations in the sample
note: 0b.foreign2#2.rep2 identifies no observations in the sample
Source | SS df MS Number of obs = 69
-------------+---------------------------------- F(7, 61) = 4.88
Model | 839.550121 7 119.935732 Prob > F = 0.0002
Residual | 1500.65278 61 24.6008652 R-squared = 0.3588
-------------+---------------------------------- Adj R-squared = 0.2852
Total | 2340.2029 68 34.4147485 Root MSE = 4.9599
-------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+----------------------------------------------------------------
foreign2#rep2 |
0 1 | 0 (empty)
0 2 | 0 (empty)
0 3 | -3 3.306617 -0.91 0.368 -9.61199 3.61199
0 4 | -1.444444 2.338132 -0.62 0.539 -6.119827 3.230938
1 0 | 5.666667 3.877352 1.46 0.149 -2.086579 13.41991
1 1 | -5.333333 3.877352 -1.38 0.174 -13.08658 2.419912
1 2 | -7.208333 2.410091 -2.99 0.004 -12.02761 -2.389059
1 3 | -7.333333 1.909076 -3.84 0.000 -11.15077 -3.515899
1 4 | -7.888889 2.338132 -3.37 0.001 -12.56427 -3.213506
|
_cons | 26.33333 1.653309 15.93 0.000 23.02734 29.63933
-------------------------------------------------------------------------------
The same approach applies to reproduce Stata results in R, just redefine levels of those two factor variables.

Change the order in which summary functions are printed by skim

I'm using skimr, and I added two summary functions (iqr_na_rm and median_na_rm) to the list of summary functions for the function skim. However, by default these new summary functions (called skimmers in skimr documentation) appear at the end of the table. Instead, I'd like median and iqr to appear after mean and sd.
The final goal is to show the results in a .Rmd report like this:
---
title: "Test"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE,
message = FALSE,
echo = FALSE)
```
## Test
```{r test, results = 'asis'}
library(skimr)
library(dplyr)
library(ggplot2)
iqr_na_rm <- function(x) IQR(x, na.rm = TRUE)
median_na_rm <- function(x) median(x, na.rm = TRUE)
skim_with(numeric = list(p50 = NULL, median = median_na_rm, iqr = iqr_na_rm),
integer = list(p50 = NULL, median = median_na_rm, iqr = iqr_na_rm))
msleep %>%
group_by(vore) %>%
skim(sleep_total) %>%
kable()
```
Rendered HTML:
As you can see, median and iqr are printed and the end of the table, after the sparkline histogram. I'd like them to be printed after sd and before p0. Is it possible?
There are two parts in the skim() output. If you want to control the numeric part, you can use skim_to_list like this. It's also easier to export in another format.
msleep %>%
group_by(vore) %>%
skim_to_list(sleep_total)%>%
.[["numeric"]]%>%
dplyr::select(vore,variable,missing,complete,n,mean,sd,
median,iqr,p0,p25,p75,p100,hist)
# A tibble: 5 x 14
vore variable missing complete n mean sd median iqr p0 p25 p75 p100 hist
* <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 carni sleep_total 0 19 19 10.38 4.67 10.4 " 6.75" 2.7 6.25 "13 " 19.4 ▃▇▂▇▆▃▂▃
2 herbi sleep_total 0 32 32 " 9.51" 4.88 10.3 " 9.92" 1.9 "4.3 " 14.22 16.6 ▆▇▁▂▂▆▇▅
3 insecti sleep_total 0 5 5 14.94 5.92 18.1 "11.1 " 8.4 "8.6 " "19.7 " 19.9 ▇▁▁▁▁▁▃▇
4 omni sleep_total 0 20 20 10.93 2.95 " 9.9" " 1.83" "8 " "9.1 " 10.93 "18 " ▆▇▂▁▁▁▁▂
5 NA sleep_total 0 7 7 10.19 "3 " 10.6 " 3.5 " 5.4 8.65 12.15 13.7 ▃▃▁▁▃▇▁▇
EDIT
Adding kable() as requested in comment.
msleep %>%
group_by(vore) %>%
skim_to_list(sleep_total)%>%
.[["numeric"]]%>%
dplyr::select(vore,variable,missing,complete,n,mean,sd,median,iqr,p0,p25,p75,p100,hist)%>%
kable()
| vore | variable | missing | complete | n | mean | sd | median | iqr | p0 | p25 | p75 | p100 | hist |
|---------|-------------|---------|----------|----|-------|------|--------|------|-----|------|-------|------|----------|
| carni | sleep_total | 0 | 19 | 19 | 10.38 | 4.67 | 10.4 | 6.75 | 2.7 | 6.25 | 13 | 19.4 | ▃▇▂▇▆▃▂▃ |
| herbi | sleep_total | 0 | 32 | 32 | 9.51 | 4.88 | 10.3 | 9.92 | 1.9 | 4.3 | 14.22 | 16.6 | ▆▇▁▂▂▆▇▅ |
| insecti | sleep_total | 0 | 5 | 5 | 14.94 | 5.92 | 18.1 | 11.1 | 8.4 | 8.6 | 19.7 | 19.9 | ▇▁▁▁▁▁▃▇ |
| omni | sleep_total | 0 | 20 | 20 | 10.93 | 2.95 | 9.9 | 1.83 | 8 | 9.1 | 10.93 | 18 | ▆▇▂▁▁▁▁▂ |
| NA | sleep_total | 0 | 7 | 7 | 10.19 | 3 | 10.6 | 3.5 | 5.4 | 8.65 | 12.15 | 13.7 | ▃▃▁▁▃▇▁▇ |
Here's another option that uses the append=FALSE option.
library(skimr)
library(dplyr)
library(ggplot2)
iqr_na_rm <- function(x) IQR(x, na.rm = TRUE)
median_na_rm <- function(x) median(x, na.rm = TRUE)
my_skimmers <- list(n = length, missing = n_missing, complete = n_complete,
mean = mean.default, sd = purrr::partial(sd, na.rm = TRUE),
median = median_na_rm, iqr = iqr_na_rm
)
skim_with(numeric = my_skimmers,
integer = my_skimmers, append = FALSE)
msleep %>%
group_by(vore) %>%
skim(sleep_total) %>%
kable()
I didn't put all the stats but you can look in the functions.R and stats.R files to see how the various statistics are defined.

Resources