Summarizing data in table by group for each variable in r - r
I have some data that I'd like to properly format with some summary values in R. I've played with aggregate and other things such as summaryBy, but none produced what I wanted to.
Here's the data
data <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48),
x1 = c(0.2846,0.3741,0.4208,0.3756,0.3476,0.3664,0.2852,0.3537,0.3116,0.3124,0.364,0.3934,0.3456,0.3034,0.3139,0.2766,0.3034,0.3159,0.3648,0.4046,0.3961,0.3451,0.2059,0.3184,0.2481,0.3503,0.331,0.3166,0.3203,0.1868,0.245,0.1625,0.2227,0.196,0.1697,0.2064,0.1369,0.1938,0.1498,0.1315,0.1523,0.2151,0.168,0.1427,0.3083,0.301,0.2328,0.2747),
x2 = c(-0.4364,-0.5262,-0.5338,-0.5037,-0.4758,-0.5003,-0.4359,-0.5002,-0.4027,-0.424,-0.4811,-0.5492,-0.3846,-0.3899,-0.4473,-0.3688,-0.3946,-0.4112,-0.4833,-0.4909,-0.4865,-0.368,0.295,-0.3221,-0.2482,-0.5424,-0.5021,-0.4453,-0.3952,0.3915,0.4472,0.364,0.436,0.3877,0.4077,0.2737,0.3104,0.3514,0.3256,0.287,0.3126,0.3648,-0.2596,-0.1913,-0.3656,-0.4598,-0.3198,-0.3685),
x3 = c(0.6043,0.5141,0.4638,0.486,0.3691,0.4104,0.426,0.3846,0.3191,0.4347,0.5842,0.4638,0.4418,0.523,0.5009,0.4568,0.5105,0.5421,0.4857,0.4063,0.391,0.4114,0.5189,0.5248,0.4942,0.2855,0.6107,0.4712,0.2009,0.4632,0.4457,0.3914,0.4547,0.4801,0.4873,0.5501,0.4442,0.4458,0.4651,0.5748,0.5231,0.4869,0.1769,0.099,0.5013,0.4543,0.4601,0.4396),
x4 = c(0.4895,0.6991,0.6566,0.6106,0.6976,0.6883,0.6533,0.6951,0.6852,0.5062,0.5682,0.6172,0.5073,0.6514,0.577,0.5228,0.6571,0.6132,0.4893,0.7904,0.6519,0.6582,0.6919,0.6011,0.6145,0.5943,0.4608,0.5997,0.4431,0.4082,0.5641,0.4535,0.5448,0.4632,0.4237,0.6187,0.4115,0.4995,0.4504,0.4103,0.4511,0.527,0.3654,0.2537,0.6317,0.478,0.5915,0.5283),
trt = c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","C","C","C","C","C","C","C","C","C","C","C","C","C","D","D","D","D","D","D")
)
And I'd like the data to summarize in the following way.
A | B | C | D
-------------------+------------+----------+-----------+-----------+------------+-----------+-------------
| Mean | Std.Dev | Mean | Std.Dev | Mean | Std.Dev | Mean | Std.Dev |
-----+-------------+------------+----------+-----------+-----------+------------+-----------+-------------
| X1 | 0.3456 | 0.04104 |0.3207333 | 0.0514311 | 0.1821923 | 0.0350107 | 0.2379167 | 0.06966645 |
-----+-------------+------------+----------+-----------+-----------+------------+-----------+-------------
| X2 | -0.4674143 | 0.05489628 |-0.37406 | 0.2003379 | 0.3584308 | 0.05489583 | -0.3274333| 0.0936547 |
-----+-------------+------------+----------+-----------+-----------+------------+-----------+-------------
| X3 | 0.4589214 | 0.07952784 |0.45406 | 0.1036369 | 0.4778769 | 0.04866813 | 0.3552 | 0.1713025 |
-----+-------------+------------+----------+-----------+-----------+------------+-----------+-------------
| X4 | 0.6232571 | 0.0762495 |0.5976867 | 0.0914621 | 0.4789231 | 0.06686731 | 0.4747667 | 0.1428023 |
-------------------+------------+----------+-----------+-----------+------------+-----------+-------------
One of the ways that I tried doing using aggregate is the following:
library(dplyr)
t(data[,2:5] %>% group_by(data$trt) %>% summarise_each(funs(mean, sd)))
but it produced in this format:
[,1] [,2] [,3] [,4]
data$trt "A" "B" "C" "D"
x1_mean "0.3456000" "0.3207333" "0.1821923" "0.2379167"
x2_mean "-0.4674143" "-0.3740600" " 0.3584308" "-0.3274333"
x3_mean "0.4589214" "0.4540600" "0.4778769" "0.3552000"
x4_mean "0.6232571" "0.5976867" "0.4789231" "0.4747667"
x1_sd "0.04104517" "0.05143110" "0.03501070" "0.06966645"
x2_sd "0.05489628" "0.20033792" "0.05489583" "0.09365470"
x3_sd "0.07952784" "0.10363689" "0.04866813" "0.17130249"
x4_sd "0.07624950" "0.09146218" "0.06686731" "0.14280235"
Is it possible to do what I want in R?
Here's one way to do it:
data %>%
select(-id) %>%
gather(row, val, -trt) %>%
group_by(trt, row) %>%
summarise_all(funs(Mean=mean, `Std.Dev`=sd)) %>%
gather(col, val, Mean, `Std.Dev`) %>%
unite("col", trt, col) %>%
spread(col, val)
# # A tibble: 4 x 9
# row A_Mean A_Std.Dev B_Mean B_Std.Dev C_Mean C_Std.Dev D_Mean D_Std.Dev
# * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 x1 0.346 0.0410 0.321 0.0514 0.182 0.0350 0.238 0.0697
# 2 x2 -0.467 0.0549 -0.374 0.200 0.358 0.0549 -0.327 0.0937
# 3 x3 0.459 0.0795 0.454 0.104 0.478 0.0487 0.355 0.171
# 4 x4 0.623 0.0762 0.598 0.0915 0.479 0.0669 0.475 0.143
You might add %>% tibble::column_to_rownames("row") to turn the first column into row names, however, it's deprecated.
Here is a way to do it using base R and aggregate
apply(data[,2:5], 2, function(x) aggregate(x, by=list(data$trt), FUN=summary))
$x1
Group.1 x.Min. x.1st Qu. x.Median x.Mean x.3rd Qu. x.Max.
1 A 0.2846 0.3118 0.3506 0.3456 0.3722 0.4208
2 B 0.2059 0.3086 0.3184 0.3207 0.3477 0.4046
3 C 0.1315 0.1523 0.1868 0.1822 0.2064 0.2450
4 D 0.1427 0.1842 0.2538 0.2379 0.2944 0.3083
$x2
Group.1 x.Min. x.1st Qu. x.Median x.Mean x.3rd Qu. x.Max.
1 A -0.5492 -0.5028 -0.4784 -0.4674 -0.4270 -0.3846
2 B -0.5424 -0.4849 -0.4112 -0.3741 -0.3684 0.2950
3 C 0.2737 0.3126 0.3640 0.3584 0.3915 0.4472
4 D -0.4598 -0.3678 -0.3427 -0.3274 -0.2746 -0.1913
$x3
Group.1 x.Min. x.1st Qu. x.Median x.Mean x.3rd Qu. x.Max.
1 A 0.3191 0.4143 0.4528 0.4589 0.5071 0.6043
2 B 0.2009 0.4088 0.4857 0.4541 0.5147 0.6107
3 C 0.3914 0.4458 0.4651 0.4779 0.4873 0.5748
4 D 0.0990 0.2426 0.4470 0.3552 0.4586 0.5013
$x4
Group.1 x.Min. x.1st Qu. x.Median x.Mean x.3rd Qu. x.Max.
1 A 0.4895 0.5788 0.6524 0.6233 0.6875 0.6991
2 B 0.4431 0.5499 0.6011 0.5977 0.6545 0.7904
3 C 0.4082 0.4237 0.4535 0.4789 0.5270 0.6187
4 D 0.2537 0.3936 0.5032 0.4748 0.5757 0.6317
Related
How does gtsummary produce confidence intervals and standard error statistics for glm models? (Code Examples Included)
Want to preface this with heaps of appreciate for gtsummary -- wonderful package. After using tidymodels, GLM, and gtsummary for a while, I've been trying to understand gtsummary's computations for GLM model performance and confidence intervals. Can the anyone and/or Dr. Sjoberg + gtsummary team explain the following questions 1 & 2 Question 1: Why are standard errors different when using broom::tidy() vs. parameters::model_parameters() functions to extract model residual data? (Bolded text in print outs shows differences) library(gtsummary) library(parameters) library(rsample) library(broom) trial2 <- trial %>% select(age, grade, response, trt) %>% drop_na() model_trial2 <- glm(response ~ age + grade + trt, data = trial2, family=binomial(link="logit")) broom::tidy(model_trial2, exponentiate = TRUE) # # A tibble: 5 × 5 # term estimate std.error statistic p.value # <chr> <dbl> <dbl> <dbl> <dbl> # 1 (Intercept) 0.184 **0.630** -2.69 0.00715 # 2 age 1.02 0.0114 1.67 0.0952 # 3 gradeII 0.852 **0.395** -0.406 0.685 # 4 gradeIII 1.01 0.385 0.0199 0.984 # 5 trtDrug B 1.13 **0.321** 0.387 0.699 preadmission_model_parameters <- model_trial2 %>% parameters::model_parameters(exponentiate = TRUE) preadmission_model_parameters # Parameter | Odds Ratio | SE | 95% CI | z | p # --------------------------------------------------------------- # (Intercept) | 0.18 | **0.12** | [0.05, 0.61] | -2.69 | 0.007 # age | 1.02 | 0.01 | [1.00, 1.04] | 1.67 | 0.095 # grade [II] | 0.85 | **0.34** | [0.39, 1.85] | -0.41 | 0.685 # grade [III] | 1.01 | 0.39 | [0.47, 2.15] | 0.02 | 0.984 # trt [Drug B] | 1.13 | **0.36** | [0.60, 2.13] | 0.39 | 0.699 Question 2: (a) What method does gtsummary use to produce confidence intervals? (b) can the user define (stratified or unstratified) k-fold cross-validation or bootstraps to produce confidence intervals? (Bolded differences in confidence intervals for the reg_intervals() bootstrapped confidence intervals and the unknown method gtsummary tbl_regression() confidence intervals.) library(gtsummary) library(parameters) library(rsample) library(broom) trial2 <- trial %>% select(age, grade, response, trt) %>% drop_na() bootstraps(trial2, times = 10) trial_bootrapped_confidence_intervals <- reg_intervals(response ~ age + grade + trt, data = trial2, model_fn = "glm", keep_reps = TRUE, family=binomial(link="logit")) trial_bootrapped_confidence_intervals_exp <- trial_bootrapped_confidence_intervals %>% select(term:.alpha) %>% mutate(across(.cols = c(.lower, .estimate, .upper), ~exp(.))) %>% as_tibble() trial_bootrapped_confidence_intervals_exp # # A tibble: 4 × 5 # term .lower .estimate .upper .alpha # <chr> <dbl> <dbl> <dbl> <dbl> # 1 age 0.997 1.02 1.04 0.05 # 2 gradeII **0.400** 0.846 **1.86** 0.05 # 3 gradeIII 0.473 1.01 2.10 0.05 # 4 trtDrug B 0.600 1.14 2.22 0.05 model_trial2_tbl_regression <- glm(response ~ age + grade + trt, data = trial2, family=binomial(link="logit")) %>% tbl_regression( exponentiate = T ) %>% add_global_p() model_trial2_tbl_regression_metrics <- model_trial2_tbl_regression$table_body %>% select( label, estimate, std.error, statistic, conf.low , conf.high, p.value ) model_trial2_tbl_regression_metrics # A tibble: 8 × 7 # label estimate std.error statistic conf.low conf.high p.value # <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> # 1 Age 1.02 0.0114 1.67 0.997 1.04 0.0909 # 2 Grade NA NA NA NA NA 0.894 # 3 I NA NA NA NA NA NA # 4 II 0.852 0.395 -0.406 **0.389** **1.85** NA # 5 III 1.01 0.385 0.0199 0.472 2.15 NA # 6 Chemotherapy Treatment NA NA NA NA NA 0.699 # 7 Drug A NA NA NA NA NA NA # 8 Drug B 1.13 0.321 0.387 0.603 2.13 NA
The issue is with the exponentiation (applied as the family is binomial). Broom::tidy does not exponentiate the standard errors but parameters does. You can see this with broom::tidy(model_trial2, exponentiate = TRUE) and broom::tidy(model_trial2, exponentiate = FALSE), which return the same standard errors. parameters::model_parameters(exponentiate = TRUE) and parameters::model_parameters(exponentiate = FALSE) return different standard errors. When exponentiate is FALSE for parameters, the standard errors match. This is discussed in Check exponentiate behavior in tidy methods #422 To create a custom tidier for gtsummary, see FAQ + Gallery
Tidy way to get `summary` output per group?
My code frequently uses tapply and summary as shown below: data <- tibble( year = rep(2018:2021, 3), x = runif(length(year)) ) tapply(data$x, data$year, summary) The output looks like: $`2018` Min. 1st Qu. Median Mean 3rd Qu. Max. 0.3914 0.5696 0.7477 0.6668 0.8045 0.8614 $`2019` Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1910 0.2863 0.3816 0.4179 0.5313 0.6809 (etc.) Is there a way to get such summary-like output in a tibble? Desired output, using ugly code: tapply(data$x, data$year, summary)%>% map(~ as.numeric(round(.x, 2))) %>% map_dfr(set_names, names(summary(1))) %>% add_column(year = 2018:2021, .before = 1) # A tibble: 4 x 7 year Min. `1st Qu.` Median Mean `3rd Qu.` Max. <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 2018 0.39 0.570 0.75 0.67 0.8 0.86 2 2019 0.19 0.290 0.38 0.42 0.53 0.68 3 2020 0.01 0.35 0.7 0.55 0.82 0.93 4 2021 0.06 0.15 0.24 0.32 0.45 0.66 I'm hoping that there is a nice combination of dplyr functions to do that better -- my code to get the desired output is hacky. Of course, I'm hoping not to have to rewrite base R's summary function, as below: summarise(`Min` = min(x), `1st Qu.` = quantile(x, 0.25), ...)
Here is a concise tidyverse way. library(dplyr) library(purrr) library(tidyr) data %>% nest_by(year) %>% mutate(data = map(data, summary)) %>% unnest_wider(data) # # A tibble: 4 x 7 # year Min. `1st Qu.` Median Mean `3rd Qu.` Max. # <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> # 1 2018 0.105 0.256 0.407 0.307 0.407 0.407 # 2 2019 0.0354 0.205 0.375 0.313 0.452 0.529 # 3 2020 0.272 0.467 0.662 0.546 0.684 0.705 # 4 2021 0.00564 0.107 0.208 0.252 0.375 0.542 You can also just convert the table output from your original line. Note that here it converted year to character, so you would probably want to change that back. library(purrr) tapply(data$x, data$year, summary) %>% map_dfr(c, .id = "year") # # A tibble: 4 x 7 # year Min. `1st Qu.` Median Mean `3rd Qu.` Max. # <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> # 1 2018 0.105 0.256 0.407 0.307 0.407 0.407 # 2 2019 0.0354 0.205 0.375 0.313 0.452 0.529 # 3 2020 0.272 0.467 0.662 0.546 0.684 0.705 # 4 2021 0.00564 0.107 0.208 0.252 0.375 0.542
Base R solution Try with by followed by do.call/rbind. do.call(rbind, by(data$x, data$year, summary)) # Min. 1st Qu. Median Mean 3rd Qu. Max. #2018 0.45126737 0.5437956 0.6363238 0.6343376 0.7258727 0.8154215 #2019 0.70134602 0.7425629 0.7837798 0.8227042 0.8833833 0.9829869 #2020 0.02726706 0.3338530 0.6404389 0.4591294 0.6750606 0.7096822 #2021 0.26667973 0.3242120 0.3817443 0.4953048 0.6096173 0.8374904 This returns a "matrix": class(do.call(rbind, by(data$x, data$year, summary))) #[1] "matrix" "array" To get a "data.frame", coerce the return value after, don't use rbind.data.frame, it will loose the column names. smry <- do.call(rbind, by(data$x, data$year, summary)) as.data.frame(smry) dplyr solution. A dplyr and purrr solution could be the following. Note that it doesn't round, it coerces the return value of map_dfr, which is columns of class "table", to numeric instead. library(purrr) library(dplyr) tapply(data$x, data$year, summary)%>% map_dfr(set_names, names(summary(1))) %>% mutate(across(everything(), as.numeric)) ## A tibble: 4 x 6 # Min. `1st Qu.` Median Mean `3rd Qu.` Max. # <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #1 0.451 0.544 0.636 0.634 0.726 0.815 #2 0.701 0.743 0.784 0.823 0.883 0.983 #3 0.0273 0.334 0.640 0.459 0.675 0.710 #4 0.267 0.324 0.382 0.495 0.610 0.837
Another possible tidyverse solution. Same basic idea as Rui's solution above, but a little more verbose since it uses nest() and unnest() before pivoting back to wide data. library(tidyverse) data <- tibble( year = rep(2018:2021, 3), x = runif(length(year)) ) df_summary <- data %>% nest_by(year) %>% mutate( summary = map(data, ~list(summary(.x))), df = map(summary, ~data.frame(names = names(.x), values = c(.x))), ) %>% unnest(df) %>% select(-data, -summary) %>% pivot_wider(names_from = names, values_from = values) year Min. `1st Qu.` Median Mean `3rd Qu.` Max. <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 2018 0.204 0.351 0.498 0.538 0.705 0.912 2 2019 0.548 0.673 0.798 0.767 0.877 0.956 3 2020 0.228 0.416 0.604 0.604 0.792 0.980 4 2021 0.240 0.314 0.388 0.357 0.416 0.443
Extracting final p-value statistic from an lm lapply loop with multiple models
I have the following code that automatically performs lm between my independent variable (Kpl) and all my other dependent variables (Y1, Y2, ...., Yi): linear_summary <- lapply(testdata[,-1], function(x) summary(lm(Kpl ~ x))) The output for this is Call: lm(formula = Kpl ~ x) Residuals: Min 1Q Median 3Q Max -1.37567 -0.52392 0.04236 0.67444 0.81316 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.7282 0.3456 5.001 0.000402 *** x -0.1550 0.2712 -0.571 0.579196 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.772 on 11 degrees of freedom Multiple R-squared: 0.02883, Adjusted R-squared: -0.05946 F-statistic: 0.3265 on 1 and 11 DF, p-value: 0.5792 $Y2 Call: lm(formula = Kpl ~ x) Residuals: Min 1Q Median 3Q Max -1.2472 -0.4236 -0.2057 0.7140 1.0348 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.6900 0.9010 0.766 0.460 x 0.8832 0.8767 1.007 0.335 Residual standard error: 0.7495 on 11 degrees of freedom Multiple R-squared: 0.08447, Adjusted R-squared: 0.001238 F-statistic: 1.015 on 1 and 11 DF, p-value: 0.3354 Etc. (I have truncated it for just the first 2 correlations) I wanted to extract the final p-value for the whole model for each of the instances (0.5792 and 0.3354 in these two cases). Ideally this would come in some sort of table form with the associated correlation variable, i.e. Y1=0.5792 Y2=0.3354. Most of the info I can find either seem to only work for a single correlation (as opposed to an sapply with multiple correlations) or I do not seem to get it to work, which could be a problem with my original code. Any suggestions for a person just starting with R on how to solve this? Edit: The data looks something like this | X | Y1 | Y2 | Y3 | Y4 | | -------- | ------------|-------------|-------------|-------------| | 0.33767 | 2.33063062 | 1.013212308 | 1.277996888 | 1.373238355 | | 0.33767 | 0.095967324 | 0.508830529 | 0.789257027 | 0.815877121 | | 1.010474 | 2.344657045 | 0.842490752 | 1.240582283 | 1.262360905 | | 1.010474 | 0.08135992 | 0.912535398 | 0.384427466 | 0.409817599 | | 1.183276 | 0.135626937 | 0.967877981 | 0.505801442 | 0.576288093 | | 1.536974 | 1.507146148 | 1.428839993 | 1.316569449 | 1.392022619 | | 1.536974 | 1.255210981 | 1.191822955 | 1.395769591 | 1.41903939 | | 2.017965 | 1.410299711 | 1.121560244 | 1.369835675 | 1.385143026 | | 2.017965 | 1.032587109 | 1.372235121 | 1.390878783 | 1.42741762 | | 2.3436 | 1.275999998 | 0.930400789 | 1.19877482 | 1.217540034 | | 2.3436 | 1.250513383 | 1.063880146 | 1.206719195 | 1.23325973 | | 2.387598 | 0.182866909 | 0.89588293 | 0.416923749 | 0.45364797 | | 2.387598 | 0.097133916 | 0.750430855 | 0.506463633 | 0.03434754 | These are the actual values that I used to get the correlations above
I think the p-value is not stored, you need to calculate it from the fstatistics, maybe something like this: set.seed(111) testdata = data.frame(Kpl = rnorm(100), Y1 = rnorm(100), Y2 = rnorm(100), Y3 = rnorm(100)) IV = colnames(testdata)[-1] DV = "Kpl" linear_summary <- lapply(IV,function(x){ summary(lm(reformulate(response=DV,termlabels=x),data=testdata)) }) names(linear_summary) = IV tab = lapply(IV,function(x){ p = with( linear_summary[[x]], pf(fstatistic[1],fstatistic[2],fstatistic[3],lower.tail=FALSE) ) data.frame(IV = x, p = p) }) do.call(rbind,tab) IV p value Y1 0.5757187 value1 Y2 0.4922582 value2 Y3 0.4009439 Check for example first summary: linear_summary[[1]] Call: lm(formula = reformulate(response = DV, termlabels = x), data = testdata) Residuals: Min 1Q Median 3Q Max -2.94515 -0.73325 0.05448 0.57901 2.76026 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.01382 0.10747 -0.129 0.898 Y1 -0.05950 0.10597 -0.562 0.576 Residual standard error: 1.075 on 98 degrees of freedom Multiple R-squared: 0.003207, Adjusted R-squared: -0.006964 F-statistic: 0.3153 on 1 and 98 DF, p-value: 0.5757
Ok I edited my code in the following way: library(purrr) library(dplyr) library(broom) library(tidyr) df %>% # Solution 1 pivot_longer(-X) %>% group_split(name) %>% set_names(nm = map(., ~ first(.x$name))) %>% map(~ tidy(lm(X ~ value, data = .))) %>% bind_rows(.id = "var") %>% filter(term == "value") # A tibble: 4 x 6 var term estimate std.error statistic p.value <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 Y1 value -0.155 0.271 -0.571 0.579 2 Y2 value 0.883 0.877 1.01 0.335 3 Y3 value 0.0341 0.552 0.0618 0.952 4 Y4 value -0.158 0.469 -0.337 0.743 Or you can use this: df %>% # Solution 2 pivot_longer(Y1:Y4) %>% group_by(name) %>% arrange(.by_group = TRUE) %>% nest() %>% mutate(models = map(data, ~ lm(X ~ value, data = .)), glance = map(models, glance)) %>% unnest(glance) # A tibble: 4 x 15 # Groups: name [4] name data models r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC <chr> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Y1 <tibbl~ <lm> 0.0288 -0.0595 0.772 0.327 0.579 1 -14.0 34.0 35.7 2 Y2 <tibbl~ <lm> 0.0845 0.00124 0.750 1.01 0.335 1 -13.6 33.2 34.9 3 Y3 <tibbl~ <lm> 0.000348 -0.0905 0.783 0.00382 0.952 1 -14.2 34.4 36.1 4 Y4 <tibbl~ <lm> 0.0102 -0.0798 0.779 0.113 0.743 1 -14.1 34.2 35.9 # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> I know you already got your answer but here I presented 2 other solutions. Thought it may be ok to learn alternative ways of dealing with a problem and thank you for your question, it was very good.
Reproducing a result from R in Stata - Telling R or Stata to remove the same variables causing perfect collinearity/singularities
I am trying to reproduce a result from R in Stata (Please note that the data below is fictitious and serves just as an example). For some reason however, Stata appears to deal with certain issues differently than R. It chooses different dummy variables to kick out in case of multicollinearity. I have posted a related question dealing with the statistical implications of these country-year dummies being removed here. In the example below, R kicks out 2, while Stata kicks out 3, leading to a different result. Check for example the coefficients and p-values for vote and vote_won. In essence, all I want to know is how to communicate to either R or Stata, which variables to kick out, so that they both do the same. Data The data looks as follows: library(data.table) library(dplyr) library(foreign) library(censReg) library(wooldridge) data('mroz') year= c(2005, 2010) country = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J") n <- 2 DT <- data.table( country = rep(sample(country, length(mroz), replace = T), each = n), year = c(replicate(length(mroz), sample(year, n)))) x <- DT DT <- rbind(DT, DT); DT <- rbind(DT, DT); DT <- rbind(DT, DT) ; DT <- rbind(DT, DT); DT <- rbind(DT, x) mroz <- mroz[-c(749:753),] DT <- cbind(mroz, DT) DT <- DT %>% group_by(country) %>% mutate(base_rate = as.integer(runif(1, 12.5, 37.5))) %>% group_by(country, year) %>% mutate(taxrate = base_rate + as.integer(runif(1,-2.5,+2.5))) DT <- DT %>% group_by(country, year) %>% mutate(vote = sample(c(0,1),1), votewon = ifelse(vote==1, sample(c(0,1),1),0)) rm(mroz,x, country, year) The lm regression in R summary(lm(educ ~ exper + I(exper^2) + vote + votewon + country:as.factor(year), data=DT)) Call: lm(formula = educ ~ exper + I(exper^2) + vote + votewon + country:as.factor(year), data = DT) Residuals: Min 1Q Median 3Q Max -7.450 -0.805 -0.268 0.954 5.332 Coefficients: (3 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 11.170064 0.418578 26.69 < 0.0000000000000002 *** exper 0.103880 0.029912 3.47 0.00055 *** I(exper^2) -0.002965 0.000966 -3.07 0.00222 ** vote 0.576865 0.504540 1.14 0.25327 votewon 0.622522 0.636241 0.98 0.32818 countryA:as.factor(year)2005 -0.196348 0.503245 -0.39 0.69653 countryB:as.factor(year)2005 -0.530681 0.616653 -0.86 0.38975 countryC:as.factor(year)2005 0.650166 0.552019 1.18 0.23926 countryD:as.factor(year)2005 -0.515195 0.638060 -0.81 0.41968 countryE:as.factor(year)2005 0.731681 0.502807 1.46 0.14605 countryG:as.factor(year)2005 0.213345 0.674642 0.32 0.75192 countryH:as.factor(year)2005 -0.811374 0.637254 -1.27 0.20334 countryI:as.factor(year)2005 0.584787 0.503606 1.16 0.24594 countryJ:as.factor(year)2005 0.554397 0.674789 0.82 0.41158 countryA:as.factor(year)2010 0.388603 0.503358 0.77 0.44035 countryB:as.factor(year)2010 -0.727834 0.617210 -1.18 0.23869 countryC:as.factor(year)2010 -0.308601 0.504041 -0.61 0.54056 countryD:as.factor(year)2010 0.785603 0.503165 1.56 0.11888 countryE:as.factor(year)2010 0.280305 0.452293 0.62 0.53562 countryG:as.factor(year)2010 0.672074 0.674721 1.00 0.31954 countryH:as.factor(year)2010 NA NA NA NA countryI:as.factor(year)2010 NA NA NA NA countryJ:as.factor(year)2010 NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.3 on 728 degrees of freedom Multiple R-squared: 0.037, Adjusted R-squared: 0.0119 F-statistic: 1.47 on 19 and 728 DF, p-value: 0.0882 Same regression in Stata write.dta(DT, "C:/Users/.../mroz_adapted.dta") encode country, gen(n_country) reg educ c.exper c.exper#c.exper vote votewon n_country#i.year note: 9.n_country#2010.year omitted because of collinearity note: 10.n_country#2010.year omitted because of collinearity Source | SS df MS Number of obs = 748 -------------+---------------------------------- F(21, 726) = 1.80 Model | 192.989406 21 9.18997171 Prob > F = 0.0154 Residual | 3705.47583 726 5.1039612 R-squared = 0.0495 -------------+---------------------------------- Adj R-squared = 0.0220 Total | 3898.46524 747 5.21882897 Root MSE = 2.2592 --------------------------------------------------------------------------------- educ | Coef. Std. Err. t P>|t| [95% Conf. Interval] ----------------+---------------------------------------------------------------- exper | .1109858 .0297829 3.73 0.000 .052515 .1694567 | c.exper#c.exper | -.0031891 .000963 -3.31 0.001 -.0050796 -.0012986 | vote | .0697273 .4477115 0.16 0.876 -.8092365 .9486911 votewon | -.0147825 .6329659 -0.02 0.981 -1.257445 1.227879 | n_country#year | A#2010 | .0858634 .4475956 0.19 0.848 -.7928728 .9645997 B#2005 | -.4950677 .5003744 -0.99 0.323 -1.477421 .4872858 B#2010 | .0951657 .5010335 0.19 0.849 -.8884818 1.078813 C#2005 | -.5162827 .447755 -1.15 0.249 -1.395332 .3627664 C#2010 | -.0151834 .4478624 -0.03 0.973 -.8944434 .8640767 D#2005 | .3664596 .5008503 0.73 0.465 -.6168283 1.349747 D#2010 | .5119858 .500727 1.02 0.307 -.4710599 1.495031 E#2005 | .5837942 .6717616 0.87 0.385 -.7350329 1.902621 E#2010 | .185601 .5010855 0.37 0.711 -.7981486 1.169351 F#2005 | .5987978 .6333009 0.95 0.345 -.6445219 1.842117 F#2010 | .4853639 .7763936 0.63 0.532 -1.038881 2.009608 G#2005 | -.3341302 .6328998 -0.53 0.598 -1.576663 .9084021 G#2010 | .2873193 .6334566 0.45 0.650 -.956306 1.530945 H#2005 | -.4365233 .4195984 -1.04 0.299 -1.260294 .3872479 H#2010 | -.1683725 .6134262 -0.27 0.784 -1.372673 1.035928 I#2005 | -.39264 .7755549 -0.51 0.613 -1.915238 1.129958 I#2010 | 0 (omitted) J#2005 | 1.036108 .4476018 2.31 0.021 .1573591 1.914856 J#2010 | 0 (omitted) | _cons | 11.58369 .350721 33.03 0.000 10.89514 12.27224 ---------------------------------------------------------------------------------
Just for your question about which 'variables to kick out": I guess you meant which combination of interaction terms to be used as the reference group for calculating regression coefficients. By default, Stata uses the combination of the lowest values of two variables as the reference while R uses the highest values of two variables as the reference. I use Stata auto data to demonstrate this: # In R webuse::webuse("auto") auto$foreign = as.factor(auto$foreign) auto$rep78 = as.factor(auto$rep78) # Model r_model <- lm(mpg ~ rep78:foreign, data=auto) broom::tidy(r_model) # A tibble: 11 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) 26.3 1.65 15.9 2.09e-23 2 rep781:foreign0 -5.33 3.88 -1.38 1.74e- 1 3 rep782:foreign0 -7.21 2.41 -2.99 4.01e- 3 4 rep783:foreign0 -7.33 1.91 -3.84 2.94e- 4 5 rep784:foreign0 -7.89 2.34 -3.37 1.29e- 3 6 rep785:foreign0 5.67 3.88 1.46 1.49e- 1 7 rep781:foreign1 NA NA NA NA 8 rep782:foreign1 NA NA NA NA 9 rep783:foreign1 -3.00 3.31 -0.907 3.68e- 1 10 rep784:foreign1 -1.44 2.34 -0.618 5.39e- 1 11 rep785:foreign1 NA NA NA NA In Stata: . reg mpg i.foreign#i.rep78 note: 1.foreign#1b.rep78 identifies no observations in the sample note: 1.foreign#2.rep78 identifies no observations in the sample Source | SS df MS Number of obs = 69 -------------+---------------------------------- F(7, 61) = 4.88 Model | 839.550121 7 119.935732 Prob > F = 0.0002 Residual | 1500.65278 61 24.6008652 R-squared = 0.3588 -------------+---------------------------------- Adj R-squared = 0.2852 Total | 2340.2029 68 34.4147485 Root MSE = 4.9599 ------------------------------------------------------------------------------- mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- foreign#rep78 | Domestic#2 | -1.875 3.921166 -0.48 0.634 -9.715855 5.965855 Domestic#3 | -2 3.634773 -0.55 0.584 -9.268178 5.268178 Domestic#4 | -2.555556 3.877352 -0.66 0.512 -10.3088 5.19769 Domestic#5 | 11 4.959926 2.22 0.030 1.082015 20.91798 Foreign#1 | 0 (empty) Foreign#2 | 0 (empty) Foreign#3 | 2.333333 4.527772 0.52 0.608 -6.720507 11.38717 Foreign#4 | 3.888889 3.877352 1.00 0.320 -3.864357 11.64213 Foreign#5 | 5.333333 3.877352 1.38 0.174 -2.419912 13.08658 | _cons | 21 3.507197 5.99 0.000 13.98693 28.01307 ------------------------------------------------------------------------------- To reproduce the previous R in Stata, we could recode those two variables foreign and rep78: . reg mpg i.foreign2#i.rep2 note: 0b.foreign2#1.rep2 identifies no observations in the sample note: 0b.foreign2#2.rep2 identifies no observations in the sample Source | SS df MS Number of obs = 69 -------------+---------------------------------- F(7, 61) = 4.88 Model | 839.550121 7 119.935732 Prob > F = 0.0002 Residual | 1500.65278 61 24.6008652 R-squared = 0.3588 -------------+---------------------------------- Adj R-squared = 0.2852 Total | 2340.2029 68 34.4147485 Root MSE = 4.9599 ------------------------------------------------------------------------------- mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- foreign2#rep2 | 0 1 | 0 (empty) 0 2 | 0 (empty) 0 3 | -3 3.306617 -0.91 0.368 -9.61199 3.61199 0 4 | -1.444444 2.338132 -0.62 0.539 -6.119827 3.230938 1 0 | 5.666667 3.877352 1.46 0.149 -2.086579 13.41991 1 1 | -5.333333 3.877352 -1.38 0.174 -13.08658 2.419912 1 2 | -7.208333 2.410091 -2.99 0.004 -12.02761 -2.389059 1 3 | -7.333333 1.909076 -3.84 0.000 -11.15077 -3.515899 1 4 | -7.888889 2.338132 -3.37 0.001 -12.56427 -3.213506 | _cons | 26.33333 1.653309 15.93 0.000 23.02734 29.63933 ------------------------------------------------------------------------------- The same approach applies to reproduce Stata results in R, just redefine levels of those two factor variables.
Change the order in which summary functions are printed by skim
I'm using skimr, and I added two summary functions (iqr_na_rm and median_na_rm) to the list of summary functions for the function skim. However, by default these new summary functions (called skimmers in skimr documentation) appear at the end of the table. Instead, I'd like median and iqr to appear after mean and sd. The final goal is to show the results in a .Rmd report like this: --- title: "Test" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(warning = FALSE, message = FALSE, echo = FALSE) ``` ## Test ```{r test, results = 'asis'} library(skimr) library(dplyr) library(ggplot2) iqr_na_rm <- function(x) IQR(x, na.rm = TRUE) median_na_rm <- function(x) median(x, na.rm = TRUE) skim_with(numeric = list(p50 = NULL, median = median_na_rm, iqr = iqr_na_rm), integer = list(p50 = NULL, median = median_na_rm, iqr = iqr_na_rm)) msleep %>% group_by(vore) %>% skim(sleep_total) %>% kable() ``` Rendered HTML: As you can see, median and iqr are printed and the end of the table, after the sparkline histogram. I'd like them to be printed after sd and before p0. Is it possible?
There are two parts in the skim() output. If you want to control the numeric part, you can use skim_to_list like this. It's also easier to export in another format. msleep %>% group_by(vore) %>% skim_to_list(sleep_total)%>% .[["numeric"]]%>% dplyr::select(vore,variable,missing,complete,n,mean,sd, median,iqr,p0,p25,p75,p100,hist) # A tibble: 5 x 14 vore variable missing complete n mean sd median iqr p0 p25 p75 p100 hist * <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> 1 carni sleep_total 0 19 19 10.38 4.67 10.4 " 6.75" 2.7 6.25 "13 " 19.4 ▃▇▂▇▆▃▂▃ 2 herbi sleep_total 0 32 32 " 9.51" 4.88 10.3 " 9.92" 1.9 "4.3 " 14.22 16.6 ▆▇▁▂▂▆▇▅ 3 insecti sleep_total 0 5 5 14.94 5.92 18.1 "11.1 " 8.4 "8.6 " "19.7 " 19.9 ▇▁▁▁▁▁▃▇ 4 omni sleep_total 0 20 20 10.93 2.95 " 9.9" " 1.83" "8 " "9.1 " 10.93 "18 " ▆▇▂▁▁▁▁▂ 5 NA sleep_total 0 7 7 10.19 "3 " 10.6 " 3.5 " 5.4 8.65 12.15 13.7 ▃▃▁▁▃▇▁▇ EDIT Adding kable() as requested in comment. msleep %>% group_by(vore) %>% skim_to_list(sleep_total)%>% .[["numeric"]]%>% dplyr::select(vore,variable,missing,complete,n,mean,sd,median,iqr,p0,p25,p75,p100,hist)%>% kable() | vore | variable | missing | complete | n | mean | sd | median | iqr | p0 | p25 | p75 | p100 | hist | |---------|-------------|---------|----------|----|-------|------|--------|------|-----|------|-------|------|----------| | carni | sleep_total | 0 | 19 | 19 | 10.38 | 4.67 | 10.4 | 6.75 | 2.7 | 6.25 | 13 | 19.4 | ▃▇▂▇▆▃▂▃ | | herbi | sleep_total | 0 | 32 | 32 | 9.51 | 4.88 | 10.3 | 9.92 | 1.9 | 4.3 | 14.22 | 16.6 | ▆▇▁▂▂▆▇▅ | | insecti | sleep_total | 0 | 5 | 5 | 14.94 | 5.92 | 18.1 | 11.1 | 8.4 | 8.6 | 19.7 | 19.9 | ▇▁▁▁▁▁▃▇ | | omni | sleep_total | 0 | 20 | 20 | 10.93 | 2.95 | 9.9 | 1.83 | 8 | 9.1 | 10.93 | 18 | ▆▇▂▁▁▁▁▂ | | NA | sleep_total | 0 | 7 | 7 | 10.19 | 3 | 10.6 | 3.5 | 5.4 | 8.65 | 12.15 | 13.7 | ▃▃▁▁▃▇▁▇ |
Here's another option that uses the append=FALSE option. library(skimr) library(dplyr) library(ggplot2) iqr_na_rm <- function(x) IQR(x, na.rm = TRUE) median_na_rm <- function(x) median(x, na.rm = TRUE) my_skimmers <- list(n = length, missing = n_missing, complete = n_complete, mean = mean.default, sd = purrr::partial(sd, na.rm = TRUE), median = median_na_rm, iqr = iqr_na_rm ) skim_with(numeric = my_skimmers, integer = my_skimmers, append = FALSE) msleep %>% group_by(vore) %>% skim(sleep_total) %>% kable() I didn't put all the stats but you can look in the functions.R and stats.R files to see how the various statistics are defined.