I am trying to create a basic descriptive summary table in my RMarkdown pdf document.
data(iris)
library(summarytools)
iris %>%
group_by(Species) %>%
descr(Sepal.Length, stats = "fivenum", headings = FALSE)
I get the following output:
Descriptive Statistics
Sepal.Length by Species
Data Frame: iris
N: 50
setosa versicolor virginica
------------ -------- ------------ -----------
Min 4.30 4.90 4.90
Q1 4.80 5.60 6.20
Median 5.00 5.90 6.50
Q3 5.20 6.30 6.90
Max 5.80 7.00 7.90
How do I get rid of this part from the final output?:
Descriptive Statistics
Sepal.Length by Species
Data Frame: iris
N: 50
I assumed headings = FALSE was going to do it, but I guess I was wrong! Any help would be much appreciated.
Should be fixed soon, but in the meantime, this will work:
iris %>%
group_by(Species) %>%
descr(Sepal.Length, stats = "fivenum") %>%
print(headings = FALSE)
## setosa versicolor virginica
## ------------ -------- ------------ -----------
## Min 4.30 4.90 4.90
## Q1 4.80 5.60 6.20
## Median 5.00 5.90 6.50
## Q3 5.20 6.30 6.90
## Max 5.80 7.00 7.90
I would use tb to get rid of the "descriptive" part above and then kable for making a nice table for Rmarkdown outputs. For example:
library(summarytools)
library(dplyr)
iris %>%
group_by(Species) %>%
descr(Sepal.Length, stats = "fivenum") %>%
tb() %>%
knitr::kable()
Output:
|Species |variable | min| q1| med| q3| max|
|:----------|:------------|---:|---:|---:|---:|---:|
|setosa |Sepal.Length | 4.3| 4.8| 5.0| 5.2| 5.8|
|versicolor |Sepal.Length | 4.9| 5.6| 5.9| 6.3| 7.0|
|virginica |Sepal.Length | 4.9| 6.2| 6.5| 6.9| 7.9|
We could also transpose the table using t(). For example:
iris %>%
group_by(Species) %>%
descr(Sepal.Length, stats = "fivenum") %>%
tb() %>%
t() %>%
knitr::kable()
| | | | |
|:--------|:------------|:------------|:------------|
|Species |setosa |versicolor |virginica |
|variable |Sepal.Length |Sepal.Length |Sepal.Length |
|min |4.3 |4.9 |4.9 |
|q1 |4.8 |5.6 |6.2 |
|med |5.0 |5.9 |6.5 |
|q3 |5.2 |6.3 |6.9 |
|max |5.8 |7.0 |7.9 |
Related
Example
library(glmmTMB)
library(ggeffects)
## Zero-inflated negative binomial model
(m <- glmmTMB(count ~ spp + mined + (1|site),
ziformula=~spp + mined,
family=nbinom2,
data=Salamanders,
na.action = "na.fail"))
summary(m)
ggemmeans(m, terms="spp")
spp | Predicted | 95% CI
--------------------------------
GP | 1.11 | [0.66, 1.86]
PR | 0.42 | [0.11, 1.59]
DM | 1.32 | [0.81, 2.13]
EC-A | 0.75 | [0.37, 1.53]
EC-L | 1.81 | [1.09, 3.00]
DES-L | 2.00 | [1.25, 3.21]
DF | 0.99 | [0.61, 1.62]
ggeffects::ggeffect(m, terms="spp")
spp | Predicted | 95% CI
--------------------------------
GP | 1.14 | [0.69, 1.90]
PR | 0.44 | [0.12, 1.63]
DM | 1.36 | [0.85, 2.18]
EC-A | 0.78 | [0.39, 1.57]
EC-L | 1.87 | [1.13, 3.07]
DES-L | 2.06 | [1.30, 3.28]
DF | 1.02 | [0.63, 1.65]
Questions
Why are ggeffect and ggemmeans giving different results for the marginal effects? Is it simply something internal with how the packages emmeans and effects are computing them? Also, does anyone know of some resources on how to compute marginal effects from scratch for a model like that in the example?
You fit a complex model: zero-inflated negative binomial model with random effects.
What you observe has little to do with the model specification. Let's show this by fitting a simpler model: Poisson with fixed effects only.
library("glmmTMB")
library("ggeffects")
m <- glmmTMB(
count ~ spp + mined,
family = poisson,
data = Salamanders
)
ggemmeans(m, terms = "spp")
#> # Predicted counts of count
#>
#> spp | Predicted | 95% CI
#> --------------------------------
#> GP | 0.73 | [0.59, 0.89]
#> PR | 0.18 | [0.12, 0.27]
#> DM | 0.91 | [0.76, 1.10]
#> EC-A | 0.34 | [0.25, 0.45]
#> EC-L | 1.35 | [1.15, 1.59]
#> DES-L | 1.43 | [1.22, 1.68]
#> DF | 0.79 | [0.64, 0.96]
ggeffect(m, terms = "spp")
#> # Predicted counts of count
#>
#> spp | Predicted | 95% CI
#> --------------------------------
#> GP | 0.76 | [0.62, 0.93]
#> PR | 0.19 | [0.13, 0.28]
#> DM | 0.96 | [0.79, 1.15]
#> EC-A | 0.35 | [0.26, 0.47]
#> EC-L | 1.41 | [1.20, 1.66]
#> DES-L | 1.50 | [1.28, 1.75]
#> DF | 0.82 | [0.67, 1.00]
The documentation explains that internally ggemmeans() calls emmeans::emmeans() while ggeffect() calls effects::Effect().
Both emmeans and effects compute marginal effects but they make a different (default) choice how to marginalize out (ie. average over) mined in order to get the effect of spp.
mined is a categorical variable with two levels: "yes" and "no". The crucial bit is that the two levels are not balanced: there are slightly more "no"s than "yes"s.
xtabs(~ mined + spp, data = Salamanders)
#> spp
#> mined GP PR DM EC-A EC-L DES-L DF
#> yes 44 44 44 44 44 44 44
#> no 48 48 48 48 48 48 48
Intuitively, this means that the weighted average over mined [think of (44 × yes + 48 × no) / 92] is not the same as the simple average over mined [think of (yes + no) / 2].
Let's check the intuition by specifying how to marginalize out mined when we call emmeans::emmeans() directly.
# mean (default)
emmeans::emmeans(m, "spp", type = "response", weights = "equal")
#> spp rate SE df lower.CL upper.CL
#> GP 0.726 0.0767 636 0.590 0.893
#> PR 0.181 0.0358 636 0.123 0.267
#> DM 0.914 0.0879 636 0.757 1.104
#> EC-A 0.336 0.0497 636 0.251 0.449
#> EC-L 1.351 0.1120 636 1.148 1.590
#> DES-L 1.432 0.1163 636 1.221 1.679
#> DF 0.786 0.0804 636 0.643 0.961
#>
#> Results are averaged over the levels of: mined
#> Confidence level used: 0.95
#> Intervals are back-transformed from the log scale
# weighted mean
emmeans::emmeans(m, "spp", type = "response", weights = "proportional")
#> spp rate SE df lower.CL upper.CL
#> GP 0.759 0.0794 636 0.618 0.932
#> PR 0.190 0.0373 636 0.129 0.279
#> DM 0.955 0.0909 636 0.793 1.152
#> EC-A 0.351 0.0517 636 0.263 0.469
#> EC-L 1.412 0.1153 636 1.203 1.658
#> DES-L 1.496 0.1196 636 1.279 1.751
#> DF 0.822 0.0832 636 0.674 1.003
#>
#> Results are averaged over the levels of: mined
#> Confidence level used: 0.95
#> Intervals are back-transformed from the log scale
The second option returns the marginal effects computed with ggeffects::ggeffect.
Update
#Daniel points out that ggeffects accepts the weights argument and will pass it to emmeans. This way you can keep using ggeffects and still control how predictions are averaged to compute marginal effects.
Try it out for yourself with:
ggemmeans(m, terms="spp", weights = "proportional")
ggemmeans(m, terms="spp", weights = "equal")
I have 3 data frames (df1,df2,df3), each with 1274 rows and 2192 columns. I want to count the number of occurrences when the value of a cell matches 0.968 in df1, 0.972 in df2 and 0.909 in df3. Note that the cells have to be in the exact same location (same row and column number).
Example,
df1
| 0.968 | 0.526 |
| 0.938 | 0.632 |
| 0.873 | 0.968 |
df2
| 0.342 | 0.972 |
| 0.545 | 0.231 |
| 0.434 | 0.972 |
df3
| 0.673 | 0.812 |
| 0.128 | 0.764 |
| 0.909 | 0.909 |
The answer should return: 1
Is using a loop the best option to solve this?
You can try the code below
sum(df1==0.968 & df2 == 0.972 & df3 = 0.909)
If you would like to index the TRUE values, you can use which
which(df1==0.968 & df2 == 0.972 & df3 = 0.909, arr.ind = TRUE)
Another possible solution:
sum((df1 == 0.968) * (df2 == 0.972) * (df3 == 0.909))
#> [1] 1
I want to create summary statistics for my dataset. I have tried searching but haven't found anything that matches what I want. I want the columns to be listed on vertically with the statistics measure as headings. Here is how I want it to look:
Column
Mean
Standard deviation
25th perc.
Median
75th perc.
Column 1
Mean column 1
Std column 1
...
...
...
Column 2
Mean column 2
...
...
...
...
Etc
...
...
...
...
...
How do I do this? Thankful for any help I can get!:)
If there is a specific function to use where I can also do some formatting/styling some info about that would also be appreciated, but the main point is that it should look as described. :)
You may want to check out the summarytools package... Has built-in support for both markdown and html.
library(summarytools)
descr(iris,
stats = c("mean", "sd", "q1", "med", "q3"),
transpose = TRUE)
## Non-numerical variable(s) ignored: Species
## Descriptive Statistics
## iris
## N: 150
##
## Mean Std.Dev Q1 Median Q3
## ----------------- ------ --------- ------ -------- ------
## Petal.Length 3.76 1.77 1.60 4.35 5.10
## Petal.Width 1.20 0.76 0.30 1.30 1.80
## Sepal.Length 5.84 0.83 5.10 5.80 6.40
## Sepal.Width 3.06 0.44 2.80 3.00 3.30
We could use descr from collapse
library(collapse)
descr(iris)
Your question is missing some important features, but I think you want something like this:
Example with just the numerical variables of the iris dataset:
iris_numerical<-iris[,1:4]
calculate statistics
new_df<-sapply(iris_numerical, function(x){c(mean=mean(x), SD=sd(x), Q1=quantile(x, 0.25), median=median(x), Q3=quantile(x, 0.75))})
This gives you summary statistics column-wise
> new_df
Sepal.Length Sepal.Width Petal.Length Petal.Width
mean 5.8433333 3.0573333 3.758000 1.1993333
SD 0.8280661 0.4358663 1.765298 0.7622377
Q1.25% 5.1000000 2.8000000 1.600000 0.3000000
median 5.8000000 3.0000000 4.350000 1.3000000
Q3.75% 6.4000000 3.3000000 5.100000 1.8000000
Then create final dataframe in the desired format, with colnames as rownames:
new_df<-data.frame(column=colnames(new_df), apply(new_df, 1, function(x) x))
> new_df
column mean SD Q1.25. median Q3.75.
Sepal.Length Sepal.Length 5.843333 0.8280661 5.1 5.80 6.4
Sepal.Width Sepal.Width 3.057333 0.4358663 2.8 3.00 3.3
Petal.Length Petal.Length 3.758000 1.7652982 1.6 4.35 5.1
Petal.Width Petal.Width 1.199333 0.7622377 0.3 1.30 1.8
I noticed using plot_models from package sjPlot gives confidence intervals based on the Naive standard errors. I want it to use the Robust SEs. Is there a simple fix?
Currently, sjPlot does not support this option, however, it is planned for a forthcoming update. sjPlot uses the parameters package to compute model parameters - if you don't mind updating the parameters package from GitHub (and installing the see package), you can already use this feature:
library(parameters)
library(gee)
data(warpbreaks)
model <- gee(breaks ~ tension, id = wool, data = warpbreaks)
#> Beginning Cgee S-function, #(#) geeformula.q 4.13 98/01/27
#> running glm to get initial regression estimate
#> (Intercept) tensionM tensionH
#> 36.38889 -10.00000 -14.72222
mp <- model_parameters(model)
mp
#> Parameter | Coefficient | SE | 95% CI | z | df | p
#> ------------------------------------------------------------------------
#> (Intercept) | 36.39 | 2.80 | [ 30.90, 41.88] | 12.99 | 51 | < .001
#> tension [M] | -10.00 | 3.96 | [-17.76, -2.24] | -2.53 | 51 | 0.015
#> tension [H] | -14.72 | 3.96 | [-22.48, -6.96] | -3.72 | 51 | < .001
plot(mp)
mp <- model_parameters(model, robust = TRUE)
mp
#> Parameter | Coefficient | SE | 95% CI | z | df | p
#> ------------------------------------------------------------------------
#> (Intercept) | 36.39 | 5.77 | [ 25.07, 47.71] | 6.30 | 51 | < .001
#> tension [M] | -10.00 | 7.46 | [-24.63, 4.63] | -3.94 | 51 | 0.186
#> tension [H] | -14.72 | 3.73 | [-22.04, -7.41] | -1.34 | 51 | < .001
plot(mp)
Created on 2019-12-23 by the reprex package (v0.3.0)
I'm using skimr, and I added two summary functions (iqr_na_rm and median_na_rm) to the list of summary functions for the function skim. However, by default these new summary functions (called skimmers in skimr documentation) appear at the end of the table. Instead, I'd like median and iqr to appear after mean and sd.
The final goal is to show the results in a .Rmd report like this:
---
title: "Test"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE,
message = FALSE,
echo = FALSE)
```
## Test
```{r test, results = 'asis'}
library(skimr)
library(dplyr)
library(ggplot2)
iqr_na_rm <- function(x) IQR(x, na.rm = TRUE)
median_na_rm <- function(x) median(x, na.rm = TRUE)
skim_with(numeric = list(p50 = NULL, median = median_na_rm, iqr = iqr_na_rm),
integer = list(p50 = NULL, median = median_na_rm, iqr = iqr_na_rm))
msleep %>%
group_by(vore) %>%
skim(sleep_total) %>%
kable()
```
Rendered HTML:
As you can see, median and iqr are printed and the end of the table, after the sparkline histogram. I'd like them to be printed after sd and before p0. Is it possible?
There are two parts in the skim() output. If you want to control the numeric part, you can use skim_to_list like this. It's also easier to export in another format.
msleep %>%
group_by(vore) %>%
skim_to_list(sleep_total)%>%
.[["numeric"]]%>%
dplyr::select(vore,variable,missing,complete,n,mean,sd,
median,iqr,p0,p25,p75,p100,hist)
# A tibble: 5 x 14
vore variable missing complete n mean sd median iqr p0 p25 p75 p100 hist
* <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 carni sleep_total 0 19 19 10.38 4.67 10.4 " 6.75" 2.7 6.25 "13 " 19.4 ▃▇▂▇▆▃▂▃
2 herbi sleep_total 0 32 32 " 9.51" 4.88 10.3 " 9.92" 1.9 "4.3 " 14.22 16.6 ▆▇▁▂▂▆▇▅
3 insecti sleep_total 0 5 5 14.94 5.92 18.1 "11.1 " 8.4 "8.6 " "19.7 " 19.9 ▇▁▁▁▁▁▃▇
4 omni sleep_total 0 20 20 10.93 2.95 " 9.9" " 1.83" "8 " "9.1 " 10.93 "18 " ▆▇▂▁▁▁▁▂
5 NA sleep_total 0 7 7 10.19 "3 " 10.6 " 3.5 " 5.4 8.65 12.15 13.7 ▃▃▁▁▃▇▁▇
EDIT
Adding kable() as requested in comment.
msleep %>%
group_by(vore) %>%
skim_to_list(sleep_total)%>%
.[["numeric"]]%>%
dplyr::select(vore,variable,missing,complete,n,mean,sd,median,iqr,p0,p25,p75,p100,hist)%>%
kable()
| vore | variable | missing | complete | n | mean | sd | median | iqr | p0 | p25 | p75 | p100 | hist |
|---------|-------------|---------|----------|----|-------|------|--------|------|-----|------|-------|------|----------|
| carni | sleep_total | 0 | 19 | 19 | 10.38 | 4.67 | 10.4 | 6.75 | 2.7 | 6.25 | 13 | 19.4 | ▃▇▂▇▆▃▂▃ |
| herbi | sleep_total | 0 | 32 | 32 | 9.51 | 4.88 | 10.3 | 9.92 | 1.9 | 4.3 | 14.22 | 16.6 | ▆▇▁▂▂▆▇▅ |
| insecti | sleep_total | 0 | 5 | 5 | 14.94 | 5.92 | 18.1 | 11.1 | 8.4 | 8.6 | 19.7 | 19.9 | ▇▁▁▁▁▁▃▇ |
| omni | sleep_total | 0 | 20 | 20 | 10.93 | 2.95 | 9.9 | 1.83 | 8 | 9.1 | 10.93 | 18 | ▆▇▂▁▁▁▁▂ |
| NA | sleep_total | 0 | 7 | 7 | 10.19 | 3 | 10.6 | 3.5 | 5.4 | 8.65 | 12.15 | 13.7 | ▃▃▁▁▃▇▁▇ |
Here's another option that uses the append=FALSE option.
library(skimr)
library(dplyr)
library(ggplot2)
iqr_na_rm <- function(x) IQR(x, na.rm = TRUE)
median_na_rm <- function(x) median(x, na.rm = TRUE)
my_skimmers <- list(n = length, missing = n_missing, complete = n_complete,
mean = mean.default, sd = purrr::partial(sd, na.rm = TRUE),
median = median_na_rm, iqr = iqr_na_rm
)
skim_with(numeric = my_skimmers,
integer = my_skimmers, append = FALSE)
msleep %>%
group_by(vore) %>%
skim(sleep_total) %>%
kable()
I didn't put all the stats but you can look in the functions.R and stats.R files to see how the various statistics are defined.