Comparing unique values across multiple data frames in R - r

I have 3 data frames (df1,df2,df3), each with 1274 rows and 2192 columns. I want to count the number of occurrences when the value of a cell matches 0.968 in df1, 0.972 in df2 and 0.909 in df3. Note that the cells have to be in the exact same location (same row and column number).
Example,
df1
| 0.968 | 0.526 |
| 0.938 | 0.632 |
| 0.873 | 0.968 |
df2
| 0.342 | 0.972 |
| 0.545 | 0.231 |
| 0.434 | 0.972 |
df3
| 0.673 | 0.812 |
| 0.128 | 0.764 |
| 0.909 | 0.909 |
The answer should return: 1
Is using a loop the best option to solve this?

You can try the code below
sum(df1==0.968 & df2 == 0.972 & df3 = 0.909)
If you would like to index the TRUE values, you can use which
which(df1==0.968 & df2 == 0.972 & df3 = 0.909, arr.ind = TRUE)

Another possible solution:
sum((df1 == 0.968) * (df2 == 0.972) * (df3 == 0.909))
#> [1] 1

Related

Computing marginal effects: Why does ggeffect and ggemmeans give difference answers?

Example
library(glmmTMB)
library(ggeffects)
## Zero-inflated negative binomial model
(m <- glmmTMB(count ~ spp + mined + (1|site),
ziformula=~spp + mined,
family=nbinom2,
data=Salamanders,
na.action = "na.fail"))
summary(m)
ggemmeans(m, terms="spp")
spp | Predicted | 95% CI
--------------------------------
GP | 1.11 | [0.66, 1.86]
PR | 0.42 | [0.11, 1.59]
DM | 1.32 | [0.81, 2.13]
EC-A | 0.75 | [0.37, 1.53]
EC-L | 1.81 | [1.09, 3.00]
DES-L | 2.00 | [1.25, 3.21]
DF | 0.99 | [0.61, 1.62]
ggeffects::ggeffect(m, terms="spp")
spp | Predicted | 95% CI
--------------------------------
GP | 1.14 | [0.69, 1.90]
PR | 0.44 | [0.12, 1.63]
DM | 1.36 | [0.85, 2.18]
EC-A | 0.78 | [0.39, 1.57]
EC-L | 1.87 | [1.13, 3.07]
DES-L | 2.06 | [1.30, 3.28]
DF | 1.02 | [0.63, 1.65]
Questions
Why are ggeffect and ggemmeans giving different results for the marginal effects? Is it simply something internal with how the packages emmeans and effects are computing them? Also, does anyone know of some resources on how to compute marginal effects from scratch for a model like that in the example?
You fit a complex model: zero-inflated negative binomial model with random effects.
What you observe has little to do with the model specification. Let's show this by fitting a simpler model: Poisson with fixed effects only.
library("glmmTMB")
library("ggeffects")
m <- glmmTMB(
count ~ spp + mined,
family = poisson,
data = Salamanders
)
ggemmeans(m, terms = "spp")
#> # Predicted counts of count
#>
#> spp | Predicted | 95% CI
#> --------------------------------
#> GP | 0.73 | [0.59, 0.89]
#> PR | 0.18 | [0.12, 0.27]
#> DM | 0.91 | [0.76, 1.10]
#> EC-A | 0.34 | [0.25, 0.45]
#> EC-L | 1.35 | [1.15, 1.59]
#> DES-L | 1.43 | [1.22, 1.68]
#> DF | 0.79 | [0.64, 0.96]
ggeffect(m, terms = "spp")
#> # Predicted counts of count
#>
#> spp | Predicted | 95% CI
#> --------------------------------
#> GP | 0.76 | [0.62, 0.93]
#> PR | 0.19 | [0.13, 0.28]
#> DM | 0.96 | [0.79, 1.15]
#> EC-A | 0.35 | [0.26, 0.47]
#> EC-L | 1.41 | [1.20, 1.66]
#> DES-L | 1.50 | [1.28, 1.75]
#> DF | 0.82 | [0.67, 1.00]
The documentation explains that internally ggemmeans() calls emmeans::emmeans() while ggeffect() calls effects::Effect().
Both emmeans and effects compute marginal effects but they make a different (default) choice how to marginalize out (ie. average over) mined in order to get the effect of spp.
mined is a categorical variable with two levels: "yes" and "no". The crucial bit is that the two levels are not balanced: there are slightly more "no"s than "yes"s.
xtabs(~ mined + spp, data = Salamanders)
#> spp
#> mined GP PR DM EC-A EC-L DES-L DF
#> yes 44 44 44 44 44 44 44
#> no 48 48 48 48 48 48 48
Intuitively, this means that the weighted average over mined [think of (44 × yes + 48 × no) / 92] is not the same as the simple average over mined [think of (yes + no) / 2].
Let's check the intuition by specifying how to marginalize out mined when we call emmeans::emmeans() directly.
# mean (default)
emmeans::emmeans(m, "spp", type = "response", weights = "equal")
#> spp rate SE df lower.CL upper.CL
#> GP 0.726 0.0767 636 0.590 0.893
#> PR 0.181 0.0358 636 0.123 0.267
#> DM 0.914 0.0879 636 0.757 1.104
#> EC-A 0.336 0.0497 636 0.251 0.449
#> EC-L 1.351 0.1120 636 1.148 1.590
#> DES-L 1.432 0.1163 636 1.221 1.679
#> DF 0.786 0.0804 636 0.643 0.961
#>
#> Results are averaged over the levels of: mined
#> Confidence level used: 0.95
#> Intervals are back-transformed from the log scale
# weighted mean
emmeans::emmeans(m, "spp", type = "response", weights = "proportional")
#> spp rate SE df lower.CL upper.CL
#> GP 0.759 0.0794 636 0.618 0.932
#> PR 0.190 0.0373 636 0.129 0.279
#> DM 0.955 0.0909 636 0.793 1.152
#> EC-A 0.351 0.0517 636 0.263 0.469
#> EC-L 1.412 0.1153 636 1.203 1.658
#> DES-L 1.496 0.1196 636 1.279 1.751
#> DF 0.822 0.0832 636 0.674 1.003
#>
#> Results are averaged over the levels of: mined
#> Confidence level used: 0.95
#> Intervals are back-transformed from the log scale
The second option returns the marginal effects computed with ggeffects::ggeffect.
Update
#Daniel points out that ggeffects accepts the weights argument and will pass it to emmeans. This way you can keep using ggeffects and still control how predictions are averaged to compute marginal effects.
Try it out for yourself with:
ggemmeans(m, terms="spp", weights = "proportional")
ggemmeans(m, terms="spp", weights = "equal")

Reproducing a result from R in Stata - Telling R or Stata to remove the same variables causing perfect collinearity/singularities

I am trying to reproduce a result from R in Stata (Please note that the data below is fictitious and serves just as an example). For some reason however, Stata appears to deal with certain issues differently than R. It chooses different dummy variables to kick out in case of multicollinearity.
I have posted a related question dealing with the statistical implications of these country-year dummies being removed here.
In the example below, R kicks out 2, while Stata kicks out 3, leading to a different result. Check for example the coefficients and p-values for vote and vote_won.
In essence, all I want to know is how to communicate to either R or Stata, which variables to kick out, so that they both do the same.
Data
The data looks as follows:
library(data.table)
library(dplyr)
library(foreign)
library(censReg)
library(wooldridge)
data('mroz')
year= c(2005, 2010)
country = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
n <- 2
DT <- data.table( country = rep(sample(country, length(mroz), replace = T), each = n),
year = c(replicate(length(mroz), sample(year, n))))
x <- DT
DT <- rbind(DT, DT); DT <- rbind(DT, DT); DT <- rbind(DT, DT) ; DT <- rbind(DT, DT); DT <- rbind(DT, x)
mroz <- mroz[-c(749:753),]
DT <- cbind(mroz, DT)
DT <- DT %>%
group_by(country) %>%
mutate(base_rate = as.integer(runif(1, 12.5, 37.5))) %>%
group_by(country, year) %>%
mutate(taxrate = base_rate + as.integer(runif(1,-2.5,+2.5)))
DT <- DT %>%
group_by(country, year) %>%
mutate(vote = sample(c(0,1),1),
votewon = ifelse(vote==1, sample(c(0,1),1),0))
rm(mroz,x, country, year)
The lm regression in R
summary(lm(educ ~ exper + I(exper^2) + vote + votewon + country:as.factor(year), data=DT))
Call:
lm(formula = educ ~ exper + I(exper^2) + vote + votewon + country:as.factor(year),
data = DT)
Residuals:
Min 1Q Median 3Q Max
-7.450 -0.805 -0.268 0.954 5.332
Coefficients: (3 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.170064 0.418578 26.69 < 0.0000000000000002 ***
exper 0.103880 0.029912 3.47 0.00055 ***
I(exper^2) -0.002965 0.000966 -3.07 0.00222 **
vote 0.576865 0.504540 1.14 0.25327
votewon 0.622522 0.636241 0.98 0.32818
countryA:as.factor(year)2005 -0.196348 0.503245 -0.39 0.69653
countryB:as.factor(year)2005 -0.530681 0.616653 -0.86 0.38975
countryC:as.factor(year)2005 0.650166 0.552019 1.18 0.23926
countryD:as.factor(year)2005 -0.515195 0.638060 -0.81 0.41968
countryE:as.factor(year)2005 0.731681 0.502807 1.46 0.14605
countryG:as.factor(year)2005 0.213345 0.674642 0.32 0.75192
countryH:as.factor(year)2005 -0.811374 0.637254 -1.27 0.20334
countryI:as.factor(year)2005 0.584787 0.503606 1.16 0.24594
countryJ:as.factor(year)2005 0.554397 0.674789 0.82 0.41158
countryA:as.factor(year)2010 0.388603 0.503358 0.77 0.44035
countryB:as.factor(year)2010 -0.727834 0.617210 -1.18 0.23869
countryC:as.factor(year)2010 -0.308601 0.504041 -0.61 0.54056
countryD:as.factor(year)2010 0.785603 0.503165 1.56 0.11888
countryE:as.factor(year)2010 0.280305 0.452293 0.62 0.53562
countryG:as.factor(year)2010 0.672074 0.674721 1.00 0.31954
countryH:as.factor(year)2010 NA NA NA NA
countryI:as.factor(year)2010 NA NA NA NA
countryJ:as.factor(year)2010 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.3 on 728 degrees of freedom
Multiple R-squared: 0.037, Adjusted R-squared: 0.0119
F-statistic: 1.47 on 19 and 728 DF, p-value: 0.0882
Same regression in Stata
write.dta(DT, "C:/Users/.../mroz_adapted.dta")
encode country, gen(n_country)
reg educ c.exper c.exper#c.exper vote votewon n_country#i.year
note: 9.n_country#2010.year omitted because of collinearity
note: 10.n_country#2010.year omitted because of collinearity
Source | SS df MS Number of obs = 748
-------------+---------------------------------- F(21, 726) = 1.80
Model | 192.989406 21 9.18997171 Prob > F = 0.0154
Residual | 3705.47583 726 5.1039612 R-squared = 0.0495
-------------+---------------------------------- Adj R-squared = 0.0220
Total | 3898.46524 747 5.21882897 Root MSE = 2.2592
---------------------------------------------------------------------------------
educ | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
exper | .1109858 .0297829 3.73 0.000 .052515 .1694567
|
c.exper#c.exper | -.0031891 .000963 -3.31 0.001 -.0050796 -.0012986
|
vote | .0697273 .4477115 0.16 0.876 -.8092365 .9486911
votewon | -.0147825 .6329659 -0.02 0.981 -1.257445 1.227879
|
n_country#year |
A#2010 | .0858634 .4475956 0.19 0.848 -.7928728 .9645997
B#2005 | -.4950677 .5003744 -0.99 0.323 -1.477421 .4872858
B#2010 | .0951657 .5010335 0.19 0.849 -.8884818 1.078813
C#2005 | -.5162827 .447755 -1.15 0.249 -1.395332 .3627664
C#2010 | -.0151834 .4478624 -0.03 0.973 -.8944434 .8640767
D#2005 | .3664596 .5008503 0.73 0.465 -.6168283 1.349747
D#2010 | .5119858 .500727 1.02 0.307 -.4710599 1.495031
E#2005 | .5837942 .6717616 0.87 0.385 -.7350329 1.902621
E#2010 | .185601 .5010855 0.37 0.711 -.7981486 1.169351
F#2005 | .5987978 .6333009 0.95 0.345 -.6445219 1.842117
F#2010 | .4853639 .7763936 0.63 0.532 -1.038881 2.009608
G#2005 | -.3341302 .6328998 -0.53 0.598 -1.576663 .9084021
G#2010 | .2873193 .6334566 0.45 0.650 -.956306 1.530945
H#2005 | -.4365233 .4195984 -1.04 0.299 -1.260294 .3872479
H#2010 | -.1683725 .6134262 -0.27 0.784 -1.372673 1.035928
I#2005 | -.39264 .7755549 -0.51 0.613 -1.915238 1.129958
I#2010 | 0 (omitted)
J#2005 | 1.036108 .4476018 2.31 0.021 .1573591 1.914856
J#2010 | 0 (omitted)
|
_cons | 11.58369 .350721 33.03 0.000 10.89514 12.27224
---------------------------------------------------------------------------------
Just for your question about which 'variables to kick out": I guess you meant which combination of interaction terms to be used as the reference group for calculating regression coefficients.
By default, Stata uses the combination of the lowest values of two variables as the reference while R uses the highest values of two variables as the reference. I use Stata auto data to demonstrate this:
# In R
webuse::webuse("auto")
auto$foreign = as.factor(auto$foreign)
auto$rep78 = as.factor(auto$rep78)
# Model
r_model <- lm(mpg ~ rep78:foreign, data=auto)
broom::tidy(r_model)
# A tibble: 11 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 26.3 1.65 15.9 2.09e-23
2 rep781:foreign0 -5.33 3.88 -1.38 1.74e- 1
3 rep782:foreign0 -7.21 2.41 -2.99 4.01e- 3
4 rep783:foreign0 -7.33 1.91 -3.84 2.94e- 4
5 rep784:foreign0 -7.89 2.34 -3.37 1.29e- 3
6 rep785:foreign0 5.67 3.88 1.46 1.49e- 1
7 rep781:foreign1 NA NA NA NA
8 rep782:foreign1 NA NA NA NA
9 rep783:foreign1 -3.00 3.31 -0.907 3.68e- 1
10 rep784:foreign1 -1.44 2.34 -0.618 5.39e- 1
11 rep785:foreign1 NA NA NA NA
In Stata:
. reg mpg i.foreign#i.rep78
note: 1.foreign#1b.rep78 identifies no observations in the sample
note: 1.foreign#2.rep78 identifies no observations in the sample
Source | SS df MS Number of obs = 69
-------------+---------------------------------- F(7, 61) = 4.88
Model | 839.550121 7 119.935732 Prob > F = 0.0002
Residual | 1500.65278 61 24.6008652 R-squared = 0.3588
-------------+---------------------------------- Adj R-squared = 0.2852
Total | 2340.2029 68 34.4147485 Root MSE = 4.9599
-------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+----------------------------------------------------------------
foreign#rep78 |
Domestic#2 | -1.875 3.921166 -0.48 0.634 -9.715855 5.965855
Domestic#3 | -2 3.634773 -0.55 0.584 -9.268178 5.268178
Domestic#4 | -2.555556 3.877352 -0.66 0.512 -10.3088 5.19769
Domestic#5 | 11 4.959926 2.22 0.030 1.082015 20.91798
Foreign#1 | 0 (empty)
Foreign#2 | 0 (empty)
Foreign#3 | 2.333333 4.527772 0.52 0.608 -6.720507 11.38717
Foreign#4 | 3.888889 3.877352 1.00 0.320 -3.864357 11.64213
Foreign#5 | 5.333333 3.877352 1.38 0.174 -2.419912 13.08658
|
_cons | 21 3.507197 5.99 0.000 13.98693 28.01307
-------------------------------------------------------------------------------
To reproduce the previous R in Stata, we could recode those two variables foreign and rep78:
. reg mpg i.foreign2#i.rep2
note: 0b.foreign2#1.rep2 identifies no observations in the sample
note: 0b.foreign2#2.rep2 identifies no observations in the sample
Source | SS df MS Number of obs = 69
-------------+---------------------------------- F(7, 61) = 4.88
Model | 839.550121 7 119.935732 Prob > F = 0.0002
Residual | 1500.65278 61 24.6008652 R-squared = 0.3588
-------------+---------------------------------- Adj R-squared = 0.2852
Total | 2340.2029 68 34.4147485 Root MSE = 4.9599
-------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+----------------------------------------------------------------
foreign2#rep2 |
0 1 | 0 (empty)
0 2 | 0 (empty)
0 3 | -3 3.306617 -0.91 0.368 -9.61199 3.61199
0 4 | -1.444444 2.338132 -0.62 0.539 -6.119827 3.230938
1 0 | 5.666667 3.877352 1.46 0.149 -2.086579 13.41991
1 1 | -5.333333 3.877352 -1.38 0.174 -13.08658 2.419912
1 2 | -7.208333 2.410091 -2.99 0.004 -12.02761 -2.389059
1 3 | -7.333333 1.909076 -3.84 0.000 -11.15077 -3.515899
1 4 | -7.888889 2.338132 -3.37 0.001 -12.56427 -3.213506
|
_cons | 26.33333 1.653309 15.93 0.000 23.02734 29.63933
-------------------------------------------------------------------------------
The same approach applies to reproduce Stata results in R, just redefine levels of those two factor variables.

compute k-means after PCA

I'm new to R and I want to do a k-means clustering based on the results of pca. I did like this (taken Iris dataset as an example):
library(tidyverse)
library(FactoMineR)
library(factoextra)
df <- iris %>%
select(- Species)
# compute PCA
res.pca <- PCA(df,
scale.unit = TRUE,
graph = FALSE)
summary(res.pca)
# k-means clustering
kc <- kmeans(res.pca, 3)
Then I got an error:Error in storage.mode(x) <- "double" : a list cannot converted automatically into a 'double'.
The output of the PCA are:
> res.pca
**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 150 individuals, described by 4 variables
*The results are available in the following objects:
name description
1 "$eig" "eigenvalues"
2 "$var" "results for the variables"
3 "$var$coord" "coord. for the variables"
4 "$var$cor" "correlations variables - dimensions"
5 "$var$cos2" "cos2 for the variables"
6 "$var$contrib" "contributions of the variables"
7 "$ind" "results for the individuals"
8 "$ind$coord" "coord. for the individuals"
9 "$ind$cos2" "cos2 for the individuals"
10 "$ind$contrib" "contributions of the individuals"
11 "$call" "summary statistics"
12 "$call$centre" "mean of the variables"
13 "$call$ecart.type" "standard error of the variables"
14 "$call$row.w" "weights for the individuals"
15 "$call$col.w" "weights for the variables"
>
> summary(res.pca)
Call:
PCA(X = df, scale.unit = TRUE, graph = FALSE)
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4
Variance 2.918 0.914 0.147 0.021
% of var. 72.962 22.851 3.669 0.518
Cumulative % of var. 72.962 95.813 99.482 100.000
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
1 | 2.319 | -2.265 1.172 0.954 | 0.480 0.168 0.043 | -0.128 0.074 0.003 |
2 | 2.202 | -2.081 0.989 0.893 | -0.674 0.331 0.094 | -0.235 0.250 0.011 |
3 | 2.389 | -2.364 1.277 0.979 | -0.342 0.085 0.020 | 0.044 0.009 0.000 |
4 | 2.378 | -2.299 1.208 0.935 | -0.597 0.260 0.063 | 0.091 0.038 0.001 |
5 | 2.476 | -2.390 1.305 0.932 | 0.647 0.305 0.068 | 0.016 0.001 0.000 |
6 | 2.555 | -2.076 0.984 0.660 | 1.489 1.617 0.340 | 0.027 0.003 0.000 |
7 | 2.468 | -2.444 1.364 0.981 | 0.048 0.002 0.000 | 0.335 0.511 0.018 |
8 | 2.246 | -2.233 1.139 0.988 | 0.223 0.036 0.010 | -0.089 0.036 0.002 |
9 | 2.592 | -2.335 1.245 0.812 | -1.115 0.907 0.185 | 0.145 0.096 0.003 |
10 | 2.249 | -2.184 1.090 0.943 | -0.469 0.160 0.043 | -0.254 0.293 0.013 |
Variables
Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
Sepal.Length | 0.890 27.151 0.792 | 0.361 14.244 0.130 | -0.276 51.778 0.076 |
Sepal.Width | -0.460 7.255 0.212 | 0.883 85.247 0.779 | 0.094 5.972 0.009 |
Petal.Length | 0.992 33.688 0.983 | 0.023 0.060 0.001 | 0.054 2.020 0.003 |
Petal.Width | 0.965 31.906 0.931 | 0.064 0.448 0.004 | 0.243 40.230 0.059 |
Could somebody help me with this problem ? What should I put instead of res.pca in the kmeans()? I don't know which part of the PCA results should I extract to use in the fonction kmeans()
Thank you in advance.
The principal component scores are stored under res.pca$ind$coord What you want to do kmeans on these:
So we can do:
kc <- kmeans(res.pca$ind$coord, 3)
plot(res.pca$ind$coord[,1:2],col=factor(kc$cluster))
It seems kmeans() expects a numeric matrix as input, however you are giving to it res.pca which is a list. Thus you get the error "cannot convert object of type list to double". "Double" is R's class to matrix or vectors of pure numbers.
I'm not sure about the what the PCA function outputs, so you must find a way to extract the PCA values from it, make it a matrix, and then run kmeans.
Hope it helps.
But for future reference, you can do a few things to make your questions easier to help:
Provide a reproducible example (a df with a few lines)
Translate error messages to english
Add the packages the function is from

Using Robust SE's for plot_models from package gee

I noticed using plot_models from package sjPlot gives confidence intervals based on the Naive standard errors. I want it to use the Robust SEs. Is there a simple fix?
Currently, sjPlot does not support this option, however, it is planned for a forthcoming update. sjPlot uses the parameters package to compute model parameters - if you don't mind updating the parameters package from GitHub (and installing the see package), you can already use this feature:
library(parameters)
library(gee)
data(warpbreaks)
model <- gee(breaks ~ tension, id = wool, data = warpbreaks)
#> Beginning Cgee S-function, #(#) geeformula.q 4.13 98/01/27
#> running glm to get initial regression estimate
#> (Intercept) tensionM tensionH
#> 36.38889 -10.00000 -14.72222
mp <- model_parameters(model)
mp
#> Parameter | Coefficient | SE | 95% CI | z | df | p
#> ------------------------------------------------------------------------
#> (Intercept) | 36.39 | 2.80 | [ 30.90, 41.88] | 12.99 | 51 | < .001
#> tension [M] | -10.00 | 3.96 | [-17.76, -2.24] | -2.53 | 51 | 0.015
#> tension [H] | -14.72 | 3.96 | [-22.48, -6.96] | -3.72 | 51 | < .001
plot(mp)
mp <- model_parameters(model, robust = TRUE)
mp
#> Parameter | Coefficient | SE | 95% CI | z | df | p
#> ------------------------------------------------------------------------
#> (Intercept) | 36.39 | 5.77 | [ 25.07, 47.71] | 6.30 | 51 | < .001
#> tension [M] | -10.00 | 7.46 | [-24.63, 4.63] | -3.94 | 51 | 0.186
#> tension [H] | -14.72 | 3.73 | [-22.04, -7.41] | -1.34 | 51 | < .001
plot(mp)
Created on 2019-12-23 by the reprex package (v0.3.0)

Change the order in which summary functions are printed by skim

I'm using skimr, and I added two summary functions (iqr_na_rm and median_na_rm) to the list of summary functions for the function skim. However, by default these new summary functions (called skimmers in skimr documentation) appear at the end of the table. Instead, I'd like median and iqr to appear after mean and sd.
The final goal is to show the results in a .Rmd report like this:
---
title: "Test"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE,
message = FALSE,
echo = FALSE)
```
## Test
```{r test, results = 'asis'}
library(skimr)
library(dplyr)
library(ggplot2)
iqr_na_rm <- function(x) IQR(x, na.rm = TRUE)
median_na_rm <- function(x) median(x, na.rm = TRUE)
skim_with(numeric = list(p50 = NULL, median = median_na_rm, iqr = iqr_na_rm),
integer = list(p50 = NULL, median = median_na_rm, iqr = iqr_na_rm))
msleep %>%
group_by(vore) %>%
skim(sleep_total) %>%
kable()
```
Rendered HTML:
As you can see, median and iqr are printed and the end of the table, after the sparkline histogram. I'd like them to be printed after sd and before p0. Is it possible?
There are two parts in the skim() output. If you want to control the numeric part, you can use skim_to_list like this. It's also easier to export in another format.
msleep %>%
group_by(vore) %>%
skim_to_list(sleep_total)%>%
.[["numeric"]]%>%
dplyr::select(vore,variable,missing,complete,n,mean,sd,
median,iqr,p0,p25,p75,p100,hist)
# A tibble: 5 x 14
vore variable missing complete n mean sd median iqr p0 p25 p75 p100 hist
* <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 carni sleep_total 0 19 19 10.38 4.67 10.4 " 6.75" 2.7 6.25 "13 " 19.4 ▃▇▂▇▆▃▂▃
2 herbi sleep_total 0 32 32 " 9.51" 4.88 10.3 " 9.92" 1.9 "4.3 " 14.22 16.6 ▆▇▁▂▂▆▇▅
3 insecti sleep_total 0 5 5 14.94 5.92 18.1 "11.1 " 8.4 "8.6 " "19.7 " 19.9 ▇▁▁▁▁▁▃▇
4 omni sleep_total 0 20 20 10.93 2.95 " 9.9" " 1.83" "8 " "9.1 " 10.93 "18 " ▆▇▂▁▁▁▁▂
5 NA sleep_total 0 7 7 10.19 "3 " 10.6 " 3.5 " 5.4 8.65 12.15 13.7 ▃▃▁▁▃▇▁▇
EDIT
Adding kable() as requested in comment.
msleep %>%
group_by(vore) %>%
skim_to_list(sleep_total)%>%
.[["numeric"]]%>%
dplyr::select(vore,variable,missing,complete,n,mean,sd,median,iqr,p0,p25,p75,p100,hist)%>%
kable()
| vore | variable | missing | complete | n | mean | sd | median | iqr | p0 | p25 | p75 | p100 | hist |
|---------|-------------|---------|----------|----|-------|------|--------|------|-----|------|-------|------|----------|
| carni | sleep_total | 0 | 19 | 19 | 10.38 | 4.67 | 10.4 | 6.75 | 2.7 | 6.25 | 13 | 19.4 | ▃▇▂▇▆▃▂▃ |
| herbi | sleep_total | 0 | 32 | 32 | 9.51 | 4.88 | 10.3 | 9.92 | 1.9 | 4.3 | 14.22 | 16.6 | ▆▇▁▂▂▆▇▅ |
| insecti | sleep_total | 0 | 5 | 5 | 14.94 | 5.92 | 18.1 | 11.1 | 8.4 | 8.6 | 19.7 | 19.9 | ▇▁▁▁▁▁▃▇ |
| omni | sleep_total | 0 | 20 | 20 | 10.93 | 2.95 | 9.9 | 1.83 | 8 | 9.1 | 10.93 | 18 | ▆▇▂▁▁▁▁▂ |
| NA | sleep_total | 0 | 7 | 7 | 10.19 | 3 | 10.6 | 3.5 | 5.4 | 8.65 | 12.15 | 13.7 | ▃▃▁▁▃▇▁▇ |
Here's another option that uses the append=FALSE option.
library(skimr)
library(dplyr)
library(ggplot2)
iqr_na_rm <- function(x) IQR(x, na.rm = TRUE)
median_na_rm <- function(x) median(x, na.rm = TRUE)
my_skimmers <- list(n = length, missing = n_missing, complete = n_complete,
mean = mean.default, sd = purrr::partial(sd, na.rm = TRUE),
median = median_na_rm, iqr = iqr_na_rm
)
skim_with(numeric = my_skimmers,
integer = my_skimmers, append = FALSE)
msleep %>%
group_by(vore) %>%
skim(sleep_total) %>%
kable()
I didn't put all the stats but you can look in the functions.R and stats.R files to see how the various statistics are defined.

Resources