I want to create a gt table where I see some metrics like number of observations, mean and median, and I want a column with its histogram. For this question I will use the iris dataset.
I have recently learned how to put a plot in a tibble using this code:
library(dplyr)
library(tidyr)
library(purrr)
library(gt)
my_tibble <- iris %>%
pivot_longer(-Species,
names_to = "Vars",
values_to = "Values") %>%
group_by(Vars) %>%
summarise(obs = n(),
mean = round(mean(Values),2),
median = round(median(Values),2),
plots = list(ggplot(cur_data(), aes(Values)) + geom_histogram()))
Now I want to use the plots column for plotting an histogram per variable, so I have tried this:
my_tibble %>%
mutate(ggplot = NA) %>%
gt() %>%
text_transform(
locations = cells_body(vars(ggplot)),
fn = function(x) {
map(.$plots,ggplot_image)
}
)
But it returns me an error:
Error in body[[col]][stub_df$rownum_i %in% loc$rows] <- fn(body[[col]][stub_df$rownum_i %in% :
replacement has length zero
The gt table should be like this:
Any help will be greatly appreciated.
After reviewing the excellent ideas from #akrun and #TarJae, I have this solution that gives the required gt table:
plots <- iris %>%
pivot_longer(-Species,
names_to = "Vars",
values_to = "Values") %>%
group_by(Vars) %>%
nest() %>%
mutate(plot = map(data,
function(df) df %>%
ggplot(aes(Values)) +
geom_histogram())) %>%
select(-data)
iris %>%
pivot_longer(-Species,
names_to = "Vars",
values_to = "Values") %>%
group_by(Vars) %>%
summarise(obs = n(),
mean = round(mean(Values),2),
median = round(median(Values),2)) %>%
mutate(ggplot = NA) %>%
gt() %>%
text_transform(
locations = cells_body(vars(ggplot)),
fn = function(x) {
map(plots$plot, ggplot_image, height = px(100))
}
)
And this is the table:
I had to create the plot outside the output table, so I could call it in the gt table.
We need to loop over the plots
library(dplyr)
library(tidyr)
library(purrr)
library(gt)
library(ggplot2)
iris %>%
pivot_longer(-Species,
names_to = "Vars",
values_to = "Values") %>%
nest_by(Vars) %>%
mutate(n = nrow(data),
mean = round(mean(data$Values), 2),
median = round(median(data$Values), 2),
plots = list(ggplot(data, aes(Values)) + geom_histogram()), .keep = "unused") %>%
ungroup %>%
mutate(ggplot = NA) %>%
{dat <- .
dat %>%
select(-plots) %>%
gt() %>%
text_transform(locations = cells_body(c(ggplot)),
fn = function(x) {
map(dat$plots, ggplot_image, height = px(100))
}
)
}
-check for the output
Update: See comments:
For your purposes in accordance with a shiny app you may use summarytools see here: https://cran.r-project.org/web/packages/summarytools/vignettes/introduction.html
it is compatible with r shiny!
Here is a small example:
library(summarytools)
dfSummary(iris,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.75,
valid.col = FALSE,
tmp.img.dir = "/tmp")
view(dfSummary(iris))
Try this:
library(skimr)
skim(iris)
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
* <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Sepal.Length 0 1 5.84 0.828 4.3 5.1 5.8 6.4 7.9 ▆▇▇▅▂
2 Sepal.Width 0 1 3.06 0.436 2 2.8 3 3.3 4.4 ▁▆▇▂▁
3 Petal.Length 0 1 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▆▇▂
4 Petal.Width 0 1 1.20 0.762 0.1 0.3 1.3 1.8 2.5 ▇▁▇▅▃
Related
I want to do something like
df1 <- iris %>% distinct(Species, .keep_all = TRUE) %>% group_by(Petal.Width) %>% summarise(Sepal.Length.mean1=mean(Sepal.Length), .groups = "drop")
df2 <- iris %>% distinct(Species, Petal.Width, .keep_all = TRUE) %>% group_by(Petal.Width) %>% summarise(Sepal.Length.mean2 =mean(Sepal.Length), .groups = "drop")
inner_join(df1, df2, by="Petal.Width")
But this is tedious to read because of the repetition. Is it possible to do all in one pipe? I cannot recover the initial dataset after distinct() so I wonder if there's a replacement to that.
A possible solution is to create first a function and then use it inside pipes:
library(tidyverse)
f <- function(df = iris, var1 = Species, var2 = Petal.Width,
var3 = Sepal.Length, i)
{
x <- enquo(var3)
{{df}} %>%
distinct({{var1}}, .keep_all = TRUE) %>% group_by({{var2}}) %>%
summarise(!!str_c(quo_name(x), ".mean", i , sep = "") := mean({{var3}}),
.groups = "drop")
}
inner_join(f(i = 1), f(i = 2), by="Petal.Width")
#> # A tibble: 3 × 3
#> Petal.Width Sepal.Length.mean1 Sepal.Length.mean2
#> <dbl> <dbl> <dbl>
#> 1 0.2 5.1 5.1
#> 2 1.4 7 7
#> 3 2.5 6.3 6.3
A workaround would be to use an expression with {}
Here is the beginning of the solution
iris %>% {
df1 <- distinct(., Species, .keep_all = TRUE)
df2 <- distinct(., Species, Petal.Width, .keep_all = TRUE)
list(df1, df2)} %>%
map(~ group_by(.x, Petal.Width)) # SOLUTION TO BE COMPLETED
I cannot figure out why the bang-bang operator in my function is not unquoting my grp argument. Any help would be much appreciated!
library(dplyr)
test_func <- function(dat, grp){
dat %>%
group_by(!!grp) %>%
summarise(N = n())
}
test_func(dat = iris, grp = "Species")
Instead of grouping by species it just produces the summary for the entire data:
If we are passing a string, then convert to symbol and evaluate (!!)
test_func <- function(dat, grp){
dat %>%
group_by(!! rlang::ensym(grp)) %>%
summarise(N = n(), .groups = 'drop')
}
-testing
test_func(dat = iris, grp = "Species")
# A tibble: 3 x 2
# Species N
#* <fct> <int>
#1 setosa 50
#2 versicolor 50
#3 virginica 50
Or another option is to use across
test_func <- function(dat, grp){
dat %>%
group_by(across(all_of(grp))) %>%
summarise(N = n(), .groups = 'drop')
}
How do I add frequency counts to a 2x2 prop.table? So here 'dataset' contains two categorical variables.
dataset %>% prop.table(margin = 2) %>% '*' (100) %>% round(2)
I would like the counts in addition to percentages of each category.
Sorry for the dopey example, but it should look like this, except the sum doesn't need to be reported in every cell.
A reproducible example and solution:
tab <-iris %>% mutate(size = factor(1+(Sepal.Length>median(iris$Sepal.Length)),levels = 1:2, labels = c('S','L'))) %>%
select(Species, size) %>%
table()
prop <- prop.table(tab,margin = 2) %>% '*' (100) %>% round(2)
matrix(paste(tab,prop),nrow = nrow(tab),dimnames = dimnames(tab))
gives
size
Species S L
setosa "50 62.5" "0 0"
versicolor "24 30" "26 37.14"
virginica "6 7.5" "44 62.86"
or another solution:
iris %>% mutate(size = factor(1+(Sepal.Length>median(iris$Sepal.Length)),levels = 1:2, labels = c('S','L'))) %>%
group_by(Species, size) %>%
summarise(n = n()) %>%
group_by(size) %>%
mutate(p = paste(n,round(n/sum(n)*100,2))) %>%
select(-n) %>%
spread(size,p,fill = paste(0,0))
gives
# A tibble: 3 x 3
Species S L
<fct> <chr> <chr>
1 setosa 50 62.5 0 0
2 versicolor 24 30 26 37.14
3 virginica 6 7.5 44 62.86
addmargins applied to your table might do what you want.
set.seed(34)
n <- 20
tab <- table(sample(1:3, n, replace = TRUE), sample(c("A", "B"), n, replace = TRUE))
addmargins(tab)
I am using summarise_at() to obtain the mean and standard error of multiple variables by group.
The output has 1 row for each group, and 1 column for each calculated quantity, per group. I'd like to have a table with 1 row for each variable, and 1 column for each calculated quantity:
data <- mtcars
data$condition <- as.factor(c(rep("control", 16), rep("treat", 16)))
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt),
funs(mean = mean, se=sd(.)/sqrt(n())))
# A tibble: 2 x 7
condition mpg_mean cyl_mean wt_mean mpg_se cyl_se wt_se
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 control 18.2 6.5 3.56 1.04 0.387 0.204
2 treat 22.0 5.88 2.87 1.77 0.499 0.257
Here's what I think would be more useful (the numbers are not meaningful):
# MEAN.control, MEAN.treat, SE.control, SE.treat
# mpg 1.5 2.4 .30 .45
# cyl 3.2 1.9 .20 .60
# disp 12.3 17.8 .20 .19
Any ideas? New to the tidyverse, so sorry if this is too basic.
The funs is getting deprecated in dplyr. Instead use list in summarise_at/mutate_at. After the summarise step, gather the data into 'long' format, separate the 'key' column into two by splitting at the delimiter _, then unite the 'cond' and 'key2' (after changing the case of 'key2'), spread it to 'wide' format and if needed, change the row names with the column 'key1'
library(tidyverse)
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt), list(MEAN = ~ mean(.),
SE = ~sd(.)/sqrt(n()))) %>%
gather(key, val, -condition) %>%
separate(key, into = c("key1", "key2")) %>%
unite(cond, key2, condition, sep=".") %>%
spread(cond, val) %>%
column_to_rownames('key1')
# MEAN.control MEAN.treat SE.control SE.treat
#cyl 6.500000 5.875000 0.3872983 0.4989572
#mpg 18.200000 21.981250 1.0369024 1.7720332
#wt 3.560875 2.873625 0.2044885 0.2571034
A different possibility could be:
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt), list(mean = ~ mean(.),
se = ~ sd(.)/sqrt(n()))) %>%
gather(var, val, -condition) %>%
separate(var, c("vars", "var2")) %>%
mutate(var2 = paste(toupper(var2), as.character(condition), sep = "_")) %>%
select(-condition) %>%
spread(var2, val)
vars MEAN_control MEAN_treat SE_control SE_treat
<chr> <dbl> <dbl> <dbl> <dbl>
1 cyl 6.5 5.88 0.387 0.499
2 mpg 18.2 22.0 1.04 1.77
3 wt 3.56 2.87 0.204 0.257
Here, after your initial steps, it performs a wide-to-long data transformation, excluding the "condition" column. Second, it separates the variable names into two columns. Third, it combines the metric and the condition, with the metric being upper case. Finally, it removes the redundant variable and returns it to the desired format.
Or you can avoid separate() by using some regex:
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt), list(mean = ~ mean(.),
se = ~ sd(.)/sqrt(n()))) %>%
gather(var, val, -condition) %>%
mutate(vars = gsub("_.*$", "", var),
var2 = gsub(".*\\_", "", var)) %>%
mutate(var2 = paste(toupper(var2), as.character(condition), sep = "_")) %>%
select(-condition, -var) %>%
spread(var2, val)
Or with strsplit():
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt), list(mean = ~ mean(.),
se = ~ sd(.)/sqrt(n()))) %>%
gather(var, val, -condition) %>%
mutate(vars = sapply(strsplit(var, "_"), function(x) x[1]),
var2 = sapply(strsplit(var, "_"), function(x) x[2])) %>%
mutate(var2 = paste(toupper(var2), as.character(condition), sep = "_")) %>%
select(-condition, -var) %>%
spread(var2, val)
Or you can completely rewrite it to:
data %>%
select(mpg, cyl, wt, condition) %>%
gather(vars, val, -condition) %>%
group_by(condition, vars) %>%
summarise(mean = mean(val),
se = sd(val)/sqrt(n())) %>%
ungroup() %>%
gather(var2, val, -c(condition, vars)) %>%
mutate(var2 = paste(toupper(var2), condition, sep = "_")) %>%
select(-condition) %>%
spread(var2, val)
In this case it, first, selects the variables of interest. Second, it performs a transformation from wide to long format, excluding the "condition" column. Third, it groups by conditions and variable names and calculates the metrics. In the forth step, it performs a second wide-to-long transformation, excluding the "condition" column and the column with initial variable names. Finally, it combines together the metric (upper case) and condition, removes the redundant variable and returns it to the desired format.
When computing any statistic using summarise and group_by we only get the summary statistic per-category, and not the value for all the population (Total). How to get both?
I am looking for something clean and short. Until now I can only think of:
bind_rows(
iris %>% group_by(Species) %>% summarise(
"Mean" = mean(Sepal.Width),
"Median" = median(Sepal.Width),
"sd" = sd(Sepal.Width),
"p10" = quantile(Sepal.Width, probs = 0.1))
,
iris %>% summarise(
"Mean" = mean(Sepal.Width),
"Median" = median(Sepal.Width),
"sd" = sd(Sepal.Width),
"p10" = quantile(Sepal.Width, probs = 0.1)) %>%
mutate(Species = "Total")
)
But I would like something more compact. In particular, I don't want to type the code (for summarize) twice, once for each group and once for the total.
You can simplify it if you untangle what you're trying to do: you have iris data that has several species, and you want that summarized along with data for all species. You don't need to calculate those summary stats before you can bind. Instead, bind iris with a version of iris that's been set to Species = "Total", then group and summarize.
library(tidyverse)
bind_rows(
iris,
iris %>% mutate(Species = "Total")
) %>%
group_by(Species) %>%
summarise(Mean = mean(Sepal.Width),
Median = median(Sepal.Width),
sd = sd(Sepal.Width),
p10 = quantile(Sepal.Width, probs = 0.1))
#> # A tibble: 4 x 5
#> Species Mean Median sd p10
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 3.43 3.4 0.379 3
#> 2 Total 3.06 3 0.436 2.5
#> 3 versicolor 2.77 2.8 0.314 2.3
#> 4 virginica 2.97 3 0.322 2.59
I like the caution in the comments above, though I have to do this sort of calculation for work enough that I have a similar shorthand function in a personal package. It perhaps makes less sense for things like standard deviations, but it's something I need to do a lot for adding up totals of demographic groups, etc. (If it's useful, that function is here).
bit shorter, though quite similar to bind_rows
q10 <- function(x){quantile(x , probs=0.1)}
iris %>%
select(Species,Sepal.Width)%>%
group_by(Species) %>%
summarise_all(c("mean", "sd", "q10")) %>%
t() %>%
cbind(c("total", iris %>% select(Sepal.Width) %>% summarise_all(c("mean", "sd", "q10")))) %>%
t()
more clean probably:
bind_rows(
iris %>%
group_by(Species) %>%
select(Sepal.Width)%>%
summarise_all(c("mean", "sd", "q10"))
,
iris %>%
select(Sepal.Width)%>%
summarise_all(c("mean", "sd", "q10")) %>%
mutate(Species = "Total")
)