the code below shows me extracting certain values for 1 parameter in my data frame (Calcium), but I want to be able to do this for all of the parameters/rows in my data frame. There are multiple rows for Calcium, which is why I took the median value.
How can I create a loop that does this for the other drug substance parameters?
Cal_limits=ag_limits_5 %>% filter(PARAMETER=="Drug Substance.Calcium")
lcl <- median(Cal_limits$LCL, na.rm = TRUE)
ucl <- median(Cal_limits$UCL, na.rm = TRUE)
lsl <- median(Cal_limits$LSL_1, na.rm = TRUE)
usl <- median(Cal_limits$USL_1, na.rm = TRUE)
cl <- median(Cal_limits$TARGET_MEAN, na.rm = TRUE)
stdev <- median(Cal_limits$TARGET_STDEV, na.rm = TRUE)
sigabove <- ucl + stdev #3.219 #(UCL + sd (3.11+0.107))
sigbelow <- lcl - stdev#2.363 #(LCL - sd (2.47-0.107))
Snapshot showing that there are multiple rows dedicated to one parameter, the columns not pictured have confidential information but include the values I am looking to extract
Edit: I am creating an RShiny app, so I am not sure if I will need to incorporate a reactive function
Using mtcars, you can do
aggregate(. ~ cyl, data = mtcars, FUN = median)
# cyl mpg disp hp drat wt qsec vs am gear carb
# 1 4 26.0 108.0 91.0 4.080 2.200 18.900 1 1 4 2.0
# 2 6 19.7 167.6 110.0 3.900 3.215 18.300 1 0 4 4.0
# 3 8 15.2 350.5 192.5 3.115 3.755 17.175 0 0 3 3.5
which provides the median for each of the variables (. means "all others") for each of the levels of cyl. I'm going to guess that this would apply to your data as
aggregate(. ~ PARAMETER, data = ag_limits_5, FUN = median)
If you have more columns than you want to reduce, then you can specify them manually with
aggregate(LCL + UCL + LSL_1 + USL_1 + TARGET_MEAN + TARGET_STDDEV ~ PARAMETER,
data = ag_limits_5, FUN = median)
and I think you'll get output something like
# PARAMETER LCL UCL LSL_1 USL_1 TARGET_MEAN TARGET_STDDEV
# 1 Drug Substance.Calcium 1.1 1.2 1.3 1.4 ...
# 2 Drug Substance.Copper ...
(with real numbers, I'm just showing structure there).
Since it appears that you're using dplyr, you can do it this way, too:
mtcars %>%
group_by(cyl) %>%
summarize(across(everything(), ~ median(., na.rm = TRUE)))
# # A tibble: 3 x 11
# cyl mpg disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4 26 108 91 4.08 2.2 18.9 1 1 4 2
# 2 6 19.7 168. 110 3.9 3.22 18.3 1 0 4 4
# 3 8 15.2 350. 192. 3.12 3.76 17.2 0 0 3 3.5
which for you might be
ag_limits_5 %>%
group_by(PARAMETER) %>%
summarize(across(everything(), ~ median(., na.rm = TRUE)))
Related
I have this code
data_2012 %>%
group_by(job2) %>%
filter(!is.na(job2)) %>%
summarise(mean = mean(persinc2, na.rm = T),
sd = sd(persinc2, na.rm = T))
Which gives me a little table for that specific variable which is perfect, however i have multiple variables that i want the mean and SD for but it all to be in the one table, how do i do that?
I am very new to R.
You can use across and have to choose your columns using the tidy_select format:
data_2012 %>%
group_by(job2) %>%
filter(!is.na(job2)) %>%
summarise(across(your_columns, list(mean = ~ mean(.x, na.rm = TRUE),
sd = ~ sd(.x, na.rm = TRUE))))
With a toy dataset
iris %>%
group_by(Species) %>%
summarise(across(everything(), list(mean = ~ mean(.x, na.rm = TRUE),
sd = ~ sd(.x, na.rm = TRUE))))
# A tibble: 3 x 9
Species Sepal.Length_mean Sepal.Length_sd Sepal.Width_mean Sepal.Width_sd
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 0.352 3.43 0.379
2 versicolor 5.94 0.516 2.77 0.314
3 virginica 6.59 0.636 2.97 0.322
# ... with 4 more variables: Petal.Length_mean <dbl>, Petal.Length_sd <dbl>,
# Petal.Width_mean <dbl>, Petal.Width_sd <dbl>
With base R, we may use split() to split the data by some factor variable. This returns a list of a number of elements that is equal to the number of levels of that factor variable. We can then obtain the mean and sd (or any other statistic you like) per column per level using members of the *apply() family as follows:
# toy data
df <- mtcars[, 1:5]
# splitting by a factor variable
lapply(split(df, df$cyl), function(x) {
sapply(x, function(i) data.frame(Mean=mean(i), SD=sd(i)))
})
Output
$`4`
mpg cyl disp hp drat
Mean 26.66364 4 105.1364 82.63636 4.070909
SD 4.509828 0 26.87159 20.93453 0.3654711
$`6`
mpg cyl disp hp drat
Mean 19.74286 6 183.3143 122.2857 3.585714
SD 1.453567 0 41.56246 24.26049 0.4760552
$`8`
mpg cyl disp hp drat
Mean 15.1 8 353.1 209.2143 3.229286
SD 2.560048 0 67.77132 50.97689 0.3723618
With some df tab1 that gives summary output like below
summary(tab1)
I need to extract a table out of this like below
What is the right way to go about extracting data for the above table ? Looking for a solution that can use R's summary object or equivalent. Eventually I will use kabble to pretty print it.
We could combine apply with kable_styling from kableExtra package:
library(dplyr)
library(kableExtra)
#example data
df <- mtcars %>%
select(1:3)
as.data.frame(apply(df,2,summary)) %>%
kbl() %>%
kable_styling()
Here is another tidyverse approach. Just select the columns you want and iterate over them one at a time. This gives you a list of them that you can then join together.
library(dplyr)
library(purrr)
library(tibble)
# add in a NA value
mtcars2 <- mtcars
mtcars2[5, "wt"] <- NA
# select variables of interest
# iterate over each column to create a data frame of the summary statistics
# move through the list joining the results together
# iterate through each joined column, replacing NA with 0
summary_vars <- mtcars2 %>%
select(wt, hp) %>%
imap(~ enframe(c(summary(.x)), name = "Metric", value = .y)) %>%
reduce(full_join, by = "Metric") %>%
modify_if(is.numeric, replace_na, 0) %>%
filter(Metric %in% c("Mean", "Min.", "Max.", "Median", "NA's"))
summary_vars
# # A tibble: 7 x 3
# Metric wt hp
# <chr> <dbl> <dbl>
# 1 Min. 1.51 52
# 2 Median 3.22 123
# 3 Mean 3.21 147.
# 4 Max. 5.42 335
# 5 NA's 1 0
You can now use this summary_vars with kable().
I think that this will give you what you want:
df <- data.frame(
a = c(rnorm(100), NA),
b = rnorm(101),
c = rnorm(101)
)
sumtab <- summary(df)
rownames(sumtab) <- names(summary(df$a))
sumtab <- sumtab[c("Mean", "Min.", "Max.", "Median", "NA's"),]
sumtab[5,][is.na(sumtab[5,])] <- 0
knitr::kable(gsub("^.+\\:", "", sumtab))
| | a | b | c |
|:------|:-------|:--------|:-------|
|Mean |0.1033 |-0.03059 |-0.1172 |
|Min. |-2.0331 |-2.63760 |-1.9736 |
|Max. |1.9081 |1.93582 |2.1688 |
|Median |0.2622 |0.05986 |-0.1869 |
|NA's |1 |0 |0 |
Alternatively, you could just select the rows that you want from the summary table when you initially calculate the summary and use your preferred names:
sumtab <- summary(df)[c(4, 1, 6, 3, 7),]
rownames(sumtab) <- c("Mean", "Min", "Max", "Median", "NA")
sumtab["NA", ][is.na(sumtab["NA", ])] <- 0
knitr::kable(gsub("^.+\\:", "", sumtab))
| | a | b | c |
|:------|:--------|:-------|:--------|
|Mean |-0.06337 |0.1074 |-0.00909 |
|Min |-2.27831 |-1.9913 |-2.01234 |
|Max |2.09241 |2.2401 |2.58487 |
|Median |-0.08660 |0.2212 |0.01049 |
|NA |1 |0 |0 |
# summary of numeric (including factor class) columns of a data frame
my_summ <- function(df) {
df_nonchar <- df[, !sapply(df, typeof) %in% "character"]
summ <- data.frame(summary(df_nonchar), row.names = NULL)
# test for empty columns usually the 1st column is empty as a result of class
# (summary obj) which is "table" to data.frame coercion.
empty <- sapply(summ, function(x) all(x == ""))
summ <- summ[, !empty]
summ <- setNames(summ, c("var_name", "stats"))
summ <- summ[which(!is.na(summ$stats)), ]
# just in case if there are multiple :'s, we need to split only at the first match
summ$stats <- sub(":", "-;-", summ$stats)
summ <- data.frame(summ[1], do.call(rbind, strsplit(summ$stats, "-;-")))
names(summ)[-1] <- c("stats", "value")
# pivot into wide form, using 'stats' column as a key.
summ <- reshape(summ,
direction = "wide",
idvar = "var_name",
timevar = "stats",
v.names = "value"
)
var_nms <- strsplit(colnames(summ)[-1], "value\\.")
var_nms <- vapply(var_nms, function(x) x[[2]], NA_character_)
var_nms <- gsub("\\s+$", "", var_nms) # remove white spaces
names(summ)[-1] <- var_nms
rownames(summ) <- NULL
# remove white spaces
summ <- as.data.frame(sapply(summ, function(x) gsub("\\s+$", "", x)))
# when vars in the dataset contain NAs, we may have two additional columns in
# summary call
nas <- "NA's" %in% colnames(summ)
if (any(nas)) {
colnames(summ)[colnames(summ) == "NA's"] <- "missing"
}
summ
}
my_summ(mtcars[, 1:5])
var_name Min. 1st Qu. Median Mean 3rd Qu. Max.
1 mpg 10.40 15.43 19.20 20.09 22.80 33.90
2 cyl 4.000 4.000 6.000 6.188 8.000 8.000
3 disp 71.1 120.8 196.3 230.7 326.0 472.0
4 hp 52.0 96.5 123.0 146.7 180.0 335.0
5 drat 2.760 3.080 3.695 3.597 3.920 4.930
# transpose of that would put stats(summaries) on rows and vars on columns
t(my_summ(mtcars[, 1:5]))
[,1] [,2] [,3] [,4] [,5]
var_name " mpg" " cyl" " disp" " hp" " drat"
Min. "10.40" "4.000" " 71.1" " 52.0" "2.760"
1st Qu. "15.43" "4.000" "120.8" " 96.5" "3.080"
Median "19.20" "6.000" "196.3" "123.0" "3.695"
Mean "20.09" "6.188" "230.7" "146.7" "3.597"
3rd Qu. "22.80" "8.000" "326.0" "180.0" "3.920"
Max. "33.90" "8.000" "472.0" "335.0" "4.930"
There is a nice package for it called vtable
df <- data.frame(a = c(rnorm(100), NA), b = rnorm(101), c = rnorm(101))
library(vtable)
# return will return the data as data.frame but look at other options as well
sumtable(df, out = "return")
# Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
# 1 a 100 0.235 0.955 -1.593 -0.433 0.792 3.078
# 2 b 101 0.052 1.017 -2.542 -0.675 0.654 2.753
# 3 c 101 -0.163 1.01 -2.894 -0.832 0.497 2.326
# https://cran.r-project.org/web/packages/vtable/vignettes/sumtable.html
# Options for out:
# browser: Loads output in web browser.
# viewer: Loads output in Viewer pane (RStudio only).
# htmlreturn: Returns HTML code for output file.
# return: Returns summary table in data frame format. Depending on options, the data frame may be entirely character variables.
# csv: Returns summary table in data.frame format and, with a file option, saves that to CSV.
# kable: Returns a knitr::kable()
# latex: Returns a LaTeX table.
# latexpage: Returns an independently-buildable LaTeX document.
A solution based on purrr::map and data.table:
library(tidyverse)
library(data.table)
mtcars2 <- mtcars; mtcars2[4,7] <- NA
mtcars2 %>%
map(summary) %>% bind_rows %>% transpose(keep.names = "rn") %>%
setnames(-1, names(mtcars2)) %>% as_tibble
#> # A tibble: 7 × 12
#> rn mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Min. 10.4 4 71.1 52 2.76 1.51 14.5 0 0 3 1
#> 2 1st Qu. 15.4 4 121. 96.5 3.08 2.58 16.9 0 0 3 2
#> 3 Median 19.2 6 196. 123 3.70 3.32 17.6 0 0 4 2
#> 4 Mean 20.1 6.19 231. 147. 3.60 3.22 17.8 0.438 0.406 3.69 2.81
#> 5 3rd Qu. 22.8 8 326 180 3.92 3.61 18.8 1 1 4 4
#> 6 Max. 33.9 8 472 335 4.93 5.42 22.9 1 1 5 8
#> 7 NA's NA NA NA NA NA NA 1 NA NA NA NA
A tidyverse approach:
library(tidyverse)
mtcars[4,7] <- NA
mtcars %>%
summary %>%
as.data.frame %>%
separate(Freq, into = c("name", "value"),sep = ":") %>%
mutate(across(everything(), ~ str_trim(as.character(.))),
value = parse_number(value)) %>%
pivot_wider(id_cols = "name", names_from = "Var2", values_from = "value") %>%
filter(!str_detect(name,"Qu."))
#> # A tibble: 5 × 12
#> name mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Min. 10.4 4 71.1 52 2.76 1.51 14.5 0 0 3 1
#> 2 Median 19.2 6 196. 123 3.70 3.32 17.6 0 0 4 2
#> 3 Mean 20.1 6.19 231. 147. 3.60 3.22 17.8 0.438 0.406 3.69 2.81
#> 4 Max. 33.9 8 472 335 4.93 5.42 22.9 1 1 5 8
#> 5 NA's NA NA NA NA NA NA 1 NA NA NA NA
I have a numeric dataframe (m rows * n columns)
For each row of this dataframe, I want to treat it as
a numeric vector (1 * n) and subtract from it another
fixed (1 * n) vector. So for each row we return a
(1 * n) vector.
I would like to return a list with this vector subtraction
done for each row of the dataframe. So in this case
a list with m number of 1 * n vectors.
I have manually done this for 2 rows in a simple reprex
below:
library(tidyverse)
#> Registered S3 methods overwritten by 'ggplot2':
# A function that takes a row as a vector
diff_vec <- function(inp_vec, diff_val){
base::return(inp_vec - diff_val)
}
# Create a test (dummy) dataset with 3 rows and 4 columns
test_dat <- mtcars %>% dplyr::slice(c(1, 3, 6)) %>% dplyr::select(1:4)
test_dat
#> mpg cyl disp hp
#> 1 21.0 6 160 110
#> 2 22.8 4 108 93
#> 3 18.1 6 225 105
# This is the vector we want to subtract from each row
diff_v <- c(3.2, 5.4, 7.5, 8.2)
first_row <- test_dat %>% dplyr::slice(1) %>% as.vector()
test_out1 <- diff_vec(inp_vec = first_row, diff_val = diff_v)
first_row
#> mpg cyl disp hp
#> 1 21 6 160 110
test_out1
#> mpg cyl disp hp
#> 1 17.8 0.6 152.5 101.8
second_row <- test_dat %>% dplyr::slice(2) %>% as.vector()
test_out2 = diff_vec(inp_vec = second_row, diff_val = diff_v)
second_row
#> mpg cyl disp hp
#> 1 22.8 4 108 93
test_out2
#> mpg cyl disp hp
#> 1 19.6 -1.4 100.5 84.8
Created on 2019-06-07 by the reprex package (v0.2.1)
Could anyone please show how to do this using
purrr based approach?
Thanks
There is a simple solution exists:
test_dat %>% map2_dfc(diff_v, ~ .x - .y)
Resulting tibble:
mpg cyl disp hp
<dbl> <dbl> <dbl> <dbl>
1 17.8 0.600 152. 102.
2 19.6 -1.4 100. 84.8
3 14.9 0.600 218. 96.8
I'm trying to group_by multiple columns in my data frame and I can't write out every single column name in the group_by function so I want to call the column names as a vector like so:
cols <- grep("[a-z]{3,}$", colnames(mtcars), value = TRUE)
mtcars %>% filter(disp < 160) %>% group_by(cols) %>% summarise(n = n())
This returns error:
Error in mutate_impl(.data, dots) :
Column `mtcars[colnames(mtcars)[grep("[a-z]{3,}$", colnames(mtcars))]]` must be length 12 (the number of rows) or one, not 7
I definitely want to use a dplyr function to do this, but can't figure this one out.
Update
group_by_at() has been superseded; see https://dplyr.tidyverse.org/reference/group_by_all.html. Refer to Harrison Jones' answer for the current recommended approach.
Retaining the below approach for posterity
You can use group_by_at, where you can pass a character vector of column names as group variables:
mtcars %>%
filter(disp < 160) %>%
group_by_at(cols) %>%
summarise(n = n())
# A tibble: 12 x 8
# Groups: mpg, cyl, disp, drat, qsec, gear [?]
# mpg cyl disp drat qsec gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 19.7 6 145.0 3.62 15.50 5 6 1
# 2 21.4 4 121.0 4.11 18.60 4 2 1
# 3 21.5 4 120.1 3.70 20.01 3 1 1
# 4 22.8 4 108.0 3.85 18.61 4 1 1
# ...
Or you can move the column selection inside group_by_at using vars and column select helper functions:
mtcars %>%
filter(disp < 160) %>%
group_by_at(vars(matches('[a-z]{3,}$'))) %>%
summarise(n = n())
# A tibble: 12 x 8
# Groups: mpg, cyl, disp, drat, qsec, gear [?]
# mpg cyl disp drat qsec gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 19.7 6 145.0 3.62 15.50 5 6 1
# 2 21.4 4 121.0 4.11 18.60 4 2 1
# 3 21.5 4 120.1 3.70 20.01 3 1 1
# 4 22.8 4 108.0 3.85 18.61 4 1 1
# ...
I believe group_by_at has now been superseded by using a combination of group_by and across. And summarise has an experimental .groups argument where you can choose how to handle the grouping after you create a summarised object. Here is an alternative to consider:
cols <- grep("[a-z]{3,}$", colnames(mtcars), value = TRUE)
original <- mtcars %>%
filter(disp < 160) %>%
group_by_at(cols) %>%
summarise(n = n())
superseded <- mtcars %>%
filter(disp < 160) %>%
group_by(across(all_of(cols))) %>%
summarise(n = n(), .groups = 'drop_last')
all.equal(original, superseded)
Here is a blog post that goes into more detail about using the across function:
https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-colwise/
How to create simple summary statistics using dplyr from multiple variables? Using the summarise_each function seems to be the way to go, however, when applying multiple functions to multiple columns, the result is a wide, hard-to-read data frame.
Use dplyr in combination with tidyr to reshape the end result.
library(dplyr)
library(tidyr)
df <- tbl_df(mtcars)
df.sum <- df %>%
select(mpg, cyl, vs, am, gear, carb) %>% # select variables to summarise
summarise_each(funs(min = min,
q25 = quantile(., 0.25),
median = median,
q75 = quantile(., 0.75),
max = max,
mean = mean,
sd = sd))
# the result is a wide data frame
> dim(df.sum)
[1] 1 42
# reshape it using tidyr functions
df.stats.tidy <- df.sum %>% gather(stat, val) %>%
separate(stat, into = c("var", "stat"), sep = "_") %>%
spread(stat, val) %>%
select(var, min, q25, median, q75, max, mean, sd) # reorder columns
> print(df.stats.tidy)
var min q25 median q75 max mean sd
1 am 0.0 0.000 0.0 1.0 1.0 0.40625 0.4989909
2 carb 1.0 2.000 2.0 4.0 8.0 2.81250 1.6152000
3 cyl 4.0 4.000 6.0 8.0 8.0 6.18750 1.7859216
4 gear 3.0 3.000 4.0 4.0 5.0 3.68750 0.7378041
5 mpg 10.4 15.425 19.2 22.8 33.9 20.09062 6.0269481
6 vs 0.0 0.000 0.0 1.0 1.0 0.43750 0.5040161
A potentially easy solution could created with broom::tidy and purrr::map_df. broom::tidy summarises key objects from statistical ouput into a tibble. purrr::map_df applies function to each element, in this case a column and returns a tibble.
Example
library(tidyverse)
mtcars %>%
select(mpg, cyl, vs, am, gear, carb) %>%
map_df(.f = ~ broom::tidy(summary(.x)), .id = "variable")
Results
# A tibble: 6 x 7
# variable minimum q1 median mean q3 maximum
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 mpg 10.4 15.4 19.2 20.1 22.8 33.9
# 2 cyl 4 4 6 6.19 8 8
# 3 vs 0 0 0 0.438 1 1
# 4 am 0 0 0 0.406 1 1
# 5 gear 3 3 4 3.69 4 5
# 6 carb 1 2 2 2.81 4 8
I liked paljenczy's idea of just using dplyr/tidy and getting the table in a data.frame/tibble before formatting it. But I ran into robustness issues: Because it relies on parsing variable names it choked on columns with underscores in the names. After trying to fix this within the dplyr framework it seemed like it would always be somewhat fragile because it relied on string parsing.
So in the end I decided on using psych::describe() which is a function designed for exactly this thing. It doesn't do completely arbitrary functions, but pretty much anything one would realistically want to do. A full example duplicating the previous solutions is included below, combining psych::describe() with some tidyverse stuff to get the exact tibble we are looking for.
It is worth noting that this answer has been updated to reflect the changed behavior of as_tibble() with regards to how it handles rownames in data.frames:
library(psych)
library(tidyverse)
# Create an extended version with a bunch of stats
d.summary.extended <- mtcars %>%
select(mpg, cyl, vs, am, gear, carb) %>%
psych::describe(quant=c(.25,.75)) %>%
as_tibble(rownames="rowname") %>%
print()
<OUTPUT>
# A tibble: 6 x 16
rowname vars n mean sd median trimmed mad min max range skew kurtosis se Q0.25 Q0.75
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mpg 1 32 20.09062 6.0269481 19.2 19.6961538 5.41149 10.4 33.9 23.5 0.6106550 -0.372766 1.06542396 15.425 22.8
2 cyl 2 32 6.18750 1.7859216 6.0 6.2307692 2.96520 4.0 8.0 4.0 -0.1746119 -1.762120 0.31570933 4.000 8.0
3 vs 3 32 0.43750 0.5040161 0.0 0.4230769 0.00000 0.0 1.0 1.0 0.2402577 -2.001938 0.08909831 0.000 1.0
4 am 4 32 0.40625 0.4989909 0.0 0.3846154 0.00000 0.0 1.0 1.0 0.3640159 -1.924741 0.08820997 0.000 1.0
5 gear 5 32 3.68750 0.7378041 4.0 3.6153846 1.48260 3.0 5.0 2.0 0.5288545 -1.069751 0.13042656 3.000 4.0
6 carb 6 32 2.81250 1.6152000 2.0 2.6538462 1.48260 1.0 8.0 7.0 1.0508738 1.257043 0.28552971 2.000 4.0
</OUTPUT>
# Select stats for comparison with other solutions
d.summary <- d.summary.extended %>%
select(var=rowname, min, q25=Q0.25, median, q75=Q0.75, max, mean, sd) %>%
print()
<OUTPUT>
# A tibble: 6 x 8
var min q25 median q75 max mean sd
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mpg 10.4 15.425 19.2 22.8 33.9 20.09062 6.0269481
2 cyl 4.0 4.000 6.0 8.0 8.0 6.18750 1.7859216
3 vs 0.0 0.000 0.0 1.0 1.0 0.43750 0.5040161
4 am 0.0 0.000 0.0 1.0 1.0 0.40625 0.4989909
5 gear 3.0 3.000 4.0 4.0 5.0 3.68750 0.7378041
6 carb 1.0 2.000 2.0 4.0 8.0 2.81250 1.6152000
</OUTPUT>
If you want to create a summary table for publication (not for further calculations) you may want to look at the excellent stargazer package.
df <- data.frame(mtcars)
cols <- c('mpg', 'cyl', 'vs', 'am', 'gear', 'carb')
stargazer(
df[, cols], type = "text",
summary.stat = c("min", "p25", "median", "p75", "max", "median", "sd")
)
================================================================
Statistic Min Pctl(25) Median Pctl(75) Max Median St. Dev.
----------------------------------------------------------------
mpg 10.400 15.430 19.200 22.800 33.900 19.200 6.027
cyl 4 4 6 8 8 6 1.786
vs 0 0 0 1 1 0 0.504
am 0 0 0 1 1 0 0.499
gear 3 3 4 4 5 4 0.738
carb 1 2 2 4 8 2 1.615
----------------------------------------------------------------
You can change type to 'latex' and 'html' as well and save it to file with specifying the file giving 'out' argument.
There's a "new" package called skimr that has a function called skim() that gives wonderful output describing individual variables in a data.fame/tibble.
Try:
skimr::skim(mtcars)
and you'll get:
it is customizable and works well with pipes etc.
see ?skimr::skim()
and vignette("Using_skimr", package = "skimr")
You can achieve the same result using data.table as well. You might consider using it if your table is big.
dt <- data.table(mtcars)
cols <- c('mpg', 'cyl', 'vs', 'am', 'gear', 'carb')
functions <- c('min', 'q25', 'median', 'q75', 'max', 'mean', 'sd')
dt.sum <- dt[
,
lapply(
.SD,
function(x) list(
min(x), quantile(x, 0.25), median(x),
quantile(x, 0.75), max(x), mean(x), sd(x)
)
),
.SDcols = cols
]
dt.sum
mpg cyl vs am gear carb
1: 10.4 4 0 0 3 1
2: 15.43 4 0 0 3 2
3: 19.2 6 0 0 4 2
4: 22.8 8 1 1 4 4
5: 33.9 8 1 1 5 8
6: 20.09 6.188 0.4375 0.4062 3.688 2.812
7: 6.027 1.786 0.504 0.499 0.7378 1.615
# transpose and provide meaningful names
dt.sum.t <- as.data.table(t(sum))[]
setnames(dt.sum.t, names(dt.sum.t), functions)
dt.sum.t[, var := cols]
setcolorder(dt.sum.t, c("var", functions))
dt.sum.t
var min q25 median q75 max mean sd
1: mpg 10.4 15.43 19.2 22.8 33.9 20.09 6.027
2: cyl 4 4 6 8 8 6.188 1.786
3: vs 0 0 0 1 1 0.4375 0.504
4: am 0 0 0 1 1 0.4062 0.499
5: gear 3 3 4 4 5 3.688 0.7378
6: carb 1 2 2 4 8 2.812 1.615
Similar to the accepted answer, but tidied up a bit into a function:
summarise_continuous = function(d, cvars) {
d %>%
select(all_of(cvars)) %>%
mutate_all(as.numeric) %>%
summarise(across(all_of(cvars), list(N = ~sum(!is.na(.)),
mean = ~mean(., na.rm=T),
sd = ~sd(., na.rm=T),
median = ~median(., na.rm=T),
min = ~min(., na.rm=T),
max = ~max(., na.rm=T)))) %>%
pivot_longer(everything(),
names_to = c("variable",".value"),
names_pattern = "(.+)_(.+)") # %>%
# knitr::kable()
# uncomment these bits if you want a nicely formatted table in a .Rmd document
}
summarise_continuous(mtcars, c("mpg", "cyl", "vs", "am", "gear", "carb"))