Create Table from Summary() in R - r

used the generic Summary() function to get some data. Now i want to display some of the summary data into a table and then knit into pdf. How to I create a table on the results from calling the Summary() function?
Using TeX, kableExtra and ggplot2.
summary(segmentdata)
summary(subset(segmentdata, Segment == "Suburb mix"))
summary(subset(segmentdata, Segment == "Urban hip"))
summary(subset(segmentdata, Segment == "Travelers"))
summary(subset(segmentdata, Segment == "Moving up"))
E.g. data
age gender income kids ownHome
Min. :20.00 Length:300 Min. :-13292 Min. :0.000 Length:300
1st Qu.:32.75 Class :character 1st Qu.: 38122 1st Qu.:0.000 Class :character
Median :39.00 Mode :character Median : 51134 Median :1.000 Mode :character
Mean :40.59 Mean : 50259 Mean :1.163
3rd Qu.:47.00 3rd Qu.: 63001 3rd Qu.:2.000
Max. :70.00 Max. :139679 Max. :5.000
subscribe Segment
Length:300 Length:300
Class :character Class :character
Mode :character Mode :character

Welcome to StackOverflow. It's a good practice when posting a question to provide actual data with a reproducible example so contributors can help you. reprex package is recommended with R.
I'll give you an answer based on what I think you want to achieve. I used the iris data set as an example.
library(tidyverse)
library(kableExtra)
vars <- iris %>% names()
iris %>%
filter(Species == "setosa") %>% # subset data
map_dfr(summary) %>% # apply summary to variables
add_column(vars = vars, .before = 1) # add variable names
#> # A tibble: 5 x 10
#> vars Min. `1st Qu.` Median Mean `3rd Qu.` Max. setosa versicolor virginica
#> <chr> <tab> <table> <tabl> <tab> <table> <tab> <int> <int> <int>
#> 1 Sepa.. 4.3 4.8 5.0 5.006 5.200 5.8 NA NA NA
#> 2 Sepa.. 2.3 3.2 3.4 3.428 3.675 4.4 NA NA NA
#> 3 Peta.. 1.0 1.4 1.5 1.462 1.575 1.9 NA NA NA
#> 4 Peta.. 0.1 0.2 0.2 0.246 0.300 0.6 NA NA NA
#> 5 Spec.. NA NA NA NA NA NA 50 0 0
For more detail on the process, check out the function's documentation.
For the kableExtra output, add kbl() %>% kable_styling() at the end of the pipeline.

Related

step_BoxCox() with negative data

My understanding is that the step_BoxCox() requires a strictly positive variable. However, I tried to apply the step on data that has some negative values, I didn't get an error or a warning. The output had no NA values.
I don't know what is wrong, if my understanding flawed, or am I using a wrong syntax or something.
library(recipes)
library(skimr)
# create dummy data
set.seed(123)
n <- 2e3
x1 <- rpois(n, lambda = 5) # has some zero vals
x2 <- rnorm(n) # has some -ve vals
x3 <- x1 + 10 # is strictly positive
y <- x1 + x2
data <- tibble(x1, x2, x3, y)
# a BocCox recipe
rec <- recipe(y ~ ., data = data) %>%
step_BoxCox(all_predictors())
rec
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 3
#>
#> Operations:
#>
#> Box-Cox transformation on all_predictors()
# bake
processed <- rec %>%
prep() %>%
bake(new_data = NULL)
# check output
summary(data)
#> x1 x2 x3 y
#> Min. : 0.000 Min. :-3.047861 Min. :10.00 Min. :-2.048
#> 1st Qu.: 3.000 1st Qu.:-0.654767 1st Qu.:13.00 1st Qu.: 3.349
#> Median : 5.000 Median :-0.007895 Median :15.00 Median : 4.843
#> Mean : 4.981 Mean : 0.011176 Mean :14.98 Mean : 4.993
#> 3rd Qu.: 6.000 3rd Qu.: 0.688699 3rd Qu.:16.00 3rd Qu.: 6.486
#> Max. :14.000 Max. : 3.421095 Max. :24.00 Max. :15.225
summary(processed)
#> x1 x2 x3 y
#> Min. : 0.000 Min. :-3.047861 Min. :2.076 Min. :-2.048
#> 1st Qu.: 3.000 1st Qu.:-0.654767 1st Qu.:2.285 1st Qu.: 3.349
#> Median : 5.000 Median :-0.007895 Median :2.398 Median : 4.843
#> Mean : 4.981 Mean : 0.011176 Mean :2.388 Mean : 4.993
#> 3rd Qu.: 6.000 3rd Qu.: 0.688699 3rd Qu.:2.448 3rd Qu.: 6.486
#> Max. :14.000 Max. : 3.421095 Max. :2.756 Max. :15.225
sum(is.na(processed$x2))
#> [1] 0
skim(processed)
Created on 2021-04-29 by the reprex package (v0.3.0)

Convert tapply summary result to data frame [duplicate]

This question already has answers here:
Apply multiple functions to column using tapply
(2 answers)
How can I write the code to generate a summarized table in R? [duplicate]
(2 answers)
Closed 4 years ago.
My code is:
Normality <- tapply(input$TotalAuthBdNet.USD., input$Country, summary)
The output displayed is:
$Albania
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000e+00 1.066e+04 2.730e+04 3.403e+07 5.015e+04 2.720e+09
$Angola
Min. 1st Qu. Median Mean 3rd Qu. Max.
5405 15323 52522 486451 170000 4513196
$`Antigua and Barbuda`
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
22622 22622 22622 22622 22622 22622 2
$Argentina
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 15814 45000 212800 193626 4080293 15
Country names are in rows and each country will have such statistic. I want the output as:
Country Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
Albania 0.000e+00 1.066e+04 2.730e+04 3.403e+07 5.015e+04 2.720e+09
Angola 5405 15323 52522 486451 170000 4513196
Argentina 0 15814 45000 212800 193626 4080293 15
The country name is a list identified from the file.
A simple rbind would do.. E.g.
do.call(rbind, tapply(mpg$year, mpg$model, summary))
You can also directly call aggregate so you don't need the extra step:
aggregate(Sepal.Length ~ Species, iris, summary)
# Species Sepal.Length.Min. Sepal.Length.1st Qu. Sepal.Length.Median Sepal.Length.Mean Sepal.Length.3rd Qu. Sepal.Length.Max.
# 1 setosa 4.300 4.800 5.000 5.006 5.200 5.800
# 2 versicolor 4.900 5.600 5.900 5.936 6.300 7.000
# 3 virginica 4.900 6.225 6.500 6.588 6.900 7.900

Why is summary() produced from data.table output not printing to file?

I am having an error printing the ouput of a summary function to a file. I have a column "bin" with three factor levels and want to return 5 number summary for each level. The five number summary prints to the screen but won't write to file? Error reports I have
Empty data.table (0 rows) of 1 col: bin
Data:
A B info C bin
1: 10-60494 0.66392100 0.001833330 1 MAF0.01
2: rs148087467 0.35274000 0.000716240 1 MAF0.01
3: rs187110906 0.40586900 0.004488040 1 MAF0.01
4: rs192025213 0.00743299 0.000000000 1 MAF0.01
5: rs115033199 0.32829300 0.000614316 1 MAF0.01
6: rs183305313 0.51721200 0.002892520 1 MAF0.01
s <- df2[, print(summary(info)), by='bin']
print(s)
write.table(as.data.frame(s),
quote=FALSE,file=paste(i,"sum_out.txt",sep=''))
Ouput:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0009998 0.0371300 0.2016000 0.2700000 0.4477000 1.0000000
The reason you are getting zero rows is because the only thing you do in j is print the outcome of the summary command.
Considering the following example data:
set.seed(2018)
dt <- data.table(bin = rep(c('A','B'), 5), val = rnorm(10,3,1))
Now when you do (like in your question):
s <- dt[, print(summary(val)), by = bin]
the summary statistics are printed to the console but it results in an empty data.table:
> s <- dt[, print(summary(val)), by = bin]
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.389 2.577 2.936 3.547 4.735 5.099
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.450 2.735 3.271 2.991 3.637 3.863
> s
Empty data.table (0 rows) of 1 col: bin
Removing the print-command doesn't help:
> dt[, summary(val), by = bin]
bin V1
1: A 2.389
2: A 2.577
3: A 2.936
4: A 3.547
5: A 4.735
6: A 5.099
7: B 1.450
8: B 2.735
9: B 3.271
10: B 2.991
11: B 3.637
12: B 3.863
because summary returns a table-object which is treated a vector by data.table.
Instead of using print, you should use as.list to get the elements of summary as columns in a data.table:
s <- dt[, as.list(summary(val)), by = bin]
now the summary statistics are included in the resulting data.table:
> s
bin Min. 1st Qu. Median Mean 3rd Qu. Max.
1: A 2.389413 2.577016 2.935571 3.547351 4.735284 5.099471
2: B 1.450122 2.735289 3.270881 2.991340 3.637056 3.863351
Because the summary statistics are stored in the non-empty data.table s, you can write s to a file with for example fwrite (the fast write function thedata.table-package).
This can be achieved using sapply() - here is an example using the iris data frame:
levels <- unique(iris$Species)
result <- data.frame(t(sapply(levels, function (x) summary(subset(iris, Species == levels[x])$Petal.Width))))
> result
Min. X1st.Qu. Median Mean X3rd.Qu. Max.
1 0.1 0.2 0.2 0.246 0.3 0.6
2 1.0 1.2 1.3 1.326 1.5 1.8
3 1.4 1.8 2.0 2.026 2.3 2.5

How to convert lapply output to a single matrix in R

I have a list of data frames, organized by year. I am using lapply to get the summary for a single variable in each data frame. The output follows the list and gives a summary for each year, one by one. However, I want the output in the form of a single table with years for rows. How do I do this? An example using the iris dataset shows my problem:
x <- split(iris$Sepal.Length, iris$Species)
lapply(x, summary)
And the output is:
$setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 4.800 5.000 5.006 5.200 5.800
Similarly for the other two.
I want the output organized as a single table like with:
> sapply(x, summary)
setosa versicolor virginica
Min. 4.300 4.900 4.900
1st Qu. 4.800 5.600 6.225
Median 5.000 5.900 6.500
Mean 5.006 5.936 6.588
3rd Qu. 5.200 6.300 6.900
Max. 5.800 7.000 7.900
But with setosa, versicolor, virginica (or years in my case) on the left and Min... Max up top. I can flip the axes around in ggplot, but reading the table as-is is more intuitive with the years on the left. I came across a number of discussions about converting lapply output but the ones I came across were all measuring a single stat like mean or median. Thanks.
This seems like a good time to use by(). It eliminates the need for the call to split(), is all done in one line, and returns a matrix.
with(iris, do.call(rbind, by(Sepal.Length, Species, summary)))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# setosa 4.3 4.800 5.0 5.006 5.2 5.8
# versicolor 4.9 5.600 5.9 5.936 6.3 7.0
# virginica 4.9 6.225 6.5 6.588 6.9 7.9
If you still wish to use manual split-apply-combine method, then it would be
do.call(rbind, lapply(x, summary))
If you have a large data.frame, I recommend not to split it into pieces but to use data.table for grouping by year. With the iris data set this could be done along
library(data.table)
setDT(copy(iris))[, as.list(summary(Sepal.Length)), by = Species]
# Species Min. 1st Qu. Median Mean 3rd Qu. Max.
#1: setosa 4.3 4.800 5.0 5.006 5.2 5.8
#2: versicolor 4.9 5.600 5.9 5.936 6.3 7.0
#3: virginica 4.9 6.225 6.5 6.588 6.9 7.9
as.list() ensures the output of summary() appears column-wise as requested.
The result is a data.table (not a matrix) which can be used directly in a subsequent ggplot() call.
Note that copy(iris) is only required here because the iris data set is locked to prevent modifying its variable bindings. With your own data.frame df you would simply use setDT(df) to coerce to data.table without copying.
Add-on
The OP mentioned that he uses the result for plotting with ggplot2. Now, ggplot2 works best when data are provided in long format. Reshaping a data.table from wide to long format can be conveniently done with melt()
wideDT <- setDT(copy(iris))[, summary(Sepal.Length), by = Species]
longDT <- melt(wideDT, id.vars = "Species")
longDT
# Species variable value
# 1: setosa Min. 4.300
# 2: versicolor Min. 4.900
# 3: virginica Min. 4.900
# 4: setosa 1st Qu. 4.800
# 5: versicolor 1st Qu. 5.600
# 6: virginica 1st Qu. 6.225
# 7: setosa Median 5.000
# 8: versicolor Median 5.900
# 9: virginica Median 6.500
#10: setosa Mean 5.006
#11: versicolor Mean 5.936
#12: virginica Mean 6.588
#13: setosa 3rd Qu. 5.200
#14: versicolor 3rd Qu. 6.300
#15: virginica 3rd Qu. 6.900
#16: setosa Max. 5.800
#17: versicolor Max. 7.000
#18: virginica Max. 7.900

What is the **tidyverse** method for splitting a df by multiple columns?

I would like to split a dataframe by multiple columns so that I can see the summary() output for each subset of the data.
Here's a way to do that using split() from base:
library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag(): dplyr, stats
mtcars %>%
select(1:3) %>%
mutate(GRP_A = sample(LETTERS[1:2], n(), replace = TRUE),
GRP_B = sample(c(1:2), n(), replace = TRUE)) %>%
split(list(.$GRP_A, .$GRP_B)) %>%
map(summary)
#> $A.1
#> mpg cyl disp GRP_A
#> Min. :10.40 Min. :4.0 Min. :108.0 Length:10
#> 1st Qu.:14.97 1st Qu.:4.5 1st Qu.:151.9 Class :character
#> Median :18.50 Median :7.0 Median :259.3 Mode :character
#> Mean :17.61 Mean :6.4 Mean :283.4
#> 3rd Qu.:20.85 3rd Qu.:8.0 3rd Qu.:430.0
#> Max. :24.40 Max. :8.0 Max. :472.0
#> GRP_B
#> Min. :1
#> 1st Qu.:1
#> Median :1
#> Mean :1
#> 3rd Qu.:1
#> Max. :1
#>
#> $B.1
#> mpg cyl disp GRP_A
#> Min. :15.00 Min. :4.0 Min. : 75.7 Length:5
#> 1st Qu.:21.00 1st Qu.:4.0 1st Qu.: 78.7 Class :character
#> Median :21.50 Median :4.0 Median :120.1 Mode :character
#> Mean :24.06 Mean :5.2 Mean :147.1
#> 3rd Qu.:30.40 3rd Qu.:6.0 3rd Qu.:160.0
#> Max. :32.40 Max. :8.0 Max. :301.0
#> GRP_B
#> Min. :1
#> 1st Qu.:1
#> Median :1
#> Mean :1
#> 3rd Qu.:1
#> Max. :1
#>
#> $A.2
#> mpg cyl disp GRP_A
#> Min. :15.20 Min. :4.000 Min. : 95.1 Length:9
#> 1st Qu.:16.40 1st Qu.:6.000 1st Qu.:160.0 Class :character
#> Median :18.10 Median :8.000 Median :275.8 Mode :character
#> Mean :19.84 Mean :6.667 Mean :234.0
#> 3rd Qu.:21.00 3rd Qu.:8.000 3rd Qu.:275.8
#> Max. :30.40 Max. :8.000 Max. :360.0
#> GRP_B
#> Min. :2
#> 1st Qu.:2
#> Median :2
#> Mean :2
#> 3rd Qu.:2
#> Max. :2
#>
#> $B.2
#> mpg cyl disp GRP_A
#> Min. :13.30 Min. :4 Min. : 71.1 Length:8
#> 1st Qu.:14.97 1st Qu.:4 1st Qu.:125.3 Class :character
#> Median :20.55 Median :6 Median :201.5 Mode :character
#> Mean :20.99 Mean :6 Mean :213.5
#> 3rd Qu.:23.93 3rd Qu.:8 3rd Qu.:315.5
#> Max. :33.90 Max. :8 Max. :360.0
#> GRP_B
#> Min. :2
#> 1st Qu.:2
#> Median :2
#> Mean :2
#> 3rd Qu.:2
#> Max. :2
How can I achieve this same result using a tidyverse verb? My initial thought was to use purrr::by_slice(), but apparently that has been deprecated.
dplyr 0.8.0 has introduced the verb that you were looking for: group_split()
From the documentation:
group_split() works like base::split() but
it uses the grouping structure from group_by() and therefore is subject to the data mask
it does not name the elements of the list based on the grouping as this typically loses information and is confusing.
group_keys() explains the grouping structure, by returning a data
frame that has one row per group and one column per grouping variable.
For your example:
mtcars %>%
select(1:3) %>%
mutate(GRP_A = sample(LETTERS[1:2], n(), replace = TRUE),
GRP_B = sample(c(1:2), n(), replace = TRUE)) %>%
group_split(GRP_A, GRP_B) %>%
map(summary)
EDIT: this answer is now outdated. See #MartijnVanAttekum's solution above.
The "tidy" solution seems to be a combination of "mutate + list-cols + purrr" according to Hadley.
library(tidyverse)
library(magrittr)
# group, nest, create a new col leveraging purrr::map()
mt_summary <-
mtcars %>%
select(1:3) %>%
mutate(GRP_A = sample(LETTERS[1:2], n(), replace = TRUE),
GRP_B = sample(c(1:2), n(), replace = TRUE)) %>%
group_by(GRP_A, GRP_B) %>%
nest() %>%
mutate(SUMMARY = map(data, .f = summary))
# check the structure
mt_summary
#> # A tibble: 4 × 4
#> GRP_A GRP_B data SUMMARY
#> <chr> <int> <list> <list>
#> 1 A 1 <tibble [11 × 3]> <S3: table>
#> 2 B 2 <tibble [9 × 3]> <S3: table>
#> 3 A 2 <tibble [7 × 3]> <S3: table>
#> 4 B 1 <tibble [5 × 3]> <S3: table>
# extract the summaries
extract2(mt_summary, "SUMMARY") %>%
set_names(paste0(extract2(mt_summary, "GRP_A"),
extract2(mt_summary, "GRP_B")))
#> $A1
#> mpg cyl disp
#> Min. :10.40 Min. :4.000 Min. : 75.7
#> 1st Qu.:15.25 1st Qu.:4.000 1st Qu.:120.9
#> Median :19.20 Median :6.000 Median :167.6
#> Mean :20.43 Mean :6.182 Mean :229.0
#> 3rd Qu.:25.85 3rd Qu.:8.000 3rd Qu.:309.5
#> Max. :30.40 Max. :8.000 Max. :460.0
#>
#> $B2
#> mpg cyl disp
#> Min. :15.20 Min. :4.000 Min. : 78.7
#> 1st Qu.:17.80 1st Qu.:4.000 1st Qu.:120.3
#> Median :19.20 Median :6.000 Median :167.6
#> Mean :20.84 Mean :6.222 Mean :225.9
#> 3rd Qu.:21.50 3rd Qu.:8.000 3rd Qu.:351.0
#> Max. :32.40 Max. :8.000 Max. :400.0
#>
#> $A2
#> mpg cyl disp
#> Min. :15.20 Min. :4.000 Min. : 71.1
#> 1st Qu.:18.90 1st Qu.:4.000 1st Qu.:114.5
#> Median :21.40 Median :6.000 Median :145.0
#> Mean :21.79 Mean :5.429 Mean :176.0
#> 3rd Qu.:22.10 3rd Qu.:6.000 3rd Qu.:241.5
#> Max. :33.90 Max. :8.000 Max. :304.0
#>
#> $B1
#> mpg cyl disp
#> Min. :10.40 Min. :4.0 Min. :140.8
#> 1st Qu.:13.30 1st Qu.:8.0 1st Qu.:275.8
#> Median :14.30 Median :8.0 Median :350.0
#> Mean :15.62 Mean :7.2 Mean :319.7
#> 3rd Qu.:17.30 3rd Qu.:8.0 3rd Qu.:360.0
#> Max. :22.80 Max. :8.0 Max. :472.0

Resources