Convert tapply summary result to data frame [duplicate] - r

This question already has answers here:
Apply multiple functions to column using tapply
(2 answers)
How can I write the code to generate a summarized table in R? [duplicate]
(2 answers)
Closed 4 years ago.
My code is:
Normality <- tapply(input$TotalAuthBdNet.USD., input$Country, summary)
The output displayed is:
$Albania
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000e+00 1.066e+04 2.730e+04 3.403e+07 5.015e+04 2.720e+09
$Angola
Min. 1st Qu. Median Mean 3rd Qu. Max.
5405 15323 52522 486451 170000 4513196
$`Antigua and Barbuda`
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
22622 22622 22622 22622 22622 22622 2
$Argentina
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 15814 45000 212800 193626 4080293 15
Country names are in rows and each country will have such statistic. I want the output as:
Country Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
Albania 0.000e+00 1.066e+04 2.730e+04 3.403e+07 5.015e+04 2.720e+09
Angola 5405 15323 52522 486451 170000 4513196
Argentina 0 15814 45000 212800 193626 4080293 15
The country name is a list identified from the file.

A simple rbind would do.. E.g.
do.call(rbind, tapply(mpg$year, mpg$model, summary))

You can also directly call aggregate so you don't need the extra step:
aggregate(Sepal.Length ~ Species, iris, summary)
# Species Sepal.Length.Min. Sepal.Length.1st Qu. Sepal.Length.Median Sepal.Length.Mean Sepal.Length.3rd Qu. Sepal.Length.Max.
# 1 setosa 4.300 4.800 5.000 5.006 5.200 5.800
# 2 versicolor 4.900 5.600 5.900 5.936 6.300 7.000
# 3 virginica 4.900 6.225 6.500 6.588 6.900 7.900

Related

How to extract core values for summary() with a data frame? [duplicate]

I have this admission_table containing ADMIT, GRE, GPA and RANK.
> head(admission_table)
ADMIT GRE GPA RANK
1 0 380 3.61 3
2 1 660 3.67 3
3 1 800 4.00 1
4 1 640 3.19 4
5 0 520 2.93 4
6 1 760 3.00 2
I'm trying to convert the summary of this table into data.frame. I want to have ADMIT, GRE, GPA and RANK as my column headers.
> summary(admission_table)
ADMIT GRE GPA RANK
Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000
1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000
Median :0.0000 Median :580.0 Median :3.395 Median :2.000
Mean :0.3175 Mean :587.7 Mean :3.390 Mean :2.485
3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000
Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000
> as.data.frame(summary(admission_table))
Var1 Var2 Freq
1 ADMIT Min. :0.0000
2 ADMIT 1st Qu.:0.0000
3 ADMIT Median :0.0000
4 ADMIT Mean :0.3175
5 ADMIT 3rd Qu.:1.0000
6 ADMIT Max. :1.0000
7 GRE Min. :220.0
8 GRE 1st Qu.:520.0
9 GRE Median :580.0
10 GRE Mean :587.7
11 GRE 3rd Qu.:660.0
12 GRE Max. :800.0
13 GPA Min. :2.260
14 GPA 1st Qu.:3.130
15 GPA Median :3.395
16 GPA Mean :3.390
17 GPA 3rd Qu.:3.670
18 GPA Max. :4.000
19 RANK Min. :1.000
20 RANK 1st Qu.:2.000
21 RANK Median :2.000
22 RANK Mean :2.485
23 RANK 3rd Qu.:3.000
24 RANK Max. :4.000
As I'm trying to convert into data.frame, this is the only result I get. I want the data frame have the exact output just like the summary table because after that I want to insert that into Oracle database using this line of code:
dbWriteTable(connection,name="SUM_ADMISSION_TABLE",value=as.data.frame(summary(admission_table)),row.names = FALSE, overwrite = TRUE ,append = FALSE)
Is the any way to do so?
You can consider unclass, I suppose:
data.frame(unclass(summary(mydf)), check.names = FALSE, stringsAsFactors = FALSE)
# ADMIT GRE GPA RANK
# 1 Min. :0.0000 Min. :380.0 Min. :2.930 Min. :1.000
# 2 1st Qu.:0.2500 1st Qu.:550.0 1st Qu.:3.047 1st Qu.:2.250
# 3 Median :1.0000 Median :650.0 Median :3.400 Median :3.000
# 4 Mean :0.6667 Mean :626.7 Mean :3.400 Mean :2.833
# 5 3rd Qu.:1.0000 3rd Qu.:735.0 3rd Qu.:3.655 3rd Qu.:3.750
# 6 Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000
str(.Last.value)
# 'data.frame': 6 obs. of 4 variables:
# $ ADMIT: chr "Min. :0.0000 " "1st Qu.:0.2500 " "Median :1.0000 " "Mean :0.6667 " ...
# $ GRE : chr "Min. :380.0 " "1st Qu.:550.0 " "Median :650.0 " "Mean :626.7 " ...
# $ GPA : chr "Min. :2.930 " "1st Qu.:3.047 " "Median :3.400 " "Mean :3.400 " ...
# $ RANK: chr "Min. :1.000 " "1st Qu.:2.250 " "Median :3.000 " "Mean :2.833 " ...
Note that there is a lot of excessive whitespace there, in both the names and the values.
However, it might be sufficient to do something like:
do.call(cbind, lapply(mydf, summary))
# ADMIT GRE GPA RANK
# Min. 0.0000 380.0 2.930 1.000
# 1st Qu. 0.2500 550.0 3.048 2.250
# Median 1.0000 650.0 3.400 3.000
# Mean 0.6667 626.7 3.400 2.833
# 3rd Qu. 1.0000 735.0 3.655 3.750
# Max. 1.0000 800.0 4.000 4.000
Another way to output a dataframe is:
as.data.frame(apply(mydf, 2, summary))
Works if only numerical columns are selected.
And it may throw an Error in dimnames(x) if there are columns with NA's. It's worth checking for that without the as.data.frame() function first.
None of these solutions actually capture the output of the summary function. The tidy() function extracts the elements from a summary object and makes a bland data.frame, so it does not preserve other features or formatting.
If you want the exact output of the summary function in a data frame, you can do:
output<-capture.output(summary(thisModel), file=NULL,append=FALSE)
output_df <-as.data.frame(output)
This retains all of the new lines and is suitable for writing to XLSX, etc., which will result in the output appropriately spaced across rows.
If you want this output collapsed into a single cell, you can do:
output_collapsed <- paste0(output,sep="",collapse="\n")
output_df <-as.data.frame(output_collapsed)

Summarize the same variables from multiple dataframes in one table

I have voter and party-data from several datasets that I further separated into different dataframes and lists to make it comparable. I could just use the summary command on each of them individually then compare manually, but I was wondering whether there was a way to get them all together and into one table?
Here's a sample of what I have:
> summary(eco$rilenew)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3 4 4 4 4 5
> summary(ecovoters)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 3.000 4.000 3.744 5.000 10.000 26
> summary(lef$rilenew)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 3.000 3.000 3.692 4.000 7.000
> summary(lefvoters)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 2.000 3.000 3.612 5.000 10.000 332
> summary(soc$rilenew)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 4.000 4.000 4.143 5.000 6.000
> summary(socvoters)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 3.000 4.000 3.674 5.000 10.000 346
Is there a way I can summarize these lists (ecovoters, lefvoters, socvoters etc) and the dataframe variables (eco$rilenew, lef$rilenew, soc$rilenew etc) together and have them in one table?
You could put everything into a list and summarize with a small custom function.
L <- list(eco$rilenew, ecovoters, lef$rilenew,
lefvoters, soc$rilenew, socvoters)
t(sapply(L, function(x) {
s <- summary(x)
length(s) <- 7
names(s)[7] <- "NA's"
s[7] <- ifelse(!any(is.na(x)), 0, s[7])
return(s)
}))
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
[1,] 0.9820673 3.3320662 3.958665 3.949512 4.625109 7.229069 0
[2,] -4.8259384 0.5028293 3.220546 3.301452 6.229384 9.585749 26
[3,] -0.3717391 2.3280366 3.009360 3.013908 3.702156 6.584659 0
[4,] -2.6569493 1.6674330 3.069440 3.015325 4.281100 8.808432 332
[5,] -2.3625651 2.4964361 3.886673 3.912009 5.327401 10.349040 0
[6,] -2.4719404 1.3635785 2.790523 2.854812 4.154936 8.491347 346
Data
set.seed(42)
eco <- data.frame(rilenew=rnorm(800, 4, 1))
ecovoters <- rnorm(75, 4, 4)
ecovoters[sample(length(ecovoters), 26)] <- NA
lef <- data.frame(rilenew=rnorm(900, 3, 1))
lefvoters <- rnorm(700, 3, 2)
lefvoters[sample(length(lefvoters), 332)] <- NA
soc <- data.frame(rilenew=rnorm(900, 4, 2))
socvoters <- rnorm(700, 3, 2)
socvoters[sample(length(socvoters), 346)] <- NA
Can use map from tidyverse to get the summary list, then if you want the result as dataframe, then plyr::ldply can help to convert list to dataframe:
ll = map(L, summary)
ll
plyr::ldply(ll, rbind)
> ll = map(L, summary)
> ll
[[1]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9821 3.3321 3.9587 3.9495 4.6251 7.2291
[[2]]
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-4.331 1.347 3.726 3.793 6.653 16.845 26
[[3]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.3717 2.3360 3.0125 3.0174 3.7022 6.5847
[[4]]
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-2.657 1.795 3.039 3.013 4.395 9.942 332
[[5]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.363 2.503 3.909 3.920 5.327 10.349
[[6]]
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-3.278 1.449 2.732 2.761 4.062 8.171 346
> plyr::ldply(ll, rbind)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1 0.9820673 3.332066 3.958665 3.949512 4.625109 7.229069 NA
2 -4.3312551 1.346532 3.725708 3.793431 6.652917 16.844796 26
3 -0.3717391 2.335959 3.012507 3.017438 3.702156 6.584659 NA
4 -2.6569493 1.795307 3.038905 3.012928 4.395338 9.941819 332
5 -2.3625651 2.503324 3.908727 3.920050 5.327401 10.349040 NA
6 -3.2779863 1.448814 2.732515 2.760569 4.061854 8.170793 346

R summary() gives incorrect values for too many NAs

The Setup
I have a data set that consists of 3.5e6 1's, 7.5e6 0's, and 4.4e6 NA's. When I call summary() on it, I get a mean and maximum that are wrong (in disagreement with mean() and max()).
> summary(data, digits = 10)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 0 1 1 1 1 4365239
When mean() is called separately, it returns a reasonable value:
> mean(data, na.rm = T)
[1] 0.6804823
Characterization of the problem
It looks like this problem is generic to any vector with more than 3162277 NA values in it.
With just under the cutoff:
> thingie <- as.numeric(c(rep(0,1e6), rep(1,1e6), rep(NA,3162277)))
> summary(thingie)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0 0.0 0.5 0.5 1.0 1.0 3162277
And just over:
> thingie <- as.numeric(c(rep(0,1e6), rep(1,1e6), rep(NA,3162278)))
> summary(thingie)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 0 0 0 1 1 3162278
It doesn't seem to matter how many non-missing values there are either.
> thingie <- as.numeric(c(rep(0,1), rep(1,1), rep(NA,3162277)))
> summary(thingie)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0 0.2 0.5 0.5 0.8 1.0 3162277
> thingie <- as.numeric(c(rep(0,1), rep(1,1), rep(NA,3162278)))
> summary(thingie)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 0 0 0 1 1 3162278
Research
In searching for an answer, I came across the well-known rounding error, but that doesn't affect this behavior (see the first code chunk).
I thought this might be some sort of bizarre quirk of my environment/machine/planetary alignment, so I had my sister run the same code. She got the same results on her machine.
Closing remarks
Clearly, this isn't a critical problem because the mean() and max() functions can be used instead of summary(), but I'm curious if anyone knows what causes this behavior. Also, neither my sister nor I could find any mention of it, so I figured I'd document it for posterity.
EDIT: I said mean and max the whole post but the max is fine. 1st quantile, median, and 3rd quantile differ.
Here's some example data:
x <- rep(c(1,0,NA), c(3.5e6,7.5e6,4.4e6))
out <- summary(x)
out
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 0 0 0 0 1 1 4400000
mean(x, na.rm=TRUE)
#[1] 0.3181818
The issue can be traced back to zapsmall() as it does some rounding in a line that essentially does:
c(out)
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 0.000e+00 0.000e+00 0.000e+00 3.182e-01 1.000e+00 1.000e+00 4.400e+06
round(c(out), max(0L, getOption("digits")-log10(4400000)))
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 0 0 0 0 1 1 4400000
The critical turning point here is 3162277 to 3162278 NA values where it tips the rounding threshold from 0 to 1 as it goes across 0.5.
dput(max(0L,getOption("digits")-log10(3162277)))
#0.500000090664876
dput(max(0L,getOption("digits")-log10(3162278)))
#0.499999953328896
out[7] <- 3162277
out
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 0.0 0.0 0.0 0.3 1.0 1.0 3162277
out[7] <- 3162278
out
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 0 0 0 0 1 1 3162278
An update to #thelatemail 's answer:
R (version 4.1.2, at least) will do this for many fewer than 3162277 NA's if your other values are small. For instance, with only 10 NA's,
bar <- c(1:10/100, rep(x=NA, times=10))
> summary(bar, digits=12)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0100 0.0325 0.0550 0.0550 0.0775 0.1000 10
> print.default(summary(bar), digits=12)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0100 0.0325 0.0550 0.0550 0.0775 0.1000 10.0000
summary() gets it right. But with 10^4 NA's there's some bad rounding,
bar <- c(1:10/100, rep(x=NA, times=1E4))
> summary(bar, digits=12)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0 0.0 0.1 0.1 0.1 0.1 10000
> print.default(summary(bar), digits=12)
Min. 1st Qu. Median Mean 3rd Qu.
0.0100 0.0325 0.0550 0.0550 0.0775
Max. NA's
0.1000 10000.0000
and with 10^5 NA's everything's rounded to zero:
bar <- c(1:10/100, rep(x=NA, times=1E5))
> summary(bar, digits=12)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 0 0 0 0 0 100000
> print.default(summary(bar), digits=12)
Min. 1st Qu. Median Mean 3rd Qu.
0.0100 0.0325 0.0550 0.0550 0.0775
Max. NA's
0.1000 100000.0000

Summary of each column of data.table or data.frame for xtable

I want to use summary of each column of a data.table or data.frame to be used for sweave with xtable package. Here is MWE.
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
library(xtable)
lapply(iris, summary)
$Sepal.Length
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 5.100 5.800 5.843 6.400 7.900
$Sepal.Width
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.800 3.000 3.057 3.300 4.400
$Petal.Length
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.600 4.350 3.758 5.100 6.900
$Petal.Width
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.100 0.300 1.300 1.199 1.800 2.500
$Species
setosa versicolor virginica
50 50 50
xtableList(lapply(iris, summary))
Error in xtable.table(x[[i]], caption = caption, label = label, align = align, :
xtable.table is not implemented for tables of > 2 dimensions
Wonder how to get summary of each column in separate table to be used for sweave or knitr. Thanks in advance.

Restructure output of R summary function

Is there an easy way to change the output format for R's summary function so that the results print in a column instead of row? R does this automatically when you pass summary a data frame. I'd like to print summary statistics in a column when I pass it a single vector. So instead of this:
>summary(vector)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 2.000 6.699 6.000 559.000
It would look something like this:
>summary(vector)
Min. 1.000
1st Qu. 1.000
Median 2.000
Mean 6.699
3rd Qu. 6.000
Max. 559.000
Sure. Treat it as a data.frame:
set.seed(1)
x <- sample(30, 100, TRUE)
summary(x)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.00 10.00 15.00 16.03 23.25 30.00
summary(data.frame(x))
# x
# Min. : 1.00
# 1st Qu.:10.00
# Median :15.00
# Mean :16.03
# 3rd Qu.:23.25
# Max. :30.00
For slightly more usable output, you can use data.frame(unclass(.)):
data.frame(val = unclass(summary(x)))
# val
# Min. 1.00
# 1st Qu. 10.00
# Median 15.00
# Mean 16.03
# 3rd Qu. 23.25
# Max. 30.00
Or you can use stack:
stack(summary(x))
# values ind
# 1 1.00 Min.
# 2 10.00 1st Qu.
# 3 15.00 Median
# 4 16.03 Mean
# 5 23.25 3rd Qu.
# 6 30.00 Max.

Resources