R - Discrepancy in summary(data) and summary(data$variable) - r

I have a data set with 61 observations and 2 variables. When I summary the whole data, the quantiles, median, mean and max of the second variable are sometimes different from the result I get from summary the second variable alone. Why is that?
data <- read.csv("testdata.csv")
head(data)
# Group.1 x
# 1 10/1/12 0
# 2 10/2/12 126
# 3 10/3/12 11352
# 4 10/4/12 12116
# 5 10/5/12 13294
# 6 10/6/12 15420
summary(data)
# Group.1 x
# 10/1/12 : 1 Min. : 0
# 10/10/12: 1 1st Qu.: 6778
# 10/11/12: 1 Median :10395
# 10/12/12: 1 Mean : 9354
# 10/13/12: 1 3rd Qu.:12811
# 10/14/12: 1 Max. :21194
# (Other) :55
summary(data[2])
# x
# Min. : 0
# 1st Qu.: 6778
# Median :10395
# Mean : 9354
# 3rd Qu.:12811
# Max. :21194
# The following code yield different result:
summary(data$x)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0 6778 10400 9354 12810 21190

#r2evans' comment is correct in that the discrepancy is caused by differences in summary.data.frame and summary.default.
The default value of digits for both methods is max(3L, getOption("digits") - 3L). If you haven't changed your options, this will evaluate to 4L. However, the two methods use their digits argument differently when formatting the output, which is the reason for the differences in the two methods' output. From ?summary:
digits: integer, used for number formatting with signif() (for summary.default) or format() (for summary.data.frame).
Say we have the vector of x´s summary statistics in the question:
q <- append(quantile(data$x), mean(data$x), after = 3L)
q
## 0% 25% 50% 75% 100%
## 0.00 6778.00 10395.00 9354.23 12811.00 21194.00
In summary.default the output is formatted by using signif, which rounds it's input to the supplied number of significant digits:
signif(q, digits = 4L)
## 0% 25% 50% 75% 100%
## 0 6778 10400 9354 12810 21190
While summary.data.frame uses format, which uses it's digits argument as only a sugggestion (?format) for the number of significant digits to display:
format(q, digits = 4L)
## 0% 25% 50% 75% 100%
## " 0" " 6778" "10395" " 9354" "12811" "21194"
Thus, when using the default digits argument value 4, summary.default(data$x) rounds the 5-digit quantiles to only 4 significant digits; but summary.data.frame(data[2]) displays the 5-digit quantiles witout rounding.
If you explicitly supply the digits argument as larger than 4, you'll get identical results:
summary(data[2], digits = 5L)
## x
## Min. : 0.0
## 1st Qu.: 6778.0
## Median :10395.0
## Mean : 9354.2
## 3rd Qu.:12811.0
## Max. :21194.0
summary(data$x, digits = 5L)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 6778.0 10395.0 9354.2 12811.0 21194.0
As an extreme example of the differences of the two methods with the default digits:
df <- data.frame(a = 1e5 + 0:100)
summary(df$a)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 100000 100000 100000 100000 100100 100100
summary(df)
## a
## Min. :100000
## 1st Qu.:100025
## Median :100050
## Mean :100050
## 3rd Qu.:100075
## Max. :100100

Related

Why is summary() produced from data.table output not printing to file?

I am having an error printing the ouput of a summary function to a file. I have a column "bin" with three factor levels and want to return 5 number summary for each level. The five number summary prints to the screen but won't write to file? Error reports I have
Empty data.table (0 rows) of 1 col: bin
Data:
A B info C bin
1: 10-60494 0.66392100 0.001833330 1 MAF0.01
2: rs148087467 0.35274000 0.000716240 1 MAF0.01
3: rs187110906 0.40586900 0.004488040 1 MAF0.01
4: rs192025213 0.00743299 0.000000000 1 MAF0.01
5: rs115033199 0.32829300 0.000614316 1 MAF0.01
6: rs183305313 0.51721200 0.002892520 1 MAF0.01
s <- df2[, print(summary(info)), by='bin']
print(s)
write.table(as.data.frame(s),
quote=FALSE,file=paste(i,"sum_out.txt",sep=''))
Ouput:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0009998 0.0371300 0.2016000 0.2700000 0.4477000 1.0000000
The reason you are getting zero rows is because the only thing you do in j is print the outcome of the summary command.
Considering the following example data:
set.seed(2018)
dt <- data.table(bin = rep(c('A','B'), 5), val = rnorm(10,3,1))
Now when you do (like in your question):
s <- dt[, print(summary(val)), by = bin]
the summary statistics are printed to the console but it results in an empty data.table:
> s <- dt[, print(summary(val)), by = bin]
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.389 2.577 2.936 3.547 4.735 5.099
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.450 2.735 3.271 2.991 3.637 3.863
> s
Empty data.table (0 rows) of 1 col: bin
Removing the print-command doesn't help:
> dt[, summary(val), by = bin]
bin V1
1: A 2.389
2: A 2.577
3: A 2.936
4: A 3.547
5: A 4.735
6: A 5.099
7: B 1.450
8: B 2.735
9: B 3.271
10: B 2.991
11: B 3.637
12: B 3.863
because summary returns a table-object which is treated a vector by data.table.
Instead of using print, you should use as.list to get the elements of summary as columns in a data.table:
s <- dt[, as.list(summary(val)), by = bin]
now the summary statistics are included in the resulting data.table:
> s
bin Min. 1st Qu. Median Mean 3rd Qu. Max.
1: A 2.389413 2.577016 2.935571 3.547351 4.735284 5.099471
2: B 1.450122 2.735289 3.270881 2.991340 3.637056 3.863351
Because the summary statistics are stored in the non-empty data.table s, you can write s to a file with for example fwrite (the fast write function thedata.table-package).
This can be achieved using sapply() - here is an example using the iris data frame:
levels <- unique(iris$Species)
result <- data.frame(t(sapply(levels, function (x) summary(subset(iris, Species == levels[x])$Petal.Width))))
> result
Min. X1st.Qu. Median Mean X3rd.Qu. Max.
1 0.1 0.2 0.2 0.246 0.3 0.6
2 1.0 1.2 1.3 1.326 1.5 1.8
3 1.4 1.8 2.0 2.026 2.3 2.5

Remove the text in the cells having only those numbers and name the row names

Using trees dataset.
data(trees)
Each column has the values of summary including its titles Min,Max,1st Quartile and so on.. But only numbers should be present in the corresponding cells and that names should be named as row names in column for whole dataset.
Need Output like this
We can apply summary on each of the columns separately by looping with sapply.
data(trees)
sapply(trees, summary)
# Girth Height Volume
# Min. 8.30 63 10.20
# 1st Qu. 11.05 72 19.40
# Median 12.90 76 24.20
# Mean 13.25 76 30.17
# 3rd Qu. 15.25 80 37.30
# Max. 20.60 87 77.00
The OP's output may have resulted from applying the summary directly on the whole dataset.
summary(trees)
# Girth Height Volume
# Min. : 8.30 Min. :63 Min. :10.20
# 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
# Median :12.90 Median :76 Median :24.20
# Mean :13.25 Mean :76 Mean :30.17
# 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
# Max. :20.60 Max. :87 Max. :77.00

Create summary statistics based on condition

Normally to make summary statistics on a condition I would say
summary(data$how_fast[data$weight == 'Medium' & data$height == 'High'], basic = T)
But what I would like is to output all of the summary statistics for every variable.
summary(data[data$weight == 'Medium' & data$height == 'High'], basic = T)
So we'd get summary statistics not just for $how_fast, but also for other variable like $start_speed or $medals.
Ideally, it'd be stored in a awesome table (although I believe you can do this using rtf package).
by lets you apply functions to data frames. The output is an array with dimensionality based on your grouping.
dat <- data.frame(A = rep(1:2, each = 10),
B = rep(1:2, times = 10), C = rpois(20, 1))
by(data = dat, INDICES = dat[c("A", "B")], FUN = summary, basic = TRUE)
# A: 1
# B: 1
# A B C
# Min. :1 Min. :1 Min. :0.0
# 1st Qu.:1 1st Qu.:1 1st Qu.:0.0
# Median :1 Median :1 Median :0.0
# Mean :1 Mean :1 Mean :0.6
# 3rd Qu.:1 3rd Qu.:1 3rd Qu.:1.0
# Max. :1 Max. :1 Max. :2.0
# -------------------------------------------------------------
# ...
This lets you summarize for all groupings in a data.frame. To just apply for a single subset you could use lapply.
lapply(X = dat[dat$A == 1 && dat$B == 1, ],
FUN = summary, basic = TRUE)
# $A
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.0 1.0 1.5 1.5 2.0 2.0
#
# $B
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.0 1.0 1.5 1.5 2.0 2.0
#
# $C
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.0 0.0 1.0 0.9 1.0 3.0

force summary() to report the number of NA's even if none

I have many numeric vectors, some have NA's, some don't. Here is an example with two vectors:
x1 <- c(1,2,3,2,2,4)
summary(x1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 2.000 2.333 2.750 4.000
x2 <- c(1,2,3,2,2,4,NA)
summary(x2)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.000 2.000 2.000 2.333 2.750 4.000 1
In the end, I want to rbind all the summary's:
rbind(summary(x1), summary(x2))
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
[1,] 1 2 2 2.333 2.75 4 1
[2,] 1 2 2 2.333 2.75 4 1
Warning message:
In rbind(summary(x1), summary(x2)) :
number of columns of result is not a multiple of vector length (arg 1)
Is there a way to force summary to count NA's without error nor warning?
All my trials failed:
summary(x1, na.rm=FALSE)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 2.000 2.333 2.750 4.000
summary(x1, useNA="always")
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 2.000 2.333 2.750 4.000
summary(addNA(x1))
1 2 3 4 <NA>
1 3 1 1 0
I also tried the following, but it is a bit of a hack:
tmp <- rbind(summary(x1[complete.cases(x1)]), summary(x2[complete.cases(x2)]))
tmp <- cbind(tmp, c(sum(is.na(x1)), sum(is.na(x2))))
colnames(tmp)[ncol(tmp)] <- "NA's"
tmp
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
[1,] 1 2 2 2.333 2.75 4 0
[2,] 1 2 2 2.333 2.75 4 1
I have not found a way to force summary to display NA's. However, you could write a custom function that returns what you want:
my_summary <- function(v){
if(!any(is.na(v))){
res <- c(summary(v),"NA's"=0)
} else{
res <- summary(v)
}
return(res)
}
Because the problem is that you are combining vectors of different lengths you can assign the length of the longest to the shortest. When you combine them, this will generate NAs for the missing data that we can easily replace with zeros.
s1 <- summary(x1)
s2 <- summary(x2)
length(s1) <- length(s2)
s <- rbind(s2,s1)
s[is.na(s)] <- 0
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
s2 1 2 2 2.333 2.75 4 1
s1 1 2 2 2.333 2.75 4 0
The solutions that were given before ignore the fact that summary() also works for data.frames and matrices. I would usually handle this by recursive function definition although the result is not exactly the same as is with the original summary() function.
summaryna <- function(x, ...) {
# Recursive function definition in case of matrix or data.frame.
if(is.matrix(x)) {
return(apply(x,2,function(x)summaryna(x, ...)))
} else if (is.data.frame(x)) {
return(sapply(x,function(x)summaryna(x, ...)))
}
# This is the actual function.
sum <- summary(x, ...)
if(length(sum)<7) sum <- c(sum,"NA's"=0)
return(sum)
}

Summary method results do not seem to be accurate for vectors

This is puzzling me. When you run summary() on a vector of integers you don't seem to get accurate results. The numbers seem to be rounded off. I tried this on three different machines with different OS's and the results are the same.
For a vector:
>a <- 0:628846
>str(a)
int [1:628847] 0 1 2 3 4 5 6 7 8 9 ...
>summary(a)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 157200 314400 314400 471600 628800
>max(a)
[1] 628846
For a data.frame:
> b <- data.frame(b = 0:628846)
> str(b)
'data.frame': 628847 obs. of 1 variable:
$ b: int 0 1 2 3 4 5 6 7 8 9 ...
> summary(b)
b
Min. : 0
1st Qu.:157212
Median :314423
Mean :314423
3rd Qu.:471635
Max. :628846
> summary(b$b)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 157200 314400 314400 471600 628800
Why are these results different?
The object a is class integer, b is class data.frame. A data frame is a list with certain properties and with class data.frame (http://cran.r-project.org/doc/manuals/R-intro.html#Data-frames). Many functions, including summary, handle objects of different classes differently (see that you can use summary on an object of class lm and it gives you something completely different). If you want to apply the function summary to every components in b, you could use lapply:
> a <- 0:628846
> b <- data.frame(b = 0:628846)
> class(a)
[1] "integer"
> class(b)
[1] "data.frame"
> names(b)
[1] "b"
> length(b)
[1] 1
> summary(b[[1]]) # b[[1]] gives the first component of the list b
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 157200 314400 314400 471600 628800
> class(b$b)
[1] "integer"
> summary(b$b)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 157200 314400 314400 471600 628800
> lapply(b,summary)
$b
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 157200 314400 314400 471600 628800
>
> # example of summary on a linear model
> x <- rnorm(100)
> y <- x + rnorm(100)
> my.lm <- lm(y~x)
> class(my.lm)
[1] "lm"
> summary(my.lm)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.6847 -0.5460 0.1175 0.6610 2.2976
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.04122 0.09736 0.423 0.673
x 1.14790 0.09514 12.066 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9735 on 98 degrees of freedom
Multiple R-squared: 0.5977, Adjusted R-squared: 0.5936
F-statistic: 145.6 on 1 and 98 DF, p-value: < 2.2e-16

Resources