I am having an error printing the ouput of a summary function to a file. I have a column "bin" with three factor levels and want to return 5 number summary for each level. The five number summary prints to the screen but won't write to file? Error reports I have
Empty data.table (0 rows) of 1 col: bin
Data:
A B info C bin
1: 10-60494 0.66392100 0.001833330 1 MAF0.01
2: rs148087467 0.35274000 0.000716240 1 MAF0.01
3: rs187110906 0.40586900 0.004488040 1 MAF0.01
4: rs192025213 0.00743299 0.000000000 1 MAF0.01
5: rs115033199 0.32829300 0.000614316 1 MAF0.01
6: rs183305313 0.51721200 0.002892520 1 MAF0.01
s <- df2[, print(summary(info)), by='bin']
print(s)
write.table(as.data.frame(s),
quote=FALSE,file=paste(i,"sum_out.txt",sep=''))
Ouput:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0009998 0.0371300 0.2016000 0.2700000 0.4477000 1.0000000
The reason you are getting zero rows is because the only thing you do in j is print the outcome of the summary command.
Considering the following example data:
set.seed(2018)
dt <- data.table(bin = rep(c('A','B'), 5), val = rnorm(10,3,1))
Now when you do (like in your question):
s <- dt[, print(summary(val)), by = bin]
the summary statistics are printed to the console but it results in an empty data.table:
> s <- dt[, print(summary(val)), by = bin]
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.389 2.577 2.936 3.547 4.735 5.099
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.450 2.735 3.271 2.991 3.637 3.863
> s
Empty data.table (0 rows) of 1 col: bin
Removing the print-command doesn't help:
> dt[, summary(val), by = bin]
bin V1
1: A 2.389
2: A 2.577
3: A 2.936
4: A 3.547
5: A 4.735
6: A 5.099
7: B 1.450
8: B 2.735
9: B 3.271
10: B 2.991
11: B 3.637
12: B 3.863
because summary returns a table-object which is treated a vector by data.table.
Instead of using print, you should use as.list to get the elements of summary as columns in a data.table:
s <- dt[, as.list(summary(val)), by = bin]
now the summary statistics are included in the resulting data.table:
> s
bin Min. 1st Qu. Median Mean 3rd Qu. Max.
1: A 2.389413 2.577016 2.935571 3.547351 4.735284 5.099471
2: B 1.450122 2.735289 3.270881 2.991340 3.637056 3.863351
Because the summary statistics are stored in the non-empty data.table s, you can write s to a file with for example fwrite (the fast write function thedata.table-package).
This can be achieved using sapply() - here is an example using the iris data frame:
levels <- unique(iris$Species)
result <- data.frame(t(sapply(levels, function (x) summary(subset(iris, Species == levels[x])$Petal.Width))))
> result
Min. X1st.Qu. Median Mean X3rd.Qu. Max.
1 0.1 0.2 0.2 0.246 0.3 0.6
2 1.0 1.2 1.3 1.326 1.5 1.8
3 1.4 1.8 2.0 2.026 2.3 2.5
Related
I have a data set with 61 observations and 2 variables. When I summary the whole data, the quantiles, median, mean and max of the second variable are sometimes different from the result I get from summary the second variable alone. Why is that?
data <- read.csv("testdata.csv")
head(data)
# Group.1 x
# 1 10/1/12 0
# 2 10/2/12 126
# 3 10/3/12 11352
# 4 10/4/12 12116
# 5 10/5/12 13294
# 6 10/6/12 15420
summary(data)
# Group.1 x
# 10/1/12 : 1 Min. : 0
# 10/10/12: 1 1st Qu.: 6778
# 10/11/12: 1 Median :10395
# 10/12/12: 1 Mean : 9354
# 10/13/12: 1 3rd Qu.:12811
# 10/14/12: 1 Max. :21194
# (Other) :55
summary(data[2])
# x
# Min. : 0
# 1st Qu.: 6778
# Median :10395
# Mean : 9354
# 3rd Qu.:12811
# Max. :21194
# The following code yield different result:
summary(data$x)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0 6778 10400 9354 12810 21190
#r2evans' comment is correct in that the discrepancy is caused by differences in summary.data.frame and summary.default.
The default value of digits for both methods is max(3L, getOption("digits") - 3L). If you haven't changed your options, this will evaluate to 4L. However, the two methods use their digits argument differently when formatting the output, which is the reason for the differences in the two methods' output. From ?summary:
digits: integer, used for number formatting with signif() (for summary.default) or format() (for summary.data.frame).
Say we have the vector of x´s summary statistics in the question:
q <- append(quantile(data$x), mean(data$x), after = 3L)
q
## 0% 25% 50% 75% 100%
## 0.00 6778.00 10395.00 9354.23 12811.00 21194.00
In summary.default the output is formatted by using signif, which rounds it's input to the supplied number of significant digits:
signif(q, digits = 4L)
## 0% 25% 50% 75% 100%
## 0 6778 10400 9354 12810 21190
While summary.data.frame uses format, which uses it's digits argument as only a sugggestion (?format) for the number of significant digits to display:
format(q, digits = 4L)
## 0% 25% 50% 75% 100%
## " 0" " 6778" "10395" " 9354" "12811" "21194"
Thus, when using the default digits argument value 4, summary.default(data$x) rounds the 5-digit quantiles to only 4 significant digits; but summary.data.frame(data[2]) displays the 5-digit quantiles witout rounding.
If you explicitly supply the digits argument as larger than 4, you'll get identical results:
summary(data[2], digits = 5L)
## x
## Min. : 0.0
## 1st Qu.: 6778.0
## Median :10395.0
## Mean : 9354.2
## 3rd Qu.:12811.0
## Max. :21194.0
summary(data$x, digits = 5L)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 6778.0 10395.0 9354.2 12811.0 21194.0
As an extreme example of the differences of the two methods with the default digits:
df <- data.frame(a = 1e5 + 0:100)
summary(df$a)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 100000 100000 100000 100000 100100 100100
summary(df)
## a
## Min. :100000
## 1st Qu.:100025
## Median :100050
## Mean :100050
## 3rd Qu.:100075
## Max. :100100
Normally to make summary statistics on a condition I would say
summary(data$how_fast[data$weight == 'Medium' & data$height == 'High'], basic = T)
But what I would like is to output all of the summary statistics for every variable.
summary(data[data$weight == 'Medium' & data$height == 'High'], basic = T)
So we'd get summary statistics not just for $how_fast, but also for other variable like $start_speed or $medals.
Ideally, it'd be stored in a awesome table (although I believe you can do this using rtf package).
by lets you apply functions to data frames. The output is an array with dimensionality based on your grouping.
dat <- data.frame(A = rep(1:2, each = 10),
B = rep(1:2, times = 10), C = rpois(20, 1))
by(data = dat, INDICES = dat[c("A", "B")], FUN = summary, basic = TRUE)
# A: 1
# B: 1
# A B C
# Min. :1 Min. :1 Min. :0.0
# 1st Qu.:1 1st Qu.:1 1st Qu.:0.0
# Median :1 Median :1 Median :0.0
# Mean :1 Mean :1 Mean :0.6
# 3rd Qu.:1 3rd Qu.:1 3rd Qu.:1.0
# Max. :1 Max. :1 Max. :2.0
# -------------------------------------------------------------
# ...
This lets you summarize for all groupings in a data.frame. To just apply for a single subset you could use lapply.
lapply(X = dat[dat$A == 1 && dat$B == 1, ],
FUN = summary, basic = TRUE)
# $A
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.0 1.0 1.5 1.5 2.0 2.0
#
# $B
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.0 1.0 1.5 1.5 2.0 2.0
#
# $C
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.0 0.0 1.0 0.9 1.0 3.0
I have many numeric vectors, some have NA's, some don't. Here is an example with two vectors:
x1 <- c(1,2,3,2,2,4)
summary(x1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 2.000 2.333 2.750 4.000
x2 <- c(1,2,3,2,2,4,NA)
summary(x2)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.000 2.000 2.000 2.333 2.750 4.000 1
In the end, I want to rbind all the summary's:
rbind(summary(x1), summary(x2))
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
[1,] 1 2 2 2.333 2.75 4 1
[2,] 1 2 2 2.333 2.75 4 1
Warning message:
In rbind(summary(x1), summary(x2)) :
number of columns of result is not a multiple of vector length (arg 1)
Is there a way to force summary to count NA's without error nor warning?
All my trials failed:
summary(x1, na.rm=FALSE)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 2.000 2.333 2.750 4.000
summary(x1, useNA="always")
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 2.000 2.333 2.750 4.000
summary(addNA(x1))
1 2 3 4 <NA>
1 3 1 1 0
I also tried the following, but it is a bit of a hack:
tmp <- rbind(summary(x1[complete.cases(x1)]), summary(x2[complete.cases(x2)]))
tmp <- cbind(tmp, c(sum(is.na(x1)), sum(is.na(x2))))
colnames(tmp)[ncol(tmp)] <- "NA's"
tmp
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
[1,] 1 2 2 2.333 2.75 4 0
[2,] 1 2 2 2.333 2.75 4 1
I have not found a way to force summary to display NA's. However, you could write a custom function that returns what you want:
my_summary <- function(v){
if(!any(is.na(v))){
res <- c(summary(v),"NA's"=0)
} else{
res <- summary(v)
}
return(res)
}
Because the problem is that you are combining vectors of different lengths you can assign the length of the longest to the shortest. When you combine them, this will generate NAs for the missing data that we can easily replace with zeros.
s1 <- summary(x1)
s2 <- summary(x2)
length(s1) <- length(s2)
s <- rbind(s2,s1)
s[is.na(s)] <- 0
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
s2 1 2 2 2.333 2.75 4 1
s1 1 2 2 2.333 2.75 4 0
The solutions that were given before ignore the fact that summary() also works for data.frames and matrices. I would usually handle this by recursive function definition although the result is not exactly the same as is with the original summary() function.
summaryna <- function(x, ...) {
# Recursive function definition in case of matrix or data.frame.
if(is.matrix(x)) {
return(apply(x,2,function(x)summaryna(x, ...)))
} else if (is.data.frame(x)) {
return(sapply(x,function(x)summaryna(x, ...)))
}
# This is the actual function.
sum <- summary(x, ...)
if(length(sum)<7) sum <- c(sum,"NA's"=0)
return(sum)
}
I have the following code in R:
getmonitor<- function(id,directory,summarize=FALSE) {
a<- "C:/Users/UNI/Documents/Coursera/archivosR/"
b<- paste(a,directory,"/",sprintf("%03d",as.numeric(id)),".csv",sep="")
c<- read.csv(b)
if(summarize) {
print(summary(c))
}else {
return(c)
}
}
What I try to do is that if summarize =FALSE, the functions just returns the file and that works fine in my function. However, if summarize=TRUE , the functions returns the summary which is correct but if I write head() for a value, which is equal to my function in TRUE mode, the result is the summary and I want the result of head to be the file.
In your code, when you passed TRUE to summarize, it bypassed the return(c) statement, so R decided to return the summary. Since you want to return the data frame c for all cases, you just need to put it outside your if/else structure.
Modifying your if/else structure to:
getmonitor<- function(id,directory,summarize=FALSE) {
a<- "C:/Users/UNI/Documents/Coursera/archivosR/"
b<- paste(a,directory,"/",sprintf("%03d",as.numeric(id)),".csv",sep="")
c<- read.csv(b)
if(summarize)
print(summary(c))
return(c)
}
Gives me the desired output:
> data <- getmonitor(1,"specdata",TRUE)
Date sulfate nitrate ID
2003-01-01: 1 Min. : 0.613 Min. :0.1180 Min. :1
2003-01-02: 1 1st Qu.: 2.210 1st Qu.:0.2835 1st Qu.:1
2003-01-03: 1 Median : 2.870 Median :0.4530 Median :1
2003-01-04: 1 Mean : 3.881 Mean :0.5499 Mean :1
2003-01-05: 1 3rd Qu.: 4.730 3rd Qu.:0.6635 3rd Qu.:1
2003-01-06: 1 Max. :19.100 Max. :1.8300 Max. :1
(Other) :1455 NA's :1344 NA's :1339
> head(data)
Date sulfate nitrate ID
1 2003-01-01 NA NA 1
2 2003-01-02 NA NA 1
3 2003-01-03 NA NA 1
4 2003-01-04 NA NA 1
5 2003-01-05 NA NA 1
6 2003-01-06 NA NA 1
I ran this:
GroupSummary <- function(x){
for (i in x) {
if(i>0){
p <-c(summary(x))
r <- c(p)
} else {if(i<0){
n <-c(summary(x))
r <- c(n)
} else {stop}
}
return(r)
}
}
x <- c(-1,-2,-3,-4,-5,-6,-7,-8,-9,-10,1,2,3,4,5,6,7,8,9,10)
GroupSummary(x)
i end up getting this as a result:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-10.00 -5.25 0.00 0.00 5.25 10.00
I am trying to seperate it in two groups one group for positive numbers and another for negative numbers not combine both.Where did i go wrong in the coding i wrote?? Any hints or help are welcome thank you
Using the built-in fivenum, you can obtain:
tapply(x,x>0,fivenum)
May advocate for aggregate?
> x <- c(-(1:10),1:10)
> aggregate(x, by=list(positive=x>0), summary)
positive x.Min. x.1st Qu. x.Median x.Mean x.3rd Qu. x.Max.
1 FALSE -10.00 -7.75 -5.50 -5.50 -3.25 -1.00
2 TRUE 1.00 3.25 5.50 5.50 7.75 10.00
> aggregate(x, by=list(positive=x>0), fivenum)
positive x.1 x.2 x.3 x.4 x.5
1 FALSE -10.0 -8.0 -5.5 -3.0 -1.0
2 TRUE 1.0 3.0 5.5 8.0 10.0
This can be the required function. It works by default for negative/positive, but you can use any index (the default ind=NULL create the positive/negative index). The vectors x and ind must have the same length, so we stop the execution if this condition doesn't hold (using stop).
groupSummary = function(x, ind=NULL) {
if(is.null(ind)) {
ind = character(length(x))
ind[x>=0] = "positive"
ind[x<0] = "negative"
}
if(length(x)!=length(ind)) stop("'x' and 'ind' must have the same length.")
out = do.call(rbind, tapply(x,INDEX=ind,FUN=summary))
return(out)
}
groupSummary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
negative -10 -7.75 -5.5 -5.5 -3.25 -1
positive 1 3.25 5.5 5.5 7.75 10
set.seed(123) # to get the same output for 'colors' index
colors = sample(c("red", "blue", "green"), length(x), replace=TRUE)
groupSummary(x, colors)
Min. 1st Qu. Median Mean 3rd Qu. Max.
blue -9 -5.00 -1 -3.0000 0.0 1
green -10 -6.50 -4 -0.9091 5.0 10
red -3 -0.75 4 3.1670 6.5 9
groupSummary(x, ind=1:3)
Error in groupSummary(x, ind = 1:3) :
'x' and 'ind' must have the same length.