Creating a three-way table of summary statistics in R - r

Example Data
I have 100 rows of patient data stored in the object example. For each patient, we know which one of five possible hospitals at which they were treated, the time period in which they were treated, and how many lymph nodes they had.
set.seed(50)
example <- data.frame(
Hospital = sample(as.factor(c("Hospital 1", "Hospital 2", "Hospital 3", "Hospital 4", "Hospital 5")), size = 100, replace = TRUE),
Time = sample(as.factor(c("2000-2002", "2003-2005", "2006-2008")), size = 100, replace = TRUE),
Nodes = sample(20:100, size = 100, replace = TRUE))
I know that I can view the summary statistics for the number of lymph nodes like so... (Note that I have appended the "n" to the rightward-most column, not sure if there is a more eloquent way to do this.)
cbind(do.call(rbind, by(example$Nodes, example$Hospital, summary)), table(example$Hospital, useNA = "no"))
Min. 1st Qu. Median Mean 3rd Qu. Max.
Hospital 1 20 34.25 54.0 55.55 77.75 90 22
Hospital 2 22 38.75 60.5 56.25 71.75 94 20
Hospital 3 22 37.00 51.0 57.12 81.00 96 17
Hospital 4 25 39.75 55.5 57.11 72.25 97 28
Hospital 5 26 42.00 50.0 57.00 77.00 99 13
Similarly, I can view them for the time period like so:
cbind(do.call(rbind, by(example$Nodes, example$Time, summary)), table(example$Time, useNA = "no"))
Min. 1st Qu. Median Mean 3rd Qu. Max.
2000-2002 20 40.00 57.0 58.84 77 97 37
2003-2005 20 33.75 45.5 52.94 78 99 36
2006-2008 23 39.50 61.0 58.33 72 98 27
Question
I would like to create a 3-way table table in which the leftward, outermost row identifiers are the five hospitals, further sub-stratified by time period. I want the columns to be the summary statistics for the number of lymph nodes. I have a feeling the xtabs() or ftable() might help, but have no idea how to apply them to my problem. In fact, typing ftable(example) gives me a table that is structured how I would want it to be, but the columns are not what I want. Thanks!
Edit #1 - In response to Ananda's comment below
Wow, yes that is almost exactly what I am looking for. My preference, however, would be for it to be in this format (with the numbers filled in, of course):
Nodes
Min. 1st Qu. Median Mean 3rd Qu. Max. n
Hospital Time
Hospital 1 2000-2002
2003-2005
2006-2008
Hospital 2 2000-2002
2003-2005
2006-2008
....and so forth....

Ordering the dataframe that results from the aggregate() function that #AnandaMahto mentioned above would provide something very close to what you need, but without the nested values:
dF <- aggregate(Nodes~Hospital+Time, example, summary)
dF <- dF[order(dF[, 1]), ]
Hospital Time Nodes.Min. Nodes.1st Qu. Nodes.Median Nodes.Mean Nodes.3rd Qu.
1 Hospital 1 2000-2002 20.00 25.00 34.00 33.29 38.00
6 Hospital 1 2003-2005 20.00 41.50 77.00 62.86 85.50
11 Hospital 1 2006-2008 35.00 60.50 70.50 68.62 80.75
2 Hospital 2 2000-2002 24.00 40.75 65.50 60.70 80.75
7 Hospital 2 2003-2005 22.00 22.00 26.00 33.75 37.75
12 Hospital 2 2006-2008 45.00 60.25 61.50 63.83 68.00
3 Hospital 3 2000-2002 40.00 63.00 74.00 72.80 91.00
8 Hospital 3 2003-2005 22.00 36.75 66.00 60.50 81.75
13 Hospital 3 2006-2008 23.00 29.50 37.00 40.67 46.75
4 Hospital 4 2000-2002 30.00 55.75 64.50 68.17 90.00
9 Hospital 4 2003-2005 25.00 38.25 42.00 49.36 59.50
14 Hospital 4 2006-2008 27.00 36.00 45.00 45.00 54.00
5 Hospital 5 2000-2002 26.00 39.00 52.00 51.67 64.50
10 Hospital 5 2003-2005 34.00 42.00 50.00 55.40 52.00
15 Hospital 5 2006-2008 30.00 42.00 48.00 61.80 91.00
Nodes.Max.
1 53.00
6 89.00
11 90.00
2 94.00
7 61.00
12 85.00
3 96.00
8 95.00
13 70.00
4 97.00
9 89.00
14 63.00
5 77.00
10 99.00
15 98.00

Related

Different output when printed vs saved as a data frame

I'm trying to create a summary data frame which has averages, mins and maxes (see original post here). I can print the output I want, but whenever I try to save it to a data frame, I only get the averages, not the mins and maxes. I've tried updating R and tidyr, and I can't think of anything else that would cause this. I've tried using as.data.frame() and that doesn't help.
#example df
df <- read.table(header=TRUE, text="shop tables chairs beds
jim-1 2 63 31
jim-2a 10 4 16
jim-2b 32 34 43
jen-1 32 90 32
jen-2 73 91 6
jen-3 35 85 65
sam-a 72 57 72
sam-b 18 48 11
sam-c 34 49 79
paul-1 43 49 23
paul-2 76 20 23
paul-2a 34 20 8")
#create a grouping to allow me to average out group values
shop_group = sub("-.*", "", df$shop)
#print a summary table (works fine)
aggregate(df[,2:4], list(shop_group),
FUN = function(x) summary(x)[c(4,1,6)])
#generate a summary data frame (doesn't work, only gives me the averages, not the mins and maxes)
summ_df= aggregate(df[,2:4], list(shop_group),
FUN = function(x) summary(x)[c(4,1,6)])
it works fine..
> aggregate(df[,2:4], list(shop_group),
+ FUN = function(x) summary(x)[c(4,1,6)])
Group.1 tables.Mean tables.Min. tables.Max. chairs.Mean chairs.Min. chairs.Max.
1 jen 46.67 32.00 73.00 88.67 85.00 91.00
2 jim 14.67 2.00 32.00 33.67 4.00 63.00
3 paul 51.00 34.00 76.00 29.67 20.00 49.00
4 sam 41.33 18.00 72.00 51.33 48.00 57.00
beds.Mean beds.Min. beds.Max.
1 34.33 6.00 65.00
2 30.00 16.00 43.00
3 18.00 8.00 23.00
4 54.00 11.00 79.00
> summ_df= aggregate(df[,2:4], list(shop_group),
+ FUN = function(x) summary(x)[c(4,1,6)])
> summ_df
Group.1 tables.Mean tables.Min. tables.Max. chairs.Mean chairs.Min. chairs.Max.
1 jen 46.67 32.00 73.00 88.67 85.00 91.00
2 jim 14.67 2.00 32.00 33.67 4.00 63.00
3 paul 51.00 34.00 76.00 29.67 20.00 49.00
4 sam 41.33 18.00 72.00 51.33 48.00 57.00
beds.Mean beds.Min. beds.Max.
1 34.33 6.00 65.00
2 30.00 16.00 43.00
3 18.00 8.00 23.00
4 54.00 11.00 79.00
>

summarising data frame rows based on name prefix

I'd like to create a summary data frame that gathers all rows based on a text prefix, with an average, max and min for each variable. So in the example below, I'd like to summarise the average, min and max values for the "Jim" shops, "Jen, shops etc, as well as the same values for all of the furniture in each group of shops.
shop tables chairs beds
jim-1 2 63 31
jim-2a 10 4 16
jim-2b 32 34 43
jen-1 32 90 32
jen-2 73 91 6
jen-3 35 85 65
sam-a 72 57 72
sam-b 18 48 11
sam-c 34 49 79
paul-1 43 49 23
paul-2 76 20 23
paul-2a 34 20 8
Note that some shops are 1,2,3 or a,b,c etc and that there can be a variable number of letters in a name (jim vs paul). I'd like my output to resemble:
shop_group tables_av tables_min tables_max chairs_av chairs_min chairs_max beds_av beds_min beds_max furniture_av furniture_min furniture_max
jim 14.67 2.00 32.00 33.67 4.00 63.00 30.00 16.00 43.00 78.33 30.00 109.00
jen 46.67 32.00 73.00 88.67 85.00 91.00 34.33 6.00 65.00 169.67 154.00 185.00
sam 41.33 18.00 72.00 51.33 48.00 57.00 54.00 11.00 79.00 146.67 77.00 201.00
paul 51.00 34.00 76.00 29.67 20.00 49.00 18.00 8.00 23.00 98.67 62.00 119.00
Thanks in advance...
Just construct the shop group, and use aggregate together with summary to get the output you were seeking.
shop_group = sub("-.*", "", df$shop)
aggregate(df[,2:4], list(shop_group),
FUN = function(x) summary(x)[c(4,1,6)])
Group.1 tables.Mean tables.Min. tables.Max. chairs.Mean chairs.Min. chairs.Max. beds.Mean beds.Min. beds.Max.
1 jen 46.67 32.00 73.00 88.67 85.00 91.00 34.33 6.00 65.00
2 jim 14.67 2.00 32.00 33.67 4.00 63.00 30.00 16.00 43.00
3 paul 51.00 34.00 76.00 29.67 20.00 49.00 18.00 8.00 23.00
4 sam 41.33 18.00 72.00 51.33 48.00 57.00 54.00 11.00 79.00
I still don't know why the data frame wasn't outputting properly, but just in case someone else is having the same problem, here is the solution that a colleague came up with:
library(tibble)
library(dplyr)
library(magrittr)
df %>%
mutate(shop = gsub("-.*", "", shop)) %>%
group_by(shop) %>%
summarise_each(funs(mean, min, max)) -> summary_df

Using Summary function inside Data.table

I am learning data.table using examples and I am stuck-up with my own scenario.
I am using cars dataset and converted to a data.table for trying my commands.
library(data.table)
> cars.dt=data.table(cars)
> cars.dt[1:5]
speed dist
1: 4 2
2: 4 10
3: 7 4
4: 7 22
5: 8 16
.
.
I wanted to calculate the summary statistics for each group of speed and store it in different columns but the values are stored in multiple rows.
e.g
> cars.dt[, summary(dist), by="speed"]
speed V1
1: 4 2
2: 4 4
3: 4 6
4: 4 6
5: 4 8
---
110: 25 85
111: 25 85
112: 25 85
113: 25 85
114: 25 85
I was expecting the below output and I am unable to achieve it.
speed Min. 1st Qu. Median Mean 3rd Qu. Max.
1: 4 2 4 6 6 8 10
2: 7 4.0 8.5 13.0 13.0 17.5 22.0
3: 8 16 16 16 16 16 16
4: 9 10 10 10 10 10 10
5: 10 18 22 26 26 30 34
6: 11 17.00 19.75 22.50 22.50 25.25 28.00
7: 12 14.0 18.5 22.0 21.5 25.0 28.0
8: 13 26 32 34 35 37 46
9: 14 26.0 33.5 48.0 50.5 65.0 80.0
10: 15 20.00 23.00 26.00 33.33 40.00 54.00
11: 16 32 34 36 36 38 40
12: 17 32.00 36.00 40.00 40.67 45.00 50.00
13: 18 42.0 52.5 66.0 64.5 78.0 84.0
14: 19 36 41 46 50 57 68
15: 20 32.0 48.0 52.0 50.4 56.0 64.0
16: 22 66 66 66 66 66 66
17: 23 54 54 54 54 54 54
18: 24 70.00 86.50 92.50 93.75 99.75 120.00
19: 25 85 85 85 85 85 85
I tried the below command but the output was not in a data.table
> cars.dt[, print(summary(dist)), by="speed"]
Min. 1st Qu. Median Mean 3rd Qu. Max.
2 4 6 6 8 10
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.0 8.5 13.0 13.0 17.5 22.0
...
Min. 1st Qu. Median Mean 3rd Qu. Max.
70.00 86.50 92.50 93.75 99.75 120.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
85 85 85 85 85 85
Empty data.table (0 rows) of 1 col: speed
I am unable to use functions returning multiple values when using by clause.
If anyone has any idea as to how to write this, it would be much appreciated.
Also let me know if this possible in data.table
Try:
dt1 <- cars.dt[, as.list(summary(dist)), by="speed"]
head(dt1)
# speed Min. 1st Qu. Median Mean 3rd Qu. Max.
#1: 4 2 4.00 6.0 6.0 8.00 10
#2: 7 4 8.50 13.0 13.0 17.50 22
#3: 8 16 16.00 16.0 16.0 16.00 16
#4: 9 10 10.00 10.0 10.0 10.00 10
#5: 10 18 22.00 26.0 26.0 30.00 34
#6: 11 17 19.75 22.5 22.5 25.25 28
You could also consider summaryBy from doBy to have some control over the summary functions to output.
library(doBy)
dt2 <- summaryBy(.~speed, cars.dt, FUN=c(min, median, mean, max))
head(dt2,2)
# speed dist.min dist.median dist.mean dist.max
#1: 4 2 6 6 10
#2: 7 4 13 13 22
I guess the difference in as.list and list argument is:
Without the grouping variable
list(summary(cars.dt$speed)) #this gets a `list` with one `list element`
#[[1]]
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.0 12.0 15.0 15.4 19.0 25.0
as.list(summary(cars.dt$speed)) #whereas this is also a list with multiple elements
# $Min.
#[1] 4
#$`1st Qu.`
#[1] 12
#$Median
#[1] 15
#$Mean
#[1] 15.4
#$`3rd Qu.`
#[1] 19
#$Max.
#[1] 25
same as list(1:5) and as.list(1:5)

Quantiling reps performance

This is a simple question from an R rookie.
I need to quantile reps performance based on annual sales but the data is provided in quarterly fashion.
Can someone help me optimize the code.
Rep Quarter Sales
1 100 1 25
2 100 2 32
3 100 3 40
4 100 4 52
5 101 1 40
6 101 2 23
7 101 3 37
8 101 4 61
This is really guessing, because your question is way too vague - but it sounds like you could use aggregate.
set.seed(1)
example <- data.frame(Rep=rep(100:104,each=4),
Quarter=rep(1:4,5),
Sales=sample(100,20,replace=TRUE))
> head(example)
Rep Quarter Sales
1 100 1 27
2 100 2 38
3 100 3 58
4 100 4 91
5 101 1 21
6 101 2 90
> aggregate(example$Sales,by=list(Rep=example$Rep),summary)
Rep x.Min. x.1st Qu. x.Median x.Mean x.3rd Qu. x.Max.
1 100 27.00 35.25 48.00 53.50 66.25 91.00
2 101 21.00 55.50 78.50 68.25 91.25 95.00
3 102 7.00 15.25 19.50 27.25 31.50 63.00
4 103 39.00 47.25 59.50 58.75 71.00 77.00
5 104 39.00 63.75 75.00 72.25 83.50 100.00
Or using the formula method (thanks #Ferdinand):
> aggregate(Sales ~ Rep,summary,data=example)
Rep Sales.Min. Sales.1st Qu. Sales.Median Sales.Mean Sales.3rd Qu. Sales.Max.
1 100 27.00 35.25 48.00 53.50 66.25 91.00
2 101 21.00 55.50 78.50 68.25 91.25 95.00
3 102 7.00 15.25 19.50 27.25 31.50 63.00
4 103 39.00 47.25 59.50 58.75 71.00 77.00
5 104 39.00 63.75 75.00 72.25 83.50 100.00
You can also plot the data very easily using boxplot:
boxplot(example$Sales~example$Rep)

R format decimal places of result in table

I would like to round the output to one decimal place in mean table
aov_two <- aov(Mass ~ Distance + Colony + Distance:Colony, data = seed_ant)
summary(aov_two)
B <- model.tables(aov_two, "means")
model table of mean result is,
B
Tables of means
Grand mean
54.2461
Distance
0 5 10
56.53 53.19 52.25
rep 221.00 217.00 139.00
Colony
101 2 23 25 28 3 4 X
64.91 41.84 51.44 60.55 50.83 45.32 54.25 60.85
rep 82.00 71.00 52.00 76.00 59.00 77.00 75.00 85.00
Distance:Colony
Colony
Distance 101 2 23 25 28 3 4 X
0 61.79 41.04 51.97 74.52 53.45 41.33 53.26 72.04
rep 29.00 24.00 29.00 29.00 29.00 27.00 27.00 27.00
5 70.33 40.45 52.61 55.20 49.47 47.75 54.18 57.15
rep 27.00 29.00 23.00 25.00 30.00 28.00 28.00 27.00
10 62.23 44.50 48.05 46.59 55.30 53.42
rep 26.00 18.00 0.00 22.00 0.00 22.00 20.00 31.00
how can I round all those numbers to one decimal place?
It's rather difficult to read the data.frames in your question, but try:
df <- round(df, 1)

Resources