R format decimal places of result in table - r

I would like to round the output to one decimal place in mean table
aov_two <- aov(Mass ~ Distance + Colony + Distance:Colony, data = seed_ant)
summary(aov_two)
B <- model.tables(aov_two, "means")
model table of mean result is,
B
Tables of means
Grand mean
54.2461
Distance
0 5 10
56.53 53.19 52.25
rep 221.00 217.00 139.00
Colony
101 2 23 25 28 3 4 X
64.91 41.84 51.44 60.55 50.83 45.32 54.25 60.85
rep 82.00 71.00 52.00 76.00 59.00 77.00 75.00 85.00
Distance:Colony
Colony
Distance 101 2 23 25 28 3 4 X
0 61.79 41.04 51.97 74.52 53.45 41.33 53.26 72.04
rep 29.00 24.00 29.00 29.00 29.00 27.00 27.00 27.00
5 70.33 40.45 52.61 55.20 49.47 47.75 54.18 57.15
rep 27.00 29.00 23.00 25.00 30.00 28.00 28.00 27.00
10 62.23 44.50 48.05 46.59 55.30 53.42
rep 26.00 18.00 0.00 22.00 0.00 22.00 20.00 31.00
how can I round all those numbers to one decimal place?

It's rather difficult to read the data.frames in your question, but try:
df <- round(df, 1)

Related

Different output when printed vs saved as a data frame

I'm trying to create a summary data frame which has averages, mins and maxes (see original post here). I can print the output I want, but whenever I try to save it to a data frame, I only get the averages, not the mins and maxes. I've tried updating R and tidyr, and I can't think of anything else that would cause this. I've tried using as.data.frame() and that doesn't help.
#example df
df <- read.table(header=TRUE, text="shop tables chairs beds
jim-1 2 63 31
jim-2a 10 4 16
jim-2b 32 34 43
jen-1 32 90 32
jen-2 73 91 6
jen-3 35 85 65
sam-a 72 57 72
sam-b 18 48 11
sam-c 34 49 79
paul-1 43 49 23
paul-2 76 20 23
paul-2a 34 20 8")
#create a grouping to allow me to average out group values
shop_group = sub("-.*", "", df$shop)
#print a summary table (works fine)
aggregate(df[,2:4], list(shop_group),
FUN = function(x) summary(x)[c(4,1,6)])
#generate a summary data frame (doesn't work, only gives me the averages, not the mins and maxes)
summ_df= aggregate(df[,2:4], list(shop_group),
FUN = function(x) summary(x)[c(4,1,6)])
it works fine..
> aggregate(df[,2:4], list(shop_group),
+ FUN = function(x) summary(x)[c(4,1,6)])
Group.1 tables.Mean tables.Min. tables.Max. chairs.Mean chairs.Min. chairs.Max.
1 jen 46.67 32.00 73.00 88.67 85.00 91.00
2 jim 14.67 2.00 32.00 33.67 4.00 63.00
3 paul 51.00 34.00 76.00 29.67 20.00 49.00
4 sam 41.33 18.00 72.00 51.33 48.00 57.00
beds.Mean beds.Min. beds.Max.
1 34.33 6.00 65.00
2 30.00 16.00 43.00
3 18.00 8.00 23.00
4 54.00 11.00 79.00
> summ_df= aggregate(df[,2:4], list(shop_group),
+ FUN = function(x) summary(x)[c(4,1,6)])
> summ_df
Group.1 tables.Mean tables.Min. tables.Max. chairs.Mean chairs.Min. chairs.Max.
1 jen 46.67 32.00 73.00 88.67 85.00 91.00
2 jim 14.67 2.00 32.00 33.67 4.00 63.00
3 paul 51.00 34.00 76.00 29.67 20.00 49.00
4 sam 41.33 18.00 72.00 51.33 48.00 57.00
beds.Mean beds.Min. beds.Max.
1 34.33 6.00 65.00
2 30.00 16.00 43.00
3 18.00 8.00 23.00
4 54.00 11.00 79.00
>

summarising data frame rows based on name prefix

I'd like to create a summary data frame that gathers all rows based on a text prefix, with an average, max and min for each variable. So in the example below, I'd like to summarise the average, min and max values for the "Jim" shops, "Jen, shops etc, as well as the same values for all of the furniture in each group of shops.
shop tables chairs beds
jim-1 2 63 31
jim-2a 10 4 16
jim-2b 32 34 43
jen-1 32 90 32
jen-2 73 91 6
jen-3 35 85 65
sam-a 72 57 72
sam-b 18 48 11
sam-c 34 49 79
paul-1 43 49 23
paul-2 76 20 23
paul-2a 34 20 8
Note that some shops are 1,2,3 or a,b,c etc and that there can be a variable number of letters in a name (jim vs paul). I'd like my output to resemble:
shop_group tables_av tables_min tables_max chairs_av chairs_min chairs_max beds_av beds_min beds_max furniture_av furniture_min furniture_max
jim 14.67 2.00 32.00 33.67 4.00 63.00 30.00 16.00 43.00 78.33 30.00 109.00
jen 46.67 32.00 73.00 88.67 85.00 91.00 34.33 6.00 65.00 169.67 154.00 185.00
sam 41.33 18.00 72.00 51.33 48.00 57.00 54.00 11.00 79.00 146.67 77.00 201.00
paul 51.00 34.00 76.00 29.67 20.00 49.00 18.00 8.00 23.00 98.67 62.00 119.00
Thanks in advance...
Just construct the shop group, and use aggregate together with summary to get the output you were seeking.
shop_group = sub("-.*", "", df$shop)
aggregate(df[,2:4], list(shop_group),
FUN = function(x) summary(x)[c(4,1,6)])
Group.1 tables.Mean tables.Min. tables.Max. chairs.Mean chairs.Min. chairs.Max. beds.Mean beds.Min. beds.Max.
1 jen 46.67 32.00 73.00 88.67 85.00 91.00 34.33 6.00 65.00
2 jim 14.67 2.00 32.00 33.67 4.00 63.00 30.00 16.00 43.00
3 paul 51.00 34.00 76.00 29.67 20.00 49.00 18.00 8.00 23.00
4 sam 41.33 18.00 72.00 51.33 48.00 57.00 54.00 11.00 79.00
I still don't know why the data frame wasn't outputting properly, but just in case someone else is having the same problem, here is the solution that a colleague came up with:
library(tibble)
library(dplyr)
library(magrittr)
df %>%
mutate(shop = gsub("-.*", "", shop)) %>%
group_by(shop) %>%
summarise_each(funs(mean, min, max)) -> summary_df

Merging two dataframes in R with date

I have the following 2 dataframes:
> bvg1
Parameters X18.Oct.14 X19.Oct.14 X20.Oct.14 X21.Oct.14 X22.Oct.14 X23.Oct.14 X24.Oct.14
1 24K Equivalent Plan 29.00 29.60 33.80 36.60 35.30 31.90 29.00
2 24K Equivalent Act 28.80 31.00 35.40 35.90 34.70 33.40 31.90
3 Plan Rep WS 2463.00 2513.00 2869.00 3115.00 2999.00 2714.00 2468.00
4 Act Rep WS 2447.00 2633.00 3013.00 3054.00 2953.00 2842.00 2714.00
5 Rep WS Var -16.00 120.00 144.00 -61.00 -46.00 128.00 246.00
6 Plan Rep Intakes 568.00 461.00 1159.00 1146.00 1126.00 1124.00 1106.00
7 Act Rep Intakes 707.00 494.00 1106.00 1096.00 1274.00 1087.00 1101.00
8 Rep Intakes Var 139.00 33.00 -53.00 -50.00 148.00 -37.00 -5.00
9 Plan Rep Comps_DL 468.00 54.00 836.00 1190.00 1327.00 1286.00 1108.00
10 Act Rep Comps_DL 471.00 70.00 995.00 1137.00 1323.00 1150.00 1073.00
11 Rep Comps Var_DL 3.00 16.00 159.00 -53.00 -4.00 -136.00 -35.00
12 Plan Rep Mandays_DL 148.00 19.00 260.00 368.00 412.00 398.00 345.00
13 Act Rep Mandays_DL 147.00 19.00 303.00 359.00 423.00 374.00 348.00
14 Rep Mandays Var_DL -1.00 1.00 43.00 -9.00 12.00 -24.00 3.00
15 Plan FVR Mandays_DL 0.00 0.00 4.00 18.00 18.00 18.00 18.00
16 Act FVR Mandays_DL 0.00 0.00 4.00 7.00 8.00 8.00 7.00
17 FVR Mandays Var_DL 0.00 0.00 0.00 -11.00 -10.00 -10.00 -11.00
18 Plan Rep Prod_DL 3.16 2.88 3.21 3.23 3.22 3.23 3.21
19 Act Rep Prod_DL 3.21 3.62 3.28 3.16 3.12 3.07 3.08
20 Rep Prod Var_DL 0.05 0.74 0.07 -0.07 -0.10 -0.16 -0.13
> bvg2
Parameters X18.Oct X19.Oct X20.Oct X21.Oct X22.Oct X23.Oct X24.Oct
1 24K Equivalent Plan 30.50 31.30 35.10 36.10 33.60 28.80 25.50
2 24K Equivalent Act 31.40 33.40 36.60 38.10 36.80 34.40 32.10
3 Plan Rep WS 3419.00 3509.00 3933.00 4041.00 3764.00 3220.00 2859.00
4 Act Rep WS 3514.00 3734.00 4098.00 4271.00 4122.00 3852.00 3591.00
5 Rep WS Var 95.00 225.00 165.00 230.00 358.00 632.00 732.00
6 Plan Rep Intakes 813.00 613.00 1559.00 1560.00 1506.00 1454.00 1410.00
7 Act Rep Intakes 964.00 602.00 1629.00 1532.00 1657.00 1507.00 1439.00
8 Rep Intakes Var 151.00 -11.00 70.00 -28.00 151.00 53.00 29.00
9 Plan Rep Comps_DL 675.00 175.00 1331.00 1732.00 1938.00 1706.00 1493.00
10 Act Rep Comps_DL 718.00 224.00 1389.00 1609.00 1848.00 1698.00 1537.00
11 Rep Comps Var_DL 43.00 49.00 58.00 -123.00 -90.00 -8.00 44.00
12 Plan Rep Mandays_DL 203.00 58.00 428.00 541.00 605.00 536.00 475.00
13 Act Rep Mandays_DL 215.00 63.00 472.00 542.00 608.00 556.00 523.00
14 Rep Mandays Var_DL 12.00 5.00 44.00 2.00 3.00 20.00 48.00
15 Plan FVR Mandays_DL 0.00 0.00 1.00 12.00 2.00 32.00 57.00
16 Act FVR Mandays_DL 0.00 0.00 2.00 2.00 5.00 5.00 5.00
17 FVR Mandays Var_DL 0.00 0.00 1.00 -10.00 3.00 -27.00 -52.00
18 Plan Rep Prod_DL 3.33 3.03 3.11 3.20 3.20 3.18 3.14
19 Act Rep Prod_DL 3.34 3.56 2.94 2.97 3.04 3.05 2.94
20 Rep Prod Var_DL 0.01 0.53 -0.17 -0.23 -0.16 -0.13 -0.20
It is a time series data i.e. 24K Equivalent Plan was 29 on 18th Oct, 29.60 on 19th Oct and 33.80 on 20th Oct. First dataframe have data for one business unit and second dataframe have the data for a different business unit.
I want to merge dataframes into 1 and want to analyse the variance i.e. where they differ in values. Draw ggplots like 2 histograms showing the difference, timeseries plots etc.
I have tried the following:
I can merge the two dataframes by:
joined = rbind(bvg1, bvg2)
however, i can't identify the record whether it belongs to bvg1 or bvg2 df.
if i add an additional column i.e.
bvg1$id = "bvg1"
bvg2$id = "bvg2"
then merge command doesn't work and gives the following error:
Error in match.names(clabs, names(xi)) :
names do not match previous names
Any sample code would be highly appreciated.
You can match the column names of the two datasets by stripping the . followed by the digits in the bvg1. This can be done using regex. In the below code, a lookbehind regex is used. It matches the lookbehind (?<=[A-Za-]) i.e. an alphabet followed by . followed by one or more elements .* to the end of string $ and remove those "".
colnames(bvg1) <-gsub("(?<=[A-Za-z])\\..*$", "", colnames(bvg1), perl=TRUE)
res <- rbind(bvg1, bvg2)
dim(res)
#[1] 40 9
head(res,3)
# Parameters X18.Oct X19.Oct X20.Oct X21.Oct X22.Oct X23.Oct X24.Oct
#1 24K Equivalent Plan 29.0 29.6 33.8 36.6 35.3 31.9 29.0
#2 24K Equivalent Act 28.8 31.0 35.4 35.9 34.7 33.4 31.9
#3 Plan Rep WS 2463.0 2513.0 2869.0 3115.0 2999.0 2714.0 2468.0
# id
#1 bvg1
#2 bvg1
#3 bvg1

Creating a three-way table of summary statistics in R

Example Data
I have 100 rows of patient data stored in the object example. For each patient, we know which one of five possible hospitals at which they were treated, the time period in which they were treated, and how many lymph nodes they had.
set.seed(50)
example <- data.frame(
Hospital = sample(as.factor(c("Hospital 1", "Hospital 2", "Hospital 3", "Hospital 4", "Hospital 5")), size = 100, replace = TRUE),
Time = sample(as.factor(c("2000-2002", "2003-2005", "2006-2008")), size = 100, replace = TRUE),
Nodes = sample(20:100, size = 100, replace = TRUE))
I know that I can view the summary statistics for the number of lymph nodes like so... (Note that I have appended the "n" to the rightward-most column, not sure if there is a more eloquent way to do this.)
cbind(do.call(rbind, by(example$Nodes, example$Hospital, summary)), table(example$Hospital, useNA = "no"))
Min. 1st Qu. Median Mean 3rd Qu. Max.
Hospital 1 20 34.25 54.0 55.55 77.75 90 22
Hospital 2 22 38.75 60.5 56.25 71.75 94 20
Hospital 3 22 37.00 51.0 57.12 81.00 96 17
Hospital 4 25 39.75 55.5 57.11 72.25 97 28
Hospital 5 26 42.00 50.0 57.00 77.00 99 13
Similarly, I can view them for the time period like so:
cbind(do.call(rbind, by(example$Nodes, example$Time, summary)), table(example$Time, useNA = "no"))
Min. 1st Qu. Median Mean 3rd Qu. Max.
2000-2002 20 40.00 57.0 58.84 77 97 37
2003-2005 20 33.75 45.5 52.94 78 99 36
2006-2008 23 39.50 61.0 58.33 72 98 27
Question
I would like to create a 3-way table table in which the leftward, outermost row identifiers are the five hospitals, further sub-stratified by time period. I want the columns to be the summary statistics for the number of lymph nodes. I have a feeling the xtabs() or ftable() might help, but have no idea how to apply them to my problem. In fact, typing ftable(example) gives me a table that is structured how I would want it to be, but the columns are not what I want. Thanks!
Edit #1 - In response to Ananda's comment below
Wow, yes that is almost exactly what I am looking for. My preference, however, would be for it to be in this format (with the numbers filled in, of course):
Nodes
Min. 1st Qu. Median Mean 3rd Qu. Max. n
Hospital Time
Hospital 1 2000-2002
2003-2005
2006-2008
Hospital 2 2000-2002
2003-2005
2006-2008
....and so forth....
Ordering the dataframe that results from the aggregate() function that #AnandaMahto mentioned above would provide something very close to what you need, but without the nested values:
dF <- aggregate(Nodes~Hospital+Time, example, summary)
dF <- dF[order(dF[, 1]), ]
Hospital Time Nodes.Min. Nodes.1st Qu. Nodes.Median Nodes.Mean Nodes.3rd Qu.
1 Hospital 1 2000-2002 20.00 25.00 34.00 33.29 38.00
6 Hospital 1 2003-2005 20.00 41.50 77.00 62.86 85.50
11 Hospital 1 2006-2008 35.00 60.50 70.50 68.62 80.75
2 Hospital 2 2000-2002 24.00 40.75 65.50 60.70 80.75
7 Hospital 2 2003-2005 22.00 22.00 26.00 33.75 37.75
12 Hospital 2 2006-2008 45.00 60.25 61.50 63.83 68.00
3 Hospital 3 2000-2002 40.00 63.00 74.00 72.80 91.00
8 Hospital 3 2003-2005 22.00 36.75 66.00 60.50 81.75
13 Hospital 3 2006-2008 23.00 29.50 37.00 40.67 46.75
4 Hospital 4 2000-2002 30.00 55.75 64.50 68.17 90.00
9 Hospital 4 2003-2005 25.00 38.25 42.00 49.36 59.50
14 Hospital 4 2006-2008 27.00 36.00 45.00 45.00 54.00
5 Hospital 5 2000-2002 26.00 39.00 52.00 51.67 64.50
10 Hospital 5 2003-2005 34.00 42.00 50.00 55.40 52.00
15 Hospital 5 2006-2008 30.00 42.00 48.00 61.80 91.00
Nodes.Max.
1 53.00
6 89.00
11 90.00
2 94.00
7 61.00
12 85.00
3 96.00
8 95.00
13 70.00
4 97.00
9 89.00
14 63.00
5 77.00
10 99.00
15 98.00

Quantiling reps performance

This is a simple question from an R rookie.
I need to quantile reps performance based on annual sales but the data is provided in quarterly fashion.
Can someone help me optimize the code.
Rep Quarter Sales
1 100 1 25
2 100 2 32
3 100 3 40
4 100 4 52
5 101 1 40
6 101 2 23
7 101 3 37
8 101 4 61
This is really guessing, because your question is way too vague - but it sounds like you could use aggregate.
set.seed(1)
example <- data.frame(Rep=rep(100:104,each=4),
Quarter=rep(1:4,5),
Sales=sample(100,20,replace=TRUE))
> head(example)
Rep Quarter Sales
1 100 1 27
2 100 2 38
3 100 3 58
4 100 4 91
5 101 1 21
6 101 2 90
> aggregate(example$Sales,by=list(Rep=example$Rep),summary)
Rep x.Min. x.1st Qu. x.Median x.Mean x.3rd Qu. x.Max.
1 100 27.00 35.25 48.00 53.50 66.25 91.00
2 101 21.00 55.50 78.50 68.25 91.25 95.00
3 102 7.00 15.25 19.50 27.25 31.50 63.00
4 103 39.00 47.25 59.50 58.75 71.00 77.00
5 104 39.00 63.75 75.00 72.25 83.50 100.00
Or using the formula method (thanks #Ferdinand):
> aggregate(Sales ~ Rep,summary,data=example)
Rep Sales.Min. Sales.1st Qu. Sales.Median Sales.Mean Sales.3rd Qu. Sales.Max.
1 100 27.00 35.25 48.00 53.50 66.25 91.00
2 101 21.00 55.50 78.50 68.25 91.25 95.00
3 102 7.00 15.25 19.50 27.25 31.50 63.00
4 103 39.00 47.25 59.50 58.75 71.00 77.00
5 104 39.00 63.75 75.00 72.25 83.50 100.00
You can also plot the data very easily using boxplot:
boxplot(example$Sales~example$Rep)

Resources