R, analyzing a data set with a large parameter space and replicates - r

I've run experiments whereby I use a parameter combination, collect the average forces and torques (in the x,y, and z directions). I do four replicates for each parameter combo, and I have 432 parameter combinations in total.
The actual dataset is a bit too big to include here, so I've made a subset for testing purposes and uploaded it to dropbox, along with the relevant R script.
Here is a heavily parsed version:
> data2[1:20,1:8]
# A tibble: 20 x 8
`Foil Color` `Flow Speed (rpm)` `Frequency (Hz)` StepTime Maxpress Minpress `Minpress Percentage` FxMean
<fctr> <fctr> <fctr> <fctr> <fctr> <int> <fctr> <dbl>
1 Black 0 0.25 250 50 0 0 0.014537062
2 Black 0 0.25 250 50 0 0 0.014870256
3 Black 0 0.25 250 50 0 0 0.013180870
4 Black 0 0.25 250 50 0 0 0.013448804
5 Black 0 0.25 250 50 3 0.05 0.012996979
6 Black 0 0.25 250 50 3 0.05 0.012115166
7 Black 0 0.25 250 50 3 0.05 0.012427347
8 Black 0 0.25 250 50 3 0.05 0.012561253
9 Black 0 0.25 250 50 5 0.1 0.012480644
10 Black 0 0.25 250 50 5 0.1 0.011603403
11 Black 0 0.25 250 50 5 0.1 0.011427116
12 Black 0 0.25 250 50 5 0.1 0.011545803
13 Black 0 0.25 250 50 13 0.25 0.009891865
14 Black 0 0.25 250 50 13 0.25 0.008465604
15 Black 0 0.25 250 50 13 0.25 0.009089619
16 Black 0 0.25 250 50 13 0.25 0.008560160
17 Black 0 0.25 250 75 0 0 0.025101186
18 Black 0 0.25 250 75 0 0 0.023611920
19 Black 0 0.25 250 75 0 0 0.026276007
20 Black 0 0.25 250 75 0 0 0.026593895
I am trying to group the data by the parameter combinations and calculate the average FxMean, sd, and se, for that group of 4 replicates.
I have tried to follow tutorials and other examples where people try to summarize the data (example), but it doesn't work for me. It normally spits out an array that looks nothing like what I need.
For example:
fx_data2 <- ddply(data_csv, c(data_csv$`Frequency (Hz)`,data_csv$`Flow Speed (rpm)`, data_csv$StepTime, data_csv$Maxpress, data_csv$`Minpress Percentage`), summarise,
N = length(data_csv$FxMean),
mean = mean(data_csv$FxMean),
sd = sd(data_csv$FxMean),
se = sd / sqrt(N)
)
fx_data3 <- summaryBy(FxMean ~freq + foilColor+maxP+minPP, data=data_csv, FUN=c(length, mean, sd))
fx_data2 looks just...abyssmal.
head(fx_data2)
....
Foil Color.2530 Foil Color.2531 Foil Length.2512 Foil Length.2513 Foil
Length.2514 Foil Length.2515 Flow Speed (rpm).2544 Flow Speed (rpm).2545
Flow Speed (rpm).2546 Flow Speed (rpm).2547 Frequency (Hz).800 Frequency
(Hz).801 Frequency (Hz).802 Frequency (Hz).803 Foil Color.2532 Foil Color.2533
Foil Color.2534 Foil Color.2535 Foil Length.2516 Foil Length.2517 Foil
Length.2518 Foil Length.2519 Flow Speed (rpm).2548 Flow Speed (rpm).2549
Flow Speed (rpm).2550 Flow Speed (rpm).2551 Frequency (Hz).804 Frequency
(Hz).805 Frequency (Hz).806 Frequency (Hz).807 Foil Color.2536 Foil Color.2537
I mean. I have no idea what's going on with that. The dimensions are 24x8724. Just...what.
and fx_data3 looks like this:
> fx_data3
FxMean.length FxMean.mean FxMean.sd
1 1744 0.01379712 0.01423244
>
Ideally, these would look like the original data set, but each parameter combination is compressed to a single line, and the values on the far right would be the mean, sd, and se for the FxMean, FxStDev, etc. for the four replicates.
I've been struggling with this for a few days. I'd greatly appreciate some help.
Thank you,
Zane

url <- "https://www.dropbox.com/sh/vhf39uz4pol7sgl/AAAJ9Fr6OTEIgb_ZeSno-X5ea?dl=1"
download.file(url, destfile = "from-SO-via-dropbox")
unzip("from-SO-via-dropbox")
df <- readr::read_csv("Data_subset.csv")
library(dplyr)
df %>%
group_by(`Frequency (Hz)`, `Foil Color`, StepTime, Maxpress, `Minpress Percentage`) %>%
summarise_at(vars(FxMean), funs(N = length, mean, sd, se = sd(.) / sqrt(N)))
# # A tibble: 13 x 9
# # Groups: Frequency (Hz), Foil Color, StepTime, Maxpress [?]
# `Frequency (Hz)` `Foil Color` StepTime Maxpress `Minpress Percentage` N mean sd se
# <dbl> <chr> <int> <int> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.25 Black 250 50 0.00 4 0.014009248 0.0008206156 0.0004103078
# 2 0.25 Black 250 50 0.05 4 0.012525186 0.0003658681 0.0001829340
# 3 0.25 Black 250 50 0.10 4 0.011764241 0.0004832082 0.0002416041
# 4 0.25 Black 250 50 0.25 4 0.009001812 0.0006538297 0.0003269149
# 5 0.25 Black 250 75 0.00 4 0.025395752 0.0013514463 0.0006757231
# 6 0.25 Black 250 75 0.05 4 0.020794212 0.0028703242 0.0014351621
# 7 0.25 Black 250 75 0.10 4 0.018409500 0.0037305138 0.0018652569
# 8 0.25 Black 250 75 0.25 4 0.016193536 0.0016200530 0.0008100265
# 9 0.25 Black 250 100 0.00 4 0.035485324 0.0052513208 0.0026256604
# 10 0.25 Black 250 100 0.05 4 0.050097709 0.0024123653 0.0012061827
# 11 0.25 Black 250 100 0.10 4 0.051378181 0.0049857712 0.0024928856
# 12 0.25 Black 250 100 0.25 4 0.039374874 0.0031421884 0.0015710942
# 13 0.50 Black 250 50 0.00 2 0.014778494 0.0004683882 0.0003312005

Which parameters you want to group_by? Just insert them in the code snippet below in place of param1, param2 etc
You could use dplyr:
library(dplyr)
data %>%
group_by(param1, param2, param3) %>%
summarise(mean = mean(FxMean),
sd = sd(FxMean),
se = sd/n())

Related

group_by() summarise() and weights percentages - R

Let's suppose that a company has 3 Bosses and 20 Employees, where each Employee has done n_Projects with an overall Performance in percentage:
> df <- data.frame(Boss = sample(1:3, 20, replace=TRUE),
Employee = sample(1:20,20),
n_Projects = sample(50:100, 20, replace=TRUE),
Performance = round(sample(1:100,20,replace=TRUE)/100,2),
stringsAsFactors = FALSE)
> df
Boss Employee n_Projects Performance
1 3 8 79 0.57
2 1 3 59 0.18
3 1 11 76 0.43
4 2 5 85 0.12
5 2 2 75 0.10
6 2 9 66 0.60
7 2 19 85 0.36
8 1 20 79 0.65
9 2 17 79 0.90
10 3 14 77 0.41
11 1 1 78 0.97
12 1 7 72 0.52
13 2 6 62 0.69
14 2 10 53 0.97
15 3 16 91 0.94
16 3 4 98 0.63
17 1 18 63 0.95
18 2 15 90 0.33
19 1 12 80 0.48
20 1 13 97 0.07
The CEO asks me to compute the quality of the work for each boss. However, he asks for a specific calculation: Each Performance value has to have a weight equal to the n_Project value over the total n_Project for that boss.
For example, for Boss 1 we have a total of 604 n_Projects, where the project 1 has a Performance weight of 0,13 (78/604 * 0,97 = 0,13), project 3 a Performance weight of 0,1 (59/604 * 0,18 = 0,02), and so on. The sum of these Performance weights are the Boss performance, that for Boss 1 is 0,52. So, the final output should be like this:
Boss total_Projects Performance
1 604 0.52
2 340 0.18 #the values for boss 2 are invented
3 230 0.43 #the values for boss 3 are invented
However, I'm still struggling with this:
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
In addition to this problem, can you give me any feedback about this problem (my code, specifically) or any recommendation to improve data-manipulations skills? (you can see in my profile that I have asked a lot of questions like this, but still I'm not able to solve them on my own)
We can get the sum of product of `n_Projects' and 'Performance' and divide by the 'total_projects'
library(dplyr)
df %>%
group_by(Boss) %>%
summarise(total_projects = sum(n_Projects),
Weight_Project = sum(n_Projects * Performance)/total_projects)
# or
# Weight_Project = n_Projects %*% Performance/total_projects)
# A tibble: 3 x 3
# Boss total_projects Weight_Project
# <int> <int> <dbl>
#1 1 604 0.518
#2 2 595 0.475
#3 3 345 0.649
Adding some more details about what you did and #akrun's answer :
You must have received the following error message :
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
## Error in summarise_impl(.data, dots) :
## Column `Weight_Project` must be length 1 (a summary value), not 7
This tells you that the calculus you made for Weight_Project does not yield a unique value for each Boss, but 7. summarise is there to summarise several values into one (by means, sums, etc.). Here you just divide each value of n_Projects by sum(total_Projects), but you don't summarise it into a single value.
Assuming that what you had in mind was first calculating the weight for each performance, then combining it with the performance mark to yield the weighted mean performance, you can proceed in two steps :
df %>%
group_by(Boss) %>%
mutate(Weight_Performance = n_Projects / sum(n_Projects)) %>%
summarise(weighted_mean_performance = sum(Weight_Performance * Performance))
The mutate statement preserves the number of total rows in df, but sum(n_Projects) is calculated for each Boss value thanks to group_by.
Once, for each row, you have a project weight (which depends on the boss), you can calculate the weighted mean — which is a mean thus a summary value — with summarise.
A more compact way that still lets appear the weighted calculus would be :
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum((n_Projects / sum(n_Projects)) * Performance))
# Reordering to minimise parenthesis, which is #akrun's answer
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum(n_Projects * Performance) / sum(n_Projects))

big dataframe: "repeated" t-test between groups for thousand of factors

I have read a lot of posts related to data wrangling and “repeated” t-test but I can’t figure out the way to achieve it in my case.
You can get my example dataset for StackOverflow here: https://www.dropbox.com/s/0b618fs1jjnuzbg/dataset.example.stckovflw.txt?dl=0
I have a big dataframe of gen expression like:
> b<-read.delim("dataset.example.stckovflw.txt")
> head(b)
animal gen condition tissue LogFC
1 animalcontrol1 kjhss1 control brain 7.129283
2 animalcontrol1 sdth2 control brain 7.179909
3 animalcontrol1 sgdhstjh20 control brain 9.353147
4 animalcontrol1 jdygfjgdkydg21 control brain 6.459432
5 animalcontrol1 shfjdfyjydg22 control brain 9.372865
6 animalcontrol1 jdyjkdg23 control brain 9.541097
> str(b)
'data.frame': 21507 obs. of 5 variables:
$ animal : Factor w/ 25 levels "animalcontrol1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ gen : Factor w/ 1131 levels "dghwg1041","dghwg1086",..: 480 761 787 360 863 385 133 888 563 738 ...
$ condition: Factor w/ 5 levels "control","treatmentA",..: 1 1 1 1 1 1 1 1 1 1 ...
$ tissue : Factor w/ 2 levels "brain","heart": 1 1 1 1 1 1 1 1 1 1 ...
$ LogFC : num 7.13 7.18 9.35 6.46 9.37 ...
Each group has 5 animals, and each animals has many gens quantified. (However, each animal may possibly have a different set of quantified gens, but also many of the gens will be in common between animals and groups).
I would like to perform t-test for each gen between my treated group (A, B, C or D) and the controls. The data should be presented as a table containing the p- value for each gen in each group.
Because I have so many gens (thousand), I cannot subset each gen.
Do you know how could I automate the procedure ?
I was thinking about a loop but I am absolutely not sure it could achieve what I want and how to proceed.
Also, I was looking more at these posts using the apply function : Apply t-test on many columns in a dataframe split by factor and Looping through t.tests for data frame subsets in r
#
################ additionnal information after reading first comments and answers :
#andrew_reece : Thank you very much for this. It is almost-exactly what I was looking for. However, I can’t find the way to do it with t-test. ANOVA is interesting information, but then I will need to know which of the treated groups is/are significantly different from my controls. Also I would need to know which treated group is significantly different from each others, “two by two”.
I have been trying to use your code by changing the “aov(..)” in “t.test(…)”. For that, first I realize a subset(b, condition == "control" | condition == "treatmentA" ) in order to compare only two groups. However, when exporting the result table in csv file, the table is unanderstandable (no gen name, no p-values, etc, only numbers). I will keep searching a way to do it properly but until now I’m stuck.
#42:
Thank you very much for these tips. This is just a dataset example, let’s assume we do have to use individual t-tests.
This is very useful start for exploring my data. For example, I have been trying to reprsent my data with Venndiagrams. I can write my code but it is kind of out of the initial topic. Also, I don't know how to summarize in a less fastidious way the shared "gene" detected in each combination of conditions so i have simplified with only 3 conditions.
# Visualisation of shared genes by VennDiagrams :
# let's simplify and consider only 3 conditions :
b<-read.delim("dataset.example.stckovflw.txt")
b<- subset(b, condition == "control" | condition == "treatmentA" | condition == "treatmentB")
b1<-table(b$gen, b$condition)
b1
b2<-subset(data.frame(b1, "control" > 2
|"treatmentA" > 2
|"treatmentB" > 2 ))
b3<-subset(b2, Freq>2) # select only genes that have been quantified in at least 2 animals per group
b3
b4 = within(b3, {
Freq = ifelse(Freq > 1, 1, 0)
}) # for those observations, we consider the gene has been detected so we change the value 0 regardless the freq of occurence (>2)
b4
b5<-table(b4$Var1, b4$Var2)
write.csv(b5, file = "b5.csv")
# make an intermediate file .txt (just add manually the name of the cfirst column title)
# so now we have info
bb5<-read.delim("bb5.txt")
nrow(subset(bb5, control == 1))
nrow(subset(bb5, treatmentA == 1))
nrow(subset(bb5, treatmentB == 1))
nrow(subset(bb5, control == 1 & treatmentA == 1))
nrow(subset(bb5, control == 1 & treatmentB == 1))
nrow(subset(bb5, treatmentA == 1 & treatmentB == 1))
nrow(subset(bb5, control == 1 & treatmentA == 1 & treatmentB == 1))
library(grid)
library(futile.logger)
library(VennDiagram)
venn.plot <- draw.triple.venn(area1 = 1005,
area2 = 927,
area3 = 943,
n12 = 843,
n23 = 861,
n13 = 866,
n123 = 794,
category = c("controls", "treatmentA", "treatmentB"),
fill = c("red", "yellow", "blue"),
cex = 2,
cat.cex = 2,
lwd = 6,
lty = 'dashed',
fontface = "bold",
fontfamily = "sans",
cat.fontface = "bold",
cat.default.pos = "outer",
cat.pos = c(-27, 27, 135),
cat.dist = c(0.055, 0.055, 0.085),
cat.fontfamily = "sans",
rotation = 1);
Update (per OP comments):
Pairwise comparison across condition can be managed with an ANOVA post-hoc test, such as Tukey's Honest Significant Difference (stats::TukeyHSD()). (There are others, this is just one way to demonstrate PoC.)
results <- b %>%
mutate(condition = factor(condition)) %>%
group_by(gen) %>%
filter(length(unique(condition)) >= 2) %>%
nest() %>%
mutate(
model = map(data, ~ TukeyHSD(aov(LogFC ~ condition, data = .x))),
coef = map(model, ~ broom::tidy(.x))
) %>%
unnest(coef) %>%
select(-term)
results
# A tibble: 7,118 x 6
gen comparison estimate conf.low conf.high adj.p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 kjhss1 treatmentA-control 1.58 -20.3 23.5 0.997
2 kjhss1 treatmentC-control -3.71 -25.6 18.2 0.962
3 kjhss1 treatmentD-control 0.240 -21.7 22.2 1.000
4 kjhss1 treatmentC-treatmentA -5.29 -27.2 16.6 0.899
5 kjhss1 treatmentD-treatmentA -1.34 -23.3 20.6 0.998
6 kjhss1 treatmentD-treatmentC 3.95 -18.0 25.9 0.954
7 sdth2 treatmentC-control -1.02 -21.7 19.7 0.991
8 sdth2 treatmentD-control 3.25 -17.5 24.0 0.909
9 sdth2 treatmentD-treatmentC 4.27 -16.5 25.0 0.849
10 sgdhstjh20 treatmentC-control -7.48 -30.4 15.5 0.669
# ... with 7,108 more rows
Original answer
You can use tidyr::nest() and purrr::map() to accomplish the technical task of grouping by gen, and then conducting statistical tests comparing the effects of condition (presumably with LogFC as your DV).
But I agree with the other comments that there are issues with your statistical approach here that bear careful consideration - stats.stackexchange.com is a better forum for those questions.
For the purpose of demonstration, I've used an ANOVA instead of a t-test, since there are frequently more than two conditions per gen grouping. This shouldn't really change the intuition behind the implementation, however.
require(tidyverse)
results <- b %>%
mutate(condition = factor(condition)) %>%
group_by(gen) %>%
filter(length(unique(condition)) >= 2) %>%
nest() %>%
mutate(
model = map(data, ~ aov(LogFC ~ condition, data = .x)),
coef = map(model, ~ broom::tidy(.x))
) %>%
unnest(coef)
A few cosmetic trimmings to get closer to your original vision (of just a table with gen and p-values), although note that this really leaves a lot of important information out and I'm not advising you actually limit your results in this way.
results %>%
filter(term!="Residuals") %>%
select(gen, df, statistic, p.value)
results
# A tibble: 1,111 x 4
gen df statistic p.value
<chr> <dbl> <dbl> <dbl>
1 kjhss1 3. 0.175 0.912
2 sdth2 2. 0.165 0.850
3 sgdhstjh20 2. 0.440 0.654
4 jdygfjgdkydg21 2. 0.267 0.770
5 shfjdfyjydg22 2. 0.632 0.548
6 jdyjkdg23 2. 0.792 0.477
7 fckjhghw24 2. 0.790 0.478
8 shsnv25 2. 1.15 0.354
9 qeifyvj26 2. 0.588 0.573
10 qsiubx27 2. 1.14 0.359
# ... with 1,101 more rows
Note: I can't take much credit for this approach - it's taken almost verbatim from an example I saw Hadley give at a talk last night on purrr. Here's a link to the public repo of the demo code he used, which covers a similar use case.
You have 25 animals in 5 different treatment groups with a varying number of gen-values (presumably activities of genetic probes) in two different tissues:
table(b$animal, b$condition)
control treatmentA treatmentB treatmentC treatmentD
animalcontrol1 1005 0 0 0 0
animalcontrol2 857 0 0 0 0
animalcontrol3 959 0 0 0 0
animalcontrol4 928 0 0 0 0
animalcontrol5 1005 0 0 0 0
animaltreatmentA1 0 927 0 0 0
animaltreatmentA2 0 883 0 0 0
animaltreatmentA3 0 908 0 0 0
animaltreatmentA4 0 861 0 0 0
animaltreatmentA5 0 927 0 0 0
animaltreatmentB1 0 0 943 0 0
animaltreatmentB2 0 0 841 0 0
animaltreatmentB3 0 0 943 0 0
animaltreatmentB4 0 0 910 0 0
animaltreatmentB5 0 0 943 0 0
animaltreatmentC1 0 0 0 742 0
animaltreatmentC2 0 0 0 724 0
animaltreatmentC3 0 0 0 702 0
animaltreatmentC4 0 0 0 698 0
animaltreatmentC5 0 0 0 742 0
animaltreatmentD1 0 0 0 0 844
animaltreatmentD2 0 0 0 0 776
animaltreatmentD3 0 0 0 0 812
animaltreatmentD4 0 0 0 0 783
animaltreatmentD5 0 0 0 0 844
Agree you need to "automate" this in some fashion, but I think you are in need of a more general strategy for statistical inference rather than trying to pick out relationships by applying individual t-tests. You might consider either mixed models or one of the random forest variants. I think you should be discussing this with a statistician. As an example of where your hopes are not going to be met, take a look at the information you have about the first "gen" among the 1131 values:
str( b[ b$gen == "dghwg1041", ])
'data.frame': 13 obs. of 5 variables:
$ animal : Factor w/ 25 levels "animalcontrol1",..: 1 6 11 2 7 12 3 8 13 14 ...
$ gen : Factor w/ 1131 levels "dghwg1041","dghwg1086",..: 1 1 1 1 1 1 1 1 1 1 ...
$ condition: Factor w/ 5 levels "control","treatmentA",..: 1 2 3 1 2 3 1 2 3 3 ...
$ tissue : Factor w/ 2 levels "brain","heart": 1 1 1 1 1 1 1 1 1 1 ...
$ LogFC : num 4.34 2.98 4.44 3.87 2.65 ...
You do have a fair number with "complete representation:
gen_length <- ave(b$LogFC, b$gen, FUN=length)
Hmisc::describe(gen_length)
#--------------
gen_length
n missing distinct Info Mean Gmd .05 .10
21507 0 18 0.976 20.32 4.802 13 14
.25 .50 .75 .90 .95
18 20 24 25 25
Value 5 8 9 10 12 13 14 15 16 17
Frequency 100 48 288 270 84 624 924 2220 64 527
Proportion 0.005 0.002 0.013 0.013 0.004 0.029 0.043 0.103 0.003 0.025
Value 18 19 20 21 22 23 24 25
Frequency 666 2223 3840 42 220 1058 3384 4925
Proportion 0.031 0.103 0.179 0.002 0.010 0.049 0.157 0.229
You might start by looking at all the "gen"s that have complete data:
head( gen_tbl[ gen_tbl == 25 ], 25)
#------------------
dghwg1131 dghwg546 dghwg591 dghwg636 dghwg681
25 25 25 25 25
dghwg726 dgkuck196 dgkuck286 dgkuck421 dgkuck691
25 25 25 25 25
dgkuck736 dgkukdgse197 dgkukdgse287 dgkukdgse422 dgkukdgse692
25 25 25 25 25
dgkukdgse737 djh592 djh637 djh682 djh727
25 25 25 25 25
dkgkjd327 dkgkjd642 dkgkjd687 dkgkjd732 fckjhghw204
25 25 25 25 25

Sizing scatter plot point mean proportional to sample size

I am creating a scatter plot using ggplot2 and would like to size my point means proportional to the sample size used to calculate the mean. This is my code, where I used fun.y to calculate the mean by group Trt:
branch1 %>%
ggplot() + aes(x=Branch, y=Flow_T, group=Trt, color=Trt) +
stat_summary(aes(group=Trt), fun.y=mean, geom="point", size=)
I am relatively new to R, but my guess is to use size in the aes function to resize my points. I thought it might be a good idea to extract the sample sizes used in fun.y=mean and create a new class that could be inputted into size, however I am not sure how to do that.
Any help will be greatly appreciated! Cheers.
EDIT
Here's my data for reference:
Plant Branch Pod_B Flow_Miss Pod_A Flow_T Trt Dmg
<int> <dbl> <int> <int> <int> <dbl> <fct> <int>
1 1 1.00 0 16 20 36.0 Early 1
2 1 2.00 0 1 17 18.0 Early 1
3 1 3.00 0 0 17 17.0 Early 1
4 1 4.00 0 3 14 17.0 Early 1
5 1 5.00 5 2 4 11.0 Early 1
6 1 6.00 0 3 7 10.0 Early 1
7 1 7.00 0 4 6 10.0 Early 1
8 1 8.00 0 13 6 19.0 Early 1
9 1 9.00 0 2 7 9.00 Early 1
10 1 10.0 0 2 3 5.00 Early 1
EDIT 2:
Here is a graph of what I'm trying to achieve with proportional sizing by sample size n per Trt (treatment), where the mean is calculated per Trt and Branch number. I'm wondering if I should make Branch a categorical variable.
Plot without Proportional Sizing
If I understood you correctly you would like to scale the size of points based on the number of points per Trt group.
How about something like this? Note that I appended your sample data, because Trt contains only Early entries.
df %>%
group_by(Trt) %>%
mutate(ssize = n()) %>%
ggplot(aes(x = Branch, y = Flow_T, colour = Trt, size = ssize)) +
geom_point();
Explanation: We group by Trt, then calculate the number of samples per group ssize, and plot with argument aes(...., size = ssize) to ensure that the size of points scale with sscale. You don't need the group aesthetic here.
Update
To scale points according to the mean of Flow_T per Trt we can do:
df %>%
group_by(Trt) %>%
mutate(
ssize = n(),
mean.Flow_T = mean(Flow_T)) %>%
ggplot(aes(x = Branch, y = Flow_T, colour = Trt, size = mean.Flow_T)) +
geom_point();
Sample data
# Sample data
df <- read.table(text =
"Plant Branch Pod_B Flow_Miss Pod_A Flow_T Trt Dmg
1 1 1.00 0 16 20 36.0 Early 1
2 1 2.00 0 1 17 18.0 Early 1
3 1 3.00 0 0 17 17.0 Early 1
4 1 4.00 0 3 14 17.0 Early 1
5 1 5.00 5 2 4 11.0 Early 1
6 1 6.00 0 3 7 10.0 Early 1
7 1 7.00 0 4 6 10.0 Early 1
8 1 8.00 0 13 6 19.0 Early 1
9 1 9.00 0 2 7 9.00 Early 1
10 1 10.0 0 2 3 5.00 Early 1
11 1 10.0 0 2 3 20.00 Late 1", header = T)
Using #Maurits Evers's help, I created my desired graph by making Branch a factor. The following is my code as well as my intended graph:
branch1$Branch <- as.factor(branch1$Branch)
branch1$Flow_T <- as.numeric(branch1$Flow_T)
branch1 %>%
group_by(Trt, Branch) %>%
mutate(ssize = n()) %>%
ggplot(aes(x = Branch, y = Flow_T, colour = Trt)) +
stat_summary(aes(size=ssize), fun.y=mean, geom="point")
Final Plot

Selecting subsets of a grouped variable

The data I used can be found here (the "sq.txt" file).
Below is a summary of the data:
> summary(sq)
behaviour date squirrel time
resting :983 2017-06-28: 197 22995 : 127 09:30:00: 17
travelling :649 2017-06-26: 160 22758 : 116 08:00:00: 16
feeding :344 2017-06-30: 139 23080 : 108 16:25:00: 15
OOS :330 2017-07-18: 110 23089 : 100 08:11:00: 13
vocalization:246 2017-06-27: 99 23079 : 97 08:31:00: 13
social : 53 2017-06-29: 96 22865 : 95 15:24:00: 13
(Other) : 67 (Other) :1871 (Other):2029 (Other) :2585
Each squirrel has a number of observations that correspond to a number of different behaviours (behaviour).
For example, squirrel 22995 was observed 127 times. These 127 observations correspond to different behaviour categories: 7 feeding, 1 territorial, 55 resting, etc. I then need to divide the number of each behaviour by the total number of observations (i.e. feeding = 7/127, territorial = 1/127, resting = 55/127, etc.) to get proportions of time spent doing each behaviour.
I already have grouped my observations by squirrel using the dplyr package.
Is there a way, using dplyr, for me to calculate proportions for one column (behaviour) based on the total observations for a column (squirrel) where the values have been grouped?
Something like this?
sq %>%
count(squirrel, behaviour) %>%
group_by(squirrel) %>%
mutate(p = n/sum(n)) %>%
# add this line to see result for squirrel 22995
filter(squirrel == 22995)
# A tibble: 8 x 4
# Groups: squirrel [1]
squirrel behaviour n p
<int> <chr> <int> <dbl>
1 22995 feeding 7 0.0551
2 22995 nest_building 4 0.0315
3 22995 OOS 9 0.0709
4 22995 resting 55 0.433
5 22995 social 6 0.0472
6 22995 territorial 1 0.00787
7 22995 travelling 32 0.252
8 22995 vocalization 13 0.102
EDIT:
If you want to include zero counts for squirrels where a behaviour was not observed, one way is to use tidyr::complete(). That generates NA by default, which you may want to replace with zero.
library(dplyr)
library(tidyr)
sq %>%
count(squirrel, behaviour) %>%
complete(squirrel, behaviour) %>%
group_by(squirrel) %>%
mutate(p = n/sum(n, na.rm = TRUE)) %>%
replace_na(list(n = 0, p = 0)) %>%
filter(squirrel == 22995)
# A tibble: 11 x 4
# Groups: squirrel [1]
squirrel behaviour n p
<int> <chr> <dbl> <dbl>
1 22995 dead 0 0
2 22995 feeding 7.00 0.0551
3 22995 grooming 0 0
4 22995 nest_building 4.00 0.0315
5 22995 OOS 9.00 0.0709
6 22995 resting 55.0 0.433
7 22995 social 6.00 0.0472
8 22995 territorial 1.00 0.00787
9 22995 travelling 32.0 0.252
10 22995 vigilant 0 0
11 22995 vocalization 13.0 0.102

How to calculate survival probabilities in R?

I am trying to fit a parametric survival model. I think I managed to do so. However, I could not succeed in calculating the survival probabilities:
library(survival)
zaman <- c(65,156,100,134,16,108,121,4,39,143,56,26,22,1,1,5,65,
56,65,17,7,16,22,3,4,2,3,8,4,3,30,4,43)
test <- c(rep(1,17),rep(0,16))
WBC <- c(2.3,0.75,4.3,2.6,6,10.5,10,17,5.4,7,9.4,32,35,100,
100,52,100,4.4,3,4,1.5,9,5.3,10,19,27,28,31,26,21,79,100,100)
status <- c(rep(1,33))
data <- data.frame(zaman,test,WBC)
surv3 <- Surv(zaman[test==1], status[test==1])
fit3 <- survreg( surv3 ~ log(WBC[test==1]),dist="w")
On the other hand, no problem at all while calculating the survival probabilities using the Kaplan-Meier Estimation:
fit2 <- survfit(Surv(zaman[test==0], status[test==0]) ~ 1)
summary(fit2)$surv
Any idea why?
You can get the predicted probabilities from a survreg object with predict:
predict(fit3)
If you're interested in combining this with the original data, and also in the residual and standard errors of the predictions, you can use the augment function in my broom package:
library(broom)
augment(fit3)
A full analysis might look something like:
library(survival)
library(broom)
data <- data.frame(zaman, test, WBC, status)
subdata <- data[data$test == 1, ]
fit3 <- survreg( Surv(zaman, status) ~ log(WBC), subdata, dist="w")
augment(fit3, subdata)
With the output:
zaman test WBC status .fitted .se.fit .resid
1 65 1 2.30 1 115.46728 43.913188 -50.467281
2 156 1 0.75 1 197.05852 108.389586 -41.058516
3 100 1 4.30 1 85.67236 26.043277 14.327641
4 134 1 2.60 1 108.90836 39.624106 25.091636
5 16 1 6.00 1 73.08498 20.029707 -57.084979
6 108 1 10.50 1 55.96298 13.989099 52.037022
7 121 1 10.00 1 57.28065 14.350609 63.719348
8 4 1 17.00 1 44.47189 11.607368 -40.471888
9 39 1 5.40 1 76.85181 21.708514 -37.851810
10 143 1 7.00 1 67.90395 17.911170 75.096054
11 56 1 9.40 1 58.99643 14.848751 -2.996434
12 26 1 32.00 1 32.88935 10.333303 -6.889346
13 22 1 35.00 1 31.51314 10.219871 -9.513136
14 1 1 100.00 1 19.09922 8.963022 -18.099216
15 1 1 100.00 1 19.09922 8.963022 -18.099216
16 5 1 52.00 1 26.09034 9.763728 -21.090343
17 65 1 100.00 1 19.09922 8.963022 45.900784
In this case, the .fitted column is the predicted probabilities.

Resources