I am creating a scatter plot using ggplot2 and would like to size my point means proportional to the sample size used to calculate the mean. This is my code, where I used fun.y to calculate the mean by group Trt:
branch1 %>%
ggplot() + aes(x=Branch, y=Flow_T, group=Trt, color=Trt) +
stat_summary(aes(group=Trt), fun.y=mean, geom="point", size=)
I am relatively new to R, but my guess is to use size in the aes function to resize my points. I thought it might be a good idea to extract the sample sizes used in fun.y=mean and create a new class that could be inputted into size, however I am not sure how to do that.
Any help will be greatly appreciated! Cheers.
EDIT
Here's my data for reference:
Plant Branch Pod_B Flow_Miss Pod_A Flow_T Trt Dmg
<int> <dbl> <int> <int> <int> <dbl> <fct> <int>
1 1 1.00 0 16 20 36.0 Early 1
2 1 2.00 0 1 17 18.0 Early 1
3 1 3.00 0 0 17 17.0 Early 1
4 1 4.00 0 3 14 17.0 Early 1
5 1 5.00 5 2 4 11.0 Early 1
6 1 6.00 0 3 7 10.0 Early 1
7 1 7.00 0 4 6 10.0 Early 1
8 1 8.00 0 13 6 19.0 Early 1
9 1 9.00 0 2 7 9.00 Early 1
10 1 10.0 0 2 3 5.00 Early 1
EDIT 2:
Here is a graph of what I'm trying to achieve with proportional sizing by sample size n per Trt (treatment), where the mean is calculated per Trt and Branch number. I'm wondering if I should make Branch a categorical variable.
Plot without Proportional Sizing
If I understood you correctly you would like to scale the size of points based on the number of points per Trt group.
How about something like this? Note that I appended your sample data, because Trt contains only Early entries.
df %>%
group_by(Trt) %>%
mutate(ssize = n()) %>%
ggplot(aes(x = Branch, y = Flow_T, colour = Trt, size = ssize)) +
geom_point();
Explanation: We group by Trt, then calculate the number of samples per group ssize, and plot with argument aes(...., size = ssize) to ensure that the size of points scale with sscale. You don't need the group aesthetic here.
Update
To scale points according to the mean of Flow_T per Trt we can do:
df %>%
group_by(Trt) %>%
mutate(
ssize = n(),
mean.Flow_T = mean(Flow_T)) %>%
ggplot(aes(x = Branch, y = Flow_T, colour = Trt, size = mean.Flow_T)) +
geom_point();
Sample data
# Sample data
df <- read.table(text =
"Plant Branch Pod_B Flow_Miss Pod_A Flow_T Trt Dmg
1 1 1.00 0 16 20 36.0 Early 1
2 1 2.00 0 1 17 18.0 Early 1
3 1 3.00 0 0 17 17.0 Early 1
4 1 4.00 0 3 14 17.0 Early 1
5 1 5.00 5 2 4 11.0 Early 1
6 1 6.00 0 3 7 10.0 Early 1
7 1 7.00 0 4 6 10.0 Early 1
8 1 8.00 0 13 6 19.0 Early 1
9 1 9.00 0 2 7 9.00 Early 1
10 1 10.0 0 2 3 5.00 Early 1
11 1 10.0 0 2 3 20.00 Late 1", header = T)
Using #Maurits Evers's help, I created my desired graph by making Branch a factor. The following is my code as well as my intended graph:
branch1$Branch <- as.factor(branch1$Branch)
branch1$Flow_T <- as.numeric(branch1$Flow_T)
branch1 %>%
group_by(Trt, Branch) %>%
mutate(ssize = n()) %>%
ggplot(aes(x = Branch, y = Flow_T, colour = Trt)) +
stat_summary(aes(size=ssize), fun.y=mean, geom="point")
Final Plot
Related
Use the data below to make the cumsum_a column look like the should column.
Data to start with:
> demo
th seq group
1 20.1 1 10
2 24.1 2 10
3 26.1 3 10
4 1.1 1 20
5 2.1 2 20
6 4.1 3 20
The "should" column below is the goal.
demo<-data.frame(th=c(c(20.1,24.1,26.1),(c(1.1,2.1,4.1))),
seq=(c(1:3,1:3)),group=c(rep(10,3),rep(20,3)))
library(magrittr)
library(dplyr)
demo %>%
group_by(group) %>%
mutate(
cumsum_a= cumsum((group)^seq*
(((th)/cummax(th)))))%>%
ungroup()%>%
mutate(.,
cumsum_m=c( #As an example only, this manually does exactly what cumsum_a is doing (which is wrong)
10^1*20.1/20.1, #good
10^1*20.1/20.1 + 10^2*24.1/24.1, #different denominators, bad
10^1*20.1/20.1 + 10^2*24.1/24.1 + 10^3*26.1/26.1, #different denominators, bad
20^1*1.1/1.1, #good
20^1*1.1/1.1 + 20^2*2.1/2.1, #different denominators, bad
20^1*1.1/1.1 + 20^2*2.1/2.1 + 20^3*4.1/4.1 #different denominators, bad
),
should=c( #this is exactly the kind of calculation I want
10^1*20.1/20.1, #good
10^1*20.1/24.1 + 10^2*24.1/24.1, #good
10^1*20.1/26.1 + 10^2*24.1/26.1 + 10^3*26.1/26.1, #good
20^1*1.1/1.1, #good
20^1*1.1/2.1 + 20^2*2.1/2.1, #good
20^1*1.1/4.1 + 20^2*2.1/4.1 + 20^3*4.1/4.1 #good
)
)
Most simply put, denominators need to be the same for each row so 24.1 and 24.1 instead of 20.1 and 24.1 on the second row of cumsum_m or the underlying calculations for cumsum_a.
Here are the new columns, where should is what cumsum_a or cumsum_m should be.
th seq group cumsum_a cumsum_m should
<dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 20.1 1 10 10 10 10
2 24.1 2 10 110 110 108.
3 26.1 3 10 1110 1110 1100.
4 1.1 1 20 20 20 20
5 2.1 2 20 420 420 410.
6 4.1 3 20 8420 8420 8210.
You can use the following solution:
purrr::accumulate takes a two argument function, the first one which is represented by .x or ..1 is the accumulated value of the previous iterations and .y represents the current value of our vector (2:n()). So our first accumulated value will be first element of group value as I supplied it as .init argument
Since you would like to change the denominator of the previous iterations/ calculations, I multiplied the result .x by the ratio of the previous value of cmax to the current value of cmax
I think the rest is pretty clear but if you have any more question about it just let me know.
library(dplyr)
library(purrr)
demo %>%
group_by(group) %>%
mutate(cmax = cummax(th),
should = accumulate(2:n(), .init = group[1],
~ (.x * cmax[.y - 1] / cmax[.y]) + (group[.y] ^ seq[.y]) * (th[.y] / cmax[.y])))
# A tibble: 6 x 5
# Groups: group [2]
th seq group cmax should
<dbl> <int> <dbl> <dbl> <dbl>
1 20.1 1 10 20.1 10
2 24.1 2 10 24.1 108.
3 26.1 3 10 26.1 1100.
4 1.1 1 20 1.1 20
5 2.1 2 20 2.1 410.
6 4.1 3 20 4.1 8210.
I'm working with processed data from motion sensors and would like some help manipulating the dataset. My variables include "Time" (in milliseconds) and "Shoulder Flexion" (in angles). I want to create a new variable that flags every time there is an absolute value change of 3 degrees in the "Shoulder Flexion" variable. Something like:
newdata <- mydata %>%
mutate(changevariable = ifelse(Shoulder.Flexion = [absolute value change in 3 degrees], "1", "0")
, where each flag/"1" is in a sequence of ± 3 degrees.
An example of my dataset:
structure(list(Time = c(0, 0.0078125, 0.015625, 0.023438, 0.03125,
0.039062, 0.046875, 0.054688, 0.0625, 0.070312, 0.078125, 0.085938,
0.09375, 0.10156, 0.10938, 0.11719, 0.125, 0.13281, 0.14062,
0.14844), Shoulder.Flexion = c(-9.4721, -12.098, -12.51, 12.253,
11.815, 11.385, 11.03, 10.766, 10.586, 10.472, 10.408, 10.381,
10.383, 10.407, 10.453, 10.521, 10.605, 10.695, 10.778, 10.846
)), row.names = c(NA, 20L), class = "data.frame")
This point of this is to help me generate a value of the number of times a subject rotates 3 degrees in a specified time interval. Any help would be much appreciated.
Thanks in advance.
Depending on the amount of data you have you could consider a loop. I do try to avoid for loops in general and do think a data.table solution would probably be more helpful and preferable.
mydata$Shoulder.Flexion = c(9,10,10.5,11.7,12.1,13,13.5,14,15.9,16.2,17.4,18.6,19,18.5,17.5,17,15,14,13,12)
ref = mydata[1,2]
mydata$changevariable <- 0
for (i in 2:nrow(mydata)) {
if (abs(mydata[i, 2] - ref) >= 3) {
mydata[i, "changevariable"] <- 1
ref = mydata[i, 2]
}
}
Output
Time Shoulder.Flexion changevariable
1 0.0000000 9.0 0
2 0.0078125 10.0 0
3 0.0156250 10.5 0
4 0.0234380 11.7 0
5 0.0312500 12.1 1
6 0.0390620 13.0 0
7 0.0468750 13.5 0
8 0.0546880 14.0 0
9 0.0625000 15.9 1
10 0.0703120 16.2 0
11 0.0781250 17.4 0
12 0.0859380 18.6 0
13 0.0937500 19.0 1
14 0.1015600 18.5 0
15 0.1093800 17.5 0
16 0.1171900 17.0 0
17 0.1250000 15.0 1
18 0.1328100 14.0 0
19 0.1406200 13.0 0
20 0.1484400 12.0 1
Edit:
It is still unclear what is desired. It would help to have a "final" desired dataframe that includes the changevariable for your sample data. It also might help to have different sample data that has more changes in it too that exceed 3 degrees.
Here is another version in base R that calculates differences between rows, takes the absolute value of those differences, and then steps through those differences to calculate cumulative sum. When the sum is greater than 3, then changevariable is set to 1, and the cumulative sum is reset to zero.
Let me know if this is closer:
mydata$diff <- ave(mydata$Shoulder.Flexion, FUN = function(x) c(0, abs(diff(x))))
total = 0
mydata$changevariable <- 0
for (i in 2:nrow(mydata)) {
total <- total + mydata[i, "diff"]
if (total >= 3) {
mydata[i, "changevariable"] <- 1
total = 0
}
}
Time Shoulder.Flexion diff changevariable
1 0.0000000 -9.4721 0.0000 0
2 0.0078125 -12.0980 2.6259 0
3 0.0156250 -12.5100 0.4120 1
4 0.0234380 12.2530 24.7630 1
5 0.0312500 11.8150 0.4380 0
6 0.0390620 11.3850 0.4300 0
7 0.0468750 11.0300 0.3550 0
8 0.0546880 10.7660 0.2640 0
9 0.0625000 10.5860 0.1800 0
10 0.0703120 10.4720 0.1140 0
11 0.0781250 10.4080 0.0640 0
12 0.0859380 10.3810 0.0270 0
13 0.0937500 10.3830 0.0020 0
14 0.1015600 10.4070 0.0240 0
15 0.1093800 10.4530 0.0460 0
16 0.1171900 10.5210 0.0680 0
17 0.1250000 10.6050 0.0840 0
18 0.1328100 10.6950 0.0900 0
19 0.1406200 10.7780 0.0830 0
20 0.1484400 10.8460 0.0680 0
Let's suppose that a company has 3 Bosses and 20 Employees, where each Employee has done n_Projects with an overall Performance in percentage:
> df <- data.frame(Boss = sample(1:3, 20, replace=TRUE),
Employee = sample(1:20,20),
n_Projects = sample(50:100, 20, replace=TRUE),
Performance = round(sample(1:100,20,replace=TRUE)/100,2),
stringsAsFactors = FALSE)
> df
Boss Employee n_Projects Performance
1 3 8 79 0.57
2 1 3 59 0.18
3 1 11 76 0.43
4 2 5 85 0.12
5 2 2 75 0.10
6 2 9 66 0.60
7 2 19 85 0.36
8 1 20 79 0.65
9 2 17 79 0.90
10 3 14 77 0.41
11 1 1 78 0.97
12 1 7 72 0.52
13 2 6 62 0.69
14 2 10 53 0.97
15 3 16 91 0.94
16 3 4 98 0.63
17 1 18 63 0.95
18 2 15 90 0.33
19 1 12 80 0.48
20 1 13 97 0.07
The CEO asks me to compute the quality of the work for each boss. However, he asks for a specific calculation: Each Performance value has to have a weight equal to the n_Project value over the total n_Project for that boss.
For example, for Boss 1 we have a total of 604 n_Projects, where the project 1 has a Performance weight of 0,13 (78/604 * 0,97 = 0,13), project 3 a Performance weight of 0,1 (59/604 * 0,18 = 0,02), and so on. The sum of these Performance weights are the Boss performance, that for Boss 1 is 0,52. So, the final output should be like this:
Boss total_Projects Performance
1 604 0.52
2 340 0.18 #the values for boss 2 are invented
3 230 0.43 #the values for boss 3 are invented
However, I'm still struggling with this:
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
In addition to this problem, can you give me any feedback about this problem (my code, specifically) or any recommendation to improve data-manipulations skills? (you can see in my profile that I have asked a lot of questions like this, but still I'm not able to solve them on my own)
We can get the sum of product of `n_Projects' and 'Performance' and divide by the 'total_projects'
library(dplyr)
df %>%
group_by(Boss) %>%
summarise(total_projects = sum(n_Projects),
Weight_Project = sum(n_Projects * Performance)/total_projects)
# or
# Weight_Project = n_Projects %*% Performance/total_projects)
# A tibble: 3 x 3
# Boss total_projects Weight_Project
# <int> <int> <dbl>
#1 1 604 0.518
#2 2 595 0.475
#3 3 345 0.649
Adding some more details about what you did and #akrun's answer :
You must have received the following error message :
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
## Error in summarise_impl(.data, dots) :
## Column `Weight_Project` must be length 1 (a summary value), not 7
This tells you that the calculus you made for Weight_Project does not yield a unique value for each Boss, but 7. summarise is there to summarise several values into one (by means, sums, etc.). Here you just divide each value of n_Projects by sum(total_Projects), but you don't summarise it into a single value.
Assuming that what you had in mind was first calculating the weight for each performance, then combining it with the performance mark to yield the weighted mean performance, you can proceed in two steps :
df %>%
group_by(Boss) %>%
mutate(Weight_Performance = n_Projects / sum(n_Projects)) %>%
summarise(weighted_mean_performance = sum(Weight_Performance * Performance))
The mutate statement preserves the number of total rows in df, but sum(n_Projects) is calculated for each Boss value thanks to group_by.
Once, for each row, you have a project weight (which depends on the boss), you can calculate the weighted mean — which is a mean thus a summary value — with summarise.
A more compact way that still lets appear the weighted calculus would be :
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum((n_Projects / sum(n_Projects)) * Performance))
# Reordering to minimise parenthesis, which is #akrun's answer
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum(n_Projects * Performance) / sum(n_Projects))
I'm trying to create a histogram using ggplot2 in R.
This is the code I'm using:
library(tidyverse)
dat_male$explicit_truncated <- trunc(dat_male$explicit_mean)
means2 <- aggregate(dat_male$IAT_D, by=list(dat_male$explicit_truncated,dat_male$id), mean, na.rm=TRUE)
colnames(means2) <- c("explicit", "id", "IAT_D")
sd2 <- aggregate(dat_male$IAT_D, by=list(dat_male$explicit_truncated,dat_male$id), sd, na.rm=TRUE)
length2 <- aggregate(dat_male$IAT_D, by=list(dat_male$explicit_truncated,dat_male$id), length)
se2 <- sd2$x / sqrt(length$x)
means2$lo <- means2$IAT_D - 1.6*se2
means2$hi <- means2$IAT_D + 1.6*se2
ggplot(data = means2, aes(x = factor(explicit), y = IAT_D, fill = factor(id))) +
geom_bar(stat = "identity", position = position_dodge()) +
geom_errorbar(aes(ymin=lo,ymax=hi, width=.2), position=position_dodge(0.9), data=means2) +
xlab("Explicit attitude score") +
ylab("D-score")
For some reason I get the following warning message:
Removed 3 rows containing missing values (geom_bar).
And I get the following histogram:
I really have no clue what is going on.
Please let me know if you need to see anything else of my code, I'm never really sure what to include.
dat_male is a dataset that looks like this (I have only included the variables that I mentioned in this question, as the dataset contains 68 variables):
id explicit_mean IAT_D explicit_truncated
5 1 3.1250 0.366158652 3
6 1 3.3125 0.373590066 3
9 1 3.6250 0.208096230 3
11 1 3.1250 0.661983618 3
15 1 2.3125 0.348246184 2
19 1 3.7500 0.562406383 3
28 1 2.5625 -0.292888526 2
35 1 4.3750 0.560039531 4
36 1 3.8125 -0.117455439 3
37 1 3.1250 0.074375196 3
46 1 2.5625 0.488265849 2
47 1 4.2500 -0.131005579 4
53 1 2.0625 0.193040876 2
55 1 2.6875 0.875420303 2
62 1 3.8750 0.579146056 3
63 1 3.3125 0.666095380 3
66 1 2.8125 0.115607820 2
68 1 4.3750 0.259929946 4
80 1 3.0000 0.502709149 3
means2 is a dataset I have used to calculate means, and that looks like this:
explicit id IAT_D lo hi
1 0 0 NaN NaN NaN
2 2 0 0.23501191 0.1091807 0.3608431
3 3 0 0.31478389 0.2311406 0.3984272
4 4 0 -0.24296625 -0.3241166 -0.1618159
5 1 1 -0.04010111 NA NA
6 2 1 0.21939286 0.1109138 0.3278719
7 3 1 0.29097806 0.1973051 0.3846511
8 4 1 0.22965463 0.1209229 0.3383864
Now that I see it front of me, it probably has something to do with the NaN's?
From your dataset it seems like everything is alright.
The errors that you get are an indication that your data.frame has empty values (i.e. NaN and NA).
I actually got two warning messages:
Warning messages:
1: Removed 1 rows containing missing values
(geom_bar).
2: Removed 2 rows containing missing values
(geom_errorbar).
Regarding the plot, because you don't have any zero values under explicit, you don't see it in the graph. Similarly, because you have NAs under lo and hi for one in explicit, you don't get the corresponding error bar.
Dataset:
means2 <- read.table(text = " explicit id IAT_D lo hi
1 0 0 NaN NaN NaN
2 2 0 0.23501191 0.1091807 0.3608431
3 3 0 0.31478389 0.2311406 0.3984272
4 4 0 -0.24296625 -0.3241166 -0.1618159
5 1 1 -0.04010111 NA NA
6 2 1 0.21939286 0.1109138 0.3278719
7 3 1 0.29097806 0.1973051 0.3846511
8 4 1 0.22965463 0.1209229 0.3383864",
header = TRUE)
plot:
means2 %>%
ggplot(aes(x = factor(explicit), y = IAT_D, fill = factor(id))) +
geom_bar(stat = "identity", position = position_dodge()) +
geom_errorbar(aes(ymin=lo,ymax=hi, width=.2),
position=position_dodge(0.9)) +
xlab("Explicit attitude score") +
ylab("D-score")
I am trying to fit a parametric survival model. I think I managed to do so. However, I could not succeed in calculating the survival probabilities:
library(survival)
zaman <- c(65,156,100,134,16,108,121,4,39,143,56,26,22,1,1,5,65,
56,65,17,7,16,22,3,4,2,3,8,4,3,30,4,43)
test <- c(rep(1,17),rep(0,16))
WBC <- c(2.3,0.75,4.3,2.6,6,10.5,10,17,5.4,7,9.4,32,35,100,
100,52,100,4.4,3,4,1.5,9,5.3,10,19,27,28,31,26,21,79,100,100)
status <- c(rep(1,33))
data <- data.frame(zaman,test,WBC)
surv3 <- Surv(zaman[test==1], status[test==1])
fit3 <- survreg( surv3 ~ log(WBC[test==1]),dist="w")
On the other hand, no problem at all while calculating the survival probabilities using the Kaplan-Meier Estimation:
fit2 <- survfit(Surv(zaman[test==0], status[test==0]) ~ 1)
summary(fit2)$surv
Any idea why?
You can get the predicted probabilities from a survreg object with predict:
predict(fit3)
If you're interested in combining this with the original data, and also in the residual and standard errors of the predictions, you can use the augment function in my broom package:
library(broom)
augment(fit3)
A full analysis might look something like:
library(survival)
library(broom)
data <- data.frame(zaman, test, WBC, status)
subdata <- data[data$test == 1, ]
fit3 <- survreg( Surv(zaman, status) ~ log(WBC), subdata, dist="w")
augment(fit3, subdata)
With the output:
zaman test WBC status .fitted .se.fit .resid
1 65 1 2.30 1 115.46728 43.913188 -50.467281
2 156 1 0.75 1 197.05852 108.389586 -41.058516
3 100 1 4.30 1 85.67236 26.043277 14.327641
4 134 1 2.60 1 108.90836 39.624106 25.091636
5 16 1 6.00 1 73.08498 20.029707 -57.084979
6 108 1 10.50 1 55.96298 13.989099 52.037022
7 121 1 10.00 1 57.28065 14.350609 63.719348
8 4 1 17.00 1 44.47189 11.607368 -40.471888
9 39 1 5.40 1 76.85181 21.708514 -37.851810
10 143 1 7.00 1 67.90395 17.911170 75.096054
11 56 1 9.40 1 58.99643 14.848751 -2.996434
12 26 1 32.00 1 32.88935 10.333303 -6.889346
13 22 1 35.00 1 31.51314 10.219871 -9.513136
14 1 1 100.00 1 19.09922 8.963022 -18.099216
15 1 1 100.00 1 19.09922 8.963022 -18.099216
16 5 1 52.00 1 26.09034 9.763728 -21.090343
17 65 1 100.00 1 19.09922 8.963022 45.900784
In this case, the .fitted column is the predicted probabilities.