I want to plot a very simple boxplot like this in R:
desired graph
It is a log-link (Gamma distributed: jh_conc is a hormone concentration variable) Generalized linear model of a continuous dependent variable (jh_conc) for a categorical grouping variable (group: type of bee)
My script that I already have is:
> jh=read.csv("data_jh_titer.csv",header=T)
> jh
group jh_conc
1 Queens 6.38542714
2 Queens 11.22512563
3 Queens 7.74472362
4 Queens 11.56834171
5 Queens 3.74020100
6 Virgin Queens 0.06080402
7 Virgin Queens 0.12663317
8 Virgin Queens 0.08090452
9 Virgin Queens 0.04422111
10 Virgin Queens 0.14673367
11 Workers 0.03417085
12 Workers 0.02449749
13 Workers 0.02927136
14 Workers 0.01648241
15 Workers 0.02150754
fit1=glm(jh_conc~group,family=Gamma(link=log), data=jh)
ggplot(fit, aes(group, jh_conc))+
geom_boxplot(aes(fill=group))+
coord_trans(y="log")
the resulting plot looks like this:
My question is: what (geom) extensions can I use to split the y-axis and rescale them different? Also how do I add the black circles (averages; which are calculated on a log scale and then back-transformed to the original scale) horizontal lines which are significance levels based on posthoc tests performed on log transformed data: ** : p<0.01, *** :p< 0.001?
You can't create a broken numeric axis in ggplot2 by design, mainly because it visually distorts the data/differences being represented and is considered misleading.
You can however use scale_log10() + annotation_logticks() to help condense data across a wide range of values or better show heteroskedastic data. You can also use annotate to build out your p-value representation stars and bars.
Also you can easily grab information from a model using it's named attributes, here we care about fit$coef:
# make a zero intercept version for easy plotting
fit2 <- glm(jh_conc ~ 0 + group, family = Gamma(link = log), data = jh)
# extract relevant group means and use exp() to scale back
means <- data.frame(group = gsub("group", "",names(fit2$coef)), means = exp(fit2$coef))
ggplot(fit, aes(group, jh_conc)) +
geom_boxplot(aes(fill=group)) +
# plot the circles from the model extraction (means)
geom_point(data = means, aes(y = means),size = 4, shape = 21, color = "black", fill = NA) +
# use this instead of coord_trans
scale_y_log10() + annotation_logticks(sides = "l") +
# use annotate "segment" to draw the horizontal lines
annotate("segment", x = 1, xend = 2, y = 15, yend = 15) +
# use annotate "text" to add your pvalue *'s
annotate("text", x = 1.5, y = 15.5, label = "**", size = 4) +
annotate("segment", x = 1, xend = 3, y = 20, yend = 20) +
annotate("text", x = 2, y = 20.5, label = "***", size = 4) +
annotate("segment", x = 2, xend = 3, y = .2, yend = .2) +
annotate("text", x = 2.5, y = .25, label = "**", size = 4)
Related
I have a dataset that looks like this with 3 more levels for scarification. Germination is my response variable.
scarification
time
germination
Water
0
0
Water
2
0
Water
4
8
Water
8
23
Ethanol
0
0
Ethanol
2
18
Ethanol
4
19
Ethanol
8
22
I have made a glm for the data and plotted the fitted values, and done pairwise contrasts using emmeans. I'd like to add letters to my bar chart to indicate letters of significance, but am having trouble extracting cld data as the cld function does not work with emmGrid objects, and the variable names used in emmeans are different to those used in the plot. I have tried renaming the variables but that does not work. I have also tried using geom_signif but that does not seem to work either.
geom_signif(comparisons = em,
+ test = "emmeans",
+ map_signif_level = TRUE)
Warning message:
Computation failed in `stat_signif()`
Caused by error in `mapped_discrete()`:
! Can't convert `x` <list> to <double>.
Here is the code I have so far
#make a glm
summary(mod_8 <- glm(cbind(germination, total - germination) ~ scarification*time, data = df, family = binomial))
# make a new df with the predicted values from the model, specifying for stratification to just do 0, 2, 4, and 8 from the continuous variable
mydf <- ggpredict(mod_8, terms = c("time [0,2,4,8]", "scarification"))
#add time as a factor to the new df
mydf$x_fact <- as.factor(mydf$x)
#get contrast values
em <- emmeans(mod_8, ~scarification + time,
at = list(time = c(0, 2, 4, 8)),
trans = "response") %>%
contrast(interaction = c("pairwise", "pairwise"),
by = "time")
#make a grouped bar chart with scarification (group) on the x axis, predicted on the y axis, and grouped by the factor version of time (x_fact)
ggplot(mydf, aes(x = group, y = predicted, fill = x_fact)) +
geom_col(position = "dodge") +
geom_bar(stat = "identity", position = "dodge") +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), position = position_dodge(width = 0.9)) +
labs(x = "Scarification", y = "Predicted Germination Proportion", fill = "Time") +
ggtitle("Grouped Bar Chart of Germination by Scarification and Time")
If anyone has any ideas I would appreciate it.
I'm doing for class hypothesis contrast with bayesian models. And I want to do a fancy graphic with ggplot showing the two hypothesis regions with two different colours.
Normal distribution
I would like to fill region H1 with different colour of region H0.
My code is:
#Param of normal distribution
param1 <- 1.74
param2 <- 0.000617
#Normal simulation
sim_posteriori <- data.frame(rnorm(1000, param1, sqrt(param2)), rep('Posteriori', 1000))
names(sim_posteriori) <- c('Datos', 'Grupo')
#Hypotesis contrast
# P(H0) -> mu <= 1.75
pnorm(1.75, param1, sqrt(param2))
# P(H1) -> mu <= 1.75
1 - pnorm(1.75, param1, sqrt(param2))
#Plot
sim_posteriori %>% ggplot(aes(Datos)) +
stat_ecdf(fill = '#F2C14E95', geom = 'density') +
geom_vline(aes(xintercept = 1.75), lty = 2, size = 1) +
labs(title = 'Distribución posteriori y acumulada') +
xlab('Altura(en metros)') +
ylab('Densidad') +
theme_minimal() +
annotate('text', x = 1.735, y = 0.25, label = 'Región H1') +
annotate('text', x = 1.79, y = 0.25, label = 'Región H0')
If you find yourself wondering how to get ggplot to do a complex manipulation of your data with its various stat_ functions, you're probably approaching your problem in the wrong way. These functions exist to make it easy to carry out common simple transformations, but we need to remember that ggplot is a tool for plotting, not for wrangling data, so if the stat_ functions aren't quite what you are looking for, it's normally best to just prepare the data you actually want to plot, then plot it.
In this case it is pretty trivial to to create your own ecdf in a data frame outside of ggplot, label which parts of it are above and below your threshold, then use geom_area to plot it:
h <- sort(sim_posteriori$Datos)
df <- data.frame(x = h, y = seq_along(h)/length(h), region = h > 1.75)
ggplot(df, aes(x, y, fill = region)) +
geom_area() +
geom_vline(aes(xintercept = 1.75), lty = 2, size = 1) +
scale_fill_manual(values = c('#F2C14E95', '#C14E4295'), guide = "none") +
labs(title = 'Distribución posteriori y acumulada',
x = 'Altura(en metros)', y = 'Densidad') +
theme_minimal() +
annotate('text', x = 1.735, y = 0.25, label = 'Región H1') +
annotate('text', x = 1.79, y = 0.25, label = 'Región H0')
I am trying to plot the relative frequency of 1D data from 3 clusters. What I want is a single histogram that uses color to distinguish between the 3 clusters, and I want the height of each bin to represent the relative frequency of that value range for a particular cluster.
The code is as follows:
library(mvtnorm)
library(gtools)
library(ggplot2)
K = 3 # number of clusters
p_p = c(0.25, 0.25, 0.5) # population weights
theta_p = c(2, 5, 15) # population gamma params - shape
phi_p = c(2,2, 5) # population gamma params - scale
N_p = c(25, 25, 50) # sample size within each cluster
set.seed(1) # set seed so that the results are the same each time
y <- numeric()
## We will now sample data from all three clusters
y[1:N_p[1]] <- rgamma(N_p[1], theta_p[1], phi_p[1])
y[(N_p[1]+1): (N_p[1]+N_p[2])] <- rgamma(N_p[2], theta_p[2], phi_p[2])
y[(N_p[1]+N_p[2]+1): sum(N_p)] <- rgamma(N_p[3], theta_p[3], phi_p[3])
Data = data.frame(y = y, source = as.factor(c(rep(1,25), rep(2,25), rep(3,50))))
ggplot(Data, aes(x=y, color = source))+
geom_histogram(aes(y=..count../sum(..count..)),fill="white", position="dodge", binwidth = 0.5) +
theme(legend.position="top")+labs(title="Samples against Theoretical Dist",y="Frequency", x="Sample Value")
length(which(y[1:25]<=0.5))/length(y)
length(which(y[1:25]<=0.5))/length(y[0:25])
Now, what I want is for the first red histogram bar to have a height equal to length(which(y[1:25]<=0.5))/length(y[0:25]). I would understand if i was getting length(which(y[1:25]<=0.5))/length(y) instead, and I could work around that.
However, I'm getting a height of around 0.12, which doesn't match either of these values and has me thinking I am completely misunderstanding ..count.. and sum(..count..).
The issue isn't with your understanding of ..count.. but in your assumption of how binwidth works. You have assumed that setting it to 0.5 will set the breaks at 0, 0.5, 1, 1.5 etc, but in fact it sets it at the lowest value of the range of your data. So in fact, the height of your first bar is length(which(y[1:25] <= (min(y) + 0.5)))/length(y), which is 13.
You can specify breaks in geom_histogram to work round this limitation:
ggplot(Data, aes(x = y, color = source)) +
geom_histogram(aes(y = stat(count)/length(y)), fill = "white",
position = "dodge", breaks = seq(0, 6, 0.5)) +
theme(legend.position = "top" +
labs(title = "Samples against Theoretical Dist",
y = "Frequency", x = "Sample Value")
Now each bar is 1/100th of the count since the vector is 100 long.
I have a dataset with 29 columns and 2500 rows resulting from an test. three columns need to be represented on a plot, the fist two are simple X,Y coordinate pairs representing actual X,Y positions on an image used in the test, the third is a response from the participants giving a simple yes or no answer (recorded as 1 and -1 respectively).
Each X,Y coordinate was used name times in the test, and I'm trying to get an overall bias for each point. The values can be found by a simple sum of the Y,N answers. My problem is that I can't plot the "sum" of the answers, only the density of the yes and no separately. I need to show the bias towards yes and no overall for each point, so having two plots or simply plotting the two sets of results on the same plot is on little value.
In the code I'm using the X value is audioDim1a and the Y value is audioDim2. There are 2 DFs used which have been reduced - one to include all the Y answers and the other all the N answers.
this code uses the two N & Y data frames
ggplot() +
xlim(0, 110) + ylim(0, 150) +
stat_density_2d(data = test_plot_N, aes(audioDim1a, audioDim2, alpha="density", fill = "density"), geom = "polygon", size = 0.2, contour = T, n = 150, h = 20, bins = 10, colour = "purple") + stat_density_2d(data = test_plot_Y, aes(audioDim1a, audioDim2, alpha="density", fill = "density"), geom = "polygon", size = 0.2, contour = T, n = 150, h = 20, bins = 10, colour = "green") + geom_point(data = test_plot_N, aes(audioDim1a, audioDim2), colour = "blue", size = 1)
If I use a dataset (see below) with the Y and N combined I hoped to get the situation where if the number of Y and N answers was equal the density would result in a 0 plot and thus the contour fill would be clear/white. This does not happen as it seems to simply show a count of responses rather than an arithmetic sum.
ggplot() +
xlim(0, 110) + ylim(0, 150) +
stat_density_2d(data = test_plot, aes(audioDim1a, audioDim2, alpha="density", fill = "density"), geom = "polygon", size = 0.2, contour = T, n = 150, h = 20, bins = 10, colour = "purple") +
geom_point(data = test_plot_N, aes(audioDim1a, audioDim2), colour = "blue", size = 1)
Do I need to supply the data set and the full R code I'm using?
Any help would be really appreciated.
I have a dataset with 29 columns and 2500 rows resulting from a test. Three columns need to be represented on a plot, the first two are simple X, Y coordinate pairs representing actual X, Y positions on an image used in the test, the third is a response from the participants giving a simple yes or no answer (recorded as 1 and -1 respectively).
Each X, Y coordinate was used name times in the test, and I'm trying to get an overall bias for each point. The values can be found by a simple sum of the Y, N answers. My problem is that I can't plot the "sum" of the answers, only the density of the yes and no separately. I need to show the bias towards yes and no overall for each point, so having two plots or simply plotting the two sets of results on the same plot is on little value.
This is a suitable reproducible bit of code that has the same characteristics... Much of my data seems in favour of the Y answer, but I want to 'see' whether this is related to the X, Y coordinate position (distance away from a separate fixed point).
the code below does a reasonable job of simulating the three main columns of interest...
my.data <- data.frame(x = (rep((1:5), 5)),
y = (rep((1:5), 5)),
z = sample(c(rep(-1, 10), rep(1, 40))))
ggplot(data = my.data, aes(x = x, y = y, z = z)) +
xlim(-1, 6) + ylim(-01, 6) +
stat_density_2d(aes(alpha = "density", fill = "density"),
geom = "polygon", size = 0.2, contour = T,
n = 300, h = 1.5, bins = 20, colour = "purple") +
geom_point(data = my.data, aes(x, y), colour = "blue", size = 1)
Any help would be really appreciated - I've been stuck on this for 3 weeks now.