Proper display of confidence interval in R using ggplot

Proper display of confidence interval in R using ggplot - r

I'm trying to make a plot that will represent 2 measurements(prr and ebgm) for different adverse reactions of different drugs grouped by age category like so:
library(ggplot2)
strata <- factor(c("Neonates", "Infants", "Children", "Adolescents", "Pediatrics"), levels=c("Neonates", "Infants", "Children", "Adolescents", "Pediatrics"), order=T)
Data <- data.frame(
strata = sample(strata, 200, replace=T),
drug=sample(c("ibuprofen", "clarithromycin", "fluticasone"), 200, replace=T), #20 de medicamente
reaction=sample(c("Liver Injury", "Sepsis", "Acute renal failure", "Anaphylaxis"), 200, replace=T),
measurement=sample(c("prr", "EBGM"), 200, replace=T),
value_measurement=sample(runif(16), 200, replace=T),
lower_CI=sample(runif(6), 200, replace=T),
upper_CI=sample(runif(5), 200, replace=T)
)
g <- ggplot(Data, aes(x=strata, y=value_measurement, fill=measurement, group=measurement))+
geom_histogram(stat="identity", position="dodge")+
facet_wrap(~reaction)+
geom_errorbar(aes(x=strata, ymax=upper_CI, ymin=lower_CI), position="dodge", stat="identity")
ggsave(file="meh.png", plot=g)
The upper and lower CI are the confidence interval limit of the measurement. Given that I for each measurement I have a confidence interval I want the proper histogram to have the corresponding confidence interval, but what I get is s follows.
Graph:
Any ideas how to place those nasty conf intervals properly? Thank you!
Later edit: in the original data for a given drug I have many rows each containing an adverse reaction, the age category and each of these categories has 2 measurements: prr or EBGM and the corresponding confidence interval. This is not reflected in the data simulation.

The problem is that each of your bars is really multiple bars plotted over each other, because you have more than one row of data for each combination of reaction, strata, and measurement. (You're getting multiple error bars for the same reason.)
You can see this in the code below, where I've changed geom_histogram to geom_bar and added alpha=0.3 and colour="grey40" to show the multiple overlapping bars. I've also commented out the error bars.
ggplot(Data, aes(x=strata, y=value_measurement, fill=measurement, group=measurement)) +
geom_bar(stat="identity", position="dodge", alpha=0.3, colour="grey40") +
facet_wrap(~reaction) #+
# geom_errorbar(aes(x=strata, ymax=upper_CI, ymin=lower_CI),
# position="dodge", stat="identity")
You can fix this by adding another column to your data that adds a grouping category by which you can separate these bars. For example, in the code below we add a new column called count that just assigns numbers 1 through n for each row of data within each combination of reaction and strata. We sort by measurement so that each measurement type will be kept together in the count sequence.
library(dplyr)
Data = Data %>% group_by(reaction, strata) %>%
arrange(measurement) %>%
mutate(count = 1:n())
Now plot the data:
ggplot(Data, aes(x=strata, y=value_measurement,
fill=measurement, group=count)) +
geom_bar(stat="identity", position=position_dodge(0.7), width=0.6) +
facet_wrap(~reaction, ncol=1) +
geom_errorbar(aes(x=strata, ymax=upper_CI, ymin=lower_CI, group=count),
position=position_dodge(0.7), stat="identity", width=0.3)
Now you can see the separate bars, along with their error bars (which are weird, but only because they're fake data).

Related

Subsetting data for ggplot2

I have data saved in multiple datasets, each consisting of four variables. Imagine something like a data.table dt consisting of the variables Country, Male/Female, Birthyear, Weighted Average Income. I would like to create a graph where you see only one country's weighted average income by birthyear and split by male/female. I've used the facet_grid() function to get a grid of graphs for all countries as below.
ggplot() +
geom_line(data = dt,
aes(x = Birthyear,
y = Weighted Average Income,
colour = 'Weighted Average Income'))+
facet_grid(Country ~ Male/Female)
However, I've tried isolating the graphs for just one country, but the below code doesn't seem to work. How can I subset the data correctly?
ggplot() +
geom_line(data = dt[Country == 'Germany'],
aes(x = Birthyear,
y = Weighted Average Income,
colour = 'Weighted Average Income'))+
facet_grid(Country ~ Male/Female)

For your specific case the problem is that you are not quoting Male/Female and Weighted Average Income. Also your data and basic aesthetics should likely be part of ggplot and not geom_line. Doing so isolates these to the single layer, and you would have to add the code to every layer of your plot if you were to add for example geom_smooth.
So to fix your problem you could do
library(tidyverse)
plot <- ggplot(data = dt[Country == 'Germany'],
aes(x = Birthyear,
y = sym("Weighted Average Income"),
col = sym("Weighted Average Income")
) + #Could use "`x`" instead of sym(x)
geom_line() +
facet_grid(Country ~ sym("Male/Female")) ##Could use "`x`" instead of sym(x)
plot
Now ggplot2 actually has a (lesser known) builtin functionality for changing your data, so if you wanted to compare this to the plot with all of your countries included you could do:
plot %+% dt # `%+%` is used to change the data used by one or more layers. See help("+.gg")

I have 7 different data points for virus concentration collected at 3 different time points. How do I graph this with error bars in R?

I collected seven different samples containing varying concentrations of the Kunjin Virus.
3 samples are from the 24 hour time point: 667, 1330, 1670
2 samples are from the 48 hour time point: 323000, 590000
2 samples are from the 72 hour time point: 3430000, 4670000
How do I create a dotplot reflecting this data including error bars in R? I'm using ggplot2.
My code so far is:
data1 <-data.frame(hours, titer)
ggplot(data1, aes(x=hours, y=titer, colour = hours)) + geom_point()

I would suggest you next approach. If you want error bars you can compute it based on mean and standard deviation. In the next code is sketched the way to do that. I have used one standard deviation but you can set any other value. Also as you want to see different samples, I have used facet_wrap(). Here the code:
library(ggplot2)
library(dplyr)
#Data
df <- data.frame(sample=c(rep('24 hour',3),rep('48 hour',2),rep('72 hour',2)),
titer=c(667, 1330, 1670,323000, 590000,3430000, 4670000),
stringsAsFactors = F)
#Compute error bars
df <- df %>% group_by(sample) %>% mutate(Mean=mean(titer),SD=sd(titer))
#Plot
ggplot(df,aes(x=sample,y=titer,color=sample,group=sample))+
geom_errorbar(aes(ymin=Mean-SD,ymax=Mean+SD),color='black')+
geom_point()+
scale_y_continuous(labels = scales::comma)+
facet_wrap(.~sample,scales='free')
Output:
If you have a common y-axis scale, you can try this:
#Plot 2
ggplot(df,aes(x=sample,y=titer,color=sample,group=sample))+
geom_errorbar(aes(ymin=Mean-SD,ymax=Mean+SD),color='black')+
geom_point()+
scale_y_continuous(labels = scales::comma)+
facet_wrap(.~sample,scales = 'free_x')
Output:

R ggplot Histogram group shows sum of two groups

I tried to plot the distribution of my test and train data set in a histogram and found something curious:
Background:
I have a test set with 50 rows and a training set with 100 rows each with the same column structure.
I'd normally plot the data like that:
plot2 <- ggplot(data=Donald_1) +
geom_histogram(aes_string(x = "Alter", y = "..count..", fill = "Group"),
bins=20, alpha=0.7)
which results in the right histogram shown below. I then wondered how it could be that test has a higher count than training as the test set is only 50 rows instead of 100. And it seems as if the test bars show the sum of the test and training bars of the left plot.
Then I tried:
plot1 <- ggplot() +
geom_histogram(data=Donald_1 %>% filter(Group == "Training"),
aes_string(x="Alter", y="..count..", fill = "Group"),
bins=20, alpha=0.7) +
geom_histogram(data=Donald_1 %>% filter(Group == "Test"),
aes_string(x="Alter", y="..count..", fill="Group"),
bins=20, alpha=0.7)
which results in the left plot shown below and that results makes more sense to me.
I now wonder, why the first attempt doesn't result in the same plot as the second attempt. Am I missing something obvious here?

In your dataframe, you have the column "Group" which represents both values Training and Test.
ggplot understands that you are representing one histogram with two groups.
Your second plot represents two distinct histograms on the same grid, and transparency (alpha) makes it what it actually what it look like.
Moreover, maybe you will prefer this one :
plot3 <- ggplot(data=Donald_1) +
geom_histogram(aes_string(x = "Alter", y = "..count..", fill = "Group"),
bins=20, alpha=0.7, position="dodge")

ANOVA significance visualisation of replicate experiment data (ggplot)

I am struggling to get significance values of my experiment replicate data. Experiment done in duplicate for each species and i want to compare how significant the values are for each time point between each species. I am trying to do two-way ANOVA...
library(ggplot2)
library(reshape)
library(dplyr)
abs2.melt<-melt(abs2,
id.vars='Time',
measure.vars=c('WT','WT.1','DsigB','DsigB.1','DrsbR','DrsbR.1'))
print(abs2.melt)
abs2.melt.mod<-abs2.melt %>%
separate(col=variable,into=c('Species'),sep='\\.')
print(abs2.melt.mod)
ggplot(abs2.melt.mod,aes(x=Time,y=value,group=Species))+
stat_summary(
fun =mean,
geom="line",
aes(color=Species))+
stat_summary(
fun=mean,
geom="point")+
stat_summary(
fun.data=mean_cl_boot,
geom='errorbar',
width=2)+
theme_bw()+
xlab("Time")+
ylab("OD600")+
labs(title="Growth Curve of Mutant Strains")
summary(abs2.melt.mod)
print(abs2.melt.mod)
###SD and mean values
as.data.frame<-abs2.melt.mod %>% group_by(Species,Time) %>%
summarize(mean.val=mean(value), sd.val=sd(value))
anova1<-aov(value~Species,data=abs2.melt.mod)
##statistical significance?
print(as.data.frame)
anova1<-aov(Time~Species+value,data=abs2.melt.mod)
summary(anova1)

Simulate something that looks like your data
set.seed(111)
df = expand.grid(rep=1:3,Time=1:5,Species=letters[1:3])
df$value = 0.5*df$Time + rnorm(nrow(df))
df$Time = factor(df$Time)
Then we plot, allowing comparison for each time point:
library(ggplot2)
ggplot(df,aes(x=Time,y=value,col=Species)) +
stat_summary(fun.data="mean_sdl",position=position_dodge(width=0.5))
Or error bar which i think looks bad:
ggplot(df,aes(x=Time,y=value,col=Species))+
stat_summary(fun.data="mean_sdl",position=position_dodge(width=0.5),
geom="errorbar",width=0.4)
Since you have a few data points, no point doing a boxplot, so you can try something like the above

How to split x-axis as decile in R and make ggplot

Hi I am wondering how to split x-axis as decile in R and make ggplot?
I currently have age range data and NO2 pollution data. The two datasets share the same geographic reference named ward. I wish to plot my demographic data in quantiles of equal number of ward (Total 298).
I tried the quantile regression in R where I used the following:
library(SparseM)
library(quantreg)
mydata<- read.csv("M:/Desktop10/Test2.csv")
attach(mydata)
Y <- cbind(NO2.value)
X <- cbind(age.0.to.4, age..5.to.9, age.10.to.14, age.15.to.19, age.20.to.24, age.25.to.29, age.30.to.44, age.45.to.59, age.60.to.64, age.65.to.74, age.75.to.84, age.85.to.89, age.above.90)
quantreg.all <- rq(Y ~ X, tau = seq(0.05, 0.95, by = 0.05), data=mydata)
quantreg.plot <- summary(quantreg.all)
plot(quantreg.plot)
But what I get are not what I expected as the y-axies is not the NO2 data.
The ideal plot is attached:
Many thanks for your help and suggestions.

If I understand your question, I think the cut function combined with the quantile function will create the deciles. Here's an example with fake data.
In the code below, we use the cut function to split the data into deciles and we use the quantile function to set the breaks argument for cut. This tells cut to group the data into 10 groups of equal size, from smallest values of NO2 to largest.
group_by(age) means we create the deciles separately for each age group. This means that there are equal numbers of subjects within each decile in a given age group, but the NO2 cutoff values for each decile are different for different age groups. To create deciles over the data as a whole, just remove group_by(age). This will result in the same NO2 cutoff values for each decile across all age groups, but within a given age group, the number of subjects will not be the same in each decile.
library(tidyverse)
# Fake data
set.seed(2)
dat = data.frame(NO2=c(runif(600, 0, 10), runif(400, 1, 11)),
age=rep(c("0-10","11-20"), c(600,400)))
# Create decile groups
dat = dat %>%
group_by(age) %>%
mutate(decile = cut(NO2, breaks=quantile(NO2, probs=seq(0,1,0.1)),
labels=10:1, include.lowest=TRUE),
decile = fct_rev(decile))
Now we plot using ggplot2. The stat_summary function returns the mean for each decile in each age group.
ggplot(dat, aes(decile, NO2, colour=age, group=age)) +
stat_summary(fun.y=mean, geom="line") +
stat_summary(fun.y=mean, geom="point") +
expand_limits(y=0) +
theme_bw()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Proper display of confidence interval in R using ggplot - r

Related

Subsetting data for ggplot2

I have 7 different data points for virus concentration collected at 3 different time points. How do I graph this with error bars in R?

R ggplot Histogram group shows sum of two groups

ANOVA significance visualisation of replicate experiment data (ggplot)

How to split x-axis as decile in R and make ggplot

Categories

Resources