make barplot of groups with error bars - r

I would like to make a barplot of the columns V2 and length. I would also plot the standard deviation from the number in length for each group.
> head(Length_filter3)
V1 V2 V3 length
1 URS00000081EA snRNA AAACTCGACTGCATAATTTGTGGTAGTGGG 30
2 URS00000081EA snRNA AAACTCGACTGCATAATTTGTGGTAGTGGGG 31
3 URS00000081EA snRNA AAACTCGACTGCATAATTTGTGGTAGTGGGGGACT 35
4 URS0000008112A tRNA AAACTCGACTGCATAATTTGTGGTAGTGGGGGACTG 36
5 URS000000812A tRNA AAATGTGGGAAACTCGACTGCATAATTTGTGGTAGTGGGGGACT 44
6 URS0000008121EA tRNA AACTCGACTGCATAATTTGTGGTAGTGGG 29
ggplot(Length_filter3, aes(V2,length)) + geom_bar(stat="identity")

I am assuming that you are looking to create some sort of summary statistic, such as average, rather than trying to plot the total length of all of the RNA types (for which there would be no error bar to speak of).
If it has to be a bar plot, you will likely need to calculate the values yourself. Here, I am manually calculating the ranges I want from the iris data (using dplyr):
summarizedData <-
iris %>%
group_by(Species) %>%
summarise(
mean = mean(Petal.Length)
, sd = sd(Petal.Length)
, low = mean + sd/(sqrt(n())) * qt(0.025, n()-1 )
, high = mean + sd/(sqrt(n())) * qt(0.975, n()-1 )
)
ggplot(
summarizedData
, aes(x = Species
, y = mean
, ymax = high
, ymin = low)
) +
geom_bar(stat = "identity") +
geom_linerange()
Alternatively, you can let ggplot do the work for you, particularly if you are willing to use points and error bars instead of a bar plot (I tend to prefer it this way)
ggplot(
iris
, aes(x = Species
, y = Petal.Length)
) +
stat_summary(fun.data = mean_cl_normal)
You can combine these approaches if you like as well.

try FUN function in ggplot choosing stdev.

Related

How to obtain size of cluster of pixels in R

I have a picture of 2 colors. Red color pixels are in form of cluster. I would like to know the max dimension of each cluster to compare with the acceptable tolerance. How to do? Is there any function to perform it?
For this kind of image analysis, you can check out EBImage:
install.packages("BiocManager")
BiocManager::install("EBImage")
Your workflow might look something like this. First, load the packages and read in your image. We'll also display it to show we're on the right track:
library(EBImage)
library(ggplot2)
dots <- readImage("https://i.stack.imgur.com/3RU7u.png")
display(dots, method = "raster")
Now we can use the computeFeatures functions to get the centroids and maximum diameter of each cluster:
dots_bw <- getFrame(dots, 1)
labelled_dots <- bwlabel(dots_bw)
df <- as.data.frame(cbind(computeFeatures.moment(labelled_dots)[, 1:2],
computeFeatures.shape(labelled_dots)[, 5:6]))
df
#> m.cx m.cy s.radius.min s.radius.max
#> 1 65.73316 25.69588 11.095535 40.69698
#> 2 156.24181 129.77241 19.377341 33.83485
#> 3 483.60853 155.23006 9.419478 16.28808
#> 4 277.21467 409.62152 20.411710 28.77508
#> 5 397.36817 607.47749 8.424518 18.53617
#> 6 224.93790 623.28266 8.530353 15.26678
Now we want to find out which dimension matches which blob, so let's plot the raster in ggplot, and write the maximum pixel dimension above each blob.
img_df <- reshape2::melt(as.matrix(as.raster(as.array(dots))))
ggplot(img_df, aes(Var1, Var2, fill = value)) +
geom_raster() +
scale_fill_identity() +
scale_y_reverse() +
geom_text(inherit.aes = FALSE, data = df, color = "white",
aes(x = m.cx, y = m.cy, label = round(s.radius.max, 1))) +
coord_equal()
If you would rather have the total number of pixels than the maximum diameter in pixels, you can also get this from computeFeatures

ggplot multiline by rows and columns with legend + sd

I have 2 data.frame :
data_mean :
territory_1 territory_2 territory_3 territory_4 territory_5 territory_6 territory_7 territory_8 territory_9
season_1 0.04548814 0.03679184 0.04804329 0.01774598 0.1873583 0.03798713 0.02515220 0.04006423 0.2460139
season_2 0.07529072 0.08048696 0.06041415 0.03461997 0.1473725 0.04371079 0.02451824 0.03729869 0.2325734
season_3 0.19094684 0.05399267 0.09108074 0.05326579 0.1107565 0.04996543 0.02944363 0.04424125 0.2061001
season_4 0.16393195 0.02916149 0.05763407 0.03526731 0.1212815 0.05466920 0.02825975 0.06150540 0.2232308
season_5 0.08309387 0.05862481 0.07578285 0.03620725 0.1433460 0.07038242 0.03102652 0.05434440 0.1553574
and data_SD :
territory_1 territory_2 territory_3 territory_4 territory_5 territory_6 territory_7 territory_8 territory_9
season_1 0.009414762 0.009205625 0.003816925 0.002548717 0.01475648 0.003631448 0.001490306 0.002462043 0.007583638
season_2 0.024247471 0.016402706 0.004980897 0.004745206 0.01393021 0.004178247 0.001597244 0.002553933 0.007538909
season_3 0.030626971 0.012865086 0.006913778 0.005980786 0.01313423 0.004754663 0.001875462 0.002952610 0.007434868
season_4 0.034040440 0.009705439 0.004927881 0.004330766 0.01350788 0.004751983 0.001753364 0.003384793 0.007406657
season_5 0.020016015 0.014591767 0.005815301 0.004232419 0.01499951 0.005411255 0.001875151 0.003234048 0.006308964
I want to draw one ggplot Y= values, X= seasons and one line for one territory with different color of line by territory (color(territory)) + draw a area between Sd inf and Sd sup. And I want draw an other ggplot : xggplot by xterritory + area between Sd inf and Sd sup.
Thank you for your help !
The most effective way to work with ggplot is to reshape your data into long form, typically into a single data frame. Below, I use dplyr::left_join and tidyr::gather to transform and combine the two provided tables into the form that ggplot works best with.
library(tidyverse)
data_combined <-
left_join(
data_mean %>% gather(territory, mean, -season),
data_SD %>% gather(territory, SD, -season)
)
Then it's pretty straightforward to assign the different columns to different aesthetics of the chart:
ggplot(data_combined,
aes(x = season, y = mean, color = territory, group = territory)) +
geom_ribbon(aes(ymin = mean - SD, ymax = mean + SD, fill = territory),
alpha = 0.3, color = NA) +
geom_line()

Shading area between confidence interval with ggplot for a piece of time series

I have a time series of 190 observations, I am using the first 180 to perform Kalman filter and the last 10 to compare different models forecatsing. I am trying to plot a data of the last 20 observations of my dataset, from 170 to 190, and then the forecasted level from 181 to 190 together with their confidence interval using ggplot. So in total I have 4 time series: the observations with 20 elements, the forecasts and the two confidence intervals with 10 elements each.
These are the data
track_28_only_95bpm$bpm[170:190]
[1] 154.2031 150.0625 158.8750 153.0000 147.1250 148.1797 151.0000 150.0000
153.8125 155.0000
[11] 151.7375 157.1875 155.7500 160.2500 151.3906 149.0000 149.7500 155.1328
163.0000 162.7500
[21] 160.0000
f_compl[181:190]
152.3169 156.2046 155.8417 159.3604 152.9990 149.8070 149.7615 154.0488
161.1935 162.4359
upper_compl[181:190]
160.9422 164.8298 164.4670 167.9856 161.6243 158.4323 158.3868 162.6741
169.8188 171.0612
lower_compl[181:190]
143.6917 147.5793 147.2165 150.7351 144.3737 141.1818 141.1362 145.4235
152.5683 153.8106
Here is my code.
df_obs_compl = data.frame(time = seq(170,190,length=21),
M = track_62_only_95bpm$bpm[170:190], legend =
"observations")
df_f = data.frame(time = seq(181,190,length=10), M = f_compl[181:190],
legend = "forecast")
df_u_compl = data.frame(time = seq(181,190,length=10), M =
upper_compl[181:190] ,legend = "upper")
df_l_compl = data.frame(time = seq(181,190,length=10), M =
lower_compl[181:190], legend = "lower")
df_compl = rbind(df_obs_compl, df_f, df_u_compl, df_l_compl)
ggplot(df_compl, aes(x = time, y = M, color = legend)) + geom_line() +
scale_colour_manual(values=c(observations='black', one_step='red',
upper='blue', lower='blue'))
With this I am able to achieve what I was saying, i.e. I plot 20 observations together with 10 forecasts and their confidence interval. Now, I'd like the confidence interval to be shaded, and I have tried the following piece of code.
ggplot(df, aes(x = time, y = M)) + geom_line(colour='black') +
geom_smooth(aes(x=time, y=M, ymax=upper, ymin=lower),
colour='red')
However I get the following error
Error: Aesthetics must be either length 1 or the same as the data (51): x,
y, ymax, ymin
Any idea how I can fix this? Or any other method to obtain shaded confidence intervals for the forecasts?

Find the y-coordinate at intersection of two curves when x is known

Background and Summary of Objective
I am trying to find the y-coordinate at the intersection of two plotted curves using R. I will provide complete details and sample data below, but in the hopes that this is a simple problem, I'll be more concise up front.
The cumulative frequencies of two curves(c1 and c2 for simplicity) are defined by the following function, where a and b are known coefficients:
f(x)=1/(1+exp(-(a+bx)))
Using the uniroot() function, I found "x" at the intersection of c1 and c2.
I had assumed that if x is known then determining y should be simple substitution: for example, if x = 10, y=1/(1+exp(-(a+b*10))) (again, a and b are known values); however, as will be shown below, this is not the case.
The objective of this post is to determine how to find y-coordinate.
Details
This data replicates respondents' stated price at which they find the product's price to be too.cheap (i.e., they question its quality) and the price at which they feel the product is a bargain.
The data will be cleaned before use to ensure that too.cheap is
always less than the bargain price.
The cumulative frequency for the
bargain price will be inverted to become not.bargain.
The intersection of bargain and too.cheap will represent the point at
which an equal share of respondents feel the price is not a bargain
and too.cheap --- the point of marginal cheapness ("pmc").
Getting to the point where I'm having a challenge will take a number of steps.
Step 1: Generate some data
# load libraries for all steps
library(car)
library(ggplot2)
# function that generates the data
so.create.test.dataset <- function(n, mean){
step.to.bargain <- round(rnorm(n = n, 3, sd = 0.75), 2)
price.too.cheap <- round(rnorm(n = n, mean = mean, sd = floor(mean * 100 / 4) / 100), 2)
price.bargain <- price.too.cheap + step.to.bargain
df.temp <- cbind(price.too.cheap,
price.bargain)
df.temp <- as.data.frame(df.temp)
return(df.temp)
}
# create 389 "observations" where the too.cheap has a mean value of 10.50
# the function will also create a "bargain" price by
#adding random values with a mean of 3.00 to the too.cheap price
so.test.df <- so.create.test.dataset(n = 389, mean = 10.50)
Step 2: Create a data frame of cumulative frequencies
so.get.count <- function(p.points, p.vector){
cc.temp <- as.data.frame(table(p.vector))
cc.merged <- merge(p.points, cc.temp, by.x = "price.point", by.y = "p.vector", all.x = T)
cc.extracted <- cc.merged[,"Freq"]
cc.extracted[is.na(cc.extracted)] <- 0
return(cc.extracted)
}
so.get.df.price<-function(df){
# creates cumulative frequencies for three variables
# using the price points provided by respondents
# extract and sort all unique price points
# Thanks to akrun for their help with this step
price.point <- sort(unique(unlist(round(df, 2))))
#create a new data frame to work with having a row for each price point
dfp <- as.data.frame(price.point)
# Create cumulative frequencies (as percentages) for each variable
dfp$too.cheap.share <- 1 - (cumsum(so.get.count(dfp, df$price.too.cheap)) / nrow(df))
dfp$bargain.share <- 1 - cumsum(so.get.count(dfp, df$price.bargain)) / nrow(df)
dfp$not.bargain.share <- 1 - dfp$bargain.share# bargain inverted so curves will intersect
return(dfp)
}
so.df.price <- so.get.df.price(so.test.df)
Step 3: Estimate the curves for the cumulative frequencies
# Too Cheap
so.l <- lm(logit(so.df.price$too.cheap.share, percents = TRUE)~so.df.price$price.point)
so.cof.TCh <- coef(so.l)
so.temp.nls <- nls(too.cheap.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.TCh[1], b = so.cof.TCh[2]), data = so.df.price, trace = TRUE)
so.df.price$Pr.TCh <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
#Not Bargain
so.l <- lm(logit(not.bargain.share, percents = TRUE) ~ price.point, so.df.price)
so.cof.NBr <- coef(so.l)
so.temp.nls <- nls(not.bargain.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.NBr[1], b = so.cof.Br[2]), data= so.df.price, trace=TRUE)
so.df.price$Pr.NBr <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
# Thanks to John Fox & Sanford Weisberg - "An R Companion to Applied Regression, second edition"
At this point, we can plot and compare the "observed" cumulative frequencies against the estimated frequencies
ggplot(data = so.df.price, aes(x = price.point))+
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap"))+
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain"))+
geom_line(aes(y = so.df.price$too.cheap.share, colour = "too.cheap.share"))+
geom_line(aes(y = so.df.price$not.bargain.share, colour = "not.bargain.share"))+
scale_y_continuous(name = "Cummulative Frequency")
The estimate appears to fit the observations reasonably well.
Step 4: Find the intersection point for the two estimate functions
so.f <- function(x, a, b){
# model for the curves
1 / (1 + exp(-(a + b * x)))
}
# note, this function may also be used in step 3
#I was building as I went and I don't want to risk a transpositional error that breaks the example
so.pmc.x <- uniroot(function(x) so.f(x, so.cof.TCh[1], so.cof.TCh[2]) - so.f(x, so.cof.Br[1], so.cof.Br[2]), c(0, 50), tol = 0.01)$root
We may visually test the so.pmc.x by plotting it with the two estimates. If it is correct, a vertical line for so.pmc.x should pass through the intersection of too.cheap and not.bargain.
ggplot(data = so.df.price, aes(x = price.point)) +
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap")) +
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain")) +
scale_y_continuous(name = "Cumulative Frequency") +
geom_vline(aes(xintercept = so.pmc.x))
...which it does.
Step 5: Find y
Here is where I get stumped, and I'm sure I'm overlooking something very basic.
If a curve is defined by f(x) = 1/(1+exp(-(a+bx))), and a, b and x are all known, then shouldn't y be the result of 1/(1+exp(-(a+bx))) for either estimate?
In this instance, it is not.
# We attempt to use the too.cheap estimate to find y
so.pmc.y <- so.f(so.pmc.x, so.cof.TCh[1], so.cof.TCh[2])
# In theory, y for not.bargain at price.point so.pmc.x should be the same
so.pmc.y2 <- so.f(so.pmc.x, so.cof.NBr[1], so.cof.NBr[2])
EDIT: This is where the error occurs (see solution below).
a != so.cof.NBr[1] and b != so.cof.NBr[2], instead a and be should be defined as the coefficients from so.temp.nls (not so.l)
# Which they are
#> so.pmc.y
#(Intercept)
# 0.02830516
#> so.pmc.y2
#(Intercept)
# 0.0283046
If we calculate the correct value for y, a horizontal line at yintercept = so.pmc.y, should pass through the intersection of too.cheap and not.bargain.
...which it obviously does not.
So how does one estimate y?
I've solved this, and as I suspected, it was a simple error.
My assumption that y = 1/(1+exp(-(a+bx))) is correct.
The issue is that I was using the wrong a, b coefficients.
My curve was defined using the coefficients in so.cof.NBr as defined by so.l.
#Not Bargain
so.l <- lm(logit(not.bargain.share, percents = TRUE) ~ price.point, so.df.price)
so.cof.NBr <- coef(so.l)
so.temp.nls <- nls(not.bargain.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.NBr[1], b = so.cof.Br[2]), data= so.df.price, trace=TRUE)
so.df.price$Pr.NBr <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
But the resulting curve is so.temp.nls, NOT so.l.
Therefore, once I find so.pmc.x I need to extract the correct coefficients from so.temp.nls and use those to find y.
# extract coefficients from so.temp.nls
so.co <- coef(so.temp.nls)
# find y
so.pmc.y <- 1 / (1 + exp(-(so.co[1] + so.co[2] * so.pmc.x)))
ggplot(data = so.df.price, aes(x = price.point))+
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap"))+
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain"))+
scale_y_continuous(name = "Cumulative Frequency")+
geom_hline(aes(yintercept = so.pmc.y))
Yielding the following...
which graphically depicts the correct answer.

bar plot grouped multi-column data with confidence intervals with ggplot2

I'm trying to create a bar plot of grouped multi column data and to add confidence intervals to each bar. So far, I have done almost all tasks with the help of several entries in different blogs and platforms like stackoverflow.
My data sgr_sum_v3 looks like this:
treatment mean_C16_0 sd_C16_0 mean_C18_0 sd_C18_0 mean_LIN sd_LIN mean_ALA sd_ALA
ALA 92500.0 1492.0 14406.7 1291.5 740.2 77.7 3399.2 436.4
ALA+ARA 71538.3 3159.0 14088.7 1101.0 582.3 91.5 2089.3 439.6
ALA+EPA 82324.6 2653.3 10745.2 1244.2 658.3 19.2 2629.3 134.7
ALA+EPA+LIN+ARA 68422.9 2097.2 10818.2 721.8 969.9 24.0 2154.0 124.5
ALA+LIN 87489.0 3150.6 15951.9 888.2 1173.0 279.1 2010.6 519.4
ARA 65571.7 2635.6 11174.7 1851.9 589.0 7.0 1640.9 163.7
control 107313.4 10828.0 22087.0 6217.7 783.8 38.6 2417.5 59.2
EPA 76621.3 1863.7 9947.7 156.4 654.6 31.0 1946.8 56.6
EPA+ARA 70312.3 2187.3 10896.8 148.6 716.8 24.4 2144.0 251.4
EPA+LIN 79388.5 4866.9 10080.4 613.3 1449.9 41.7 1862.9 235.4
LIN 87398.4 2213.9 11961.6 798.8 1909.3 100.2 1939.1 82.5
LIN+ARA 71437.1 1220.1 12612.0 1190.8 1134.6 333.6 1628.6 508.1
Scen 138102.2 22228.4 24893.0 1259.9 4259.4 612.0 23417.2 3946.5
Basically different treatments with mean values and standard deviations of some measured values.
To get the plot running I basically adapted the code from this post:
Creating grouped bar-plot of multi-column data in R
from joran for the multi column problem and the code from this post:
Grouped barplot in R with error bars
from Colonel Beauvel for the confidence intervals.
Here is my code:
library(reshape2)
dfm <- data.frame()
dfm <- melt(sgr_sum_v3[,c('treatment', 'mean_ALA', 'mean_LIN')], id.vars = 1)
ggplot(data=dfm, aes(x=treatment, y = value, fill = variable))+
geom_bar(stat = "identity", position = "dodge")+
geom_errorbar(aes(ymin = value - 1000, ymax = value + 1000), width = .2, position = position_dodge(.9))
Now my problem is, that as the multi-column problem is solved by the melt function, I don't have my standard deviations to get real errorbars (so far I just insert 1000 to see if it works).
Do you have suggestions how to solve this, or even to get the multi column plot running with the original data (without melting) which would make the cf problem pretty straight forward?
Thanks in advance for your help :)
Eventhough, my question is already pretty old and solved in the meanwhile, I will answer it in a more comprehensive way, as #dende85 asked currently for the complete code. The following code is not exactly with the data above, but I created it for a small R-lecture for my students. However, I'm pretty sure, that this might be handled easier.
So here's the answer:
First, I create two data sets. One for mean values and one for sd. In this case I only chose a subset with the [1:4]-thing
my_bar_data_mean <- data.frame(treatment = levels(my_data$treatment)[1:4])
my_bar_data_sd <- data.frame(treatment = levels(my_data$treatment)[1:4])
Then I used aggregate() to calculate mean and sd for all groups for all (in this case 3) parameters of interest:
#BL
my_bar_data_mean$BL_mean <- aggregate(my_data,
by = list(my_data$treatment),
FUN = mean,
na.rm = TRUE)[, 8]
my_bar_data_sd$BL_sd <- aggregate(my_data,
by = list(my_data$treatment),
FUN = sd,
na.rm = TRUE)[, 8]
# BW
my_bar_data_mean$BW_mean <- aggregate(my_data,
by = list(my_data$treatment),
FUN = mean,
na.rm = TRUE)[, 9]
my_bar_data_sd$BW_sd <- aggregate(my_data,
by = list(my_data$treatment),
FUN = sd,
na.rm = TRUE)[, 9]
# SL
my_bar_data_mean$SL_mean <- aggregate(my_data,
by = list(my_data$treatment),
FUN = mean,
na.rm = TRUE)[, 10]
my_bar_data_sd$SL_sd <- aggregate(my_data,
by = list(my_data$treatment),
FUN = sd,
na.rm = TRUE)[, 10]
Now, we need to reshape the data.frame. Therefore, we need some packages:
library(Hmisc)
library(car)
library(reshape2)
We create a new data.frame and reshape our data with the help of the melt()-function. Note that we still have two data.frames: one for mean and one for sd:
dfm <- data.frame()
dfm <- melt(my_bar_data_mean)
temp <- data.frame()
temp <- melt(my_bar_data_sd)
Now we can see, that our variable are gathered vertically. We just have to add the value of the temp data.frame as a new column called sd to the first data.frame:
dfm$sd <- temp$value
Now, we just have to plot everything:
ggplot(dfm, aes(variable, value, fill=treatment))+
geom_bar(stat="identity", position = "dodge")+
theme_classic() +
geom_errorbar(aes(ymin = value - sd, ymax = value + sd), width=0.4, position = position_dodge(.9))
You can simply add the error bars using geom_errorbar and using the columns value and sd for min and max of your whiskers. Don't forget to set position = position_dodge(.9) for the error bars as well.
You can also simply change whether to plot your response variables as dodged bars and split them for treatment or vice versa by simply exchanging variable and value in the first line (ggplot(aes())).
I hope this hepls.

Resources