bar plot grouped multi-column data with confidence intervals with ggplot2 - r

I'm trying to create a bar plot of grouped multi column data and to add confidence intervals to each bar. So far, I have done almost all tasks with the help of several entries in different blogs and platforms like stackoverflow.
My data sgr_sum_v3 looks like this:
treatment mean_C16_0 sd_C16_0 mean_C18_0 sd_C18_0 mean_LIN sd_LIN mean_ALA sd_ALA
ALA 92500.0 1492.0 14406.7 1291.5 740.2 77.7 3399.2 436.4
ALA+ARA 71538.3 3159.0 14088.7 1101.0 582.3 91.5 2089.3 439.6
ALA+EPA 82324.6 2653.3 10745.2 1244.2 658.3 19.2 2629.3 134.7
ALA+EPA+LIN+ARA 68422.9 2097.2 10818.2 721.8 969.9 24.0 2154.0 124.5
ALA+LIN 87489.0 3150.6 15951.9 888.2 1173.0 279.1 2010.6 519.4
ARA 65571.7 2635.6 11174.7 1851.9 589.0 7.0 1640.9 163.7
control 107313.4 10828.0 22087.0 6217.7 783.8 38.6 2417.5 59.2
EPA 76621.3 1863.7 9947.7 156.4 654.6 31.0 1946.8 56.6
EPA+ARA 70312.3 2187.3 10896.8 148.6 716.8 24.4 2144.0 251.4
EPA+LIN 79388.5 4866.9 10080.4 613.3 1449.9 41.7 1862.9 235.4
LIN 87398.4 2213.9 11961.6 798.8 1909.3 100.2 1939.1 82.5
LIN+ARA 71437.1 1220.1 12612.0 1190.8 1134.6 333.6 1628.6 508.1
Scen 138102.2 22228.4 24893.0 1259.9 4259.4 612.0 23417.2 3946.5
Basically different treatments with mean values and standard deviations of some measured values.
To get the plot running I basically adapted the code from this post:
Creating grouped bar-plot of multi-column data in R
from joran for the multi column problem and the code from this post:
Grouped barplot in R with error bars
from Colonel Beauvel for the confidence intervals.
Here is my code:
library(reshape2)
dfm <- data.frame()
dfm <- melt(sgr_sum_v3[,c('treatment', 'mean_ALA', 'mean_LIN')], id.vars = 1)
ggplot(data=dfm, aes(x=treatment, y = value, fill = variable))+
geom_bar(stat = "identity", position = "dodge")+
geom_errorbar(aes(ymin = value - 1000, ymax = value + 1000), width = .2, position = position_dodge(.9))
Now my problem is, that as the multi-column problem is solved by the melt function, I don't have my standard deviations to get real errorbars (so far I just insert 1000 to see if it works).
Do you have suggestions how to solve this, or even to get the multi column plot running with the original data (without melting) which would make the cf problem pretty straight forward?
Thanks in advance for your help :)

Eventhough, my question is already pretty old and solved in the meanwhile, I will answer it in a more comprehensive way, as #dende85 asked currently for the complete code. The following code is not exactly with the data above, but I created it for a small R-lecture for my students. However, I'm pretty sure, that this might be handled easier.
So here's the answer:
First, I create two data sets. One for mean values and one for sd. In this case I only chose a subset with the [1:4]-thing
my_bar_data_mean <- data.frame(treatment = levels(my_data$treatment)[1:4])
my_bar_data_sd <- data.frame(treatment = levels(my_data$treatment)[1:4])
Then I used aggregate() to calculate mean and sd for all groups for all (in this case 3) parameters of interest:
#BL
my_bar_data_mean$BL_mean <- aggregate(my_data,
by = list(my_data$treatment),
FUN = mean,
na.rm = TRUE)[, 8]
my_bar_data_sd$BL_sd <- aggregate(my_data,
by = list(my_data$treatment),
FUN = sd,
na.rm = TRUE)[, 8]
# BW
my_bar_data_mean$BW_mean <- aggregate(my_data,
by = list(my_data$treatment),
FUN = mean,
na.rm = TRUE)[, 9]
my_bar_data_sd$BW_sd <- aggregate(my_data,
by = list(my_data$treatment),
FUN = sd,
na.rm = TRUE)[, 9]
# SL
my_bar_data_mean$SL_mean <- aggregate(my_data,
by = list(my_data$treatment),
FUN = mean,
na.rm = TRUE)[, 10]
my_bar_data_sd$SL_sd <- aggregate(my_data,
by = list(my_data$treatment),
FUN = sd,
na.rm = TRUE)[, 10]
Now, we need to reshape the data.frame. Therefore, we need some packages:
library(Hmisc)
library(car)
library(reshape2)
We create a new data.frame and reshape our data with the help of the melt()-function. Note that we still have two data.frames: one for mean and one for sd:
dfm <- data.frame()
dfm <- melt(my_bar_data_mean)
temp <- data.frame()
temp <- melt(my_bar_data_sd)
Now we can see, that our variable are gathered vertically. We just have to add the value of the temp data.frame as a new column called sd to the first data.frame:
dfm$sd <- temp$value
Now, we just have to plot everything:
ggplot(dfm, aes(variable, value, fill=treatment))+
geom_bar(stat="identity", position = "dodge")+
theme_classic() +
geom_errorbar(aes(ymin = value - sd, ymax = value + sd), width=0.4, position = position_dodge(.9))
You can simply add the error bars using geom_errorbar and using the columns value and sd for min and max of your whiskers. Don't forget to set position = position_dodge(.9) for the error bars as well.
You can also simply change whether to plot your response variables as dodged bars and split them for treatment or vice versa by simply exchanging variable and value in the first line (ggplot(aes())).
I hope this hepls.

Related

Filter datapoints in a scatterplot using a linear equation

I am writing a function in order to filter out datapoints out of my plot based on a linear equation.
I currently have the following function (with a different function within it):
MD_filter<- function(dataframe, mz_col){
#In-function MD calculation
MZ<- mz_col
MZR<- trunc(mz_col, digits = 0)#Either floor() or trunc() can be used for this part.
MD<- as.numeric(MZ-MZR)
dataframe<- dataframe%>%
dplyr::mutate(MD)%>%
dplyr::select(MD, everything())
#fit data to m/z defect maxima equation
f<- function(x){#This could be problem `1`, maybe resolved by leaving x....
y<-0.00112*x + 0.01953
return(y)}
fit<-as.data.frame(t(apply(dataframe,1,f)))# t() transforms df to matrix...?
filtered<-dataframe[which((dataframe$MD<= fit$MZ)),]
#keep rows in dataframe if MD is less than or equal to fitted value (mz after equation)
#As "fit" calculated the maximum MD value for each MZ value in the MZ column, we subset fit$MZ, as this contains the dataframe MZ values.
#The MD calculated at the very start, needs to be lower than the equivalent MZ value of the fitted dataframe.
filtered<-write.table(filtered,"feature_list_mz_defect_filtered.txt",sep="\t",col.names=NA)
#Now we have pre filter dataframe (dataframe) and post filter df (filtered)
#2 Different plots: (highlight to be removed as well, so we need a 3rd eventually)
MD_plot<- ggplot(dataframe, aes(x= MZ, y = MD)) +
geom_point() +#THE FOLLOWING PART DOES NOT WORK YET
ggtitle(paste("Unfiltered MD data - ", dataframe))
#stat_smooth(method="lm", se=FALSE)-> For linear line through the plot, but may not be necessary to show
return(MD_plot)#While I do get a plot, I have not yet gotten the equation. I could use my earlier approach maybe.
MD_plot_2<- ggplot(filtered, aes(x= MZ, y = MD)) +#Filtered is basically the second dataframe,
#which subsets datapoints with an Y value (which is the MD), below the linear equation MD...
geom_point() +#THE FOLLOWING PART DOES NOT WORK YET
ggtitle(paste("Filtered MD data - ", dataframe))
#stat_smooth(method="lm", se=FALSE) -> For linear line through the plot, but may not be necessary to show
return(MD_plot_2)
}
The function works as follows:
The argument inputs are a dataframe and a specific column inside that dataframe which I call the mz_col.
From this column a second column, the MD column, is generated.
From here on out I want to make two plots:
ggplot 1: A plot with the mz_col (MZ) values on the X axis and the MD values on the Y axis
ggplot 2: EXACTLY the same as ggplot 1, but I want to filter out the datapoints if MD exceeds the linear equation y<-0.00112*x + 0.01953 (as is visible in the code). This linear line is basically my maximum filter limit in the plot, everything above this I want gone.
I've tried many different solutions. I swapped the "x" argument with mz_col among many other solutions such as trying to use plot() instead of ggplot. Currently I'm getting no plot, but I do get this:
Basically my question is: How do I solve my function, so I can get my two plots? The first plot is no real issue, this already works, but the second plot just won't filter out datapoints based on my linear equation.
Thanks in advance! I'm quite new to SO and R, so I apologize if anything is unclear. Please let me know if any clearification is needed and thanks in advance for all the help!
Reproducable sample data:
structure(list(mz = c(446.0394, 346.043, 199.0446, 199.0464, 97.057, 657.0595, 879.0606, 978.0631, 199.0684, 199.0707, 199.0724, 86.0748, 199.0761, 196.0789, 199.0819, 199.0852, 199.0878, 199.089, 908.0935, 147.0958, 199.0999,199.1299, 199.1322, 199.1384, 199.1398, 199.1434, 124.1475, 199.1513, 187.156, 199.1686, 199.1766, 199.1797, 199.1879, 199.1924, 187.1959, 479.1981, 169.1999, 109.2049, 399.2092, 299.2125, 159.2146, 199.2242, 356.2405, 69.2423, 956.4337, 978.5537, 199.5695, 676.5769, 199.5851, 500.6021, 260.6039, 270.6081, 200.6114, 200.6131, 200.6172, 200.6221,
200.6315, 200.6402, 200.6476, 200.766, 200.8591, 200.8732, 200.8768,
200.89, 200.8937, 200.8972, 200.9067, 200.9127, 200.9147, 200.9231,
200.9253, 200.9288, 200.9324, 200.935, 200.9468, 200.9515, 200.9536,
200.9557, 200.9568, 200.9594, 200.9661, 200.968, 200.9729, 200.9745,
200.9819, 200.9837, 200.9858, 200.9937)), row.names = c(NA, -88L), class = c("tbl_df", "tbl", "data.frame"))
I got a bit lost trying to follow your code, but based on your description, does the following work for you?
library(dplyr)
library(ggplot2)
MD_filter <- function(dataframe, mz_col, a = 0.01953, b = 0.00112){
# rename column so that rest of function doesn't depend on inputted column name
dataframe[["MZ"]] <- dataframe[[mz_col]]
# process dataframe
dataframe <- dataframe %>%
select(MZ) %>%
mutate(MD = MZ - trunc(MZ, digits = 0),
MD.limit = a + b*MZ)
p1 <- ggplot(dataframe,
aes(x = MZ, y = MD)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
ggtitle("Unfiltered MD data")
p2 <- p1 %+% filter(dataframe, MD <= MD.limit) +
expand_limits(y = range(dataframe[["MD"]])) + # optional (if you want same
# y-axis range for both plots)
ggtitle("Filtered MD data")
cowplot::plot_grid(p1, p2, nrow = 1)
}
Data & usage
dd <- structure(list(mz = c(
446.0394, 346.043, 199.0446, 199.0464, 97.057, 657.0595, 879.0606,
978.0631, 199.0684, 199.0707, 199.0724, 86.0748, 199.0761, 196.0789,
199.0819, 199.0852, 199.0878, 199.089, 908.0935, 147.0958, 199.0999,
199.1299, 199.1322, 199.1384, 199.1398, 199.1434, 124.1475, 199.1513,
187.156, 199.1686, 199.1766, 199.1797, 199.1879, 199.1924, 187.1959,
479.1981, 169.1999, 109.2049, 399.2092, 299.2125, 159.2146, 199.2242,
356.2405, 69.2423, 956.4337, 978.5537, 199.5695, 676.5769, 199.5851,
500.6021, 260.6039, 270.6081, 200.6114, 200.6131, 200.6172, 200.6221,
200.6315, 200.6402, 200.6476, 200.766, 200.8591, 200.8732, 200.8768,
200.89, 200.8937, 200.8972, 200.9067, 200.9127, 200.9147, 200.9231,
200.9253, 200.9288, 200.9324, 200.935, 200.9468, 200.9515, 200.9536,
200.9557, 200.9568, 200.9594, 200.9661, 200.968, 200.9729, 200.9745,
200.9819, 200.9837, 200.9858, 200.9937)),
row.names = c(NA, -88L),
class = c("tbl_df", "tbl", "data.frame"))
MD_filter(dd, "mz")
# MD_filter(dd, "mz", a = 0.02, b = 0.001) # if you want to change the limit

Plotting a datable with multiple columns (all 1:7 rows) via ggplot with a single geom_point() using aesthetics to color them differently

I intend to compare timings between two algorithm-based functions f1,f2 via microbenchmark which work on a rpois simulated dataset with sizes of: [1:7] vector given by 10^seq(1,4,by=0.5) i.e. :
[1] 10.00000 31.62278 100.00000 316.22777 1000.00000 3162.27766 10000.00000
Am working on to plot them as well, with all of the information required from microbenchmark (i.e. min,lq,mean,median,uq and max - yes all of them are required, except for expr and neval). I require this via ggplot on a log-log scale with a single geom_point() and aesthetics with each of the information being of different colours and here is my code for that:
library(ggplot2)
library(microbenchmark)
require(dplyr)
library(data.table)
datasetsizes<-c(10^seq(1,4,by=0.5))
f1_min<-integer(length(datasetsizes))
f1_lq<-integer(length(datasetsizes))
f1_mean<-integer(length(datasetsizes))
f1_median<-integer(length(datasetsizes))
f1_uq<-integer(length(datasetsizes))
f1_max<-integer(length(datasetsizes))
f2_min<-integer(length(datasetsizes))
f2_lq<-integer(length(datasetsizes))
f2_mean<-integer(length(datasetsizes))
f2_median<-integer(length(datasetsizes))
f2_uq<-integer(length(datasetsizes))
f2_max<-integer(length(datasetsizes))
for(loopvar in 1:(length(datasetsizes)))
{
s<-summary(microbenchmark(f1(rpois(datasetsizes[loopvar],10), max.segments=3L),f2(rpois(datasetsizes[loopvar],10), maxSegments=3)))
f1_min[loopvar] <- s$min[1]
f2_min[loopvar] <- s$min[2]
f1_lq[loopvar] <- s$lq[1]
f2_lq[loopvar] <- s$lq[2]
f1_mean[loopvar] <- s$mean[1]
f2_mean[loopvar] <- s$mean[2]
f1_median[loopvar] <- s$median[1]
f2_median[loopvar] <- s$median[2]
f1_uq[loopvar] <- s$uq[1]
f2_uq[loopvar] <- s$uq[2]
f1_max[loopvar] <- s$max[1]
f2_max[loopvar] <- s$max[2]
}
algorithm<-data.table(f1_min ,f2_min,
f1_lq, f2_lq,
f1_mean, f2_mean,
f1_median, f2_median,
f1_uq, f2_uq,
f1_max, cdpa_max, datasetsizes)
ggplot(algorithm, aes(x=algorithm,y=datasetsizes)) + geom_point(aes(color=algorithm)) + labs(x="N", y="Runtime") + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10')
I debug my code at each step and uptil the assignment of computed values to a datatable by the name of 'algorithm' it works fine.
Here are the computed runs which are passed as [1:7]vecs into the data table along with datasetsizes (1:7 as well) at the end:
> algorithm
f1_min f2_min f1_lq f2_lq f1_mean f2_mean f1_median f2_median f1_uq f2_uq f1_max f2_max datasetsizes
1: 86.745000 21.863000 105.080000 23.978000 113.645630 24.898840 113.543500 24.683000 120.243000 25.565500 185.477000 39.141000 10.00000
2: 387.879000 52.893000 451.880000 58.359000 495.963480 66.070390 484.672000 62.061000 518.876500 66.116500 734.149000 110.370000 31.62278
3: 1608.287000 341.335000 1845.951500 382.062000 1963.411800 412.584590 1943.802500 412.739500 2065.103500 443.593500 2611.131000 545.853000 100.00000
4: 5.964166 3.014524 6.863869 3.508541 7.502123 3.847917 7.343956 3.851285 7.849432 4.163704 9.890556 5.096024 316.22777
5: 23.128505 29.687534 25.348581 33.654475 26.860166 37.576444 26.455269 37.080149 28.034113 41.343289 35.305429 51.347386 1000.00000
6: 79.785949 301.548202 88.112824 335.135149 94.248141 370.902821 91.577462 373.456685 98.486816 406.472393 135.355570 463.908240 3162.27766
7: 274.367776 2980.122627 311.613125 3437.044111 337.287131 3829.503738 333.544669 3820.517762 354.347487 4205.737045 546.996092 4746.143252 10000.00000
The microbenchmark computed values fine as expected but the ggplot throws up this error:
Don't know how to automatically pick scale for object of type data.table/data.frame. Defaulting to continuous.
Error: Aesthetics must be either length 1 or the same as the data (7): colour, x
Am not being able to resolve this, can anyone let me know what is possibly wrong and correct the plotting procedure for the same?
Also on a sidenote I had to extract all the values (min,lq,mean,median,uq,max) seperately from the computed benchmark seperately since I cant take that as a datatable from the summary itself since it contained expr (expression) and neval columns. I was able to eliminate one of the columns using
algorithm[,!"expr"] or algorithm[,!"neval"]
but I can't eliminate two of them together, i.e.
algorithm[,!"expr",!"neval"] or algorithm[,!("expr","neval")] or algorithm[,!"expr","neval"]
- all possible combinations like that don't work (throws 'invalid argument type' error).
Any possible workaround or solution to this and the plotting (main thing) would be highly appreciated!
Your problem lies mainly with the fact that you're referring to an algorithm column in the ggplot formula that does not exist in your object.
From what you gave, I could do the following :
algorithm$algorithm <- 1:nrow(algorithm)
ggplot(algorithm, aes(x=algorithm,y=datasetsizes)) + geom_point(aes(color=algorithm)) + labs(x="N", y="Runtime") +
scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10')
and plot this fine :
EDIT : let's clean this up a bit...
As per OP's request, I've cleaned up his code a bit.
There are a lot of things you can work on to improve on your code's readability, but I'm focusing more on the practical aspect here.
Basically, join your variables together in a table if you know they'll end up as such.
There are a bunch of tricks you can use to assign the values to the correct spots, a few of which you'll see in the code below.
library(ggplot2)
library(microbenchmark)
require(dplyr)
library(data.table)
datasetsizes<-c(10^seq(1,4,by=0.5))
l <- length(datasetsizes)
# make a vector with your different conditions
conds <- c('f1', 'f2')
# initalizing a table from the getgo is much cleaner
# than doing everything in separate variables
dat <- data.frame(
datasetsizes = rep(datasetsizes, each = length(conds)), # make replicates for each condition
cond = rep(NA, l*length(conds))
)
dat[, c("min", "lq", "mean", "median", "uq", "max")] <- 0
dat$cond <- factor(dat$cond, levels = conds)
head(dat)
for(i in 1:l){ # for the love of god, don't use something as long as 'loopvar' as an iterative
# I don't have f1 & f2 so I did what I could...
s <- summary(microbenchmark(
"f1" = rpois(datasetsizes[i],10),
"f2" = {length(rpois(datasetsizes[i],10))}))
dat[which(dat$datasetsizes == datasetsizes[i]), # select rows of current ds size
c("cond", "min", "lq", "mean", "median", "uq", "max")] <- s[, !colnames(s)%in%c("neval")]
}
dat <- data.table(dat)
ggplot(dat, aes(x=datasetsizes,y=mean)) +
geom_point(aes(color = cond)) +
geom_line(aes(color = cond)) + # added to see a clear difference btw conds
labs(x="N", y="Runtime") + scale_x_continuous(trans = 'log10') +
scale_y_continuous(trans = 'log10')
This give the following plot.

ggplot multiline by rows and columns with legend + sd

I have 2 data.frame :
data_mean :
territory_1 territory_2 territory_3 territory_4 territory_5 territory_6 territory_7 territory_8 territory_9
season_1 0.04548814 0.03679184 0.04804329 0.01774598 0.1873583 0.03798713 0.02515220 0.04006423 0.2460139
season_2 0.07529072 0.08048696 0.06041415 0.03461997 0.1473725 0.04371079 0.02451824 0.03729869 0.2325734
season_3 0.19094684 0.05399267 0.09108074 0.05326579 0.1107565 0.04996543 0.02944363 0.04424125 0.2061001
season_4 0.16393195 0.02916149 0.05763407 0.03526731 0.1212815 0.05466920 0.02825975 0.06150540 0.2232308
season_5 0.08309387 0.05862481 0.07578285 0.03620725 0.1433460 0.07038242 0.03102652 0.05434440 0.1553574
and data_SD :
territory_1 territory_2 territory_3 territory_4 territory_5 territory_6 territory_7 territory_8 territory_9
season_1 0.009414762 0.009205625 0.003816925 0.002548717 0.01475648 0.003631448 0.001490306 0.002462043 0.007583638
season_2 0.024247471 0.016402706 0.004980897 0.004745206 0.01393021 0.004178247 0.001597244 0.002553933 0.007538909
season_3 0.030626971 0.012865086 0.006913778 0.005980786 0.01313423 0.004754663 0.001875462 0.002952610 0.007434868
season_4 0.034040440 0.009705439 0.004927881 0.004330766 0.01350788 0.004751983 0.001753364 0.003384793 0.007406657
season_5 0.020016015 0.014591767 0.005815301 0.004232419 0.01499951 0.005411255 0.001875151 0.003234048 0.006308964
I want to draw one ggplot Y= values, X= seasons and one line for one territory with different color of line by territory (color(territory)) + draw a area between Sd inf and Sd sup. And I want draw an other ggplot : xggplot by xterritory + area between Sd inf and Sd sup.
Thank you for your help !
The most effective way to work with ggplot is to reshape your data into long form, typically into a single data frame. Below, I use dplyr::left_join and tidyr::gather to transform and combine the two provided tables into the form that ggplot works best with.
library(tidyverse)
data_combined <-
left_join(
data_mean %>% gather(territory, mean, -season),
data_SD %>% gather(territory, SD, -season)
)
Then it's pretty straightforward to assign the different columns to different aesthetics of the chart:
ggplot(data_combined,
aes(x = season, y = mean, color = territory, group = territory)) +
geom_ribbon(aes(ymin = mean - SD, ymax = mean + SD, fill = territory),
alpha = 0.3, color = NA) +
geom_line()

Find the y-coordinate at intersection of two curves when x is known

Background and Summary of Objective
I am trying to find the y-coordinate at the intersection of two plotted curves using R. I will provide complete details and sample data below, but in the hopes that this is a simple problem, I'll be more concise up front.
The cumulative frequencies of two curves(c1 and c2 for simplicity) are defined by the following function, where a and b are known coefficients:
f(x)=1/(1+exp(-(a+bx)))
Using the uniroot() function, I found "x" at the intersection of c1 and c2.
I had assumed that if x is known then determining y should be simple substitution: for example, if x = 10, y=1/(1+exp(-(a+b*10))) (again, a and b are known values); however, as will be shown below, this is not the case.
The objective of this post is to determine how to find y-coordinate.
Details
This data replicates respondents' stated price at which they find the product's price to be too.cheap (i.e., they question its quality) and the price at which they feel the product is a bargain.
The data will be cleaned before use to ensure that too.cheap is
always less than the bargain price.
The cumulative frequency for the
bargain price will be inverted to become not.bargain.
The intersection of bargain and too.cheap will represent the point at
which an equal share of respondents feel the price is not a bargain
and too.cheap --- the point of marginal cheapness ("pmc").
Getting to the point where I'm having a challenge will take a number of steps.
Step 1: Generate some data
# load libraries for all steps
library(car)
library(ggplot2)
# function that generates the data
so.create.test.dataset <- function(n, mean){
step.to.bargain <- round(rnorm(n = n, 3, sd = 0.75), 2)
price.too.cheap <- round(rnorm(n = n, mean = mean, sd = floor(mean * 100 / 4) / 100), 2)
price.bargain <- price.too.cheap + step.to.bargain
df.temp <- cbind(price.too.cheap,
price.bargain)
df.temp <- as.data.frame(df.temp)
return(df.temp)
}
# create 389 "observations" where the too.cheap has a mean value of 10.50
# the function will also create a "bargain" price by
#adding random values with a mean of 3.00 to the too.cheap price
so.test.df <- so.create.test.dataset(n = 389, mean = 10.50)
Step 2: Create a data frame of cumulative frequencies
so.get.count <- function(p.points, p.vector){
cc.temp <- as.data.frame(table(p.vector))
cc.merged <- merge(p.points, cc.temp, by.x = "price.point", by.y = "p.vector", all.x = T)
cc.extracted <- cc.merged[,"Freq"]
cc.extracted[is.na(cc.extracted)] <- 0
return(cc.extracted)
}
so.get.df.price<-function(df){
# creates cumulative frequencies for three variables
# using the price points provided by respondents
# extract and sort all unique price points
# Thanks to akrun for their help with this step
price.point <- sort(unique(unlist(round(df, 2))))
#create a new data frame to work with having a row for each price point
dfp <- as.data.frame(price.point)
# Create cumulative frequencies (as percentages) for each variable
dfp$too.cheap.share <- 1 - (cumsum(so.get.count(dfp, df$price.too.cheap)) / nrow(df))
dfp$bargain.share <- 1 - cumsum(so.get.count(dfp, df$price.bargain)) / nrow(df)
dfp$not.bargain.share <- 1 - dfp$bargain.share# bargain inverted so curves will intersect
return(dfp)
}
so.df.price <- so.get.df.price(so.test.df)
Step 3: Estimate the curves for the cumulative frequencies
# Too Cheap
so.l <- lm(logit(so.df.price$too.cheap.share, percents = TRUE)~so.df.price$price.point)
so.cof.TCh <- coef(so.l)
so.temp.nls <- nls(too.cheap.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.TCh[1], b = so.cof.TCh[2]), data = so.df.price, trace = TRUE)
so.df.price$Pr.TCh <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
#Not Bargain
so.l <- lm(logit(not.bargain.share, percents = TRUE) ~ price.point, so.df.price)
so.cof.NBr <- coef(so.l)
so.temp.nls <- nls(not.bargain.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.NBr[1], b = so.cof.Br[2]), data= so.df.price, trace=TRUE)
so.df.price$Pr.NBr <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
# Thanks to John Fox & Sanford Weisberg - "An R Companion to Applied Regression, second edition"
At this point, we can plot and compare the "observed" cumulative frequencies against the estimated frequencies
ggplot(data = so.df.price, aes(x = price.point))+
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap"))+
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain"))+
geom_line(aes(y = so.df.price$too.cheap.share, colour = "too.cheap.share"))+
geom_line(aes(y = so.df.price$not.bargain.share, colour = "not.bargain.share"))+
scale_y_continuous(name = "Cummulative Frequency")
The estimate appears to fit the observations reasonably well.
Step 4: Find the intersection point for the two estimate functions
so.f <- function(x, a, b){
# model for the curves
1 / (1 + exp(-(a + b * x)))
}
# note, this function may also be used in step 3
#I was building as I went and I don't want to risk a transpositional error that breaks the example
so.pmc.x <- uniroot(function(x) so.f(x, so.cof.TCh[1], so.cof.TCh[2]) - so.f(x, so.cof.Br[1], so.cof.Br[2]), c(0, 50), tol = 0.01)$root
We may visually test the so.pmc.x by plotting it with the two estimates. If it is correct, a vertical line for so.pmc.x should pass through the intersection of too.cheap and not.bargain.
ggplot(data = so.df.price, aes(x = price.point)) +
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap")) +
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain")) +
scale_y_continuous(name = "Cumulative Frequency") +
geom_vline(aes(xintercept = so.pmc.x))
...which it does.
Step 5: Find y
Here is where I get stumped, and I'm sure I'm overlooking something very basic.
If a curve is defined by f(x) = 1/(1+exp(-(a+bx))), and a, b and x are all known, then shouldn't y be the result of 1/(1+exp(-(a+bx))) for either estimate?
In this instance, it is not.
# We attempt to use the too.cheap estimate to find y
so.pmc.y <- so.f(so.pmc.x, so.cof.TCh[1], so.cof.TCh[2])
# In theory, y for not.bargain at price.point so.pmc.x should be the same
so.pmc.y2 <- so.f(so.pmc.x, so.cof.NBr[1], so.cof.NBr[2])
EDIT: This is where the error occurs (see solution below).
a != so.cof.NBr[1] and b != so.cof.NBr[2], instead a and be should be defined as the coefficients from so.temp.nls (not so.l)
# Which they are
#> so.pmc.y
#(Intercept)
# 0.02830516
#> so.pmc.y2
#(Intercept)
# 0.0283046
If we calculate the correct value for y, a horizontal line at yintercept = so.pmc.y, should pass through the intersection of too.cheap and not.bargain.
...which it obviously does not.
So how does one estimate y?
I've solved this, and as I suspected, it was a simple error.
My assumption that y = 1/(1+exp(-(a+bx))) is correct.
The issue is that I was using the wrong a, b coefficients.
My curve was defined using the coefficients in so.cof.NBr as defined by so.l.
#Not Bargain
so.l <- lm(logit(not.bargain.share, percents = TRUE) ~ price.point, so.df.price)
so.cof.NBr <- coef(so.l)
so.temp.nls <- nls(not.bargain.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.NBr[1], b = so.cof.Br[2]), data= so.df.price, trace=TRUE)
so.df.price$Pr.NBr <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
But the resulting curve is so.temp.nls, NOT so.l.
Therefore, once I find so.pmc.x I need to extract the correct coefficients from so.temp.nls and use those to find y.
# extract coefficients from so.temp.nls
so.co <- coef(so.temp.nls)
# find y
so.pmc.y <- 1 / (1 + exp(-(so.co[1] + so.co[2] * so.pmc.x)))
ggplot(data = so.df.price, aes(x = price.point))+
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap"))+
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain"))+
scale_y_continuous(name = "Cumulative Frequency")+
geom_hline(aes(yintercept = so.pmc.y))
Yielding the following...
which graphically depicts the correct answer.

make barplot of groups with error bars

I would like to make a barplot of the columns V2 and length. I would also plot the standard deviation from the number in length for each group.
> head(Length_filter3)
V1 V2 V3 length
1 URS00000081EA snRNA AAACTCGACTGCATAATTTGTGGTAGTGGG 30
2 URS00000081EA snRNA AAACTCGACTGCATAATTTGTGGTAGTGGGG 31
3 URS00000081EA snRNA AAACTCGACTGCATAATTTGTGGTAGTGGGGGACT 35
4 URS0000008112A tRNA AAACTCGACTGCATAATTTGTGGTAGTGGGGGACTG 36
5 URS000000812A tRNA AAATGTGGGAAACTCGACTGCATAATTTGTGGTAGTGGGGGACT 44
6 URS0000008121EA tRNA AACTCGACTGCATAATTTGTGGTAGTGGG 29
ggplot(Length_filter3, aes(V2,length)) + geom_bar(stat="identity")
I am assuming that you are looking to create some sort of summary statistic, such as average, rather than trying to plot the total length of all of the RNA types (for which there would be no error bar to speak of).
If it has to be a bar plot, you will likely need to calculate the values yourself. Here, I am manually calculating the ranges I want from the iris data (using dplyr):
summarizedData <-
iris %>%
group_by(Species) %>%
summarise(
mean = mean(Petal.Length)
, sd = sd(Petal.Length)
, low = mean + sd/(sqrt(n())) * qt(0.025, n()-1 )
, high = mean + sd/(sqrt(n())) * qt(0.975, n()-1 )
)
ggplot(
summarizedData
, aes(x = Species
, y = mean
, ymax = high
, ymin = low)
) +
geom_bar(stat = "identity") +
geom_linerange()
Alternatively, you can let ggplot do the work for you, particularly if you are willing to use points and error bars instead of a bar plot (I tend to prefer it this way)
ggplot(
iris
, aes(x = Species
, y = Petal.Length)
) +
stat_summary(fun.data = mean_cl_normal)
You can combine these approaches if you like as well.
try FUN function in ggplot choosing stdev.

Resources