ggplot multiline by rows and columns with legend + sd - r

I have 2 data.frame :
data_mean :
territory_1 territory_2 territory_3 territory_4 territory_5 territory_6 territory_7 territory_8 territory_9
season_1 0.04548814 0.03679184 0.04804329 0.01774598 0.1873583 0.03798713 0.02515220 0.04006423 0.2460139
season_2 0.07529072 0.08048696 0.06041415 0.03461997 0.1473725 0.04371079 0.02451824 0.03729869 0.2325734
season_3 0.19094684 0.05399267 0.09108074 0.05326579 0.1107565 0.04996543 0.02944363 0.04424125 0.2061001
season_4 0.16393195 0.02916149 0.05763407 0.03526731 0.1212815 0.05466920 0.02825975 0.06150540 0.2232308
season_5 0.08309387 0.05862481 0.07578285 0.03620725 0.1433460 0.07038242 0.03102652 0.05434440 0.1553574
and data_SD :
territory_1 territory_2 territory_3 territory_4 territory_5 territory_6 territory_7 territory_8 territory_9
season_1 0.009414762 0.009205625 0.003816925 0.002548717 0.01475648 0.003631448 0.001490306 0.002462043 0.007583638
season_2 0.024247471 0.016402706 0.004980897 0.004745206 0.01393021 0.004178247 0.001597244 0.002553933 0.007538909
season_3 0.030626971 0.012865086 0.006913778 0.005980786 0.01313423 0.004754663 0.001875462 0.002952610 0.007434868
season_4 0.034040440 0.009705439 0.004927881 0.004330766 0.01350788 0.004751983 0.001753364 0.003384793 0.007406657
season_5 0.020016015 0.014591767 0.005815301 0.004232419 0.01499951 0.005411255 0.001875151 0.003234048 0.006308964
I want to draw one ggplot Y= values, X= seasons and one line for one territory with different color of line by territory (color(territory)) + draw a area between Sd inf and Sd sup. And I want draw an other ggplot : xggplot by xterritory + area between Sd inf and Sd sup.
Thank you for your help !

The most effective way to work with ggplot is to reshape your data into long form, typically into a single data frame. Below, I use dplyr::left_join and tidyr::gather to transform and combine the two provided tables into the form that ggplot works best with.
library(tidyverse)
data_combined <-
left_join(
data_mean %>% gather(territory, mean, -season),
data_SD %>% gather(territory, SD, -season)
)
Then it's pretty straightforward to assign the different columns to different aesthetics of the chart:
ggplot(data_combined,
aes(x = season, y = mean, color = territory, group = territory)) +
geom_ribbon(aes(ymin = mean - SD, ymax = mean + SD, fill = territory),
alpha = 0.3, color = NA) +
geom_line()

Related

How to use ggplot with prop.table(table(x)?

First, I have a data with two categorical variables into like this:
nombre <- c("A","B","C","A","D","F","F","H","I","J")
sexo <- c(rep("man",4),rep("woman",6))
edad <- c (25,14,25,76,12,90,65,45,56,43)
pais <- c(rep("spain",3),rep("italy",4),rep("portugal",3))
data <- data.frame(nombre=nombre,sexo=sexo,edad=edad,pais=pais)
If I use:
prop.table(table(data$sexo,data$pais), margin=1)
I can see the relative frequency of the levels, for example for Italy (Man=0.25 Woman=0.5)
but the problem is that when I try to plot the prop.table(table(x)) I get something different
ggplot(as.data.frame(prop.table(table(data),margin=1)), aes(x=pais ,y =Freq, fill=sexo))+geom_bar(stat="identity")
On the Y axis from 0 to 3 and for example in the bar Italy (Woman=2 Man=2.5)
I don't need that (and I don't know what is showing), I want the same with as I had with the table of the prop.table(table(x))
I think the problem is something related with the margin=1
Thanks you!
You need to make the same table
tab = prop.table(table(data$sexo,data$pais), margin=1)
tab = as.data.frame(tab)
Then plot:
ggplot(tab,aes(x=Var2,y=Freq,fill=Var1)) + geom_col()
Or simply:
barplot(prop.table(table(data$sexo,data$pais), margin=1))
You're probably looking for something like position = "dodge"
If I run the following on your data :
P <- prop.table(table(data$sexo,data$pais), margin=1)
ggplot(as.data.frame(P), aes(x = Var2, y = Freq, fill = Var1)) +
geom_bar(stat="identity", position = "dodge")
I output the following graph :

Find the y-coordinate at intersection of two curves when x is known

Background and Summary of Objective
I am trying to find the y-coordinate at the intersection of two plotted curves using R. I will provide complete details and sample data below, but in the hopes that this is a simple problem, I'll be more concise up front.
The cumulative frequencies of two curves(c1 and c2 for simplicity) are defined by the following function, where a and b are known coefficients:
f(x)=1/(1+exp(-(a+bx)))
Using the uniroot() function, I found "x" at the intersection of c1 and c2.
I had assumed that if x is known then determining y should be simple substitution: for example, if x = 10, y=1/(1+exp(-(a+b*10))) (again, a and b are known values); however, as will be shown below, this is not the case.
The objective of this post is to determine how to find y-coordinate.
Details
This data replicates respondents' stated price at which they find the product's price to be too.cheap (i.e., they question its quality) and the price at which they feel the product is a bargain.
The data will be cleaned before use to ensure that too.cheap is
always less than the bargain price.
The cumulative frequency for the
bargain price will be inverted to become not.bargain.
The intersection of bargain and too.cheap will represent the point at
which an equal share of respondents feel the price is not a bargain
and too.cheap --- the point of marginal cheapness ("pmc").
Getting to the point where I'm having a challenge will take a number of steps.
Step 1: Generate some data
# load libraries for all steps
library(car)
library(ggplot2)
# function that generates the data
so.create.test.dataset <- function(n, mean){
step.to.bargain <- round(rnorm(n = n, 3, sd = 0.75), 2)
price.too.cheap <- round(rnorm(n = n, mean = mean, sd = floor(mean * 100 / 4) / 100), 2)
price.bargain <- price.too.cheap + step.to.bargain
df.temp <- cbind(price.too.cheap,
price.bargain)
df.temp <- as.data.frame(df.temp)
return(df.temp)
}
# create 389 "observations" where the too.cheap has a mean value of 10.50
# the function will also create a "bargain" price by
#adding random values with a mean of 3.00 to the too.cheap price
so.test.df <- so.create.test.dataset(n = 389, mean = 10.50)
Step 2: Create a data frame of cumulative frequencies
so.get.count <- function(p.points, p.vector){
cc.temp <- as.data.frame(table(p.vector))
cc.merged <- merge(p.points, cc.temp, by.x = "price.point", by.y = "p.vector", all.x = T)
cc.extracted <- cc.merged[,"Freq"]
cc.extracted[is.na(cc.extracted)] <- 0
return(cc.extracted)
}
so.get.df.price<-function(df){
# creates cumulative frequencies for three variables
# using the price points provided by respondents
# extract and sort all unique price points
# Thanks to akrun for their help with this step
price.point <- sort(unique(unlist(round(df, 2))))
#create a new data frame to work with having a row for each price point
dfp <- as.data.frame(price.point)
# Create cumulative frequencies (as percentages) for each variable
dfp$too.cheap.share <- 1 - (cumsum(so.get.count(dfp, df$price.too.cheap)) / nrow(df))
dfp$bargain.share <- 1 - cumsum(so.get.count(dfp, df$price.bargain)) / nrow(df)
dfp$not.bargain.share <- 1 - dfp$bargain.share# bargain inverted so curves will intersect
return(dfp)
}
so.df.price <- so.get.df.price(so.test.df)
Step 3: Estimate the curves for the cumulative frequencies
# Too Cheap
so.l <- lm(logit(so.df.price$too.cheap.share, percents = TRUE)~so.df.price$price.point)
so.cof.TCh <- coef(so.l)
so.temp.nls <- nls(too.cheap.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.TCh[1], b = so.cof.TCh[2]), data = so.df.price, trace = TRUE)
so.df.price$Pr.TCh <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
#Not Bargain
so.l <- lm(logit(not.bargain.share, percents = TRUE) ~ price.point, so.df.price)
so.cof.NBr <- coef(so.l)
so.temp.nls <- nls(not.bargain.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.NBr[1], b = so.cof.Br[2]), data= so.df.price, trace=TRUE)
so.df.price$Pr.NBr <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
# Thanks to John Fox & Sanford Weisberg - "An R Companion to Applied Regression, second edition"
At this point, we can plot and compare the "observed" cumulative frequencies against the estimated frequencies
ggplot(data = so.df.price, aes(x = price.point))+
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap"))+
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain"))+
geom_line(aes(y = so.df.price$too.cheap.share, colour = "too.cheap.share"))+
geom_line(aes(y = so.df.price$not.bargain.share, colour = "not.bargain.share"))+
scale_y_continuous(name = "Cummulative Frequency")
The estimate appears to fit the observations reasonably well.
Step 4: Find the intersection point for the two estimate functions
so.f <- function(x, a, b){
# model for the curves
1 / (1 + exp(-(a + b * x)))
}
# note, this function may also be used in step 3
#I was building as I went and I don't want to risk a transpositional error that breaks the example
so.pmc.x <- uniroot(function(x) so.f(x, so.cof.TCh[1], so.cof.TCh[2]) - so.f(x, so.cof.Br[1], so.cof.Br[2]), c(0, 50), tol = 0.01)$root
We may visually test the so.pmc.x by plotting it with the two estimates. If it is correct, a vertical line for so.pmc.x should pass through the intersection of too.cheap and not.bargain.
ggplot(data = so.df.price, aes(x = price.point)) +
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap")) +
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain")) +
scale_y_continuous(name = "Cumulative Frequency") +
geom_vline(aes(xintercept = so.pmc.x))
...which it does.
Step 5: Find y
Here is where I get stumped, and I'm sure I'm overlooking something very basic.
If a curve is defined by f(x) = 1/(1+exp(-(a+bx))), and a, b and x are all known, then shouldn't y be the result of 1/(1+exp(-(a+bx))) for either estimate?
In this instance, it is not.
# We attempt to use the too.cheap estimate to find y
so.pmc.y <- so.f(so.pmc.x, so.cof.TCh[1], so.cof.TCh[2])
# In theory, y for not.bargain at price.point so.pmc.x should be the same
so.pmc.y2 <- so.f(so.pmc.x, so.cof.NBr[1], so.cof.NBr[2])
EDIT: This is where the error occurs (see solution below).
a != so.cof.NBr[1] and b != so.cof.NBr[2], instead a and be should be defined as the coefficients from so.temp.nls (not so.l)
# Which they are
#> so.pmc.y
#(Intercept)
# 0.02830516
#> so.pmc.y2
#(Intercept)
# 0.0283046
If we calculate the correct value for y, a horizontal line at yintercept = so.pmc.y, should pass through the intersection of too.cheap and not.bargain.
...which it obviously does not.
So how does one estimate y?
I've solved this, and as I suspected, it was a simple error.
My assumption that y = 1/(1+exp(-(a+bx))) is correct.
The issue is that I was using the wrong a, b coefficients.
My curve was defined using the coefficients in so.cof.NBr as defined by so.l.
#Not Bargain
so.l <- lm(logit(not.bargain.share, percents = TRUE) ~ price.point, so.df.price)
so.cof.NBr <- coef(so.l)
so.temp.nls <- nls(not.bargain.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.NBr[1], b = so.cof.Br[2]), data= so.df.price, trace=TRUE)
so.df.price$Pr.NBr <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
But the resulting curve is so.temp.nls, NOT so.l.
Therefore, once I find so.pmc.x I need to extract the correct coefficients from so.temp.nls and use those to find y.
# extract coefficients from so.temp.nls
so.co <- coef(so.temp.nls)
# find y
so.pmc.y <- 1 / (1 + exp(-(so.co[1] + so.co[2] * so.pmc.x)))
ggplot(data = so.df.price, aes(x = price.point))+
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap"))+
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain"))+
scale_y_continuous(name = "Cumulative Frequency")+
geom_hline(aes(yintercept = so.pmc.y))
Yielding the following...
which graphically depicts the correct answer.

bar plot grouped multi-column data with confidence intervals with ggplot2

I'm trying to create a bar plot of grouped multi column data and to add confidence intervals to each bar. So far, I have done almost all tasks with the help of several entries in different blogs and platforms like stackoverflow.
My data sgr_sum_v3 looks like this:
treatment mean_C16_0 sd_C16_0 mean_C18_0 sd_C18_0 mean_LIN sd_LIN mean_ALA sd_ALA
ALA 92500.0 1492.0 14406.7 1291.5 740.2 77.7 3399.2 436.4
ALA+ARA 71538.3 3159.0 14088.7 1101.0 582.3 91.5 2089.3 439.6
ALA+EPA 82324.6 2653.3 10745.2 1244.2 658.3 19.2 2629.3 134.7
ALA+EPA+LIN+ARA 68422.9 2097.2 10818.2 721.8 969.9 24.0 2154.0 124.5
ALA+LIN 87489.0 3150.6 15951.9 888.2 1173.0 279.1 2010.6 519.4
ARA 65571.7 2635.6 11174.7 1851.9 589.0 7.0 1640.9 163.7
control 107313.4 10828.0 22087.0 6217.7 783.8 38.6 2417.5 59.2
EPA 76621.3 1863.7 9947.7 156.4 654.6 31.0 1946.8 56.6
EPA+ARA 70312.3 2187.3 10896.8 148.6 716.8 24.4 2144.0 251.4
EPA+LIN 79388.5 4866.9 10080.4 613.3 1449.9 41.7 1862.9 235.4
LIN 87398.4 2213.9 11961.6 798.8 1909.3 100.2 1939.1 82.5
LIN+ARA 71437.1 1220.1 12612.0 1190.8 1134.6 333.6 1628.6 508.1
Scen 138102.2 22228.4 24893.0 1259.9 4259.4 612.0 23417.2 3946.5
Basically different treatments with mean values and standard deviations of some measured values.
To get the plot running I basically adapted the code from this post:
Creating grouped bar-plot of multi-column data in R
from joran for the multi column problem and the code from this post:
Grouped barplot in R with error bars
from Colonel Beauvel for the confidence intervals.
Here is my code:
library(reshape2)
dfm <- data.frame()
dfm <- melt(sgr_sum_v3[,c('treatment', 'mean_ALA', 'mean_LIN')], id.vars = 1)
ggplot(data=dfm, aes(x=treatment, y = value, fill = variable))+
geom_bar(stat = "identity", position = "dodge")+
geom_errorbar(aes(ymin = value - 1000, ymax = value + 1000), width = .2, position = position_dodge(.9))
Now my problem is, that as the multi-column problem is solved by the melt function, I don't have my standard deviations to get real errorbars (so far I just insert 1000 to see if it works).
Do you have suggestions how to solve this, or even to get the multi column plot running with the original data (without melting) which would make the cf problem pretty straight forward?
Thanks in advance for your help :)
Eventhough, my question is already pretty old and solved in the meanwhile, I will answer it in a more comprehensive way, as #dende85 asked currently for the complete code. The following code is not exactly with the data above, but I created it for a small R-lecture for my students. However, I'm pretty sure, that this might be handled easier.
So here's the answer:
First, I create two data sets. One for mean values and one for sd. In this case I only chose a subset with the [1:4]-thing
my_bar_data_mean <- data.frame(treatment = levels(my_data$treatment)[1:4])
my_bar_data_sd <- data.frame(treatment = levels(my_data$treatment)[1:4])
Then I used aggregate() to calculate mean and sd for all groups for all (in this case 3) parameters of interest:
#BL
my_bar_data_mean$BL_mean <- aggregate(my_data,
by = list(my_data$treatment),
FUN = mean,
na.rm = TRUE)[, 8]
my_bar_data_sd$BL_sd <- aggregate(my_data,
by = list(my_data$treatment),
FUN = sd,
na.rm = TRUE)[, 8]
# BW
my_bar_data_mean$BW_mean <- aggregate(my_data,
by = list(my_data$treatment),
FUN = mean,
na.rm = TRUE)[, 9]
my_bar_data_sd$BW_sd <- aggregate(my_data,
by = list(my_data$treatment),
FUN = sd,
na.rm = TRUE)[, 9]
# SL
my_bar_data_mean$SL_mean <- aggregate(my_data,
by = list(my_data$treatment),
FUN = mean,
na.rm = TRUE)[, 10]
my_bar_data_sd$SL_sd <- aggregate(my_data,
by = list(my_data$treatment),
FUN = sd,
na.rm = TRUE)[, 10]
Now, we need to reshape the data.frame. Therefore, we need some packages:
library(Hmisc)
library(car)
library(reshape2)
We create a new data.frame and reshape our data with the help of the melt()-function. Note that we still have two data.frames: one for mean and one for sd:
dfm <- data.frame()
dfm <- melt(my_bar_data_mean)
temp <- data.frame()
temp <- melt(my_bar_data_sd)
Now we can see, that our variable are gathered vertically. We just have to add the value of the temp data.frame as a new column called sd to the first data.frame:
dfm$sd <- temp$value
Now, we just have to plot everything:
ggplot(dfm, aes(variable, value, fill=treatment))+
geom_bar(stat="identity", position = "dodge")+
theme_classic() +
geom_errorbar(aes(ymin = value - sd, ymax = value + sd), width=0.4, position = position_dodge(.9))
You can simply add the error bars using geom_errorbar and using the columns value and sd for min and max of your whiskers. Don't forget to set position = position_dodge(.9) for the error bars as well.
You can also simply change whether to plot your response variables as dodged bars and split them for treatment or vice versa by simply exchanging variable and value in the first line (ggplot(aes())).
I hope this hepls.

make barplot of groups with error bars

I would like to make a barplot of the columns V2 and length. I would also plot the standard deviation from the number in length for each group.
> head(Length_filter3)
V1 V2 V3 length
1 URS00000081EA snRNA AAACTCGACTGCATAATTTGTGGTAGTGGG 30
2 URS00000081EA snRNA AAACTCGACTGCATAATTTGTGGTAGTGGGG 31
3 URS00000081EA snRNA AAACTCGACTGCATAATTTGTGGTAGTGGGGGACT 35
4 URS0000008112A tRNA AAACTCGACTGCATAATTTGTGGTAGTGGGGGACTG 36
5 URS000000812A tRNA AAATGTGGGAAACTCGACTGCATAATTTGTGGTAGTGGGGGACT 44
6 URS0000008121EA tRNA AACTCGACTGCATAATTTGTGGTAGTGGG 29
ggplot(Length_filter3, aes(V2,length)) + geom_bar(stat="identity")
I am assuming that you are looking to create some sort of summary statistic, such as average, rather than trying to plot the total length of all of the RNA types (for which there would be no error bar to speak of).
If it has to be a bar plot, you will likely need to calculate the values yourself. Here, I am manually calculating the ranges I want from the iris data (using dplyr):
summarizedData <-
iris %>%
group_by(Species) %>%
summarise(
mean = mean(Petal.Length)
, sd = sd(Petal.Length)
, low = mean + sd/(sqrt(n())) * qt(0.025, n()-1 )
, high = mean + sd/(sqrt(n())) * qt(0.975, n()-1 )
)
ggplot(
summarizedData
, aes(x = Species
, y = mean
, ymax = high
, ymin = low)
) +
geom_bar(stat = "identity") +
geom_linerange()
Alternatively, you can let ggplot do the work for you, particularly if you are willing to use points and error bars instead of a bar plot (I tend to prefer it this way)
ggplot(
iris
, aes(x = Species
, y = Petal.Length)
) +
stat_summary(fun.data = mean_cl_normal)
You can combine these approaches if you like as well.
try FUN function in ggplot choosing stdev.

Position dodge with geom_point(), x=continuous, y=factor

I have made a function that can plot the loadings from many factor analyses at once, also when their variables do not overlap perfectly (or at all). It works fine, but sometimes factor loadings are identical across analyses which means that the points get plotted on top of each other.
library(pacman)
p_load(devtools, psych, stringr, plotflow)
source_url("https://raw.githubusercontent.com/Deleetdk/psych2/master/psych2.R")
loadings.plot2 = function(fa.objects, fa.names=NA) {
fa.num = length(fa.objects) #number of fas
#check names are correct or set automatically
if (length(fa.names)==1 & is.na(fa.names)) {
fa.names = str_c("fa.", 1:fa.num)
}
if (length(fa.names) != fa.num) {
stop("Names vector does not match the number of factor analyses.")
}
#merge into df
d = data.frame() #to merge into
for (fa.idx in 1:fa.num) { #loop over fa objects
loads = fa.objects[[fa.idx]]$loadings
rnames = rownames(loads)
loads = as.data.frame(as.vector(loads))
rownames(loads) = rnames
colnames(loads) = fa.names[fa.idx]
d = merge.datasets(d, loads, 1)
}
#reshape to long form
d2 = reshape(d,
varying = 1:fa.num,
direction="long",
ids = rownames(d))
d2$time = as.factor(d2$time)
d2$id = as.factor(d2$id)
colnames(d2)[2] = "fa"
print(d2)
#plot
g = ggplot(reorder_by(id, ~ fa, d2), aes(x=fa, y=id, color=time, group=time)) +
geom_point(position=position_dodge()) +
xlab("Loading") + ylab("Indicator") +
scale_color_discrete(name="Analysis",
labels=fa.names)
return(g)
}
#Some example plots
fa1 = fa(iris[-5])
fa2 = fa(iris[-c(1:50),-5])
fa3 = fa(ability)
fa4 = fa(ability[1:50,])
loadings.plot2(list(fa1,fa1,fa2))
Here I've plotted the same object twice just to show the effect. The plot has no red points because the green ones from fa.2 are on top. Instead, I want them to be dodged on the y-axis. However, position="dodge" with various settings does not appear to make a difference.
However, position="jitter" works, but it is random, so sometimes it does not work well as well as makes the plot chaotic to look at.
How do I make the points dodged on the y-axis?
Apparently, you can only dodge sideways, but there is a workaround. The trick is to flip your x and y, do the position_dodge, and then do a coord_flip().
g = ggplot(data = reorder_by(id, ~ fa, d2), aes(x=id, y=fa, color=time, group=time)) +
geom_point(position=position_dodge(width = .5)) +
xlab("Loading") + ylab("Indicator") +
scale_color_discrete(name="Analysis",
labels=fa.names) +
coord_flip()
Possible duplicate
In the linked post, the right answer states that one must use position_jitter() instead of position_dodge(). It has worked for me.

Resources