Plot median values on top of a density distribution

Plot median values on top of a density distribution - r

I'm trying to plot the median values of some data on a density distribution using the ggplot2 R library. I would like to print the median values as text on top of the density plot.
You'll see what I mean with an example (using the "diamonds" default dataframe):
I'm printing three itmes: the density plot itself, a vertical line showing the median price of each cut, and a text label with that value. But, as you can see, the median prices overlap on the "y" axis (this aesthetic is mandatory in the geom_text() function).
Is there any way to dynamically assign a "y" value to each median price, so as to print them at different heights? For example, at the maximum density value of each "cut".
So far I've got this
# input dataframe
dia <- diamonds
# calculate mean values of each numerical variable:
library(plyr)
dia_me <- ddply(dia, .(cut), numcolwise(median))
ggplot(dia, aes(x=price, y=..density.., color = cut, fill = cut), legend=TRUE) +
labs(title="diamond price per cut") +
geom_density(alpha = 0.2) +
geom_vline(data=dia_me, aes(xintercept=price, colour=cut),
linetype="dashed", size=0.5) +
scale_x_log10() +
geom_text(data = dia_me, aes(label = price, y=1, x=price))
(I'm assigning a constant value to the y aesthetics in the geom_text function because it's mandatory)

This might be a start (but it's not very readable due to the colors). My idea was to create an 'y'-position inside the data used to plot the lines for the medians. It's a bit arbitrary, but I wanted y-positions to be between 0.2 and 1 (to nicely fit on the plot). I did this by the sequence-command. Then I tried to order it (didn't do a lot of good) by the median price; this is arbitrary.
#scatter y-pos over plot
dia_me$y_pos <- seq(0.2,1,length.out=nrow(dia_me))[order(dia_me$price,decreasing = T)]
ggplot(dia, aes(x=price, y=..density.., color = cut, fill = cut), legend=TRUE) +
labs(title="diamond price per cut") +
geom_density(alpha = 0.2) +
geom_vline(data=dia_me, aes(xintercept=price, colour=cut),
linetype="dashed", size=0.5) +
scale_x_log10() +
geom_text(data = dia_me, aes(label = price, y=y_pos, x=price))

Related

Generating Statistics Summary from a ggplot in R

I'm an R novice and working on project with script provided by my professor and I'm having trouble getting an accurate mean for my data that matches the box plot that I created. The mean in this plot is below 300kg per stem and the mean I am getting when I use
ggsummarystats( DBHdata, x = "location", y = "biomassKeith_and_Camphor", ggfunc = ggboxplot, add = "jitter" )
or
tapply(DBHdata$biomassBrown_and_Camphor, DBHdata$location, mean)
I end up with means over 600 kg/stem. Is there way to produce summary statistics in the code for my box plot.
Box and Whisker plot of kg per stem

The boxplots do not contain mean values, but median instead. So this could explain the variation you are observing in your calculations.

Additionally, the data appears to be very skewed towards large numbers, so a mean of over 600 despite medians of ca 200 is not surpringing

As others have pointed out, a boxplot shows the median per default.
If you want to get the mean with ggstatsplot, you can change the functions that you call with the summaries argument, as such:
ggsummarystats(DBHdata, x = "location", y = "biomassKeith_and_Camphor",
ggfunc = ggboxplot, add = "jitter", summaries = c("n", "median", "iqr", "mean"))
This would add the mean besides the standard output of n, median, and interquartile range (iqr).

I'm not sure if I understand your question correctly, but first try calculating the group means with aggregate and then adding a text with means.
Sample code:
means <- aggregate(weight ~ group, PlantGrowth, mean)
library(ggplot2)
ggplot(PlantGrowth, aes(x=group, y=weight, fill=group)) +
geom_boxplot() +
stat_summary(fun=mean, colour="darkred", geom="point",
shape=18, size=3, show.legend=FALSE) +
geom_text(data = means, aes(label = weight, y = weight + 0.08))
Plot:
Sample data:
data(PlantGrowth)

overlay the grand mean and se into a scatter dot

I have a dot plot created by ggplot, in which I plot every subject's individual responses. The subjects are organized into 3 groups in the plot and I have also estimated and plotted for each subject the mean and se. Now, I want to add at the same plot the grand mean and Se for each group.
This is how I created the first plot:
mazeSRDataS1_Errorplot<-ggplot(mazeSRDataS1, aes(Errorfixed, GroupSub,
colour=as.factor(Group)))+geom_point() +
mytheme3+ ggtitle("mazeSR-S1 Error plot")+ labs(y="Subject ID", x = "Error (degrees)", colour =
"Group")+ scale_colour_manual(values = c("brown4", "slategray3", "tan1"))
mazeSRDataS1_Errorplot + stat_summary(fun = mean, position = 'dodge', shape=1, size=0.5,
colour='black') + stat_summary(fun.data = mean_cl_normal, geom = 'errorbar', colour='black')
This is how I plotted the grand mean and se for each group. (i first aggregated the data and computed the mean and se for each group).
ggplot(meanSEErrorMazeSR1, aes(x=Error, y=Group, colour=Group)) +
geom_errorbar(aes(xmin=Error-se, xmax=Error+se), width=.1, position='dodge') +
geom_line(position='dodge') + geom_point(position='dodge')
But, how do I merge these plots and overlay the one over the other?
Thank you in advance!!

You can add y-axis positions to the aggregated data you've made to specify where on the first plot you want them plotted, and then add another geom_errorbar(data = ...) where you specify to use the aggregated data e.g.:
meanSEErrorMazeSR1 <-
meanSEErrorMazeSR1 %>%
mutate(y_position = c(30, 90, 150) # since you didn't provide a reproducible example you'll need to figure out the best positions yourself here
mazeSRDataS1_Errorplot +
geom_errorbar(data = meanSEErrorMazeSR1, aes(y = y_position, xmin=Error-se, xmax=Error+se), width=.1)
You can toy around with different y-values to use for the positioning of the error bars. In your case, because the y-axis is discrete due to being based on Subject IDs, the y-values will correspond to the order of the subject on the plot - the y_position = c(30, 90, 150) above corresponds to the 30th, 90th, and 150th subject, respectively.
Note also that the argument position='dodge' is not needed because you're not using a group aesthetic!

ggplot: why is the y-scale larger than the actual values for each response?

Likely a dumb question, but I cannot seem to find a solution: I am trying to graph a categorical variable on the x-axis (3 groups) and a continuous variable (% of 0 - 100) on the y-axis. When I do so, I have to clarify that the geom_bar is stat = "identity" or use the geom_col.
However, the values still show up at 4000 on the y-axis, even after following the comments from Y-scale issue in ggplot and from Why is the value of y bar larger than the actual range of y in stacked bar plot?.
Here is how the graph keeps coming out:
I also double checked that the x variable is a factor and the y variable is numeric. Why would this still be coming out at 4000 instead of 100, like a percentage?
EDIT:
The y-values are simply responses from participants. I have a large dataset (N = 600) and the y-value are a percentage from 0-100 given by each participant. So, in each group (N = 200 per group), I have a value for the percentage. I wanted to visually compare the three groups based on the percentages they gave.
This is the code I used to plot the graph.
df$group <- as.factor(df$group)
df$confid<- as.numeric(df$confid)
library(ggplot2)
plot <-ggplot(df, aes(group, confid))+
geom_col()+
ylab("confid %") +
xlab("group")

Are you perhaps trying to plot the mean percentage in each group? Otherwise, it is not clear how a bar plot could easily represent what you are looking for. You could perhaps add error bars to give an idea of the spread of responses.
Suppose your data looks like this:
set.seed(4)
df <- data.frame(group = factor(rep(1:3, each = 200)),
confid = sample(40, 600, TRUE))
Using your plotting code, we get very similar results to yours:
library(ggplot2)
plot <-ggplot(df, aes(group, confid))+
geom_col()+
ylab("confid %") +
xlab("group")
plot
However, if we use stat_summary, we can instead plot the mean and standard error for each group:
ggplot(df, aes(group, confid)) +
stat_summary(geom = "bar", fun = mean, width = 0.6,
fill = "deepskyblue", color = "gray50") +
geom_errorbar(stat = "summary", width = 0.5) +
geom_point(stat = "summary") +
ylab("confid %") +
xlab("group")

R-ggplot plot median with ranked values

I'm trying to make a plot of the median of my ranked data. And under this plot I'm trying to plot the ranked value.
example data:
test=data.frame(a=rep(seq(-5,5,by=0.1),each=1,length.out=101),b=runif(101, min=-5, max=5))
test$range=rep(seq(1, 101, by=1), each=1,length.out=length(test[,1]))
So I'm trying to plot only the median.
I tried :
ggplot(data= test) + stat_summary(
mapping = aes(x = range, y = b),
fun.y = median)
But I got a Warning message : Removed 101 rows containing missing values (geom_pointrange).
I got it with this command :
ggplot(test, aes(x = range, y = b, color = b )) +
geom_line(size = 0.5) +
geom_smooth(aes(color=..y..), size=1.5, method = "loess", se=FALSE) +
scale_colour_gradient2(low = "green", mid = "yellow" , high = "red",
midpoint=median(test$b))
but it's not exactly what I want, I want only the median.
Also I want to plot the value of test$a under this plot. But I have no idea of how can i do this :
Thank you !

So I'm confused by some things. First, the first plot you show has b on the y-axis, yet the code implies you're plotting a on the y-axis. So do you want the median of a or b? Also, I don't understand what the range variable is supposed to do.
That said, maybe this will be of some help. I assumed the variable of interest was b. We can make something resembling your min-to-max illustration by plotting b as a function of its rank. Next we can add a horizontal line at the height of the median.
ggplot(test, aes(rank(b), b)) +
geom_line() +
geom_hline(yintercept = median(test$b))
Which gave me a plot like this:
Hope this was of some help!

ggplot2 - box plot questions on misalignment

I am having the following challenges making a plot:
I am trying to do a 'grouped' box plot - but it appears that the box plots are not showing up near the corresponding x axis group. So It's not easy to see which group each plot belongs to.
I am trying to add in an icon for the 'mean' value which right now is a triangle. However these aren't moving with the grouped boxplots.
I don't want the triangle icon for the mean value to show up in the legend - I can't figure out how to remove this.
Whenever I try to add text I just want one value either for the median or the mean - not something repeated 50x.
Boxplot
library(ggplot2)
library(ggthemes)
library(RColorBrewer)
library(reshape2)
ggplot(tips, aes(x = day, y = total_bill, fill=sex)) + #grouping factor, y variable
geom_boxplot(position = position_dodge(width = 1.2)) + # how to color
labs(title ='Barchart Plot', x=' xaxis label',y='ylabel') +
scale_fill_brewer(palette="Dark2")+ #use Dark2, Paired, Set1
theme(axis.text.x = element_text(colour="black",size=14,angle=45,
hjust=.5,vjust=.5,face="bold"),
axis.text.y = element_text(colour="grey20",size=16,angle=45,
hjust=1,vjust=0,face="plain"),
axis.title.x = element_text(colour="grey20",size=12,angle=45,
hjust=.5,vjust=0,face="plain"),
axis.title.y = element_text(colour="blue",size=16,angle=90,
hjust=0.5,vjust=.5,face='bold')) +
stat_summary(fun.y=mean, geom="point", shape=17, size=4) +
theme_base() +
geom_text(label='just the mean or median please - number only')

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Plot median values on top of a density distribution - r

Related

Generating Statistics Summary from a ggplot in R

overlay the grand mean and se into a scatter dot

ggplot: why is the y-scale larger than the actual values for each response?

R-ggplot plot median with ranked values

ggplot2 - box plot questions on misalignment

Categories

Resources