displaying stat_summary accurately on violin plots - r

I just started using ggplot2 on R and have a violin plot question.
I have a data set that can be accessed here: data.
The data comes from a study of making estimations. The variables of interest are the question.no (questions), condition, estimate.no (tr.est1 or tr.est2) and estimate.
The code below makes the plot look almost the way I want it to look at least for one question, yet the median dots generated by stat_summary() are displayed in between the "violins."
v.data<-read.csv("data.csv")
# loop through each question number
d_ply(v.data, c("question.no"), function(d.plot){
q.no <- v.data$question.no
plot.q <- ggplot(d.plot,aes(condition, estimate, fill=estimate.no)) +
geom_violin() +
stat_summary(fun.y="median", geom="point") +
scale_y_continuous('Change Scores') +
scale_x_discrete("Conditions")
ggsave(filename=paste(q.no,".png",sep=""))
})
My Question: How can I make the median dots display correctly on the "violins" rather than in between them?
I searched the previous questions asked on ggplot2 on this site and looked at the ggplot2 documentation as well as other R forums but have not been able to find anything relevant.
I would appreciate any comments and suggestions as to how I can fix it. Also, if the questions I ask are already answered somewhere else, I would appreciate the links to the threads,too. Many thanks in advance.

stat_summary is limited to the variable that determines your x-axis. One way to convey the information you want would be to replace condition in your call to aes with interaction(condition, estimate.no).

Plotluck is a library based on ggplot2 that aims at automating the choice of plot type based on characteristics of 1-3 variables. For your data set, the command plotluck(v.data, condition, estimate, question.no) generates the following plot:
Note that the library chose to scale y logarithmically. You can override this behavior with plotluck(v.data,condition,estimate,question.no,opts=plotluck.options(trans.log.thresh=1E20)) but it doesn't display well, and the median points look like they are all on the zero line.

Related

Pictured link is my coding. How do I make a proper good graph?

Okay so I have an assignment where I need to conduct a graph that best represents the before and after affects of two streams. The graph(s) have to contain means and standard error for each stream in each year.. I cannot figure the proper coding for the graph. I continue to get errors and bad graphs. I will attach a sample of what the data looks like too.
A sample of the data, it changes to after at 51
Try to post a reproducible error or specification of your problem.
As far as I can analyze your problem, you maybe should not create b4, because it does not seem to be an effective subset. If you want to assemble certain plots, you can use plot_grid from cowplot.
Otherwise you can add facet_wrap(~ VARIABLE_NAME) to ggplot in order to create many plots divided by deviating observations in the specified variable.
If you are not happy with the visual outcome and result of your graph, you can choose another theme, e.g. theme_bw() which can be simply added to your ggplot function. You can add and change further labels with labs() and theme().

Change colors in r plot

I am currently trying to plot some data and don't manage to obtain a nice result. I have a set of 51 individuals with each a specific value (Pn) and split within 14 groups. The closest thing I end up with is this kind of plot. I obtain it thanks to the simple code bellow, starting by ordering my values for the Individuals :
Individuals <- factor(Individuals,levels=Individuals[order(Pn)])
dotchart(Pn,label=Individuals,color=Groups)
The issue is that I only have 9 colors on this plot (so I lost information somehow) and I can't manage to find a way to apply manually one color per group.
I've also try to use the ggplot2 package by reading it could give nice looking things. In that case I can't manage to order properly the Individuals (the previous sorting doesn't seem to have any effect here), plus I end up with only different type of blue for the group representation which is not an efficient way to represent the information given by my data set. The plot I get is accessible here and I used the following code:
ggplot(data=gps)+geom_point(mapping=aes(x=Individuals, y=Pn, color=Groups))
I apologize if this question seems redundant but I couldn't figure a solution on my own, even following some answer given to others...
Thank you in advance!
EDIT: Using the RColorBrewer as suggested bellow sorted out the issue with the colors when I use the ggplot2 package.
I believe you are looking for the scale_color_manual() function within ggplot2. You didn't provide a reproducible example, but try something along the lines of this:
ggplot(data=gps, mapping=aes(x=Individuals, y=Pn, color=Groups))+
geom_point() +
scale_color_manual(values = c('GROUP1' = 'color_value_1',
'GROUP2' = 'color_value_2',
'GROUP3' = 'color_value_3'))
Replace GROUPX with the values inside your Group column, and replace color_value_x with whatever colors you want to use.
A good resource for further learning about ggplot2 is chapter 3 of R For Data Science, which you can read here: http://r4ds.had.co.nz/data-visualisation.html
I can't be sure without looking at your data, but it looks like Groups may be a numeric value. Try this:
gps$Groups <- as.factor(gps$Groups)
library(RColorBrewer)
ggplot(data=gps)+
geom_point(mapping=aes(x=Individuals, y=Pn, color=Groups))+
scale_colour_brewer(palette = "Set1")

Breaking value axis using ggplot2 [duplicate]

This question already has answers here:
Using ggplot2, can I insert a break in the axis?
(10 answers)
Closed 3 years ago.
I have used Thinkcell, and one of its cool features is that it breaks very long y-axis to fit the graph. I am not sure whether we can do this with ggplot2. I am a beginner in ggplot2. So, I'd appreciate any thoughts.
For example:
Series <- c(1:6)
Values <- c(899, 543, 787, 35323, 121, 234)
df_val_break <- data.frame(Series, Values)
ggplot(data=df_val_break, aes(x=Series, y=Values)) +
geom_bar(stat="identity")
This creates a graph like this:
However, I want a graph that looks something like this:
However, it seems that broken axis is not supported in ggplot2 because it's misleading (Source: Using ggplot2, can I insert a break in the axis?). This thread suggests a couple of things--faceting and tables.
While I like tables, but I don't like faceting because my categorical variable "Series" are closely related. Moreover, I'd prefer Excel for drawing tables--it's fast.
I have two questions:
Question 1: One of the options I liked is at https://stats.stackexchange.com/questions/1764/what-are-alternatives-to-broken-axes. The graph is at
.
I am unable to replicate similar graph because of the scaling issue.
Question 2: This is a minor question just in case there were new packages introduced that might help us to do this. (The linked SO thread above is older than 5 years. ) Are there any other options on the table?
Update: I don't think my question is duplicate for two reasons: a) I have already gone through the indicated thread, and have referenced here explaining that I am looking for a solution that looks like the third graph in my post. Specifically, I am looking to plot both the graphs--one with shorter scales and the other with 1/20 scale in one graph. I am unable to do this using ggplot2 because of scale issue. Either both the sub-graphs get scaled to 1/nth or one of them get scaled to normal range. I believe this version is much relatable for non-technical audience who don't understand log and Inverse transformation.
I took a stab at this one. I'm a beginner so I am not sure whether this can be improved further in terms of placement of text. I struggled with fitting both high growth rate series and low growth rate series in one graph because of different scales. So, I used facetting.
Here's the code:
ggplot(data = df_val_break,aes(x=Series,y=Values)) +
geom_bar(stat = "identity") +
facet_wrap(~Modified) +
geom_text(data = df_val_break[df_val_break$Modified=="HIGH_GROWTH",], aes(label = "x20 growth rate"),hjust=0.5, vjust=0)
ggsave("post.png")
Here's the output:
There are quite a few issues that I see:
a) High_growth rate graph has Series 2 and Series 6 on the x-axis, although we don't need them. I don't know how to turn them off.
b) geom_text overlaps with the bar. This looks a little annoying.
c) I'd believe that the graph is a little misleading, especially for HIGH_GROWTH section because the y-axis isn't scaled with LOW_GROWTH I was originally thinking of showing two different y-axis--one scaled by 1/20 and the other unscaled.

ggplot2 geom_violin with 0 variance

I started to really like violin plots, since they give me a much better feel that box plots when you have funny distributions. I like to automatize a lot of stuff, and thus ran into a problem:
When one variable has 0 variance, the boxplot just gives you a line at that point. Geom_violin however, terminates with an error. What behavior would I like? Well, either put in a line or nothing, but please give me the distributions for the other variables.
Ok, quick example:
dff=data.frame(x=factor(rep(1:2,each=100)),y=c(rnorm(100),rep(0,100)))
ggplot(dff,aes(x=x,y=y)) + geom_violin()
yields
Error in `$<-.data.frame`(`*tmp*`, "n", value = 100L) :
replacement has 1 row, data has 0
However, what works is:
ggplot(dff,aes(x=x,y=y)) + geom_boxplot()
Update:
The issue is resolved as of yesterday: https://github.com/hadley/ggplot2/issues/972
Update 2:
(from question author)
Wow, Hadley himself responded! geom_violin now behaves consistently with geom_density and base R density.
However, I don't think the behavior is optimal yet.
(1) The 'zero' problem
Just run it with my original example:
dff=data.frame(x=factor(rep(1:2, each=100)), y=c(rnorm(100), rep(0,100)))
ggplot(dff,aes(x=x,y=y)) + geom_violin(trim=FALSE)
Yielding this:
Is the plot on the right an appropriate representation of 'all zeroes'? I don't think so. It is better to have trimming that produces a single line to show that there is no variation in the data.
Workaround solution: Add a + geom_boxplot()
(2) I may actually want TRIM=TRUE.
Example:
dff=data.frame(x=factor(rep(1:2, each=100)), y=c(rgamma(100,1,1), rep(0,100) ))
ggplot(dff,aes(x=x,y=y)) + geom_violin(trim=FALSE)
Now I have non-zero data, and standard kernel density estimates don't handle this correctly. With trim=T I can quickly see that the data is strictly positive.
I am not arguing that the current behavior is 'wrong', since it's in line with other functions. However, geom_violin may be used in different contexts, for exploring different data.frames with heterogeneous data types (positive+skewed or not, for instance).
Three options for dealing with this until the ggplot2 issue is resolved:
As a quick hack, you can set one of the y-values to 0.0001 (instead of zero) and geom_violin will work.
Check out the vioplot package if you're not set on using ggplot2. vioplot doesn't throw an error when you feed it a bunch of identical values.
The Hmisc package includes a panel.bpplot (box-percentile plot) function that can create violin plots with the bwplot function from the lattice package. See the Examples section of ?panel.bpplot. It produces a single line when you feed it a vector of identical values.

Indicating the statistically significant difference in bar graph USING R

This is a repeat of a question originally asked here: Indicating the statistically significant difference in bar graph but asked for R instead of python.
My question is very simple. I want to produce barplots in R, using ggplot2 if possible, with an indication of significant difference between the different bars, e.g. produce something like this. I have had a search around but can't find another question asking exactly the same thing.
I know that this is an old question and the answer by Didzis Elferts already provides one solution for the problem. But I recently created a ggplot-extension that simplifies the whole process of adding significance bars: ggsignif
Instead of tediously adding the geom_path and annotate to your plot you just add a single layer geom_signif:
library(ggplot2)
library(ggsignif)
ggplot(iris, aes(x=Species, y=Sepal.Length)) +
geom_boxplot() +
geom_signif(comparisons = list(c("versicolor", "virginica")),
map_signif_level=TRUE)
Full documentation of the package is available at CRAN.
You can use geom_path() and annotate() to get similar result. For this example you have to determine suitable position yourself. In geom_path() four numbers are provided to get those small ticks for connecting lines.
df<-data.frame(group=c("A","B","C","D"),numb=c(12,24,36,48))
g<-ggplot(df,aes(group,numb))+geom_bar(stat="identity")
g+geom_path(x=c(1,1,2,2),y=c(25,26,26,25))+
geom_path(x=c(2,2,3,3),y=c(37,38,38,37))+
geom_path(x=c(3,3,4,4),y=c(49,50,50,49))+
annotate("text",x=1.5,y=27,label="p=0.012")+
annotate("text",x=2.5,y=39,label="p<0.0001")+
annotate("text",x=3.5,y=51,label="p<0.0001")
I used the suggested method from above, but I found the annotate function easier for making lines than the geom_path function. Just use "segment" instead of "text". You have to break things up by segment and define starting and ending x and y values for each line segment.
example for making 3 lines segments:
annotate("segment", x=c(1,1,2),xend=c(1,2,2), y= c(125,130,130), yend=c(130,130,125))

Resources