Indicating the statistically significant difference in bar graph USING R - r

This is a repeat of a question originally asked here: Indicating the statistically significant difference in bar graph but asked for R instead of python.
My question is very simple. I want to produce barplots in R, using ggplot2 if possible, with an indication of significant difference between the different bars, e.g. produce something like this. I have had a search around but can't find another question asking exactly the same thing.

I know that this is an old question and the answer by Didzis Elferts already provides one solution for the problem. But I recently created a ggplot-extension that simplifies the whole process of adding significance bars: ggsignif
Instead of tediously adding the geom_path and annotate to your plot you just add a single layer geom_signif:
library(ggplot2)
library(ggsignif)
ggplot(iris, aes(x=Species, y=Sepal.Length)) +
geom_boxplot() +
geom_signif(comparisons = list(c("versicolor", "virginica")),
map_signif_level=TRUE)
Full documentation of the package is available at CRAN.

You can use geom_path() and annotate() to get similar result. For this example you have to determine suitable position yourself. In geom_path() four numbers are provided to get those small ticks for connecting lines.
df<-data.frame(group=c("A","B","C","D"),numb=c(12,24,36,48))
g<-ggplot(df,aes(group,numb))+geom_bar(stat="identity")
g+geom_path(x=c(1,1,2,2),y=c(25,26,26,25))+
geom_path(x=c(2,2,3,3),y=c(37,38,38,37))+
geom_path(x=c(3,3,4,4),y=c(49,50,50,49))+
annotate("text",x=1.5,y=27,label="p=0.012")+
annotate("text",x=2.5,y=39,label="p<0.0001")+
annotate("text",x=3.5,y=51,label="p<0.0001")

I used the suggested method from above, but I found the annotate function easier for making lines than the geom_path function. Just use "segment" instead of "text". You have to break things up by segment and define starting and ending x and y values for each line segment.
example for making 3 lines segments:
annotate("segment", x=c(1,1,2),xend=c(1,2,2), y= c(125,130,130), yend=c(130,130,125))

Related

Remembering steps in R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
I am a beginner in R and it might appear irrelevant. But can anyone tell me how to remember syntax? like arguments of ggplot or tidyverse or any other package.
There are a few ways to do that. You can start writing the function and press TAB, it will appear in a pop up. You can also check the cheatsheet, here are some
examples: https://www.rstudio.com/resources/cheatsheets/
Or you can check the help topic by writing the function with a ? in it's start, for example: ?ggplot
OP, your question does not relate to coding per se - no problem to solve via issues with code - so it's not really supposed to be on SO. With that being said, it is a viable question and very daunting to approach using ggplot2 to create plots when you really don't have the background for doing so. Consequently, I think you still deserve a good answer, so here are some principles to help out a new user.
Know where to get information
The biggest help to offer is to practice. You will become more familiar with usage, but even "the pros" forget the argument syntax and what stuff does. In this case, the following is helpful:
Use RStudio. The base R terminal is fully capable; however, RStudio brings a ton of conveniences that make programming in R so much easier. Tooltips are an important part of how I create and use functions in R. If you start typing out a function, you'll be presented with a short list of arguments:
What's more, you can start typing an argument and you'll get a description from the help directly within RStudio:
Check the help for functions. This one should be obvious, but I am constantly checking the help for functions on CRAN. This is easily done in RStudio by typing ? before the function. So, if I need to know the arguments and syntax for geom_point(), I'll type ?geom_point into the console and you'll get the documentation directly within RStudio.
Online Resources. A quick search online can give you a lot of information (maybe even this answer). There are a lot of other resources: here too. Including here, here, here, and here.
Become familiar with the Principles of plotting in ggplot2
Knowing where to get information is helpful, but sometimes you feel so lost that you don't even know what information you actually are looking to get. This is the crux of many of the questions here on SO related to ggplot2, which is: "how can I change my axes?", "How do I change colors in the plot?", or "How can I get a legend to show x, y, or z?". Sometimes you can google, but often it's not even clear what you are looking to find.
This is where a fundamental understanding of how to create a plot in ggplot2 becomes useful. I'll go through how I always approach plotting in ggplot2 and hopefully this will help you out a bit.
Step 1 - prepare data
Making your data prepared to plot is exceptionally useful, and sometimes difficult to do. It's a bit beyond what I intend to communicate here, but a mandatory piece of reading would be regarding Tidy Data Principles.
Step 2 - Think about Mapping
Mapping is often overlooked in the process, but in short, this is how the columns of your dataset relate to the plot. It's easy to say "this column will be my x axis" and "this column will be my y axis", but you should also be clear on if the values of other columns will relate to color, fill, size, shape, etc etc... Thinking this way, it will soon be quite obvious why you would want to get Step 1 correct above, because only Tidy data will be able to be used directly in mapping without issue.
Step 3 - The Fundamental ggplot() call
The first step in plotting will be your first call to ggplot(). Here you need to assign data - example via df %>% ggplot(...) or ggplot(data=df, ...). This is also typically where you would setup at least your x and y axes via mapping. You can just stop here (x and y axes), or you can specify the other aesthetics in the mapping here too. Ultimately, this alone plotted "sets up" the plot. If we just plot the result of that, you get the following:
p <- ggplot(mtcars, aes(disp, mpg))
p
Step 4 - Add your geoms
A "geom" (short for "geometry") describes the shapes and "things" on your plot that will be positioned on the x and y axes. You can add any number, but in this example, we'll add points. If all you want to do is plot the observations at the x and y axes, you just need to add geom_point() and that should be enough:
p + geom_point()
Step 5 - Adding Legends
Note we don't have a legend yet. This is because there are no aesthetics mapped other than x and y. ggplot2 creates legends automatically when you specify in the mapping (via aes()) a characteristic way of differentiating the way we draw a geom. In other words, we can describe color= within aes() and that will initiate the creation of a legend. You can do the same with other aesthetics too.
p + geom_point(aes(color=cyl))
This creates a legend type depending on the type of data mapped. So, a colorbar legend is created here because the column mtcars$cyl is numeric. If we use a non-numeric column, you get a discrete legend:
p + geom_point(aes(color=rownames(mtcars)))
There's advanced stuff too... but not covered here.
Step 6 - Adjusting the Scales
All we do when you specify mapping (i.e. aes(color = ...),) is how the data is mapped to that aesthetic. This does not specify the actual color to be used. If you don't specify, the default coloring or sizing is used, but sometimes you want to change that. You can do that via scale_*_() functions... of which there are many depending on your application. For information on color scales, you can see this answer here... but suffice it to say this is quite a complicated part of the plotting stuff that depends greatly on what you want to do. Many of the scale_() functions are structured similarly, so you can probably get an idea of what you can do with that answer and see. Here's an example of how we can adjust the color with one of these functions:
p + geom_point(aes(color=cyl)) +
scale_color_gradient(low="red", high="green")
Step 7 - Adjusting Labels
Here I usually add the plot labels and axis labels. You can conviently use ylab(), or xlab() or ggtitle() to assign axis labels and the title, or just define them all together with labs(y = ..., x = ..., title = ...). You can also use this time to format and arrange things associated with legends and scales (tick marks and whatnot) via guides(...) (for legends) or the scale_x_*() and scale_y_*() functions (for tick marks on axes).
Step 8 - Theme Elements
Finally, you can change the overall look with various ggplot themes. An account of default themes is given here, but you can extend that with the ggtheme package to get more. You might want to just change a few specific elements of size, color, linetype, etc on the plot. You can address these specific elements via theme(). A helpful list of theme elements is given here.
So, putting it all together you have:
# initial call
ggplot(mtcars, aes(disp, mpg)) +
# geoms
geom_point(aes(color=cyl), size=3) +
# define the color scale
scale_color_viridis_c() +
# define labels and ticks and stuff
# axis
scale_x_continuous(breaks = seq(0, 600, by=50)) +
# legend ticks
guides(color=guide_colorbar(ticks.colour = "black", ticks.linewidth = 3)) +
# Labels
labs(x="Disp", y="Miles per gallon (mpg)", color = "# of \ncylinders", title="Ugly Plot 1.0") +
# theme and theme elements
theme_bw() +
theme(
panel.background = element_rect(fill="gray90"),
panel.grid.major = element_line(color="gray20", linetype=2, size=0.2),
panel.grid.minor = element_line(color="gray70", linetype=2, size=0.1),
axis.text = element_text(size=12, face = "bold"),
axis.text.x = element_text(angle=30, hjust=1)
)
It's a lot of steps, but I break it down like that basically every time. When plot code gets large, I break up the chunks much in that manner above to help clear my mind on how to create the plot.

Breaking value axis using ggplot2 [duplicate]

This question already has answers here:
Using ggplot2, can I insert a break in the axis?
(10 answers)
Closed 3 years ago.
I have used Thinkcell, and one of its cool features is that it breaks very long y-axis to fit the graph. I am not sure whether we can do this with ggplot2. I am a beginner in ggplot2. So, I'd appreciate any thoughts.
For example:
Series <- c(1:6)
Values <- c(899, 543, 787, 35323, 121, 234)
df_val_break <- data.frame(Series, Values)
ggplot(data=df_val_break, aes(x=Series, y=Values)) +
geom_bar(stat="identity")
This creates a graph like this:
However, I want a graph that looks something like this:
However, it seems that broken axis is not supported in ggplot2 because it's misleading (Source: Using ggplot2, can I insert a break in the axis?). This thread suggests a couple of things--faceting and tables.
While I like tables, but I don't like faceting because my categorical variable "Series" are closely related. Moreover, I'd prefer Excel for drawing tables--it's fast.
I have two questions:
Question 1: One of the options I liked is at https://stats.stackexchange.com/questions/1764/what-are-alternatives-to-broken-axes. The graph is at
.
I am unable to replicate similar graph because of the scaling issue.
Question 2: This is a minor question just in case there were new packages introduced that might help us to do this. (The linked SO thread above is older than 5 years. ) Are there any other options on the table?
Update: I don't think my question is duplicate for two reasons: a) I have already gone through the indicated thread, and have referenced here explaining that I am looking for a solution that looks like the third graph in my post. Specifically, I am looking to plot both the graphs--one with shorter scales and the other with 1/20 scale in one graph. I am unable to do this using ggplot2 because of scale issue. Either both the sub-graphs get scaled to 1/nth or one of them get scaled to normal range. I believe this version is much relatable for non-technical audience who don't understand log and Inverse transformation.
I took a stab at this one. I'm a beginner so I am not sure whether this can be improved further in terms of placement of text. I struggled with fitting both high growth rate series and low growth rate series in one graph because of different scales. So, I used facetting.
Here's the code:
ggplot(data = df_val_break,aes(x=Series,y=Values)) +
geom_bar(stat = "identity") +
facet_wrap(~Modified) +
geom_text(data = df_val_break[df_val_break$Modified=="HIGH_GROWTH",], aes(label = "x20 growth rate"),hjust=0.5, vjust=0)
ggsave("post.png")
Here's the output:
There are quite a few issues that I see:
a) High_growth rate graph has Series 2 and Series 6 on the x-axis, although we don't need them. I don't know how to turn them off.
b) geom_text overlaps with the bar. This looks a little annoying.
c) I'd believe that the graph is a little misleading, especially for HIGH_GROWTH section because the y-axis isn't scaled with LOW_GROWTH I was originally thinking of showing two different y-axis--one scaled by 1/20 and the other unscaled.

ggplot2 geom_violin with 0 variance

I started to really like violin plots, since they give me a much better feel that box plots when you have funny distributions. I like to automatize a lot of stuff, and thus ran into a problem:
When one variable has 0 variance, the boxplot just gives you a line at that point. Geom_violin however, terminates with an error. What behavior would I like? Well, either put in a line or nothing, but please give me the distributions for the other variables.
Ok, quick example:
dff=data.frame(x=factor(rep(1:2,each=100)),y=c(rnorm(100),rep(0,100)))
ggplot(dff,aes(x=x,y=y)) + geom_violin()
yields
Error in `$<-.data.frame`(`*tmp*`, "n", value = 100L) :
replacement has 1 row, data has 0
However, what works is:
ggplot(dff,aes(x=x,y=y)) + geom_boxplot()
Update:
The issue is resolved as of yesterday: https://github.com/hadley/ggplot2/issues/972
Update 2:
(from question author)
Wow, Hadley himself responded! geom_violin now behaves consistently with geom_density and base R density.
However, I don't think the behavior is optimal yet.
(1) The 'zero' problem
Just run it with my original example:
dff=data.frame(x=factor(rep(1:2, each=100)), y=c(rnorm(100), rep(0,100)))
ggplot(dff,aes(x=x,y=y)) + geom_violin(trim=FALSE)
Yielding this:
Is the plot on the right an appropriate representation of 'all zeroes'? I don't think so. It is better to have trimming that produces a single line to show that there is no variation in the data.
Workaround solution: Add a + geom_boxplot()
(2) I may actually want TRIM=TRUE.
Example:
dff=data.frame(x=factor(rep(1:2, each=100)), y=c(rgamma(100,1,1), rep(0,100) ))
ggplot(dff,aes(x=x,y=y)) + geom_violin(trim=FALSE)
Now I have non-zero data, and standard kernel density estimates don't handle this correctly. With trim=T I can quickly see that the data is strictly positive.
I am not arguing that the current behavior is 'wrong', since it's in line with other functions. However, geom_violin may be used in different contexts, for exploring different data.frames with heterogeneous data types (positive+skewed or not, for instance).
Three options for dealing with this until the ggplot2 issue is resolved:
As a quick hack, you can set one of the y-values to 0.0001 (instead of zero) and geom_violin will work.
Check out the vioplot package if you're not set on using ggplot2. vioplot doesn't throw an error when you feed it a bunch of identical values.
The Hmisc package includes a panel.bpplot (box-percentile plot) function that can create violin plots with the bwplot function from the lattice package. See the Examples section of ?panel.bpplot. It produces a single line when you feed it a vector of identical values.

How to make an R barplot with a log y-axis scale?

This should be a simple question... I'm just trying to make a barplot from a vector in R, but want the values to be shown on a log scale, with y-axis tick marks and labelling. I can make the normal barplot just fine, but when I try to use log or labelling, things go south.
Here is my current code:
samples <- c(10,2,5,1,2,2,10,20,150,23,250,2,1,500)
barplot(samples)
Ok, this works. Then I try to use the log="" function defined in the barplot manual, and it never works. Here are some stupid attempts I have tried:
barplot(samples, log="yes")
barplot(samples, log="TRUE")
barplot(log=samples)
Can someone please help me out here? Also, the labelling would be great too. Thanks!
The log argument wants a one- or two-character string specifying which axes should be logarithmic. No, it doesn't make any sense for the x-axis of a barplot to be logarithmic, but this is a generic mechanism used by all of "base" graphics - see ?plot.default for details.
So what you want is
barplot(samples, log="y")
I can't help you with tick marks and labeling, I'm afraid, I threw over base graphics for ggplot years ago and never looked back.
This should get your started fiddling around with ggplot2.
d<-data.frame(samples)
ggplot(data=d, aes(x=factor(1:length(samples)),y=samples)) +
geom_bar(stat="identity") +
scale_y_log10()
Within the scale_y_log10() function you can define breaks, labels, and more. Similarly, you can label the x-axis. For example
ggplot(data=d, aes(x=factor(1:length(samples)),y=samples)) +
geom_bar(stat="identity") +
scale_y_log10(breaks=c(1,5,10,50,100,500,1000),
labels=c(rep("label",7))) +
scale_x_discrete(labels=samples)

displaying stat_summary accurately on violin plots

I just started using ggplot2 on R and have a violin plot question.
I have a data set that can be accessed here: data.
The data comes from a study of making estimations. The variables of interest are the question.no (questions), condition, estimate.no (tr.est1 or tr.est2) and estimate.
The code below makes the plot look almost the way I want it to look at least for one question, yet the median dots generated by stat_summary() are displayed in between the "violins."
v.data<-read.csv("data.csv")
# loop through each question number
d_ply(v.data, c("question.no"), function(d.plot){
q.no <- v.data$question.no
plot.q <- ggplot(d.plot,aes(condition, estimate, fill=estimate.no)) +
geom_violin() +
stat_summary(fun.y="median", geom="point") +
scale_y_continuous('Change Scores') +
scale_x_discrete("Conditions")
ggsave(filename=paste(q.no,".png",sep=""))
})
My Question: How can I make the median dots display correctly on the "violins" rather than in between them?
I searched the previous questions asked on ggplot2 on this site and looked at the ggplot2 documentation as well as other R forums but have not been able to find anything relevant.
I would appreciate any comments and suggestions as to how I can fix it. Also, if the questions I ask are already answered somewhere else, I would appreciate the links to the threads,too. Many thanks in advance.
stat_summary is limited to the variable that determines your x-axis. One way to convey the information you want would be to replace condition in your call to aes with interaction(condition, estimate.no).
Plotluck is a library based on ggplot2 that aims at automating the choice of plot type based on characteristics of 1-3 variables. For your data set, the command plotluck(v.data, condition, estimate, question.no) generates the following plot:
Note that the library chose to scale y logarithmically. You can override this behavior with plotluck(v.data,condition,estimate,question.no,opts=plotluck.options(trans.log.thresh=1E20)) but it doesn't display well, and the median points look like they are all on the zero line.

Resources