ggplot2 geom_violin with 0 variance - r

I started to really like violin plots, since they give me a much better feel that box plots when you have funny distributions. I like to automatize a lot of stuff, and thus ran into a problem:
When one variable has 0 variance, the boxplot just gives you a line at that point. Geom_violin however, terminates with an error. What behavior would I like? Well, either put in a line or nothing, but please give me the distributions for the other variables.
Ok, quick example:
dff=data.frame(x=factor(rep(1:2,each=100)),y=c(rnorm(100),rep(0,100)))
ggplot(dff,aes(x=x,y=y)) + geom_violin()
yields
Error in `$<-.data.frame`(`*tmp*`, "n", value = 100L) :
replacement has 1 row, data has 0
However, what works is:
ggplot(dff,aes(x=x,y=y)) + geom_boxplot()
Update:
The issue is resolved as of yesterday: https://github.com/hadley/ggplot2/issues/972
Update 2:
(from question author)
Wow, Hadley himself responded! geom_violin now behaves consistently with geom_density and base R density.
However, I don't think the behavior is optimal yet.
(1) The 'zero' problem
Just run it with my original example:
dff=data.frame(x=factor(rep(1:2, each=100)), y=c(rnorm(100), rep(0,100)))
ggplot(dff,aes(x=x,y=y)) + geom_violin(trim=FALSE)
Yielding this:
Is the plot on the right an appropriate representation of 'all zeroes'? I don't think so. It is better to have trimming that produces a single line to show that there is no variation in the data.
Workaround solution: Add a + geom_boxplot()
(2) I may actually want TRIM=TRUE.
Example:
dff=data.frame(x=factor(rep(1:2, each=100)), y=c(rgamma(100,1,1), rep(0,100) ))
ggplot(dff,aes(x=x,y=y)) + geom_violin(trim=FALSE)
Now I have non-zero data, and standard kernel density estimates don't handle this correctly. With trim=T I can quickly see that the data is strictly positive.
I am not arguing that the current behavior is 'wrong', since it's in line with other functions. However, geom_violin may be used in different contexts, for exploring different data.frames with heterogeneous data types (positive+skewed or not, for instance).

Three options for dealing with this until the ggplot2 issue is resolved:
As a quick hack, you can set one of the y-values to 0.0001 (instead of zero) and geom_violin will work.
Check out the vioplot package if you're not set on using ggplot2. vioplot doesn't throw an error when you feed it a bunch of identical values.
The Hmisc package includes a panel.bpplot (box-percentile plot) function that can create violin plots with the bwplot function from the lattice package. See the Examples section of ?panel.bpplot. It produces a single line when you feed it a vector of identical values.

Related

Pyramind Plot using ggplot2 in R

I need help in removing/converting the negative values in the x-axis of the pyramid plot.
I was able to build the pyramid through https://walker-data.com/census-r/exploring-us-census-data-with-visualization.html subheading 4.5.2. I keep getting an error saying function not found: number_format.
I think the scale_x_continuous is where I'd be able to change the - sign but returns an error everytime
So from the context of the plot, I am assuming that you don't want to convert the values themselves, but rather convert the x-axis so that the male and female sides of the pyramid are mirrored?
In which case, you can use abs. The function abs(x) computes the absolute value of x, which would remove the negative sign from your data's values.
Without seeing a reproducible example of your code (which should be included when you ask questions like this on Stack Overflow, see the package {reprex} for help with this), it's a little difficult to be sure exactly what you need to change to make the code work, but I think you should be on the right track with scale_x_continuous.
With regard to the error that you're receiving, it suggests that you haven't imported the library for that function, {scales}, as suggested by stefan in the comments (and as suggested, scales:::label_number has superseded scales:::number_format, so you should use the former).
If you're using the scale_x_continuous code from the second plot in Section 4.5.2 of the link you have shared:
utah_pyramid +
scale_x_continuous(
labels = ~ number_format(scale = .001, suffix = "k")(abs(.x)),
limits = 140000 * c(-1, 1)
)
The number_format function isn't the part of the code that is producing the absolute values, it is converting the scale of the values to thousands. It is the abs(.x) part that is removing the negative sign.

Pictured link is my coding. How do I make a proper good graph?

Okay so I have an assignment where I need to conduct a graph that best represents the before and after affects of two streams. The graph(s) have to contain means and standard error for each stream in each year.. I cannot figure the proper coding for the graph. I continue to get errors and bad graphs. I will attach a sample of what the data looks like too.
A sample of the data, it changes to after at 51
Try to post a reproducible error or specification of your problem.
As far as I can analyze your problem, you maybe should not create b4, because it does not seem to be an effective subset. If you want to assemble certain plots, you can use plot_grid from cowplot.
Otherwise you can add facet_wrap(~ VARIABLE_NAME) to ggplot in order to create many plots divided by deviating observations in the specified variable.
If you are not happy with the visual outcome and result of your graph, you can choose another theme, e.g. theme_bw() which can be simply added to your ggplot function. You can add and change further labels with labs() and theme().

Forest plot from cox object

Please be tolerant :) I am a dummy user of R and I am using the code and sample data to learn how to make forest plot that was shown in the previous post -
Optimal/efficient plotting of survival/regression analysis results
I was wondering is it possible to set user-defined x-axis scale with the code shown there? Up to now x a-axis scale is defined somehow automatically.
Thank you for any tips.
I'm unimpressed with the precision of the documentation since one might assume that the limits argument would be values on the relative risk scale rather than on the log-transformed scale. One gets a ridiculous result if that is done. That quibble not withstanding, it's relatively easy to use that parameter to created an expanded plot:
install('devtools') # then use it to get current package
# executing the install and load of the package referenced at the top of that answer
print(forest_model(lung_cox, limits=log( c(.5, 50) ) ))
Trying for a lower range of 0 on the relative risk scale is not sensible. Would imply a -Inf value on hte log-transformed scale. Trying for lower value, say log(0.001), confuses the pretty printing of the scale in my tests.

Indicating the statistically significant difference in bar graph USING R

This is a repeat of a question originally asked here: Indicating the statistically significant difference in bar graph but asked for R instead of python.
My question is very simple. I want to produce barplots in R, using ggplot2 if possible, with an indication of significant difference between the different bars, e.g. produce something like this. I have had a search around but can't find another question asking exactly the same thing.
I know that this is an old question and the answer by Didzis Elferts already provides one solution for the problem. But I recently created a ggplot-extension that simplifies the whole process of adding significance bars: ggsignif
Instead of tediously adding the geom_path and annotate to your plot you just add a single layer geom_signif:
library(ggplot2)
library(ggsignif)
ggplot(iris, aes(x=Species, y=Sepal.Length)) +
geom_boxplot() +
geom_signif(comparisons = list(c("versicolor", "virginica")),
map_signif_level=TRUE)
Full documentation of the package is available at CRAN.
You can use geom_path() and annotate() to get similar result. For this example you have to determine suitable position yourself. In geom_path() four numbers are provided to get those small ticks for connecting lines.
df<-data.frame(group=c("A","B","C","D"),numb=c(12,24,36,48))
g<-ggplot(df,aes(group,numb))+geom_bar(stat="identity")
g+geom_path(x=c(1,1,2,2),y=c(25,26,26,25))+
geom_path(x=c(2,2,3,3),y=c(37,38,38,37))+
geom_path(x=c(3,3,4,4),y=c(49,50,50,49))+
annotate("text",x=1.5,y=27,label="p=0.012")+
annotate("text",x=2.5,y=39,label="p<0.0001")+
annotate("text",x=3.5,y=51,label="p<0.0001")
I used the suggested method from above, but I found the annotate function easier for making lines than the geom_path function. Just use "segment" instead of "text". You have to break things up by segment and define starting and ending x and y values for each line segment.
example for making 3 lines segments:
annotate("segment", x=c(1,1,2),xend=c(1,2,2), y= c(125,130,130), yend=c(130,130,125))

displaying stat_summary accurately on violin plots

I just started using ggplot2 on R and have a violin plot question.
I have a data set that can be accessed here: data.
The data comes from a study of making estimations. The variables of interest are the question.no (questions), condition, estimate.no (tr.est1 or tr.est2) and estimate.
The code below makes the plot look almost the way I want it to look at least for one question, yet the median dots generated by stat_summary() are displayed in between the "violins."
v.data<-read.csv("data.csv")
# loop through each question number
d_ply(v.data, c("question.no"), function(d.plot){
q.no <- v.data$question.no
plot.q <- ggplot(d.plot,aes(condition, estimate, fill=estimate.no)) +
geom_violin() +
stat_summary(fun.y="median", geom="point") +
scale_y_continuous('Change Scores') +
scale_x_discrete("Conditions")
ggsave(filename=paste(q.no,".png",sep=""))
})
My Question: How can I make the median dots display correctly on the "violins" rather than in between them?
I searched the previous questions asked on ggplot2 on this site and looked at the ggplot2 documentation as well as other R forums but have not been able to find anything relevant.
I would appreciate any comments and suggestions as to how I can fix it. Also, if the questions I ask are already answered somewhere else, I would appreciate the links to the threads,too. Many thanks in advance.
stat_summary is limited to the variable that determines your x-axis. One way to convey the information you want would be to replace condition in your call to aes with interaction(condition, estimate.no).
Plotluck is a library based on ggplot2 that aims at automating the choice of plot type based on characteristics of 1-3 variables. For your data set, the command plotluck(v.data, condition, estimate, question.no) generates the following plot:
Note that the library chose to scale y logarithmically. You can override this behavior with plotluck(v.data,condition,estimate,question.no,opts=plotluck.options(trans.log.thresh=1E20)) but it doesn't display well, and the median points look like they are all on the zero line.

Resources