geom_violin automatically ignores values having low variation - r

I'm having a problem with geom_violin. To make my problem reproducible, I created a toy dataset.
Let say this is my original data:
require(jsonlite)
data <- fromJSON("[{\"Season\":\"Spring\",\"Maximum.Profit\":2520,\"Hidden\":\"No\"},{\"Season\":\"Spring\",\"Maximum.Profit\":1710,\"Hidden\":\"No\"},{\"Season\":\"Spring\",\"Maximum.Profit\":2500,\"Hidden\":\"No\"},{\"Season\":\"Spring\",\"Maximum.Profit\":2850,\"Hidden\":\"Yes\"},{\"Season\":\"Spring\",\"Maximum.Profit\":3500,\"Hidden\":\"No\"},{\"Season\":\"Summer\",\"Maximum.Profit\":5740,\"Hidden\":\"No\"},{\"Season\":\"Summer\",\"Maximum.Profit\":5100,\"Hidden\":\"No\"},{\"Season\":\"Summer\",\"Maximum.Profit\":1710,\"Hidden\":\"No\"},{\"Season\":\"Summer\",\"Maximum.Profit\":3500,\"Hidden\":\"Yes\"},{\"Season\":\"Summer\",\"Maximum.Profit\":8000,\"Hidden\":\"No\"},{\"Season\":\"Autumn\",\"Maximum.Profit\":4920,\"Hidden\":\"No\"},{\"Season\":\"Autumn\",\"Maximum.Profit\":720,\"Hidden\":\"No\"},{\"Season\":\"Autumn\",\"Maximum.Profit\":13740,\"Hidden\":\"No\"},{\"Season\":\"Autumn\",\"Maximum.Profit\":2600,\"Hidden\":\"Yes\"},{\"Season\":\"Autumn\",\"Maximum.Profit\":3810,\"Hidden\":\"No\"},{\"Season\":\"Autumn\",\"Maximum.Profit\":-1260,\"Hidden\":\"No\"}]")
Here is my code for visualizing it:
require(ggplot2)
p <- ggplot(data, aes(x=Season, y=Maximum.Profit))
p <- p + geom_violin(aes(color=Hidden)) + geom_boxplot(aes(fill=Hidden))
Unlike the boxplot, geom_violin ignored the Hidden-"Yes" in all Seasons. I realized there was only a single value in each of these cases (Season_Hidden): "Autumn_Yes", "Spring_Yes", "Summer_Yes". So I added one more value for each. I tried not to create identical values, so I made them a little bit different, too. You can have a look at 3 lines at the bottom of data2
data2 <- fromJSON("[{\"Season\":\"Spring\",\"Hidden\":\"No\",\"Maximum.Profit\":2520},{\"Season\":\"Spring\",\"Hidden\":\"No\",\"Maximum.Profit\":1710},{\"Season\":\"Spring\",\"Hidden\":\"No\",\"Maximum.Profit\":2500},{\"Season\":\"Spring\",\"Hidden\":\"Yes\",\"Maximum.Profit\":2850},{\"Season\":\"Spring\",\"Hidden\":\"No\",\"Maximum.Profit\":3500},{\"Season\":\"Summer\",\"Hidden\":\"No\",\"Maximum.Profit\":5740},{\"Season\":\"Summer\",\"Hidden\":\"No\",\"Maximum.Profit\":5100},{\"Season\":\"Summer\",\"Hidden\":\"No\",\"Maximum.Profit\":1710},{\"Season\":\"Summer\",\"Hidden\":\"Yes\",\"Maximum.Profit\":3500},{\"Season\":\"Summer\",\"Hidden\":\"No\",\"Maximum.Profit\":8000},{\"Season\":\"Autumn\",\"Hidden\":\"No\",\"Maximum.Profit\":4920},{\"Season\":\"Autumn\",\"Hidden\":\"No\",\"Maximum.Profit\":720},{\"Season\":\"Autumn\",\"Hidden\":\"No\",\"Maximum.Profit\":13740},{\"Season\":\"Autumn\",\"Hidden\":\"Yes\",\"Maximum.Profit\":2600},{\"Season\":\"Autumn\",\"Hidden\":\"No\",\"Maximum.Profit\":3810},{\"Season\":\"Autumn\",\"Hidden\":\"No\",\"Maximum.Profit\":-1260},{\"Season\":\"Autumn\",\"Hidden\":\"Yes\",\"Maximum.Profit\":2607.2},{\"Season\":\"Spring\",\"Hidden\":\"Yes\",\"Maximum.Profit\":2857.2},{\"Season\":\"Summer\",\"Hidden\":\"Yes\",\"Maximum.Profit\":3507.2}]")
But this data2 created the same figure as data. So I forced it a litter bit more:
p <- ggplot(rbind(data2, data2), aes(x=Season, y=Maximum.Profit))
p <- p + geom_violin(aes(color=Hidden), scale="width", position=position_dodge(width=1))
p <- p + geom_boxplot(aes(fill=Hidden), position=position_dodge(width=1), width=0.2)
(Additional settings for geom_boxplot and geom_boxplot is not important. I just put it there to make it prettier)
Now this is the picture that I want but I don't want to do it in a sneaky way, such as using rbind(data2, data2) instead of data in the previous example.
Does anyone know a better and more stable solution for this issue? How to make geom_violin NOT ignore low-variance values, or at least, leave one side blank so that it won't mess up when combining with other geometry (boxplot in this case)

Related

Histogram: Combine continuous and discrete values in ggplot2

I have a set of times that I would like to plot on a histogram.
Toy example:
df <- data.frame(time = c(1,2,2,3,4,5,5,5,6,7,7,7,9,9, ">10"))
The problem is that one value is ">10" and refers to the number of times that more than 10 seconds were observed. The other time points are all numbers referring to the actual time. Now, I would like to create a histogram that treats all numbers as numeric and combines them in bins when appropriate, while plotting the counts of the ">10" at the side of the distribution, but not in a separate plot. I have tried to call geom_histogram twice, once with the continuous data and once with the discrete data in a separate column but that gives me the following error:
Error: Discrete value supplied to continuous scale
Happy to hear suggestions!
Here's a kind of involved solution, but I believe it best answers your question, which is that you are desiring to place next to typical histogram plot a bar representing the ">10" values (or the values which are non-numeric). Critically, you want to ensure that you maintain the "binning" associated with a histogram plot, which means you are not looking to simply make your scale a discrete scale and represent a histogram with a typical barplot.
The Data
Since you want to retain histogram features, I'm going to use an example dataset that is a bit more involved than that you gave us. I'm just going to specify a uniform distribution (n=100) with 20 ">10" values thrown in there.
set.seed(123)
df<- data.frame(time=c(runif(100,0,10), rep(">10",20)))
As prepared, df$time is a character vector, but for a histogram, we need that to be numeric. We're simply going to force it to be numeric and accept that the ">10" values are going to be coerced to be NAs. This is fine, since in the end we're just going to count up those NA values and represent them with a bar. While I'm at it, I'm creating a subset of df that will be used for creating the bar representing our NAs (">10") using the count() function, which returns a dataframe consisting of one row and column: df$n = 20 in this case.
library(dplyr)
df$time <- as.numeric(df$time) #force numeric and get NA for everything else
df_na <- count(subset(df, is.na(time)))
The Plot(s)
For the actual plot, you are asking to create a combination of (1) a histogram, and (2) a barplot. These are not the same plot, but more importantly, they cannot share the same axis, since by definition, the histogram needs a continuous axis and "NA" values or ">10" is not a numeric/continuous value. The solution here is to make two separate plots, then combine them with a bit of magic thanks to cowplot.
The histogram is created quite easily. I'm saving the number of bins for demonstration purposes later. Here's the basic plot:
bin_num <- 12 # using this later
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
Thanks to the subsetting previously, the barplot for the NA values is easy too:
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3)
Yikes! That looks horrible, but have patience.
Stitching them together
You can simply run plot_grid(p1, p2) and you get something workable... but it leaves quite a lot to be desired:
There are problems here. I'll enumerate them, then show you the final code for how I address them:
Need to remove some elements from the NA barplot. Namely, the y axis entirely and the title for x axis (but it can't be NULL or the x axes won't line up properly). These are theme() elements that are easily removed via ggplot.
The NA barplot is taking up WAY too much room. Need to cut the width down. We address this by accessing the rel_widths= argument of plot_grid(). Easy peasy.
How do we know how to set the y scale upper limit? This is a bit more involved, since it will depend on the ..count.. stat for p1 as well as the numer of NA values. You can access the maximum count for a histogram using ggplot_build(), which is a part of ggplot2.
So, the final code requires the creation of the basic p1 and p2 plots, then adds to them in order to fix the limits. I'm also adding an annotation for number of bins to p1 so that we can track how well the upper limit setting works. Here's the code and some example plots where bin_num is set at 12 and 5, respectively:
# basic plots
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3) +
labs(x="") + theme(axis.line.y=element_blank(), axis.text.y=element_blank(),
axis.title.y=element_blank(), axis.ticks.y=element_blank()
) +
scale_x_discrete(expand=expansion(add=1))
#set upper y scale limit
max_count <- max(c(max(ggplot_build(p1)$data[[1]]$count), df_na$n))
# fix limits for plots
p1 <- p1 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15))) +
annotate('text', x=0, y=max_count, label=paste('Bins:', bin_num)) # for demo purposes
p2 <- p2 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15)))
plot_grid(p1, p2, rel_widths=c(1,0.2))
So, our upper limit fixing works. You can get really crazy playing around with positioning, etc and the plot_grid() function, but I think it works pretty well this way.
Perhaps, this is what you are looking for:
df1 <- data.frame(x=sample(1:12,50,rep=T))
df2 <- df1 %>% group_by(x) %>%
dplyr::summarise(y=n()) %>% subset(x<11)
df3 <- subset(df1, x>10) %>% dplyr::summarise(y=n()) %>% mutate(x=11)
df <- rbind(df2,df3 )
label <- ifelse((df$x<11),as.character(df$x),">10")
p <- ggplot(df, aes(x=x,y=y,color=x,fill=x)) +
geom_bar(stat="identity", position = "dodge") +
scale_x_continuous(breaks=df$x,labels=label)
p
and you get the following output:
Please note that sometimes you could have some of the bars missing depending on the sample.

Animation, adding geom

I want to create some kind of animation with ggplot2 but it doesn't work as I want to. Here is a minimal example.
print(p <- qplot(c(1, 2),c(1, 1))+geom_point())
print(p <- p + geom_point(aes(c(1, 2),c(2, 2)))
print(p <- p + geom_point(aes(c(1, 2),c(3, 3)))
Adding extra points by hand is no problem. But now I want to do it in some loop to get an animation.
for(i in 4:10){
Sys.sleep(.3)
print(p <- p + geom_point(aes(c(1, ),c(i, i))))
}
But now only the new points added are shown, and points of the previous iterations are deleted. I want the old ones still to be visible. How can I do this?
Either of these will do what you want, I think.
# create df dynamically
for (i in 1:10) {
df <- data.frame(x=rep(1:2,i),y=rep(1:i,each=2))
Sys.sleep(0.3)
print(ggplot(df, aes(x,y))+geom_point() + ylim(0,10))
}
# create df at the beginning, then subset in the loop
df <- data.frame(x=rep(1:2,10), y=rep(1:10,each=2))
for (i in 1:10) {
Sys.sleep(0.3)
print(ggplot(df[1:(2*i),], aes(x,y))+geom_point() +ylim(0,10))
}
Also, your code will cause the y-axis limits to change for each plot. Using ylim(...) keeps all the plots on the same scale.
EDIT Response to OP's comment.
One way to create animations is using the animations package. Here's an example.
library(ggplot2)
library(animation)
ani.record(reset = TRUE) # clear history before recording
df <- data.frame(x=rep(1:2,10), y=rep(1:10,each=2))
for (i in 1:10) {
plot(ggplot(df[1:(2*i),], aes(x,y))+geom_point() +ylim(0,10))
ani.record() # record the current frame
}
## now we can replay it, with an appropriate pause between frames
oopts = ani.options(interval = 0.5)
ani.replay()
This will "record" each frame (using ani.record(...)) and then play it back at the end using ani.replay(...). Read the documentation for more details.
Regarding the question about why your code fails, the simple answer is: "this is not the way ggplot is designed to be used." The more complicated answer is this: ggplot is based on a framework which expects you to identify a default dataset as a data frame, and then associate (map) various aspects of the graph (aesthetics) with columns in the data frame. So if you have a data frame df with columns A and B, and you want to plot B vs. A, you would write:
ggplot(data=df, aes(x=A, y=B)) + geom_point()
This code identifies df as the dataset, and maps the aesthetic x (the horizontal axis) with column A and y with column B. Taking advantage of the default order of the arguments, you could also write:
ggplot(df, aes(A,B)) + geom_point()
It is possible to specify things other than column names in aes(...) but this can and often does lead to unexpected (even bizarre) results. Don't do it!.
The reason, basically, is that ggplot does not evaluate the arguments to aes(...) immediately, but rather stores them as expressions in a ggplot object, and evaluates them when you plot or print that object. This is why, for example, you can add layers to a plot and ggplot is able to dynamically rescale the x- and y-limits, something that does not work with plot(...) in base R.

Is it possible to create 3 series (2 lines and one point) faceted plot in ggplot?

I am trying to write a code that I wrote with a basic graphics package in R to ggplot.
The graph I obtained using the basic graphics package is as follows:
I was wondering whether this type of graph is possible to create in ggplot2. I think we could create this kind of graph by using panels but I was wondering is it possible to use faceting for this kind of plot. The major difficulty I encountered is that maximum and minimum have common lengths whereas the observed data is not continuous data and the interval is quite different.
Any thoughts on arranging the data for this type of plot would be very helpful. Thank you so much.
Jdbaba,
From your comments, you mentioned that you'd like for the geom_point to have just the . in the legend. This is a feature that is yet to be implemented to be used directly in ggplot2 (if I am right). However, there's a fix/work-around that is given by #Aniko in this post. Its a bit tricky but brilliant! And it works great. Here's a version that I tried out. Hope it is what you expected.
# bind both your data.frames
df <- rbind(tempcal, tempobs)
p <- ggplot(data = df, aes(x = time, y = data, colour = group1,
linetype = group1, shape = group1))
p <- p + geom_line() + geom_point()
p <- p + scale_shape_manual("", values=c(NA, NA, 19))
p <- p + scale_linetype_manual("", values=c(1,1,0))
p <- p + scale_colour_manual("", values=c("#F0E442", "#0072B2", "#D55E00"))
p <- p + facet_wrap(~ id, ncol = 1)
p
The idea is to first create a plot with all necessary attributes set in the aesthetics section, plot what you want and then change settings manually later using scale_._manual. You can unset lines by a 0 in scale_linetype_manual for example. Similarly you can unset points for lines using NA in scale_shape_manual. Here, the first two values are for group1=maximum and minimum and the last is for observed. So, we set NA to the first two for maximum and minimum and set 0 to linetype for observed.
And this is the plot:
Solution found:
Thanks to Arun and Andrie
Just in case somebody needs the solution of this sort of problem.
The code I used was as follows:
library(ggplot2)
tempcal <- read.csv("temp data ggplot.csv",header=T, sep=",")
tempobs <- read.csv("temp data observed ggplot.csv",header=T, sep=",")
p <- ggplot(tempcal,aes(x=time,y=data))+geom_line(aes(x=time,y=data,color=group1))+geom_point(data=tempobs,aes(x=time,y=data,colour=group1))+facet_wrap(~id)
p
The dataset used were https://www.dropbox.com/s/95sdo0n3gvk71o7/temp%20data%20observed%20ggplot.csv
https://www.dropbox.com/s/4opftofvvsueh5c/temp%20data%20ggplot.csv
The plot obtained was as follows:
Jdbaba

R: Plot multiple box plots using columns from data frame

I would like to plot an INDIVIDUAL box plot for each unrelated column in a data frame. I thought I was on the right track with boxplot.matrix from the sfsmsic package, but it seems to do the same as boxplot(as.matrix(plotdata) which is to plot everything in a shared boxplot with a shared scale on the axis. I want (say) 5 individual plots.
I could do this by hand like:
par(mfrow=c(2,2))
boxplot(data$var1
boxplot(data$var2)
boxplot(data$var3)
boxplot(data$var4)
But there must be a way to use the data frame columns?
EDIT: I used iterations, see my answer.
You could use the reshape package to simplify things
data <- data.frame(v1=rnorm(100),v2=rnorm(100),v3=rnorm(100), v4=rnorm(100))
library(reshape)
meltData <- melt(data)
boxplot(data=meltData, value~variable)
or even then use ggplot2 package to make things nicer
library(ggplot2)
p <- ggplot(meltData, aes(factor(variable), value))
p + geom_boxplot() + facet_wrap(~variable, scale="free")
From ?boxplot we see that we have the option to pass multiple vectors of data as elements of a list, and we will get multiple boxplots, one for each vector in our list.
So all we need to do is convert the columns of our matrix to a list:
m <- matrix(1:25,5,5)
boxplot(x = as.list(as.data.frame(m)))
If you really want separate panels each with a single boxplot (although, frankly, I don't see why you would want to do that), I would instead turn to ggplot and faceting:
m1 <- melt(as.data.frame(m))
library(ggplot2)
ggplot(m1,aes(x = variable,y = value)) + facet_wrap(~variable) + geom_boxplot()
I used iteration to do this. I think perhaps I wasn't clear in the original question. Thanks for the responses none the less.
par(mfrow=c(2,5))
for (i in 1:length(plotdata)) {
boxplot(plotdata[,i], main=names(plotdata[i]), type="l")
}

geom_boxplot() from ggplot2 : forcing an empty level to appear

I can't find a way to ask ggplot2 to show an empty level in a boxplot without imputing my dataframe with actual missing values.
Here is reproducible code :
# fake data
dftest <- expand.grid(time=1:10,measure=1:50)
dftest$value <- rnorm(dim(dftest)[1],3+0.1*dftest$time,1)
# and let's suppose we didn't observe anything at time 2
# doesn't work even when forcing with factor(..., levels=...)
p <- ggplot(data=dftest[dftest$time!=2,],aes(x=factor(time,levels=1:10),y=value))
p + geom_boxplot()
# only way seems to have at least one actual missing value in the dataframe
dftest2 <- dftest
dftest2[dftest2$time==2,"value"] <- NA
p <- ggplot(data=dftest2,aes(x=factor(time),y=value))
p + geom_boxplot()
So I guess I'm missing something. This is not a problem when dealing with a balanced experiment where these missing data might be explicit in the dataframe. But with observed data in a cohort for example, it means imputing the data with missing values for unobserved combinations...
Thanks for your help.
You can control the breaks in a suitable scale function, in this case scale_x_discrete. Make sure you use the argument drop=FALSE:
p <- ggplot(data=dftest[dftest$time!=2,],aes(x=factor(time,levels=1:10),y=value))
p + geom_boxplot() +
scale_x_discrete("time", breaks=factor(1:10), drop=FALSE)
I like to do my data manipulation in advance of sending it to ggplot. I think this makes the code more readable. This is how I would do it myself, but the results are the same. Note, however, that the ggplot scale gets much simpler, since you don't have to specify the breaks:
dfplot <- dftest[dftest$time!=2, ]
dfplot$time <- factor(dfplot$time, levels=1:10)
ggplot(data=dfplot, aes(x=time ,y=value)) +
geom_boxplot() +
scale_x_discrete("time", drop=FALSE)

Resources