geom_boxplot() from ggplot2 : forcing an empty level to appear - r

I can't find a way to ask ggplot2 to show an empty level in a boxplot without imputing my dataframe with actual missing values.
Here is reproducible code :
# fake data
dftest <- expand.grid(time=1:10,measure=1:50)
dftest$value <- rnorm(dim(dftest)[1],3+0.1*dftest$time,1)
# and let's suppose we didn't observe anything at time 2
# doesn't work even when forcing with factor(..., levels=...)
p <- ggplot(data=dftest[dftest$time!=2,],aes(x=factor(time,levels=1:10),y=value))
p + geom_boxplot()
# only way seems to have at least one actual missing value in the dataframe
dftest2 <- dftest
dftest2[dftest2$time==2,"value"] <- NA
p <- ggplot(data=dftest2,aes(x=factor(time),y=value))
p + geom_boxplot()
So I guess I'm missing something. This is not a problem when dealing with a balanced experiment where these missing data might be explicit in the dataframe. But with observed data in a cohort for example, it means imputing the data with missing values for unobserved combinations...
Thanks for your help.

You can control the breaks in a suitable scale function, in this case scale_x_discrete. Make sure you use the argument drop=FALSE:
p <- ggplot(data=dftest[dftest$time!=2,],aes(x=factor(time,levels=1:10),y=value))
p + geom_boxplot() +
scale_x_discrete("time", breaks=factor(1:10), drop=FALSE)
I like to do my data manipulation in advance of sending it to ggplot. I think this makes the code more readable. This is how I would do it myself, but the results are the same. Note, however, that the ggplot scale gets much simpler, since you don't have to specify the breaks:
dfplot <- dftest[dftest$time!=2, ]
dfplot$time <- factor(dfplot$time, levels=1:10)
ggplot(data=dfplot, aes(x=time ,y=value)) +
geom_boxplot() +
scale_x_discrete("time", drop=FALSE)

Related

Histogram: Combine continuous and discrete values in ggplot2

I have a set of times that I would like to plot on a histogram.
Toy example:
df <- data.frame(time = c(1,2,2,3,4,5,5,5,6,7,7,7,9,9, ">10"))
The problem is that one value is ">10" and refers to the number of times that more than 10 seconds were observed. The other time points are all numbers referring to the actual time. Now, I would like to create a histogram that treats all numbers as numeric and combines them in bins when appropriate, while plotting the counts of the ">10" at the side of the distribution, but not in a separate plot. I have tried to call geom_histogram twice, once with the continuous data and once with the discrete data in a separate column but that gives me the following error:
Error: Discrete value supplied to continuous scale
Happy to hear suggestions!
Here's a kind of involved solution, but I believe it best answers your question, which is that you are desiring to place next to typical histogram plot a bar representing the ">10" values (or the values which are non-numeric). Critically, you want to ensure that you maintain the "binning" associated with a histogram plot, which means you are not looking to simply make your scale a discrete scale and represent a histogram with a typical barplot.
The Data
Since you want to retain histogram features, I'm going to use an example dataset that is a bit more involved than that you gave us. I'm just going to specify a uniform distribution (n=100) with 20 ">10" values thrown in there.
set.seed(123)
df<- data.frame(time=c(runif(100,0,10), rep(">10",20)))
As prepared, df$time is a character vector, but for a histogram, we need that to be numeric. We're simply going to force it to be numeric and accept that the ">10" values are going to be coerced to be NAs. This is fine, since in the end we're just going to count up those NA values and represent them with a bar. While I'm at it, I'm creating a subset of df that will be used for creating the bar representing our NAs (">10") using the count() function, which returns a dataframe consisting of one row and column: df$n = 20 in this case.
library(dplyr)
df$time <- as.numeric(df$time) #force numeric and get NA for everything else
df_na <- count(subset(df, is.na(time)))
The Plot(s)
For the actual plot, you are asking to create a combination of (1) a histogram, and (2) a barplot. These are not the same plot, but more importantly, they cannot share the same axis, since by definition, the histogram needs a continuous axis and "NA" values or ">10" is not a numeric/continuous value. The solution here is to make two separate plots, then combine them with a bit of magic thanks to cowplot.
The histogram is created quite easily. I'm saving the number of bins for demonstration purposes later. Here's the basic plot:
bin_num <- 12 # using this later
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
Thanks to the subsetting previously, the barplot for the NA values is easy too:
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3)
Yikes! That looks horrible, but have patience.
Stitching them together
You can simply run plot_grid(p1, p2) and you get something workable... but it leaves quite a lot to be desired:
There are problems here. I'll enumerate them, then show you the final code for how I address them:
Need to remove some elements from the NA barplot. Namely, the y axis entirely and the title for x axis (but it can't be NULL or the x axes won't line up properly). These are theme() elements that are easily removed via ggplot.
The NA barplot is taking up WAY too much room. Need to cut the width down. We address this by accessing the rel_widths= argument of plot_grid(). Easy peasy.
How do we know how to set the y scale upper limit? This is a bit more involved, since it will depend on the ..count.. stat for p1 as well as the numer of NA values. You can access the maximum count for a histogram using ggplot_build(), which is a part of ggplot2.
So, the final code requires the creation of the basic p1 and p2 plots, then adds to them in order to fix the limits. I'm also adding an annotation for number of bins to p1 so that we can track how well the upper limit setting works. Here's the code and some example plots where bin_num is set at 12 and 5, respectively:
# basic plots
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3) +
labs(x="") + theme(axis.line.y=element_blank(), axis.text.y=element_blank(),
axis.title.y=element_blank(), axis.ticks.y=element_blank()
) +
scale_x_discrete(expand=expansion(add=1))
#set upper y scale limit
max_count <- max(c(max(ggplot_build(p1)$data[[1]]$count), df_na$n))
# fix limits for plots
p1 <- p1 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15))) +
annotate('text', x=0, y=max_count, label=paste('Bins:', bin_num)) # for demo purposes
p2 <- p2 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15)))
plot_grid(p1, p2, rel_widths=c(1,0.2))
So, our upper limit fixing works. You can get really crazy playing around with positioning, etc and the plot_grid() function, but I think it works pretty well this way.
Perhaps, this is what you are looking for:
df1 <- data.frame(x=sample(1:12,50,rep=T))
df2 <- df1 %>% group_by(x) %>%
dplyr::summarise(y=n()) %>% subset(x<11)
df3 <- subset(df1, x>10) %>% dplyr::summarise(y=n()) %>% mutate(x=11)
df <- rbind(df2,df3 )
label <- ifelse((df$x<11),as.character(df$x),">10")
p <- ggplot(df, aes(x=x,y=y,color=x,fill=x)) +
geom_bar(stat="identity", position = "dodge") +
scale_x_continuous(breaks=df$x,labels=label)
p
and you get the following output:
Please note that sometimes you could have some of the bars missing depending on the sample.

R - ggplot2 - difference between ggplot(data, aes(x=variable...)) and ggplot(data, aes(x=data$variable...)) [duplicate]

This question already has an answer here:
Issue when passing variable with dollar sign notation ($) to aes() in combination with facet_grid() or facet_wrap()
(1 answer)
Closed 4 years ago.
I have currently encountered a phenomenon in ggplot2, and I would be grateful if someone could provide me with an explanation.
I needed to plot a continuous variable on a histogram, and I needed to represent two categorical variables on the plot. The following dataframe is a good example.
library(ggplot2)
species <- rep(c('cat', 'dog'), 30)
numb <- rep(c(1,2,3,7,8,10), 10)
groups <- rep(c('A', 'A', 'B', 'B'), 15)
data <- data.frame(species=species, numb=numb, groups=groups)
Let the following code represent the categorisation of a continuous variable.
data$factnumb <- as.factor(data$numb)
If I would like to plot this dataset the following two codes are completely interchangable:
Note the difference after the fill= statement.
p <- ggplot(data, aes(x=factnumb, fill=species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_y_continuous(labels = scales::percent)
plot(p):
q <- ggplot(data, aes(x=factnumb, fill=data$species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_y_continuous(labels = scales::percent)
plot(q):
However, when working with real-life continuous variables not all categories will contain observations, and I still need to represent the empty categories on the x-axis in order to get the approximation of the sample distribution. To demostrate this, I used the following code:
data_miss <- data[which(data$numb!= 3),]
This results in a disparity between the levels of the categorial variable and the observations in the dataset:
> unique(data_miss$factnumb)
[1] 1 2 7 8 10
Levels: 1 2 3 7 8 10
And plotted the data_miss dataset, still including all of the levels of the factnumb variable.
pm <- ggplot(data_miss, aes(x=factnumb, fill=species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_fill_discrete(drop=FALSE) +
scale_x_discrete(drop=FALSE)+
scale_y_continuous(labels = scales::percent)
plot(pm):
qm <- ggplot(data_miss, aes(x=factnumb, fill=data_miss$species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_x_discrete(drop=FALSE)+
scale_fill_discrete(drop=FALSE) +
scale_y_continuous(labels = scales::percent)
plot(qm):
In this case, when using fill=data_miss$species the filling of the plot changes (and for the worse).
I would be really happy if someone could clear this one up for me.
Is it just "luck", that in case of plot 1 and 2 the filling is identical, or I have stumbled upon some delicate mistake in the fine machinery of ggplot2?
Thanks in advance!
Kind regards,
Bernadette
Using aes(data$variable) inside is never good, never recommended, and should never be used. Sometimes it still works, but aes(variable) always works, so you should always use aes(variable).
More explanation:
ggplot uses nonstandard evaluation. A standard evaluating R function can only see objects in the global environment. If I have data named mydata with a column name col1, and I do mean(col1), I get an error:
mydata = data.frame(col1 = 1:3)
mean(col1)
# Error in mean(col1) : object 'col1' not found
This error happens because col1 isn't in the global environment. It's just a column name of the mydata data frame.
The aes function does extra work behind the scenes, and knows to look at the columns of the layer's data, in addition to checking the global environment.
ggplot(mydata, aes(x = col1)) + geom_bar()
# no error
You don't have to use just a column inside aes though. To give flexibility, you can do a function of a column, or even some other vector that you happen to define on the spot (if it has the right length):
# these work fine too
ggplot(mydata, aes(x = log(col1))) + geom_bar()
ggplot(mydata, aes(x = c(1, 8, 11)) + geom_bar()
So what's the difference between col1 and mydata$col1? Well, col1 is a name of a column, and mydata$col1 is the actual values. ggplot will look for columns in your data named col1, and use that. mydata$col1 is just a vector, it's the full column. The difference matters because ggplot often does data manipulation. Whenever there are facets or aggregate functions, ggplot is splitting your data up into pieces and doing stuff. To do this effectively, it needs to know identify the data and column names. When you give it mydata$col1, you're not giving it a column name, you're just giving it a vector of values - whatever happens to be in that column, and things don't work.
So, just use unquoted column names in aes() without data$ and everything will work as expected.

geom_violin automatically ignores values having low variation

I'm having a problem with geom_violin. To make my problem reproducible, I created a toy dataset.
Let say this is my original data:
require(jsonlite)
data <- fromJSON("[{\"Season\":\"Spring\",\"Maximum.Profit\":2520,\"Hidden\":\"No\"},{\"Season\":\"Spring\",\"Maximum.Profit\":1710,\"Hidden\":\"No\"},{\"Season\":\"Spring\",\"Maximum.Profit\":2500,\"Hidden\":\"No\"},{\"Season\":\"Spring\",\"Maximum.Profit\":2850,\"Hidden\":\"Yes\"},{\"Season\":\"Spring\",\"Maximum.Profit\":3500,\"Hidden\":\"No\"},{\"Season\":\"Summer\",\"Maximum.Profit\":5740,\"Hidden\":\"No\"},{\"Season\":\"Summer\",\"Maximum.Profit\":5100,\"Hidden\":\"No\"},{\"Season\":\"Summer\",\"Maximum.Profit\":1710,\"Hidden\":\"No\"},{\"Season\":\"Summer\",\"Maximum.Profit\":3500,\"Hidden\":\"Yes\"},{\"Season\":\"Summer\",\"Maximum.Profit\":8000,\"Hidden\":\"No\"},{\"Season\":\"Autumn\",\"Maximum.Profit\":4920,\"Hidden\":\"No\"},{\"Season\":\"Autumn\",\"Maximum.Profit\":720,\"Hidden\":\"No\"},{\"Season\":\"Autumn\",\"Maximum.Profit\":13740,\"Hidden\":\"No\"},{\"Season\":\"Autumn\",\"Maximum.Profit\":2600,\"Hidden\":\"Yes\"},{\"Season\":\"Autumn\",\"Maximum.Profit\":3810,\"Hidden\":\"No\"},{\"Season\":\"Autumn\",\"Maximum.Profit\":-1260,\"Hidden\":\"No\"}]")
Here is my code for visualizing it:
require(ggplot2)
p <- ggplot(data, aes(x=Season, y=Maximum.Profit))
p <- p + geom_violin(aes(color=Hidden)) + geom_boxplot(aes(fill=Hidden))
Unlike the boxplot, geom_violin ignored the Hidden-"Yes" in all Seasons. I realized there was only a single value in each of these cases (Season_Hidden): "Autumn_Yes", "Spring_Yes", "Summer_Yes". So I added one more value for each. I tried not to create identical values, so I made them a little bit different, too. You can have a look at 3 lines at the bottom of data2
data2 <- fromJSON("[{\"Season\":\"Spring\",\"Hidden\":\"No\",\"Maximum.Profit\":2520},{\"Season\":\"Spring\",\"Hidden\":\"No\",\"Maximum.Profit\":1710},{\"Season\":\"Spring\",\"Hidden\":\"No\",\"Maximum.Profit\":2500},{\"Season\":\"Spring\",\"Hidden\":\"Yes\",\"Maximum.Profit\":2850},{\"Season\":\"Spring\",\"Hidden\":\"No\",\"Maximum.Profit\":3500},{\"Season\":\"Summer\",\"Hidden\":\"No\",\"Maximum.Profit\":5740},{\"Season\":\"Summer\",\"Hidden\":\"No\",\"Maximum.Profit\":5100},{\"Season\":\"Summer\",\"Hidden\":\"No\",\"Maximum.Profit\":1710},{\"Season\":\"Summer\",\"Hidden\":\"Yes\",\"Maximum.Profit\":3500},{\"Season\":\"Summer\",\"Hidden\":\"No\",\"Maximum.Profit\":8000},{\"Season\":\"Autumn\",\"Hidden\":\"No\",\"Maximum.Profit\":4920},{\"Season\":\"Autumn\",\"Hidden\":\"No\",\"Maximum.Profit\":720},{\"Season\":\"Autumn\",\"Hidden\":\"No\",\"Maximum.Profit\":13740},{\"Season\":\"Autumn\",\"Hidden\":\"Yes\",\"Maximum.Profit\":2600},{\"Season\":\"Autumn\",\"Hidden\":\"No\",\"Maximum.Profit\":3810},{\"Season\":\"Autumn\",\"Hidden\":\"No\",\"Maximum.Profit\":-1260},{\"Season\":\"Autumn\",\"Hidden\":\"Yes\",\"Maximum.Profit\":2607.2},{\"Season\":\"Spring\",\"Hidden\":\"Yes\",\"Maximum.Profit\":2857.2},{\"Season\":\"Summer\",\"Hidden\":\"Yes\",\"Maximum.Profit\":3507.2}]")
But this data2 created the same figure as data. So I forced it a litter bit more:
p <- ggplot(rbind(data2, data2), aes(x=Season, y=Maximum.Profit))
p <- p + geom_violin(aes(color=Hidden), scale="width", position=position_dodge(width=1))
p <- p + geom_boxplot(aes(fill=Hidden), position=position_dodge(width=1), width=0.2)
(Additional settings for geom_boxplot and geom_boxplot is not important. I just put it there to make it prettier)
Now this is the picture that I want but I don't want to do it in a sneaky way, such as using rbind(data2, data2) instead of data in the previous example.
Does anyone know a better and more stable solution for this issue? How to make geom_violin NOT ignore low-variance values, or at least, leave one side blank so that it won't mess up when combining with other geometry (boxplot in this case)

how do I stop ggplot automatically arranging my graph?

I made a grouped barchart in R using the ggplot package. I used the following code:
ggplot(completedDF,aes(year,value,fill=variable)) + geom_bar(position=position_dodge(),stat="identity")
And the graph looks like this:
The problem is that I want the 1999-2008 data to be at the end.
Is there anyway to move it?
Thanks any help appreciated.
ggplot will follow the order of the levels in a factor. If you didn't ordered your factor, then it is assumed that the order is alphabetical.
If you want your "1999-2008" modality to be at the end, just reorder your factor using
completed$year <- factor(x=completed$year,
levels=c("1999-2002", "2002-2005", "2005-2008", "1999-2008"))
For example :
library(ggplot2)
# Create a sample data set
set.seed(2014)
years_labels <- c( "1999-2008","1999-2002", "2002-2005", "2005-2008")
variable_labels <- c("pointChangeVector", "nonPointChangeVector",
"onRoadChangeVector", "nonRoadChangeVecto")
years <- rbinom(n=1000, size=3,prob=0.3)
variables <- rbinom(n=1000, size=3,prob=0.3)
year <- factor(x=years , levels=0:3, labels=years_labels)
variable <- factor(x=variables , levels=0:3, labels=variable_labels)
completed <- data.frame( year, variable)
# Plot
ggplot(completed,aes(x=year, fill=variable)) + geom_bar(position=position_dodge())
# change the order
completed$year <- factor(x=completed$year,
levels=c("1999-2002", "2002-2005", "2005-2008", "1999-2008"))
ggplot(completed,aes(x=year, fill=variable)) + geom_bar(position=position_dodge())
Furthermore, the other benefit of using this is you will have also your results in a good order for others functions like summary or plot.
Does it help?
Yeah this is a real probelm in ggplot. It always changes the order of non-numeric values
The easiest way to solve it is to add scale_x_discrete in this way:
p <- ggplot(completedDF,aes(year,value,fill=variable))
p <- p + geom_bar(position=position_dodge(),stat="identity")
p <- p + scale_x_discrete(limits = c("1999-2002","2002-2005","2005-2008","1999-2008"))

Simple analog for plotting a line from a table object in ggplot2

I have been unable to find a simple analog for plotting a line graph from a table object in ggplot2. Given the elegance and utility of the package, I feel I must be missing something quite obvious. As an illustration consider a data frame with yearly observations:
dat<-data.frame(year=sample(c("2001":"2010"),1000, replace=T))
And a quick time series plot in base R:
plot(table(dat$year), type="l")
Switching to qplot, returns the error "attempt to apply a non-function":
qplot(table(dat$year), geom="line")
ggplot2 requires a data frame. Fair enough. But this returns the same error.
qplot(year, data=dat, geom="line")
After some searching and fiddling, I abandoned qplot, and came up with the following approach which involves specifying a line geometry, binning the counts, and dropping final values to avoid plotting zeros.
ggplot(dat, aes(year) ) + geom_line(stat = "bin", binwidth=1, drop=TRUE)
It seems like rather a long walk around the block. And it is still not entirely satisfactory, since the bins don't align precisely with the mid-year values on the x-axis. Where have I gone wrong?
Maybe still more complicated than you want, but:
qplot(Var1,Freq,data=as.data.frame(table(dat$year)),geom="line",group=1)
(the group=1 is necessary because the Year variable (Var1) is returned as a factor ...)
If you didn't need it as a one-liner you could use ytab <- as.data.frame(table(dat$year)) first to extract the table and convert it to a data frame ...
Following Brian Diggs's answer, if you're willing to construct a bit more fortify machinery you can condense this a bit more:
A utility function that converts a factor to numeric if possible:
conv2num <- function(x) {
xn <- suppressWarnings(as.numeric(as.character(x)))
if (!all(is.na(xn))) xn else x
}
And a fortify method that turns the table into a data frame and then tries to make the columns numeric:
fortify.table <- function(x,...) {
z <- as.data.frame(x)
facs <- sapply(z,is.factor)
z[facs] <- lapply(z[facs],conv2num)
z
}
Now this works almost as you would like it to:
qplot(Var1,Freq,data=table(dat$year),geom="line")
(It would be nice/easier if there were a table option to preserve the numeric nature of cross-classifying factors ...)
Expanding on Ben's answer, the "standard" approach would be to create the data frame from the table, at which point you can covert the years back into numbers.
ytab <- as.data.frame(table(dat$year))
ytab$Var1 <- as.numeric(as.character(ytab$Var1))
The either of the following will work:
ggplot(ytab, aes(Var1, Freq)) + geom_line()
qplot(Var1, Freq, data=ytab, geom="line")
The other approach is to create a fortify function which will transform the table into a data frame, and use that.
fortify.table <- as.data.frame.table
Then you can pass the table directly instead of a data frame. But Var1 is now still a factor and so you need group=1 to connect the line across years.
ggplot(table(dat$year), aes(Var1, Freq)) + geom_line(aes(group=1))
qplot(Var1, Freq, data=table(dat$year), geom="line", group=1)

Resources