Boxplots by base R and ggplot2 do not match - r

I have a simple dataset. When I generate boxplot for the data by base R and ggplot separately, they do not match. In fact the base R boxplot is consistent with the summary function.
library(tidyverse)
library(ggplotify)
library(patchwork)
df <- read.csv("test_boxplot_data.csv")
summary(df)
p1 <- as.ggplot(~boxplot(df$y, outline=FALSE))
p2 <- ggplot(df, aes(y=y)) + geom_boxplot(outlier.shape = NA) + ylim(0,100)
p1 + p2 + plot_layout(ncol = 2)
Generated plot kept here.
Any clue what is happening? It is also surprising that ggplot throws warning that "Removed 845 rows containing non-finite values (stat_boxplot)" but there is no NA in the data.

From: "Removed 845 rows containing non-finite values (stat_boxplot)". It just so happens that the data contains 845 points > 100. These points are being deleted in the calculation of the box plot.
From the first line of help for ylim():
"This is a shortcut for supplying the limits argument to the individual scales. By default, any values outside the limits specified are replaced with NA. Be warned that this will remove data outside the limits and this can produce unintended results. For changing x or y axis limits without dropping data observations, see coord_cartesian()."
This should provide the desired graph:
ggplot(df, aes(y=y)) + geom_boxplot(outlier.shape = NA) +
coord_cartesian(ylim=c(0,100))

Related

Can't specify x-axis with scale_x_discrete in ggplot2: "rows containing non-finite values"

I want to plot a bar chart and use scale_x_discrete to manually describe the bars. The x variable has 9 levels, the group-variable has 3 levels.
ggplot(data, aes(x = education, group = food, fill = food)) +
geom_bar() +
scale_x_discrete(limits=c("A","B","C","D","E","F","G","H","I"))
I did the same plot with the two other x-variables and it worked perfectly. This one works without the scale_x_discrete function, but as soon as I use it to give shorter names to the bars, I get the following error:
Warning message: Removed 979 rows containing non-finite values (stat_count).
There are no NAs, but for some levels of x there is only one level of the group-variable "food", which isn't the case for the other variables I plotted, so that might be part of the problem.
What could be a solution?
Sample data:
education=c("A","B","C","D","E","F","G","H","I")
food=c("Soup","Vegan","Meat","Raw","Soup","Vegan","Meat","Raw", "Vegan" )
data=cbind(education,food)
data<-data.frame(data) # make sure that your data is a df
Sample code:
ggplot(data, aes(x = education, group = food, fill = food)) +
geom_bar() +
#scale_x_discrete(limits=c("A","B","C","D","E","F","G","H","I"))+
scale_x_discrete(labels=c("A","B","C","D","E","F","G","H","I"))+
labs(y="Count", x="Education", fill="Food type")+
theme_bw()
Output:
Using scale_x_discrete(labels=c("A","B","C","D","E","F","G","H","I")) (instead of limits ) did work. Thanks to AndS. who posted this solution in a comment above.

Problem with Density Plot and Normal Density Plot in R

I am asked to plot the density of the residuals of my dataset against the density of a normal distribution.
When I do this with ggplot2, it shows me my residual density plot, but not the normal distribution. Additionally, two error messages occur:
Removed 134 rows containing non-finite values (stat_density).
Removed 403 row(s) containing missing values (geom_path).
Could somebody explain me why the normal plot is not shown?
Please find my code below:
p1 <- ggplot(steak_eaters)+ geom_density(aes(x= resid))
p1 <- p1 + stat_function(fun =dnorm, n=403, args= list(mean= mean(steak_eaters$resid), sd = sd(steak_eaters$resid)), color= "red") + theme_stata()
plot(p1)
I don't have access to your steak_eaters dataset, but here is an example with the geom_function function and a built-in dataset:
library(ggplot2)
ggplot(diamonds, aes(x=depth)) +
geom_density(color="blue") +
geom_function(fun=function(x)
dnorm(x,
mean=mean(diamonds$depth),
sd=sd(diamonds$depth)),
color="red")
The messages you copy seem like warning messages rather than error messages. You probably have missing observations or infinite values in the dataset, which get dropped by the ggplot2 plotting mechanism.

Histogram: Combine continuous and discrete values in ggplot2

I have a set of times that I would like to plot on a histogram.
Toy example:
df <- data.frame(time = c(1,2,2,3,4,5,5,5,6,7,7,7,9,9, ">10"))
The problem is that one value is ">10" and refers to the number of times that more than 10 seconds were observed. The other time points are all numbers referring to the actual time. Now, I would like to create a histogram that treats all numbers as numeric and combines them in bins when appropriate, while plotting the counts of the ">10" at the side of the distribution, but not in a separate plot. I have tried to call geom_histogram twice, once with the continuous data and once with the discrete data in a separate column but that gives me the following error:
Error: Discrete value supplied to continuous scale
Happy to hear suggestions!
Here's a kind of involved solution, but I believe it best answers your question, which is that you are desiring to place next to typical histogram plot a bar representing the ">10" values (or the values which are non-numeric). Critically, you want to ensure that you maintain the "binning" associated with a histogram plot, which means you are not looking to simply make your scale a discrete scale and represent a histogram with a typical barplot.
The Data
Since you want to retain histogram features, I'm going to use an example dataset that is a bit more involved than that you gave us. I'm just going to specify a uniform distribution (n=100) with 20 ">10" values thrown in there.
set.seed(123)
df<- data.frame(time=c(runif(100,0,10), rep(">10",20)))
As prepared, df$time is a character vector, but for a histogram, we need that to be numeric. We're simply going to force it to be numeric and accept that the ">10" values are going to be coerced to be NAs. This is fine, since in the end we're just going to count up those NA values and represent them with a bar. While I'm at it, I'm creating a subset of df that will be used for creating the bar representing our NAs (">10") using the count() function, which returns a dataframe consisting of one row and column: df$n = 20 in this case.
library(dplyr)
df$time <- as.numeric(df$time) #force numeric and get NA for everything else
df_na <- count(subset(df, is.na(time)))
The Plot(s)
For the actual plot, you are asking to create a combination of (1) a histogram, and (2) a barplot. These are not the same plot, but more importantly, they cannot share the same axis, since by definition, the histogram needs a continuous axis and "NA" values or ">10" is not a numeric/continuous value. The solution here is to make two separate plots, then combine them with a bit of magic thanks to cowplot.
The histogram is created quite easily. I'm saving the number of bins for demonstration purposes later. Here's the basic plot:
bin_num <- 12 # using this later
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
Thanks to the subsetting previously, the barplot for the NA values is easy too:
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3)
Yikes! That looks horrible, but have patience.
Stitching them together
You can simply run plot_grid(p1, p2) and you get something workable... but it leaves quite a lot to be desired:
There are problems here. I'll enumerate them, then show you the final code for how I address them:
Need to remove some elements from the NA barplot. Namely, the y axis entirely and the title for x axis (but it can't be NULL or the x axes won't line up properly). These are theme() elements that are easily removed via ggplot.
The NA barplot is taking up WAY too much room. Need to cut the width down. We address this by accessing the rel_widths= argument of plot_grid(). Easy peasy.
How do we know how to set the y scale upper limit? This is a bit more involved, since it will depend on the ..count.. stat for p1 as well as the numer of NA values. You can access the maximum count for a histogram using ggplot_build(), which is a part of ggplot2.
So, the final code requires the creation of the basic p1 and p2 plots, then adds to them in order to fix the limits. I'm also adding an annotation for number of bins to p1 so that we can track how well the upper limit setting works. Here's the code and some example plots where bin_num is set at 12 and 5, respectively:
# basic plots
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3) +
labs(x="") + theme(axis.line.y=element_blank(), axis.text.y=element_blank(),
axis.title.y=element_blank(), axis.ticks.y=element_blank()
) +
scale_x_discrete(expand=expansion(add=1))
#set upper y scale limit
max_count <- max(c(max(ggplot_build(p1)$data[[1]]$count), df_na$n))
# fix limits for plots
p1 <- p1 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15))) +
annotate('text', x=0, y=max_count, label=paste('Bins:', bin_num)) # for demo purposes
p2 <- p2 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15)))
plot_grid(p1, p2, rel_widths=c(1,0.2))
So, our upper limit fixing works. You can get really crazy playing around with positioning, etc and the plot_grid() function, but I think it works pretty well this way.
Perhaps, this is what you are looking for:
df1 <- data.frame(x=sample(1:12,50,rep=T))
df2 <- df1 %>% group_by(x) %>%
dplyr::summarise(y=n()) %>% subset(x<11)
df3 <- subset(df1, x>10) %>% dplyr::summarise(y=n()) %>% mutate(x=11)
df <- rbind(df2,df3 )
label <- ifelse((df$x<11),as.character(df$x),">10")
p <- ggplot(df, aes(x=x,y=y,color=x,fill=x)) +
geom_bar(stat="identity", position = "dodge") +
scale_x_continuous(breaks=df$x,labels=label)
p
and you get the following output:
Please note that sometimes you could have some of the bars missing depending on the sample.

How can I make geom_area() leave a gap for missing values?

When I plot using geom_area() I expect it to perform a lot like geom_bar(), but I'm a little perplexed by this behavior for missing values.
require(dplyr)
require(ggplot2)
set.seed(1)
test <- data.frame(x=rep(1:10,3), y=abs(rnorm(30)), z=rep(LETTERS[1:3],10)) %>% arrange(x,z)
# I also have no idea why geom_area needs the data.frame to be sorted first.
test[test$x==4,"y"] <- NA
ggplot(test, aes(x, y, fill=z)) + geom_bar(stat="identity", position="stack")
Produces this stacked bar chart.
However, if I change to stack_area() it interpolates across the missing values.
> ggplot(test, aes(x, y, fill=z)) + geom_area(stat="identity", position="stack")
Warning message:
Removed 3 rows containing missing values (position_stack).
If I add in na.rm=FALSE or na.rm=TRUE it makes no difference.
ggplot(test, aes(x, y, fill=z)) + geom_area(stat="identity", position="stack", na.rm=TRUE)
Warning message:
Removed 3 rows containing missing values (position_stack)
ggplot(test, aes(x, y, fill=z)) + geom_area(stat="identity", position="stack", na.rm=FALSE)
Warning message:
Removed 3 rows containing missing values (position_stack).
Obviously, whatever I'm trying isn't working. How can I show a gap in the series with stack_area()?
It seems that the problem has to do with how the values are stacked. The error message tells you that the rows containing missing values were removed, so there is simply no gap present in the data that your are plotting.
However, geom_ribbon, of which geom_area is a special case, leaves gaps for missing values. geom_ribbon plots an area as well, but you have to specify the maximum and minimum y-values. So the trick can be done by calculating these values manually and then plotting with geom_ribbon(). Starting with your data frame test, I create the ymin and ymax data as follows:
test$ymax <-test$y
test$ymin <- 0
zl <- levels(test$z)
for ( i in 2:length(zl) ) {
zi <- test$z==zl[i]
zi_1 <- test$z==zl[i-1]
test$ymin[zi] <- test$ymax[zi_1]
test$ymax[zi] <- test$ymin[zi] + test$ymax[zi]
}
and then plot with geom_ribbon:
ggplot(test, aes(x=x,ymax=ymax,ymin=ymin, fill=z)) + geom_ribbon()
This gives the following plot:

stat_qq removes values when setting group

I am trying to make a QQ-plot in ggplot2, where a select few of the points should have a different shape. But when I map the shape to a variable in the aesthetics, stat_qq includes this variable to split the data (there are 2x3 factors involved).
Here is a reproducible example:
library(ggplot2)
set.seed(331)
df <- do.call(rbind, replicate(10, {expand.grid(method=factor(letters[1:3]), model=factor(LETTERS[1:2]))}, simplify=FALSE ))
df$x <- runif(nrow(df))
df$y <- rnorm(nrow(df), sd=0.2) + 1*as.integer(df$method)
df$top <- FALSE
df <- df[order(df$y, decreasing=TRUE),]
df$top[which(df$method=='a')[1:10]] <- TRUE
So far, I have managed to make a simple QQ-plot:
ggplot(df, aes(sample=y, colour=method)) + stat_qq() + facet_grid(.~model)
This is basically what I want, except for a hand full of the points in method 'a' having a different shape, as indicated by the variable 'top'.
From the code, we know that these corresponds to the top 5 values in method 'a' in each model; i.e. that the five left most of the red dots in each facet should have a different shape.
Here I have attempted to add it as an aesthetics:
ggplot(df, aes(sample=y, colour=method, shape=top)) + stat_qq() + facet_grid(.~model)
Now, it is quite clear, that stat_qq has included the variable 'top' to split the data set, as the top 5 data points are plotted parallel to the the non-top points.
This is not as intended.
How can I instruct stat_qq how to group the data?
I could try the group-aesthetic:
ggplot(df, aes(sample=y, colour=method, shape=top, group=method)) + stat_qq() + facet_grid(.~model)
Warning messages:
1: Removed 10 rows containing missing values (geom_point).
2: Removed 10 rows containing missing values (geom_point).
But for some reason, this entirely removes all data points connected to the model.
Any ideas how to overcome this?
Since you want to violate one of the fundamental concepts of ggplot2 it would be easier to do the calculations outside of ggplot:
library(plyr)
df <- ddply(df, .(model, method),
transform, theo=qqnorm(y, plot.it=FALSE)[["x"]])
ggplot(df, aes(x=theo, y=y, colour=method, shape=top)) +
geom_point() + facet_grid(.~model)

Resources