I have created a boxplot of some data using ggplot2 in which I am displaying the data points as dots along the vertical axis of the plot.
bp2 <- ggplot(DBS, aes(DBS_Electrode,Proximal_Lead_Bowing, color=DBS_Electrode)) +
geom_boxplot() + geom_dotplot(binaxis="y", stackdir="center", fill="white",
dotsize=0.5) + theme_classic()
bp2 + scale_color_manual(values=c("goldenrod3","gray62","dodgerblue1")) +
theme(legend.position = "none") + xlab("") + ylab("Proximal Lead Bowing (mm)")
It appears that my output is rounding the data points to the nearest tenth such that the data points along the axis of each boxplot have several instances in which multiple points are being displayed at the same level along the Y-axis (see plot http://rpubs.com/Gopher16/441664). This is a misrepresentation of the data as there is are no data points that have the exact same measures of proximal lead bowing. (Data was measured to the nearest thousandth). How can I change this output such that all data points are displayed along a vertical axis along each boxplot (i.e. read the data points to the nearest thousandth rather than rounding to the nearest tenth so that no points are displayed at the same level along the Y-axis)?
First let's make this reproducible, and thus a more useful example for future readers, by using a built-in data set:
ggplot(iris, aes(Species, Sepal.Length)) +
geom_boxplot() +
geom_dotplot(binaxis = "y", stackdir = "center", fill = "white", dotsize = 0.5) +
theme_classic()
This exhibits the behavior you find unwanted: geom_dotplot() bins the points, making multiple points appear horizontally adjacent to each other even though their Sepal.Length values differ.
You could specify binwidth = 0.01 or other small value to geom_dotplot but that's just reducing the problem, and introducing other issues.
You might want geom_jitter instead:
ggplot(iris, aes(Species, Sepal.Length)) +
geom_boxplot() +
geom_jitter(width = 0.2) +
theme_classic()
This preserves the small differences in the unique y-values, which seems to be your chief concern.
Related
I haven't found anyone else with this issue. Here is my plot:
facet plot
Why are there different alpha values for each facet?
As you can see, the alpha value of the geom_rect() elements seems to scale with the y-axis or number of observations, maybe because I have set these to "free_y" in the facet_wrap() argument. How can I prevent this from happening?
Here is my code:
plot_data %>%
ggplot(aes(Date, n)) +
geom_rect(data= plot_data, inherit.aes = FALSE,
aes(xmin=current_date - lubridate::weeks(1), xmax=current_date, ymin=-Inf, ymax=+Inf),
fill='pink', alpha=0.2) +
geom_col() +
facet_wrap(~Type, scales = "free_y") +
xlab("Date") +
ylab("Count") +
theme_bw() +
scale_y_continuous(breaks = integer_breaks()) +
scale_alpha_manual(values = 0.2) +
theme(axis.text.x=element_text(angle=90, hjust=1))
Cheers!
TL;DR - It seems this is probably due to overplotting. You have 5 rect geoms drawn in the facet, but probably more than 5 observations in your dataset. The fix is to summarize your data and associate geom_rect() to plot with the summarized dataset.
Since OP did not provide an example dataset, we can only guess at the reason, but likely what's happening here is due to overplotting. geom_rect() behaves like all other geoms, which is to say that ggplot2 will draw or add to any geom layer with every observation (row) in the original dataset. If the geoms are drawn across facets and overlap in position, then you'll get overplotting. You can notice that this is happening based on:
Different alpha appearing on each facet, even though it should be constant based on the code, and
The fact that in order to get the rectangles to look like "light red", OP had to use pink color and an alpha value of 0.2... which shouldn't look like that if there was only one rect drawn.
Representative Example of the Issue
Here's an example that showcases the problem and how you can fix it using mtcars:
library(ggplot2)
df <- mtcars
p <- ggplot(df, aes(disp, mpg)) + geom_point() +
facet_wrap(~cyl) +
theme_bw()
p + geom_rect(
aes(xmin=200, xmax=300, ymin=-Inf, ymax=Inf),
alpha=0.01, fill='red')
Like OP's case, we expect all rectangles to be the same alpha value, but they are not. Also, note the alpha value is ridiculously low (0.01) for the color you see there. What's going on should be more obvious if we check number of observations in mtcars that falls within each facet:
> library(dplyr)
> mtcars %>% group_by(cyl) %>% tally()
# A tibble: 3 x 2
cyl n
<dbl> <int>
1 4 11
2 6 7
3 8 14
There's a lower number of observations where cyl==6 and cyl==4 has lower observations than cyl==8. This corresponds precisely to the alpha values we see for the geoms in the plot, so this is what's going on. For each observation, a rectangle is drawn over the same position and so there are 7 rectangles drawn in the middle facet, 14 on the right facet, and 11 on the left facet.
Fixing the Issue: Summarize the Data
To fix the issue, you should summarize your data and use the summarized dataset for plotting the rectangles.
summary_df <- df %>%
group_by(cyl) %>%
summarize(mean_d = mean(disp))
p + geom_rect(
data = summary_df,
aes(x=1, y=1, xmin=mean_d-50, xmax=mean_d+50, ymin=-Inf, ymax=Inf),
alpha=0.2, fill='red')
Since summary_df has only 3 observations (one for each group of cyl), the rectangles are drawn correctly and now alpha=0.2 with fill="red" gives the expected result. One thing to note here is that we still have to define x and y in the aes(). I set them both to 1 because although geom_rect() doesn't use them, ggplot2 still expects to find them in the dataset summary_df because we stated that they are assigned to that plot globally up in ggplot(df, aes(x=..., y=...)). The fix is to either move the aes() declaration into geom_point() or just assign both to be constant values in geom_rect().
I plotted a grouped boxplot and trying to change the background color for each panel. I can use panel.background function to change whole plot background. But how this can be done for individual panel? I found a similar question here. But I failed to adopt the code to my plot.
Top few lines of my input data look like
Code
p<-ggplot(df, aes(x=Genotype, y=Length, fill=Treatment)) + scale_fill_manual(values=c("#69b3a2", "#CF7737"))+
geom_boxplot(width=2.5)+ theme(text = element_text(size=20),panel.spacing.x=unit(0.4, "lines"),
axis.title.x=element_blank(),axis.text.x=element_blank(),axis.ticks.x=element_blank(),axis.text.y = element_text(angle=90, hjust=1,colour="black")) +
labs(x = "Genotype", y = "Petal length (cm)")+
facet_grid(~divide,scales = "free", space = "free")
p+theme(panel.background = element_rect(fill = "#F6F8F9", colour = "#E7ECF1"))
Unfortunately, like the other theme elements, the fill aesthetic of element_rect() cannot be mapped to data. You cannot just send a vector of colors to fill either (create your own mapping of sorts). In the end, the simplest solution probably is going to be very similar to the answer you linked to in your question... with a bit of a twist here.
I'll use mtcars as an example. Note that I'm converting some of the continuous variables in the dataset to factors so that we can create some more discrete values.
It's important to note, the rect geom is drawn before the boxplot geom, to ensure the boxplot appears on top of the rect.
ggplot(mtcars, aes(factor(carb), disp)) +
geom_rect(
aes(fill=factor(carb)), alpha=0.5,
xmin=-Inf, xmax=Inf, ymin=-Inf, ymax=Inf) +
geom_boxplot() +
facet_grid(~factor(carb), scales='free_x') +
theme_bw()
All done... but not quite. Something is wrong and you might notice this if you pay attention to the boxes on the legend and the gridlines in the plot panels. It looks like the alpha value is incorrect for some facets and okay for others. What's going on here?
Well, this has to do with how geom_rect works. It's drawing a box on each plot panel, but just like the other geoms, it's mapped to the data. Even though the x and y aesthetics for the geom_rect are actually not used to draw the rectangle, they are used to indicate how many of each rectangle are drawn. This means that the number of rectangles drawn in each facet corresponds to the number of lines in the dataset which exist for that facet. If 3 observations exist, 3 rectangles are drawn. If 20 observations exist for one facet, 20 rectangles are drawn, etc.
So, the fix is to supply a dataframe that contains one observation each for every facet. We have to then make sure that we supply any and all other aesthetics (x and y here) that are included in the ggplot call, or we will get an error indicating ggplot cannot "find" that particular column. Remember, even if geom_rect doesn't use these for drawing, they are used to determine how many observations exist (and therefore how many to draw).
rect_df <- data.frame(carb=unique(mtcars$carb)) # supply one of each type of carb
# have to give something to disp
rect_df$disp <- 0
ggplot(mtcars, aes(factor(carb), disp)) +
geom_rect(
data=rect_df,
aes(fill=factor(carb)), alpha=0.5,
xmin=-Inf, xmax=Inf, ymin=-Inf, ymax=Inf) +
geom_boxplot() +
facet_grid(~factor(carb), scales='free_x') +
theme_bw()
That's better.
time_pic <- ggplot(data_box, aes(x=Kind, y=TimeTotal, fill=Sitting_Position)) +
geom_boxplot()
print(time_pic)
time_pic+labs(title="", x="", y = "Time (Sec)")
I ran the above codes to get the following image. But I don't know how to add average value for each boxplot on this image.
updated.
I tried this.
means <- aggregate(TimeTotal ~ Sitting_Position*Kind, data_box, mean)
ggplot(data=data_box, aes(x=Kind, y=TimeTotal, fill=Sitting_Position)) +
geom_boxplot() +
stat_summary(fun=mean, colour="darkred", geom="point", shape=18, size=3,show_guide = FALSE) +
geom_text(data = means, aes(label = TimeTotal, y = TimeTotal + 0.08))
This is what it looks like now. Two dots are on the same line. And two values are overlapping with each other.
As others said, you can share your dataset for more specific help, but in this case I think the point can be made using a dummy dataset. I'm creating one that looks pretty similar to your own in terms of naming, so theoretically you can just plug in this code and it could work.
The biggest thing you need here is to control how ggplot2 is separating the separate boxplots for the data_box$Sitting_Position that share the same data_box$Kind. The process of separating and spreading the boxes around that x= axis value is called "dodging". When you supply a fill= or color= (or other) aesthetic in aes() for that geom, ggplot2 knows enough that it will assume you also want to group the data according to that value. So, your initial ggplot() call has in aes() that fill=Sitting_Position, which means that geom_boxplot() "works" - it creates the separate boxes that are colored differently and which are "dodged" properly.
When you create the points and the text, ggplot2 has no idea that you want to "dodge" this data, and even if you did want to dodge, on what basis to use for the dodge, since the fill= aesthetic doesn't make sense for a text or point geom. How to fix this? The answer is to:
Supply a group= aesthetic, which can override the grouping of a fill= or color= aesthetic, but which also can serve as a basis for the dodging for geoms that do not have a similar aesthetic.
Specify more clearly how you want to dodge. This will be important for accurate positioning of all things you want to dodge. Otherwise, you will have things dodged, but maybe not the same distance.
Here's how I combined all that:
# the datasets
set.seed(1234)
data_box <- data.frame(
Kind=c(rep('Model-free AR',100),rep('Real-world',100)),
TimeTotal=c(rnorm(50,5.5,1),rnorm(50,5.43,1.1),rnorm(50,4.9,1),rnorm(50,4.7,0.2)),
Sitting_Position=rep(c(rep('face to face',50),rep('side by side',50)),2)
)
means <- aggregate(TimeTotal ~ Sitting_Position*Kind, data_box, mean)
# the plot
ggplot(data_box, aes(x=Kind, y=TimeTotal)) + theme_bw() +
# specifying dodge here and width to avoid overlapping boxes
geom_boxplot(
aes(fill=Sitting_Position),
position=position_dodge(0.6), width=0.5
) +
# note group aesthetic and same dodge call for next two objects
stat_summary(
aes(group=Sitting_Position),
position=position_dodge(0.6),
fun=mean,
geom='point', color='darkred', shape=18, size=3,
show.legend = FALSE
) +
geom_text(
data=means,
aes(label=round(TimeTotal,2), y=TimeTotal + 0.18, group=Sitting_Position),
position=position_dodge(0.6)
)
Giving you this:
Is there a way to equalise the size of geom_points throughout multiple plots, so that they are easily comparable?
ie. I want the size of a 100 value to be equal throughout the plots, regardless of the minimum and maximum value that makes up the size values. As seen below, the size of geom_points are the same, but they represent different values.
graph <- ggplot(mar, aes(x=long, y=lat)) + xlab("Longitude") + ylab("Latitude")
graph + theme_grey() + geom_point(aes(size=distance$NEAR_DIST)) + scale_size_area() + labs(size = "Distance from predicted LCP Roman\nroad to known Roman road (m)")
Thanks!
You could achieve that as follows:
df1 =data.frame(x=1:20,y=runif(20,1,10),size=runif(20,1,10))
df2 =data.frame(x=1:20,y=runif(20,1,10),size=runif(20,31,40))
maximum = max(c(df1$size,df2$size))
graph <- ggplot(df1, aes(x=x, y=y,size=size)) + geom_point() +
scale_size_area(limits=c(1,maximum))
graph2 <- ggplot(df2, aes(x=x, y=y,size=size)) + geom_point() +
scale_size_area(limits=c(1,maximum))
Hope this helps!
The most commonly cited example of how to visualize a logistic fit using ggplot2 seems to be something very much like this:
data("kyphosis", package="rpart")
ggplot(data=kyphosis, aes(x=Age, y = as.numeric(Kyphosis) - 1)) +
geom_point() +
stat_smooth(method="glm", family="binomial")
This visualisation works great if you don't have too much overlapping data, and the first suggestion for crowded data seems to be to use injected jitter in the x and y coordinates of the points then adjust the alpha value of the points. When you get to the point where individual points aren't useful but distributions of points are, is it possible to use geom_density(), geom_histogram(), or something else to visualise the data but continue to split the categorical variable along the y-axis as it is done with geom_point()?
From what I have found, geom_density() and geom_histogram() can easily be split/grouped by the categorical variable and both levels can easily be reversed using scale_y_reverse() but I can't figure out if it is even possible to move only one of the categorical variable distributions to the top of the plot. Any help/suggestions would be appreciated.
The annotate() function in ggplot allows you to add geoms to a plot with properties that "are not mapped from the variables of a data frame, but are instead in as vectors," meaning that you can add layers that are unrelated to your data frame. In this case your two density curves are related to the data frame (since the variables are in it), but because you're trying to position them differently, using annotate() is useful.
Here's one way to go about it:
data("kyphosis", package="rpart")
model.only <- ggplot(data=kyphosis, aes(x=Age, y = as.numeric(Kyphosis) - 1)) +
stat_smooth(method="glm", family="binomial")
absents <- subset(kyphosis, Kyphosis=="absent")
presents <- subset(kyphosis, Kyphosis=="present")
dens.absents <- density(absents$Age)
dens.presents <- density(presents$Age)
scaling.factor <- 10 # Make the density plots taller
model.only + annotate("line", x=dens.absents$x, y=dens.absents$y*scaling.factor) +
annotate("line", x=dens.presents$x, y=dens.presents$y*scaling.factor + 1)
This adds two annotated layers with scaled density plots for each of the kyphosis groups. For the presents variable, y is scaled and increased by 1 to shift it up.
You can also fill the density plots instead of just using a line. Instead of annotate("line"...) you need to use annotate("polygon"...), like so:
model.only + annotate("polygon", x=dens.absents$x, y=dens.absents$y*scaling.factor, fill="red", colour="black", alpha=0.4) +
annotate("polygon", x=dens.presents$x, y=dens.presents$y*scaling.factor + 1, fill="green", colour="black", alpha=0.4)
Technically you could use annotate("density"...), but that won't work when you shift the present plot up by one. Instead of shifting, it fills the whole plot:
model.only + annotate("density", x=dens.absents$x, y=dens.absents$y*scaling.factor, fill="red") +
annotate("density", x=dens.presents$x, y=dens.presents$y*scaling.factor + 1, fill="green")
The only way around that problem is to use a polygon instead of a density geom.
One final variant: flipping the top density plot along y-axis = 1:
model.only + annotate("polygon", x=dens.absents$x, y=dens.absents$y*scaling.factor, fill="red", colour="black", alpha=0.4) +
annotate("polygon", x=dens.presents$x, y=(1 - dens.presents$y*scaling.factor), fill="green", colour="black", alpha=0.4)
I am not sure I get your point, but here an attempt:
dat <- rbind(kyphosis,kyphosis)
dat$grp <- factor(rep(c('smooth','dens'),each = nrow(kyphosis)),
levels = c('smooth','dens'))
ggplot(dat,aes(x=Age)) +
facet_grid(grp~.,scales = "free_y") +
#geom_point(data=subset(dat,grp=='smooth'),aes(y = as.numeric(Kyphosis) - 1)) +
stat_smooth(data=subset(dat,grp=='smooth'),aes(y = as.numeric(Kyphosis) - 1),
method="glm", family="binomial") +
geom_density(data=subset(dat,grp=='dens'))