ggplot 2 scatter plot colors are confused - r

When you run this code you will see the facet with B has a red point but it clearly should be red. How do you set the colors properly given data frame "d"
Thank you.
d = data.frame(x = c(1,2,3),y = c(4,5,6), color = c("red","blue","red"), group = c("A","B","A"))
d
ggplot(data= d, aes(x = x, y = y ) ) +geom_point( color = d$color)+
facet_wrap(~group)

Unlike base plots, ggplot doesn't expect you to have a column of color names in your data. It expects you to have a column that defines the variable that you want to color by, and optionally specify the mapping between that vector's values and custom colors (if you don't like the defaults).
In your data, the color column seems to be based off of the group column. This would be the canonical ggplot way to create your plot (notice that the color column is not used):
ggplot(data = d, aes(x = x, y = y, color = group)) +
geom_point() +
facet_wrap(~group)
Note that you do not need to facet and color by the same column, e.g.,
ggplot(data = mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point() +
facet_wrap(~ am)
The key point is that you are mapping a column to the color argument of aes() inside aes(). When facets are involved ggplot does potentially complicated splitting of the data behind-the-scenes. This data manipulation is based on the data provided to the data argument and column names provided inside aes.
If you specify data$column you are passing just a vector. You have taken it from your data frame, but ggplot doesn't know that - it could have come from anywhere. This will cause mistakes in the subsetting done for the facets. You need to use aes(color = column) (note the lack of data$ - use just the column name inside aes()), and ggplot will look for a column of that name in the data and know how to correctly filter the data for each facet.

This is one way:
ggplot(data= d, aes(x = x, y = y ) ) +
geom_point(aes(color = color))+
facet_wrap(~group) +
scale_color_manual(values = c('red' = 'red','blue' = 'blue'))

Related

Wrong matching of colours in ggplot using scale color discrete [duplicate]

I am new to ggplot. I am trying to understand how to use ggplot. I am reading Wickham's book and still trying to wrap my head around mapping vs. setting color.
A) Discrete case
Here's what I did:
library(dplyr)
library(ggplot2)
test<-filter(mpg,year==2008)
test<-test[1:10,]
grid <- data_frame(displ = seq(min(mpg$displ), max(mpg$displ), length = 50))
mod <- loess(hwy ~ displ, data = mpg)
grid$hwy <- predict(mod, newdata = grid)
a) Use discrete values and then use (aes(color = "xyz"))
ggplot(mpg,aes(displ,hwy)) +
geom_point() +
geom_text(data = test,aes(label=trans,color = "blue"))
This just adds a legend with the label "blue". Why does this happen?
b) Supply color = "blue" outside of aesthetics.
ggplot(mpg,aes(displ,hwy)) +
geom_point() +
geom_text(data = test,aes(label=trans),color = "blue")
This works and changes the color to "blue".
B) Continuous case
a) Use (aes(color = "xyz"))
Here's what I did:
ggplot(mpg,aes(displ,hwy)) +
geom_point() +
geom_line(data = grid, aes(colour = "green"),size=1.5)
As with the case a) for discrete case, this adds a pink line with the text "green"
b) Supply color outside of aesthetics.
ggplot(mpg,aes(displ,hwy)) +
geom_point() +
geom_line(data = grid, colour = "green",size=1.5)
Here, the color of the line does change to "Green" and I have lost the label.
So, I am not understanding the value of aes(colour = "xyz"). All it does is that add a label. Isn't it? Why would we use it?
Data - data columns or transformations of data columns, go inside aes(). When you do aes(color = 'blue'), it's as if your data had an unnamed column that had the character string "blue" in every row.
ggplot(mpg,aes(displ,hwy)) +
geom_point() +
geom_text(data = test, aes(label = trans, color = "blue"))
In this context, "blue" is not a color - it is just a character string. You will get an identical result (except for the label) if you use color = "green",, color = "bleu", or color = "look at this long long label" - if these are inside aes().
A character string - even if it only has one value - will be coerced to a factor and treated as a discrete variable.
This can be confusing if you don't follow the general rule: don't put constants inside aes() - only put mappings to actual data columns.
You seem to be confused about continuous vs discrete color scales. What you label as "continuous case" is still discrete when it comes to color. Using geom_point or geom_line, a smoothed geom, or any other geom doesn't make color discrete or continuous. The only thing that matters for choosing a discrete or continuous color scale is the type (class) of data that is mapped to color. If it is numeric, the default color scale will be continuous. If it is not numeric, the default color scale will be discrete.

R code of scatter plot for three variables

Hi I am trying to code for a scatter plot for three variables in R:
Race= [0,1]
YOI= [90,92,94]
ASB_mean = [1.56, 1.59, 1.74]
Antisocial <- read.csv(file = 'Antisocial.csv')
Table_1 <- ddply(Antisocial, "YOI", summarise, ASB_mean = mean(ASB))
Table_1
Race <- unique(Antisocial$Race)
Race
ggplot(data = Table_1, aes(x = YOI, y = ASB_mean, group_by(Race))) +
geom_point(colour = "Black", size = 2) + geom_line(data = Table_1, aes(YOI,
ASB_mean), colour = "orange", size = 1)
Image of plot: https://drive.google.com/file/d/1E-ePt9DZJaEr49m8fguHVS0thlVIodu9/view?usp=sharing
Data file: https://drive.google.com/file/d/1UeVTJ1M_eKQDNtvyUHRB77VDpSF1ASli/view?usp=sharing
Can someone help me understand where I am making mistake? I want to plot mean ASB vs YOI grouped by Race. Thanks.
I am not sure what is your desidered output. Maybe, if I well understood your question I Think that you want somthing like this.
g_Antisocial <- Antisocial %>%
group_by(Race) %>%
summarise(ASB = mean(ASB),
YOI = mean(YOI))
Antisocial %>%
ggplot(aes(x = YOI, y = ASB, color = as_factor(Race), shape = as_factor(Race))) +
geom_point(alpha = .4) +
geom_point(data = g_Antisocial, size = 4) +
theme_bw() +
guides(color = guide_legend("Race"), shape = guide_legend("Race"))
and this is the output:
#Maninder: there are a few things you need to look at.
First of all: The grammar of graphics of ggplot() works with layers. You can add layers with different data (frames) for the different geoms you want to plot.
The reason why your code is not working is that you mix the layer call and or do not really specify (and even mix) what is the scatter and line visualisation you want.
(I) Use ggplot() + geom_point() for a scatter plot
The ultimate first layer is: ggplot(). Think of this as your drawing canvas.
You then speak about adding a scatter plot layer, but you actually do not do it.
For example:
# plotting antisocal data set
ggplot() +
geom_point(data = Antisocial, aes(x = YOI, y = ASB, colour = as.factor(Race)))
will plot your Antiscoial data set using the scatter, i.e. geom_point() layer.
Note that I put Race as a factor to have a categorical colour scheme otherwise you might end up with a continous palette.
(II) line plot
In analogy to above, you would get for the line plot the following:
# plotting Table_1
ggplot() +
geom_line(data = Table_1, aes(x = YOI, y = ASB_mean))
I save showing the plot of the line.
(III) combining different layers
# putting both together
ggplot() +
geom_point(data = Antisocial, aes(x = YOI, y = ASB, colour = as.factor(Race))) +
geom_line(data = Table_1, aes(x = YOI, y = ASB_mean)) +
## this is to set the legend title and have a nice(r) name in your colour legend
labs(colour = "Race")
This yields:
That should explain how ggplot-layering works. Keep an eye on the datasets and geoms that you want to use. Before working with inheritance in aes, I recommend to keep the data= and aes() call in the geom_xxxx. This avoids confustion.
You may want to explore with geom_jitter() instead of geom_point() to get a bit of a better presentation of your dataset. The "few" points plotted are the result of many datapoints in the same position (and overplotted).
Moving away from plotting to your question "I want to plot mean ASB vs YOI grouped by Race."
I know too little about your research to fully comprehend what you mean with that.
I take it that the mean ASB you calculated over the whole population is your reference (aka your Table_1), and you would like to see how the Race groups feature vs this population mean.
One option is to group your race data points and show them as boxplots for each YOI.
This might be what you want. The boxplot gives you the median and quartiles, and you can compare this per group against the calculated ASB mean.
For presentation purposes, I highlighted the line by increasing its size and linetype. You can play around with the colours, etc. to give you the aesthetics you aim for.
Please note, that for the grouped boxplot, you also have to treat your integer variable YOI, I coerced into a categorical factor. Boxplot works with fill for the body (colour sets only the outer line). In this setup, you also need to supply a group value to geom_line() (I just assigned it to 1, but that is arbitrary - in other contexts you can assign another variable here).
ggplot() +
geom_boxplot(data = Antisocial, aes(x = as.factor(YOI), y = ASB, fill = as.factor(Race))) +
geom_line(data = Table_1, aes(x = as.factor(YOI), y = ASB_mean, group = 1)
, size = 2, linetype = "dashed") +
labs(x = "YOI", fill = "Race")
Hope this gets you going!

Boxplot ggplot2: Show mean value and number of observations in grouped boxplot

I wish to add the number of observations to this boxplot, not by group but separated by factor. Also, I wish to display the number of observations in addition to the x-axis label that it looks something like this: ("PF (N=12)").
Furthermore, I would like to display the mean value of each box inside of the box, displayed in millions in order not to have a giant number for each box.
Here is what I have got:
give.n <- function(x){
return(c(y = median(x)*1.05, label = length(x)))
}
mean.n <- function(x){x <- x/1000000
return(c(y = median(x)*0.97, label = round(mean(x),2)))
}
ggplot(Soils_noctrl) +
geom_boxplot(aes(x=Slope,y=Events.g_Bacteria, fill = Detergent),
varwidth = TRUE) +
stat_summary(aes(x = Slope, y = Events.g_Bacteria), fun.data = give.n, geom = "text",
fun = median,
position = position_dodge(width = 0.75))+
ggtitle("Cell Abundance")+
stat_summary(aes(x = Slope, y = Events.g_Bacteria),
fun.data = mean.n, geom = "text", fun = mean, colour = "red")+
facet_wrap(~ Location, scale = "free_x")+
scale_y_continuous(name = "Cell Counts per Gram (Millions)",
breaks = round (seq(min(0),
max(100000000), by = 5000000),1),
labels = function(y) y / 1000000)+
xlab("Sample")
And so far it looks like this:
As you can see, the mean value is at the bottom of the plot and the number of observations are in the boxes but not separated
Thank you for your help! Cheers
TL;DR - you need to supply a group= aesthetic, since ggplot2 does not know on which column data it is supposed to dodge the text geom.
Unfortunately, we don't have your data, but here's an example set that can showcase the rationale here and the function/need for group=.
set.seed(1234)
df1 <- data.frame(detergent=c(rep('EDTA',15),rep('Tween',15)), cells=c(rnorm(15,10,1),rnorm(15,10,3)))
df2 <- data.frame(detergent=c(rep('EDTA',20),rep('Tween',20)), cells=c(rnorm(20,1.3,1),rnorm(20,4,2)))
df3 <- data.frame(detergent=c(rep('EDTA',30),rep('Tween',30)), cells=c(rnorm(30,5,0.8),rnorm(30,3.3,1)))
df1$smp='Sample1'
df2$smp='Sample2'
df3$smp='Sample3'
df <- rbind(df1,df2,df3)
Instead of using stat_summary(), I'm just going to create a separate data frame to hold the mean values I want to include as text on my plot:
summary_df <- df %>% group_by(smp, detergent) %>% summarize(m=mean(cells))
Now, here's the plot and use of geom_text() with dodging:
p <- ggplot(df, aes(x=smp, y=cells)) +
geom_boxplot(aes(fill=detergent))
p + geom_text(data=summary_df,
aes(y=m, label=round(m,2)),
color='blue', position=position_dodge(0.8)
)
You'll notice the numbers are all separated along y= just fine, but the "dodging" is not working. This is because we have not supplied any information on how to do the dodging. In this case, the group= aesthetic can be supplied to let ggplot2 know that this is the column by which to use for the dodging:
p + geom_text(data=summary_df,
aes(y=m, label=round(m,2), group=detergent),
color='blue', position=position_dodge(0.8)
)
You don't have to supply the group= aesthetic if you supply another aesthetic such as color= or fill=. In cases where you give both a color= and group= aesthetic, the group= aesthetic will override any of the others for dodging purposes. Here's an example of the same, but where you don't need a group= aesthetic because I've moved color= up into the aes() (changing fill to greyscale so that you can see the text):
p + geom_text(data=summary_df,
aes(y=m, label=round(m,2), color=detergent),
position=position_dodge(0.8)
) + scale_fill_grey()
FUN FACT: Dodging still works even if you supply geom_text() with a nonsensical aesthetic that would normally work for dodging, such as fill=. You get a warning message Ignoring unknown aesthetics: fill, but the dodging still works:
p + geom_text(data=summary_df,
aes(y=m, label=round(m,2), fill=detergent),
position=position_dodge(0.8)
)
# gives you the same plot as if you just supplied group=detergent, but with black text
In your case, changing your stat_summary() line to this should work:
stat_summary(aes(x = Slope, y = Events.g_Bacteria, group = Detergent),...

Best way to calculate number of facets in geom_hline/_vline

When I combine geom_vline() with facet_grid() like so:
DATA <- data.frame(x = 1:6,y = 1:6, f = rep(letters[1:2],3))
ggplot(DATA,aes(x = x,y = y)) +
geom_point() +
facet_grid(f~.) +
geom_vline(xintercept = 2:3,
colour =c("goldenrod3","dodgerblue3"))
I get an error message stating Error: Aesthetics must be either length 1 or the same as the data (4): colour because there are two lines in each facet and there are two facets. One way to get around this is to use rep(c("goldenrod3","dodgerblue3"),2), but this requires that every time I change the faceting variables, I also have to calculate the number of facets and replace the magic number (2) in the call to rep(), which makes re-using ggplot code so much less nimble.
Is there a way to get the number of facets directly from ggplot for use in this situation?
You could put the xintercept and colour info into a data.frame to pass to geom_vline and then use scale_color_identity.
ggplot(DATA, aes(x = x, y = y)) +
geom_point() +
facet_grid(f~.) +
geom_vline(data = data.frame(xintercept = 2:3,
colour = c("goldenrod3","dodgerblue3") ),
aes(xintercept = xintercept, color = colour) ) +
scale_color_identity()
This side-steps the issue of figuring out the number of facets, although that could be done by pulling out the number of unique values in the faceting variable with something like length(unique(DATA$f)).

Which aesthetics go in ggplot( ) and which in geom_xx( )

How should I decide when to put parameters in ggplot( ) or in the geom_xx( )? Does it matter if the parameter is set to a constant or to a column from the data frame? What other factors (R pun unintentional) should be considered?
This seems to work fine, but has a legend which lists the transparency.
ggplot(mtcars, aes(x = wt, y = mpg, col = cyl, alpha = 0.6))+geom_point(size = 4)
This is a slight improvement because the legend has been removed, but seems to be the same otherwise.
ggplot(mtcars, aes(x = wt, y = mpg, col = cyl))+geom_point(size = 4, alpha = 0.6)
I understand that parameters should be defined once when possible if they apply to all geoms and separately when they should apply to only one geom or if they should override settings from ggplot. There's a helpful list of aesthetics here:
Is there a table or catalog of aesthetics for ggplot2? and a nice graphical representation here.

Resources