I want to plot a ggplot2 boxplot using all columns of a data.frame, and I want to reorder the columns by the median for each column, rotate the x-axis labels, and fill each box with the colour corresponding to the same median. I can't figure out how to do the last part. There are plenty of examples where the fill colour corresponds to a factor variable, but I haven't seen a clear example of using a continuous variable to control fill colour. (The reason I'm trying to do this is that the resultant plot will provide context for a force-directed network graph with nodes that will be colour-coded in the same way as the boxplot -- the colour will then provide a mapping between the two plots.) It would be nice if I could re-use the value-to-colour mapping for later plots so that colours are consistent between plots. So, for example, the box corresponding to the column variable with a high median value will have a colour that denotes this mapping and matches perfectly the colour for the same column variable in other plots (such as the corresponding node in a force-directed network graph).
So far, I have something like this:
# Melt the data.frame:
DT.m <- melt(results, id.vars = NULL) # using reshape2
# I can now make a boxplot for every column in the data.frame:
g <- ggplot(DT.m, aes(x = reorder(variable, value, FUN=median), y = value)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
stat_summary(fun.y=mean, colour="darkred", geom="point") +
geom_boxplot(???, alpha=0.5)
The colour fill information is what I'm stuck on. "value" is a continuous variable in the range [0,1] and there are 55 columns in my data.frame. Various approaches I've tried seem to result in the boxes being split vertically down the middle, and I haven't got any further. Any ideas?
You can do this by adding the median-by-group to your data frame and then mapping the new median variable to the fill aesthetic. Here's an example with the built-in mtcars data frame. By using this same mapping across different plots, you should get the same colors:
library(ggplot2)
library(dplyr)
ggplot(mtcars %>% group_by(carb) %>%
mutate(medMPG = median(mpg)),
aes(x = reorder(carb, mpg, FUN=median), y = mpg)) +
geom_boxplot(aes(fill=medMPG)) +
stat_summary(fun.y=mean, colour="darkred", geom="point") +
scale_fill_gradient(low=hcl(15,100,75), high=hcl(195,100,75))
If you have various data frames with different ranges of medians, you can still use the method above, but to get a consistent mapping of color to median across all your plots, you'll need to also set the same limits for scale_fill_gradient in each plot. In this example, the median of mpg (by carb grouping) varies from 15.0 to 22.8. But let's say across all my data sets, it varies from 13.3 to 39.8. Then I could add this to all my plots:
scale_fill_gradient(limits=c(13.3, 39.8),
low=hcl(15,100,75), high=hcl(195,100,75))
This is just for illustration. For ease of maintenance if your data might change, you'll want to set the actual limits programmatically.
I built on eipi10's solution and obtained the following code which does what I want:
# "results" is a 55-column data.frame containing
# bootstrapped estimates of the Gini impurity for each column variable
# (But can synthesize fake data for testing with a bunch of rnorms)
DT.m <- melt(results, id.vars = NULL) # using reshape2
g <- ggplot(DT.m %>% group_by(variable) %>%
mutate(median.gini = median(value)),
aes(x = reorder(variable, value, FUN=median), y = value)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_boxplot(aes(fill=median.gini)) +
stat_summary(fun.y=mean, colour="darkred", geom="point") +
scale_fill_gradientn(colours = heat.colors(9)) +
ylab("Gini impurity") +
xlab("Feature") +
guides(fill=guide_colourbar(title="Median\nGini\nimpurity"))
plot(g)
Later, for the second plot:
medians <- lapply(results, median)
color <- colorRampPalette(colors =
heat.colors(9))(1000)[cut(unlist(medians),1000,labels = F)]
color is then a character vector containing the colours of the nodes in my subsequent network graph, and these colours match those in the boxplot. Job done!
Related
Edited with sample data:
When I try to plot a grouped boxplot together with jittered points using position=position_jitterdodge(), and add an additional group indicated by e.g. shape, I end up with a graph where the jittered points are misaligned within the individual groups:
n <- 16
data <- data.frame(
age = factor(rep(c('young', 'old'), each=8)),
group=rep(LETTERS[1:2], n/2),
yval=rnorm(n)
)
ggplot(data, aes(x=group, y=yval))+
geom_boxplot(aes(color=group), outlier.shape = NA)+
geom_point(aes(color=group, shape=age, fill=group),size = 1.5, position=position_jitterdodge())+
scale_shape_manual(values = c(21,24))+
scale_color_manual(values=c("black", "#015393"))+
scale_fill_manual(values=c("white", "#015393"))+
theme_classic()
Is there a way to suppress that additional separation?
Thank you!
OP, I think I get what you are trying to explain. It seems the points are grouped according to age, rather than treated as the same for each group. The reason for this is that you have not specified what to group together. In order to jitter the points, they are first grouped together according to some aesthetic, then the jitter is applied. If you don't specify the grouping, then ggplot2 gives it a guess as to how you want to group the points.
In this case, it is grouping according to age and group, since both are defined to be used in the aesthetics (x=, fill=, and color= are assigned to group and shape= is assigned to age).
To define that you only want to group the points by the column group, you can use the group= aesthetic modifier. (reposting your data with a seed so you see the same thing)
set.seed(8675309)
n <- 16
data <- data.frame(
age = factor(rep(c('young', 'old'), each=8)),
group=rep(LETTERS[1:2], n/2),
yval=rnorm(n)
)
ggplot(data, aes(x=group, y=yval))+
geom_boxplot(aes(color=group), outlier.shape = NA)+
geom_point(aes(color=group, shape=age, fill=group, group=group),size = 1.5, position=position_jitterdodge())+
scale_shape_manual(values = c(21,24))+
scale_color_manual(values=c("black", "#015393"))+
scale_fill_manual(values=c("white", "#015393"))+
theme_classic()
Hi I am trying to code for a scatter plot for three variables in R:
Race= [0,1]
YOI= [90,92,94]
ASB_mean = [1.56, 1.59, 1.74]
Antisocial <- read.csv(file = 'Antisocial.csv')
Table_1 <- ddply(Antisocial, "YOI", summarise, ASB_mean = mean(ASB))
Table_1
Race <- unique(Antisocial$Race)
Race
ggplot(data = Table_1, aes(x = YOI, y = ASB_mean, group_by(Race))) +
geom_point(colour = "Black", size = 2) + geom_line(data = Table_1, aes(YOI,
ASB_mean), colour = "orange", size = 1)
Image of plot: https://drive.google.com/file/d/1E-ePt9DZJaEr49m8fguHVS0thlVIodu9/view?usp=sharing
Data file: https://drive.google.com/file/d/1UeVTJ1M_eKQDNtvyUHRB77VDpSF1ASli/view?usp=sharing
Can someone help me understand where I am making mistake? I want to plot mean ASB vs YOI grouped by Race. Thanks.
I am not sure what is your desidered output. Maybe, if I well understood your question I Think that you want somthing like this.
g_Antisocial <- Antisocial %>%
group_by(Race) %>%
summarise(ASB = mean(ASB),
YOI = mean(YOI))
Antisocial %>%
ggplot(aes(x = YOI, y = ASB, color = as_factor(Race), shape = as_factor(Race))) +
geom_point(alpha = .4) +
geom_point(data = g_Antisocial, size = 4) +
theme_bw() +
guides(color = guide_legend("Race"), shape = guide_legend("Race"))
and this is the output:
#Maninder: there are a few things you need to look at.
First of all: The grammar of graphics of ggplot() works with layers. You can add layers with different data (frames) for the different geoms you want to plot.
The reason why your code is not working is that you mix the layer call and or do not really specify (and even mix) what is the scatter and line visualisation you want.
(I) Use ggplot() + geom_point() for a scatter plot
The ultimate first layer is: ggplot(). Think of this as your drawing canvas.
You then speak about adding a scatter plot layer, but you actually do not do it.
For example:
# plotting antisocal data set
ggplot() +
geom_point(data = Antisocial, aes(x = YOI, y = ASB, colour = as.factor(Race)))
will plot your Antiscoial data set using the scatter, i.e. geom_point() layer.
Note that I put Race as a factor to have a categorical colour scheme otherwise you might end up with a continous palette.
(II) line plot
In analogy to above, you would get for the line plot the following:
# plotting Table_1
ggplot() +
geom_line(data = Table_1, aes(x = YOI, y = ASB_mean))
I save showing the plot of the line.
(III) combining different layers
# putting both together
ggplot() +
geom_point(data = Antisocial, aes(x = YOI, y = ASB, colour = as.factor(Race))) +
geom_line(data = Table_1, aes(x = YOI, y = ASB_mean)) +
## this is to set the legend title and have a nice(r) name in your colour legend
labs(colour = "Race")
This yields:
That should explain how ggplot-layering works. Keep an eye on the datasets and geoms that you want to use. Before working with inheritance in aes, I recommend to keep the data= and aes() call in the geom_xxxx. This avoids confustion.
You may want to explore with geom_jitter() instead of geom_point() to get a bit of a better presentation of your dataset. The "few" points plotted are the result of many datapoints in the same position (and overplotted).
Moving away from plotting to your question "I want to plot mean ASB vs YOI grouped by Race."
I know too little about your research to fully comprehend what you mean with that.
I take it that the mean ASB you calculated over the whole population is your reference (aka your Table_1), and you would like to see how the Race groups feature vs this population mean.
One option is to group your race data points and show them as boxplots for each YOI.
This might be what you want. The boxplot gives you the median and quartiles, and you can compare this per group against the calculated ASB mean.
For presentation purposes, I highlighted the line by increasing its size and linetype. You can play around with the colours, etc. to give you the aesthetics you aim for.
Please note, that for the grouped boxplot, you also have to treat your integer variable YOI, I coerced into a categorical factor. Boxplot works with fill for the body (colour sets only the outer line). In this setup, you also need to supply a group value to geom_line() (I just assigned it to 1, but that is arbitrary - in other contexts you can assign another variable here).
ggplot() +
geom_boxplot(data = Antisocial, aes(x = as.factor(YOI), y = ASB, fill = as.factor(Race))) +
geom_line(data = Table_1, aes(x = as.factor(YOI), y = ASB_mean, group = 1)
, size = 2, linetype = "dashed") +
labs(x = "YOI", fill = "Race")
Hope this gets you going!
I want to create a grid of density distribution plots, with a dashed vertical line at the mean, for multiple variables I have in a dataset. Using mtcars dataset as an example, the code for a single variable plot would be:
ggplot(mtcars, aes(x = mpg)) + geom_density() + geom_vline(aes(xintercept =
mean(mpg)), linetype = "dashed", size = 0.6)
I am unclear about how I alter this to make it loop over specified variables in my dataset and produce a grid with the plots of each one. It seems like it would involve some combination of adding facet_grid and the "vars" argument but I have tried a number of combinations with no success.
It seems like in all the examples I can find online, facet_grid splits the plots by subsets of a variable, while keeping the same x and y for each plot, but I want to have the plot of x vary in each graph and the y is the density of values.
In trying to solve this, it is also my understanding that the new release of ggplot includes something involving "quasiquotation" which may help solve my problem (https://www.tidyverse.org/articles/2018/07/ggplot2-tidy-evaluation/) but again, I couldn't quite figure out how to apply the examples provided here to my own issue.
Consider reshaping the data into long format than plotting with facets. Here both x and y scales are free since plot differ in magnitude across the columns.
rdf <- reshape(mtcars, varying = names(mtcars), v.names = "value",
times = names(mtcars), timevar = "variable",
new.row.names = 1:1000, direction = "long")
ggplot(rdf, aes(x = value)) + geom_density() +
geom_vline(aes(xintercept = mean(value)), linetype = "dashed", size = 0.6) +
facet_grid(~variable, scales="free")
I'm trying to add a legend to a plot that I've created using ggplot. I load the data in from two csv files, each of which has two columns of 8 rows (not including the header).
I construct a data frame from each file which include a cumulative total, so the dataframe has three columns of data (bv, bin_count and bin_cumulative), 8 rows in each column and every value is an integer.
The two data sets are then plotted as follows. The display is fine but I can't figure out how to add a legend to the resulting plot as it seems the ggplot object itself should have a data source but I'm not sure how to build one where there are multiple columns with the same name.
library(ggplot2)
i2d <- data.frame(bv=c(0,1,2,3,4,5,6,7), bin_count=c(0,0,0,2,1,2,2,3), bin_cumulative=cumsum(c(0,0,0,2,1,2,2,3)))
i1d <- data.frame(bv=c(0,1,2,3,4,5,6,7), bin_count=c(0,1,1,2,3,2,0,1), bin_cumulative=cumsum(c(0,1,1,2,3,2,0,1)))
c_data_plot <- ggplot() +
geom_line(data = i1d, aes(x=i1d$bv, y=i1d$bin_cumulative), size=2, color="turquoise") +
geom_point(data = i1d, aes(x=i1d$bv, y=i1d$bin_cumulative), color="royalblue1", size=3) +
geom_line(data = i2d, aes(x=i2d$bv, y=i2d$bin_cumulative), size=2, color="tan1") +
geom_point(data = i2d, aes(x=i2d$bv, y=i2d$bin_cumulative), color="royalblue3", size=3) +
scale_x_continuous(name="Brightness", breaks=seq(0,8,1)) +
scale_y_continuous(name="Count", breaks=seq(0,12,1)) +
ggtitle("Combine plot of BV cumulative counts")
c_data_plot
I'm fairly new to R and would much appreciate any help.
Per comments, I've edited the code to reproduce the dataset after it's loaded into the dataframes.
Regarding producing a single data frames, I'd welcome advice on how to achieve that - I'm still struggling with how data frames work.
First, we organize the data by combining i1d and i2d. I've added a column data which stores the name of the original dataset.
restructure data
i1d$data <- 'i1d'
i2d$data <- 'i2d'
i12d <- rbind.data.frame(i1d, i2d)
Then, we create the plot, using syntax that is more common to ggplot2:
create plot
ggplot(i12d, aes(x = bv, y = bin_cumulative))+
geom_line(aes(colour = data), size = 2)+
geom_point(colour = 'royalblue', size = 3)+
scale_x_continuous(name="Brightness", breaks=seq(0,8,1)) +
scale_y_continuous(name="Count", breaks=seq(0,12,1)) +
ggtitle("Combine plot of BV cumulative counts")+
theme_bw()
If we specify x and y within the ggplot function, we do not need to keep rewriting it in the various geoms we want to add to the plot. After the first three lines I copied and pasted what you had so that the formatting would match your expectation. I also added theme_bw, because I think it's more visually appealing. We also specify colour in aes using a variable (data) from our data.frame
If we want to take this a step further, we can use the scale_colour_manual function to specify the colors attributed to the different values of the data column in the data.frame i12d:
ggplot(i12d, aes(x = bv, y = bin_cumulative))+
geom_line(aes(colour = data), size = 2)+
geom_point(colour = 'royalblue', size = 3)+
scale_x_continuous(name="Brightness", breaks=seq(0,8,1)) +
scale_y_continuous(name="Count", breaks=seq(0,12,1)) +
ggtitle("Combine plot of BV cumulative counts")+
theme_bw()+
scale_colour_manual(values = c('i1d' = 'turquoise',
'i2d' = 'tan1'))
Folks,
I am plotting histograms using geom_histogram and I would like to label each histogram with the mean value (I am using mean for the sake of this example). The issue is that I am drawing multiple histograms in one facet and I get labels overlapping. This is an example:
library(ggplot2)
df <- data.frame (type=rep(1:2, each=1000), subtype=rep(c("a","b"), each=500), value=rnorm(4000, 0,1))
plt <- ggplot(df, aes(x=value, fill=subtype)) + geom_histogram(position="identity", alpha=0.4)
plt <- plt + facet_grid(. ~ type)
plt + geom_text(aes(label = paste("mean=", mean(value)), colour=subtype, x=-Inf, y=Inf), data = df, size = 4, hjust=-0.1, vjust=2)
Result is:
The problem is that the labels for Subtypes a and b are overlapping. I would like to solve this.
I have tried the position, both dodge and stack, for example:
plt + geom_text(aes(label = paste("mean=", mean(value)), colour=subtype, x=-Inf, y=Inf), position="stack", data = df, size = 4, hjust=-0.1, vjust=2)
This did not help. In fact, it issued warning about the width.
Would you pls help ?
Thx,
Riad.
I think you could precalculate mean values before plotting in new data frame.
library(plyr)
df.text<-ddply(df,.(type,subtype),summarise,mean.value=mean(value))
df.text
type subtype mean.value
1 1 a -0.003138127
2 1 b 0.023252169
3 2 a 0.030831337
4 2 b -0.059001888
Then use this new data frame in geom_text(). To ensure that values do not overlap you can provide two values in vjust= (as there are two values in each facet).
ggplot(df, aes(x=value, fill=subtype)) +
geom_histogram(position="identity", alpha=0.4)+
facet_grid(. ~ type)+
geom_text(data=df.text,aes(label=paste("mean=",mean.value),
colour=subtype,x=-Inf,y=Inf), size = 4, hjust=-0.1, vjust=c(2,4))
Just to expand on #Didzis:
You actually have two problems here. First, the text overlaps, but more importantly, when you use aggregating functions in aes(...), as in:
geom_text(aes(label = paste("mean=", mean(value)), ...
ggplot does not respect the subsetting implied in the facets (or in the groups for that matter). So mean(value) is based on the full dataset regardless of faceting or grouping. As a result, you have to use an auxillary table, as #Didzis shows.
BTW:
df.text <- aggregate(df$value,by=list(type=df$type,subtype=df$subtype),mean)
gets you the means and does not require plyr.