ggplot violin plot, specify different colours by group? - r

I have a matrix of 9 columns and I want to create a violin plot using ggplot2. I would like to have different colours for groups of three columns, basically increasing order of "grayness". How can I do this?
I have tried imputing lists of colours on the option "fill=" but it does not work. See my example below. At the moment, it indicates "gray80", but I want to be able to specify the colour for each violin plot, in order to be able to specify the colour for groups of 3.
library(ggplot2)
dat <- matrix(rnorm(100*9),ncol=9)
# Violin plots for columns
mat <- reshape2::melt(data.frame(dat), id.vars = NULL)
pp <- ggplot(mat, aes(x = variable, y = value)) + geom_violin(scale="width",adjust = 1,width = 0.5,fill = "gray80")
pp

We can add a new column, called variable_grouping to your data, and then specify fill in aes:
mat <- reshape2::melt(data.frame(dat), id.vars = NULL)
mat$variable_grouping <- ifelse(mat$variable %in% c('X1', 'X2', 'X3'), 'g1',
ifelse(mat$variable %in% c('X4','X5','X6'),
'g2', 'g3'))
ggplot(mat, aes(x = variable, y = value, fill = variable_grouping)) +
geom_violin(scale="width",adjust = 1,width = 0.5)
You can control the groupings using the ifelse statement. scale_fill_manual can be used to specify the different colors used to fill the violins.

Related

How to plot an Observed and Simulated timeseries as lines and points in ggplot?

I have timeseries of 4 simulated variables, with its 4 observed variables (observed variables have less data than simulated variables) as attached in the following link:
https://www.dropbox.com/s/sumgi6pqmjx70dl/nutrients2.csv?dl=0
I used the following code, The data is stored in "data 2" object
data2 <- read.table("C:/Users/Downloads/nutrients2.csv", header=T, sep=",")
library(lubridate)
data2$Date <- dmy(data2$Date)
library(reshape2)
data2 <- melt(data2, id=c("Date","Type"))
seg2 <- ggplot(data = data2, aes(x = Date, y = value, group = Type, colour = Type)) +
geom_line() +
facet_wrap(~ variable, scales = "free")
seg2
This give the plot (all variables in line)
Plot obtained
I need the observed data in points instead of interrupted lines, like this example
Plot desired
How to get a plot like this in ggplot, (simulated variables in line and observed variables in points or dots)?
One possible solution is to subset your dataset for geom_line and geom_point in order to use only sim and obs data respectively.
Then, if you pass shape = Type in your aes, you can remove dots for sim data in your legend by using scale_shape_manual:
(NB: I used melt function from data.table package because I found it more efficient for big dataset than the melt function reshape2)
library(lubridate)
df$Date <- dmy(df$Date)
library(data.table)
dt.m <- melt(setDT(df),measure = list(c("Nitrate","Ammonium","DIP","Chla")), value.name = "Values", variable.name = "Element")
library(ggplot2)
ggplot(dt.m, aes(x = Date, y = Values, group = Type, color = Type, shape = Type))+
geom_line(data = subset(dt.m, Type == "sim"))+
geom_point(data = subset(dt.m, Type == "obs"))+
scale_shape_manual(values = c(16,NA))+
facet_wrap(~Element, scales = "free")

ggplot not respectig order of colours in scale_fill_manual()?

I am creating some violin plots and want to colour them. It works for a matrix of dimension 9, as in my previous question
ggplot violin plot, specify different colours by group?
but when I increase the dimension to 15, the order of the colours are not respected. Why is this happening?
Here it works (9 columns):
library(ggplot2)
dat <- matrix(rnorm(250*9),ncol=9)
# Violin plots for columns
mat <- reshape2::melt(data.frame(dat), id.vars = NULL)
mat$variable_grouping <- as.character(sort(rep(1:9,250)))
pp <- ggplot(mat, aes(x = variable, y = value, fill = variable_grouping)) +
geom_violin(scale="width",adjust = 1,width = 0.5) + scale_fill_manual(values=rep(c("red","green","blue"),3))
pp
Here it does not work (15 columns):
library(ggplot2)
dat <- matrix(rnorm(250*15),ncol=15)
# Violin plots for columns
mat <- reshape2::melt(data.frame(dat), id.vars = NULL)
mat$variable_grouping <- as.character(sort(rep(1:15,250)))
pp <- ggplot(mat, aes(x = variable, y = value, fill = variable_grouping)) +
geom_violin(scale="width",adjust = 1,width = 0.5) + scale_fill_manual(values=rep(c("red","green","blue"),5))
pp
This is related to setting factor levels. Since variable_grouping is a character, ggplot2 converts it to a factor for plotting. It uses the default factor order, where 1 always comes before 2. So in your example 11-15 all come before 2 in the legend.
You can manually set the factor order to avoid the default order. I use forcats::fct_inorder() for this because it's convenient in this case where you want the order of the factor to match the order of the variable. Note you can also use factor() directly and set the level order via the levels argument.
ggplot(mat, aes(x = variable, y = value, fill = forcats::fct_inorder(variable_grouping))) +
geom_violin(scale="width", adjust = 1, width = 0.5) +
scale_fill_manual(values=rep(c("red","green","blue"),5)
You can also name the color vector. For example:
my_values <- rep(c("red","green","blue"),5)
names(my_values) <- rep(c("Data1","Data2","Data3"),5)
... +
scale_fill_manual(values=my_values)

Draw heatmap tiles for all combination of x-y

I asked a question about the heatmap which was solved here: custom colored heatmap of categorical variables. I defined my scale_fill_manual for all combinations as suggested in the accepted answer.
Based on this question, I would like to know how to tell ggplot2 to plot a heatmap with all combination of variables and not just the ones that are available in the dataframe (given that they are already in the scale_fill_manual but are not showing in the final plot).
How can I do this?
The current plotting code:
df <- data.frame(X = LETTERS[1:3],
Likelihood = c("Almost Certain","Likely","Possible"),
Impact = c("Catastrophic", "Major","Moderate"),
stringsAsFactors = FALSE)
df$color <- paste0(df$Likelihood,"-",df$Impact)
ggplot(df, aes(Impact, Likelihood)) + geom_tile(aes(fill = color),colour = "white") + geom_text(aes(label=X)) +
scale_fill_manual(values = c("Almost Certain-Catastrophic" = "red","Likely-Major" = "yellow","Possible-Moderate" = "blue"))
scale_fill_manual contains all combination of Impact, Likelihood with their respective colors.
Similar to #aosmith I tried expand.grid to get a finite set of combinations but tidyr::complete() works pretty nice as well. Add the colors and letters and fill using a set color range.
df <- data.frame(Likelihood = c("Almost Certain","Likely","Possible"),
Impact = c("Catastrophic", "Major","Moderate"),
stringsAsFactors = FALSE)
df2 <- df %>% tidyr::complete(Likelihood,Impact) # alt expand.grid(df)
df2$X <- LETTERS[1:9] # Add letters here
df2$color <- paste0(df2$Likelihood,"-",df2$Impact) # Add colors
ggplot(df2, aes(Impact, Likelihood)) + geom_tile(aes(fill = color),colour = "white") + geom_text(aes(label=X)) +
scale_fill_manual(values = RColorBrewer::brewer.pal(9,"Pastel1"))

Coloring a geom_histogram by gradient

I'm trying to plot a geom_histogram where the bars are colored by a gradient.
This is what I'm trying to do:
library(ggplot2)
set.seed(1)
df <- data.frame(id=paste("ID",1:1000,sep="."),val=rnorm(1000),stringsAsFactors=F)
ggplot(df,aes_string(x="val",y="..count..+1",fill="val"))+geom_histogram(binwidth=1,pad=TRUE)+scale_y_log10()+scale_fill_gradient2("val",low="darkblue",high="darkred")
But getting:
Any idea how to get it colored by the defined gradient?
Not sure you can fill by val because each bar of the histogram represents a collection of points.
You can, however, fill by categorical bins using cut. For example:
ggplot(df, aes(val, fill = cut(val, 100))) +
geom_histogram(show.legend = FALSE)
Just for completeness.
If the colors I'd like to have the gradient on to be manually selected here's what I suggest:
data:
library(ggplot2)
set.seed(1)
df <- data.frame(id=paste("ID",1:1000,sep="."),val=rnorm(1000),stringsAsFactors=F)
colors:
bins <- 10
cols <- c("darkblue","darkred")
colGradient <- colorRampPalette(cols)
cut.cols <- colGradient(bins)
cuts <- cut(df$val,bins)
names(cuts) <- sapply(cuts,function(t) cut.cols[which(as.character(t) == levels(cuts))])
plot:
ggplot(df,aes(val,fill=cut(val,bins))) +
geom_histogram(show.legend=FALSE) +
scale_color_manual(values=cut.cols,labels=levels(cuts)) +
scale_fill_manual(values=cut.cols,labels=levels(cuts))
Instead of binning manually another option would be to make use of the bins computed by stat_bin by mapping ..x.. (or factor(..x..) in case of a discrete scale) or after_stat(x) on the fill aesthetic.
An issue with computing the bins manually is that we end up with multiple groups per bin for which the count has to be computed (even if the count is zero most of the time) and which get stacked on top of each other in the histogram. Especially, this gets problematic if one would add labels of counts to the histogram as can be seen in this post, because in that case one ends up with multiple labels per bin.
library(ggplot2)
set.seed(1)
df <- data.frame(id = paste("ID", 1:1000, sep = "."), val = rnorm(1000), stringsAsFactors = F)
ggplot(df, aes(x = val, y = ..count.. + 1, fill = ..x..)) +
geom_histogram(binwidth = .1, pad = TRUE) +
scale_y_log10() +
scale_fill_gradient2(name = "val", low = "darkblue", high = "darkred")
#> Warning: Duplicated aesthetics after name standardisation: pad

Correcting the legend when plotting data from two data frames, some of the lines have symbols

The following code produces three plots. The first plot uses data from df_fault, and plots lines with symbols from df_maint, and that plot is fine also. The problem is with the 3rd plot, that combines the lines with symbols from df_fault with the lines from df_maint. The legend is incorrect, and there are two legends, one for lines and one for symbols. How to get one correct legend with four entries.
Create some sample data
library(zoo)
library(ggplot2)
rDates <- function(N, st="2012/01/01", et="2012/12/31") {
st <- as.POSIXct(as.Date(st))
et <- as.POSIXct(as.Date(et))
dt <- as.numeric(difftime(et,st,unit="sec"))
ev <- sort(runif(N, 0, dt))
rt <- st + ev
}
first_maint <- as.POSIXct(strptime("2014/01/01", "%Y/%m/%d"))
last_maint <- as.POSIXct(strptime("2014/12/31", "%Y/%m/%d"))
first_fault <- as.POSIXct(strptime("2014/05/01", "%Y/%m/%d"))
last_fault <- as.POSIXct(strptime("2014/07/31", "%Y/%m/%d"))
set.seed(31)
nMDates=40
nFDates=10
rMaintDates <- rDates(nMDates,first_maint,last_maint)
rFaultDates <- rDates(nFDates,first_fault,last_fault)
df_fault <- data.frame(date = rFaultDates,
type = "Non-Op",
ci = runif(nFDates,.7,1.8),stringsAsFactors=FALSE)
df_fault$type[sample(1:nFDates,3)] = "Advisory"
z_hr <- zoo(c(0,0,9.9,9.9),c(first_maint,first_fault,last_fault,last_maint))
z_maint <- zoo(,rMaintDates[c(-1,-nMDates)])
z_hr_maint_a <- merge(z_hr,z_maint)
z_hr_maint <- na.approx(z_hr_maint_a)
z_repair <- zoo(c(0,3000,5000,8000),c(first_maint,first_fault,last_fault,last_maint))
z_repair_maint_a <- merge(z_repair,z_maint)
z_repair_maint <- na.approx(z_repair_maint_a)
df_maint <- data.frame(date=index(z_hr_maint),
hrs=coredata(z_hr_maint)/9.8,
repairs=coredata(z_repair_maint)/8000)
Plot the sample data, these examples work
rpr_title = "repairs/8000"
flt_title = "hrs/9.8"
(gp2 <- ggplot(data=df_fault,aes(x=date, y=ci, color=type)) +
labs(x="Date (2014)", y="CI Amplitude",title="Sample, this plot is fine, df_fault") +
geom_line(aes(group=type,shape=type))+
geom_point(aes(group=type,shape=type),size=4)+
theme(plot.title=element_text( size=12),
axis.title=element_text( size=8)) )
(gp2a <- ggplot() + geom_line(data=df_maint,aes(x=date,y=repairs,color=rpr_title))+
geom_line(data=df_maint,aes(x=date,y=hrs,color=flt_title))+
labs(x="Date (2014)", y="CI Amplitude",title="Sample, this plot is fine, df_maint ")
)
This plot shows the fault data
This plot shows the maintenance and usage data
I would like to combine the above two plots into one plot, with four legend entries. Here is my current attempt, but the legend isn't correct
(gp2b <- gp2 + geom_line(data=df_maint,aes(x=date,y=repairs,color=rpr_title))+
geom_line(data=df_maint,aes(x=date,y=hrs,color=flt_title))+
labs(x="Date (2014)", y="CI Amplitude",title="Sample, this plot the legend is wrong")
)
This plot, there are two legends, and neither one is correct. The first "type" legend has the wrong symbols on the line, showing a circle symbol for all the lines. The second "type" legend shows two black symbols, so the colors are incorrect. I would like the 2nd legend removed, and the 1st legend correctly showing lines and colors. Also, it would be nice if the lines without symbols could be wider. The legend line/symbol for "Advisory" is correct. The legend entry for "Non-op" should have a triangle instead of a circle. The legend entries for "hrs/9.8" and "repairs/8000" should only have a line, no symbol.
Brandon suggestions for using meld helps, but the plot below still doesn't have the legend correct...
names(df_fault)[2:3] <- c("variable","value") # for rbind
dat <- melt(df_maint, c("date")) # melted
dat <- rbind(dat, df_fault)
p1 <- ggplot(dat, aes(date,value, group = variable, color = variable)) + geom_line()
p1 + geom_point(data =
dat[dat$variable %in% c("Advisory","Non-Op"),],
aes(date,value, group = variable, color = variable, shape=variable)) +
scale_colour_discrete(name ="Fleet",
breaks=c("hrs", "repairs","Advisory","Non-Op"),
labels=c("usage hrs", "maint repairs","Advisory Faults","Non-Op Faults")) +
scale_shape_discrete(name ="Fleet",
breaks=c("hrs", "repairs","Advisory","Non-Op"),
labels=c("usage hrs", "maint repairs","Advisory Faults","Non-Op Faults"),
guide = "none")
Post script: I want to mention that it took some effort to apply the above procedure to an actual data set. Here's an summary of the process.
1) Identify the x axis variables, and grouping variables.
2) In the two data frames, rename the x axis variable and group variables to the same names
3) Use melt twice (example only used it once) to generate a melted data frame. Use the x axis and group variables as is.vars. Specify the variable that you want to plot as measure.vars.
3b) Do head on the melted data frames. You need to see the X axis variable names and the grouping variable names, followed by the field variable and values. The field variable has text values corresponding to the different y axis names.
4) Use rbind to combine the two melted dataframes
5) Do head on both steps 3 and 4 so you understand the storage of the data
6) Plot the lines for all the data. Include the modification of the legend title in this step, using + guides(color=guide_legend(title="Fleet")). I don't see this command in the example.
7) Create a subset from the melted data frame of the data that will have symbols. Add the symbols, but don't add the 2nd legend from symbols +scale_shape_discrete(name ="Fleet", guide = "none") in the example.
8) Adjust the legend line symbols using + guides(colour = guide_legend(override.aes = list(shape = c(32,32,16,17))))
9) Once you can see a nominal plot of lines with some symbols and the correct legend, you may need to repeat the above process after sorting the combined melted data frame in order to get the correct lines / symbols in front. You may want to sort on variable, and the x axis fields.
By adding guides, and specifying the shape as no shape (32), and matching the other symbols (16, 17), the plot comes out correct
p1 <- ggplot(dat, aes(date,value, group = variable, color = variable)) + geom_line(size=1)
p1 + geom_point(data =
dat[dat$variable %in% c("Advisory","Non-Op"),],
aes(date,value, group = variable, color = variable, shape=variable),size=3) +
scale_colour_discrete(name ="Fleet",
breaks=c("hrs", "repairs","Advisory","Non-Op"),
labels=c("usage hrs", "maint repairs","Advisory Faults","Non-Op Faults")) +
scale_shape_discrete(name ="Fleet",
guide = "none") +
guides(colour = guide_legend(override.aes = list(shape = c(32,32,16,17))))
When in doubt, melt. See the example below:
library(reshape2)
library(ggplot2)
names(df_fault)[2:3] <- c("variable","value") # for rbind
dat <- melt(df_maint, c("date")) # melted
dat <- rbind(dat, df_fault)
p1 <- ggplot(dat, aes(date,value, group = variable, color = variable)) + geom_line()
p1 + geom_point(data =
dat[dat$variable %in% c("Advisory","Non-Op"),],
aes(date,value, group = variable, color = variable, shape=variable)) +
scale_shape(guide = "none")
Notice that I specified "data" in my geom_point() call. Each scale_ has a method for removing the guide by setting it to "none".

Resources