"Dotplot" visualisation with factors - r

I am not sure how to approach this. I want to create a "dotpot" style plot in R from a data frame of categorical variables (factors) such that for each column of the df I plot a column of dots, each coloured according to the factors. For example,
my_df <- cbind(c('sheep','sheep','cow','cow','horse'),c('sheep','sheep','sheep','sheep',<NA>),c('sheep','cow','cow','cow','cow'))
I then want to end up with a 3 x 5 grid of dots, each coloured according to sheep/cow/horse (well, one missing because of the NA).

Do you mean something like this:
my_df <- cbind(c('sheep','sheep','cow','cow','horse'),
c('sheep','sheep','sheep','sheep',NA),
c('sheep','cow','cow','cow','cow'))
df <- data.frame(my_df) # make it as data.frame
df$id <- row.names(df) # add an id
library(reshape2)
melt_df <-melt(df,'id') # melt it
library(ggplot2) # now the plot
p <- ggplot(melt_df, aes(x = variable, fill = value))
p + geom_dotplot(stackgroups = TRUE, binwidth = 0.3, binpositions = "all")

Related

add one legend with all variables for combined graphs

I'm trying to plot two graphs side-by-side with one common legend that incorporates all the variables between both graphs (some vars are different between the graphs).
Here's a mock example of what I've been attempting:
#make relative abundance values for n rows
makeData <- function(n){
n <- n
x <- runif(n, 0, 1)
y <- x / sum(x)
}
#make random matrices filled with relative abundance values
makeDF <- function(col, rw){
df <- matrix(ncol=col, nrow=rw)
for(i in 1:ncol(df)){
df[,i] <- makeData(nrow(df))
}
return(df)
}
#create df1 and assign col names
df1 <- makeDF(4, 5)
colSums(df1) #verify relative abundance values = 1
df1 <- as.data.frame(df1)
colnames(df1) <- c("taxa","s1", "s2", "s3")
df1$taxa <- c("ASV1", "ASV2", "ASV3", "ASV4", "ASV5")
#repeat for df2
df2 <- makeDF(4,5)
df2 <- as.data.frame(df2)
colnames(df2) <- c("taxa","s1", "s2", "s3")
df2$taxa <- c("ASV1", "ASV5", "ASV6", "ASV7", "ASV8")
# convert wide data format to long format -- for plotting
library(reshape2)
makeLong <- function(df){
df.long <- melt(df, id.vars="taxa",
measure.vars=grep("s\\d+", names(df), val=T),
variable.name="sample",
value.name="value")
return(df.long)
}
df1 <- makeLong(df1)
df2 <- makeLong(df2)
#generate distinct colours for each asv
taxas <- union(df1$taxa, df2$taxa)
library("RColorBrewer")
qual_col_pals = brewer.pal.info[brewer.pal.info$category == 'qual',]
colpals <- qual_col_pals[c("Set1", "Dark2", "Set3"),] #select colour palettes
col_vector = unlist(mapply(brewer.pal, colpals$maxcolors, rownames(colpals)))
taxa.col=sample(col_vector, length(taxas))
names(taxa.col) <- taxas
# plot using ggplot
library(ggplot2)
plotdf2 <- ggplot(df2, aes(x=sample, y=value, fill=taxa)) +
geom_bar(stat="identity")+
scale_fill_manual("ASV", values = taxa.col)
plotdf1 <- ggplot(df1, aes(x=sample, y=value, fill=taxa)) +
geom_bar(stat="identity")+
scale_fill_manual("ASV", values = taxa.col)
#combine plots to one figure and merge legend
library(ggpubr)
ggpubr::ggarrange(plotdf1, plotdf2, ncol=2, nrow=1, common.legend = T, legend="bottom")
(if you have suggestions on how to generate better mock data, by all means!)
When I run my code, I am able to get the two graphs in one figure, but the legend does not incorporate all variables from both plots:
I ideally would like to avoid having repeat variables in the legend, such as:
From what I've searched online, the legend only works when the variables are the same between graphs, but in my case I have similar and different variables.
Thanks for any help!
Maybe this is what you are looking for:
Convert your taxa variables to factor with the levels equal to your taxas variable, i.e. to include all levels from both datasets.
Add argument drop=FALSE to both scale_fill_manual to prevent dropping of unused factor levels.
Note: I only added the relevant parts of the code and set the seed to 42 at the beginning of the script.
set.seed(42)
df1$taxa <- factor(df1$taxa, taxas)
df2$taxa <- factor(df2$taxa, taxas)
# plot using ggplot
library(ggplot2)
plotdf2 <- ggplot(df2, aes(x=sample, y=value, fill=taxa)) +
geom_bar(stat="identity") +
scale_fill_manual("ASV", values = taxa.col, drop = FALSE)
plotdf1 <- ggplot(df1, aes(x=sample, y=value, fill=taxa)) +
geom_bar(stat="identity")+
scale_fill_manual("ASV", values = taxa.col, drop = FALSE)
#combine plots to one figure and merge legend
library(ggpubr)
ggpubr::ggarrange(plotdf1, plotdf2, ncol=2, nrow=1, common.legend = T, legend="bottom")

How to set x and y limits to same values?

This is some data I made. I have two data frames with two variables each.
var1 <- (1:10)*(rnorm(10,2,0.1))
var2 <- (6:15)*(rnorm(10,1,0.1))
df1 <- as.data.frame(cbind(var1,var2))
var3 <- (1:10)*(rnorm(10,3,0.1))
var4 <- (6:15)*(rnorm(10,1.5,0.1))
df2 <- as.data.frame(cbind(var3,var4))
There is a loop for plotting the first variable of df1 and df2, and the second of df1 and df2 too.
plot_list = list()
for(i in 1:ncol(df1)){
p=ggplot(df1,
aes_string(x=df1[,i],
y=df2[,i]))+
geom_point()
plot_list[[i]] = p
}
library(gridExtra)
do.call("grid.arrange", c(plot_list[c(1:2)], ncol=1))
And this is the plot I got.
So far so good. But, I would like to x and y within each plot had the same limit based on max and min. For example, in the above plot both x and should go from ~5 to ~30. In the below plot both x and should go from ~6 to ~24. I could set the limits manually, but I need to do this for many plots.
Is there any way to set the x and y limits for each plot based on min and max observed in any of the axis?
Thanks for the help.
In general, I’d suggest that the data for each plot should be in its own data.frame. Having a single data.frame and using facets is an option, but facets make it difficult to specify different limits for each plot. I’ve therefore gone with a grid.arrange solution similar to yours.
library(ggplot2)
library(purrr)
var1 <- (1:10)*(rnorm(10,2,0.1))
var2 <- (6:15)*(rnorm(10,1,0.1))
var3 <- (1:10)*(rnorm(10,3,0.1))
var4 <- (6:15)*(rnorm(10,1.5,0.1))
df1 <- data.frame(x = var1, y = var3)
df2 <- data.frame(x = var2, y = var4)
plots <- map(
list(df1, df2),
function(data) {
ggplot(data, aes(x, y)) +
geom_point() +
coord_fixed(xlim = range(c(data$x, data$y)), ylim = range(c(data$x, data$y)))
})
gridExtra::grid.arrange(grobs = plots, nrow = 2)

ggplot boxplot: How to order x axis according to a third variable?

I have a simple dataframe containing three columns:
ST_CODE | VALUE | HEIGHT
... ... ...
factor continuous continuous
I want a VALUE boxplot for each ST_CODE, but I want the order on the x axis to be determined by the ascending order of HEIGHT.
This is the code:
ggplot(ozone, aes(x = ST_CODE, y = VALUE)) +
geom_boxplot(notch=TRUE)
Ordering ozone inside the ggplot function by doing ozone[order(ozone$HEIGHT),] was useless, because the order is determined by ST_CODE. What should I do?
Here's the dataset: https://www.dropbox.com/s/kf0jcv50oaa5my9/ozone_example.csv?dl=0
I have found this question, but I didn't really get it: Rearrange x axis according to a variable in ggplot
The solution should be to order the levels of the factor variable ST_CODE according to the VALUE column.
Until you provide example data this is my best guess :-)
Edit 1: I have added read.csv to read your data and I would say it works. To make it easier to check the result I have used only the first 1000 rows which contain only three different ST_CODEs).
library(ggplot2)
# example data
# data <- data.frame( ST_CODE = rep(c("A", "B", "C"), 2), VALUE = rep(3:1, 2), HEIGHT = rep(c(2, 1, 3), 2))
# data
# Your data
data <- read.csv("ozone_example.csv")
data <- data[1:1000,]
table(data$ST_CODE, data$HEIGHT) # indicates how to order ST_CODEs
# plot (not sorted by HEIGHT)
ggplot(data, aes(x = ST_CODE, y = VALUE)) +
geom_boxplot(notch=TRUE)
# Plot sorted by HEIGHT by changing the factor level order
ordered.data <- data[order(data$HEIGHT),]
data$ST_CODE <- factor(data$ST_CODE, levels = unique(ordered.data$ST_CODE))
ggplot(data, aes(x = ST_CODE, y = VALUE)) +
geom_boxplot(notch=TRUE)

Multiplot of multiplots in ggplot2

I recently discovered the multiplot function from the Rmisc package to produce stacked plots using ggplot2 plots/objects. What I am trying to do now is to create a multiplot of multiplots. Unfortunately, unlike the ggplot function, multiplot does not produce objects, so my issue cannot be resolved by simply nesting multiplot.
I will create a dataframe to make my point clear. In my dataframe named df, I have 3 columns: period, group and value. A certain value is recorded for each of 3 groups over 10 periods. (Note: I don't use a seed number below despite the use of the sample function because the focus is not numerical, it is graphical)
# Create a data frame for illustration purposes
df <- data.frame(period = rep(1:10, 3),
group = rep(LETTERS[1:3], each = 10),
value = sample(100, 30, replace = TRUE))
I then add a fourth column to df, which is the exponential transformation of the value column.
df$exp.value = exp(df$value)
I would like to create stacked plots allowing me to compare the values in each group to their exponential counterparts.
# Split dataframe by group
df_split <- split(df, df$group)
# Plots of values in each group
plots <- lapply(df_split, function(i){
ggplot(data = i, aes(x = period, y = value)) + geom_line()
})
# Plots of logged values in each group
plots_exp <- lapply(df_split, function(i){
ggplot(data = i, aes(x = period, y = exp.value)) + geom_line()
})
plots and plots_exp are both lists of 3 elements each containing ggplot objects. The first element of each list corresponds to group A, the second element corresponds to group B and the third element corresponds to group C.
In order to compare each group's values to the exponential values, I can use the multiplot function. Following is an example with group A:
multiplot(plots[[1]], plots_log[[1]], cols = 1)
How can I create a grid which will include the multiplot above as well as the ones for groups B and C? As if the code included ... + facet_grid(. ~ group)?
We can use cowplot package:
library(cowplot)
plot_grid(plots[[1]], plots_exp[[1]],
plots[[2]], plots_exp[[2]],
plots[[3]], plots_exp[[3]],
labels = c("A", "A", "B", "B", "C", "C"),
ncol = 1, align = "v")
We can output to a pdf looping through plots and plots_exp list objects. Every page will contain 2 plots. This is a better option when we have a lot of groups:
pdf("myPlots.pdf")
lapply(seq(length(plots)), function(i){
plot_grid(plots[[i]], plots_exp[[i]], ncol = 1, align = "v")
})
dev.off()
Another option is to prepare the data for ggplot and use facet as usual:
library(dplyr)
library(tidyr)
library(ggplot2)
gather(df, valueType, value, -c(group, period)) %>%
mutate(myGroup = paste(group, valueType)) %>%
ggplot(aes(period, value)) +
geom_line() +
facet_grid(myGroup ~ ., scales = "free_y")

barplot of percentages per category, per variable

Given the following example data:
df<-data.frame(cbind(cntry<- c("BE","ES","IN","GE","BE","ES","GE",NA,"IN","IN"),
gndr<- c(NA,1,2,2,2,2,1,1,1,2),
plcvcrcR<-c(0,1,NA,0,0,1,1,1,0,0),
plcpvcrR<-c(0,1,1,1,NA,0,0,0,0,0),
plccbrgR<- c(0,1,0,NA,0,1,0,1,1,0),
plcarcrR<-c(1,0,0,NA,1,0,1,0,0,0),
plcrspcR<-c(1,1,0,0,0,0,0,1,1,NA)))
colnames(df)<- c("cntry", "gndr", "plcvcrcR", "plcpvcrR", "plccbrgR", "plcarcrR", "plcrspcR")
df
How could I make barplots showing for example for each gender (gndr) the percentage of 1-values on the variables plcpvcrR, plccbrgR, plcarcrR? Prefeably the bars for each gender are grouped, and of a different colour for the different variables.
Something like this image, where one colour refers to the question, and the group to the gender (without the confidence interval):
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSsAlUJsqdhxXHiY35FxFmVx3BREVji_ca24w9ub_OYEfZ3O50X5Q
I have experimented with the following function, of which I am aware it contains many flaws:
barplot(((colSums(df[c(3:5)], na.rm=TRUE)/nrow(df[c(3:5)]))*100)~gndr)
I'd do something like this:
require(ggplot2)
require(reshape2)
require(scales)
require(plyr)
# remove NA from gndr
df <- df[!is.na(df$gndr), ]
# now get percentages
df.o <- ddply(df, .(gndr), summarise,
plcpvcrR = sum(plcpvcrR == 1, na.rm = T)/sum(!is.na(plcpvcrR)),
plccbrgR = sum(plccbrgR == 1, na.rm = T)/sum(!is.na(plccbrgR)),
plcrspcR = sum(plcrspcR == 1, na.rm = T)/sum(!is.na(plcrspcR)))
# melt it:
df.m <- melt(df.o, id.var = "gndr")
# plot it:
ggplot(data = df.m, aes(x=gndr)) + geom_bar(aes(weights=value, fill=variable),
position = "dodge") + scale_y_continuous(labels=percent)
There may be easier/straightforward way to get percentages. Here's the plot:

Resources