Dot plot of multiple X and Y variables? - r

I am using a gene expression dataset from ~100 cells.
I want to generate a dot plot indicating which cells are expressing which genes, like below, excluding the color delineations.
I have tried ggplot solutions, but (from what I can tell) Ggplot2 cannot graph numerous variables in each axis. I've looked into more complex packages like Seurot and cRegulome (the image above is from cRegulome), but these produce more information the graphical output than I want.
Below is an example of the type of data frame I am working with.
Cell_A<-c(0,0,1,0,1,0,1,0)
Cell_B<-c(1,1,1,0,0,0,1,0)
Cell_C<-c(1,0,1,0,0,1,0,1)
Cell_D<-c(0,0,0,1,1,1,1,0)
Cell_E<-c(1,1,1,1,1,0,1,1)
Cell_F<-c(0,0,0,0,0,1,1,0)
Cell_G<-c(1,1,1,1,1,1,1,1)
Cell_H<-c(1,1,1,1,1,1,1,1)
Genes <- c("Gene1","Gene2","Gene3","Gene4","Gene5","Gene6","Gene7","Gene8")
fake_data <- data.frame(Cell_A, Cell_B, Cell_C, Cell_D, Cell_E,
Cell_F, Cell_G,Cell_H, row.names = Genes)
How can I manipulate this dataset to get the graphical output I want?

You can do this by reshaping the data and using geom_point. Map the size aesthetic to your count variable and it will work well. The legend is currently a bit nonsensical but can be manually tweaked if you do not have any other sizes than 0 and 1.
library(tidyverse)
Cell_A<-c(0,0,1,0,1,0,1,0)
Cell_B<-c(1,1,1,0,0,0,1,0)
Cell_C<-c(1,0,1,0,0,1,0,1)
Cell_D<-c(0,0,0,1,1,1,1,0)
Cell_E<-c(1,1,1,1,1,0,1,1)
Cell_F<-c(0,0,0,0,0,1,1,0)
Cell_G<-c(1,1,1,1,1,1,1,1)
Cell_H<-c(1,1,1,1,1,1,1,1)
Genes <- c("Gene1","Gene2","Gene3","Gene4","Gene5","Gene6","Gene7","Gene8")
fake_data <- data.frame(Cell_A, Cell_B, Cell_C, Cell_D, Cell_E,
Cell_F, Cell_G,Cell_H, row.names = Genes)
fake_data %>%
rownames_to_column(var = "gene") %>%
gather(cell, count, -gene) %>%
ggplot() +
geom_point(aes(x = gene, y = cell, size = count))
Created on 2019-08-02 by the reprex package (v0.3.0)

This solution is a base R solution that relies on matplot().
fake_data2 <- sweep(fake_data, 2, seq_len(length(fake_data)), FUN = '*')
fake_data2[fake_data2 == 0] <- NA_integer_
matplot(x = seq_along(Genes), y = as.matrix(fake_data2),
, cex = colSums(fake_data) / 3, pch = 16, col = 1
, yaxt='n', xaxt='n', ann=FALSE)
axis(1, at = seq_along(Genes), Genes)
axis(2, at = seq_len(length(fake_data)), names(fake_data), las = 1)
You didn't provide enough details on how what size you wanted. The size here is based on the number of 1 values for each column.

Related

Setting per-column y axis limits with facet_grid

I am, in R and using ggplot2, plotting the development over time of several variables for several groups in my sample (days of the week, to be precise). An artificial sample (using long data suitable for plotting) is this:
library(tidyverse)
groups1 <- rep(1:2, each = 7 * 100)
groups2 <- rep(rep(1:7, times = 2), each = 100)
x <- rep(1:100, times = 14)
values <- c(rnorm(n = 700), rgamma(n = 700, shape = 2))
data <- tibble(x, groups1, groups2, values)
data %>% ggplot(mapping = aes(x = x, y = values)) + geom_line() + facet_grid(groups2 ~ groups1)
which gives
In this example, the first variable -- shown in the left column -- has unlimited range, while the second variable -- shown in the right column -- is weakly positive.
I would like to reflect this in my plot by allowing the Y axes to differ across the columns in this plot, i.e. set Y axis limits separately for the two variables plotted. However, in order to allow for easy visual comparison of the different groups for each of the two variables, I would also like to have the identical Y axes within each column.
I've looked at the scales option to facet_grid(), but it does not seem to be able to do what I want. Specifically,
passing scales = "free_x" allows the Y axes to vary across rows, while
passing scales = "free_y" allows the X axes to vary across columns, but
there is no option to allow the Y axes to vary across columns (nor, presumably, the X axes across rows).
As usual, my attempts to find a solution have yielded nothing. Thank you very much for your help!
I think the easiest would to create a plot per facet column and bind them with something like {patchwork}. To get the facet look, you can still add a faceting layer.
library(tidyverse)
library(patchwork)
groups1 <- rep(1:2, each = 7 * 100)
groups2 <- rep(rep(1:7, times = 2), each = 100)
x <- rep(1:100, times = 14)
set.seed(42) ## always better to set a seed before using random functions
values <- c(rnorm(n = 700), rgamma(n = 700, shape = 2))
data <- tibble(x, groups1, groups2, values)
data %>%
group_split(groups1) %>%
map({
~ggplot(.x, aes(x = x, y = values)) +
geom_line() +
facet_grid(groups2 ~ groups1)
}) %>%
wrap_plots()
Created on 2023-01-11 with reprex v2.0.2

Plot multiple variables on same Barplot in R

I would really appreciate some help with this plot. I'm very new to R and struggling (after looking at many tutorials!)to understand how to plot the following:
This is my Table
The X axis is meant to have PatientID, the Y is cell counts for each patient
I've managed to do a basic plot for each variable individually, eg:
This is for 2 of the variables
And this gives me 2 separate graphs
Total cell counts
Cells counts for zone 1
I would like all the data represented on 1 graph...That means for each patient, there will be 4 bars (tot cell counts, and cell counts for each zone (1 - 3).
I don't understand whether I should be doing this as a combined plot or make the 4 different plots and then combine them together? I'm also very confused with how to actually code this. I've tried ggplot and I've done the regular Barplot in R (worked for 1 variable at a time but not sure how to do many variables). Some very step-by-step help would be so much appreciated here. TIA
Here's a way of doing it using the ggplot2 and tidyr packages from the tidyverse. The key steps are pivoting your data from "wide" to "long" format in order to make it usable with ggplot2. Afterwards, the ggplot call is pretty simple - more info here if you want a bit more explanation about stacked and bar plots in ggplot2, with an example that's pretty much identical to yours.
library(ggplot2)
library(tidyr)
# Reproducing your data
dat <- tibble(
patientID = c("a", "b", "c"),
tot_cells = c(2773, 3348, 4023),
tot_cells_zone1 = c(994, 1075, 1446),
tot_cells_zone2 = c(1141, 1254, 1349),
tot_cells_zone3 = c(961, 1075, 1426)
)
to_plot <- pivot_longer(dat, cols = starts_with("tot"), names_to = "Zone", values_to = "Count")
ggplot(to_plot, aes(x = patientID, y = Count, fill = Zone)) +
geom_bar(position="dodge", stat="identity")
Output:
Thanks everyone for your help. I was able to make the plot as follows:
First, I made a new table from data I imported into R:
#Make new table of patientID and tot cell count
patientID <- c("a", "b", "c")
tot_cells <- c(tot_cells_a, tot_cells_b, tot_cells_c)
tot_cells_zone1 <- c(tot_cells_a_zone1, tot_cells_b_zone1, tot_cells_c_zone1)
tot_cells_zone2 <- c(tot_cells_a_zone2, tot_cells_b_zone2, tot_cells_c_zone2)
tot_cells_zone3 <- c(tot_cells_a_zone3, tot_cells_b_zone3, tot_cells_c_zone3)
tot_cells_table <- data.frame(tot_cells,
tot_cells_zone1,
tot_cells_zone2,
tot_cells_zone3)
rownames(tot_cells_table) <- c(patientID)
Then I plotted as such, first converting the data.frame to matrix :
#Plot "Total Microglia Counts per Patient"
tot_cells_matrix <- data.matrix(tot_cells_table, rownames.force = patientID)
par(mar = c(5, 4, 4, 10),
xpd = TRUE)
barplot(t(tot_cells_table[1:3, 1:4]),
col = c("red", "blue", "green", "magenta"),
main = "Total Microglia Counts per Patient",
xlab = "Patient ID", ylab = "Cell #",
beside = TRUE)
legend("topright", inset = c(- 0.4, 0),
legend = c("tot_cells", "tot_cells_zone1",
"tot_cells_zone2", "tot_cells_zone3"),
fill = c("red", "blue", "green", "magenta"))
And the graph looks like this:
Barplot of multiple variables
Thanks again for pointing me in the right direction!

Boxplots aren't colouring or plotting labels properly in R, why?

My Tukey test significant results LABELS and the colours plotted as box plots do not plot over each sample box plot. Why?
Seems like the labels are plotted at different y-axis along the same s1 (x-axis)?
Reproducible dataset here:
library(multcompView)
df <- data.frame('Sample'=c("s1","s1","s1","s1","s1","s2","s2","s2","s2","s2","s3","s3","s3","s3","s4","s4","s5","s5"), 'value'=c(-0.1098,-0.1435,-0.1046,-0.1308,-0.1523,-0.1219,-0.1114,-0.1328,-0.1589,-0.1567,-0.1395,-0.1181,-0.1448,-0.124,-0.1929,-0.1996,-0.1981,-0.1917))
anova_df <- aov(df$value ~ df$Sample )
tukey_df <- TukeyHSD(anova_df, 'df$Sample', conf.level=0.95)
# I need to group the treatments that are not different each other together.
TUKEY <- tukey_df
generate_label_df <- function(TUKEY, variable){
# Extract labels and factor levels from Tukey post-hoc
Tukey.levels <- TUKEY[[variable]][,4]
Tukey.labels <- data.frame(multcompLetters(Tukey.levels)['Letters'])
#I need to put the labels in the same order as in the boxplot :
Tukey.labels$Sample=rownames(Tukey.labels)
Tukey.labels=Tukey.labels[order(Tukey.labels$Sample) , ]
return(Tukey.labels)
}
# Apply the function on my dataset
LABELS <- generate_label_df(TUKEY , "df$Sample")
# A panel of colors to draw each group with the same color :
my_colors <- c(
rgb(143,199,74,maxColorValue = 255),
rgb(242,104,34,maxColorValue = 255),
rgb(111,145,202,maxColorValue = 255))
# Draw the basic boxplot
a <- boxplot(df$value ~ df$Sample , ylim=c(min(df$value) , 1.1*max(df$value)) , col=my_colors[as.numeric(LABELS[,1])] , ylab="Value" , main="")
# I want to write the letter over each box. Over is how high I want to write it.
over <- 0.1*max(a$stats[nrow(a$stats),] )
#Add the labels
text(c(1:nlevels(df$Sample)), a$stats[nrow(a$stats),]+over, LABELS[,1] , col=my_colors[as.numeric(LABELS[,1])] )
Current output:
Desired plot-like (colours and LABELS):
First, LABELS$Letters is a character vector. You can get as.numeric(LABELS[,1]) to work if you make it a factor first.
Second, your y-limit needs some work for negative values. There is a function you might find useful called extendrange which is used in many a plotting function.
This line c(1:nlevels(df$Sample)) also would work if df$Sample was a factor which is was not.
Also, if you are plotting text at a specific location, you can adjust the text using either text(..., pos = ) or text(..., adj = ) to shift the position.
LABELS$Letters <- factor(LABELS$Letters)
a <- boxplot(df$value ~ df$Sample , ylim = extendrange(df$value), col=my_colors[as.numeric(LABELS[,1])] , ylab="Value" , main="")
text(seq_along(a$names), apply(a$stats, 2, max), LABELS[,1], col=my_colors[as.numeric(LABELS[,1])], pos = 3)
If you don't mind changing your workflow and use tidyverse library this is how you could achieve your goal:
# join df and LABELS into one data table
inner_join(df, LABELS, by = "Sample") %>%
# calculate max value for each Sample group (it will be used to place the labels)
group_by(Sample) %>%
mutate(placement = max(value)) %>%
ungroup() %>%
# make a plot
ggplot(aes(Sample, value, fill = Letters))+
geom_boxplot()+
geom_text(aes(y = placement, label = Letters, col = Letters), nudge_y = 0.01, size = 6)+
theme_minimal()+
theme(legend.position = "none")

pie charts in R where slices represent the frequency of values in the columns of the data set

I want to make pie charts for each column of my dataframe, where the slices represent the frequency, in which the values in the columns appear. For instance, the following will produce a data frame with 3 columns, and will round the numbers down to single digits.
test1<-rnorm(200,mean = 20, sd = 2)
test2<-rnorm(200,mean=20, sd =1)
test3<-rnorm(200,mean=20, sd =3)
testdata<-cbind(test,test2,test3)
testdata <-round(testdata,0)
So I would need to have 3 pie charts, where the slices represent the number of times, in which a given value appears in the respective column (with the name of the column on top of the pie chart, if possible)
So far, I have tried pie(frame(testdata$test1)) but it works for creating a single pie chart, and my real data has 25 columns. On top of that, trying to pass a "main=" argument to name it, results in error.
Thank you in advance.
ggplot2 is the go-to library to make nice plots. To have 3 different pie-plots one needs to adjust the data a bit, which is done with some tidyverse-functions.
test1<-rnorm(200,mean = 20, sd = 2)
test2<-rnorm(200,mean=20, sd =1)
test3<-rnorm(200,mean=20, sd =3)
testdata<-cbind(test1,test2,test3)
testdata <-round(testdata,0)
library(ggplot2)
library(tidyverse)
plotdata <- testdata %>%
as_tibble() %>%
pivot_longer(names(.),names_to = "data1", values_to = "value") %>%
group_by(data1) %>%
count(value)
ggplot(plot_data, aes( x = "", y = n, fill = factor(value))) +
geom_col(width = 1, show.legend = TRUE) +
coord_polar("y", start = 0) +
facet_wrap(~data1)

Detect outer rows in the dataset

I have data set that contain positions of the objects:
so <- data.frame(x = rep(c(1:5), each = 5), y = rep(1:5, 5))
so1 <- so %>% mutate(x = x + 5, y = y +2)
so2 <- rbind(so, so1) %>% mutate(x = x + 13, y = y + 7)
so3 <- so2 %>% mutate(x = x + 10)
ggplot(aes(x = x, y = y), data = rbind(so, so1, so2, so3)) + geom_point()
What I want to know is if there is a method in R that can detect that the object is located in the outer row in the data set as I have to exclude such objects from the analysis. I want to exclude the objects in red as on the picture
So far I used min, max and ifelse but this is tidious and I could not create something that could be generalised to the different data sets with different design of x and y.
Is there any package that do the thing? or/and is it possible to solve such a problem?
You could perhaps use a "spatial" approach?
Visualizing your data as a spatial object, your problem would become to remove the borders of your patches...
This can be done quite straightforwardly using the package raster: find the boundaries and mask your data accordingly.
library(dplyr)
library(raster)
# Your reproducible example
myDF = rbind(so,so1,so2,so3)
myDF$z = 1 # there may actually be more 'z' variables
# Rasterize your data
r = rasterFromXYZ(myDF) # if there are more vars, this will be a RasterBrick
par(mfrow=c(2,2))
plot(r, main='Original data')
# Here I artificially add 1 row above and down and 1 column left and right,
# This is a trick needed to make sure to also remove the cells that are
# located at the border of your raster with `boundaries` in the next step.
newextent = extent(r) + c(-res(r)[1], res(r)[1], -res(r)[2], res(r)[2] )
r = extend(r, newextent)
plot(r, main='Artificially extended')
plot(rasterToPoints(r, spatial=T), add=T, col='blue', pch=20, cex=0.3)
# Get the cells to remove, i.e. the boundaries
bounds = boundaries(r[[1]], asNA=T) #[[1]]: in case r is a RasterBrick
plot(bounds, main='Cells to remove (where 1)')
plot(rasterToPoints(bounds, spatial=T), add=T, col='red', pch=20, cex=0.3)
# Then mask your data (i.e. subset to remove boundaries)
subr = mask(r, bounds, maskvalue=1)
plot(subr, main='Resulting data')
plot(rasterToPoints(subr, spatial=T), add=T, col='blue', pch=20, cex=0.3)
# This is your new data (the added NA's are not translated so it's OK)
myDF2 = rasterToPoints(subr)
Would it help you?

Resources