Adding arbitrary labels to each group in a grouped scatterplot in ggplot2 - r

I have a list of matrices I wish to plot. Each element in the list ultimately represents a facet to be plotted. Each matrix element has dimensions Row * Col, where all values in a row are grouped and plotted as a scatterplot (i.e. X-axis is categorical for the row names, Y-axis is the value, Col. values per row).
Additionally, I would like to add the CV for the distribution of points at each X.
X1 X2 value L1
a subject1 8.026494 facet1
b subject1 7.845277 facet1
c subject1 8.189731 facet1
(10 categorical groupings - a-j)
a subject2 5.148875 facet1
b subject2 8.023356 facet1
(33 subjects plotted for each categorical grouping)
a subject1 5.148875 facet2
b subject1 8.023356 facet2
(multiple facets (in my specific case, 50) with identical categorical grouping and subject names)
I managed to plot this to my satisfaction with the following:
p <- (qplot(X1, value, data=melt(df), colour=X2)
+ facet_wrap(~Probeset, ncol=10, nrow=5, scales="free_x"))
However, I would like to add the CV of each grouping of points along the X-axis as a label hovering above the group. I tried variations on this:
p + geom_text(aes(x=X1, y="arbitrary value at the top of the Y-axis scale", label="vector of labels")))
But none of them behaved as I wish. How would I go about getting the CV of each group above the group of points, as a label?
Thank you in advance!

Since there is no X2 corresponding to each label, you have to put the labels into a separate data set, and supply it in the data argument of geom_text. Using a reproducible example:
library(ggplot2)
#create data with the desired structure
dd <- expand.grid(facet=LETTERS[1:4], group=letters[1:5], subject=factor(1:10))
dd$value <- exp(rnorm(nrow(dd)))
#calculate CV's
ddcv <- ddply(dd, .(facet,group),
function(x)c(CV=sd(x$value)/mean(x$value), maxX=max(x$value)))
ddcv$CV <- round(ddcv$CV,1)
#make plots
p <- qplot(group, value, colour=subject, data=dd) + facet_wrap(~facet)
p + geom_text(aes(x=group, y=maxX+1, label=CV), colour="black", data=ddcv)

Related

How can I set my scatterplot to color specific groupings of dots?

I'm trying to color this correlation scatterplot with 3 different colors: black for normal cell lines, red for cancer cell lines, and blue for tumors. I set up my ggplot and labeled the points, but because each cell line being compared has a specific name, I cannot group the data with dplyr or Excel. I tried dividing my data into additional data frames and using geom_point to color specifically those data frames, but that is not working. I'd appreciate any help I can get.
#create master frame
df <- my_file
names <- rownames(df)
df <- as.data.frame(df)
class(df)
normals <- df[1:21]
cell_lines <- df[22:60]
tumors <- df[61:219]
row.names(df) <- c("PTPRO", "PPP1R14C", "IRF8", "TSPAN8","NKD1")
#create new data frame for 2 specific sets
row_to_keep <- c(TRUE,TRUE,FALSE,FALSE,FALSE)
df1 <- df[row_to_keep,]
# create scatter plot
df1=data.frame(t(df)) #switch rows and columns
ggplot(df1,aes(x=PTPRO,y=PPP1R14C))+
geom_point() + #scatter plot
coord_trans(y='log10',x='log10') + #logarithmic scale
geom_text( #label names
label=rownames(df1),
nudge_x=0.25,nudge_y=0.25,
check_overlap=T,
)
Here is my current graph
I want it to look something like this reproducible example described here at https://r-charts.com/correlation/scatter-plot-group-ggplot2/

How to split x-axis as decile in R and make ggplot

Hi I am wondering how to split x-axis as decile in R and make ggplot?
I currently have age range data and NO2 pollution data. The two datasets share the same geographic reference named ward. I wish to plot my demographic data in quantiles of equal number of ward (Total 298).
I tried the quantile regression in R where I used the following:
library(SparseM)
library(quantreg)
mydata<- read.csv("M:/Desktop10/Test2.csv")
attach(mydata)
Y <- cbind(NO2.value)
X <- cbind(age.0.to.4, age..5.to.9, age.10.to.14, age.15.to.19, age.20.to.24, age.25.to.29, age.30.to.44, age.45.to.59, age.60.to.64, age.65.to.74, age.75.to.84, age.85.to.89, age.above.90)
quantreg.all <- rq(Y ~ X, tau = seq(0.05, 0.95, by = 0.05), data=mydata)
quantreg.plot <- summary(quantreg.all)
plot(quantreg.plot)
But what I get are not what I expected as the y-axies is not the NO2 data.
The ideal plot is attached:
Many thanks for your help and suggestions.
If I understand your question, I think the cut function combined with the quantile function will create the deciles. Here's an example with fake data.
In the code below, we use the cut function to split the data into deciles and we use the quantile function to set the breaks argument for cut. This tells cut to group the data into 10 groups of equal size, from smallest values of NO2 to largest.
group_by(age) means we create the deciles separately for each age group. This means that there are equal numbers of subjects within each decile in a given age group, but the NO2 cutoff values for each decile are different for different age groups. To create deciles over the data as a whole, just remove group_by(age). This will result in the same NO2 cutoff values for each decile across all age groups, but within a given age group, the number of subjects will not be the same in each decile.
library(tidyverse)
# Fake data
set.seed(2)
dat = data.frame(NO2=c(runif(600, 0, 10), runif(400, 1, 11)),
age=rep(c("0-10","11-20"), c(600,400)))
# Create decile groups
dat = dat %>%
group_by(age) %>%
mutate(decile = cut(NO2, breaks=quantile(NO2, probs=seq(0,1,0.1)),
labels=10:1, include.lowest=TRUE),
decile = fct_rev(decile))
Now we plot using ggplot2. The stat_summary function returns the mean for each decile in each age group.
ggplot(dat, aes(decile, NO2, colour=age, group=age)) +
stat_summary(fun.y=mean, geom="line") +
stat_summary(fun.y=mean, geom="point") +
expand_limits(y=0) +
theme_bw()

Plot sample-vs-sample gene expression levels in R

I have a data set containing gene expression data for various genes, across 24 different samples. In my current dataframe, each row is a gene and each column is a sample.
I want to create a dot plot where each dot is a gene, the y-axis represents the expression of that gene in sample A, and the x-axis represents the expression of the same gene in sample B.
I have tried to search for this but don't know what such a plot is called or how I can find it. Most of my other plots are plotted with ggplot2, but it does not matter what package is used to solve the problem.
Example data:
sample_A<-c(2,3,1)
sample_B<-c(-1,4,-3)
genes <- c("gene1","gene2","gene3")
df<-data.frame(sample_A,sample_B,row.names = genes)
Data frame:
sample_A sample_B
gene1 2 -1
gene2 3 4
gene3 1 -3
geom_point with ggplot2 is probably what you're looking for. The dots can also be labelled using geom_label.
require(ggplot2)
p <- ggplot(df, aes(x = sample_B, y = sample_A))+
geom_point()+
geom_label(aes(label = rownames(df)))

How to plot a variable with selected number of rows using ggplot2?

A sample of my dataframe (speed) is as below with 45122 observations.
A B C
1 0.06483121 0.08834364 0.05814113
2 0.06904103 0.13169238 0.06082291
3 0.05556961 0.09767185 0.06039383
4 0.06483121 0.13388726 0.05996474
5 0.06651514 0.11632827 0.04891578
6 0.06904103 0.11687699 0.05953565
...
......
45122 0.06212749 0.08307191 0.07422524
I can create a simple plot by selecting number of observation I like using code below:
( temporal cyclic pattern- speed shown in the y axis, 0 to 500 in the x axis)
plot(speed[1:500,3], type="l", ylab="speed", xlab="unit time")
I am trying to do the same with ggplot2, but it's giving me a histogram.
How do I do the similar plot using ggplot?
We subset the first 500 rows and the third variable ('C') using [. Note that we have to add drop=FALSE as the default is drop=TRUE. According to ?"[", if drop=TRUE, the result is coerced to the lowest possible dimension, i.e. in this case a vector.
speed1 <- speed[1:500,3, drop=FALSE]
We specify the 'x' (1:nrow(speed1)) and 'y' variables in the aes, use geom_line() for a line plot and xlab and ylab to specify the labels for 'x axis' and 'y axis'.
library(ggplot2)
ggplot(speed1, aes(x=1:nrow(speed1), y=C))+
geom_line() +
ylab('speed') +
xlab('unit time')
data
set.seed(24)
speed <- as.data.frame(matrix(abs(rnorm(45122*3)), ncol=3,
dimnames=list(NULL, LETTERS[1:3])))

Plotting gene expression data with means in a randomized experiment

I'm (a newbie to R) analyzing a randomized study on the effect of two treatments on gene expression. We evaluated 5 different genes at baseline and after 1 year. The gene fold is calculated as the value at 1 year divided by the baseline value.
Example gene:
IL10_BL
IL10_1Y
IL10_fold
Gene expression is measured as a continuous variable, typically ranging from 0.1 to 5.0.
100 patients have been randomized to either a statin or diet regime.
I would like to do the following plot:
- Y axis should display the mean gene expression with 95% confidence limit
- X axis should be categorical, with the baseline, 1 year and fold value for each of the 5 genes, grouped by treatment. So, 5 genes with 3 values for each gene in two groups would mean 30 categories on the X axis. It would be really nice of the dots for the same gene would be connected with a line.
I have tried to do this myself (using ggplot2) without any success. I've tried to do it directly from the crude data, which looks like this (first 6 observations and 2 different genes):
genes <- read.table(header=TRUE, sep=";", text =
"treatment;IL10_BL;IL10_1Y;IL10_fold;IL6_BL;IL6_1Y;IL6_fold;
diet;1.1;1.5;1.4;1.4;1.4;1.1;
statin;2.5;3.3;1.3;2.7;3.1;1.1;
statin;3.2;4.0;1.3;1.5;1.6;1.1;
diet;3.8;4.4;1.2;3.0;2.9;0.9;
statin;1.1;3.1;2.8;1.0;1.0;1.0;
diet;3.0;6.0;2.0;2.0;1.0;0.5;")
I would greatly appreciate any help (or link to a similar thread) to do this.
First, you need to melt your data into a long format, so that one column (your X column) contains a categorical variable indicating whether an observation is BL, 1Y, orfold.
(your command creates an empty column you might need to get rid of first: genes$X = NULL)
library(reshape2)
genes.long = melt(genes, id.vars='treatment', value.name='expression')
Then you need the gene and measurement (baseline, 1-year, fold) in different columns (from this question).
genes.long$gene = as.character(lapply(strsplit(as.character(genes.long$variable), split='_'), '[', 1))
genes.long$measurement = as.character(lapply(strsplit(as.character(genes.long$variable), split='_'), '[', 2))
And put the measurement in the order that you expect:
genes.long$measurement = factor(genes.long$measurement, levels=c('BL', '1Y', 'fold'))
Then you can plot using stat_summary() calls for the mean and confidence intervals. Use facets to separate the groups (treatment and gene combinations).
ggplot(genes.long, aes(measurement, expression)) +
stat_summary(fun.y = mean, geom='point') +
stat_summary(fun.data = 'mean_cl_boot', geom='errorbar', width=.25) +
facet_grid(.~treatment+gene)
You can reverse the order to facet_grid(.~gene+treatment) if you want the top level to be gene instead of treatment.

Resources