Plotting gene expression data with means in a randomized experiment - r

I'm (a newbie to R) analyzing a randomized study on the effect of two treatments on gene expression. We evaluated 5 different genes at baseline and after 1 year. The gene fold is calculated as the value at 1 year divided by the baseline value.
Example gene:
IL10_BL
IL10_1Y
IL10_fold
Gene expression is measured as a continuous variable, typically ranging from 0.1 to 5.0.
100 patients have been randomized to either a statin or diet regime.
I would like to do the following plot:
- Y axis should display the mean gene expression with 95% confidence limit
- X axis should be categorical, with the baseline, 1 year and fold value for each of the 5 genes, grouped by treatment. So, 5 genes with 3 values for each gene in two groups would mean 30 categories on the X axis. It would be really nice of the dots for the same gene would be connected with a line.
I have tried to do this myself (using ggplot2) without any success. I've tried to do it directly from the crude data, which looks like this (first 6 observations and 2 different genes):
genes <- read.table(header=TRUE, sep=";", text =
"treatment;IL10_BL;IL10_1Y;IL10_fold;IL6_BL;IL6_1Y;IL6_fold;
diet;1.1;1.5;1.4;1.4;1.4;1.1;
statin;2.5;3.3;1.3;2.7;3.1;1.1;
statin;3.2;4.0;1.3;1.5;1.6;1.1;
diet;3.8;4.4;1.2;3.0;2.9;0.9;
statin;1.1;3.1;2.8;1.0;1.0;1.0;
diet;3.0;6.0;2.0;2.0;1.0;0.5;")
I would greatly appreciate any help (or link to a similar thread) to do this.

First, you need to melt your data into a long format, so that one column (your X column) contains a categorical variable indicating whether an observation is BL, 1Y, orfold.
(your command creates an empty column you might need to get rid of first: genes$X = NULL)
library(reshape2)
genes.long = melt(genes, id.vars='treatment', value.name='expression')
Then you need the gene and measurement (baseline, 1-year, fold) in different columns (from this question).
genes.long$gene = as.character(lapply(strsplit(as.character(genes.long$variable), split='_'), '[', 1))
genes.long$measurement = as.character(lapply(strsplit(as.character(genes.long$variable), split='_'), '[', 2))
And put the measurement in the order that you expect:
genes.long$measurement = factor(genes.long$measurement, levels=c('BL', '1Y', 'fold'))
Then you can plot using stat_summary() calls for the mean and confidence intervals. Use facets to separate the groups (treatment and gene combinations).
ggplot(genes.long, aes(measurement, expression)) +
stat_summary(fun.y = mean, geom='point') +
stat_summary(fun.data = 'mean_cl_boot', geom='errorbar', width=.25) +
facet_grid(.~treatment+gene)
You can reverse the order to facet_grid(.~gene+treatment) if you want the top level to be gene instead of treatment.

Related

Is something wrong with my ggplot2 or R code to plot an ordered ancestry stacked barplot?

I want to create an organized stacked barplot where bars with similar proportions appear together. I have a data frame of 10,000 individuals and each individual comes from three populations. Here is my data.
library(MCMCpack)
library(ggplot2)
n = 10000
alpha = c(0.1, 0.1, 0.1)
q <- as.data.frame(rdirichlet(n,alpha))
head(q)
individuals <- c(1:nrow(q))
q <- cbind(q, individuals)
head(q)
V1 V2 V3 individuals
1 0.0032720232 3.381345e-08 0.996727943 1
2 0.3354060035 4.433923e-01 0.221201688 2
3 0.0004121665 9.661220e-01 0.033465842 3
4 0.9966997182 3.234048e-03 0.000066234 4
5 0.7789280208 2.090134e-01 0.012058562 5
6 0.0005048727 9.408364e-02 0.905411485 6
# long format for ggplot2 plotting
qm <- gather(q, key, value, -individuals)
colnames(qm) <- c("individuals", "ancestry", "proportions")
head(qm)
individuals ancestry proportions
1 1 V1 0.0032720232
2 2 V1 0.3354060035
3 3 V1 0.0004121665
4 4 V1 0.9966997182
5 5 V1 0.7789280208
6 6 V1 0.0005048727
Without any kind of ordering of data, I plotted the stacked barplot as:
ggplot(qm) + geom_bar(aes(x = individuals, y = proportions, fill= ancestry), stat="identity")
I have two questions:
(1) I don't know how to make these individuals with similar proportions cluster together, and I have tried many solutions on stack exchange already but can't get them to work on my dataset!
(2) For some reason, it seems like when I implement the code to order individuals by decreasing/increasing proportions in one ancestry, the code sometimes works on toy datasets of lower dimensions I create, but when I try to plot 10,000 individuals, the code doesn't work anymore! Is this a problem in ggplot2 or am I doing something wrong? I would appreciate any answer to this thread to also plot n = 10,000 stacked barplots.
(3) Not sure if I'm imagining this, but in my stacked barplot, it seems like R is clustering the stacked bar plots in some order unknown to me -- because I can see regular gaps between the stacked plots. In reality, there should be no gaps and I'm not sure why this is happening.
I would appreciate any help since I have already worked on this code for an embarrassingly long amount of time!!
Since, the variance of proportions within the ancestry is very high, the bars look like clustered with other ancestry. It is plotted in the right way. However, we couldn't distinguish the difference because the number of individuals is high.
If you think that the proportions on your data set would not lose it's meaning and could be interpreted in the same way if they're transformed intro exponential or log values, you can try it.
The stacked bar with exponential of the proportions:
ggplot(qm) + geom_bar(aes(x = individuals, y = exp(proportions), fill= ancestry),
stat="identity")
If you don't want have gaps between the bars, set widht to 1.
ggplot(qm) + geom_bar(aes(x = individuals, y = exp(proportions), fill= ancestry),
stat="identity",
width=1)

ggplot2 geom_point three variables to plot x = number ocurrences y = sum of each category and colors for nationalities - R

I would like to plot a geom_point with ggplot2 to represent y = sum of amount ($) per nationality, x = number of times per nationality and colors to differentiate nationalities.
I grouped top 20 nationalities using this:
DF$groupnat <- fct_lump(DF$customer_country, 20)
With that I could group top 20 countries and the rest will be group in the value others
I know how to do individually but when I try to summarize all in one graph something fail. I'm not able to sum amount per nationality and same with numbers of times.
Here is an example of the Data set I'm using:
dataframe -> DF
The idea is to plot a geom_point graph involving 3 variables: x = number of observations per nationality (make a count for each nationality of the variable groupnat), y= total amount (sum of amount per nationality (groupnat)) and differentiate each nationality with a different color.
Thanks in advance!

Colouring a PCA plot by clusters in R

I have some biological data that looks like this, with 2 different types of clusters (A and B):
Cluster_ID A1 A2 A3 B1 B2 B3
5 chr5:100947454..100947489,+ 3.31322 7.52365 3.67255 21.15730 8.732710 17.42640
12 chr5:101227760..101227782,+ 1.48223 3.76182 5.11534 15.71680 4.426170 13.43560
29 chr5:102236093..102236457,+ 15.60700 10.38260 12.46040 6.85094 15.551400 7.18341
I clean up the data:
CAGE<-read.table("CAGE_expression_matrix.txt", header=T)
CAGE_data <- as.data.frame(CAGE)
#Remove clusters with 0 expression for all 6 samples
CAGE_filter <- CAGE[rowSums(abs(CAGE[,2:7]))>0,]
#Filter whole file to keep only clusters with at least 5 TPM in at least 3 files
CAGE_filter_more <- CAGE_filter[apply(CAGE_filter[,2:7] >= 5,1,sum) >= 3,]
CAGE_data <- as.data.frame(CAGE_filter_more)
The data size is reduced from 6981 clusters to 599 after this.
I then go on to apply PCA:
#Get data dimensions
dim(CAGE_data)
PCA.CAGE<-prcomp(CAGE_data[,2:7], scale.=TRUE)
summary(PCA.CAGE)
I want to create a PCA plot of the data, marking each sample and coloring the samples depending on their type (A or B.) So it should be two colors for the plot with text labels for each sample.
This is what I have tried, to erroneous results:
qplot(PC1, PC2, colour = CAGE_data, geom=c("point"), label=CAGE_data, data=as.data.frame(PCA.CAGE$x))
ggplot(data=PCA.CAGE, aes(x=PCA1, y=PCA2, colour=CAGE_filter_more, label=CAGE_filter_more)) + geom_point() + geom_text()
qplot(PCA.CAGE[1:3], PCA.CAGE[4:6], label=colnames(PC1, PC2, PC3), geom=c("point", "text"))
The errors appear as such:
> qplot(PCA.CAGE$x[,1:3],PCA.CAGE$x[4:6,], xlab="Data 1", ylab="Data 2")
Error: Aesthetics must either be length one, or the same length as the dataProblems:PCA.CAGE$x[4:6, ]
> qplot(PC1, PC2, colour = CAGE_data, geom=c("point"), label=CAGE_data, data=as.data.frame(PCA.CAGE$x))
Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous
Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous
Error: Aesthetics must either be length one, or the same length as the dataProblems:CAGE_data, CAGE_data
> ggplot(data=PCA.CAGE, aes(x=PCA1, y=PCA2, colour=CAGE_filter_more, label=CAGE_filter_more)) + geom_point() + geom_text()
Error: ggplot2 doesn't know how to deal with data of class
Your question doesn't make sense (to me at least). You seem to have two groups of 3 variables (the A group and the B group). When you run PCA on these 6 variables, you'll get 6 principle components, each of which is a (different) linear combination of all 6 variables. Clustering is based on the cases (rows). If you want to cluster the data based on the first two PCs (a common approach), then you need to do that explicitly. Here's an example using the built-in iris data-set.
pca <- prcomp(iris[,1:4], scale.=TRUE)
clust <- kmeans(pca$x[,1:2], centers=3)$cluster
library(ggbiplot)
ggbiplot(pca, groups=factor(clust)) + xlim(-3,3)
So here we run PCA on the first 4 columns of iris. Then, pca$x is a matrix containing the principle components in the columns. So then we run k-means clustering based on the first 2 PCs, and extract the cluster numbers into clust. Then we use ggibplot(...) to make the plot.

Plotting confidence intervals in ggplot

I'd like to do the following plot using ggplot:
Here is an example of the structure of my df (sort of, draw not to scale with the data):
example.df = data.frame(mean = c(0.3,0.8,0.4,0.65,0.28,0.91,0.35,0.61,0.32,0.94,0.1,0.9,0.13,0.85,0.7,1.3),
std.dev = c(0.01,0.03,0.023,0.031,0.01,0.012,0.015,0.021,0.21,0.13,0.023,0.051,0.07,0.012,0.025,0.058),
class = c("1","2","1","2","1","2","1","2","1","2","1","2","1","2","1","2"),
group = c("group1","group2","group1","group2","group1","group2","group1","group2","group1","group2","group1","group2","group1","group2","group1","group2"))
This data frame consists of 16 replicates, each with a given mean and a given standard deviation.
For each replicate I'd like to plot the confidence intervals, where the big dot in my figure example is the mean estimate, and the length of the bar is twice the standard deviation.
Also I'd like to plot two different replicates in the same line but with different coloring, coloring it by class, red is class 1 and blue is class 2.
Finally, I'd like to divide the whole plot into two panels (in the same row) corresponding to the two different groups.
I tried looking into this site, http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_(ggplot2)/ but couldn't figure out how to automate this for any data frame of this structure, with X number of groups (in this case 2), and K replicates per group (in this case 8, 4 of class 1 and 4 of class 2).
Is there a good way to do this using ggplot or standard r pkg libraries?
I suppose that sample data frame you provided isn't build in appropriate way because all values in group1 have class 1, and in group2 all are class 2. So I made new data frame, added also new column named replicate that shows number of replicate (four replicates (with two class values) in each group).
example.df = data.frame(mean = c(0.3,0.8,0.4,0.65,0.28,0.91,0.35,0.61,0.32,0.94,0.1,
0.9,0.13,0.85,0.7,1.3),
std.dev = c(0.01,0.03,0.023,0.031,0.01,0.012,0.015,0.021,0.21,
0.13,0.023,0.051,0.07,0.012,0.025,0.058),
class = c("1","2","1","2","1","2","1","2","1","2","1",
"2","1","2","1","2"),
group = rep(c("group1","group2"),each=8),
replicate=rep(rep(1:4,each=2),time=2))
Now you can use geom_pointrange() to get points with confidence intervals and facet_wrap() to make plot for each group.
ggplot(example.df,aes(factor(replicate),
y=mean,ymin=mean-2*std.dev,ymax=mean+2*std.dev,color=factor(class)))+
geom_pointrange()+facet_wrap(~group)

Nested tables and calculating summary statistics with confidence intervals in R

This question is about the statistical program R.
Data
I have a data frame, study_data, that has 100 rows, each representing a different person, and three columns, gender, height_category, and freckles. The variable gender is a factor and takes the value of either "male" or "female". The variable height_category is also a factor and takes the value of "tall" or "short". The variable freckles is a continuous, numeric variable that states how many freckles that individual has.
Here are some example data (thanks to Roland for this):
set.seed(42)
DF <- data.frame(gender=sample(c("m","f"),100,T),
height_category=sample(c("tall","short"),100,T),
freckles=runif(100,0,100))
Question 1
I would like to create a nested table that divides these patients into "male" versus "female", further subdivides them into "tall" versus "short", and then calculates the number of patients in each sub-grouping along with the median number of freckles with the lower and upper 95% confidence interval.
Example
The table should look something like what is shown below, where the # signs are replaced with the appropriate calculated results.
gender height_category n median_freckles LCI UCI
male tall # # # #
short # # # #
female tall # # # #
short # # # #
Question 2
Once these results have been calculated, I would then like to create a bar graph. The y axis will be the median number of freckles. The x axis will be divided into male versus female. However, these sections will be subdivided by height category (so there will be a total of four bars in groups of two). I'd like to overlay the 95% confidence bands on top of the bars.
What I've tried
I know that I can make a nested table using the MASS library and xtabs command:
ftable(xtabs(formula = ~ gender + height_category, data = study_data))
However, I'm not sure how to incorporate calculating the median of the number of freckles into this command and then getting it to show up in the summary table. I'm also aware that ggplot2 can be used to make bar graphs, but am not sure how to do this given that I can't calculate the data that I need in the first place.
You should really provide a reproducible example. Anyway, you may find library(plyr) helpful. Be careful with these confidence intervals because the Central Limit Theorem doesn't apply if n < 30.
library(plyr)
ddply(df, .(gender, height_category), summarize,
n=length(freckles), median_freckles=median(freckles),
LCI=qt(.025, df=length(freckles) - 1)*sd(freckles)/length(freckles)+mean(freckles),
UCI=qt(.975, df=length(freckles) - 1)*sd(freckles)/length(freckles)+mean(freckles))
EDIT: I forgot to add the bit on the plot. Assuming we save the previous result as tab:
library(ggplot2)
library(reshape)
m.tab <- melt(tab, id.vars=c("gender", "height_category"))
dodge <- position_dodge(width=0.9)
ggplot(m.tab, aes(fill=height_category, x=gender, y=median_freckles))+
geom_bar(position=dodge) + geom_errorbar(aes(ymax=UCI, ymin=LCI), position=dodge, width=0.25)
set.seed(42)
DF <- data.frame(gender=sample(c("m","f"),100,T),
height_category=sample(c("tall","short"),100,T),
freckles=runif(100,0,100))
library(plyr)
res <- ddply(DF,.(gender,height_category),summarise,
n=length(na.omit(freckles)),
median_freckles=quantile(freckles,0.5,na.rm=TRUE),
LCI=quantile(freckles,0.025,na.rm=TRUE),
UCI=quantile(freckles,0.975,na.rm=TRUE))
library(ggplot2)
p1 <- ggplot(res,aes(x=gender,y=median_freckles,ymin=LCI,ymax=UCI,
group=height_category,fill=height_category)) +
geom_bar(stat="identity",position="dodge") +
geom_errorbar(position="dodge")
print(p1)
#a better plot that doesn't require to precalculate the stats
library(hmisc)
p2 <- ggplot(DF,aes(x=gender,y=freckles,colour=height_category)) +
stat_summary(fun.data="median_hilow",geom="pointrange",position = position_dodge(width = 0.4))
print(p2)

Resources