Colouring a PCA plot by clusters in R - r

I have some biological data that looks like this, with 2 different types of clusters (A and B):
Cluster_ID A1 A2 A3 B1 B2 B3
5 chr5:100947454..100947489,+ 3.31322 7.52365 3.67255 21.15730 8.732710 17.42640
12 chr5:101227760..101227782,+ 1.48223 3.76182 5.11534 15.71680 4.426170 13.43560
29 chr5:102236093..102236457,+ 15.60700 10.38260 12.46040 6.85094 15.551400 7.18341
I clean up the data:
CAGE<-read.table("CAGE_expression_matrix.txt", header=T)
CAGE_data <- as.data.frame(CAGE)
#Remove clusters with 0 expression for all 6 samples
CAGE_filter <- CAGE[rowSums(abs(CAGE[,2:7]))>0,]
#Filter whole file to keep only clusters with at least 5 TPM in at least 3 files
CAGE_filter_more <- CAGE_filter[apply(CAGE_filter[,2:7] >= 5,1,sum) >= 3,]
CAGE_data <- as.data.frame(CAGE_filter_more)
The data size is reduced from 6981 clusters to 599 after this.
I then go on to apply PCA:
#Get data dimensions
dim(CAGE_data)
PCA.CAGE<-prcomp(CAGE_data[,2:7], scale.=TRUE)
summary(PCA.CAGE)
I want to create a PCA plot of the data, marking each sample and coloring the samples depending on their type (A or B.) So it should be two colors for the plot with text labels for each sample.
This is what I have tried, to erroneous results:
qplot(PC1, PC2, colour = CAGE_data, geom=c("point"), label=CAGE_data, data=as.data.frame(PCA.CAGE$x))
ggplot(data=PCA.CAGE, aes(x=PCA1, y=PCA2, colour=CAGE_filter_more, label=CAGE_filter_more)) + geom_point() + geom_text()
qplot(PCA.CAGE[1:3], PCA.CAGE[4:6], label=colnames(PC1, PC2, PC3), geom=c("point", "text"))
The errors appear as such:
> qplot(PCA.CAGE$x[,1:3],PCA.CAGE$x[4:6,], xlab="Data 1", ylab="Data 2")
Error: Aesthetics must either be length one, or the same length as the dataProblems:PCA.CAGE$x[4:6, ]
> qplot(PC1, PC2, colour = CAGE_data, geom=c("point"), label=CAGE_data, data=as.data.frame(PCA.CAGE$x))
Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous
Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous
Error: Aesthetics must either be length one, or the same length as the dataProblems:CAGE_data, CAGE_data
> ggplot(data=PCA.CAGE, aes(x=PCA1, y=PCA2, colour=CAGE_filter_more, label=CAGE_filter_more)) + geom_point() + geom_text()
Error: ggplot2 doesn't know how to deal with data of class

Your question doesn't make sense (to me at least). You seem to have two groups of 3 variables (the A group and the B group). When you run PCA on these 6 variables, you'll get 6 principle components, each of which is a (different) linear combination of all 6 variables. Clustering is based on the cases (rows). If you want to cluster the data based on the first two PCs (a common approach), then you need to do that explicitly. Here's an example using the built-in iris data-set.
pca <- prcomp(iris[,1:4], scale.=TRUE)
clust <- kmeans(pca$x[,1:2], centers=3)$cluster
library(ggbiplot)
ggbiplot(pca, groups=factor(clust)) + xlim(-3,3)
So here we run PCA on the first 4 columns of iris. Then, pca$x is a matrix containing the principle components in the columns. So then we run k-means clustering based on the first 2 PCs, and extract the cluster numbers into clust. Then we use ggibplot(...) to make the plot.

Related

Plotting two principal component score vectors, using a different color to indicate three unique classes

After generating a simulated data set with 20 observations in each of three classes (i.e., 60 observations total), and 50 variables, I need to plot the first two principal component score vectors, using a different color to indicate the three unique classes.
I believe I can create the simulated data set (please verify), but I am having issues figuring out how to color the classes and plot. I need to make sure the three classes appear separated in the plot (or else I need to re-run the simulated data).
#for the response variable y (60 values - 3 classes 1,2,3 - 20 observations per class)
y <- rep(c(1,2,3),20)
#matrix of 50 variables i.e. 50 columns and 60 rows i.e. 60x50 dimensions (=3000 table cells)
x <- matrix( rnorm(3000), ncol=50)
xymatrix <- cbind(y,x)
dim(x)
[1] 60 50
dim(xymatrix)
[1] 60 51
pca=prcomp(xymatrix, scale=TRUE)
How should I correctly plot and color this principal component analysis as noted above? Thank you.
If I understand your question correctly, ggparcoord in Gally package would help you.
library(GGally)
y <- rep(c(1,2,3), 20)
# matrix of 50 variables i.e. 50 columns and 60 rows
# i.e. 60x50 dimensions (=3000 table cells)
x <- matrix(rnorm(3000), ncol=50)
xymatrix <- cbind(y,x)
pca <- prcomp(xymatrix, scale=TRUE)
# Principal components score and group label 'y'
pc_label <- data.frame(pca$x, y=as.factor(y))
# Plot the first two principal component scores of each samples
ggparcoord(data=pc_label, columns=1:2, groupColumn=ncol(pc_label))
However, I think it makes more sense to do PCA on x rather than xymatrix that includes the target y. So the following codes should be more appropriate in your case.
pca <- prcomp(x, scale=TRUE)
pc_label <- data.frame(pca$x, y=as.factor(y))
ggparcoord(data=pc_label, columns=1:2, groupColumn=ncol(pc_label))
If you want a scatter plot of first two principal component scores, you can do it using ggplot.
library(ggplot2)
ggplot(data=pc_label) +
geom_point(aes(x=PC1, y=PC2, colour=y))
Here's a base R solution, to show how simply this can be done. First do the PCA on the x matrix only and from the resulting object get a matrix of the transformed variables which we'll call PCs.
x <- matrix(rnorm(3000), ncol=50)
pca <- prcomp(x, scale=TRUE)
PCs <- as.matrix(pca$x)
Now we can make vector of colour names based on your y for the labels.
col.labs <- rep(c("Green", "Blue", "Red"), 20)
Now just plot as a scatter, passing the colour vector to col.
plot(PCs[, 1], PCs[, 2], col=col.labs, pch=19, xlab = "Scores on PC1", ylab="Scores on PC2")

Error message while plotting density functions in ggplot

I had a data frame with 750 observations and 250 columns, and I would like to plot two density plots on top of each other. In one case, a particular factor is present, in the other it isn't (commercial activities against non-commercial activities).
I created a subset of the data
CommercialActivityData <- subset(MbadSurvey, Q2== 1)
NonCommercialActivityData <- subset(MbadSurvey, Q2== 2)
I then tried to plot this as follows
p1 <- ggplot(CommercialActivityData, aes(x = water_use_PP)) + geom_density()
p1
However, when I do, I get the following error message
Error: Aesthetics must be either length 1 or the same as the data (51): x
I have 51 data values where there is commercial, and 699 where there isn't.
EDIT: new code!!
I don't have access to your data set so I have simulated your data:
# Creating the data frame
MbadSurvey <- data.frame("water_use_PP"=runif(1000,1,100),
"Q2"=as.factor(round(runif(1000,1,2),0)))
# Requiring the package
require(ggplot2)
# Creating 3 different density plots based on the Species
p1 <- ggplot(MbadSurvey, aes(x = water_use_PP,colour = Q2)) + geom_density()
p1
NOTE: The variable Q2 must be a factor!

Plotting gene expression data with means in a randomized experiment

I'm (a newbie to R) analyzing a randomized study on the effect of two treatments on gene expression. We evaluated 5 different genes at baseline and after 1 year. The gene fold is calculated as the value at 1 year divided by the baseline value.
Example gene:
IL10_BL
IL10_1Y
IL10_fold
Gene expression is measured as a continuous variable, typically ranging from 0.1 to 5.0.
100 patients have been randomized to either a statin or diet regime.
I would like to do the following plot:
- Y axis should display the mean gene expression with 95% confidence limit
- X axis should be categorical, with the baseline, 1 year and fold value for each of the 5 genes, grouped by treatment. So, 5 genes with 3 values for each gene in two groups would mean 30 categories on the X axis. It would be really nice of the dots for the same gene would be connected with a line.
I have tried to do this myself (using ggplot2) without any success. I've tried to do it directly from the crude data, which looks like this (first 6 observations and 2 different genes):
genes <- read.table(header=TRUE, sep=";", text =
"treatment;IL10_BL;IL10_1Y;IL10_fold;IL6_BL;IL6_1Y;IL6_fold;
diet;1.1;1.5;1.4;1.4;1.4;1.1;
statin;2.5;3.3;1.3;2.7;3.1;1.1;
statin;3.2;4.0;1.3;1.5;1.6;1.1;
diet;3.8;4.4;1.2;3.0;2.9;0.9;
statin;1.1;3.1;2.8;1.0;1.0;1.0;
diet;3.0;6.0;2.0;2.0;1.0;0.5;")
I would greatly appreciate any help (or link to a similar thread) to do this.
First, you need to melt your data into a long format, so that one column (your X column) contains a categorical variable indicating whether an observation is BL, 1Y, orfold.
(your command creates an empty column you might need to get rid of first: genes$X = NULL)
library(reshape2)
genes.long = melt(genes, id.vars='treatment', value.name='expression')
Then you need the gene and measurement (baseline, 1-year, fold) in different columns (from this question).
genes.long$gene = as.character(lapply(strsplit(as.character(genes.long$variable), split='_'), '[', 1))
genes.long$measurement = as.character(lapply(strsplit(as.character(genes.long$variable), split='_'), '[', 2))
And put the measurement in the order that you expect:
genes.long$measurement = factor(genes.long$measurement, levels=c('BL', '1Y', 'fold'))
Then you can plot using stat_summary() calls for the mean and confidence intervals. Use facets to separate the groups (treatment and gene combinations).
ggplot(genes.long, aes(measurement, expression)) +
stat_summary(fun.y = mean, geom='point') +
stat_summary(fun.data = 'mean_cl_boot', geom='errorbar', width=.25) +
facet_grid(.~treatment+gene)
You can reverse the order to facet_grid(.~gene+treatment) if you want the top level to be gene instead of treatment.

Plot multiple scatter matrices in one plot

I feel like this question has been asked many times before, however from the questions I've looked at, none of the solutions so far have worked for me.
I wish to plot the values of two correlation matrices as scatter plots, next to each other in one plot with the same y range (from 0 - 1).
My original data are time series spanning several years of 112 companies, which I've split into two subsets, period A & period B. The original data is a zoo object.
I have then created the correlation matrices for both periods:
corr_A <- cor(series_A)
corr_B <- cor(series_B)
For further data analysis, I have removed the double entries:
corr_A[lower.tri(corr_A, diag=TRUE)] <- NA
corr_A <- as.vector(corr_A)
corr_A <- corr_A[!is.na(corr_A)]
corr_B[lower.tri(corr_B, diag=TRUE)] <- NA
corr_B <- as.vector(corr_B)
corr_B <- corr_B[!is.na(corr_B)]
As a result, I have two vectors, each with a length of 6216 (111 + 110 + 109 + .. + 1 = 6216).
I have then combined these vectors into a matrix:
correlation <- matrix(c(corr_A, corr_B), nrow=6216)
colnames(correlation) <- c("period_A", "period_B")
I would now like to plot this matrix so the result looks similar to this picture:
I've tried to plot using xyplot from lattice:
xyplot(period_A + period_B ~ X, correlation)
However, in the resulting plot, the two scatter plots are stacked over each other:
I have also tried changing the matrix itself - instead of using 6216 rows, I have used 12432 rows, and then index'd the first 6512 rows as "period_A" and the last 6512 rows as "period_B" - the resulting plot looks quite similar to my desired plot:
Is there any way I can create my desired plot using xyplot? Or are there any other (ggplot, car) methods of generating the plot?
Edit (added sample data for reproducible example):
head(correlation) #data frame with 6216 rows, 3 columns
X period_A period_B
1 0.5 0.4
2 0.3 0.6
3 0.2 0.4
4 0.6 0.6
I've finally found a solution:
https://stats.stackexchange.com/questions/63203/boxplot-equivalent-for-heavy-tailed-distributions
First of all, we stack the data.
correlation <- stack(correlation)
Then, we use stripplot (from lattice) combiend with jitter=TRUE to create the desired plot.
stripplot(correlation$values ~ correlation$ind, jitter=T)
The resulting plot looks exactly like my desired plot and can be manipulated using the standard lattice/plot commands (ylab, xlab, xlim, ylim,scales, etc.).

Color Dependent Bar Graph in R

I'm a bit out of my depth with this one here. I have the following code that generates two equally sized matrices:
MAX<-100
m<-5
n<-40
success<-matrix(runif(m*n,0,1),m,n)
samples<-floor(MAX*matrix(runif(m*n),m))+1
the success matrix is the probability of success and the samples matrix is the corresponding number of samples that was observed in each case. I'd like to make a bar graph that groups each column together with the height being determined by the success matrix. The color of each bar needs to be a color (scaled from 1 to MAX) that corresponds to the number of observations (i.e., small samples would be more red, for instance, whereas high samples would be green perhaps).
Any ideas?
Here is an example with ggplot. First, get data into long format with melt:
library(reshape2)
data.long <- cbind(melt(success), melt(samples)[3])
names(data.long) <- c("group", "x", "success", "count")
head(data.long)
# group x success count
# 1 1 1 0.48513473 8
# 2 2 1 0.56583802 58
# 3 3 1 0.34541582 40
# 4 4 1 0.55829073 64
# 5 5 1 0.06455401 37
# 6 1 2 0.88928606 78
Note melt will iterate through the row/column combinations of both matrices the same way, so we can just cbind the resulting molten data frames. The [3] after the second melt is so we don't end up with repeated group and x values (we only need the counts from the second melt). Now let ggplot do its thing:
library(ggplot2)
ggplot(data.long, aes(x=x, y=success, group=group, fill=count)) +
geom_bar(position="stack", stat="identity") +
scale_fill_gradient2(
low="red", mid="yellow", high="green",
midpoint=mean(data.long$count)
)
Using #BrodieG's data.long, this plot might be a little easier to interpret.
library(ggplot2)
library(RColorBrewer) # for brewer.pal(...)
ggplot(data.long) +
geom_bar(aes(x=x, y=success, fill=count),colour="grey70",stat="identity")+
scale_fill_gradientn(colours=brewer.pal(9,"RdYlGn")) +
facet_grid(group~.)
Note that actual values are probably different because you use random numbers in your sample. In future, consider using set.seed(n) to generate reproducible random samples.
Edit [Response to OP's comment]
You get numbers for x-axis and facet labels because you start with matrices instead of data.frames. So convert success and samples to data.frames, set the column names to whatever your test names are, and prepend a group column with the "list of factors". Converting to long format is a little different now because the first column has the group names.
library(reshape2)
set.seed(1)
success <- data.frame(matrix(runif(m*n,0,1),m,n))
success <- cbind(group=rep(paste("Factor",1:nrow(success),sep=".")),success)
samples <- data.frame(floor(MAX*matrix(runif(m*n),m))+1)
samples <- cbind(group=success$group,samples)
data.long <- cbind(melt(success,id=1), melt(samples, id=1)[3])
names(data.long) <- c("group", "x", "success", "count")
One way to set a threshold color is to add a column to data.long and use that for fill:
threshold <- 25
data.long$fill <- with(data.long,ifelse(count>threshold,max(count),count))
Putting it all together:
library(ggplot2)
library(RColorBrewer)
ggplot(data.long) +
geom_bar(aes(x=x, y=success, fill=fill),colour="grey70",stat="identity")+
scale_fill_gradientn(colours=brewer.pal(9,"RdYlGn")) +
facet_grid(group~.)+
theme(axis.text.x=element_text(angle=-90,hjust=0,vjust=0.4))
Finally, when you have names for the x-axis labels they tend to get jammed together, so I rotated the names -90°.

Resources