Using R to perform a PCA on melted variables - r

I have a dataset where I've measured gene expression of 21 genes and also measured the output of 3 other assays. I have measured these for 8 different clones. I have also measured these on 5 different days.
However, I haven't measured every gene or assay on every day, or for every clone. So I have datasets of varying lengths. In order to easily combine them into one large dataset, to perform a PCA on them, I melted each dataset and then row bound them. I then standardized all the values. I now have a dataset that looks like the below.
What I want to do is a PCA where each of the factors in "group" is calculated in the PCA. Then, I'd like to create graphs where different colors of datapoints represent different "clones" or "days". I've pasted my sad attempt to get that working below. Any help would be appreciated!
set.seed(1)
# Creates variables for a dataset
clone <- sample(c(rep(c("1A","2A","2B","3B","3C"), each=100),rep(c("1B","2C","3A"), each=200)))
day <- sample(c(rep(1,225),rep(2,25),rep(3,600),rep(4,25),rep(5,225)))
group <- sample(c(rep(paste0("gene",1:21), each=42),rep("assay1",90),rep("assay2",80),rep("assay3",48)))
value = rnorm(1100, mean=0, sd=3)
# Create data frame from variables
df <- data.frame(clone,day,group,value)
df$day <- as.factor(df$day)
# Create PCA data
df_PCA <- prcomp(clone + day + group ~ value, data = df, scale = FALSE)
# Graphing results of PCA
par(mfrow=c(2,3))
plot(df_PCA$x[,1:2], col=clone)
plot(df_PCA$x[,1:2], col=day)
plot(df_PCA$x[,1:3], col=clone)
plot(df_PCA$x[,1:3], col=day)
plot(df_PCA$x[,2:3], col=clone)
plot(df_PCA$x[,2:3], col=day)

Related

When setting your obsCovs for the function pcount (package unmarked) how does R "know" which obsCov observation corresponds to each y value?

I'm relatively new at R particularly with this package. I am running n-mixture models assessing detection probabilities and abundance. I have abundance data, site covariates and observation covariates. There are three repeated observations(rounds)/site. The observation covariates are set up as columns (three column/covariate, one for each round). The rows are individual sites. The abundance data is formatted similarly, with each column heading representing a different round. I've copied my code below.
y.abun2<-COYE[2:4]
obsCovs.ss <- list(temp=Covariate2021[3:5], Date=Covariate2021[13:15], Cloud=Covariate2021[17:19], Wind=Covariate2021[21:23],Observ=Covariate2021[25:27])
siteCovs.ss <- Covariate2021[c(29,30,31,32)]
coyeabund<-unmarkedFramePCount(y=y.abun2, siteCovs = siteCovs.ss,
obsCovs = obsCovs.ss)
After this I scale using this code:
coyeabund#siteCovs$TreeCover <-
scale(coyeabund#siteCovs$TreeCover)
Moving on to my model I use this code:
abun.coye.full<-pcount(~TreeCover+temp+Date+Cloud+Wind+Observ ~ HHSDI+ProportionNH+Quality, coyeabund,mixture="NB", K=132,se=TRUE)
Is the model matching the observation covariates to the abundance measurements to each round? (i.e., is it able to tell that temp column 5 corresponds to the third round of abundance measurements?)
The models seem fine so far but I am so new at this I want to confirm that I haven't gone astray.

PCA on Transposed Dataframe

I have a data set of 4 variables and 16 observations. I want to apply PCA over the first column. I have transposed my dataframe so that I can run prcomp. I am using prcomp so there is no limit for the number of variables being more than observations.
#Transpose the dataframe
n <- SkillCount_df$Skill
t_SkillCount_df <- as.data.frame(t(SkillCount_df[, -1]))
colnames(t_SkillCount_df) <- n
str(t_SkillCount_df)
summary(t_SkillCount_df)
#Apply PCA
pca_salary <- prcomp(t_SkillCount_df, scale. = T, center = T)
The code that I run doesn't have any errors but instead of getting 16 PCs I get 3 PCs.
the summary and structure of the data seem fine. All values are numerical and the 16 variables that I want to apply pca for, are the variable names.
Can someone help me figure out what I'm doing wrong?

How to merge unsupervised hierarchical clustering result with the original data

I carried out an unsupervised hierarchical cluster analysis in R. My data are numbers in 3 columns and around 120,000 rows. I managed to use cut tree and recognised 6 clusters. Now, I need to return these clusters to the original data, i.e. add another column indicating the cluster group (1 of 6). How I can do that?
# Ward's method
hc5 <- hclust(d, method = "ward.D2" )
# Cut tree into 6 groups
sub_grp <- cutree(hc5, k = 6)
# Number of members in each cluster
table(sub_grp)
I need that as my data got spatial links, hence I would like to map the clusters back to their location on a map.
I appreciate your help.
The variable sub_grp is just a vector of cluster assignments so you can just add it to the data frame:
data(iris) # Data frame available in base R.
str(iris)
d <- dist(iris[, -5]) # Column 5 is the species name so we drop it
hc5 <- hclust(d, method="ward.D2")
sub_grp <- cutree(hc5, k=3)
str(sub_grp)
iris$grp <- sub_grp
str(iris)
aggregate(iris[, 1:4,], by=list(iris$grp), mean)
xtabs(~grp+Species, iris)
The last two commands compute the means by groups for the 4 numeric variables and cross-tabulate the cluster assignments with the known species. You don't actually need to add the cluster assignment to the data frame. R lets you combine variables from different objects as long as they have the same number of rows.

Clustering and Heatmap on microarray data using R

I have a file with the results of a microarray expression experiment. The first column holds the gene names. The next 15 columns are 7 samples from the post-mortem brain of people with Down's syndrome, and 8 from people not having Down's syndrome. The data are normalized. I would like to know which genes are differentially expressed between the groups.
There are two groups and the data is nearly normally distributed, so a t-test has been performed for each gene. The p-values were added in another column at the end. Afterwards, I did a correction for multiple testing.
I need to cluster the data to see if the differentially expressed genes (FDR<0.05) can discriminate between the groups.
Also, I would like to visualize the clustering using a heatmap with gene names on the rows and some meaningful names on the samples (columns)
I have written this code for the moment:
ds <- read.table("down_syndroms.txt", header=T)
names(ds) <- c("Gene",paste0("Down",1:7),paste0("Control",1:8), "pvalues")
pvadj <- p.adjust(ds$pvalue, method = "BH")
# # How many genes do we get with a FDR <=0.05
sum(pvadj<=0.05)
[1] 5641
# Cluster the data
ds_matrix<-as.matrix(ds[,2:18])
ds_dist_matrix<-dist(ds_matrix)
my_clustering<-hclust(ds_dist_matrix)
# Heatmap
library(gplots)
hm <- heatmap.2(ds_matrix, trace='none', margins=c(12,12))
The heatmap I have done doesn't look the way I would like. Also, I think I should remove the pvalues from it. Besides, R usually crashes when I try to plot the clustering (probably due to the big size of the data file, with more than 22 thousand genes).
How could I do a better looking tree (clustering) and heatmap?

R regression over columns with fixed deltas

I have a data frame in R , df, where each row, X, is a subject (N= 100) and each column,S, the score for each subject on a task each month over the span of two years. Thus i have a data frame of 100 subjects and 24 observations evenly spaced by 1 month intervals (ignoring month/day variance).
Question1: how do I fit a line (linear regression) to each subject? I have trouble understanding how to do this over columns, as opposed to rows within a column.
Question2: how do I fit a line (linear regression) to the whole data set? I ask because I would like to segment the dataset into groups A and B (i.e. a column is labeled as condition: {A,B}), and fit a line to each subset of subject over the 24 timepoints.
apologies if this a simple question.
I constructed a dataset based on your description. If this is useful, perhaps include it in your question itself.
df<- as.data.frame(matrix(rep(1:24,100)+rnorm(2400),nrow=100,byrow=T))
names(df)<- paste("S",1:24,sep="")
df$ID<-1:100
df$group <- as.factor(sample(c("A","B"),100,replace=T))
Now melt your data frame to get the S1 to S24 columns as a factor variable.
library(reshape2)
m<- melt(df,id.vars=c("ID","group"))
Then you can use the following kind of call to examine a linear model of time for a particular ID. You can use lapply to do this in one shot for all IDs.
summary(lm(value~as.numeric(variable), data=m, subset=ID==5))
And this will model all items as predicted by group. Note that the group factor is coerced to numeric. In this case A is 1 and B is 2.
summary(lm(value~group, data=m))

Resources