How to perform a three-way PCA in R - multidimensional-array

I would like to perform a three-way principal component analysis in R, and though I have found a few articles explaining how it works and how to interpret results, I cannot find any useful guides online on how to do it in R.
My data consists of 230 samples, 250,000 variables and 50 annotations. Usually people just do a standard PCA using only one annotation on the following type of data:
Standard data:
var1 var2 var3 var4
Sample1 1/1 0/0 1/1 1/0
Sample2 1/0 1/1 1/1 1/0
Sample3 0/0 1/1 1/1 1/1
Sample4 0/0 0/0 1/1 0/0
Sample5 1/0 1/0 0/0 1/1
However, I would like to implement all annotation information into the analysis, such that I use all 50 matrices combindly for the analysis. In this way a combination of annotations may explain more of the variance between samples than a single annotation does alone, e.g. annotation 1 and 4 together explain more variance that annotation 1 alone.
Annotation 1:
var1 var2 var3 var4
Sample1 1/1 0/0 1/1 1/0
Sample2 1/0 1/1 1/1 1/0
Sample3 0/0 1/1 1/1 1/1
Sample4 0/0 0/0 1/1 0/0
Sample5 1/0 1/0 0/0 1/1
Annotation 2:
var1 var2 var3 var4
Sample1 missense none STOP synonymous
Sample2 missense missense STOP synonymous
Sample3 none missense STOP synonymous
Sample4 none none STOP none
Sample5 missense missense none synonymous
Annotation 3:
var1 var2 var3 var4
Sample1 0.30 0.00 0.01 0.04
Sample2 0.30 -0.24 0.01 0.04
Sample3 0.00 -0.24 0.01 0.04
Sample4 0.00 -0.24 0.01 0.00
Sample5 0.30 -0.24 0.00 0.04
Annotation 4:
var1 var2 var3 var4
Sample1 CTCF NONE NONE MAX
Sample2 CTCF NONE NONE MAX
Sample3 NONE NONE NONE MAX
Sample4 NONE NONE NONE NONE
Sample5 CTCF NONE NONE MAX
From what I have found there are three packages that can do the Tucker 3-way PCA: ThreeWay, PTAk and rTensor. I have tried to run ThreeWay, but the data structure that they use seems very ugly to work with. Maybe I could make this work, but the example in the ThreeWay article also generated an error, so I would prefer another package:
ThreeWay data structure:
var1_anno1 var1_anno2 var1_anno3 var2_anno1 var2_anno2
Sample1 1/1 missense 0.30 0/0 missense
Sample2 1/0 missense 0.30 1/1 missense
Sample3 0/0 none 0.30 1/1 missense
Sample4 0/0 none 0.30 0/0 none
Sample5 1/0 missense 0.30 1/0 missense
The PTAk packages requires:
"a tensor (as an array) of order k, if non-identity metrics are used X is a list with data as the array and met a list of metrics"
It is not clear to me what this means. I tried to look into the tensor packages of how to generate a tensor, but their example is very convoluted as they do tons of multiplications on various tensors, rather than explain the basics of how to create a tensor form ones data.
I would appreciate both comments on weaknesses of this approach and on how to create tensors, and how to analyse them using any of the packages.
Thanks

For all functions PTAk(), FCAk(), PCAn() and CANDPARA() (for CANDECOMP or PARAFAC) the data input is an array generated more likely from the function array().
Preparing the data read by array(), you need to remember there is (like for matrix(), for rows and columns) an order of indices the first running faster than the next etc...
So, using as.vector(), cbind(), rbind(), and all other data manipulation can prepare the data to be read by array(), and then possibly also using abind() (package abind) to combine some arrays.
Some examples are given in the JSS paper
e.g. abind(x1,x2,x3,x4,...xp,along=3) is indeed quite convenient to create a 3-way array n x q x p from the series of p matrices x1,x2, ... ,xp of dimensions n x q.

I ended up using the PTAk package for running the analysis.
To build the tensors, I used the two packages tensor and abind.
I built the tensors (aka Multi-way Arrays) by creating a vector from a matrix, and then re-defining its dimensions in three dimensions. The function abind() was then used to merge the arrays from each individual into the final three-dimensional tensor.
for (i in 1:length(list_of_sample_matrices)) {
# Converting matrix into single sample tensor
single_sample_tensor <- array(as.vector(list_of_sample_matrices[i])), c(250000, 50, 1))
# Creating all sample tensor
if (i == 1) {
all_sample_tensor <- single_sample_tensor
}
# Adding a single sample tensor at the time to the all sample tensor
if (i > 1) {
all_sample_tensor <- abind(all_sample_tensor, single_sample_tensor)
}
}

Related

Issues using a confusion matrix

Im trying to use the confusion matrix from library(carot) to determine if the which column is more accurate and Im running into trouble. Im trying to see if column df$G5 is more accurate than df$G9 when compared to df$GE. The methods I've tried in the past arent working and Im not sure how to proceed with the matrix. The main error I keep running into is "Error: data and reference should be factors with the same levels".
df <-
C P I R A S GE G5 A5 G9 A9 AF
1 8 163302 rs141069412 CAT C NONE 1/1 1/1 1 <NA> NA 9.33843e-01
2 8 163366 rs34810249 T C NONE 0/1 0/1 1 1/0 1 2.07735e-01
3 8 163370 rs7844253 C G NONE 1/1 1/1 1 1/1 1 9.28438e-01
4 8 163387 rs3008286 C T NONE 0/1 0/1 1 0/1 1 7.17963e-01
5 8 163432 rs3008285 A G NONE 0/1 0/0 0 <NA> NA 1.02935e-01
6 8 163438 rs7844396 C T NONE 1/1 1/1 1 1/1 1 9.28281e-01

How can loading factors from PCA be used to calculate an index that can be applied for each individual in a data frame in R?

I am using principal component analysis (PCA) based on ~30 variables to compose an index that classifies individuals in 3 different categories (top, middle, bottom) in R.
I have a dataframe of ~2000 individuals with 28 binary and 2 continuous variables.
Now, I would like to use the loading factors from PC1 to construct an
index that classifies my 2000 individuals for these 30 variables in 3 different groups.
Problem: Despite extensive research, I could not find out how to extract the loading factors from PCA_loadings, give each individual a score (based on the loadings of the 30 variables), which would subsequently allow me to rank each individual (for further classification). Does it make sense to display the loading factors in a graph?
I've performed the following steps:
a) Ran a PCA using PCA_outcome <- prcomp(na.omit(df1), scale = T)
b) Extracted the loadings using PCA_loadings <- PCA_outcome$rotation
c) Removed all the variables for which the loading factors were close to 0.
I have considered creating 30 new variable, one for each loading factor, which I would sum up for each binary variable == 1 (though, I am not sure how to proceed with the continuous variables). Consequently, I would assign each individual a score. However, I would not know how to assemble the 30 values from the loading factors to a score for each individual.
R code
df1 <- read.table(text="
educ call house merge_id school members
A 1 0 1 12_3 0 0.9
B 0 0 0 13_3 1 0.8
C 1 1 1 14_3 0 1.1
D 0 0 0 15_3 1 0.8
E 1 1 1 16_3 3 3.2", header=T)
## Run PCA
PCA_outcome <- prcomp(na.omit(df1), scale = T)
## Extract loadings
PCA_loadings <- PCA_outcome$rotation
## Explanation: A-E are 5 of the 2000 individuals and the variables (education, call, house, school, members) represent my 30 variables (binary and continuous).
Expected results:
- Get a rank score for each individual
- Subsequently, assign a category 1-3 to each individual.
I'm not 100% sure what you're asking, but here's an answer to the question I think you're asking.
First of all, PC1 of a PCA won't necessarily provide you with an index of socio-economic status. As explained here, PC1 simply "accounts for as much of the variability in the data as possible". PC1 may well work as a good metric for socio-economic status for your data set, but you'll have to critically examine the loadings and see if this makes sense. Depending on the signs of the loadings, it could be that a very negative PC1 corresponds to a very positive socio-economic status. As I say: look at the results with a critical eye. An explanation of how PC scores are calculated can be found here. Anyway, that's a discussion that belongs on Cross Validated, so let's get to the code.
It sounds like you want to perform the PCA, pull out PC1, and associate it with your original data frame (and merge_ids). If that's your goal, here's a solution.
# Create data frame
df <- read.table(text = "educ call house merge_id school members
A 1 0 1 12_3 0 0.9
B 0 0 0 13_3 1 0.8
C 1 1 1 14_3 0 1.1
D 0 0 0 15_3 1 0.8
E 1 1 1 16_3 3 3.2", header = TRUE)
# Perform PCA
PCA <- prcomp(df[, names(df) != "merge_id"], scale = TRUE, center = TRUE)
# Add PC1
df$PC1 <- PCA$x[, 1]
# Look at new data frame
print(df)
#> educ call house merge_id school members PC1
#> A 1 0 1 12_3 0 0.9 0.1000145
#> B 0 0 0 13_3 1 0.8 1.6610864
#> C 1 1 1 14_3 0 1.1 -0.8882381
#> D 0 0 0 15_3 1 0.8 1.6610864
#> E 1 1 1 16_3 3 3.2 -2.5339491
Created on 2019-05-30 by the reprex package (v0.2.1.9000)
As you say you have to use PCA, I'm assuming this is for a homework question, so I'd recommend reading up on PCA so that you get a feel of what it does and what it's useful for.

2 numbers in R not equal despite being the same, fails in left_join

I have a strange problem, when trying to do a left_join from dplyr between two data frames say table_a and table_b which have the column C in common I get lots of NAs except for when the values are zero in both even though the values in the rows match more often.
One thing I did notice was that the C column in table_b on which I would like to match, has values 0 as 0.0 whereas in the table_a, 0 is displayed as simply 0.
A sample is here
head(table_a) gives
likelihood_ols LR_statistic_ols decision_ols C
1 -1.51591 0.20246 0 -10
2 -1.51591 0.07724 0 -9
3 -1.51591 0.00918 0 -8
4 -1.51591 0.00924 0 -7
5 -1.51591 0.08834 0 -6
6 -1.51591 0.25694 0 -5
and the other one is here
head(table_b)
quantile C pctile
1 2.96406 0.0 90
2 4.12252 0.0 95
3 6.90776 0.0 99
4 2.78129 -1.8 90
5 3.92385 -1.8 95
6 6.77284 -1.8 99
Now, there are definitely overlaps between the C columns but only the zeroes are found, which is confusing.
When I subset the unique values in the C columns according to
a <- sort(unique(table_a$C)) and b <- sort(unique(table_b$C)) I get the following confusing output:
> a[2]
[1] -9
> b[56]
[1] -9
> a[2]==b[56]
[1] FALSE
What is going on here? I am reading in the values using read.csv and the csvs are generated once on CentOS and once RedHat/Fedora if that plays a role at all. I have tried forcing them to be tibbles or first as characters then numerics and also checked all of R's classes and also checked the types discussed here but to no avail and they all match.
What else could make them different and how do I tell R that they are so I can run my merge function?
Just because two floating point numbers print out the same doesn't mean they are identical.
A simple enough solution is to round, e.g.:
table_a$new_a_likelihood_ols <- signif(table_a$likelihood_ols, 6)

Community detection in very large networks

I have a very large network with 50,000 nodes. It's sparse. I want to find communities in it, using R. How do I do so? Thank you!
(I tried using igraph, but it won't work because the adjacency matrix is too large.)
The dataset looks like this:
1 0 0 1 0 0 0 1
0 1 0 1 0 1 1 0
...
It's 50,000 x 80.
I found the correlation between every row, creating a correlation matrix that looks like this:
0.14 0.26 0.36
0.24 0.79 0.36
...
It's 50,000 x 50,000.
Then I put it into igraph:
output2<-matrix(ifelse(runif(50000*80)<0.2,1,0),50000,80) # random binary sparse matrix
x2 <- graph.adjacency(cor(t(output2)), weighted=TRUE, diag=FALSE)
x2 = delete.vertices(x2,which(degree(x)<1))
x2 = as.undirected(x2)
b2 <- walktrap.community(x2)
k3<-groups(b2)
However, igraph says it can't create the adjacency graph, as it's too large.

How to convert different numbers per words in different columns (unix)

I have a big file with 28 columns with 3 different codes (0/0, 1/1 and 0/1) that I want to convert to words. This file has millions of lines, each one beggining with "Chr"
Chr10_102 T G 999 DP 38 DP4 37 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/1 0/0 0/1 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/1 0/0 0/1 0/0 0/0
Chr1_111 C T 999 DP 37 DP4 37 0/1 1/1 0/0 0/1 0/1 0/1 0/1 0/1 0/0 0/1 0/1 0/0 0/0 0/1 1/1 1/1 0/1 0/1 0/0 1/1 0/0 0/0 0/1 0/1 0/1 0/1 1/1 0/1 ...
I want to convert the codes in each of the 28 columns and all lines as follow:
0/0 to no_variant
1/1 to homo
0/1 to het
How to do that? I converted before it, but I had only one column with 2 codes (0/1 and 1/1) and now I have 28 columns to convert and 3 codes, I used
awk '{if ($9=="0/1") {print $0,"het"} else{print $0}}' | awk '{if ($9=="1/1") {print $0,"hom"} else{print $0}}'
thanks very much
Clarissa
sed 's|0/0|no_variant|g; s|1/1|homo|g; s|0/1|het|g' file
As awk, that would be
awk '{gsub("0/0","no_variant"); gsub("1/1","homo"); gsub("0/1","het")} 1' file
If you need to go column-by-column for some reason, use a for-loop:
awk '
BEGIN {c["0/0"] = "no_variant"; c["0/1"] = "het"; c["1/1"] = "homo"}
{for (n=9; n<=NF; n++) {$n = c[$n]}; print}
' file

Resources