Multiple Matrix Operations in R with loop based on matrix name - r

I'm a novice R user, who's learning to use this coding language to deal with data problems in research. I am trying to understand how knowledge evolves within an industry by looking at patenting in subclasses. So far I managed to get the following:
# kn.matrices<-with(patents, table(Class,year,firm))
# kn.ind <- with(patents, table(Class, year))
patents is my datafile, with Subclass, app.yr, and short.name as three of the 14 columns
# for (k in 1:37)
# kn.firms = assign(paste("firm", k ,sep=''),kn.matrices[,,k])
There are 37 different firms (in the real dataset, here only 5)
This has given 37 firm-specific and 1 industry-specific 2635 by 29 matrices (in the real dataset). All firm-specific matrices are called firmk with k going from 1 until 37.
I would like to perform many operations in each of the firm-specific matrices (e.g. compare the numbers in app.yr 't' with the average of the 3 previous years across all rows) so I am looking for a way that allows me to loop the operations for every matrix named firm1,firm2,firm3...,firm37 and that generates new matrices with consistent naming, e.g. firm1.3yearcomparison
Hopefully I framed this question in an appropriate way. Any help would be greatly appreciated.
Following comments I'm trying to add a minimal reproducible example
year<-c(1990,1991,1989,1992,1993,1991,1990,1990,1989,1993,1991,1992,1991,1991,1991,1990,1989,1991,1992,1992,1991,1993)
firm<-(c("a","a","a","b","b","c","d","d","e","a","b","c","c","e","a","b","b","e","e","e","d","e"))
class<-c(1900,2000,3000,7710,18000,19000,36000,115000,212000,215000,253600,383000,471000,594000)
These three vectors thus represent columns in a spreadsheet that forms the "patents" matrix mentioned before.

it looks like you already have a 3 dimensional array with all your data. You can basically view this as your 38 matrices all piled one on top of the other. You don't want to split this into 38 matrices and use loops. Instead, you can use R's apply function and extraction functions. Just view the help topic on the apply() family and it should show you how to do what you want. Here are a few basic examples
examples:
# returns the sums of all columns for all matrices
apply(kn.matrices, 3, colSums)
# extract the 5th row of all matrices
kn.matrices[5, , ]
# extract the 5th column of all matrices
kn.matrices[, 5, ]
# extract the 5th matrix
kn.matrices[, , 5]
# mean of 5th column for all matrices
colMeans(kn.matrices[, 5, ])

Related

Translating a for-loop to perhaps an apply through a list

I have a r code question that has kept me from completing several tasks for the last year, but I am relatively new to r. I am trying to loop over a list to create two variables with a specified correlation structure. I have been able to "cobble" this together with a "for" loop. To further complicate matters, I need to be able to put the correlation number into a data frame two times.
For my ultimate usage, I am concerned about speed, efficiency, and long-term effectiveness of my code.
library(mvtnorm)
n=100
d = NULL
col = c(0, .3, .5)
for (j in 1:length(col)){
X.corr = matrix(c(1, col[j], col[j], 1), nrow=2, ncol=2)
x=rmvnorm(n, mean=c(0,0), sigma=X.corr)
x1=x[,1]
x2=x[,2]
}
d = rbind(d, c(j))
Let me describe my code, so my logic is clear. This is part of a larger simulation. I am trying to draw 2 correlated variables from the mvtnorm function with 3 different correlation levels per pass using 100 observations [toy data to get the coding correct]. d is a empty data frame. The 3 correlation levels will occur in the following way pass 1 uses correlation 0 then create the variables, and yes other code will occur; pass 2 uses correlation .3 to create 2 new variables, and then other code will occur; pass 3 uses correlation .5 to create 2 new variables, and then other code will occur. Within my larger code, the for-loop gets the job done. The last line puts the number of the correlation into the data frame. I realize as presented here it will only put 1 number into this data frame, but when it is incorporated into my larger code it works as desired by putting 3 different numbers in a single column (1=0, 2=.3, and 3=.5). To reiterate, the for-loop gets the job done, but I believe there is a better way--perhaps something in the apply family. I do not know how to construct this and still access which correlation is being used. Would someone help me develop this little piece of code? Thank you.

How to remove some values from a 4-dimensional matrix?

I'm working with a 4-dimensional matrix (Year, Simulation, Flow, Time instant: 10x5x20x10) in R. I need to remove some values from the matrix. For example, for year 1 I need to remove simulations number 1 and 2; for year 2 I need to remove simulation number 5.
Can anyone suggest me how I can make such changes?
Arrays (which is how R documentation usually refers to higher-dimensional 'matrices') can be indexed with negative values in the same way as matrices or vectors: a negative value removes the corresponding row/column/slice. So if you wanted to remove year 1 completely (for example), you could use a[-1,,,]; to remove simulation 5 completely, a[,-5,,].
However, arrays can't be "ragged", there has to be something in every row/column/slice combination. You could replace the values you want to remove with NAs (and then make sure to account for the NAs appropriately when computing, e.g. using na.rm = TRUE in sum()/min()/max()/median()/etc.): a[1,1:2,,] <- NA or a[2,5,,] <- NA in your examples.
If you knew that all values of Flow and Time would always be present, you could store your data as a list of lists of matrices: e.g.
results <- list(Year1 = list(Simulation1 = matrix(...),
Simulation2 = matrix(...),
...),
Year2 = list(Simulation1 = matrix(...),
Simulation2 = matrix(...),
...))
Then you could easily remove years or simulations within years (by setting them to NULL, but it would make indexing a little bit harder (e.g. "retrieve Simulation1 values for all years" would require an lapply or a loop across years).

How to create adjacency matrix for gene-gene interactions from RNA-Seq (circlize input)

I'm profiling tumor microenvironment and I want to show interactions between subpopulations that I found. I have a list of receptors and ligands for example, and I want to show that population A expresses ligand 1 and population C expresses receptor 1 so there's likely an interaction between these two populations through the expression of ligand-receptor 1.
I have been trying to use circlize to visualize these interactions by making a chordDiagram, but it requires an adjacency matrix as input and I do not understand how to create the matrix. The adjacency matrix is supposed to show the strength of the relationship between any two genes in my matrix. I have 6 unique populations of cells that can express any of the 485 ligands/receptors that I am interested in, and the goal is to show interactions between these populations through the ligands and receptors.
I found a tool to use in RStudio called BUS- gene.similarity: Calculate adjacency matrix for gene-gene interaction.
Maybe I am just using BUS incorrectly but it says: For gene expression data with M genes and N experiments, the adjacency matrix is in size of MxM. An adjacency matrix in size of MxM with rows and columns both standing for genes. Element in row i and column j indicates the similarity between gene i and gene j.
So, I made a matrix where each column is a subpopulation and each row is a ligand/receptor I want to show interactions with. The cells have expression values and it looks like this:
> head(Test)
A B C D E F
Adam10 440.755990 669.875468 748.7313995 702.991422 1872.033343 2515.074366
Adam17 369.813134 292.625603 363.0301707 434.905968 1183.152694 1375.424034
Agt 12.676036 28.269671 9.2428034 19.920561 121.587010 168.116735
Angpt1 22.807415 42.350205 25.5464603 16.010813 194.620550 99.383567
Angpt2 92.492760 186.167844 819.3679836 852.666499 669.642441 1608.748788
Angpt4 3.327743 0.693985 0.8292746 1.112826 5.463647 5.826927
Where A-F are my populations. Then I pass this matrix to BUS:
res<-gene.similarity(Test,measure="corr",net.trim="none")
Warning message:
In cor(mat) : the standard deviation is zero
But the output file which is supposed to be my adjacency matrix is full of NA's:
Adam10 Adam17
Adam10 1 NA
Adam17 NA 1
I thought maybe my matrix was too complex, so I compared only 2 cell populations with my ligands/receptors, but I get the exact same output.
I was expecting to get something like:
A:Adam10 A:Adam17
C:Adam10 6 1
E:Adam17 2 10
But, even if the res object gave me numbers instead of NA it does not maintain the identity of the population when making relationships amongst genes so it still would not produce my expected output.
I do not have to use BUS to make the matrix, so I don't necessarily need help troubleshooting that code, I just need SOME way to make an adjacency matrix.
I've never used circlize or Circos before so I apologize if my question is stupid.
Seems like you need to transform you matrix a little.
you can create a new matrix which has size (nrow(Test) x ncol(Text)) x (nrow(Test) x ncol(Text)), so in the example you gave, the new matrix will be 36x36, and the colnames and rownames will be the same which are A_Adam10, A_Adam17,..., A_Angpt4, B_Adam10,..., F_Angpt4.
With the help of a loop, you can load the similarity of each pair into the new matrix, and now you can plot the matrix. It's a little complicated, also takes a while to run the loop, but it's intuitive.
You're welcomed to check my github repo since I had a similar problem not too long ago, and I posted detailed code on there. I hope this will help you

Most efficient way to randomize a matrix in R or in Python

I'm working with a numeric matrix M in R which is quite big (11000 rows per 20 columns). On this matrix, I'm performing a lot of correlation tests
=> the function cor.test(M[i,], M[j,], method='spearman') where i and j are two rows from the matrix (all possible combinations are tested).
The problem as you know is that I'm doing too many tests to get a very reliable p-value returned by this test.
My strategy to overcome this limitation would be to generate a new probability distribution by Bootstrap on my matrix M: I would like to get 100 random matrices generated from M to do the multiple correlations on these matrices and choose the right cut-off for the p-value to get a FDR of 5%.
My question is:
What is the most efficient way to randomize my matrix?
Since it's quite time consumming (I suppose) it could be interresting if the solution could be parallelized.
Thank you in advance for all the usefull answers that you'll provide to me.
In python there is a function random.sample() in module random. If you store M as list of rows, randomly sampling n rows from matrix M without replacement would be like this
M_sample = random.sample(M,n)
However, for bootstrapping, you might want to do random sampling with replacement. To do this, you can use numpy.random.choice():
import numpy
M_sample = numpy.random.choice(M,n,replace=True)
In R, we use sample() to randomly decide the row indices to take, and then use row access to take the rows from the matrices. Randomly sampling n rows from matrix M without replacement is done as follows:
indices = sample(nrow(M), n,replace=FALSE)
M_sample = M[indices, ]
And for randomly sampling with replacement, replace the first line with this:
indices = sample(nrow(M), n,replace=TRUE)

Iterate process in R using range of vectors derived from matrix

I must first apologize as I have no programming background, so please forgive me if this question is overly simplistic or if it has been addressed repeatedly. I would be very willing to help clarify my issue if it is not clear from my explanation.
I have two sets of data matrices. "A":
[Ac1] [Ac2] ... [Ac500]
[Ac1] 25 30 ... 15
[Ar2] 7 54 ... 41
...
[cr25000]
and
"B" which is similar in the number of columns, but not the number of rows
[Bc1] [Bc2] ... [Bc500]
[Br1] 25 30 ... 15
[Br2] 7 54 ... 41
...
[Br20000]
I'm running an module ("npSeq") in R that uses the matrix A consistently as an input value, a horizontal vector that includes all of the values from a row in matrix B, ex [1]. The module returns a separate list of values. I will need to run the analysis independently for all of the rows in matrix B saving all of the returned lists which I will then need to combine.
However I would like to know if there is a way to automate the process so that the module runs using a vector derived from row [Br1], saves the returned list, and then runs the process again using the vector derived from row [Br2]. Repeating the process until [Br20000].
Again I'm sorry that this is worded so poorly. I wish I understood enough of the terminology to state my problem more clearly.
You can use lapply to loop over B's row indices:
result.list <- lapply(1:nrow(B), function(i) npSeq(A, B[i, ]))
Note that this is not going to be much (any?) faster than using a for loop. It is just a short and clean equivalent. 20,000 iterations does sound like a lot so it may take a while depending on how slow the function is.

Resources