Removing Isolated Values From Network Matrix with R - r

I have very a matrix like represented below:
A=matrix(c(1,2,0,3,4,0,0,0,0),nrow=3,ncol=3,byrow=TRUE)
In this matrix row and column names are same. Every row/column name corresponds an author which cites another. 3'th row and 3'th column are zero. How can I shrink matrix with removing isolated authors neither gets citation nor cites anyone? In other words, the how can I remove intersecting n'th columns.
In tm (textual mining) library I can do it on document-term-matrixes with removeSparseTerms.

If you want to store as sparse matrix (and automatically store only non-zero items) then you can do something like the following:
library(Matrix)
A <- as(A, "sparseMatrix")
A
# 3 x 3 sparse Matrix of class "dgCMatrix"
# [1,] 1 2 .
# [2,] 3 4 .
# [3,] . . .

Using colSums, rowSums, and [ in base R, this can be accomplished with
A[rowSums(A) > 0, colSums(A) > 0]
[,1] [,2]
[1,] 1 2
[2,] 3 4
This will drop any row or any column that is zero (no citing, no citations).
However, if the matrix is square, and the desire is to drop instances where both the column and the row are zero, you could use
keepem <- rowSums(A) > 0 | colSums(A) > 0
A[keepem, keepem]

Related

Creating upper/lower triangular correlation matrix based on values from a group of text files?

Will try not to complicate things too much with my explanations, but I'm confused how to best go about filling a triangulated correlation matrix with no repeat values with existing correlation values derived from another package. This involves extracting specific values from a list of text files. This is what I have done so far:
# read in list of file names (they are named '1_1', '1_2' .. so on until '47_48' with no repeat values generated)
filenames <- read_table('/home/filenames.txt', col_names = 'file_id')
# create symmetrical matrix
M <- diag(48)
ct <- 1
for (sub in (filenames$file_id)) {
subj <- read.table(paste0(dat_dir, '/ht_', sub, '.HEreg'), sep="", fill=TRUE)
ht <- as.character(subj$V2[grep("rG",sub$V1)]) # wanting to extract the specific value in that column for each text file
M[ct,] <- as.numeric(ht) #input this value into the appropriate location
ct <- ct + 1
}
This obviously does not give me the triangulated output I would envision - I know there is an error with inputting the variable 'ht' into the matrix, but am not sure how to solve this moving forward. Ideally, the correlation value of file 1_1 should be inserted in row 1, col 1, file 1_2 should be inserted in row 2, col 1, so on and so forth, and avoiding repeats (should be 0's)
Should I turn to nested loops?
Much help would be appreciated from this R newbie here, I hope I didn't complicate things unnecessarily!
I think the easiest way would be to read in all your values into a vector. You can do this using a variation of your existing loop.
Let us assume that your desired size correlation matrix is 5x5 (I know you have 48x48 judging by your code, but to keep the example simple I will work with a smaller matrix).
Let us assume that you have read all of your correlation values into the vector x in column major order (same as R uses), i.e. the first element of x is row 2 column 1, second element is row 3 column 1 etc. I am further assuming that you are creating a symmetric correlation matrix, i.e. you have ones on the diagonal, which is why the indexing starts the way it does, because of your use of the diag() function. Let's assume your vector x contains the following values:
x <- 1:10
I know that these are not correlations, but they will make it easy to see how we fill the matrix, i.e. which vector element goes into which position in the resulting matrix.
Now, let us create the identity matrix and zero matrices for the upper and lower triangular correlations (off diagonal).
# Assuming 5x5 matrix
n_elements <- 5
m <- diag(n_elements)
m_upper <- m_lower <- matrix(0, n_elements, n_elements)
To quickly fill the lower triangular matrix, we can use the lower.tri().
m_lower[lower.tri(m_lower, diag = FALSE)] <- x
This will yield the following output:
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 1 0 0 0 0
[3,] 2 5 0 0 0
[4,] 3 6 8 0 0
[5,] 4 7 9 10 0
As you can see, we have successfully filled the lower triangular. Also note the order in which the elements of the vector is filled into the matrix. This is crucial for your results to be correct. The upper triangular is simply the transpose of the lower triangular, and then we can add our three matrices together to form your symmetric correlation matrix.
m_upper <- t(m_lower)
M <- m_lower + m + m_upper
Which yields the desired output:
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 2 3 4
[2,] 1 1 5 6 7
[3,] 2 5 1 8 9
[4,] 3 6 8 1 10
[5,] 4 7 9 10 1
As you see, there is no need to work with nested loops to fill these matrices. The only loop you need is to read in the results from files (which it appears you have a handle on). If you only want the triangulated output, you can simply stop at the lower triangular matrix above. If your vector of estimated correlations (in my example x) include the diagonal elements, simply set diag = TRUE in the lower.tri() function and you are good to go.

Does the c command create a row vector or a column vector by default in R

In R, when I use a command like this:
b <-c(7,10)
b
Does it create a row vector (1 row, 2 cols) or a column vector (1 col, 2 rows) by default?
I can't tell from the displayed output.
I am R beginner (as is obvious :))
Neither. A vector does not have a dimension attribute by default, it only has a length.
If you look at the documentation on matrix arithmetic, help("%*%"), you see that:
Multiplies two matrices, if they are conformable. If one argument is a
vector, it will be promoted to either a row or column matrix to make
the two arguments conformable. If both are vectors of the same length,
it will return the inner product (as a matrix).
So R will interpret a vector in whichever way makes the matrix product sensible.
Some examples to illustrate:
> b <- c(7,10)
> b
[1] 7 10
> dim(b) <- c(1,2)
> b
[,1] [,2]
[1,] 7 10
> dim(b) <- c(2,1)
> b
[,1]
[1,] 7
[2,] 10
> class(b)
[1] "matrix"
> dim(b) <- NULL
> b
[1] 7 10
> class(b)
[1] "numeric"
A matrix is just a vector with a dimension attribute. So adding an explicit dimension makes it a matrix, and R will do that in whichever way makes sense in context.
And an example of the behavior in the context of matrix multiplication:
> m <- matrix(1:2,1,2)
> m
[,1] [,2]
[1,] 1 2
> m %*% b
[,1]
[1,] 27
> m <- matrix(1:2,2,1)
> m %*% b
[,1] [,2]
[1,] 7 10
[2,] 14 20
You can treat a vector ( c() ) in R as a row or a column.
You can see this by
rbind(c(1,3,5),c(2,4,6))
cbind(c(1,2,3),c(4,5,6))
It is a collection. By default tho when casting to a data frame
data.frame(c(1,2,3))
it is made a column, such where the first index will address which column of the table is being referenced, in contradiction to what is orthodox in linear algebra.
i.e., to access the hello in this casting of a vector into a data.frame
an additional index is required
a = data.frame(c("hello","F***ery"))
a[[1]][[1]]
and this is where things get wacky, because data frames don't chive with strings... the type of "hello" is supposedly an integer, with levels...
The c function creates an "atomic" vector, using the word of Norman Matloff in the art of R programming:
atomic vectors, since their components cannot be broken down into
smaller components.
It can be seen as a "concatenation" (in fact c stands for concatenate) of elements, indexed by their positions and so no dimensions (in a spatial sense), but just a continuous index that goes from 1 to the length of the object itself.

How to use row and column index in matrix value calculation in R without looping?

How can produce a matrix where the entries are, say, the product of the index of the row and column. For example:
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 2 4 6
[3,] 3 6 9
NB: this is not specific to multiplication. I actually need it to raise each entry to a power (row index - column index), and was looking to not have to induce loops (as I suspect there is a more R-friendly way).
Thanks!
M <- matrix(NA, 3,3)
Mrcprod <- row(M)*col(M)
Use the outer product of 1:3 and 1:3
outer(1:3,1:3)
# or
1:3 %o% 1:3
If you need the different of the row indices and column indices, use outer again
outer(1:3,1:3,"-")

Creating a sequence of numbers for indexing matrices

I am trying to create an index vector for a programming problem. The idea is to be able to index the elements of a matrix so I can replace just these elements with another matrix.
nstks<- 2
stk<-1:nstks
nareas<-3
area<-1:nareas
eff<-c(10,10,10)
x<-matrix(1:6,nrow=nstks,ncol=nareas)
h<-matrix(0,nrow=length(eff)+nstks,ncol=nareas)
for(i in 1:nareas) h[i,i]<-1
This returns a 5 by 3 matrix with 1s on the diagonal of the first 3 rows. Now I want to replace the 4th and 5th rows with a 2 by 3 matrix returned by another function. One way I figured is to index the h matrix by:
hlen<-c(nareas + stk,(nareas+ stk +(nareas +nstks)),(nareas+stk +(nareas+nstks)+(nareas+nstks)))
h[hlen] <- x
This replaces the 4,5,9,10,14,15th elements of h with the elements of x in order.
However, I need to make this flexible for differing numbers of nstks and nareas. As an example, for nareas=4 and nstks=3, I need to spit out a vector: c(5,6,7,12,13,14,19,20,21,26,27,28)
To clarify: I need to create the jacobian matrix for a constrained optimization problem. The dimensions of the jacobian vary depending on the number of constraints, and number of variables. I want to write a function that will give the jacobian matrix for any specified number of dimensions.
The variable is eff, which has the same length as nareas. There are non-negativity constraints on eff, which are reflected in the first nareas*nareas sub matrix being a diagonal identity matrix. The last rows of the matrix reflect the constraint on the number of fish that can be caught, by stock. So, for one stock, there will only be 1 additional row, 2 stocks, 2 additional rows etc. etc.
I need to replace the elements in these last rows by the elements given by another matrix. In the example, x is just for illustration. The actual x is given by a function but will have these same dimensions. Does this clarify things?
Any ideas?
Thanks!
I believe I can use:
h[(length(eff)+1): (length(eff)+nstks),1:nareas]<-x
I was making it too complicated as usual. Thanks for the help.
Instead of trying to find the indices for the values you need to replace with a sub-matrix returned from another function, can you not just place in the sub-matrix directly?
E.g. if you have:
x <- matrix(c(0, 1, 1, 0, 0, 1), ncol=3)
x
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 1 0 1
Identify the part of h you want to drop the sub-matrix in:
h[4:5, 1:3] <- x
h
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
[4,] 0 1 0
[5,] 1 0 1
Or, if x is a vector,
x <- c(0, 1, 1, 0, 0, 1)
x <- matrix(x, ncol=3, byrow=TRUE)
h[4:5, 1:3] <- x

Adding values to a matrix using index vectors that include row and column names

Suppose I have a really big matrix of sparse data, but i'm only interested in looking at a sample of it making it even more sparse. Suppose I also have a dataframe of triples including columns for row/column/value of the data (imported from a csv file). I know I can use the sparseMatrix() function of library(Matrix) to create a sparse matrix using
sparseMatrix(i=df$row,j=df$column,x=df$value)
However, because of my values I end up with a sparse matrix that's millions of rows by tens of thousands of columns (most of which are empty because my subset is excluding most of the rows and columns). All of those zero rows and columns end up skewing some of my functions (take clustering for example -- I end up with one cluster that includes the origin when the origin isn't even a valid point).
I'd like to perform the same operation, but using i and j as rownames and colnames. I've tried creating a dense vector, sampling down to the max size and adding values using
denseMatrix <- matrix(0,nrows,ncols,dimnames=c(df$row,df$column))
denseMatrix[as.character(df$row),as.character(df$column)]=df$value
(actually I've been setting it equal to 1 because I'm not interested in the value in this case) but I've been finding it fills in the entire matrix because it takes the cross of all the rows and columns rather than just row1*col1, row2*col2...
Does anybody know a way to accomplish what I'm trying to do? Alternatively i'd be fine with filling in a sparse matrix and simply having it somehow discard all of the zero rows and columns to compact itself into a denser form (but I'd like to maintain some reference back to the original row and column numbers)
I appreciate any suggestions!
Here's an example:
> rows<-c(3,1,3,5)
> cols<-c(2,4,6,6)
> mtx<-sparseMatrix(i=rows,j=cols,x=1)
> mtx
5 x 6 sparse Matrix of class "dgCMatrix"
[1,] . . . 1 . .
[2,] . . . . . .
[3,] . 1 . . . 1
[4,] . . . . . .
[5,] . . . . . 1
I'd like to get rid of colums 1,3 and 5 as well as rows 2 and 4. This is a pretty trivial example, but imagine if instead of having row numbers 1, 3 and 5 they were 1000, 3000 and 5000. Then there would be a lot more empty rows between them. Here's what happens when I using a dense matrix with named rows/columns
> dmtx<-matrix(0,3,3,dimnames=list(c(1,3,5),c(2,4,6)))
> dmtx
2 4 6
1 0 0 0
3 0 0 0
5 0 0 0
> dmtx[as.character(rows),as.character(cols)]=1
> dmtx
2 4 6
1 1 1 1
3 1 1 1
5 1 1 1
When you say "get rid of" certain columns/rows, do you mean just this:
> mtx[-c(2,4), -c(1,3,5)]
3 x 3 sparse Matrix of class "dgCMatrix"
[1,] . 1 .
[2,] 1 . 1
[3,] . . 1
Subsetting works, so you just need a way of finding out which rows and columns are empty? If that is correct, then you can use colSums() and rowSums() as these have been enhanced by the Matrix package to have appropriate methods for sparse matrices. This should preserve the sparseness during the operation
> dimnames(mtx) <- list(letters[1:5], LETTERS[1:6])
> mtx[which(rowSums(mtx) != 0), which(colSums(mtx) != 0)]
3 x 3 sparse Matrix of class "dgCMatrix"
B D F
a . 1 .
c 1 . 1
e . . 1
or, perhaps safer
> mtx[rowSums(mtx) != 0, colSums(mtx) != 0]
3 x 3 sparse Matrix of class "dgCMatrix"
B D F
a . 1 .
c 1 . 1
e . . 1
Your code almost works, you just need to cbind together the row names and column names. Each row of the resulting matrix is then treated as a pair instead of treating the rows and the columns separately.
> dmtx <- matrix(0,3,3,dimnames=list(c(1,3,5),c(2,4,6)))
> dmtx[cbind(as.character(rows),as.character(cols))] <- 1
> dmtx
2 4 6
1 0 1 0
3 1 0 1
5 0 0 1
This may be faster if you use factors.
> rowF <- factor(rows)
> colF <- factor(cols)
> dmtx <- matrix(0, nlevels(rowF), nlevels(colF),
dimnames=list(levels(rowF), levels(colF)))
> dmtx[cbind(rowF,colF)] <- 1
> dmtx
2 4 6
1 0 1 0
3 1 0 1
5 0 0 1
You can also use these factors in a call to sparseMatrix.
> sparseMatrix(i=as.integer(rowF), j=as.integer(colF), x=1,
+ dimnames = list(levels(rowF), levels(colF)))
3 x 3 sparse Matrix of class "dgCMatrix"
2 4 6
1 . 1 .
3 1 . 1
5 . . 1
Note that one of the other solutions may be faster; converting to factors can be slow if there's a lot of data.
Your first issue stems from the fact that the coordinate list (COO) has non-contiguous values for the row and column indices. When faced with this, or even when dealing with most sparse matrices, I tend to reorder the rows and columns by their support.
You can do this in two ways:
Produce the sparse matrix and the do colSums and rowSums of logical(yourMatrix) to get the support values, or
Use a function like table or bigtabulate (from the bigmemory suite) to calculate the # of unique times that each value has occurred in the coordinate list. (My preference is bigtabulate.)
Once you have the support, you can use the rank function (actually, rank(-1 * support, ties = "first")) to map the original indices to new ones, based on their ranks.
At this point, if you create the matrix with sparseMatrix, it will only produce a matrix with dimensions such that all of your rows and columns have support. It will not map to anything larger.
This is similar to #GavinSimpson's approach, though his method only drops the missing rows and columns, while my approach reorders to put the maximum density in the upper left corner of the matrix, with decreasing density as you move to larger indices for the rows and columns. In order to map back to the original indices in my approach, simply create a pair of mappings: "original to ranked" and "ranked to original", and you can perfectly recreate the original data, if you choose.
#Iterator's answer is very helpful for my application, but it's a pity that his/her response didn't include an example to illustrate the idea. Here is my implementation of the idea for reordering the rows and columns of very huge sparse matrix (e.g. with about one million rows and a few thousands of columns on supercomputer with sufficient memory to load the sparse matrix).
library(Matrix)
sparseY <- sparseMatrix( i=sample(2000, 500, replace=TRUE), j=sample(1000,500, replace=TRUE), x=sample(10000,500) )
# visualize the original sparse matrix
image(sparseY, aspect=1, colorkey=TRUE, main="The original sparse matrix")
numObs <- length( sparseY#x )
# replace all non-zero entries with 1 to calculate #non-zero entries per row/column and use rank() to sort based on supports
logicalY <- sparseY; logicalY#x <- rep(1, numObs)
# calculate the number of observed entries per row/column
colObsFreqs <- colSums(logicalY)
rowObsFreqs <- rowSums(logicalY)
colObsFreqs
rowObsFreqs
# get the rank of supports for rows and columns
colRanks <- rank( -1*colObsFreqs, ties="first" )
rowRanks <- rank( -1*rowObsFreqs, ties="first" )
# Sort the ranks from small to large
sortColInds <- sort(colRanks, index.return=TRUE)
sortRowInds <- sort(rowRanks, index.return=TRUE)
# reorder the original sparse matrix so that the maximum density data block is placed in the upper left corner of the matrix, with decreasing density as you move to larger indices for the rows and columns.
sparseY <- sparseY[ sortRowInds$ix, sortColInds$ix ]
# visualize the reordered sparse matrix
image(sparseY, aspect=1, colorkey=TRUE, main="The sparse matrix after reordering")
logicalY <- sparseY; logicalY#x <- rep(1, numObs)
# Check whether the resulting sparse matrix is what's expected, i.e. with the maximum density data block placed in the upper left corner of the matrix
colObsFreqs <- colSums(logicalY)
rowObsFreqs <- rowSums(logicalY)
colObsFreqs
rowObsFreqs

Resources