I'm trying to create a sparse matrix, where for each row has a maximum of n entries that are each integers within a certain range, which I could then use as an adjacency matrix for social network analysis. For example, an 80X80 matrix where each row has 10 or fewer entries that are integers from 1-4. The goal is to represent the sort of data you would get from a social networking survey in which respondents were selecting values between 1 and 4 to indicate their relationship with up to 10 of the possibilities/columns in the survey.
I can create a sparse matrix using the "rsparsematrix" function, and using the density command can approximate the required number of responses, but I can't control the number of responses per row and would have to do additional processing to convert the random values to integers within my desired range.
eg: I could start with something like
M1<-rsparsematrix(80, 80, density = .1, symmetric = FALSE)
A more promising approach (from https://www.r-bloggers.com/casting-a-wide-and-sparse-matrix-in-r/) would be to generate the values and then use "transform" to convert them into a matrix. This allows me to control the integer values, but still doesn't get the limited number of responses per row.
Example code from the blog follows:
set.seed(11)
N = 10
data = data.frame(
row = sample(1:3, N, replace = TRUE),
col = sample(LETTERS, N, replace = TRUE),
value = sample(1:3, N, replace = TRUE))
data = transform(data,
row = factor(row),
col = factor(col)) "
This could be tweaked to give the required 80x80 matrix, but doesn't solve the problem of limiting the responses per row and, in the case of duplicate entries in the same row/column combination will result in out of range values since it resolves duplicate entries by taking the sum.
Any suggestions would be most appreciated.
As a bonus question, how would you then create random rows of null responses? For example within the 80*80 matrix, how might you introduce 40 random rows with no values? As in the description above, this would correspond to missing survey data.
You can try to build the spare matrix up using the row (i), column (j) amd value (x) components. This involves sampling subject to your row and value constraints.
# constraints
values <- 1:4
maxValuesPerRow <- 10
nrow <- 80
ncol <- 80
# sample values : how many values should each row get but <= 10 values
set.seed(1)
nValuesForEachRow <- sample(maxValuesPerRow, nrow, replace=TRUE)
# create matrix
library(Matrix)
i <- rep(seq_len(nrow), nValuesForEachRow) # row
j <- unlist(lapply(nValuesForEachRow, sample, x=seq_len(ncol))) # which columns
x <- sample(values, sum(nValuesForEachRow), replace=TRUE) # values
sm <- sparseMatrix(i=i, j=j, x=x)
check
dim(sm)
table(rowSums(sm>0))
table(as.vector(sm))
note, cant just sample columns like below as this can give duplicate values, hence loop used.
j <- sample(seq_len(ncol), sum(nValuesForEachRow), replace=TRUE)
The code below will do what you want. It generates your random sparse matrix, rounds it to whole numbers, then for every row that has more than 10 entries, randomly makes some entries NA until only 10 remain. It then makes all the non NA entries a random number between 1 and 4.
library(Matrix)
M1<-as.data.frame(as.matrix((rsparsematrix(80, 80, density = .1, symmetric = FALSE))))
M1 <- as.data.frame(apply(M1,1,round))
M1<-as.data.frame(sapply(M1,function(x) ifelse(x==0,NA,x)))
rows<-which(apply(M1,1,function(x) sum(!(is.na(x)))) >10)
for(i in rows)
{
toNA<-setdiff(which(!(is.na(M1[i,]))),sample(which(!(is.na(M1[i,]))),10,replace=F))
M1[i,toNA] <- NA
)
for(i in 1:nrow(M1))
{
M1[i,which(!(is.na(M1[i,])))] <- sample(1:4,length(M1[i,which(!
(is.na(M1[i,])))]),replace=T)
}
I have a large data frame (50k by 5k). I would like to make a smaller data frame using the following rule. For a given k, 0>k>n, I would like to select the largest set of columns such that k rows have non-na values in all of these columns.
This seems like it might be too hard for a computer to do on a big data frame, but I'm hoping it is possible. I have written code for this operation.
It seems my way of doing this is too complex. It relies on (1) computing a list of all possible subsets of the set of columns, and then (2) checking how many shared rows they have. For even small numbers of columns (1) gets too slow (e.g. 45 seconds for 25 columns).
Question: Is it theoretically possible to get the largest set of columns sharing at least k non-na rows? If so, what is a more realistic approach?
#alexis_laz's elegant answer to a similar question takes an inverse approach to mine, examining all (fixed-size) subsets of the observations/samples/draws/units and checking which variables are present in them.
Taking combinations of n observations is difficult for large n. For example, length(combn(1:500, 3, simplify = FALSE)) yields 20,708,500 combinations for 500 observations and fails to produce the combinations on my computer for sizes greater than 3. This makes me worry that it's not gonna be possible for large n and p in either approach.
I have included an example matrix for reproducibility.
require(dplyr)
# generate example matrix
set.seed(123)
n = 100
p = 25
missing = 25
mat = rnorm(n * p)
mat[sample(1:(n*p), missing)] = NA
mat = matrix(mat, nrow = n, ncol = p)
colnames(mat) = 1:p
# matrix reporting whether a value is na
hasVal = 1-is.na(mat)
system.time(
# collect all possible subsets of the columns' indices
nameSubsets <<- unlist(lapply(1:ncol(mat), combn, x = colnames(mat), simplify = FALSE),
recursive = FALSE,
use.names = FALSE)
)
#how many observations have all of the subset variables
countObsWithVars = function(varsVec){
selectedCols = as.matrix(hasVal[,varsVec])
countInRow = apply(selectedCols, 1, sum) # for each row, number of matching values
numMatching = sum(countInRow == length(varsVec)) #has all selected columns
}
system.time(
numObsWithVars <<- unlist(lapply(nameSubsets, countObsWithVars))
)
# collect results into a data.frame
df = data.frame(subSetNum = 1:length(numObsWithVars),
numObsWithVars = numObsWithVars,
numVarsInSubset = unlist(lapply(nameSubsets, length)),
varsInSubset = I(nameSubsets))
# find the largest set of columns for each number of rows
maxdf = df %>% group_by(numObsWithVars) %>%
filter(numVarsInSubset== max(numVarsInSubset)) %>%
arrange(numObsWithVars)
I have a large matrix from which I would like to randomly extract a smaller matrix. (I want to do this 1000 times, so ultimately it will be in a for loop.) Say for example that I have this 9x9 matrix:
mat=matrix(c(0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,
0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,
1,0,1,0,0,0,0,0,1,0,1,0,0,0,1), nrow=9)
From this matrix, I would like a random 3x3 subset. The trick is that I do not want any of the row or column sums in the final matrix to be 0. Another important thing is that I need to know the original number of the rows and columns in the final matrix. So, if I end up randomly selecting rows 4, 5, and 7 and columns 1, 3, and 8, I want to have those identifiers easily accessible in the final matrix.
Here is what I've done so far.
First, I create a vector of row numbers and column numbers. I am trying to keep these attached to the matrix throughout.
r.num<-seq(from=1,to=nrow(mat),by=1) #vector of row numbers
c.num<-seq(from=0, to=(ncol(mat)+1),by=1) #vector of col numbers (adj for r.num)
mat.1<-cbind(r.num,mat)
mat.2<-rbind(c.num,mat.1)
Now I have a 10x10 matrix with identifiers. I can select my rows by creating a random vector and subsetting the matrix.
rand <- sample(r.num,3)
temp1 <- rbind(mat.2[1,],mat.2[rand,]) #keep the identifier row
This works well! Now I want to randomly select 3 columns. This is where I am running into trouble. I tried doing it the same way.
rand2 <- sample(c.num,3)
temp2 <- cbind(temp1[,1],temp1[,rand2])
The problem is that I end up with some row and column sums that are 0. I can eliminate columns that sum to 0 first.
temp3 <- temp1[,which(colSums(temp1[2:nrow(temp1),])>0)]
cols <- which(colSums(temp1[2:nrow(temp1),2:ncol(temp1)])>0)
rand3 <- sample(cols,3)
temp4 <- cbind(temp3[,1],temp3[,rand3])
But I end up with an error message. For some reason, R does not like to subset the matrix this way.
So my question is, is there a better way to subset the matrix by the random vector "rand3" after the zero columns have been removed OR is there a better way to randomly select three complementary rows and columns such that there are none that sum to 0?
Thank you so much for your help!
If I understood your problem, I think this would work:
mat=matrix(c(0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,
0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,
1,0,1,0,0,0,0,0,1,0,1,0,0,0,1), nrow=9)
smallmatrix = matrix(0,,nrow=3,ncol=3)
while(any(apply(smallmatrix,2,sum) ==0) | any(apply(smallmatrix,1,sum) ==0)){
cols = sample(ncol(mat),3)
rows= sample(nrow(mat),3)
smallmatrix = mat[rows,cols]
}
colnames(smallmatrix) = cols
rownames(smallmatrix) = rows