bind together sparse model matrices by row names - r

I am trying to construct a large sparse matrix with a split-apply-combine approach by separately calling sparse.model.matrix() from the package Matrix on subsets of columns of a dataframe and then binding them together into a full matrix. I have to do this because of memory limitations (I can't call sparse.model.matrix on the whole df at once). This process works fine, and I get a list of sparse matrices, but these have different dimensions and when I try to bind them together, I can't.
ex:
data(iris)
set.seed(100)
iris$v6 <- sample(c("a","b","c",NA), 150, replace=TRUE)
iris$v7 <- sample(c("x","y",NA), 150, replace = TRUE)
sparse_m1 <- sparse.model.matrix(~., iris[,1:5])
sparse_m2 <- sparse.model.matrix(~.-1, iris[, 6:7])
dim(sparse_m1)
[1] 150 7
dim(sparse_m2)
[1] 71 4
cbind2(sparse_m1, sparse_m2)
Error: Matrices must have same number of rows in cbind2(sparse_m1, sparse_m2)
cbind(sparse_m1, sparse_m2)
Error: Matrices must have same number of rows in cbind2(..1, r)
The matrices have the same row names, just some rows have been omitted from sparse_m2 because they had missing values in both columns. Is there any way to combine them?
I also tried using rbind.fill.matrix() from the plyr package, by first transposing and then calling it and then re-transposing, but then I lose column names since row names are ignored in rbind.fill.matrix.
Any ideas?

An old question still in need of an answer...
One approach is to create an empty Matrix of the required dimensions and then populate it:
m12.dimnames<-list(union(rownames(sparse_m1),rownames(sparse_m2)),c(colnames(sparse_m1),colnames(sparse_m2)))
m12<- Matrix(0,nrow=length(m12.dimnames[[1]]),ncol=length(m12.dimnames[[2]]),dimnames=m12.dimnames)
m12[rownames(sparse_m2),colnames(sparse_m2)]<-sparse_m1
m12[rownames(sparse_m2),colnames(sparse_m2)]<-sparse_m2

recently bumped in the same issue, and nowadays you can
install.packages("Matrix.utils")
library(Matrix.utils)
sparse_filled <- rBind.fill(sparse_m1, sparse_m2)

Related

matrix subseting by column's name using `subset` function

Consider the following simulation snippet:
k <- 1:5
x <- seq(0,10,length.out = 100)
dsts <- lapply(1:length(k), function(i) cbind(x=x, distri=dchisq(x,k[i]),i) )
dsts <- do.call(rbind,dsts)
why does this code throws an error (dsts is matrix):
subset(dsts,i==1)
#Error in subset.matrix(dsts, i == 1) : object 'i' not found
Even this one:
colnames(dsts)[3] <- 'iii'
subset(dsts,iii==1)
But not this one (matrix coerced as dataframe):
subset(as.data.frame(dsts),i==1)
This one works either where x is already defined:
subset(dsts,x> 500)
The error occurs in subset.matrix() on this line:
else if (!is.logical(subset))
Is this a bug that should be reported to R Core?
The behavior you are describing is by design and is documented on the ?subset help page.
From the help page:
For data frames, the subset argument works on the rows. Note that subset will be evaluated in the data frame, so columns can be referred to (by name) as variables in the expression (see the examples).
In R, data.frames and matrices are very different types of objects. If this is causing a problem, you are probably using the wrong data structure for your data. Matrices are really only necessary if you meed matrix arithmetic. If you are thinking of your columns as different attributes for a row observations, then you should be storing your data in a data.frame in the first place. You could store all your values in a simple vector where every three values represent one observation, but that would also be a poor choice of data structure for your data. I'm not sure if you were trying to be more efficient by choosing a matrix but it seems like just the wrong choice.
A data.frame is stored as a named list while a matrix is stored as a dimensioned vector. A list can be used as an environment which makes it easy to evaluate variable names in that context. The biggest difference between the two is that data.frames can hold columns of different classes (numerics, characters, dates) while matrices can only hold values of exactly one data.type. You cannot always easily convert between the two without a loss of information.
Thinks like $ only work with data.frames as well.
dd <- data.frame(x=1:10)
dd$x
mm <- matrix(1:10, ncol=1, dimnames=list(NULL, "x"))
mm$x # Error
If you want to subset a matrix, you are better off using standard [ subsetting rather than the sub setting function.
dsts[ dsts[,"i"]==1, ]
This behavior has been a part of R for a very long time. Any changes to this behavior is likely to introduce breaking changes to existing code that relies on variables being evaluated in a certain context. I think the problem lies with whomever told you to use a matrix in the first place. Rather than cbind(), you should have used data.frame()

R populate list with samples

I have a numeric vector stock_data containing thousands of floating point numbers, I know i can sample them using
sample(stock_data, sample_size)
I want to take 100 different samples and populate them in a list of samples.
How do i do that without using a loop to append the samples to a list?
I thought of creating a list replicating the stock data 100 times then using lapply on them.
I tried:
all_repl <- as.list(rep(stock_data,100))
all_samples <- lapply(all_repl, sample, size=100)
But all_repl doesn't contain a list of data, it contains a single numeric vector which has replicated the data 100 times.
Can anyone suggest what's wrong and point out a better method to do what i want.
We can use replicate
replicate(100, sample(stock_data, sample_size))
Using simplify=FALSE get the output in a list. Using a reproducible example
replicate(5, sample(1:9, 5), simplify=FALSE)

Subsetting every x amount of columns as separate sites

I need a function that recognises every x amount of columns as a separate site. So in df1 below there are 8 columns, with 4 sites each consisting of 2 variables. Previously, I have used a procedure like this as answered here Selecting column sequences and creating variables.
set.seed(24)
df1 <- as.data.frame(matrix(sample(0:20, 8*10, replace=TRUE), ncol=8))
I then need to calculate a column sum so that a total for each variable is obtained.
colsums <- as.data.frame(t(colSums(df1)))
I subsequently split the dataframe using this technique...
lst1 <- setNames(lapply(split(1:ncol(colsums), as.numeric(gl(ncol(colsums),
2, ncol(colsums)))), function(i) colsums[,i]), paste0('site', 1:4))
list2env(lst1, envir=.GlobalEnv)
And organise into one dataframe...
Combined <- as.matrix(mapply(c,site1,site2,site3,site4))
rownames(Combined) <- c("Site.1","Site.2","Site.3","Site.4")
Whilst this technique has been great on smaller dataframes, where there are a substantial amount of sites (>500) typing out each site following the mapply function takes up a lot of code and could lead to some sites getting missed off if I'm typing them all in manually. Is there an easy way to overcome this following the colsums stage?
A matrix is a vector with dimensions. Matrices are stored in column-major order in R.
The call matrix(colsums, nrow=2) should help you a lot.
NB.: Polluting the "global" environment is generally a bad idea.

Sorting list of matrices by the first column

I have a list containing 4 matrices, each with 21 random numbers in 3 columns and 7 rows.
I want to create new list using lapply function in which each matrix is sorted by the first column.
I tried:
#example data
set.seed(1)
list.a <- replicate(4, list(matrix(sample(1:99, 21), nrow=7)))
ordered <- order(list.a[,1])
lapply(list.a, function(x){[ordered,]})
but at the first step the R gives me error "incorrect number of dimensions". Don't know what to do. It works with one matrix, though.
Please help me. Thanks!
You were almost there - but you would need to iterate through the list to reorder each matrix.
Its easier to do this is one lapply statement
lapply(list.a, function(x) x[order(x[,1]),])
Note that x in the function call represents the matrices in the list.

Subset multiple matirces to the same number of rows and columns in R

I have created a list of multiple matrices called 'Mix' and I need to subset all the matrices in the list to the same number of rows and columns (5:1, 5:20). I tried this but it didn't work:
NM<-lapply(Mix, subset(c(5:31,5:20)))
I also tried :
NM<-lapply(Mix, subset(c[5:31],c[5:20]))
It still did not work. What would be the best options to subset all the matrices in Mix?
Posting as answer, to close out the question:
NM <-lapply(Mix, function(x) x[5:31,5:20])
#To merge them into one matrix
NM <- do.call("rbind",NM)
#To check size of each matrix/subset
do.call('rbind', lapply(Mix, dim))

Resources