How can I include a column of matrices in a data.frame? - r

I want to store the output from many regression models including regression coefficients and information matrix from each model.
To store the results, it will be convenient if can use a data frame with two columns, one for the regression coefficients, and one for the information matrix. How can I create such a data frame?
res = data.frame(mu = I(matrix(0, m, n)), j = ???)
(It seems j should be an array in such a situation.)

You can do just not at the birth of the dataframe as you're trying. You can add it on later (As I show below). I've done the same thing on occasion and thus far no R gods have attempted to destroy me. Maybe not the best thing but a data.frame is a list so it can be done. Sometimes though the visual table format of the data.frame may be nicer than a list.
dat <- data.frame(coeff = 1:10)
dat$mats <- lapply(1:10, function(i) matrix(1:4, 2))
dat[1, 2]
## [[1]]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4

Data.frames work best when you have rectangular data; specifically a collection of atomic vectors of same length. Trying to shove other data in there is not a good idea. Plus adding rows one-by-one to a data.frame is not an efficient operation. The general container for all objects in R is the list. I can hold anything you list and you can name the elements whatever you like. Not sure why you think you may need a data.frame.

Related

apply function to all meaningful combination in a list and save in matrix format

I want to apply a function (distance between distance matrices) to a list of distance matrices and extract the calculated value in a matrix/table format.
Using ecodistpackage for the calculation and nested lapply to make all possible combination of the MRMdistance calculation.
Part 1:
library("ecodist")
#example data
data(graze)
#make list to get it looking like my data
grazelist<-as.list.data.frame(graze)
#all vs all distance combination
grazedist<-lapply(names (grazelist),function(z)
lapply(names(grazelist),function(f)
MRM(dist(grazelist [[z]])~ dist(grazelist[[f]]),nperm=1)))
This makes all vs. all possible combinations, but I only need all combination in one way (half the matrix), as calculation takes very long and the result for the second half is the same. Any idea how to solve this?
Part 2:
I would like to gather only the dist value under $coefper calculation in a matrix format for follow up processing. In this case 1.000000e+00.
R output looks like this:
> head(grazedist[[1]])
[[1]]
[[1]]$`coef`
dist(grazelist[[z]]) pval
Int 8.881784e-16 1
dist(grazelist[[f]]) 1.000000e+00 1
[[1]]$r.squared
R2 pval
1 1
[[1]]$F.test
F F.pval
3.753766e+18 1.000000e+00
I know how to get it as txt or csv file for a simple R output not processed from a list (and called MRM_calculation).
write.table(MRM_calculation$coef[2,1],file="file.txt")
But how can I collect all dist values of $coef in a data frame, table or directly matrix format from a list?
something like:
mapply(write.csv2, x=grazedist$coef,
file=paste(names(grazedist),"value.csv"))
I´m using larger matrices (1500x1500) in a list, but hope the example data graze are sufficient as a reproducible example.
You could use combn which generates for your example data only 351 combinations rather than 729 combination (27 X 27) using nested lapply. You can then apply the function to every combination using the FUN argument in combn and extract the dist value in coef and write it to a dataframe.
library(ecodist)
df <- data.frame(value = combn(names(grazelist), 2, function(x)
MRM(dist(grazelist[[x[1]]])~ dist(grazelist[[x[2]]]),nperm=1)$coef[[2]]))
and then write this to csv
write.csv(df, "/path/to/file/filename.csv", row.names = FALSE)

Best way to feed which(,arr.ind=T) back into matrix in R?

I have extracted the array indeces of some elements I want to look at as follows:
mat = matrix(0,10,10)
arrInd = which(mat ==0,arr.ind = T)
Then I do some more operations on this matrix and eventually end up with a vector or rows rowInd and a vector of columns colInd. I want us these indeces to insert values into another matrix, say mat2. But I can't seem to figure out a way to do this without looping or doing the modular arithmetic calculation myself. I realize I could take something like
mat2[rowInd*(colInd-1)+rowInd]
In order to transform back to the 1-d indexing. But since R usually has built in functions to do this sort of thing, I was wondering if there is any more concise way to do this? It would just seem natural that such a handy data-manipulation function like which(,arr.ind=T) would have a handy inverse.
I also tried using mat2[rowInd,colInd], but this did not work.
Have a read on R intro: indexing a matrix on the use of matrix indexing. which(, arr.ind = TRUE) returns a two column matrix suitable for direct use of matrix indexing. For example:
A <- matrix(c(1L,2L,2L,1L), 2)
iv <- which(A == 1L, arr.ind = TRUE)
# row col
#[1,] 1 1
#[2,] 2 2
A[iv]
# [1] 1 1
If you have another matrix B which you want to update values according to iv, just do
B[iv] <- replacement
Maybe for some reason you've separated row index and column index into rowInd and colInd. In that case, just use
cbind(rowInd, colInd)
as indexing matrix.

Merging Two Matrices

I've done a little bit of digging for this result but most of the questions on here have information in regards to the cbind function, and basic matrix concatenation. What I'm looking to do is a little more complicated.
Let's say, for example, I have an NxM matrix whose first column is a unique identifier for each of the rows (and luckily in this instance is sorted by that identifier). For reasons which are inconsequential to this inquiry, I'm splitting the rows of this matrix into (n_i)xM matrices such that the sum of n_i = N.
I'm intending to run separate analysis on each of these sub-matrices and then combine the data together again with the usage of the unique identifier.
An example:
Let's say I have matrix data which is 10xM. After my split, I'll receive matrices subdata1 and subdata2. If you were to look at the contents of the matrices:
data[,1] = 1:10
subdata1[,1] = c(1,3,4,6,7)
subdata2[,1] = c(2,5,8,9,10)
I then manipulate the columns of subdata1 and subdata2, but preserve the information in the first column. I would like to combine this matrices again such that finaldata[,1] = 1:10, where finaldata is a result of the combination.
I realize now that I could use rbind and the sort the matrix, but for large matrices that is very inefficient.
I know R has some great functions out there for data management, is there a work around for this problem?
I may not fully understand your question, but as an example of general use, I would typically convert the matrices to dataframes and then do something like this:
combi <- rbind(dataframe1, dataframe2)
If you know they are matrices, you can do this with multidimensional arrays:
X <- matrix(1:100, 10,10)
s1 <- X[seq(1, 9,2), ]
s2 <- X[seq(2,10,2), ]
XX <- array(NA, dim=c(2,5,10) )
XX[1, ,] <- s1 #Note two commas, as it's a 3D array
XX[2, ,] <- s2
dim(XX) <- c(10,10)
XX
This will copy each element of s1 and s2 into the appropriate slice of the array, then drop the extra dimension. There's a decent chance that rbind is actually faster, but this way you won't need to re-sort it.
Caveat: you need equal sized splits for this approach.

Build all possible 3-column matrices from 3 input matrices of different sizes

I have three different matrices:
m1, which has 12 rows and 5 columns;
m2, which has 12 rows and 4 columns; and
m3, which has 12 rows and 1 column.
I'm trying to build a series of 3-column matrices (p1 to p20) from this, such that in each p matrix:
p[,1] is taken from m1,
p[,2] is taken from m2, and
p[,3] is taken from m3.
I want the process to be exhaustive, so that I create all 20 possible 3-column matrices, so sampling m1, m2, and m3 (a solution I already tried) doesn't seem to work.
I tried half a dozen different for loops, but none of them accomplished what I wanted, and I played with some permutation functions, but couldn't figure out how to make them work in this context.
Ultimately, I'm trying to do this for an unknown number of input matrices, and since I'm still new to R, I have no other ideas about where to start. Any help the forum can offer will be appreciated.
## Example matrices
m1 <- matrix(1:4, nrow=2)
m2 <- matrix(1:6, nrow=2)
m3 <- matrix(1:2, nrow=2)
## A function that should do what you're after
f <- function(...) {
mm <- list(...)
ii <- expand.grid(lapply(mm, function(X) seq_len(ncol(X))))
lapply(seq_len(nrow(ii)), function(Z) {
mapply(FUN=function(X, Y) X[,Y], mm, ii[Z,])
})
}
## Try it out
f(m1)
f(m1,m2)
f(m1,m2,m3)
It looks like your problem can be split into two parts:
Create all valid combination of indexes from 1:5, 1:4 and 1
Compute the matrices
For the first problem, consider a merge without common columns (also called a "cross join"):
merge(data.frame(a=1:5), data.frame(a=1:4), by=c())
Use a loop to construct a data frame as big as you need. EDIT: Or just use expand.grid, as suggested by Josh.
For the second problem, the alply function from the plyr package will be useful. It allows processing a matrix/data frame row by row and collects the results in a list (a list of matrices in your case):
alply(combinations, 1, function(x) { ... })
combinations is the data frame generated by expand.grid or the like. The function will be called once for each combination of indexes, x will contain a data frame with one row. The return values of that function will be collected into a list.

Mean of elements in a list of data.frames

Suppose I had a list of data.frames (of equal rows and columns)
dat1 <- as.data.frame(matrix(rnorm(25), ncol=5))
dat2 <- as.data.frame(matrix(rnorm(25), ncol=5))
dat3 <- as.data.frame(matrix(rnorm(25), ncol=5))
all.dat <- list(dat1=dat1, dat2=dat2, dat3=dat3)
How can I return a single data.frame that is the mean (or sum, etc.) for each element in the data.frames across the list (e.g., mean of first row and first column from lists 1, 2, 3 and so on)? I have tried lapply and ldply in plyr but these return the statistic for each data.frame within the list.
Edit: For some reason, this was retagged as homework. Not that it matters either way, but this is not a homework question. I just don't know why I can't get this to work. Thanks for any insight!
Edit2: For further clarification:
I can get the results using loops, but I was hoping that there were a way (a simpler and faster way because the data I am using has data.frames that are 12 rows by 100 columns and there is a list of 1000+ of these data frames).
z <- matrix(0, nrow(all.dat$dat1), ncol(all.dat$dat1))
for(l in 1:nrow(all.dat$dat1)){
for(m in 1:ncol(all.dat$dat1)){
z[l, m] <- mean(unlist(lapply(all.dat, `[`, i =l, j = m)))
}
}
With a result of the means:
> z
[,1] [,2] [,3] [,4] [,5]
[1,] -0.64185488 0.06220447 -0.02153806 0.83567173 0.3978507
[2,] -0.27953054 -0.19567085 0.45718399 -0.02823715 0.4932950
[3,] 0.40506666 0.95157856 1.00017954 0.57434125 -0.5969884
[4,] 0.71972821 -0.29190645 0.16257478 -0.08897047 0.9703909
[5,] -0.05570302 0.62045662 0.93427522 -0.55295824 0.7064439
I was wondering if there was a less clunky and faster way to do this. Thanks!
Here is a one liner with plyr. You can replace mean with any other function that you want.
ans1 = aaply(laply(all.dat, as.matrix), c(2, 3), mean)
You would have an easier time changing the data structure, combining the three two dimensional matrices into a single 3 dimensional array (using the abind library). Then the solution is more direct using apply and specifying the dimensions to average over.
EDIT:
When I answered the question, it was tagged homework, so I just gave an approach. The original poster removed that tag, so I will take him/her at his/her word that it isn't.
library("abind")
all.matrix <- abind(all.dat, along=3)
apply(all.matrix, c(1,2), mean)
I gave one answer that uses a completely different data structure to achieve the result. This answer uses the data structure (list of data frames) given directly. I think it is less elegant, but wanted to provide it anyway.
Reduce(`+`, all.dat) / length(all.dat)
The logic is to add the data frames together element by element (which + will do with data frames), then divide by the number of data frames. Using Reduce is necessary since + can only take two arguments at a time (and addition is associative).
Another approach using only base functions to change the structure of the object:
listVec <- lapply(all.dat, c, recursive=TRUE)
m <- do.call(cbind, listVec)
Now you can calculate the mean with rowMeans or the median with apply:
means <- rowMeans(m)
medians <- apply(m, 1, median)
I would take a slightly different approach:
library(plyr)
tmp <- ldply(all.dat) # convert to df
tmp$counter <- 1:5 # 1:12 for your actual situation
ddply(tmp, .(counter), function(x) colMeans(x[2:ncol(x)]))
Couldn't you just use nested lapply() calls?
This appears to give the correct result on my machine
mean.dat <- lapply(all.dat, function (x) lapply(x, mean, na.rm=TRUE))

Resources