I've done a little bit of digging for this result but most of the questions on here have information in regards to the cbind function, and basic matrix concatenation. What I'm looking to do is a little more complicated.
Let's say, for example, I have an NxM matrix whose first column is a unique identifier for each of the rows (and luckily in this instance is sorted by that identifier). For reasons which are inconsequential to this inquiry, I'm splitting the rows of this matrix into (n_i)xM matrices such that the sum of n_i = N.
I'm intending to run separate analysis on each of these sub-matrices and then combine the data together again with the usage of the unique identifier.
An example:
Let's say I have matrix data which is 10xM. After my split, I'll receive matrices subdata1 and subdata2. If you were to look at the contents of the matrices:
data[,1] = 1:10
subdata1[,1] = c(1,3,4,6,7)
subdata2[,1] = c(2,5,8,9,10)
I then manipulate the columns of subdata1 and subdata2, but preserve the information in the first column. I would like to combine this matrices again such that finaldata[,1] = 1:10, where finaldata is a result of the combination.
I realize now that I could use rbind and the sort the matrix, but for large matrices that is very inefficient.
I know R has some great functions out there for data management, is there a work around for this problem?
I may not fully understand your question, but as an example of general use, I would typically convert the matrices to dataframes and then do something like this:
combi <- rbind(dataframe1, dataframe2)
If you know they are matrices, you can do this with multidimensional arrays:
X <- matrix(1:100, 10,10)
s1 <- X[seq(1, 9,2), ]
s2 <- X[seq(2,10,2), ]
XX <- array(NA, dim=c(2,5,10) )
XX[1, ,] <- s1 #Note two commas, as it's a 3D array
XX[2, ,] <- s2
dim(XX) <- c(10,10)
XX
This will copy each element of s1 and s2 into the appropriate slice of the array, then drop the extra dimension. There's a decent chance that rbind is actually faster, but this way you won't need to re-sort it.
Caveat: you need equal sized splits for this approach.
Related
I've got 2 dataframes each with 150 rows and 10 columns + column and row IDs. I want to correlate every row in one dataframe with every row in the other (e.g. 150x150 correlations) and plot the distribution of the resulting 22500 values.(Then I want to calculate p values etc from the distribution - but that's the next step).
Frankly I don't know where to start with this. I can read my data in and see how to correlate vectors or matching slices of two matrices etc., but I can't get handle on what I'm trying to do here.
set.seed(42)
DF1 <- as.data.frame(matrix(rnorm(1500),150))
DF2 <- as.data.frame(matrix(runif(1500),150))
#transform to matrices for better performance
m1 <- as.matrix(DF1)
m2 <- as.matrix(DF2)
#use outer to get all combinations of row numbers and apply a function to them
#22500 combinations is small enough to fit into RAM
cors <- outer(seq_len(nrow(DF1)),seq_len(nrow(DF2)),
#you need a vectorized function
#Vectorize takes care of that, but is just a hidden loop (slow for huge row numbers)
FUN=Vectorize(function(i,j) cor(m1[i,],m2[j,])))
hist(cors)
You can use cor with two arguments:
cor( t(m1), t(m2) )
I have three different matrices:
m1, which has 12 rows and 5 columns;
m2, which has 12 rows and 4 columns; and
m3, which has 12 rows and 1 column.
I'm trying to build a series of 3-column matrices (p1 to p20) from this, such that in each p matrix:
p[,1] is taken from m1,
p[,2] is taken from m2, and
p[,3] is taken from m3.
I want the process to be exhaustive, so that I create all 20 possible 3-column matrices, so sampling m1, m2, and m3 (a solution I already tried) doesn't seem to work.
I tried half a dozen different for loops, but none of them accomplished what I wanted, and I played with some permutation functions, but couldn't figure out how to make them work in this context.
Ultimately, I'm trying to do this for an unknown number of input matrices, and since I'm still new to R, I have no other ideas about where to start. Any help the forum can offer will be appreciated.
## Example matrices
m1 <- matrix(1:4, nrow=2)
m2 <- matrix(1:6, nrow=2)
m3 <- matrix(1:2, nrow=2)
## A function that should do what you're after
f <- function(...) {
mm <- list(...)
ii <- expand.grid(lapply(mm, function(X) seq_len(ncol(X))))
lapply(seq_len(nrow(ii)), function(Z) {
mapply(FUN=function(X, Y) X[,Y], mm, ii[Z,])
})
}
## Try it out
f(m1)
f(m1,m2)
f(m1,m2,m3)
It looks like your problem can be split into two parts:
Create all valid combination of indexes from 1:5, 1:4 and 1
Compute the matrices
For the first problem, consider a merge without common columns (also called a "cross join"):
merge(data.frame(a=1:5), data.frame(a=1:4), by=c())
Use a loop to construct a data frame as big as you need. EDIT: Or just use expand.grid, as suggested by Josh.
For the second problem, the alply function from the plyr package will be useful. It allows processing a matrix/data frame row by row and collects the results in a list (a list of matrices in your case):
alply(combinations, 1, function(x) { ... })
combinations is the data frame generated by expand.grid or the like. The function will be called once for each combination of indexes, x will contain a data frame with one row. The return values of that function will be collected into a list.
I have two dataframes and I would like to do independent 2-group t-tests on the rows (i.e. t.test(y1, y2) where y1 is a row in dataframe1 and y2 is matching row in dataframe2)
whats best way of accomplishing this?
EDIT:
I just found the format: dataframe1[i,] dataframe2[i,]. This will work in a loop. Is that the best solution?
The approach you outlined is reasonable, just make sure to preallocate your storage vector. I'd double check that you really want to compare the rows instead of the columns. Most datasets I work with have each row as a unit of observation and the columns represent separate responses/columns of interest Regardless, it's your data - so if that's what you need to do, here's an approach:
#Fake data
df1 <- data.frame(matrix(runif(100),10))
df2 <- data.frame(matrix(runif(100),10))
#Preallocate results
testresults <- vector("list", nrow(df1))
#For loop
for (j in seq(nrow(df1))){
testresults[[j]] <- t.test(df1[j,], df2[j,])
}
You now have a list that is as long as you have rows in df1. I would then recommend using lapply and sapply to easily extract things out of the list object.
It would make more sense to have your data stored as columns.
You can transpose a data.frame by
df1_t <- as.data.frame(t(df1))
df2_t <- as.data.frame(t(df2))
Then you can use mapply to cycle through the two data.frames a column at a time
t.test_results <- mapply(t.test, x= df1_t, y = df2_t, SIMPLIFY = F)
Or you could use Map which is a simple wrapper for mapply with SIMPLIFY = F (Thus saving key strokes!)
t.test_results <- Map(t.test, x = df1_t, y = df2_t)
I'm trying to find an apply() type function that can run a function that operates on two arrays instead of one.
Sort of like:
apply(X1 = doy_stack, X2 = snow_stack, MARGIN = 2, FUN = r_part(a, b))
The data is a stack of band arrays from Landsat tiles that are stacked together using rbind. Each row contains the data from a single tile, and in the end, I need to apply a function on each column (pixel) of data in this stack. One such stack contains whether each pixel has snow on it or not, and the other stack contains the day of year for that row. I want to run a classifier (rpart) on each pixel and have it identify the snow free day of year for each pixel.
What I'm doing now is pretty silly: mapply(paste, doy, snow_free) concatenates the day of year and the snow status together for each pixel as a string, apply(strstack, 2, FUN) runs the classifer on each pixel, and inside the apply function, I'm exploding each string using strsplit. As you might imagine, this is pretty inefficient, especially on 1 million pixels x 300 tiles.
Thanks!
I wouldn't try to get too fancy. A for loop might be all you need.
out <- numeric(n)
for(i in 1:n) {
out[i] <- snow_free(doy_stack[,i], snow_stack[,i])
}
Or, if you don't want to do the bookkeeping yourself,
sapply(1:n, function(i) snow_free(doy_stack[,i], snow_stack[,i]))
I've just encountered the same problem and, if I clearly understood the question, I may have solved it using mapply.
We'll use two 10x10 matrices populated with uniform random values.
set.seed(1)
X <- matrix(runif(100), 10, 10)
set.seed(2)
Y <- matrix(runif(100), 10, 10)
Next, determine how operations between the matrices will be performed. If it is row-wise, you need to transpose X and Y then cast to data.frame. This is because a data.frame is a list with columns as list elements. mapply() assumes that you are passing a list. In this example I'll perform correlation row-wise.
res.row <- mapply(function(x, y){cor(x, y)}, as.data.frame(t(X)), as.data.frame(t(Y)))
res.row[1]
V1
0.36788
should be the same as
cor(X[1,], Y[1,])
[1] 0.36788
For column-wise operations exclude the t():
res.col <- mapply(function(x, y){cor(x, y)}, as.data.frame(X), as.data.frame(Y))
This obviously assumes that X and Y have dimensions consistent with the operation of interest (i.e. they don't have to be exactly the same dimensions). For instance, one could require a statistical test row-wise but having differing numbers of columns in each matrix.
Wouldn't it be more natural to implement this as a raster stack? With the raster package you can use entire rasters in functions (eg ras3 <- ras1^2 + ras2), as well as extract a single cell value from XY coordinates, or many cell values using a block or polygon mask.
apply can work on higher dimensions (i.e. list elements). Not sure how your data is set up, but something like this might be what you are looking for:
apply(list(doy_stack, snow_stack), c(1,2), function(x) r_part(x[1], x[2]))
merge is a very nice function: It merges matrices and data.frames, and returns a data.frame.
Having rather big character matrices,
is there another good way to merge -
without data.frame conversion?
Comment 1:
A small function to merge a named vector with a matrix or data.frame. Elements of the vector can link to multiple entries in the matrix:
expand <- function(v,m,by.m,v.name='v',...) {
df <- do.call(rbind,lapply(names(v),function(x) {
pos <- which(m[,by.m] %in% v[x])
cbind(x,m[pos,],...)
}))
colnames(df)[1] <- v.name
df
}
Example:
v <- rep(letters,each=3)[seq_along(letters)]
names(v) <- letters
m <- data.frame(a=unique(v),b=seq_along(unique(v)),stringsAsFactors=F)
expand(v,m,'a')
You can use a combination of match and cbind to do the equivalent of merge without conversion to data frame, a simple example:
st1 <- state.x77[ sample(1:50), ]
st2 <- as.matrix( USArrests )[ sample(1:50), ]
tmp1 <- match(rownames(st1), rownames(st2) )
st3 <- cbind( st1, st2[tmp1,] )
head(st3)
Keeping track of which columns you want, and merging whith many to 1 relationships or missing rows in one group require a bit more thought but are still possible.
No, not without either (a) overwriting the merge function or (b) creating a new merge.matrix() S3 function (this would be the right approach to the problem).
You can see in the merge help:
Value
A data frame.
Also, the merge.default function:
> merge.default
function (x, y, ...)
merge(as.data.frame(x), as.data.frame(y), ...)
There is now a merge.Matrix function in the Matrix.utils package. This works on combinations of matrices as well as capital M Matrices, data.frames, etc.
The match solution is nice, but as someone pointed out does not work on m:n relationships. It also does not implement the other features of merge, including all.x, all.y, etc.