I have the following data:
seed(1)
X <- data.frame(matrix(rnorm(2000), nrow=10))#### the dataset
The following code creates 1000 bootstrapped datasets "x" and 1000 bootstrapped datasets "y" with 5 columns each.
colnums_boot <- replicate(1000,sample.int(200,10))
output<-lapply(1:1000, function(i){
Xprime <- X[,colnums_boot[1:5,i]]
Yprime <- X[,colnums_boot[6:10,i]]
xy <- list(x=Xprime,y=Yprime )
} )
I obtained a list of lists of dataframes " xy " to which I would like to apply this particular code but do not understand the list indexing operations.
From the output "xy"
Considering the first list [1] which has
$x and
$y
I would like to apply the code:
X= cor($x)
Y= cor($y) separately and then
sapply(1:10, function(row) cor(X[row,], Y[row,]))
which will give me a single value for each row "r1" for list [1].
I would like to apply this to the entire list and obtain r1, r2 from list[1] , list[2] respectively and so on.. until 1000 and make it as a dataframe in the end. It will be a ten by thousand dimension dataframe in the end.
I can't find the question where I wrote that Xprime, Yprime bit; I hope you didn't delete it...? If I remember correctly, I suggested this, since it is much more efficient to deal with matrices:
Z <- as.matrix(X)
Xprime2 <- array(,dim=c(10,5,1000))
Yprime2 <- array(,dim=c(10,5,1000))
Xprime2[] <- Z[,colnums_boot[1:5,]]
Yprime2[] <- Z[,colnums_boot[6:10,]]
Anyway, in your setup, as #KarlForner commented, this will get you correlations between X and Y columns
lapply(output,function(ll) cor(ll$x,ll$y))
This is also potentially inefficient when bootstrapping, since you will be computing correlations among the same 200 vectors. I think it makes more sense to just compute them up front cor(X) and then grab the values from there...
As far as putting that into a data.frame, I'm not clear on what that would mean.
Related
So, I'm trying to generate random numbers from multivariate normal distributions with different means. I'm also trying to use the apply functions and not for loops, which is where the problem occurs. Here is my code:
library(MASS)
set.seed(123)
# X and Y means
Means<-cbind(c(.2,.2,.8),c(.2,.6,.8))
Means
Sigma<-matrix(c(.01,0,0,.01),nrow=2)
Sigma
data<-apply(X=Means,MARGIN=1,FUN=mvrnorm,n=10,Sigma=Sigma)
data
Instead of getting two vector with X and Y points for the three means, I get three vectors with X and Y points stacked. What is the best way to get the two vectors? I know I could unstack them manually, but I feel R should have some slick way of getting this done.
It's not sure if it's what I would call 'slick' but if you really want to use apply (instead of lapply as previously mentioned), you can force apply to return your results as a list of matrices. Then it's just a matter of sticking the results together. I expect that this would be less error-prone than trying to rebuild a two column matrix.
data <- apply(Means, 1, function(x) {
list(mvrnorm(n=10, mu=x, Sigma=Sigma))
})
data <- do.call('rbind', unlist(data, recursive=FALSE))
Try:
set.seed(42)
res1 <- lapply(seq_len(nrow(Means)), function(i) mvrnorm(Means[i,], n=10, Sigma))
Checking with the results of apply
set.seed(42)
res2 <- apply(X=Means,MARGIN=1,FUN=mvrnorm,n=10,Sigma=Sigma)
dim(res2) <- c(10,2, 3)
res3 <-lapply(1:dim(res2)[3], function(i) res2[,,i])
all.equal(res3, res1, check.attributes=FALSE)
#[1] TRUE
I've done a little bit of digging for this result but most of the questions on here have information in regards to the cbind function, and basic matrix concatenation. What I'm looking to do is a little more complicated.
Let's say, for example, I have an NxM matrix whose first column is a unique identifier for each of the rows (and luckily in this instance is sorted by that identifier). For reasons which are inconsequential to this inquiry, I'm splitting the rows of this matrix into (n_i)xM matrices such that the sum of n_i = N.
I'm intending to run separate analysis on each of these sub-matrices and then combine the data together again with the usage of the unique identifier.
An example:
Let's say I have matrix data which is 10xM. After my split, I'll receive matrices subdata1 and subdata2. If you were to look at the contents of the matrices:
data[,1] = 1:10
subdata1[,1] = c(1,3,4,6,7)
subdata2[,1] = c(2,5,8,9,10)
I then manipulate the columns of subdata1 and subdata2, but preserve the information in the first column. I would like to combine this matrices again such that finaldata[,1] = 1:10, where finaldata is a result of the combination.
I realize now that I could use rbind and the sort the matrix, but for large matrices that is very inefficient.
I know R has some great functions out there for data management, is there a work around for this problem?
I may not fully understand your question, but as an example of general use, I would typically convert the matrices to dataframes and then do something like this:
combi <- rbind(dataframe1, dataframe2)
If you know they are matrices, you can do this with multidimensional arrays:
X <- matrix(1:100, 10,10)
s1 <- X[seq(1, 9,2), ]
s2 <- X[seq(2,10,2), ]
XX <- array(NA, dim=c(2,5,10) )
XX[1, ,] <- s1 #Note two commas, as it's a 3D array
XX[2, ,] <- s2
dim(XX) <- c(10,10)
XX
This will copy each element of s1 and s2 into the appropriate slice of the array, then drop the extra dimension. There's a decent chance that rbind is actually faster, but this way you won't need to re-sort it.
Caveat: you need equal sized splits for this approach.
I have two dataframes as follows:
seed(1)
X <- data.frame(matrix(rnorm(2000), nrow=10))
where the rows represent the genes and the columns are the genotypes.
For each round of bootstrapping (n=1000), genotypes should be selected at random without replacement from this dataset (X) and form two groups of datasets (X' should have 5 genotypes and Y' should have 5 genotypes). Basically, in the end I will have thousand such datasets X' and Y' which will contain 5 random genotypes each from the full expression dataset.
I tried using replicate and apply but did not work.
B <- 1000
replicate(B, apply(X, 2, sample, replace = FALSE))
I think it might make more sense for you to first select the column numbers, 10 from 200 without replacement (five for each X' and Y'):
colnums_boot <- replicate(1000,sample.int(200,10))
From there, as you evaluate each iteration, i from 1 to 1000, you can grab
Xprime <- X[,colnums_boot[1:5,i]]
Yprime <- X[,colnums_boot[6:10,i]]
This saves you from making a 3-dimensional array (the generalization of matrix in R).
Also, if speed is a concern, I think it would be much faster to leave X as a matrix instead of a data frame. Maybe someone else can comment on that.
EDIT: Here's a way to grab them all up-front (in a pair of three-dimensional arrays):
Z <- as.matrix(X)
Xprimes <- array(,dim=c(10,5,1000))
Xprimes[] <- Z[,colnums_boot[1:5,]]
Yprimes <- array(,dim=c(10,5,1000))
Yprimes[] <- Z[,colnums_boot[6:10,]]
I've got 2 dataframes each with 150 rows and 10 columns + column and row IDs. I want to correlate every row in one dataframe with every row in the other (e.g. 150x150 correlations) and plot the distribution of the resulting 22500 values.(Then I want to calculate p values etc from the distribution - but that's the next step).
Frankly I don't know where to start with this. I can read my data in and see how to correlate vectors or matching slices of two matrices etc., but I can't get handle on what I'm trying to do here.
set.seed(42)
DF1 <- as.data.frame(matrix(rnorm(1500),150))
DF2 <- as.data.frame(matrix(runif(1500),150))
#transform to matrices for better performance
m1 <- as.matrix(DF1)
m2 <- as.matrix(DF2)
#use outer to get all combinations of row numbers and apply a function to them
#22500 combinations is small enough to fit into RAM
cors <- outer(seq_len(nrow(DF1)),seq_len(nrow(DF2)),
#you need a vectorized function
#Vectorize takes care of that, but is just a hidden loop (slow for huge row numbers)
FUN=Vectorize(function(i,j) cor(m1[i,],m2[j,])))
hist(cors)
You can use cor with two arguments:
cor( t(m1), t(m2) )
I have three different matrices:
m1, which has 12 rows and 5 columns;
m2, which has 12 rows and 4 columns; and
m3, which has 12 rows and 1 column.
I'm trying to build a series of 3-column matrices (p1 to p20) from this, such that in each p matrix:
p[,1] is taken from m1,
p[,2] is taken from m2, and
p[,3] is taken from m3.
I want the process to be exhaustive, so that I create all 20 possible 3-column matrices, so sampling m1, m2, and m3 (a solution I already tried) doesn't seem to work.
I tried half a dozen different for loops, but none of them accomplished what I wanted, and I played with some permutation functions, but couldn't figure out how to make them work in this context.
Ultimately, I'm trying to do this for an unknown number of input matrices, and since I'm still new to R, I have no other ideas about where to start. Any help the forum can offer will be appreciated.
## Example matrices
m1 <- matrix(1:4, nrow=2)
m2 <- matrix(1:6, nrow=2)
m3 <- matrix(1:2, nrow=2)
## A function that should do what you're after
f <- function(...) {
mm <- list(...)
ii <- expand.grid(lapply(mm, function(X) seq_len(ncol(X))))
lapply(seq_len(nrow(ii)), function(Z) {
mapply(FUN=function(X, Y) X[,Y], mm, ii[Z,])
})
}
## Try it out
f(m1)
f(m1,m2)
f(m1,m2,m3)
It looks like your problem can be split into two parts:
Create all valid combination of indexes from 1:5, 1:4 and 1
Compute the matrices
For the first problem, consider a merge without common columns (also called a "cross join"):
merge(data.frame(a=1:5), data.frame(a=1:4), by=c())
Use a loop to construct a data frame as big as you need. EDIT: Or just use expand.grid, as suggested by Josh.
For the second problem, the alply function from the plyr package will be useful. It allows processing a matrix/data frame row by row and collects the results in a list (a list of matrices in your case):
alply(combinations, 1, function(x) { ... })
combinations is the data frame generated by expand.grid or the like. The function will be called once for each combination of indexes, x will contain a data frame with one row. The return values of that function will be collected into a list.