Bootstrapping two datasets in R - r

I have two dataframes as follows:
seed(1)
X <- data.frame(matrix(rnorm(2000), nrow=10))
where the rows represent the genes and the columns are the genotypes.
For each round of bootstrapping (n=1000), genotypes should be selected at random without replacement from this dataset (X) and form two groups of datasets (X' should have 5 genotypes and Y' should have 5 genotypes). Basically, in the end I will have thousand such datasets X' and Y' which will contain 5 random genotypes each from the full expression dataset.
I tried using replicate and apply but did not work.
B <- 1000
replicate(B, apply(X, 2, sample, replace = FALSE))

I think it might make more sense for you to first select the column numbers, 10 from 200 without replacement (five for each X' and Y'):
colnums_boot <- replicate(1000,sample.int(200,10))
From there, as you evaluate each iteration, i from 1 to 1000, you can grab
Xprime <- X[,colnums_boot[1:5,i]]
Yprime <- X[,colnums_boot[6:10,i]]
This saves you from making a 3-dimensional array (the generalization of matrix in R).
Also, if speed is a concern, I think it would be much faster to leave X as a matrix instead of a data frame. Maybe someone else can comment on that.
EDIT: Here's a way to grab them all up-front (in a pair of three-dimensional arrays):
Z <- as.matrix(X)
Xprimes <- array(,dim=c(10,5,1000))
Xprimes[] <- Z[,colnums_boot[1:5,]]
Yprimes <- array(,dim=c(10,5,1000))
Yprimes[] <- Z[,colnums_boot[6:10,]]

Related

R: Performance issue when finding maximum of splitted list

When trying to find the maximum values of a splitted list, I run into serious performance issues.
Is there a way I can optimize the following code:
# Generate data for this MWE
x <- matrix(runif(900 * 9000), nrow = 900, ncol = 9000)
y <- rep(1:100, each = 9)
my_data <- cbind(y, x)
my_data <- data.frame(my_data)
# This is the critical part I would like to optimize
my_data_split <- split(my_data, y)
max_values <- lapply(my_data_split, function(x) x[which.max(x[ , 50]), ])
I want to get the rows where a given column hits its maximum for a given group (it should be easier to understand from the code).
I know that splitting into a list is probably the reason for the slow performance, but I don't know how to circumvent it.
This may not be immediately clear to you.
There is an internal function max.col doing something similar, except that it finds position index of the maximum along a matrix row (not column). So if you transpose your original matrix x, you will be able to use this function.
Complexity steps in when you want to do max.col by group. The split-lapply convention is needed. But, if after the transpose, we convert the matrix to a data frame, we can do split.default. (Note it is not split or split.data.frame. Here the data frame is treated as a list (vector), so the split happens among the data frame columns.) Finally, we do an sapply to apply max.col by group and cbind the result into a matrix.
tx <- data.frame(t(x))
tx.group <- split.default(tx, y) ## note the `split.default`, not `split`
pos <- sapply(tx.group, max.col)
The resulting pos is something like a look-up table. It has 9000 rows and 100 columns (groups). The pos[i, j] gives the index you want for the i-th column (of your original non-transposed matrix) and j-th group. So your final extraction for the 50-th column and all groups is
max_values <- Map("[[", tx.group, pos[50, ])
You just generate the look-up table once, and make arbitrary extraction at any time.
Disadvantage of this method:
After the split, data in each group are stored in a data frame rather than a matrix. That is, for example, tx.group[[1]] is a 9000 x 9 data frame. But max.col expects a matrix so it will convert this data frame into a matrix internally.
Thus, the major performance / memory overhead includes:
initial matrix transposition;
matrix to data frame conversion;
data frame to matrix conversion (per group).
I am not sure whether we eliminate all above with some functions from MatrixStats package. I look forward to seeing a solution with that.
But anyway, this answer is already much faster than what OP originally does.
A solution using {dplyr}:
# Generate data for this MWE
x <- matrix(runif(900 * 9000), nrow = 900, ncol = 9000)
y <- rep(1:100, each = 9)
my_data <- cbind.data.frame(y, x)
# This is the critical part I would like to optimize
system.time({
my_data_split <- split(my_data, y)
max_values <- lapply(my_data_split, function(x) x[which.max(x[ , 50]), ])
})
# Using {dplyr} is 9 times faster, but you get results in a slightly different format
library(dplyr)
system.time({
max_values2 <- my_data %>%
group_by(y) %>%
do(max_values = .[which.max(.[[50]]), ])
})
all.equal(max_values[[1]], max_values2$max_values[[1]], check.attributes = FALSE)

Resampling a dataset of x rows

I have a data set of x entries, and I need to resample it to y entries, with y being a number smaller than x. my data set is not a series of numbers, of rather x rows, and I need the entire row of information when resampling.
I am aware of the sample() function but given that my dataset is not a vector I am unclear how the exact code should be written.
Any help would be appreciated!
The idea is that you want to sample the index of rows, then use that to pull back all of the columns for those rows, like so:
set.seed(4444) # for reproducibility
data(iris)
x <- nrow(iris)
y <- 7
irisSubset <- iris[sample(x,y),]

applying a function for a list of dataframes

I have the following data:
seed(1)
X <- data.frame(matrix(rnorm(2000), nrow=10))#### the dataset
The following code creates 1000 bootstrapped datasets "x" and 1000 bootstrapped datasets "y" with 5 columns each.
colnums_boot <- replicate(1000,sample.int(200,10))
output<-lapply(1:1000, function(i){
Xprime <- X[,colnums_boot[1:5,i]]
Yprime <- X[,colnums_boot[6:10,i]]
xy <- list(x=Xprime,y=Yprime )
} )
I obtained a list of lists of dataframes " xy " to which I would like to apply this particular code but do not understand the list indexing operations.
From the output "xy"
Considering the first list [1] which has
$x and
$y
I would like to apply the code:
X= cor($x)
Y= cor($y) separately and then
sapply(1:10, function(row) cor(X[row,], Y[row,]))
which will give me a single value for each row "r1" for list [1].
I would like to apply this to the entire list and obtain r1, r2 from list[1] , list[2] respectively and so on.. until 1000 and make it as a dataframe in the end. It will be a ten by thousand dimension dataframe in the end.
I can't find the question where I wrote that Xprime, Yprime bit; I hope you didn't delete it...? If I remember correctly, I suggested this, since it is much more efficient to deal with matrices:
Z <- as.matrix(X)
Xprime2 <- array(,dim=c(10,5,1000))
Yprime2 <- array(,dim=c(10,5,1000))
Xprime2[] <- Z[,colnums_boot[1:5,]]
Yprime2[] <- Z[,colnums_boot[6:10,]]
Anyway, in your setup, as #KarlForner commented, this will get you correlations between X and Y columns
lapply(output,function(ll) cor(ll$x,ll$y))
This is also potentially inefficient when bootstrapping, since you will be computing correlations among the same 200 vectors. I think it makes more sense to just compute them up front cor(X) and then grab the values from there...
As far as putting that into a data.frame, I'm not clear on what that would mean.

dataframe (product) correlations in R

I've got 2 dataframes each with 150 rows and 10 columns + column and row IDs. I want to correlate every row in one dataframe with every row in the other (e.g. 150x150 correlations) and plot the distribution of the resulting 22500 values.(Then I want to calculate p values etc from the distribution - but that's the next step).
Frankly I don't know where to start with this. I can read my data in and see how to correlate vectors or matching slices of two matrices etc., but I can't get handle on what I'm trying to do here.
set.seed(42)
DF1 <- as.data.frame(matrix(rnorm(1500),150))
DF2 <- as.data.frame(matrix(runif(1500),150))
#transform to matrices for better performance
m1 <- as.matrix(DF1)
m2 <- as.matrix(DF2)
#use outer to get all combinations of row numbers and apply a function to them
#22500 combinations is small enough to fit into RAM
cors <- outer(seq_len(nrow(DF1)),seq_len(nrow(DF2)),
#you need a vectorized function
#Vectorize takes care of that, but is just a hidden loop (slow for huge row numbers)
FUN=Vectorize(function(i,j) cor(m1[i,],m2[j,])))
hist(cors)
You can use cor with two arguments:
cor( t(m1), t(m2) )

regarding random number generation in a sequential sampling process

I have a data list, like
12345
23456
67891
-20000
200
600
20
...
Assume the size of this data set (i.e. lines of file) is N. I want to randomly draw m lines from this data file and output them into one file, and put the remaining N-m lines into another data file. I can random draw an index over m-iterations to get those m-lines. The issue that confuses me is that how to ensure the randomly drawn m lines are all different?
Is there a way to do that in R?
Yes, use sample(N, size=m, replace=FALSE) to get a random sample of m out of N without replacement. Or just sample(N, m) since replace=FALSE is the default.
I'm not entirely sure I understand the question, but here is one way to sample without replacement from a vector and then split that vector into two based on the sampling. This could be easily extended to other data types (e.g., data.frame).
## Example data vector.
X <- c(12345, 23456, 67891, -20000, 200, 600, 20)
## Length of data.
N <- length(X)
## Sample from the data indices, without replacement.
sampled.idx <- sample(1:N, 2, replace=FALSE)
## Select the sampled data elements.
(sampled <- X[sampled.idx])
## Select the non-sampled data elements.
(rest <- X[!(1:N %in% sampled.idx)])
## Update: A better way to do the last step.
## Thanks to #PLapointe's comment below.
(rest <- X[-sampled.idx])

Resources