I have a data list, like
12345
23456
67891
-20000
200
600
20
...
Assume the size of this data set (i.e. lines of file) is N. I want to randomly draw m lines from this data file and output them into one file, and put the remaining N-m lines into another data file. I can random draw an index over m-iterations to get those m-lines. The issue that confuses me is that how to ensure the randomly drawn m lines are all different?
Is there a way to do that in R?
Yes, use sample(N, size=m, replace=FALSE) to get a random sample of m out of N without replacement. Or just sample(N, m) since replace=FALSE is the default.
I'm not entirely sure I understand the question, but here is one way to sample without replacement from a vector and then split that vector into two based on the sampling. This could be easily extended to other data types (e.g., data.frame).
## Example data vector.
X <- c(12345, 23456, 67891, -20000, 200, 600, 20)
## Length of data.
N <- length(X)
## Sample from the data indices, without replacement.
sampled.idx <- sample(1:N, 2, replace=FALSE)
## Select the sampled data elements.
(sampled <- X[sampled.idx])
## Select the non-sampled data elements.
(rest <- X[!(1:N %in% sampled.idx)])
## Update: A better way to do the last step.
## Thanks to #PLapointe's comment below.
(rest <- X[-sampled.idx])
Related
I have a vector on which I want to do block resampling to get, say, 1000 samples of the same size of the vector, and then save all this samples in a list.
This is the code that performs normal resampling, i.e. randomly draws one observation per time, and saves the result in a list:
myvector <- c(1:200)
mylist <- list()
for(i in 1:1000){
mylist[[i]] <- sample(myvector, length(myvector), replace=TRUE)
}
I need a code that does exactly the same thing, except that instead of drawing single observations it draws blocks of observations (let's use blocks of dimension equal to 5).
I know there are packages that perform bootstrap operations, but I don't need statistics or confidence intervals or anything, just all the samples in a list. Both overlapping and non-overlapping blocks are ok, so the code for just one of the two procedures is enough. Of course, if you are so kind to give me the code for both it's appreciated. Thanks to anybody who can help me with this.
Not sure how you're wanting to store the final structure.
The following takes a block dimension, samples your vector by that block size (e.g. 200 element vector with block size 5 gives 40 observations of randomly sampled elements) and adds those blocks to an index of the final list. Using your example, the final result is a list with 1000 entries; each entry containing 40 randomly sampled observations.
myvector <- c(1:200)
rm(.Random.seed, envir=globalenv())
block_dimension <- 5
res = list()
for(i in 1:1000) {
name <- paste('sample_', i, sep='')
rep_num <- length(myvector) / block_dimension
all_blocks <- replicate(rep_num, sample(myvector, block_dimension))
tmp <- split(all_blocks, ceiling(seq_along(all_blocks)/block_dimension))
res[[name]] <- tmp
}
Here are the first 6 sampled observations for the first entry:
How about the following? Note that you can use lapply, which should be slightly faster than filling the list in a for loop in this case.
As reference, here is the case where you sample individual observations.
# Sample individual observations
set.seed(2017);
mylist <- lapply(1:1000, function(x) sample(myvector, length(myvector), replace = TRUE));
Next we sample blocks of 5 observations.
# Sample blocks of n observations
n <- 5;
set.seed(2017);
mylist <- lapply(1:1000, function(x) {
idx <- sample(1:(length(myvector) - n), length(myvector) / n, replace = TRUE);
idx <- c(t(sapply(0:(n - 1), function(i) idx + i)));
myvector[idx];
})
One solution, assuming blocks consist of contiguous elements of myvector, is to pre-define the blocks in rows of a data frame with start/end columns (e.g. blocks <- data.frame(start=seq(1,96,5),end=seq(5,100,5))). Create a set of sample indexes (with replacement) from [1:number of blocks] and concatenate values indexing from myvector using the start/end values from the defined blocks. You can add randomization within blocks as well, if you need to. This gives you control over the block contents, overlap, size, etc.
I found a way to perform the task with non-overlapping blocks:
myvector <- c(1:200)
n <- 5
mymatrix <- matrix(myvector, nrow = length(myvector)/n, byrow = TRUE)
mylist <- list()
for(i in 1:1000){
mylist[[i]] <- as.vector(t(mymatrix[sample(nrow(mymatrix), size = length(myvector)/n, replace = TRUE),]))
}
So I have my problem below and R code: (The nile data is one of R's included datasets)
seed random number generator
Define an empty or 1000-element vector, "sample1," to write sample means to
Write for-loop to drawn 1000 random samples, n = 25, and write mean of sample to prepared vector
data <- as.vector(Nile)
set.seed(123)
sample1 <- vector()
for(i in 1:1000){
r <- vector()
r[i] <- data[sample(1:100,size=25,replace=1)]
sample1[i] <- mean(r[i])
}
and I am getting a warning message in my output saying:
Warning in r[i] <- data[sample(1:100, size = 25, replace = 1)]: number of items to replace is not a multiple of replacement length
Could anyone help me out?
As mentioned in the comments, the problem is that you are trying to add a vector to an element of a vector. The dimensions don't add up.
The fix in this case is quite simply to remove the step, as it's redundant. In general if you need to store multiple vectors you can do that in a matrix, data frame or list structure, depending on if the vectors are of known length, same length, same class etc.
data <- as.vector(Nile)
set.seed(123)
sample1 <- vector()
for(i in 1:1000){
d <- data[sample(1:100, size=25, replace=TRUE)]
sample1[i] <- mean(d)
}
Instead of using a for loop, in this case you can use replicate, which is a relative of lapply and its ilk.
set.seed(123)
sample2 <- replicate(1000, mean(data[sample(1:100, size=25, replace=TRUE)]))
# as you can see, the results are identical
head(sample1); head(sample2)
#[1] 920.16 915.12 925.96 919.36 859.36 928.96
#[1] 920.16 915.12 925.96 919.36 859.36 928.96
I have a task to do using R. I need to make 10000 samples of a vector of 12 elements each of them between 1 and 7. I did this using:
dataSet = t(replicate(10000, sample(1:7, 12, r=T)))
Now I need to count the rows of this dataSet that contain all the values from 1:7.
How can I do that and is there a better way to represent the data than this?
One way would be (you need to use set.seed in order to make this reproducible)
indx <- 1:7
sum(apply(dataSet, 1, function(x) all(indx %in% x)))
## 2336
I have two dataframes as follows:
seed(1)
X <- data.frame(matrix(rnorm(2000), nrow=10))
where the rows represent the genes and the columns are the genotypes.
For each round of bootstrapping (n=1000), genotypes should be selected at random without replacement from this dataset (X) and form two groups of datasets (X' should have 5 genotypes and Y' should have 5 genotypes). Basically, in the end I will have thousand such datasets X' and Y' which will contain 5 random genotypes each from the full expression dataset.
I tried using replicate and apply but did not work.
B <- 1000
replicate(B, apply(X, 2, sample, replace = FALSE))
I think it might make more sense for you to first select the column numbers, 10 from 200 without replacement (five for each X' and Y'):
colnums_boot <- replicate(1000,sample.int(200,10))
From there, as you evaluate each iteration, i from 1 to 1000, you can grab
Xprime <- X[,colnums_boot[1:5,i]]
Yprime <- X[,colnums_boot[6:10,i]]
This saves you from making a 3-dimensional array (the generalization of matrix in R).
Also, if speed is a concern, I think it would be much faster to leave X as a matrix instead of a data frame. Maybe someone else can comment on that.
EDIT: Here's a way to grab them all up-front (in a pair of three-dimensional arrays):
Z <- as.matrix(X)
Xprimes <- array(,dim=c(10,5,1000))
Xprimes[] <- Z[,colnums_boot[1:5,]]
Yprimes <- array(,dim=c(10,5,1000))
Yprimes[] <- Z[,colnums_boot[6:10,]]
The sample code
population <- 10000
vec <- sample(1:6, population, replace=T)
output <- sample(1:vec, population, replace=T)
warning: numerical expression has 10000 elements: only the first used.
The sample is attempting to change the limits of the sample for each choice, so one iteration should randomly sample between 1:2, another could be between 1:6. The value of the maximum is defined in 'vec'
What is the correct way to structure this line such that it knows to create 'output' as a vector of length 10,000, with the proper references to the maximum values in 'vec'? Currently it is only using the first value of 'vec' for all 10000 samples in 'output'
Maybe use sapply to loop over vec:
out <- sapply(vec,sample,size = 1)
Another way: create a matrix where columns are samples using different numbers. Then build a vector that randomly takes a value from each row. I thought this might be faster, but both ways are very fast.
population <- 1e4
samp.mat <- sapply(1:6,sample.int,size=population,replace=TRUE)
indices <- cbind(seq_len(nrow(samp.mat)),sample.int(6,nrow(samp.mat),replace=TRUE))
out <- a[indices]