I have a vector on which I want to do block resampling to get, say, 1000 samples of the same size of the vector, and then save all this samples in a list.
This is the code that performs normal resampling, i.e. randomly draws one observation per time, and saves the result in a list:
myvector <- c(1:200)
mylist <- list()
for(i in 1:1000){
mylist[[i]] <- sample(myvector, length(myvector), replace=TRUE)
}
I need a code that does exactly the same thing, except that instead of drawing single observations it draws blocks of observations (let's use blocks of dimension equal to 5).
I know there are packages that perform bootstrap operations, but I don't need statistics or confidence intervals or anything, just all the samples in a list. Both overlapping and non-overlapping blocks are ok, so the code for just one of the two procedures is enough. Of course, if you are so kind to give me the code for both it's appreciated. Thanks to anybody who can help me with this.
Not sure how you're wanting to store the final structure.
The following takes a block dimension, samples your vector by that block size (e.g. 200 element vector with block size 5 gives 40 observations of randomly sampled elements) and adds those blocks to an index of the final list. Using your example, the final result is a list with 1000 entries; each entry containing 40 randomly sampled observations.
myvector <- c(1:200)
rm(.Random.seed, envir=globalenv())
block_dimension <- 5
res = list()
for(i in 1:1000) {
name <- paste('sample_', i, sep='')
rep_num <- length(myvector) / block_dimension
all_blocks <- replicate(rep_num, sample(myvector, block_dimension))
tmp <- split(all_blocks, ceiling(seq_along(all_blocks)/block_dimension))
res[[name]] <- tmp
}
Here are the first 6 sampled observations for the first entry:
How about the following? Note that you can use lapply, which should be slightly faster than filling the list in a for loop in this case.
As reference, here is the case where you sample individual observations.
# Sample individual observations
set.seed(2017);
mylist <- lapply(1:1000, function(x) sample(myvector, length(myvector), replace = TRUE));
Next we sample blocks of 5 observations.
# Sample blocks of n observations
n <- 5;
set.seed(2017);
mylist <- lapply(1:1000, function(x) {
idx <- sample(1:(length(myvector) - n), length(myvector) / n, replace = TRUE);
idx <- c(t(sapply(0:(n - 1), function(i) idx + i)));
myvector[idx];
})
One solution, assuming blocks consist of contiguous elements of myvector, is to pre-define the blocks in rows of a data frame with start/end columns (e.g. blocks <- data.frame(start=seq(1,96,5),end=seq(5,100,5))). Create a set of sample indexes (with replacement) from [1:number of blocks] and concatenate values indexing from myvector using the start/end values from the defined blocks. You can add randomization within blocks as well, if you need to. This gives you control over the block contents, overlap, size, etc.
I found a way to perform the task with non-overlapping blocks:
myvector <- c(1:200)
n <- 5
mymatrix <- matrix(myvector, nrow = length(myvector)/n, byrow = TRUE)
mylist <- list()
for(i in 1:1000){
mylist[[i]] <- as.vector(t(mymatrix[sample(nrow(mymatrix), size = length(myvector)/n, replace = TRUE),]))
}
Related
I'm trying to figure out a way to write a loop (or apply-like function) that takes a particular row from a matrix within a list, writes it to a matrix/data frame, takes the matching row from the next list element, and places it after the previous one in the same row.
This involves a nested list which has a five matrices in the first part, five in the next part, and so on. Each chunk of matrices can vary in number of columns with a minimum of six and a maximum of eight. There are always four rows. The example below has 4x6 matrices in the first part and 4x8 matrices in the second part.
set.seed(100)
test.df <- data.frame(matrix(1:440,nrow=40,ncol=11))
mat1 <- matrix(rnorm(24),nrow=4,ncol=6)
mat2 <- matrix(rnorm(24),nrow=4,ncol=6)
mat3 <- matrix(rnorm(24),nrow=4,ncol=6)
mat4 <- matrix(rnorm(24),nrow=4,ncol=6)
mat5 <- matrix(rnorm(24),nrow=4,ncol=6)
mat6 <- matrix(rnorm(32),nrow=4,ncol=8)
mat7 <- matrix(rnorm(32),nrow=4,ncol=8)
mat8 <- matrix(rnorm(32),nrow=4,ncol=8)
mat9 <- matrix(rnorm(32),nrow=4,ncol=8)
mat10 <- matrix(rnorm(32),nrow=4,ncol=8)
test.list1 <- list(mat1,mat2,mat3,mat4,mat5)
test.list2 <- list(mat6,mat7,mat8,mat9,mat10)
list.f <- list(test.list1,test.list2)
res.mat <- matrix(nrow=2,ncol=40)
# (Edit) Example of expected results
res.mat[1,1:6] <- mat1[1,]
res.mat[1,9:14] <- mat2[1,]
res.mat[2,1:8] <- mat6[1,]
res.mat[2,9:16] <- mat7[1,]
res.mat
final.res <- cbind(test.df,res.mat)
final.res
The first row of each matrix in the first list of the nested list occupy the first row of res.mat. The first row of res.mat would have six entries, skip two columns (leaving the NAs) and then the next six entries and so on. Row two of res.mat would have eight entries, then the next eight and so on. Once res.mat is populated, I would append it to test.df as the final result. test.df is filled with a series of numbers as placeholders.
I have more than two higher level list elements, but for reproducibility, I've only included two. When it handles all of my data, there would be 2916 list elements and 2916 rows in the final data frame. Any ideas on how I could accomplish this would be greatly appreciated, and please ask if I can clarify anything with the code or description.
In case anyone's interested, I did figure it out. Here's the solution:
format.dv <- function(list.f,test.df) {
nrow.f <- length(dvs)
res.mat <- matrix(nrow=nrow.f,ncol=40)
for (cond in 1:nrow(cond.list)) {
ri <- 1
min <- 1
for (ri in 1:5) {
data <- list.f[[cond]][[ri]]
ncol <- dim(data)[2]
adj <- 8 - ncol
max <- 8
#Any number can be substituted for 1 depending on desired row to be taken
pe.row <- data[1,]
res.mat[cond,min:((max*ri)-adj)] <- pe.row
min <- min+8
}
}
res.mat <- round(res.mat,4)
final.res <- cbind(test.df,res.mat)
final.res
}
dataset <- matrix(rnorm(100), 20, 5)
My dataset is a matrix of 100 returns, of 5 assets over 20 days.
I want to caluclate the average return for each asset, between 1:10 rows and 11:20 rows.
Then, I want to include the returns so computed in two vectors, in turn, included in a list.
The following list should include the two vectors of returns computed between rows 1:10 and 11:20.
returns <- vector(mode="list", 2)
I have implemented a for-loop, as reported below, to calculate the mean of returns only between 1:10.
assets <- 5
r <- rep(0, assets) # this vector should include the returns over 1:10
for(i in 1:assets){
r[i] <- mean(data[1:10,i])
}
returns[[1]] <- r
How could I manage this for-loop in order to calculate also the mean of returns between 11:20 rows?
I have tried to "index" the rows of the dataset, in the following way.
time <- c(1, 10, 11, 20)
and then implement a double for-loop, but the length are different. Moreover, in this case, I meet difficulties in managing the vector "r". Because, in this case, I should have two vectors and no longer only one as before.
for(j 1:length(time){
for(i in 1:assets){
r[i] <- mean(data[1:10,i])
}}
returns[[1]] <- r
You don't even need a for loop. You can use colMeans
returns <- vector(mode="list", 2)
returns[[1]] <- colMeans(dataset[1:10,])
returns[[2]] <- colMeans(dataset[11:20,])
Using a for loop, your solution could be something like the following
for(i in 1:assets){
returns[[1]] <- c(returns[[1]], mean(dataset[1:10,i]))
returns[[2]] <- c(returns[[2]], mean(dataset[11:20,i]))
}
I am trying to subset this data frame by pre determined row numbers.
# Make dummy data frame
df <- data.frame(data=1:200)
train.length <- 1:2
# Set pre determined row numbers for subsetting
train.length.1 = 1:50
test.length.1 = 50:100
train.length.2 = 50:100
test.length.2 = 100:150
train.list <- list()
test.list <- list()
# Loop for subsetting by row, using row numbers in variables above
for (i in 1:length(train.length)) {
# subset by row number, each row number in variables train.length.1,2etc..
train.list[[i]] <- df[train.length.[i],] # need to place the variable train.length.n here...
test.list[[i]] <- df[test.length.[i],] # place test.length.n variable here..
# save outcome to lists
}
My question is, if I have my row numbers stored in a variable, how I do place each [ith] one inside the subsetting code?
I have tried:
df[train.length.[i],]
also
df[paste0"train.length.",[i],]
however that pastes as a character and it doesnt read my train.length.n variable... as below
> train.list[[i]] <- df[c(paste0("train.length.",train.length[i])),]
> train.list
[[1]]
data data1
NA NA NA
If i have the variable in there by itself, it works as intended. Just need it to work in a for loop
Desired output - print those below
train.set.output.1 <- df[train.length.1,]
test.set.output.1 <- df[test.length.1,]
train.set.output.2 <- df[train.length.2,]
test.set.output.2 <- df[test.length.2,]
I can do this manually, but its cumersome for lots of train / test sets... hence for loop
Consider staggered seq() and pass the number sequences in lapply to slice by rows. Also, for equal-length dataframes, you likely intended starts at 1, 51, 101, ...
train_num_set <- seq(1, 200, by=50)
train.list <- lapply(train_num_set, function(i) df[c(i:(i+49)),])
test_num_set <- seq(51, 200, by=50)
test.list <- lapply(test_num_set, function(i) df[c(i:(i+49)),])
Create a function that splits your data frame into different chunks:
split_frame_by_chunks <- function(data_frame, chunk_size) {
n <- nrow(data_frame)
r <- rep(1:ceiling(n/chunk_size),each=chunk_size)[1:n]
sub_frames <- split(data_frame,r)
return(sub_frames)
}
Call your function using your data frame and chunk size. In your case, you are splitting your data frame into chunks of 50:
chunked_frames <- split_frame_by_chunks(data_frame, 50)
Decide number of train/test splits to create in the loop
num_splits <- 2
Create the appropriate train and test sets inside your loop. In this case, I am creating the 2 you showed in your question. (i.e. the first loop creates a train and test set with rows 1-50 and 50-100 respectively):
for(i in 1:num_splits) {
this_train <- chunked_frames[i]
this_test <- chunked_frames[i+1]
}
Just do whatever you need to the dynamically created train and test frames inside your loop.
So I have my problem below and R code: (The nile data is one of R's included datasets)
seed random number generator
Define an empty or 1000-element vector, "sample1," to write sample means to
Write for-loop to drawn 1000 random samples, n = 25, and write mean of sample to prepared vector
data <- as.vector(Nile)
set.seed(123)
sample1 <- vector()
for(i in 1:1000){
r <- vector()
r[i] <- data[sample(1:100,size=25,replace=1)]
sample1[i] <- mean(r[i])
}
and I am getting a warning message in my output saying:
Warning in r[i] <- data[sample(1:100, size = 25, replace = 1)]: number of items to replace is not a multiple of replacement length
Could anyone help me out?
As mentioned in the comments, the problem is that you are trying to add a vector to an element of a vector. The dimensions don't add up.
The fix in this case is quite simply to remove the step, as it's redundant. In general if you need to store multiple vectors you can do that in a matrix, data frame or list structure, depending on if the vectors are of known length, same length, same class etc.
data <- as.vector(Nile)
set.seed(123)
sample1 <- vector()
for(i in 1:1000){
d <- data[sample(1:100, size=25, replace=TRUE)]
sample1[i] <- mean(d)
}
Instead of using a for loop, in this case you can use replicate, which is a relative of lapply and its ilk.
set.seed(123)
sample2 <- replicate(1000, mean(data[sample(1:100, size=25, replace=TRUE)]))
# as you can see, the results are identical
head(sample1); head(sample2)
#[1] 920.16 915.12 925.96 919.36 859.36 928.96
#[1] 920.16 915.12 925.96 919.36 859.36 928.96
I am generating a data vector to sample from with sample without replacement.
If the dataset I am generating from is large enough, the vector exceeds the limits of R.
How can I represent these data in such a way that I can sample without replacement but can still handle huge datasets?
Generating the vector of counts:
counts <- vector()
for (i in 1:1024) {
counts <- c(counts, rep(i, times=data[i,]$readCount))
}
Sampling:
trial_fn <- function(counts) {
replicate(num_trials, sample(counts, size=trial_size, replace=F), simplify=F)
}
trials <- trial_fn(counts)
Error: cannot allocate vector of size 32.0 Mb
Is there a more sparse or compressed way I can represent this and still be able to sample without replacement?
If I understand correctly, your data has 1024 rows with different readCount.
The vector you build has the first readCount value repeated once, the second readCount repeated twice and so on.
Then you want to sample from this vector without replacement. So basically, you're sampling the first readCount with a probability of 1 / sum(1:1024), the second readCount with a probability of 2 / sum(1:1024) and so on, and each time you extract one value, it is removed from the set.
Of course the fastest and easier approach is yours, but you can also do it with much less memory but losing speed (significantly). This can be done by giving probabilities of extraction to sample function, extracting one value at a time and manually "removing" the extracted value.
Here's an example :
# an example of your data
data <- data.frame(readCount=1:1024)
# custom function to sample
mySample <- function(values, size, nElementsPerValue){
nElementsPerValue <- as.integer(nElementsPerValue)
if(sum(nElementsPerValue) < size)
stop("Total number of elements per value is lower than the sample size")
if(length(values) != length(nElementsPerValue))
stop("nElementsPerValue must have the same length of values")
if(any(nElementsPerValue < 0))
stop("nElementsPerValue cannot contain a negative numbers")
# remove values having zero elements inside
nElementsPerValue <- nElementsPerValue[which(nElementsPerValue > 0)]
values <- values[which(nElementsPerValue > 0)]
# pre-allocate the result vector
res <- rep.int(0.0,size)
for(i in 1:size){
idx <- sample(1:length(values),size=1,replace=F,prob=nElementsPerValue)
res[i] <- values[idx]
# remove sampled value from nElementsPerValue
nElementsPerValue[idx] <- nElementsPerValue[idx] - 1
# if zero elements remove also from values
if(nElementsPerValue[idx] == 0){
values <- values[-idx]
nElementsPerValue <- nElementsPerValue[-idx]
}
}
return(res)
}
# just for reproducibility
set.seed(123)
# sample 100k values from readCount
system.time(
a <- mySample(data$readCount, 100000, 1:1024),
gcFirst=T)
# on my machine it gives :
# user system elapsed
# 10.63 0.00 10.67