Warning Message with For Loop in R? - r

So I have my problem below and R code: (The nile data is one of R's included datasets)
seed random number generator
Define an empty or 1000-element vector, "sample1," to write sample means to
Write for-loop to drawn 1000 random samples, n = 25, and write mean of sample to prepared vector
data <- as.vector(Nile)
set.seed(123)
sample1 <- vector()
for(i in 1:1000){
r <- vector()
r[i] <- data[sample(1:100,size=25,replace=1)]
sample1[i] <- mean(r[i])
}
and I am getting a warning message in my output saying:
Warning in r[i] <- data[sample(1:100, size = 25, replace = 1)]: number of items to replace is not a multiple of replacement length
Could anyone help me out?

As mentioned in the comments, the problem is that you are trying to add a vector to an element of a vector. The dimensions don't add up.
The fix in this case is quite simply to remove the step, as it's redundant. In general if you need to store multiple vectors you can do that in a matrix, data frame or list structure, depending on if the vectors are of known length, same length, same class etc.
data <- as.vector(Nile)
set.seed(123)
sample1 <- vector()
for(i in 1:1000){
d <- data[sample(1:100, size=25, replace=TRUE)]
sample1[i] <- mean(d)
}
Instead of using a for loop, in this case you can use replicate, which is a relative of lapply and its ilk.
set.seed(123)
sample2 <- replicate(1000, mean(data[sample(1:100, size=25, replace=TRUE)]))
# as you can see, the results are identical
head(sample1); head(sample2)
#[1] 920.16 915.12 925.96 919.36 859.36 928.96
#[1] 920.16 915.12 925.96 919.36 859.36 928.96

Related

double for-loop to calculate the mean in a range of rows in a matrix

dataset <- matrix(rnorm(100), 20, 5)
My dataset is a matrix of 100 returns, of 5 assets over 20 days.
I want to caluclate the average return for each asset, between 1:10 rows and 11:20 rows.
Then, I want to include the returns so computed in two vectors, in turn, included in a list.
The following list should include the two vectors of returns computed between rows 1:10 and 11:20.
returns <- vector(mode="list", 2)
I have implemented a for-loop, as reported below, to calculate the mean of returns only between 1:10.
assets <- 5
r <- rep(0, assets) # this vector should include the returns over 1:10
for(i in 1:assets){
r[i] <- mean(data[1:10,i])
}
returns[[1]] <- r
How could I manage this for-loop in order to calculate also the mean of returns between 11:20 rows?
I have tried to "index" the rows of the dataset, in the following way.
time <- c(1, 10, 11, 20)
and then implement a double for-loop, but the length are different. Moreover, in this case, I meet difficulties in managing the vector "r". Because, in this case, I should have two vectors and no longer only one as before.
for(j 1:length(time){
for(i in 1:assets){
r[i] <- mean(data[1:10,i])
}}
returns[[1]] <- r
You don't even need a for loop. You can use colMeans
returns <- vector(mode="list", 2)
returns[[1]] <- colMeans(dataset[1:10,])
returns[[2]] <- colMeans(dataset[11:20,])
Using a for loop, your solution could be something like the following
for(i in 1:assets){
returns[[1]] <- c(returns[[1]], mean(dataset[1:10,i]))
returns[[2]] <- c(returns[[2]], mean(dataset[11:20,i]))
}

R - Block resampling and saving samples in a list

I have a vector on which I want to do block resampling to get, say, 1000 samples of the same size of the vector, and then save all this samples in a list.
This is the code that performs normal resampling, i.e. randomly draws one observation per time, and saves the result in a list:
myvector <- c(1:200)
mylist <- list()
for(i in 1:1000){
mylist[[i]] <- sample(myvector, length(myvector), replace=TRUE)
}
I need a code that does exactly the same thing, except that instead of drawing single observations it draws blocks of observations (let's use blocks of dimension equal to 5).
I know there are packages that perform bootstrap operations, but I don't need statistics or confidence intervals or anything, just all the samples in a list. Both overlapping and non-overlapping blocks are ok, so the code for just one of the two procedures is enough. Of course, if you are so kind to give me the code for both it's appreciated. Thanks to anybody who can help me with this.
Not sure how you're wanting to store the final structure.
The following takes a block dimension, samples your vector by that block size (e.g. 200 element vector with block size 5 gives 40 observations of randomly sampled elements) and adds those blocks to an index of the final list. Using your example, the final result is a list with 1000 entries; each entry containing 40 randomly sampled observations.
myvector <- c(1:200)
rm(.Random.seed, envir=globalenv())
block_dimension <- 5
res = list()
for(i in 1:1000) {
name <- paste('sample_', i, sep='')
rep_num <- length(myvector) / block_dimension
all_blocks <- replicate(rep_num, sample(myvector, block_dimension))
tmp <- split(all_blocks, ceiling(seq_along(all_blocks)/block_dimension))
res[[name]] <- tmp
}
Here are the first 6 sampled observations for the first entry:
How about the following? Note that you can use lapply, which should be slightly faster than filling the list in a for loop in this case.
As reference, here is the case where you sample individual observations.
# Sample individual observations
set.seed(2017);
mylist <- lapply(1:1000, function(x) sample(myvector, length(myvector), replace = TRUE));
Next we sample blocks of 5 observations.
# Sample blocks of n observations
n <- 5;
set.seed(2017);
mylist <- lapply(1:1000, function(x) {
idx <- sample(1:(length(myvector) - n), length(myvector) / n, replace = TRUE);
idx <- c(t(sapply(0:(n - 1), function(i) idx + i)));
myvector[idx];
})
One solution, assuming blocks consist of contiguous elements of myvector, is to pre-define the blocks in rows of a data frame with start/end columns (e.g. blocks <- data.frame(start=seq(1,96,5),end=seq(5,100,5))). Create a set of sample indexes (with replacement) from [1:number of blocks] and concatenate values indexing from myvector using the start/end values from the defined blocks. You can add randomization within blocks as well, if you need to. This gives you control over the block contents, overlap, size, etc.
I found a way to perform the task with non-overlapping blocks:
myvector <- c(1:200)
n <- 5
mymatrix <- matrix(myvector, nrow = length(myvector)/n, byrow = TRUE)
mylist <- list()
for(i in 1:1000){
mylist[[i]] <- as.vector(t(mymatrix[sample(nrow(mymatrix), size = length(myvector)/n, replace = TRUE),]))
}

Split data to make train and test sets - for loop - insert variable to subset by row

I am trying to subset this data frame by pre determined row numbers.
# Make dummy data frame
df <- data.frame(data=1:200)
train.length <- 1:2
# Set pre determined row numbers for subsetting
train.length.1 = 1:50
test.length.1 = 50:100
train.length.2 = 50:100
test.length.2 = 100:150
train.list <- list()
test.list <- list()
# Loop for subsetting by row, using row numbers in variables above
for (i in 1:length(train.length)) {
# subset by row number, each row number in variables train.length.1,2etc..
train.list[[i]] <- df[train.length.[i],] # need to place the variable train.length.n here...
test.list[[i]] <- df[test.length.[i],] # place test.length.n variable here..
# save outcome to lists
}
My question is, if I have my row numbers stored in a variable, how I do place each [ith] one inside the subsetting code?
I have tried:
df[train.length.[i],]
also
df[paste0"train.length.",[i],]
however that pastes as a character and it doesnt read my train.length.n variable... as below
> train.list[[i]] <- df[c(paste0("train.length.",train.length[i])),]
> train.list
[[1]]
data data1
NA NA NA
If i have the variable in there by itself, it works as intended. Just need it to work in a for loop
Desired output - print those below
train.set.output.1 <- df[train.length.1,]
test.set.output.1 <- df[test.length.1,]
train.set.output.2 <- df[train.length.2,]
test.set.output.2 <- df[test.length.2,]
I can do this manually, but its cumersome for lots of train / test sets... hence for loop
Consider staggered seq() and pass the number sequences in lapply to slice by rows. Also, for equal-length dataframes, you likely intended starts at 1, 51, 101, ...
train_num_set <- seq(1, 200, by=50)
train.list <- lapply(train_num_set, function(i) df[c(i:(i+49)),])
test_num_set <- seq(51, 200, by=50)
test.list <- lapply(test_num_set, function(i) df[c(i:(i+49)),])
Create a function that splits your data frame into different chunks:
split_frame_by_chunks <- function(data_frame, chunk_size) {
n <- nrow(data_frame)
r <- rep(1:ceiling(n/chunk_size),each=chunk_size)[1:n]
sub_frames <- split(data_frame,r)
return(sub_frames)
}
Call your function using your data frame and chunk size. In your case, you are splitting your data frame into chunks of 50:
chunked_frames <- split_frame_by_chunks(data_frame, 50)
Decide number of train/test splits to create in the loop
num_splits <- 2
Create the appropriate train and test sets inside your loop. In this case, I am creating the 2 you showed in your question. (i.e. the first loop creates a train and test set with rows 1-50 and 50-100 respectively):
for(i in 1:num_splits) {
this_train <- chunked_frames[i]
this_test <- chunked_frames[i+1]
}
Just do whatever you need to the dynamically created train and test frames inside your loop.

Saving each iteration of a repeat loop to a vector - R

I've looked through previous help threads and haven't found something that has helped me with this specific problem. I know that a for loop would be a better way to generate the same data, but I'm interested in making this work with a repeat loop (mostly just as an exercise) and am struggling with the solution.
So I'm looping to create 3 iterations of 100 rnorm observations, changing the means each time from 5, to 25, to 45.
i <- 1
repeat{
x <- rnorm(100, mean = j, sd = 3)
j <- 5*i
i <- i + 4
if (j > 45) break
cat(x, "\n",j, "\n")
}
All of my tinkering to get a combined saved output for each iteration (for a total of 300 values) has failed. Help!
You can use lapply to get this:
lapply(c(5,25,45), function(x){
rnorm(100, mean = x, sd = 3)
})
This will give you a list with 3 elements:
Each containing 100 observations drawn from the respective normal-distribution.
Depends on what structure of data do you want.
For lists it would be:
r = list()
repeat{
r[[length(r)+1]] = list(x,j)
}
Then: r[[1]][[1]] will be x for 1 loop and r[[1]][[2]] would be j.
Since you know how many observations you want to store, you can pre-allocate a matrix of that size, and store the data in it as it's generated.
# preallocate the space for the values you want to store
x <- matrix(nrow=100, ncol=3)
# save the three means in a vector
j_vals <- c(5,25,45)
# if you really need a repeat loop you can do it like so:
i <- 1
repeat {
# save the random sample in a column of the matrix x
x[,i] <- rnorm(100, mean = j_vals[i], sd = 3)
# print the random sample to the console (you can omit this)
cat(x[,i], "\n",j_vals[i], "\n")
i <- i+1
if (i > 3) break
}
You should get out a matrix x with the random samples stored in the columns. You can access each column like x[,1], x[,2] etc.

How to sample 1:x where x is a vector of random integers with length greater than 1

The sample code
population <- 10000
vec <- sample(1:6, population, replace=T)
output <- sample(1:vec, population, replace=T)
warning: numerical expression has 10000 elements: only the first used.
The sample is attempting to change the limits of the sample for each choice, so one iteration should randomly sample between 1:2, another could be between 1:6. The value of the maximum is defined in 'vec'
What is the correct way to structure this line such that it knows to create 'output' as a vector of length 10,000, with the proper references to the maximum values in 'vec'? Currently it is only using the first value of 'vec' for all 10000 samples in 'output'
Maybe use sapply to loop over vec:
out <- sapply(vec,sample,size = 1)
Another way: create a matrix where columns are samples using different numbers. Then build a vector that randomly takes a value from each row. I thought this might be faster, but both ways are very fast.
population <- 1e4
samp.mat <- sapply(1:6,sample.int,size=population,replace=TRUE)
indices <- cbind(seq_len(nrow(samp.mat)),sample.int(6,nrow(samp.mat),replace=TRUE))
out <- a[indices]

Resources