Results = rep(0,15)
number <- 1:500
for(i in 1:100) {
Results[i] <- sum(sample(number, 20, replace = F))/20)}
I wish to change the sample size 20, 10, 15 in the same loop. How can do it in a single loop?
You can create a loop within a loop.
Continuing what you started makes it difficult, because making sure the values go into the correct cell of your empty pre-structured dataframe requires some thinking. I would just start with a dataframe with 0 rows, and use rowbind to add the results of each itteration.
First create an empty data frame with 0 rows:
Results2 = data.frame(samplesize=c(), sum = c())
Set some parameters:
numbersToSampleFrom <- 1:500
samplesizes <- c(10, 20, 50)
NumberOfIterations <- 100
Loop it together
for(s in seq_along(samplesizes)){
for(i in 1:NumberOfIterations) {
Results2 <- rbind(Results2, data.frame(samplesize= samplesizes[s],
sum = sum(sample(number, samplesizes[s], replace = F))/samplesizes[s])
)
}
}
For each iteration it picks a sample size, does the sampling and creates a dataframe with one row, that you then bind to the Results2 dataframe.
There will be lots of ways to do this, this one is an easy modification from your code (I think).
Related
I am trying to populate the output of a for loop into a data frame. The loop is repeating across the columns of a dataset called "data". The output is to be put into a new dataset called "data2". I specified an empty data frame with 4 columns (i.e. ncol=4). However, the output generates only the first two columns. I also get a warning message: "In matrix(value, n, p) : data length [2403] is not a sub-multiple or multiple of the number of columns [2]"
Why does the dataframe called "data2" have 2 columns, when I have specified 4 columns? This is my code:
a <- 0
b <- 0
GM <- 0
GSD <- 0
data2 <- data.frame(ncol=4, nrow=33)
for (i in 1:ncol(data))
{
if (i==34) {break}
a[i] <- colnames(data[i])
b <- data$cycle
GM[i] <- geoMean(data[,i], na.rm=TRUE)
GSD[i] <- geoSD(data[,i], na.rm=TRUE)
data2[i,] <- c(a[i], b, GM[i], GSD[i])
}
data2
If you look at the ?data.frame() help page, you'll see that it does not take arguments nrow and ncol--those are arguments for the matrix() function.
This is how you initialize data2, and you can see it starts with 2 columns, one column is named ncol, the second column is named nrow.
data2 <- data.frame(ncol=4, nrow=33)
data2
# ncol nrow
# 1 4 33
Instead you could try data2 <- as.data.frame(matrix(NA, ncol = 4, nrow = 33)), though if you share a small sample of data and your expected result there may be more efficient ways than explicit loops to get this job done.
Generally, if you do loop, you want to do as much outside of the loop as possible. This is just guesswork without having sample data, these changes seem like a start at improving your code.
a <- colnames(data)
b <- data$cycle ## this never changes, no need to redefine every iteration
GM <- numeric(ncol(data)) ## better to initialize vectors to the correct length
GSD <- numeric(ncol(data))
data2 <- as.data.frame(matrix(NA, ncol = 4, nrow = 33))
for (i in 1:ncol(data))
{
if (i==34) {break}
GM[i] <- geoMean(data[,i], na.rm=TRUE)
GSD[i] <- geoSD(data[,i], na.rm=TRUE)
## it's weird to assign a row of data.frame at once...
## maybe you should keep it as a matrix?
data2[i,] <- c(a[i], b, GM[i], GSD[i])
}
data2
I have this population:
MyPopulation <- c(1:100)
and I want to create a data frame of 40 columns and 5 lines. Each column has to be a random sample of MyPopulation, so I try this:
MySample <- data.frame(NoSample = c(1:5))
for (i in 1:40) {
MySample$i <- sample(MyPopulation,5)
}
The result is a data frame with only 1 more column (named i) with a random sample as values.
What am I doing wrong?
The easiest solution probably would be
MyPopulation <- c(1:100)
MySample <- data.frame(NoSample = c(1:5))
for (i in 1:40) {
MySample[,i+1] <- sample(MyPopulation,5)
}
You cannot assign new columns that way, try MySample[paste(i)] = ...
That is you cannot assign a numeric value to a column, hence strings.
Maybe you can try replicate + as.data.frame
MySample <- as.data.frame(replicate(40,sample(MyPopulation,5)))
You can also create a single stream of random values and then state the column dimension in a matrix with the row count being imputed:
m <- matrix(sample(1:1000, 200, replace = TRUE), ncol = 40)
df <- as.data.frame(m)
I have a data table that provides the length and composition of given vectors
for example:
set.seed(1)
dt = data.table(length = c(100, 150),
n_A = c(30, 30),
n_B = c(20, 100),
n_C = c(50, 20))
I need to randomly split each vector into two subsets with 80% and 20% of observations respectively. I can currently do this using a for loop. For example:
dt_80_list <- list() # create output lists
dt_20_list <- list()
for (i in 1:nrow(dt)){ # for each row in the data.table
sample_vec <- sample( c( rep("A", dt$n_A[i]), # create a randomised vector with the given nnumber of each component.
rep("B", dt$n_B[i]),
rep("C", dt$n_C[i]) ) )
sample_vec_80 <- sample_vec[1:floor(length(sample_vec)*0.8)] # subset 80% of the vector
dt_80_list[[i]] <- data.table( length = length(sample_vec_80), # count the number of each component in the subset and output to list
n_A = length(sample_vec_80[which(sample_vec_80 == "A")]),
n_B = length(sample_vec_80[which(sample_vec_80 == "B")]),
n_C = length(sample_vec_80[which(sample_vec_80 == "C")])
)
dt_20_list[[i]] <- data.table( length = dt$length[i] - dt_80_list[[i]]$length, # subtract the number of each component in the 80% to identify the number in the 20%
n_A = dt$n_A[i] - dt_80_list[[i]]$n_A,
n_B = dt$n_B[i] - dt_80_list[[i]]$n_B,
n_C = dt$n_C[i] - dt_80_list[[i]]$n_C
)
}
dt_80 <- do.call("rbind", dt_80_list) # collapse lists to output data.tables
dt_20 <- do.call("rbind", dt_20_list)
However, the dataset I need to apply this to is very large, and this is too slow. Does anyone have any suggestions for how I could improve performance?
Thanks.
(I assumed your dataset consists of many more rows (but only a few colums).)
Here's a version I came up with, with mainly three changes
use .N and by= to count the number of "A","B","C" drawn in each row
use the size argument in sample
join the original dt and dt_80 to calculate dt_20 without a for-loop
## draw training data
dt_80 <- dcast(
dt[,row:=1:nrow(dt)
][, .(draw=sample(c(rep("A80",n_A),
rep("B80",n_B),
rep("C80",n_C)),
size=.8*length) )
, by=row
][,.N,
by=.(row,draw)],
row~draw,value.var="N")[,length80:=A80+B80+C80]
## draw test data
dt_20 <- dt[dt_80,
.(A20=n_A-A80,
B20=n_B-B80,
C20=n_C-C80),on="row"][,length20:=A20+B20+C20]
There is probably still room for optimization, but I hope it already helps :)
EDIT
Here I add my initial first idea, I did not post this because the code above is much faster. But this one might be more memory-efficient which seems crucial in your case. So, even if you already have a working solution, this might be of interest...
library(data.table)
library(Rfast)
## add row numbers
dt[,row:=1:nrow(dt)]
## sampling function
sampfunc <- function(n_A,n_B,n_C){
draw <- sample(c(rep("A80",n_A),
rep("B80",n_B),
rep("C80",n_C)),
size=.8*(n_A+n_B+n_C))
out <- Rfast::Table(draw)
return(as.list(out))
}
## draw training data
dt_80 <- dt[,sampfunc(n_A,n_B,n_C),by=row]
I am trying to randomly sample 10 individuals from a population and repeat 1000 times. Is this possible? Here is my code so far and I am not quite sure if I am on the right track. I keep receiving the error "number of items to replace is not a multiple of replacement length".
Here is my code:
B<-1000
for (i in 1:B){
FR3_Acropora_Sample[i]<-(sample(FR3_Acropora$Ratio,size=10,replace=TRUE))
}
Consider replicate (wrapper to sapply):
# MATRIX
sample_matrix <- replicate(B, sample(FR3_Acropora$Ratio, size=10, replace=TRUE))
# LIST
sample_list <- replicate(B, sample(FR3_Acropora$Ratio, size=10, replace=TRUE),
simplify = FALSE)
I believe you can accomplish this as follows. I create a sample dataset of numbers 1 through 50 - you'll skip this step of course. I initialize a vector of lists with a length of 100. I loop from 1 to 100 and choose a random sample to assign to each empty space in my vector. I can then access any sample with sampleList[[x]] where x is any number 1 to 100.
x <- c(1:50)
sampleList <- vector(mode="list", length=100)
for (i in 1:100) {
sampleList[[i]] = sample(x, size = 10, replace = TRUE)
}
Using your variable names, this would look like:
B<-1000
FR3_Acropora_Sample <- vector(mode="list", length=1000)
for (i in 1:B){
FR3_Acropora_Sample[[i]]=sample(FR3_Acropora$Ratio,size=10,replace=TRUE)
}
I want to sample 60 random rows 1000 times with the replace=TRUE and calculate the correlation coefficients between first and second columns in each sample.
I don't know how to sample row randomly, so I tried to sample 60 numbers in 1:60, and matched the row numbers.
The row data is 60x2 matrix which is called data1.
My code is
k <- list()
data.sam <- list()
set.seed(1)
for (j in 1:60){
for (i in 1:1000){
k[[i]] <- sample(1:60, 60, replace = TRUE)
}
data.sam[[i]][j,] <- data1[k[[i]][j],]
corr <- vector()
corr[i] <- cor(data.sam[[i]][,1],data.sam[[i]][,2])
}
And the statement is showed:
Error in `*tmp*`[[i]] : subscript out of bounds
It doesn't look like the j variable is doing very much. Your indexing is already vectorized by k[[i], so you don't need two loops explicitly. Also don't reset the corr variable inside the loop.
Instead, I might write:
data1 <- matrix(rnorm(120), 60,2)
for (i in 1:1000){
k[[i]] <- sample(1:60, 60, replace = TRUE)
data.sam[[i]] <- data1[k[[i]],]
corr[i] <- cor(data.sam[[i]][,1],data.sam[[i]][,2])
}
Which give this:
hist(corr)