How to make combination of sample and lapply reproducible? - r

The followings are functions for bootstrapping, but how can I make the result reproducible? I tried set.seed() but that does not work because the every time lapply calls function boot.lm.vector, the function just produced one simulated set and calculated coefficients once. Is there any thing in R that can function like a seed list? or any other way to make the result reproducible?
boot.lm.vector <- function(index, inputData) {
d <- inputData[sample.int(nrow(inputData), replace = T),]
a <- ncol(inputData)-1
X <- d[, 1:a]
y <- d[, a+1]
solve(crossprod(X), crossprod(X,y))
}
rtest <- lapply(1:10000, fun = boot.lm.vector, inputData = boot_set)
rtestdf <- plyr::ldply(rtest)

If you set the seed using an index inside your function, you should be able to reproduce it. Dummy boot.lm.vector function below:
## samples 1 item from inputData
boot.lm.vector <- function(index, inputData) {
set.seed(index)
return(sample(inputData, 1))
}
## iterating 5 times: use lapply as per your requirement
test <- sapply(1:5, FUN = boot.lm.vector, inputData = 1:10)
test
[1] 3 2 2 6 3 # reproducible result

Related

How can I program n steps where each step is a dataframe created from the previous step?

I try to simulate changes in a data frame through different steps depending on each others. Let's try to take a very simple example to illustrate my problem.
I create a dataframe with two columns
a=runif(10)
b=runif(10)
data_1=data.frame(a,b)
data_1
a b
1 0.94922669 0.47418098
2 0.26702201 0.79179699
3 0.57398333 0.25158378
4 0.52724079 0.61531202
5 0.03999831 0.95233479
6 0.15171673 0.64564561
7 0.51353129 0.75676464
8 0.60312432 0.85318316
9 0.52900913 0.06297818
10 0.75459362 0.40209925
Then, I would like to create n steps, where each step consists in creating a new dataframe at i+1 which is function (let's call it "whatever") of the dataframe at i: data_2 is a transformation of data_1, data_3 a transformation of data_2, etc.
iterations=function(nsteps)
{
lapply(1:nsteps,function(i)
{
data_i+1=whatever(data_i)
})
}
Whatever the function I use, I have an error message saying:
Error in whatever(data_i) : object 'data_i' not found
Can someone help me figure out what I am missing?
See if you can get some inspiration from the following example.
First, a whatever function to be applied to the previous dataframe.
whatever <- function(DF) {
DF[[2]] <- DF[[2]]*2
DF
}
Now the function you want. I have added an extra argument, the dataframe x.
The function starts by creating the object to be returned. Each member of the list data_list will be a dataframe function of the previous dataframe.
iterations <- function(nsteps, x){
data_list <- vector("list", length = nsteps)
data_list[[1]] <- x
for(i in seq_len(nsteps)[-1]){
data_list[[i]] <- whatever(data_list[[i - 1]])
}
names(data_list) <- sprintf("data_%d", seq_len(nsteps))
data_list
}
And apply iterations to an example dataframe.
df1 <- data.frame(A = letters[1:10], X = 1:10)
iterations(10, df1)
You might be looking for a combination of assign and paste:
assign(paste("data_", i + 1, sep = ""), whatever(data_i))

How to store data from for loop inside of for loop? (rolling correlation in r)

require(quantmod)
require(TTR)
iris2 <- iris[1:4]
b=NULL
for (i in 1:ncol(iris2)){
for (j in 1:ncol(iris2)){
a<- runCor(iris2[,i],iris2[,j],n=21)
b<-cbind(b,a)}}
I want to calculate a rolling correlation of different columns within a dataframe and store the data separately by a column. Although the code above stores the data into variable b, it is not as useful as it is just dumping all the results. What I would like is to be able to create different dataframe for each i.
In this case, as I have 4 columns, what I would ultimately want are 4 dataframes, each containing 4 columns showing rolling correlations, i.e. df1 = corr of col 1 vs col 1,2,3,4, df2 = corr of col 2 vs col 1,2,3,4...etc)
I thought of using lapply or rollapply, but ran into the same problem.
d=NULL
for (i in 1:ncol(iris2))
for (j in 1:ncol(iris2))
{c<-rollapply(iris2, 21 ,function(x) cor(x[,i],x[,j]), by.column=FALSE)
d<-cbind(d,c)}
Would really appreciate any inputs.
If you want to keep the expanded loop, how about a list of dataframes?
e <- list(length = length(ncol(iris2)))
for (i in 1:ncol(iris2)) {
d <- matrix(0, nrow = length(iris2[,1]), ncol = length(iris2[1,]))
for (j in 1:ncol(iris2)) {
d[,j]<- runCor(iris2[,i],iris2[,j],n=21)
}
e[[i]] <- d
}
It's also a good idea to allocate the amount of space you want with placeholders and put items into that space rather than use rbind or cbind.
Although it is not a good practice to create dataframes on the fly in R (you should prefer putting them in a list as in other answer), the way to do so is to use the assign and get functions.
for (i in 1:ncol(iris2)) {
for (j in 1:ncol(iris2)){
c <- runCor(iris2[,i],iris2[,j],n=21)
# Assign 'c' to the name df1, df2...
assign(paste0("df", i), c)
}
}
# to have access to the dataframe:
get("df1")
# or inside a loop
get(paste0("df", i))
Since you stated your computation was slow, I wanted to provide you with a parallel solution. If you have a modern computer, it probably has 2 cores, if not 4 (or more!). You can easily check this via:
require(parallel) # for parallelization
detectCores()
Now the code:
require(quantmod)
require(TTR)
iris2 <- iris[,1:4]
Parallelization requires the functions and variables be placed into a special environment that is created and destroyed with each process. That means a wrapper function must be created to define the variables and functions.
wrapper <- function(data, n) {
# variables placed into environment
force(data)
force(n)
# functions placed into environment
# same inner loop written in earlier answer
runcor <- function(data, n, i) {
d <- matrix(0, nrow = length(data[,1]), ncol = length(data[1,]))
for (j in 1:ncol(data)) {
d[,i] <- TTR::runCor(data[,i], data[,j], n = n)
}
return(d)
}
# call function to loop over iterator i
worker <- function(i) {
runcor(data, n, i)
}
return(worker)
}
Now create a cluster on your local computer. This allows the multiple cores to run separately.
parallelcluster <- makeCluster(parallel::detectCores())
models <- parallel::parLapply(parallelcluster, 1:ncol(iris2),
wrapper(data = iris2, n = 21))
stopCluster(parallelcluster)
Stop and close the cluster when finished.

vectorize replacement over 3d for loop

I would like to vectorize (or optimize in any way possible), the following 3d for loop:
dat: array with dim = c(n,n,m)
ref: matrix with dim = c(n,m)
for(i in 1:length(dat[,1,1])){
for(k in 1:length(dat[1,1,])){
dat[i,,k][dat[i,,k] > ref[i,k]] <- NA
}
}
The array I am working with is 7e3 x 7e3 x 2e2 so the for loop above is impractically expensive. To boot, I will need to perform two or three very similar operations (on different arrays), so any saved time will be multiplied.
Example dat and ref arrays:
dat <- array(seq(1,75), dim=c(5,5,3))
ref <- cbind(seq(6,10), seq(36,40), seq(61,65))
You can use this instead. It creates a new_ref array which is conformable to dat, so you can compare them directly:
new_ref <- aperm(array(ref, dim(dat)[c(1,3,2)]), c(1,3,2))
dat3 <- dat
dat3[dat3 > new_ref] <- NA
Comparison with your loop:
dat2 <- dat
for(i in 1:length(dat[,1,1])){
for(k in 1:length(dat[1,1,])){
dat2[i,,k][dat2[i,,k] > ref[i,k]] <- NA
}
}
identical(dat2, dat3)
#[1] TRUE

R: Row resampling loop speed improvement

I'm subsampling rows from a dataframe with c("x","y","density") columns at a variety of c("s_size","reps"). Reps= replicates, s_size= number of rows subsampled from the whole dataframe.
> head(data_xyz)
x y density
1 6 1 0
2 7 1 17600
3 8 1 11200
4 12 1 14400
5 13 1 0
6 14 1 8000
#Subsampling###################
subsample_loop <- function(s_size, reps, int) {
tm1 <- system.time( #start timer
{
subsample_bound = data.frame()
#Perform Subsampling of the general
for (s_size in seq(1,s_size,int)){
for (reps in 1:reps) {
subsample <- sample.df.rows(s_size, data_xyz)
assign(paste("sample" ,"_","n", s_size, "_", "r", reps , sep=""), subsample)
subsample_replicate <- subsample[,] #temporary variable
subsample_replicate <- cbind(subsample, rep(s_size,(length(subsample_replicate[,1]))),
rep(reps,(length(subsample_replicate[,1]))))
subsample_bound <- rbind(subsample_bound, subsample_replicate)
}
}
}) #end timer
colnames(subsample_bound) <- c("x","y","density","s_size","reps")
subsample_bound
} #end function
Here's the function call:
source("R/functions.R")
subsample_data <- subsample_loop(s_size=206, reps=5, int=10)
Here's the row subsample function:
# Samples a number of rows in a dataframe, outputs a dataframe of the same # of columns
# df Data Frame
# N number of samples to be taken
sample.df.rows <- function (N, df, ...)
{
df[sample(nrow(df), N, replace=FALSE,...), ]
}
It's way too slow, I've tried a few times with apply functions and had no luck. I'll be doing somewhere around 1,000-10,000 replicates for each s_size from 1:250.
Let me know what you think! Thanks in advance.
=========================================================================
UPDATE EDIT: Sample data from which to sample:
https://www.dropbox.com/s/47mpo36xh7lck0t/density.csv
Joran's code in a function (in a sourced function.R file):
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
resampling_custom <- function(dat, s_size, int, reps) {
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
}
Calling the function
set.seed(2)
out <- resampling_custom(dat=retinal_xyz, s_size=206, int=5, reps=10)
outputs data, unfortunately with this warning message:
Warning message:
In mapply(foo, i = ss, j = id, MoreArgs = list(data = dat), SIMPLIFY = FALSE) :
longer argument not a multiple of length of shorter
I put very little thought into actually optimizing this, I was just concentrating on doing something that's at least reasonable while matching your procedure.
Your big problem is that you are growing objects via rbind and cbind. Basically anytime you see someone write data.frame() or c() and expand that object using rbind, cbind or c, you can be very sure that the resulting code will essentially be the slowest possible way of doing what ever task is being attempted.
This version is around 12-13 times faster, and I'm sure you could squeeze some more out of this if you put some real thought into it:
s_size <- 200
int <- 10
reps <- 30
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
The best part about R is that not only is this way, way faster, it's also way less code.

Simulating t-test p-values using a for loop

For this project I am required to use an R script to simulate the effectiveness of the t-test. I must use a for loop will be used to carry out the following 2000 times:
Would the loop look something like this
i <- 1
for (i <= 2001) {
x <-rf(5,df1=5,df2=10)
b <- df2
p.value <-t.test(x,mu=(b/(b-2))$p.value
i <- i+1
}
In the way you wrote it, it would be a "while" loop.
For loops in R have the following syntax:
for (i in 1:2000) {
df1 <- 5
df2 <- 10
x <-rf(5, df1=df1, df2=df2)
b <- df2
p.value <- t.test(x, mu=(b/(b-2)))$p.value
}
Additionally, it might be more efficient to employ an "apply" construct, for example with replicate, and include the df as function arguments:
get.p.value <- function(df1, df2) {
x <- rf(5, df1=df1, df2=df2)
p.value <- t.test(x, mu=(df2/(df2-2)))$p.value
}
replicate (2000, get.p.value(df1 = 5, df2 = 10))
This is not always true, but it simplifies the recovery of the p.values.

Resources