R: run function over same dataframe multiple times - r

I’m looking to apply a function over an initial dataframe multiple times. As a simple example, take this data:
library(dplyr)
thisdata <- data.frame(vara = seq(from = 1, to = 20, by = 1)
,varb = seq(from = 1, to = 20, by = 1))
And here is a simple function I would like to run over it:
simplefunc <- function(data) {datasetfinal2 <- data %>% mutate(varb = varb+1)
return(datasetfinal2)}
thisdata2 <- simplefunc(thisdata)
thisdata3 <- simplefunc(thisdata2)
So, how would I run this function, say 10 times, without having to keep calling the function (ie. thisdata3)? I’m mostly interested in the final dataframe after the replication but it would be good to have a list of all the dataframes produced so I can run some diagnostics. Appreciate the help!

Dealing with multiple identically-structured data.frames individually is a difficult way to manage things, especially if the number of iterations is more than a few. A popular "best practice" is to deal with a "list of data.frames", something like:
n <- 10 # number of times you need to repeat the process
out <- vector("list", n)
out[[1]] <- thisdata
for (i in 2:n) out[[i]] <- simplefunc(out[[i-1]])
You can look at any interim value with
str(out[[10]])
# 'data.frame': 20 obs. of 2 variables:
# $ vara: num 1 2 3 4 5 6 7 8 9 10 ...
# $ varb: num 10 11 12 13 14 15 16 17 18 19 ...
and, as you might expect, the final result is in out[[n]].
This can be simplified slightly using Reduce, and adding a throw-away second argument to simplefunc:
simplefunc <- function(data, ...) {
datasetfinal2 <- data %>% mutate(varb = varb+1)
return(datasetfinal2)
}
out <- Reduce(simplefunc, 1:10, init = thisdata, accumulate = TRUE)
This effectively does:
tmp <- simplefunc(thisdata, 1)
tmp <- simplefunc(tmp, 2)
tmp <- simplefunc(tmp, 3)
# ...
(In fact, if you look at the source for Reduce, it's effectively doing my first suggestion above.)
Note that if simplefunc has other arguments that cannot be dropped, perhaps:
simplefunc <- function(data, ..., otherarg, anotherarg) {
datasetfinal2 <- data %>% mutate(varb = varb+1)
return(datasetfinal2)
}
though you must change all other calls to simplefunc to pass parameters "by-name" instead of by-position (which is a common/default way).
Edit: if you cannot (or do not want to) edit simplefunc, you can always use an anonymous function to ignore the iterator/counter:
Reduce(function(x, ign) simplefunc(x), 1:10, init = thisdata, accumulate = TRUE)

We can use a for loop
thisdata1 <- thisdata
for(i in 2:3){
assign(paste0('thisdata', i), value = simplefunc(get(paste0('thisdata', i-1))))
}
NOTE1: It is better not to create individual objects in the global environment where the operations can be done easily within a list.
NOTE2: Forgot to add the disclaimer earlier

Related

how to use apply (or sapply) with columns of matrix or dataframe as function args

I know this is a bonehead newbie question, but I've been trying to figure it out for quite awhile and need some input. Basically, I'm trying to learn how to use the apply family to omit for loops, specifically how to set up the call so that columns of a matrix serve as arguments to the function. I'll use a simple call to the rbinom function as an example.
Example: this for loop works fine. The data are a set of integers and a set of probabilities
success <- rep(-1, times=10) # initialize result var
num <- sample.int(20, 10) # get 10 random integers
p <- runif(10) # get 10 random probabilities
for (i in 1:10) {
success[i]= rbinom(n=1, size=num[i],prob=p[i]) # number successes in 1 trial
}
But how to do the same thing with the apply family? I first put the data into 2 columns of a matrix, thinking that was the right start. However, the following does NOT work, obviously due to my
poor understanding of how to set up a call to apply.
myData <- matrix(nrow=10, ncol=2)
myData[,1] <- num
myData[,2] <- p
success <- apply(myData, rbinom, n=1, size=myData[,1], prob=myData[,2])
Any tips are greatly appreciated! I'm coming to R from Fortran, and trying to port over a lot of code that is loaded with DO loops, so I really need to get my head around this.
lapply, sapply, apply only deal with one vector/list at a time. That is, apply will only call its function for one column at a time. What you need is mapply or Map.
myData <- matrix(nrow=10, ncol=2)
myData[,1] <- num
myData[,2] <- p
mapply(rbinom, n = 1, myData[,1], myData[,2])
# [1] 5 4 11 8 3 3 17 8 0 11
Just like lapply returns a list, so does Map; similarly, just like sapply, mapply will return a vector or array if all return values are compatible, otherwise it returns a list as well.
These calls are equivalent:
sapply(1:3, function(z) z + 1)
mapply(function(z) z + 1, 1:3)
but mapply and Map allow arbitrary number of lists/vectors, so for instance
func <- function(X,Y,Z) X^2+2*Y-Z
Map(func, 1:9, 11:19, 21:29)
## effectively the same as
list(
func(1, 11, 21),
func(2, 12, 22),
func(3, 13, 33),
...,
func(9, 19, 29)
)
The equivalent call of that with sapply for your data would be
sapply(seq_len(nrow(myData)), function(ind) {
rbinom(n = 1, size = myData[ind,1], prob = myData[ind,2])
})
though I personally feel that mapply is easier to read.

How to use grep function in for loop

I have troubles using the grep function within a for loop.
In my data set, I have several columns where only the last 5-6 letters change. With the loop I want to use the same functions for all 16 situations.
Here is my code:
situations <- c("KKKTS", "KKKNL", "KKDTS", "KKDNL", "NkKKTS", "NkKKNL", "NkKDTS", "NkKDNL", "KTKTS", "KTKNL", "KTDTS", "KTDNL", "NkTKTS", "NkTKNL", "NkTDTS", "NkTDNL")
View(situations)
for (i in situations[1:16]) {
## Trust Skala
a <- vector("numeric", length = 1L)
b <- vector("numeric", length = 1L)
a <- grep("Tru_1_[i]", colnames(cleandata))
b <- grep("Tru_5_[i]", colnames(cleandata))
cleandata[, c(a:b)] <- 8-cleandata[, c(a:b)]
attach(cleandata)
cleandata$scale_tru_[i] <- (Tru_1_[i] + Tru_2_[i] + Tru_3_[i] + Tru_4_[i] + Tru_5_[i])/5
detach(cleandata)
}
With the grep function I first want to finde the column number of e.g. Tru_1_KKKTS and Tru_5_KKKTS. Then I want to reverse code the items of the specific column numbers. The last part worked without the loop when I manually used grep for every single situation.
Here ist the manual version:
# KKKTS
grep("Tru_1_KKKTS", colnames(cleandata)) #29 -> find the index of respective column
grep("Tru_5_KKKTS", colnames(cleandata)) #33
cleandata[,c(29:33)] <- 8-cleandata[c(29:33)] # trust scale ranges from 1 to 7 [8-1/2/3/4/5/6/7 = 7/6/5/4/3/2/1]
attach(cleandata)
cleandata$scale_tru_KKKTS <- (Tru_1_KKKTS + Tru_2_KKKTS + Tru_3_KKKTS + Tru_4_KKKTS + Tru_5_KKKTS)/5
detach(cleandata)
You can do:
Mean5 <- function(sit) {
cnames <- paste0("Tru_", 1:5, "_", sit)
rowMeans(cleandata[cnames])
}
cleandata[, paste0("scale_tru_", situations)] <- sapply(situations, FUN=Mean5)
how about something like this. It's a bit more compact and you don't have to use attach..
situations <- c("KKKTS", "KKKNL", "KKDTS", "KKDNL", "NkKKTS", "NkKKNL", "NkKDTS", "NkKDNL", "KTKTS", "KTKNL", "KTDTS", "KTDNL", "NkTKTS", "NkTKNL", "NkTDTS", "NkTDNL")
for (i in situations[1:16]) {
cols <- paste("Tru", 1:5, i, sep = "_")
result <- paste("scale_tru" , i, sep = "_")
cleandata[cols] <- 8 - cleandata[cols]
cleandata[result] <- rowMeans(cleandata[cols])
}
I took for granted that when you write a:b you mean all the columns between those, which I assumed were named from 2 to 4
situations <- c("KKKTS", "KKKNL", "KKDTS", "KKDNL", "NkKKTS", "NkKKNL", "NkKDTS", "NkKDNL", "KTKTS", "KTKNL", "KTDTS", "KTDNL", "NkTKTS", "NkTKNL", "NkTDTS", "NkTDNL")
# constructor for column names
get_col_names <- function(part) paste("Tru", 1:5, part, sep="_")
for (situation in situtations) {
# revert the values in the columns in situ
cleandata[, get_col_names(situation)] <- 8 - cleandata[, get_col_names(situtation)]
# and calculate the average
subdf <- cleandata[, get_col_names(situation)]
cleandata[, paste0("scale_tru_", situation)] <- rowSums(subdf)/ncol(subdf)
}
By the way, you call it "scale" but your code shows an average/mean calculation.
(Scale without centering).

How can I program n steps where each step is a dataframe created from the previous step?

I try to simulate changes in a data frame through different steps depending on each others. Let's try to take a very simple example to illustrate my problem.
I create a dataframe with two columns
a=runif(10)
b=runif(10)
data_1=data.frame(a,b)
data_1
a b
1 0.94922669 0.47418098
2 0.26702201 0.79179699
3 0.57398333 0.25158378
4 0.52724079 0.61531202
5 0.03999831 0.95233479
6 0.15171673 0.64564561
7 0.51353129 0.75676464
8 0.60312432 0.85318316
9 0.52900913 0.06297818
10 0.75459362 0.40209925
Then, I would like to create n steps, where each step consists in creating a new dataframe at i+1 which is function (let's call it "whatever") of the dataframe at i: data_2 is a transformation of data_1, data_3 a transformation of data_2, etc.
iterations=function(nsteps)
{
lapply(1:nsteps,function(i)
{
data_i+1=whatever(data_i)
})
}
Whatever the function I use, I have an error message saying:
Error in whatever(data_i) : object 'data_i' not found
Can someone help me figure out what I am missing?
See if you can get some inspiration from the following example.
First, a whatever function to be applied to the previous dataframe.
whatever <- function(DF) {
DF[[2]] <- DF[[2]]*2
DF
}
Now the function you want. I have added an extra argument, the dataframe x.
The function starts by creating the object to be returned. Each member of the list data_list will be a dataframe function of the previous dataframe.
iterations <- function(nsteps, x){
data_list <- vector("list", length = nsteps)
data_list[[1]] <- x
for(i in seq_len(nsteps)[-1]){
data_list[[i]] <- whatever(data_list[[i - 1]])
}
names(data_list) <- sprintf("data_%d", seq_len(nsteps))
data_list
}
And apply iterations to an example dataframe.
df1 <- data.frame(A = letters[1:10], X = 1:10)
iterations(10, df1)
You might be looking for a combination of assign and paste:
assign(paste("data_", i + 1, sep = ""), whatever(data_i))

How to append rows to an R data frame

I have looked around StackOverflow, but I cannot find a solution specific to my problem, which involves appending rows to an R data frame.
I am initializing an empty 2-column data frame, as follows.
df = data.frame(x = numeric(), y = character())
Then, my goal is to iterate through a list of values and, in each iteration, append a value to the end of the list. I started with the following code.
for (i in 1:10) {
df$x = rbind(df$x, i)
df$y = rbind(df$y, toString(i))
}
I also attempted the functions c, append, and merge without success. Please let me know if you have any suggestions.
Update from comment:
I don't presume to know how R was meant to be used, but I wanted to ignore the additional line of code that would be required to update the indices on every iteration and I cannot easily preallocate the size of the data frame because I don't know how many rows it will ultimately take. Remember that the above is merely a toy example meant to be reproducible. Either way, thanks for your suggestion!
Update
Not knowing what you are trying to do, I'll share one more suggestion: Preallocate vectors of the type you want for each column, insert values into those vectors, and then, at the end, create your data.frame.
Continuing with Julian's f3 (a preallocated data.frame) as the fastest option so far, defined as:
# pre-allocate space
f3 <- function(n){
df <- data.frame(x = numeric(n), y = character(n), stringsAsFactors = FALSE)
for(i in 1:n){
df$x[i] <- i
df$y[i] <- toString(i)
}
df
}
Here's a similar approach, but one where the data.frame is created as the last step.
# Use preallocated vectors
f4 <- function(n) {
x <- numeric(n)
y <- character(n)
for (i in 1:n) {
x[i] <- i
y[i] <- i
}
data.frame(x, y, stringsAsFactors=FALSE)
}
microbenchmark from the "microbenchmark" package will give us more comprehensive insight than system.time:
library(microbenchmark)
microbenchmark(f1(1000), f3(1000), f4(1000), times = 5)
# Unit: milliseconds
# expr min lq median uq max neval
# f1(1000) 1024.539618 1029.693877 1045.972666 1055.25931 1112.769176 5
# f3(1000) 149.417636 150.529011 150.827393 151.02230 160.637845 5
# f4(1000) 7.872647 7.892395 7.901151 7.95077 8.049581 5
f1() (the approach below) is incredibly inefficient because of how often it calls data.frame and because growing objects that way is generally slow in R. f3() is much improved due to preallocation, but the data.frame structure itself might be part of the bottleneck here. f4() tries to bypass that bottleneck without compromising the approach you want to take.
Original answer
This is really not a good idea, but if you wanted to do it this way, I guess you can try:
for (i in 1:10) {
df <- rbind(df, data.frame(x = i, y = toString(i)))
}
Note that in your code, there is one other problem:
You should use stringsAsFactors if you want the characters to not get converted to factors. Use: df = data.frame(x = numeric(), y = character(), stringsAsFactors = FALSE)
Let's benchmark the three solutions proposed:
# use rbind
f1 <- function(n){
df <- data.frame(x = numeric(), y = character())
for(i in 1:n){
df <- rbind(df, data.frame(x = i, y = toString(i)))
}
df
}
# use list
f2 <- function(n){
df <- data.frame(x = numeric(), y = character(), stringsAsFactors = FALSE)
for(i in 1:n){
df[i,] <- list(i, toString(i))
}
df
}
# pre-allocate space
f3 <- function(n){
df <- data.frame(x = numeric(1000), y = character(1000), stringsAsFactors = FALSE)
for(i in 1:n){
df$x[i] <- i
df$y[i] <- toString(i)
}
df
}
system.time(f1(1000))
# user system elapsed
# 1.33 0.00 1.32
system.time(f2(1000))
# user system elapsed
# 0.19 0.00 0.19
system.time(f3(1000))
# user system elapsed
# 0.14 0.00 0.14
The best solution is to pre-allocate space (as intended in R). The next-best solution is to use list, and the worst solution (at least based on these timing results) appears to be rbind.
Suppose you simply don't know the size of the data.frame in advance. It can well be a few rows, or a few millions. You need to have some sort of container, that grows dynamically. Taking in consideration my experience and all related answers in SO I come with 4 distinct solutions:
rbindlist to the data.frame
Use data.table's fast set operation and couple it with manually doubling the table when needed.
Use RSQLite and append to the table held in memory.
data.frame's own ability to grow and use custom environment (which has reference semantics) to store the data.frame so it will not be copied on return.
Here is a test of all the methods for both small and large number of appended rows. Each method has 3 functions associated with it:
create(first_element) that returns the appropriate backing object with first_element put in.
append(object, element) that appends the element to the end of the table (represented by object).
access(object) gets the data.frame with all the inserted elements.
rbindlist to the data.frame
That is quite easy and straight-forward:
create.1<-function(elems)
{
return(as.data.table(elems))
}
append.1<-function(dt, elems)
{
return(rbindlist(list(dt, elems),use.names = TRUE))
}
access.1<-function(dt)
{
return(dt)
}
data.table::set + manually doubling the table when needed.
I will store the true length of the table in a rowcount attribute.
create.2<-function(elems)
{
return(as.data.table(elems))
}
append.2<-function(dt, elems)
{
n<-attr(dt, 'rowcount')
if (is.null(n))
n<-nrow(dt)
if (n==nrow(dt))
{
tmp<-elems[1]
tmp[[1]]<-rep(NA,n)
dt<-rbindlist(list(dt, tmp), fill=TRUE, use.names=TRUE)
setattr(dt,'rowcount', n)
}
pos<-as.integer(match(names(elems), colnames(dt)))
for (j in seq_along(pos))
{
set(dt, i=as.integer(n+1), pos[[j]], elems[[j]])
}
setattr(dt,'rowcount',n+1)
return(dt)
}
access.2<-function(elems)
{
n<-attr(elems, 'rowcount')
return(as.data.table(elems[1:n,]))
}
SQL should be optimized for fast record insertion, so I initially had high hopes for RSQLite solution
This is basically copy&paste of Karsten W. answer on similar thread.
create.3<-function(elems)
{
con <- RSQLite::dbConnect(RSQLite::SQLite(), ":memory:")
RSQLite::dbWriteTable(con, 't', as.data.frame(elems))
return(con)
}
append.3<-function(con, elems)
{
RSQLite::dbWriteTable(con, 't', as.data.frame(elems), append=TRUE)
return(con)
}
access.3<-function(con)
{
return(RSQLite::dbReadTable(con, "t", row.names=NULL))
}
data.frame's own row-appending + custom environment.
create.4<-function(elems)
{
env<-new.env()
env$dt<-as.data.frame(elems)
return(env)
}
append.4<-function(env, elems)
{
env$dt[nrow(env$dt)+1,]<-elems
return(env)
}
access.4<-function(env)
{
return(env$dt)
}
The test suite:
For convenience I will use one test function to cover them all with indirect calling. (I checked: using do.call instead of calling the functions directly doesn't makes the code run measurable longer).
test<-function(id, n=1000)
{
n<-n-1
el<-list(a=1,b=2,c=3,d=4)
o<-do.call(paste0('create.',id),list(el))
s<-paste0('append.',id)
for (i in 1:n)
{
o<-do.call(s,list(o,el))
}
return(do.call(paste0('access.', id), list(o)))
}
Let's see the performance for n=10 insertions.
I also added a 'placebo' functions (with suffix 0) that don't perform anything - just to measure the overhead of the test setup.
r<-microbenchmark(test(0,n=10), test(1,n=10),test(2,n=10),test(3,n=10), test(4,n=10))
autoplot(r)
For 1E5 rows (measurements done on Intel(R) Core(TM) i7-4710HQ CPU # 2.50GHz):
nr function time
4 data.frame 228.251
3 sqlite 133.716
2 data.table 3.059
1 rbindlist 169.998
0 placebo 0.202
It looks like the SQLite-based sulution, although regains some speed on large data, is nowhere near data.table + manual exponential growth. The difference is almost two orders of magnitude!
Summary
If you know that you will append rather small number of rows (n<=100), go ahead and use the simplest possible solution: just assign the rows to the data.frame using bracket notation and ignore the fact that the data.frame is not pre-populated.
For everything else use data.table::set and grow the data.table exponentially (e.g. using my code).
Update with purrr, tidyr & dplyr
As the question is already dated (6 years), the answers are missing a solution with newer packages tidyr and purrr. So for people working with these packages, I want to add a solution to the previous answers - all quite interesting, especially .
The biggest advantage of purrr and tidyr are better readability IMHO.
purrr replaces lapply with the more flexible map() family,
tidyr offers the super-intuitive method add_row - just does what it says :)
map_df(1:1000, function(x) { df %>% add_row(x = x, y = toString(x)) })
This solution is short and intuitive to read, and it's relatively fast:
system.time(
map_df(1:1000, function(x) { df %>% add_row(x = x, y = toString(x)) })
)
user system elapsed
0.756 0.006 0.766
It scales almost linearly, so for 1e5 rows, the performance is:
system.time(
map_df(1:100000, function(x) { df %>% add_row(x = x, y = toString(x)) })
)
user system elapsed
76.035 0.259 76.489
which would make it rank second right after data.table (if your ignore the placebo) in the benchmark by #Adam Ryczkowski:
nr function time
4 data.frame 228.251
3 sqlite 133.716
2 data.table 3.059
1 rbindlist 169.998
0 placebo 0.202
A more generic solution for might be the following.
extendDf <- function (df, n) {
withFactors <- sum(sapply (df, function(X) (is.factor(X)) )) > 0
nr <- nrow (df)
colNames <- names(df)
for (c in 1:length(colNames)) {
if (is.factor(df[,c])) {
col <- vector (mode='character', length = nr+n)
col[1:nr] <- as.character(df[,c])
col[(nr+1):(n+nr)]<- rep(col[1], n) # to avoid extra levels
col <- as.factor(col)
} else {
col <- vector (mode=mode(df[1,c]), length = nr+n)
class(col) <- class (df[1,c])
col[1:nr] <- df[,c]
}
if (c==1) {
newDf <- data.frame (col ,stringsAsFactors=withFactors)
} else {
newDf[,c] <- col
}
}
names(newDf) <- colNames
newDf
}
The function extendDf() extends a data frame with n rows.
As an example:
aDf <- data.frame (l=TRUE, i=1L, n=1, c='a', t=Sys.time(), stringsAsFactors = TRUE)
extendDf (aDf, 2)
# l i n c t
# 1 TRUE 1 1 a 2016-07-06 17:12:30
# 2 FALSE 0 0 a 1970-01-01 01:00:00
# 3 FALSE 0 0 a 1970-01-01 01:00:00
system.time (eDf <- extendDf (aDf, 100000))
# user system elapsed
# 0.009 0.002 0.010
system.time (eDf <- extendDf (eDf, 100000))
# user system elapsed
# 0.068 0.002 0.070
Lets take a vector 'point' which has numbers from 1 to 5
point = c(1,2,3,4,5)
if we want to append a number 6 anywhere inside the vector then below command may come handy
i) Vectors
new_var = append(point, 6 ,after = length(point))
ii) columns of a table
new_var = append(point, 6 ,after = length(mtcars$mpg))
The command append takes three arguments:
the vector/column to be modified.
value to be included in the modified vector.
a subscript, after which the values are to be appended.
simple...!!
Apologies in case of any...!
My solution is almost the same as the original answer but it doesn't worked for me.
So, I gave names for the columns and it works:
painel <- rbind(painel, data.frame("col1" = xtweets$created_at,
"col2" = xtweets$text))

R: Row resampling loop speed improvement

I'm subsampling rows from a dataframe with c("x","y","density") columns at a variety of c("s_size","reps"). Reps= replicates, s_size= number of rows subsampled from the whole dataframe.
> head(data_xyz)
x y density
1 6 1 0
2 7 1 17600
3 8 1 11200
4 12 1 14400
5 13 1 0
6 14 1 8000
#Subsampling###################
subsample_loop <- function(s_size, reps, int) {
tm1 <- system.time( #start timer
{
subsample_bound = data.frame()
#Perform Subsampling of the general
for (s_size in seq(1,s_size,int)){
for (reps in 1:reps) {
subsample <- sample.df.rows(s_size, data_xyz)
assign(paste("sample" ,"_","n", s_size, "_", "r", reps , sep=""), subsample)
subsample_replicate <- subsample[,] #temporary variable
subsample_replicate <- cbind(subsample, rep(s_size,(length(subsample_replicate[,1]))),
rep(reps,(length(subsample_replicate[,1]))))
subsample_bound <- rbind(subsample_bound, subsample_replicate)
}
}
}) #end timer
colnames(subsample_bound) <- c("x","y","density","s_size","reps")
subsample_bound
} #end function
Here's the function call:
source("R/functions.R")
subsample_data <- subsample_loop(s_size=206, reps=5, int=10)
Here's the row subsample function:
# Samples a number of rows in a dataframe, outputs a dataframe of the same # of columns
# df Data Frame
# N number of samples to be taken
sample.df.rows <- function (N, df, ...)
{
df[sample(nrow(df), N, replace=FALSE,...), ]
}
It's way too slow, I've tried a few times with apply functions and had no luck. I'll be doing somewhere around 1,000-10,000 replicates for each s_size from 1:250.
Let me know what you think! Thanks in advance.
=========================================================================
UPDATE EDIT: Sample data from which to sample:
https://www.dropbox.com/s/47mpo36xh7lck0t/density.csv
Joran's code in a function (in a sourced function.R file):
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
resampling_custom <- function(dat, s_size, int, reps) {
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
}
Calling the function
set.seed(2)
out <- resampling_custom(dat=retinal_xyz, s_size=206, int=5, reps=10)
outputs data, unfortunately with this warning message:
Warning message:
In mapply(foo, i = ss, j = id, MoreArgs = list(data = dat), SIMPLIFY = FALSE) :
longer argument not a multiple of length of shorter
I put very little thought into actually optimizing this, I was just concentrating on doing something that's at least reasonable while matching your procedure.
Your big problem is that you are growing objects via rbind and cbind. Basically anytime you see someone write data.frame() or c() and expand that object using rbind, cbind or c, you can be very sure that the resulting code will essentially be the slowest possible way of doing what ever task is being attempted.
This version is around 12-13 times faster, and I'm sure you could squeeze some more out of this if you put some real thought into it:
s_size <- 200
int <- 10
reps <- 30
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
The best part about R is that not only is this way, way faster, it's also way less code.

Resources