Apply different functions to different sets of columns by group - r

I have a data.table with the following features:
bycols: columns that divide the data into groups
nonvaryingcols: columns that are constant within each group (so that taking the first item from within each group and carrying that through would be sufficient)
datacols: columns to be aggregated / summarized (e.g. sum them within group)
I'm curious what the most efficient way to do what you might call a mixed collapse, taking all three of the above inputs as character vectors. It doesn't have to be the absolute fastest, but fast enough with reasonable syntax would be ideal.
Example data, where the different sets of columns are stored in character vectors.
require(data.table)
set.seed(1)
bycols <- c("g1","g2")
datacols <- c("dat1","dat2")
nonvaryingcols <- c("nv1","nv2")
test <- data.table(
g1 = rep( letters, 10 ),
g2 = rep( c(LETTERS,LETTERS), each = 5 ),
dat1 = runif( 260 ),
dat2 = runif( 260 ),
nv1 = rep( seq(130), 2),
nv2 = rep( seq(130), 2)
)
Final data should look like:
g1 g2 dat1 dat2 nv1 nv2
1: a A 0.8403809 0.6713090 1 1
2: b A 0.4491883 0.4607716 2 2
3: c A 0.6083939 1.2031960 3 3
4: d A 1.5510033 1.2945761 4 4
5: e A 1.1302971 0.8573135 5 5
6: f B 1.4964821 0.5133297 6 6
I have worked out two different ways of doing it, but one is horridly inflexible and unwieldy, and one is horridly slow. Will post tomorrow if no one has come up with something better by then.

As always with this sort of programmatic use of [.data.table, the general strategy is to construct an expression e that that can be evaluated in the j argument. Once you understand that (as I'm sure you do), it just becomes a game of computing on the language to get a j-slot expression that looks like what you'd write at the command line.
Here, for instance, and given the particular values in your example, you'd like a call that looks like:
test[, list(dat1=sum(dat1), dat2=sum(dat2), nv1=nv1[1], nv2=nv2[1]),
by=c("g1", "g2")]
so the expression you'd like evaluated in the j-slot is
list(dat1=sum(dat1), dat2=sum(dat2), nv1=nv1[1], nv2=nv2[1])
Most of the following function is taken up with constructing just that expression:
f <- function(dt, bycols, datacols, nvcols) {
e <- c(sapply(datacols, function(x) call("sum", as.symbol(x))),
sapply(nvcols, function(x) call("[", as.symbol(x), 1)))
e<- as.call(c(as.symbol("list"), e))
dt[,eval(e), by=bycols]
}
f(test, bycols=bycols, datacols=datacols, nvcols=nonvaryingcols)
## g1 g2 dat1 dat2 nv1 nv2
## 1: a A 0.8403809 0.6713090 1 1
## 2: b A 0.4491883 0.4607716 2 2
## 3: c A 0.6083939 1.2031960 3 3
## 4: d A 1.5510033 1.2945761 4 4
## 5: e A 1.1302971 0.8573135 5 5
## ---
## 126: v Z 0.5627018 0.4282380 126 126
## 127: w Z 0.7588966 1.4429034 127 127
## 128: x Z 0.7060596 1.3736510 128 128
## 129: y Z 0.6015249 0.4488285 129 129
## 130: z Z 1.5304034 1.6012207 130 130

Here's what I had come up with. It works, but very slowly.
test[, {
cbind(
as.data.frame( t( sapply( .SD[, ..datacols], sum ) ) ),
.SD[, ..nonvaryingcols][1]
)
}, by = bycols ]
Benchmarks
FunJosh <- function() {
f(test, bycols=bycols, datacols=datacols, nvcols=nonvaryingcols)
}
FunAri <- function() {
test[, {
cbind(
as.data.frame( t( sapply( .SD[, ..datacols], sum ) ) ),
.SD[, ..nonvaryingcols][1]
)
}, by = bycols ]
}
FunEddi <- function() {
cbind(
test[, lapply(.SD, sum), by = bycols, .SDcols = datacols],
test[, lapply(.SD, "[", 1), by = bycols, .SDcols = nonvaryingcols][, ..nonvaryingcols]
)
}
library(microbenchmark)
identical(FunJosh(), FunAri())
# [1] TRUE
microbenchmark(FunJosh(), FunAri(), FunEddi())
#Unit: milliseconds
# expr min lq median uq max neval
# FunJosh() 2.749164 2.958478 3.098998 3.470937 6.863933 100
# FunAri() 246.082760 255.273839 284.485654 360.471469 509.740240 100
# FunEddi() 5.877494 6.229739 6.528205 7.375939 112.895573 100
At least two orders of magnitude slower than #joshobrien's solution. Edit #Eddi's solution is much faster as well, and shows that cbind wasn't optimal but could be fairly fast in the right hands. Might be all the transforming and sapplying I was doing rather than just directly using lapply.

Just for a bit of variety, here is a variant of #Josh O'brien's solution that uses the bquote operator instead of call. I did try to replace the final as.call with a bquote, but because bquote doesn't support list splicing (e.g., see this question), I couldn't get that to work.
f <- function(dt, bycols, datacols, nvcols) {
datacols = sapply(datacols, as.symbol)
nvcols = sapply(nvcols, as.symbol)
e = c(lapply(datacols, function(x) bquote(sum(.(x)))),
lapply(nvcols, function(x) bquote(.(x)[1])))
e = as.call(c(as.symbol("list"), e))
dt[,eval(e), by=bycols]
}
> f(test, bycols=bycols, datacols=datacols, nvcols=nonvaryingcols)
g1 g2 dat1 dat2 nv1 nv2
1: a A 0.8404 0.6713 1 1
2: b A 0.4492 0.4608 2 2
3: c A 0.6084 1.2032 3 3
4: d A 1.5510 1.2946 4 4
5: e A 1.1303 0.8573 5 5
---
126: v Z 0.5627 0.4282 126 126
127: w Z 0.7589 1.4429 127 127
128: x Z 0.7061 1.3737 128 128
129: y Z 0.6015 0.4488 129 129
130: z Z 1.5304 1.6012 130 130
>

Related

Replace each observation in a data.frame with n copies [duplicate]

This question already has answers here:
Repeat rows of a data.frame N times
(10 answers)
Closed 3 years ago.
I want to repeat the rows of a data.frame, each N times. The result should be a new data.frame (with nrow(new.df) == nrow(old.df) * N) keeping the data types of the columns.
Example for N = 2:
A B C
A B C 1 j i 100
1 j i 100 --> 2 j i 100
2 K P 101 3 K P 101
4 K P 101
So, each row is repeated 2 times and characters remain characters, factors remain factors, numerics remain numerics, ...
My first attempt used apply: apply(old.df, 2, function(co) rep(co, each = N)), but this one transforms my values to characters and I get:
A B C
[1,] "j" "i" "100"
[2,] "j" "i" "100"
[3,] "K" "P" "101"
[4,] "K" "P" "101"
df <- data.frame(a = 1:2, b = letters[1:2])
df[rep(seq_len(nrow(df)), each = 2), ]
A clean dplyr solution, taken from here
library(dplyr)
df <- tibble(x = 1:2, y = c("a", "b"))
df %>% slice(rep(1:n(), each = 2))
There is a lovely vectorized solution that repeats only certain rows n-times each, possible for example by adding an ntimes column to your data frame:
A B C ntimes
1 j i 100 2
2 K P 101 4
3 Z Z 102 1
Method:
df <- data.frame(A=c("j","K","Z"), B=c("i","P","Z"), C=c(100,101,102), ntimes=c(2,4,1))
df <- as.data.frame(lapply(df, rep, df$ntimes))
Result:
A B C ntimes
1 Z Z 102 1
2 j i 100 2
3 j i 100 2
4 K P 101 4
5 K P 101 4
6 K P 101 4
7 K P 101 4
This is very similar to Josh O'Brien and Mark Miller's method:
df[rep(seq_len(nrow(df)), df$ntimes),]
However, that method appears quite a bit slower:
df <- data.frame(A=c("j","K","Z"), B=c("i","P","Z"), C=c(100,101,102), ntimes=c(2000,3000,4000))
microbenchmark::microbenchmark(
df[rep(seq_len(nrow(df)), df$ntimes),],
as.data.frame(lapply(df, rep, df$ntimes)),
times = 10
)
Result:
Unit: microseconds
expr min lq mean median uq max neval
df[rep(seq_len(nrow(df)), df$ntimes), ] 3563.113 3586.873 3683.7790 3613.702 3657.063 4326.757 10
as.data.frame(lapply(df, rep, df$ntimes)) 625.552 654.638 676.4067 668.094 681.929 799.893 10
If you can repeat the whole thing, or subset it first then repeat that, then this similar question may be helpful. Once again:
library(mefa)
rep(mtcars,10)
or simply
mefa:::rep.data.frame(mtcars)
Adding to what #dardisco mentioned about mefa::rep.data.frame(), it's very flexible.
You can either repeat each row N times:
rep(df, each=N)
or repeat the entire dataframe N times (think: like when you recycle a vectorized argument)
rep(df, times=N)
Two thumbs up for mefa! I had never heard of it until now and I had to write manual code to do this.
For reference and adding to answers citing mefa, it might worth to take a look on the implementation of mefa::rep.data.frame() in case you don't want to include the whole package:
> data <- data.frame(a=letters[1:3], b=letters[4:6])
> data
a b
1 a d
2 b e
3 c f
> as.data.frame(lapply(data, rep, 2))
a b
1 a d
2 b e
3 c f
4 a d
5 b e
6 c f
The rep.row function seems to sometimes make lists for columns, which leads to bad memory hijinks. I have written the following which seems to work well:
library(plyr)
rep.row <- function(r, n){
colwise(function(x) rep(x, n))(r)
}
My solution similar as mefa:::rep.data.frame, but a little faster and cares about row names:
rep.data.frame <- function(x, times) {
rnames <- attr(x, "row.names")
x <- lapply(x, rep.int, times = times)
class(x) <- "data.frame"
if (!is.numeric(rnames))
attr(x, "row.names") <- make.unique(rep.int(rnames, times))
else
attr(x, "row.names") <- .set_row_names(length(rnames) * times)
x
}
Compare solutions:
library(Lahman)
library(microbenchmark)
microbenchmark(
mefa:::rep.data.frame(Batting, 10),
rep.data.frame(Batting, 10),
Batting[rep.int(seq_len(nrow(Batting)), 10), ],
times = 10
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> mefa:::rep.data.frame(Batting, 10) 127.77786 135.3480 198.0240 148.1749 278.1066 356.3210 10 a
#> rep.data.frame(Batting, 10) 79.70335 82.8165 134.0974 87.2587 191.1713 307.4567 10 a
#> Batting[rep.int(seq_len(nrow(Batting)), 10), ] 895.73750 922.7059 981.8891 956.3463 1018.2411 1127.3927 10 b
try using for example
N=2
rep(1:4, each = N)
as an index
Another way to do this would to first get row indices, append extra copies of the df, and then order by the indices:
df$index = 1:nrow(df)
df = rbind(df,df)
df = df[order(df$index),][,-ncol(df)]
Although the other solutions may be shorter, this method may be more advantageous in certain situations.

R - How to run average & max on different data.table columns based on multiple factors & return original colnames

I am changing my R code from data.frame + plyr to data.tables as I need a faster and more memory-efficient way to handle a big data set. Unfortunately, my R skills are woefully limited and I've hit a wall for the whole day. Would appreciate if SO experts here can enlighten.
My Goals
Aggregate rows in my data.table based on 2 functions - average and max - run on selected columns (with column names passed via vector) while grouping by columns also passed via vector.
The resulting DT should contain the original column names.
There should not be unnecessary copying of the DT in order to conserve memory
My Test Code
DT = data.table( a=LETTERS[c(1,1,1:4)],b=4:9, c=3:8, d = rnorm(6),
e=LETTERS[c(rep(25,3),rep(26,3))], key="a" )
GrpVar1 <- "a"
GrpVar2 <- "e"
VarToMax <- "b"
VarToAve <- c( "c", "d")
What I tried but didn't work for me
DT[, list( b=max( b ), c=mean(c), d=mean(d) ), by=c( GrpVar1, GrpVar2 ) ]
# Hard-code col name - not what I want
DT[, list( max( get(VarToMax) ), mean( get(VarToAve) )), by=c( GrpVar1, GrpVar2 ) ]
# Col names become 'V1', 'V2', worse, 1 column goes missing - Not what I want either
DT[, list( get(VarToMax)=max( get(VarToMax) ),
get(VarToAve)=mean( get(VarToAve) ) ), by=c( GrpVar1, GrpVar2 ) ]
# Above code gave Error!
Additional Question
Based on my very limited understanding of DTs, the with = F argument should instruct R to parse the values of VarToMax and VarToAve, but running the code below leads to error.
DT[, list( max(VarToMax), mean(VarToAve) ), by=c( GrpVar1, GrpVar2 ), with=F ]
# Error in `[.data.table`(DT, , list(max(VarToMax), mean(VarToAve)), by = c(GrpVar1, :
# object 'ansvals' not found
# In addition: Warning message:
# In mean.default(VarToAve) :
# argument is not numeric or logical: returning NA
Existing SO solutions can't help
Arun's solution was how I got to this point, but I am very stuck. His other solution using lapply and .SDcols involves creating 2 extra DT, which does not meet my memory-conserving requirement.
dt1 <- dt[, lapply(.SD, sum), by=ID, .SDcols=c(3,4)]
dt2 <- dt[, lapply(.SD, head, 1), by=ID, .SDcols=c(2)]
I am SO confused over data.table! Any help would be most appreciated!
In a similar fashion as #David Arenburg, but using .SDcols in order to simplify the notation. Also I show the code until the merge.
DTaves <- DT[, lapply(.SD, mean), .SDcols = VarToAve, by = c(GrpVar1, GrpVar2)]
DTmaxs <- DT[, lapply(.SD, max), .SDcols = VarToMax, by = c(GrpVar1, GrpVar2)]
merge(DTmaxs, DTaves)
## a e b c d
## 1: A Y 6 4 0.2230091
## 2: B Z 7 6 0.5909434
## 3: C Z 8 7 -0.4828223
## 4: D Z 9 8 -1.3591240
Alternatively, you can do this in one go by subsetting the .SD using the .. notation to look for VarToAve in the parent frame of .SD (as opposed to a column named VarToAve)
DT[, c(lapply(.SD[, ..VarToAve], mean),
lapply(.SD[, ..VarToMax], max)),
by = c(GrpVar1, GrpVar2)]
## a e c d b
## 1: A Y 4 0.2230091 6
## 2: B Z 6 0.5909434 7
## 3: C Z 7 -0.4828223 8
## 4: D Z 8 -1.3591240 9
Here's my humble attempt
DT[, as.list(c(setNames(max(get(VarToMax)), VarToMax),
lapply(.SD[, ..VarToAve], mean))),
c(GrpVar1, GrpVar2)]
# a e b c d
# 1: A Y 6 4 -0.8000173
# 2: B Z 7 6 0.2508633
# 3: C Z 8 7 1.1966517
# 4: D Z 9 8 1.7291615
Or, for maximum efficiency you could use colMeans and eval(as.name()) combination instead of lapply and get
DT[, as.list(c(setNames(max(eval(as.name(VarToMax))), VarToMax),
colMeans(.SD[, ..VarToAve]))),
c(GrpVar1, GrpVar2)]
# a e b c d
# 1: A Y 6 4 -0.8000173
# 2: B Z 7 6 0.2508633
# 3: C Z 8 7 1.1966517
# 4: D Z 9 8 1.7291615

Split-apply on sequences

Every here and again I have the problem that I need to split a data.frame where one column is a (possibly unordered) sequence. The split shall be done at these rows where a certain criterion is fulfilled in the sequence.
So assume this data.frame as a simple example:
dt <- data.frame( A = sort(sample( 1:300, 100 )) , B = rnorm(100) )
I want to split dt whenever in A a gap larger 4 occurs and calculate the mean in B. What I do is to introduce an id-variable F by
dt[ , "F" ] <- c( 0, cumsum( diff( dt[, "A"] ) > 4) )
head(dt)
A B F
1 2 -0.8019945 0
2 6 -0.1948101 0
3 7 0.1961203 0
4 12 -0.2478185 1
5 13 1.2571841 1
6 14 2.1354909 1
and then
library(plyr)
ddply( dt, .(F), summarise,
A.range = paste( range(A), collapse = "-" ),
B.mean = mean( B )
)
F A.range B.mean
1 0 2-7 -0.26689475
2 1 12-17 0.57051336
3 2 25-25 0.29054572
My question is: Is there no such function in base or other packages (plyr, data.table, zoo, ...) which replaces the cumsum-diff trick and gives me also more flexibility on the splitting criterion?
I think you're doing it the right way. To make it slightly more efficient (from a programming perspective), you can call the cumsum/diff [or other function] directly in the ddply() call
ddply( dt, .(F=c( 0, cumsum( diff( dt[, "A"] ) > 4) )), summarise,
A.range = paste( range(A), collapse = "-" ),
B.mean = mean( B )
)

R help on aggregation function

for my question I created a dummy data frame:
set.seed(007)
DF <- data.frame(a = rep(LETTERS[1:5], each=2), b = sample(40:49), c = sample(1:10))
DF
a b c
1 A 49 2
2 A 43 3
3 B 40 7
4 B 47 1
5 C 41 9
6 C 48 8
7 D 45 6
8 D 42 5
9 E 46 10
10 E 44 4
How can I use the aggregation function on column a so that, for instance, for "A" the following value is calculated: 49-43 / 2+3?
I started like:
aggregate(DF, by=list(DF$a), FUN=function(x) {
...
})
The problem I have is that I do not know how to access the 4 different cells 49, 43, 2 and 3
I tried x[[1]][1] and similar stuff but don't get it working.
Inside aggregate, the function FUN is applied independently to each column of your data. Here you want to use a function that takes two columns as inputs, so a priori, you can't use aggregate for that.
Instead, you can use ddply from the plyr package:
ddply(DF, "a", summarize, res = (b[1] - b[2]) / sum(c))
# a res
# 1 A 1.2000000
# 2 B -0.8750000
# 3 C -0.4117647
# 4 D 0.2727273
# 5 E 0.1428571
When you aggregate the FUN argument can be anything you want. Keep in mind that the value passed will either be a vector (if x is one column) or a little data.frame or matrix (if x is more than one). However, aggregate doesn't let you access the columns of a multi-column argument. For example.
aggregate( . ~ a, data = DF, FUN = function(x) diff(x[,1]) / sum(x[,2]) )
That fails with an error even though I used . (which takes all of the columns of DF that I'm not using elsewhere). To see what aggregate is trying to do there look at the following.
aggregate( . ~ a, data = DF, FUN = sum )
The two columns, b, and c, were aggregated but from the first attempt we know that you can't do something that accesses each column separately. So, strictly sticking with aggregate you need two passes and three lines of code.
diffb <- aggregate( b ~ a, data = DF, FUN = diff )
Y <- aggregate( c ~ a, data = DF, FUN = sum )
Y$c <- diffb$b / Y$c
Now Y contains the result you want.
The by function is simpler than aggregate and all it does is split the original data.frame using the indices and then apply the FUN function.
l <- by( data = DF, INDICES = DF$a, FUN = function(x) diff(x$b)/sum(x$c), simplify = FALSE )
unlist(l)
You have to do a little to get the result back into a data.frame if you really want one.
data.frame(a = names(l), x = unlist(l))
Using data.table could be faster and easier.
library(data.table)
DT <- data.table(DF)
DT[, (-1*diff(b))/sum(c), by=a]
a V1
1: A 1.2000000
2: B -0.8750000
3: C -0.4117647
4: D 0.2727273
5: E 0.1428571
Using aggregate, not so good. I didn't a better way to do it using aggregate :( but here's an attempt.
B <- aggregate(DF$b, by=list(DF$a), diff)
C <- aggregate(DF$c, by=list(DF$a), sum)
data.frame(a=B[,1], Result=(-1*B[,2])/C[,2])
a Result
1 A 1.2000000
2 B -0.8750000
3 C -0.4117647
4 D 0.2727273
5 E 0.1428571
A data.table solution - for efficiency of time and memory.
library(data.table)
DT <- as.data.table(DF)
DT[, list(calc = diff(b) / sum(c)), by = a]
You can use the base by() function:
listOfRows <-
by(data=DF,
INDICES=DF$a,
FUN=function(x){data.frame(a=x$a[1],res=(x$b[1] - x$b[2])/(x$c[1] + x$c[2]))})
newDF <- do.call(rbind,listOfRows)

Repeat rows of a data.frame [duplicate]

This question already has answers here:
Repeat rows of a data.frame N times
(10 answers)
Closed 3 years ago.
I want to repeat the rows of a data.frame, each N times. The result should be a new data.frame (with nrow(new.df) == nrow(old.df) * N) keeping the data types of the columns.
Example for N = 2:
A B C
A B C 1 j i 100
1 j i 100 --> 2 j i 100
2 K P 101 3 K P 101
4 K P 101
So, each row is repeated 2 times and characters remain characters, factors remain factors, numerics remain numerics, ...
My first attempt used apply: apply(old.df, 2, function(co) rep(co, each = N)), but this one transforms my values to characters and I get:
A B C
[1,] "j" "i" "100"
[2,] "j" "i" "100"
[3,] "K" "P" "101"
[4,] "K" "P" "101"
df <- data.frame(a = 1:2, b = letters[1:2])
df[rep(seq_len(nrow(df)), each = 2), ]
A clean dplyr solution, taken from here
library(dplyr)
df <- tibble(x = 1:2, y = c("a", "b"))
df %>% slice(rep(1:n(), each = 2))
There is a lovely vectorized solution that repeats only certain rows n-times each, possible for example by adding an ntimes column to your data frame:
A B C ntimes
1 j i 100 2
2 K P 101 4
3 Z Z 102 1
Method:
df <- data.frame(A=c("j","K","Z"), B=c("i","P","Z"), C=c(100,101,102), ntimes=c(2,4,1))
df <- as.data.frame(lapply(df, rep, df$ntimes))
Result:
A B C ntimes
1 Z Z 102 1
2 j i 100 2
3 j i 100 2
4 K P 101 4
5 K P 101 4
6 K P 101 4
7 K P 101 4
This is very similar to Josh O'Brien and Mark Miller's method:
df[rep(seq_len(nrow(df)), df$ntimes),]
However, that method appears quite a bit slower:
df <- data.frame(A=c("j","K","Z"), B=c("i","P","Z"), C=c(100,101,102), ntimes=c(2000,3000,4000))
microbenchmark::microbenchmark(
df[rep(seq_len(nrow(df)), df$ntimes),],
as.data.frame(lapply(df, rep, df$ntimes)),
times = 10
)
Result:
Unit: microseconds
expr min lq mean median uq max neval
df[rep(seq_len(nrow(df)), df$ntimes), ] 3563.113 3586.873 3683.7790 3613.702 3657.063 4326.757 10
as.data.frame(lapply(df, rep, df$ntimes)) 625.552 654.638 676.4067 668.094 681.929 799.893 10
If you can repeat the whole thing, or subset it first then repeat that, then this similar question may be helpful. Once again:
library(mefa)
rep(mtcars,10)
or simply
mefa:::rep.data.frame(mtcars)
Adding to what #dardisco mentioned about mefa::rep.data.frame(), it's very flexible.
You can either repeat each row N times:
rep(df, each=N)
or repeat the entire dataframe N times (think: like when you recycle a vectorized argument)
rep(df, times=N)
Two thumbs up for mefa! I had never heard of it until now and I had to write manual code to do this.
For reference and adding to answers citing mefa, it might worth to take a look on the implementation of mefa::rep.data.frame() in case you don't want to include the whole package:
> data <- data.frame(a=letters[1:3], b=letters[4:6])
> data
a b
1 a d
2 b e
3 c f
> as.data.frame(lapply(data, rep, 2))
a b
1 a d
2 b e
3 c f
4 a d
5 b e
6 c f
The rep.row function seems to sometimes make lists for columns, which leads to bad memory hijinks. I have written the following which seems to work well:
library(plyr)
rep.row <- function(r, n){
colwise(function(x) rep(x, n))(r)
}
My solution similar as mefa:::rep.data.frame, but a little faster and cares about row names:
rep.data.frame <- function(x, times) {
rnames <- attr(x, "row.names")
x <- lapply(x, rep.int, times = times)
class(x) <- "data.frame"
if (!is.numeric(rnames))
attr(x, "row.names") <- make.unique(rep.int(rnames, times))
else
attr(x, "row.names") <- .set_row_names(length(rnames) * times)
x
}
Compare solutions:
library(Lahman)
library(microbenchmark)
microbenchmark(
mefa:::rep.data.frame(Batting, 10),
rep.data.frame(Batting, 10),
Batting[rep.int(seq_len(nrow(Batting)), 10), ],
times = 10
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> mefa:::rep.data.frame(Batting, 10) 127.77786 135.3480 198.0240 148.1749 278.1066 356.3210 10 a
#> rep.data.frame(Batting, 10) 79.70335 82.8165 134.0974 87.2587 191.1713 307.4567 10 a
#> Batting[rep.int(seq_len(nrow(Batting)), 10), ] 895.73750 922.7059 981.8891 956.3463 1018.2411 1127.3927 10 b
try using for example
N=2
rep(1:4, each = N)
as an index
Another way to do this would to first get row indices, append extra copies of the df, and then order by the indices:
df$index = 1:nrow(df)
df = rbind(df,df)
df = df[order(df$index),][,-ncol(df)]
Although the other solutions may be shorter, this method may be more advantageous in certain situations.

Resources