I have an extremely long nested list of size several million. Here are the first few entries:
d1
[[1]]
x Freq
1 NA 4
[[2]]
x Freq
1 0005073936 8
2 NA 4
[[3]]
x Freq
1 0005073936 14
I want to populate the vector "s_week" with maximum frequency ("Freq") values from this list. For instance, in the above case, the answer will be
s_week=["NA","0005073936","0005073936"]
Here's my attempt to populate this vector iteratively.
for(i in 1:length(d1)){
s_week[i]=as.character(d1[[i]]$x[which(d1[[i]]$Freq==max(d1[[i]]$Freq))][1])
}
However, this is excruciatingly slow and takes forever as the list has more than 100 million entries. I was wondering if there's a more elegant non-iterative solution using lapply or its variants?
Thanks in advance for the help!
Well, it is also highly important whether we use the $ operator for the extraction or the [[ brackets. Otherwise the solution might actually be slower than a for loop. vapply is also worth a try, it's similar to sapply, but has a pre-specified type of return value (in our case character(1)) and, thus, might be faster.
vapply(H, function(item) item$x[which.max(item$Freq)], FUN.VALUE=character(1))
I did a benchmark for you. List H has length 1e5, entries have an average of 2.00 rows with SD 0.58, column x contains NA at random. I hope I got it more or less right.
H[3:5]
# [[1]]
# x Freq
# 1 <NA> 15
# 2 <NA> 7
#
# [[2]]
# x Freq
# 1 <NA> 8
# 2 <NA> 7
# 3 0000765808 14
#
# [[3]]
# x Freq
# 1 <NA> 9
# 2 0000618128 9
# 3 <NA> 5
sapply(H[[3]], class)
# x Freq
# "character" "numeric"
Benchmark
s_week <- NA
microbenchmark::microbenchmark(
vapply=s_week <- vapply(H, function(item) item$x[which.max(item$Freq)],
FUN.VALUE=character(1)),
sapply=s_week <- sapply(H, function(item) item$x[which.max(item$Freq)]),
lapply2=s_week <- unlist(lapply(H, function(x) x$x[which.max(x$Freq)])),
forloop={for(i in 1:length(H)) {
s_week[i]=as.character(H[[i]]$x[which(H[[i]]$Freq == max(H[[i]]$Freq))][1])
}},
vapply2=s_week <- vapply(H, function(item) item[["x"]][which.max(item[["Freq"]])],
FUN.VALUE=character(1)),
lapply=s_week <- unlist(lapply(H, function(item) item[["x"]][which.max(item[["Freq"]])])),
sapply2=s_week <- sapply(H, function(item) item[["x"]][which.max(item[["Freq"]])]),
times=20L)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# vapply 508.1789 525.1708 589.4401 550.5763 577.3948 956.8675 20 a
# sapply 526.0700 552.1580 651.5795 586.8449 631.1057 1038.6949 20 a
# lapply2 528.9962 564.0170 594.9651 590.1182 618.8509 715.0774 20 a
# forloop 820.0938 890.6525 1004.3736 912.5017 1048.2990 1449.8975 20 b
# vapply2 1694.4961 1787.8798 2028.4530 1863.9924 1919.8244 3349.9039 20 c
# lapply 1700.2831 1851.8868 2102.6394 1938.5132 2161.0250 2964.7155 20 c
# sapply2 1752.4071 1883.6729 2069.3157 1971.4675 2074.1322 3216.9192 20 c
Note: Performed on a AMD FX(tm)-8350 Eight-Core Processor.
As it turns out, vapply with $ seems to be fastest. The for loop seems actually still to be faster than the lapply with [[ method for extraction.
I've taken data.table::rbindlist out of the benchmark since it performed unexpectedly slow. There might not really be an advantage since we don't have data.table objects yet. (Or probably the code is somewhat flawed? I'm not too familiar with data.table. It seems that also some system process is permanently involved.)
library(data.table)
system.time(
s_week <- rbindlist(H, idcol=TRUE)[, .SD[which.max(Freq)], by=.id][, x]
)
# user system elapsed
# 41.26 15.93 35.44
I also found a tidyverse solution in the revision history that performed very slow and therefore also didn't make it into my benchmark.
library(tidyverse)
system.time(
s_week <- map(H, ~ .x %>% slice(which.max(Freq)) %>% pull(x)) %>% unlist
)
# user system elapsed
# 70.59 0.18 72.12
Data
set.seed(42)
H <- replicate(1e5, {
n <- sample(1:3, 1, replace=TRUE)
data.frame(x=sprintf("%010d", sample(9:1e6, n)),
Freq=round(abs(rnorm(n, 6.2, 5)) + 1), stringsAsFactors=FALSE)
}, simplify=FALSE)
# create NA's
H <- lapply(H, function(x) {
s <- sample(1:nrow(x), sample(1:nrow(x), 1), replace=FALSE)
if (length(s) != 0)
x[s, 1] <- NA
else
x
return(x)
})
Try:
unlist(lapply(d1, function(x) x[["x"]][which.max(x[["Freq"]])]))
As #jay.sf suggests, you may also rather use $ instead of [[:
unlist(lapply(d1, function(x) x$x[which.max(x$Freq)]))
Related
This question already has answers here:
Repeat rows of a data.frame N times
(10 answers)
Closed 3 years ago.
I want to repeat the rows of a data.frame, each N times. The result should be a new data.frame (with nrow(new.df) == nrow(old.df) * N) keeping the data types of the columns.
Example for N = 2:
A B C
A B C 1 j i 100
1 j i 100 --> 2 j i 100
2 K P 101 3 K P 101
4 K P 101
So, each row is repeated 2 times and characters remain characters, factors remain factors, numerics remain numerics, ...
My first attempt used apply: apply(old.df, 2, function(co) rep(co, each = N)), but this one transforms my values to characters and I get:
A B C
[1,] "j" "i" "100"
[2,] "j" "i" "100"
[3,] "K" "P" "101"
[4,] "K" "P" "101"
df <- data.frame(a = 1:2, b = letters[1:2])
df[rep(seq_len(nrow(df)), each = 2), ]
A clean dplyr solution, taken from here
library(dplyr)
df <- tibble(x = 1:2, y = c("a", "b"))
df %>% slice(rep(1:n(), each = 2))
There is a lovely vectorized solution that repeats only certain rows n-times each, possible for example by adding an ntimes column to your data frame:
A B C ntimes
1 j i 100 2
2 K P 101 4
3 Z Z 102 1
Method:
df <- data.frame(A=c("j","K","Z"), B=c("i","P","Z"), C=c(100,101,102), ntimes=c(2,4,1))
df <- as.data.frame(lapply(df, rep, df$ntimes))
Result:
A B C ntimes
1 Z Z 102 1
2 j i 100 2
3 j i 100 2
4 K P 101 4
5 K P 101 4
6 K P 101 4
7 K P 101 4
This is very similar to Josh O'Brien and Mark Miller's method:
df[rep(seq_len(nrow(df)), df$ntimes),]
However, that method appears quite a bit slower:
df <- data.frame(A=c("j","K","Z"), B=c("i","P","Z"), C=c(100,101,102), ntimes=c(2000,3000,4000))
microbenchmark::microbenchmark(
df[rep(seq_len(nrow(df)), df$ntimes),],
as.data.frame(lapply(df, rep, df$ntimes)),
times = 10
)
Result:
Unit: microseconds
expr min lq mean median uq max neval
df[rep(seq_len(nrow(df)), df$ntimes), ] 3563.113 3586.873 3683.7790 3613.702 3657.063 4326.757 10
as.data.frame(lapply(df, rep, df$ntimes)) 625.552 654.638 676.4067 668.094 681.929 799.893 10
If you can repeat the whole thing, or subset it first then repeat that, then this similar question may be helpful. Once again:
library(mefa)
rep(mtcars,10)
or simply
mefa:::rep.data.frame(mtcars)
Adding to what #dardisco mentioned about mefa::rep.data.frame(), it's very flexible.
You can either repeat each row N times:
rep(df, each=N)
or repeat the entire dataframe N times (think: like when you recycle a vectorized argument)
rep(df, times=N)
Two thumbs up for mefa! I had never heard of it until now and I had to write manual code to do this.
For reference and adding to answers citing mefa, it might worth to take a look on the implementation of mefa::rep.data.frame() in case you don't want to include the whole package:
> data <- data.frame(a=letters[1:3], b=letters[4:6])
> data
a b
1 a d
2 b e
3 c f
> as.data.frame(lapply(data, rep, 2))
a b
1 a d
2 b e
3 c f
4 a d
5 b e
6 c f
The rep.row function seems to sometimes make lists for columns, which leads to bad memory hijinks. I have written the following which seems to work well:
library(plyr)
rep.row <- function(r, n){
colwise(function(x) rep(x, n))(r)
}
My solution similar as mefa:::rep.data.frame, but a little faster and cares about row names:
rep.data.frame <- function(x, times) {
rnames <- attr(x, "row.names")
x <- lapply(x, rep.int, times = times)
class(x) <- "data.frame"
if (!is.numeric(rnames))
attr(x, "row.names") <- make.unique(rep.int(rnames, times))
else
attr(x, "row.names") <- .set_row_names(length(rnames) * times)
x
}
Compare solutions:
library(Lahman)
library(microbenchmark)
microbenchmark(
mefa:::rep.data.frame(Batting, 10),
rep.data.frame(Batting, 10),
Batting[rep.int(seq_len(nrow(Batting)), 10), ],
times = 10
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> mefa:::rep.data.frame(Batting, 10) 127.77786 135.3480 198.0240 148.1749 278.1066 356.3210 10 a
#> rep.data.frame(Batting, 10) 79.70335 82.8165 134.0974 87.2587 191.1713 307.4567 10 a
#> Batting[rep.int(seq_len(nrow(Batting)), 10), ] 895.73750 922.7059 981.8891 956.3463 1018.2411 1127.3927 10 b
try using for example
N=2
rep(1:4, each = N)
as an index
Another way to do this would to first get row indices, append extra copies of the df, and then order by the indices:
df$index = 1:nrow(df)
df = rbind(df,df)
df = df[order(df$index),][,-ncol(df)]
Although the other solutions may be shorter, this method may be more advantageous in certain situations.
I have two vectors, one (A) of about 100 million non-unique elements (integers), the other (B) of 1 million of the same, unique, elements. I am trying to get a list containing the indices of the repeated instances of each element of B in A.
A <- c(2, 1, 1, 1, 2, 1, 1, 3, 3, 2)
B <- 1:3
# would result in this:
[[1]]
[1] 2 3 4 6 7
[[2]]
[1] 1 5 10
[[3]]
[1] 8 9
I first, naively, tried this:
b_indices <- lapply(B, function(b) which(A == b))
which is horribly inefficient, and apparently wouldn't complete in a few years.
The second thing I tried was to create a list of empty vectors, indexed with all elements of B, and to then loop through A, appending the index to the corresponding vector for each element in A. Although technically O(n), I'm not sure about the time to repeatedly append elements. This approach would apparently take ~ 2-3 days, which is still too slow...
Is there anything that could work faster?
This is fast:
A1 <- order(A, method = "radix")
split(A1, A[A1])
#$`1`
#[1] 2 3 4 6 7
#
#$`2`
#[1] 1 5 10
#
#$`3`
#[1] 8 9
B <- seq_len(1e6)
set.seed(42)
A <- sample(B, 1e8, TRUE)
system.time({
A1 <- order(A, method = "radix")
res <- split(A1, A[A1])
})
# user system elapsed
#8.650 1.056 9.704
data.table is arguably the most efficient way of dealing with Big Data in R and it would even let you avoid having to use that 1 million length vector all together!
require(data.table)
a <- data.table(x=rep(c("a","b","c"),each=3))
a[ , list( yidx = list(.I) ) , by = x ]
a yidx
1: a 1,2,3
2: b 4,5,6
3: c 7,8,9
Using your example data:
a <- data.table(x=c(2, 1, 1, 1, 2, 1, 1, 3, 3, 2))
a[ , list( yidx = list(.I) ) , by = x ]
a yidx
1: 2 1, 5,10
2: 1 2,3,4,6,7
3: 3 8,9
Add this to your benchmarks. I dare say it should be significantly faster than using the built-in functions, if you test it at scale. The bigger the data the better the relative performance of data.table in my experience.
In my benchmark it only takes about 46% as long as order on my Debian laptop and only 5% as long as order on my Windows laptop with 8GB RAM and a 2.x GHz CPU.
B <- seq_len(1e6)
set.seed(42)
A <- data.table(x = sample(B, 1e8, TRUE))
system.time({
+ res <- A[ , list( yidx = list(.I) ) , by = x ]
+ })
user system elapsed
4.25 0.22 4.50
We can also use dplyr
library(dplyr)
data_frame(A) %>%
mutate(B = row_number()) %>%
group_by(A) %>%
summarise(B = list(B)) %>%
.$B
#[[1]]
#[1] 2 3 4 6 7
#[[2]]
#[1] 1 5 10
#[[3]]
#[1] 8 9
In a smaller dataset of 1e5 size, it gives system.time
# user system elapsed
# 0.01 0.00 0.02
but with larger example as showed in the other post, it is slower. However, this is dplyr...
A quick example
a <- c(1,1,2)
b <- c(1000,200,20)
c <- c(10,20,10)
myframe <- data.frame(a,b,c)
> myframe
a b c
1 1 1000 10
2 1 200 20
3 2 20 10
I now want to aggregate the values of column c where the value of column a equals 1. The result should consequently be 30.
Just a word to the original data the dataframe has about 100,000 rows and 400 columns. The values rows to aggregate pop up about 10-30 times in the data.
Sum the values of c where a == 1.
with(myframe, sum(c[a == 1]))
# [1] 30
If you have a very big data set maybe use data.table binary search (although it seems #Svens solution will be efficient enough)
library(data.table)
setkey(setDT(myframe), a)[J(1), sum(c)]
# [1] 30
In order to illustrate the difference, one can show that for a data set of 1MM rows, binary search is faster by a factor of 6~
set.seed(123)
n <- 1e6
a <- sample(1e3, n, replace = TRUE)
b <- sample(1e4, n, replace = TRUE)
c <- sample(1e2, n, replace = TRUE)
myframe <- data.frame(a,b,c)
myframe2 <- copy(myframe)
library(microbenchmark)
microbenchmark(Sven = with(myframe, sum(c[a == 1])),
David = setkey(setDT(myframe2), a)[J(1), sum(c)])
# Unit: milliseconds
# expr min lq mean median uq max neval
# Sven 28.020912 30.171903 32.858967 31.464116 32.766395 71.02099 100
# David 3.696436 4.080331 5.719189 4.469356 6.167174 43.38575 100
'aggregate' function can be used:
> aggregate(c~a, data=myframe, sum)
a c
1 1 30
2 2 10
data.table version:
> library(data.table)
> setDT(myframe)[,list(sum=sum(c)),by=a]
a sum
1: 1 30
2: 2 10
I have a data.table with the following features:
bycols: columns that divide the data into groups
nonvaryingcols: columns that are constant within each group (so that taking the first item from within each group and carrying that through would be sufficient)
datacols: columns to be aggregated / summarized (e.g. sum them within group)
I'm curious what the most efficient way to do what you might call a mixed collapse, taking all three of the above inputs as character vectors. It doesn't have to be the absolute fastest, but fast enough with reasonable syntax would be ideal.
Example data, where the different sets of columns are stored in character vectors.
require(data.table)
set.seed(1)
bycols <- c("g1","g2")
datacols <- c("dat1","dat2")
nonvaryingcols <- c("nv1","nv2")
test <- data.table(
g1 = rep( letters, 10 ),
g2 = rep( c(LETTERS,LETTERS), each = 5 ),
dat1 = runif( 260 ),
dat2 = runif( 260 ),
nv1 = rep( seq(130), 2),
nv2 = rep( seq(130), 2)
)
Final data should look like:
g1 g2 dat1 dat2 nv1 nv2
1: a A 0.8403809 0.6713090 1 1
2: b A 0.4491883 0.4607716 2 2
3: c A 0.6083939 1.2031960 3 3
4: d A 1.5510033 1.2945761 4 4
5: e A 1.1302971 0.8573135 5 5
6: f B 1.4964821 0.5133297 6 6
I have worked out two different ways of doing it, but one is horridly inflexible and unwieldy, and one is horridly slow. Will post tomorrow if no one has come up with something better by then.
As always with this sort of programmatic use of [.data.table, the general strategy is to construct an expression e that that can be evaluated in the j argument. Once you understand that (as I'm sure you do), it just becomes a game of computing on the language to get a j-slot expression that looks like what you'd write at the command line.
Here, for instance, and given the particular values in your example, you'd like a call that looks like:
test[, list(dat1=sum(dat1), dat2=sum(dat2), nv1=nv1[1], nv2=nv2[1]),
by=c("g1", "g2")]
so the expression you'd like evaluated in the j-slot is
list(dat1=sum(dat1), dat2=sum(dat2), nv1=nv1[1], nv2=nv2[1])
Most of the following function is taken up with constructing just that expression:
f <- function(dt, bycols, datacols, nvcols) {
e <- c(sapply(datacols, function(x) call("sum", as.symbol(x))),
sapply(nvcols, function(x) call("[", as.symbol(x), 1)))
e<- as.call(c(as.symbol("list"), e))
dt[,eval(e), by=bycols]
}
f(test, bycols=bycols, datacols=datacols, nvcols=nonvaryingcols)
## g1 g2 dat1 dat2 nv1 nv2
## 1: a A 0.8403809 0.6713090 1 1
## 2: b A 0.4491883 0.4607716 2 2
## 3: c A 0.6083939 1.2031960 3 3
## 4: d A 1.5510033 1.2945761 4 4
## 5: e A 1.1302971 0.8573135 5 5
## ---
## 126: v Z 0.5627018 0.4282380 126 126
## 127: w Z 0.7588966 1.4429034 127 127
## 128: x Z 0.7060596 1.3736510 128 128
## 129: y Z 0.6015249 0.4488285 129 129
## 130: z Z 1.5304034 1.6012207 130 130
Here's what I had come up with. It works, but very slowly.
test[, {
cbind(
as.data.frame( t( sapply( .SD[, ..datacols], sum ) ) ),
.SD[, ..nonvaryingcols][1]
)
}, by = bycols ]
Benchmarks
FunJosh <- function() {
f(test, bycols=bycols, datacols=datacols, nvcols=nonvaryingcols)
}
FunAri <- function() {
test[, {
cbind(
as.data.frame( t( sapply( .SD[, ..datacols], sum ) ) ),
.SD[, ..nonvaryingcols][1]
)
}, by = bycols ]
}
FunEddi <- function() {
cbind(
test[, lapply(.SD, sum), by = bycols, .SDcols = datacols],
test[, lapply(.SD, "[", 1), by = bycols, .SDcols = nonvaryingcols][, ..nonvaryingcols]
)
}
library(microbenchmark)
identical(FunJosh(), FunAri())
# [1] TRUE
microbenchmark(FunJosh(), FunAri(), FunEddi())
#Unit: milliseconds
# expr min lq median uq max neval
# FunJosh() 2.749164 2.958478 3.098998 3.470937 6.863933 100
# FunAri() 246.082760 255.273839 284.485654 360.471469 509.740240 100
# FunEddi() 5.877494 6.229739 6.528205 7.375939 112.895573 100
At least two orders of magnitude slower than #joshobrien's solution. Edit #Eddi's solution is much faster as well, and shows that cbind wasn't optimal but could be fairly fast in the right hands. Might be all the transforming and sapplying I was doing rather than just directly using lapply.
Just for a bit of variety, here is a variant of #Josh O'brien's solution that uses the bquote operator instead of call. I did try to replace the final as.call with a bquote, but because bquote doesn't support list splicing (e.g., see this question), I couldn't get that to work.
f <- function(dt, bycols, datacols, nvcols) {
datacols = sapply(datacols, as.symbol)
nvcols = sapply(nvcols, as.symbol)
e = c(lapply(datacols, function(x) bquote(sum(.(x)))),
lapply(nvcols, function(x) bquote(.(x)[1])))
e = as.call(c(as.symbol("list"), e))
dt[,eval(e), by=bycols]
}
> f(test, bycols=bycols, datacols=datacols, nvcols=nonvaryingcols)
g1 g2 dat1 dat2 nv1 nv2
1: a A 0.8404 0.6713 1 1
2: b A 0.4492 0.4608 2 2
3: c A 0.6084 1.2032 3 3
4: d A 1.5510 1.2946 4 4
5: e A 1.1303 0.8573 5 5
---
126: v Z 0.5627 0.4282 126 126
127: w Z 0.7589 1.4429 127 127
128: x Z 0.7061 1.3737 128 128
129: y Z 0.6015 0.4488 129 129
130: z Z 1.5304 1.6012 130 130
>
This question already has answers here:
Repeat rows of a data.frame N times
(10 answers)
Closed 3 years ago.
I want to repeat the rows of a data.frame, each N times. The result should be a new data.frame (with nrow(new.df) == nrow(old.df) * N) keeping the data types of the columns.
Example for N = 2:
A B C
A B C 1 j i 100
1 j i 100 --> 2 j i 100
2 K P 101 3 K P 101
4 K P 101
So, each row is repeated 2 times and characters remain characters, factors remain factors, numerics remain numerics, ...
My first attempt used apply: apply(old.df, 2, function(co) rep(co, each = N)), but this one transforms my values to characters and I get:
A B C
[1,] "j" "i" "100"
[2,] "j" "i" "100"
[3,] "K" "P" "101"
[4,] "K" "P" "101"
df <- data.frame(a = 1:2, b = letters[1:2])
df[rep(seq_len(nrow(df)), each = 2), ]
A clean dplyr solution, taken from here
library(dplyr)
df <- tibble(x = 1:2, y = c("a", "b"))
df %>% slice(rep(1:n(), each = 2))
There is a lovely vectorized solution that repeats only certain rows n-times each, possible for example by adding an ntimes column to your data frame:
A B C ntimes
1 j i 100 2
2 K P 101 4
3 Z Z 102 1
Method:
df <- data.frame(A=c("j","K","Z"), B=c("i","P","Z"), C=c(100,101,102), ntimes=c(2,4,1))
df <- as.data.frame(lapply(df, rep, df$ntimes))
Result:
A B C ntimes
1 Z Z 102 1
2 j i 100 2
3 j i 100 2
4 K P 101 4
5 K P 101 4
6 K P 101 4
7 K P 101 4
This is very similar to Josh O'Brien and Mark Miller's method:
df[rep(seq_len(nrow(df)), df$ntimes),]
However, that method appears quite a bit slower:
df <- data.frame(A=c("j","K","Z"), B=c("i","P","Z"), C=c(100,101,102), ntimes=c(2000,3000,4000))
microbenchmark::microbenchmark(
df[rep(seq_len(nrow(df)), df$ntimes),],
as.data.frame(lapply(df, rep, df$ntimes)),
times = 10
)
Result:
Unit: microseconds
expr min lq mean median uq max neval
df[rep(seq_len(nrow(df)), df$ntimes), ] 3563.113 3586.873 3683.7790 3613.702 3657.063 4326.757 10
as.data.frame(lapply(df, rep, df$ntimes)) 625.552 654.638 676.4067 668.094 681.929 799.893 10
If you can repeat the whole thing, or subset it first then repeat that, then this similar question may be helpful. Once again:
library(mefa)
rep(mtcars,10)
or simply
mefa:::rep.data.frame(mtcars)
Adding to what #dardisco mentioned about mefa::rep.data.frame(), it's very flexible.
You can either repeat each row N times:
rep(df, each=N)
or repeat the entire dataframe N times (think: like when you recycle a vectorized argument)
rep(df, times=N)
Two thumbs up for mefa! I had never heard of it until now and I had to write manual code to do this.
For reference and adding to answers citing mefa, it might worth to take a look on the implementation of mefa::rep.data.frame() in case you don't want to include the whole package:
> data <- data.frame(a=letters[1:3], b=letters[4:6])
> data
a b
1 a d
2 b e
3 c f
> as.data.frame(lapply(data, rep, 2))
a b
1 a d
2 b e
3 c f
4 a d
5 b e
6 c f
The rep.row function seems to sometimes make lists for columns, which leads to bad memory hijinks. I have written the following which seems to work well:
library(plyr)
rep.row <- function(r, n){
colwise(function(x) rep(x, n))(r)
}
My solution similar as mefa:::rep.data.frame, but a little faster and cares about row names:
rep.data.frame <- function(x, times) {
rnames <- attr(x, "row.names")
x <- lapply(x, rep.int, times = times)
class(x) <- "data.frame"
if (!is.numeric(rnames))
attr(x, "row.names") <- make.unique(rep.int(rnames, times))
else
attr(x, "row.names") <- .set_row_names(length(rnames) * times)
x
}
Compare solutions:
library(Lahman)
library(microbenchmark)
microbenchmark(
mefa:::rep.data.frame(Batting, 10),
rep.data.frame(Batting, 10),
Batting[rep.int(seq_len(nrow(Batting)), 10), ],
times = 10
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> mefa:::rep.data.frame(Batting, 10) 127.77786 135.3480 198.0240 148.1749 278.1066 356.3210 10 a
#> rep.data.frame(Batting, 10) 79.70335 82.8165 134.0974 87.2587 191.1713 307.4567 10 a
#> Batting[rep.int(seq_len(nrow(Batting)), 10), ] 895.73750 922.7059 981.8891 956.3463 1018.2411 1127.3927 10 b
try using for example
N=2
rep(1:4, each = N)
as an index
Another way to do this would to first get row indices, append extra copies of the df, and then order by the indices:
df$index = 1:nrow(df)
df = rbind(df,df)
df = df[order(df$index),][,-ncol(df)]
Although the other solutions may be shorter, this method may be more advantageous in certain situations.