I am trying to dynamically create and evaluate a function from a string input and am hung up, again, on meta-programming/evaluation (https://adv-r.hadley.nz/metaprogramming.html). I have a feeling this is answered on SO, but I searched and wasn't able to figure out the solution looking through other posts; however, if there is an existing answer, please let me know and flag as duplicate. Thank you so much for your time and help! Below is a reprex of the issue.
library(dplyr)
library(purrr)
library(rlang)
library(palmerpenguins)
# Create data to join with penguins
penguin_colors <-
tibble(
species = c("Adelie", "Chinstrap", "Gentoo"),
color = c("orange", "purple", "green")
)
# Create function to do specified join and print join type
foo <- function(JOINTYPE) {
# DOESN'T RUN
# JOINTYPE_join(penguins, penguin_colors, by = "species")
# call2(sym(paste0(JOINTYPE, "_join")), x = penguins, y = penguin_colors, by = "species")
print(JOINTYPE)
}
# Desired behavior of foo when JOINTYPE == "inner"
inner_join(penguins, penguin_colors, by = "species")
#> # A tibble: 344 x 9
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#> <chr> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torge… 39.1 18.7 181 3750
#> 2 Adelie Torge… 39.5 17.4 186 3800
#> 3 Adelie Torge… 40.3 18 195 3250
#> 4 Adelie Torge… NA NA NA NA
#> 5 Adelie Torge… 36.7 19.3 193 3450
#> 6 Adelie Torge… 39.3 20.6 190 3650
#> 7 Adelie Torge… 38.9 17.8 181 3625
#> 8 Adelie Torge… 39.2 19.6 195 4675
#> 9 Adelie Torge… 34.1 18.1 193 3475
#> 10 Adelie Torge… 42 20.2 190 4250
#> # … with 334 more rows, and 3 more variables: sex <fct>, year <int>,
#> # color <chr>
print("inner")
#> [1] "inner"
# Use function in for loop
for (JOINTYPE in c("inner", "left", "right")) {
foo(JOINTYPE)
}
#> [1] "inner"
#> [1] "left"
#> [1] "right"
# Use function in vectorised fashion
walk(c("inner", "left", "right"), foo)
#> [1] "inner"
#> [1] "left"
#> [1] "right"
Created on 2020-10-27 by the reprex package (v0.3.0)
One option is to use get() to retrieve the appropriate function:
join <- function(JOINTYPE) {
get( paste0(JOINTYPE, "_join") )
}
join("inner")(penguins, penguin_colors, by="species")
If using rlang, the more appropriate function here is rlang::exec:
join2 <- function(JOINTYPE, ...) {
rlang::exec( paste0(JOINTYPE, "_join"), ... )
}
join2("inner", penguins, penguin_colors, by="species")
Related
penguins %>%
select(species,island,sex) %>%
rename(island_new=island) %>%
rename_with(penguins,toupper)
this is code which is causing error, can someone solve the problem
It's implied that the first argument of rename_with is what has been piped to it, so you don't need to pass penguins as the first argument:
penguins %>%
select(species,island,sex) %>%
rename(island_new=island) %>%
rename_with(toupper)
# A tibble: 344 x 3
SPECIES ISLAND_NEW SEX
<fct> <fct> <fct>
1 Adelie Torgersen male
2 Adelie Torgersen female
3 Adelie Torgersen female
4 Adelie Torgersen NA
5 Adelie Torgersen female
6 Adelie Torgersen male
7 Adelie Torgersen female
8 Adelie Torgersen male
9 Adelie Torgersen NA
10 Adelie Torgersen NA
I have have the variable 'county' listed as a column but when I try to aggregate it using group_by_across in this manner:
testing4 <- testing2 %>%
group_by(across(-c(county, population))) %>%
summarise(pop=sum(population))
it gives me:
Error: Problem with `mutate()` input `..1`.
x Can't subset columns that don't exist.
x Column `county` doesn't exist.
Input `..1` is `across(-c(county, population))`.
i The error occurred in group 1: year = 1980, state = "AK", stfips = 2,
county = 2900.
Run `rlang::last_error()` to see where the error occurred.
However, when I do
testing3 <- testing2 %>%
group_by(year, state, stfips, race) %>%
summarise(pop = sum(population))
it runs fine.
Edit: Someone asked for dput(head(testing2))
dput(head(testing2))
structure(list(year = c(1980L, 1980L, 1980L, 1980L, 1980L, 1980L
), state = c("AK", "AK", "AK", "AL", "AL", "AL"), stfips = c(2L,
2L, 2L, 1L, 1L, 1L), county = c(2900L, 2900L, 2900L, 1001L, 1001L,
1001L), race = c(1L, 2L, 3L, 1L, 2L, 3L), population = c(318054L,
13960L, 72666L, 24876L, 7193L, 148L)), row.names = c(NA, -6L), groups =
structure(list(
year = c(1980L, 1980L), state = c("AK", "AL"), stfips = 2:1,
county = c(2900L, 1001L), .rows = structure(list(1:3, 4:6), ptype =
integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = 1:2, class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
Looks like inserting an ungroup() as the second step will work: for a minimal comparison, compare
testing2 %>% group_by(across(-county))
and
testing2 %>% ungroup() %>% group_by(across(-county))
and welcome here.
When you group a tibble, all the functions applied after grouping use the grouped data, excluding the grouping variables (?group_by). In fact, you can use/access that (each-group temporary-new) data inside that function using cur_group() (?cur_across).
So, step 1: when you use across() in a grouped tibble (as yours), that uses, for each group, the data without the grouping variables. across() (without the .fnc argument, default = NULL; ?across) returns the listed variables without modification, starting from the input data, which, in your case, does not have the old grouping variables! Hence, you cannot use a grouping variable inside across() for a grouped tibble.
But, step 2: you can also consider that group_by() overrides itself (see examples in ?group_by).
Combining the two, you don't need to list a variable you want to exclude if it is already a grouping one. If you're going to (re-)group a tibble based on other variables: you can remove the additional ones you do not want to use! Those variables (along with the other used for the previous grouping) are already excluded at the time you compute the new groups. When the new group_by (evaluating across() "by groups"; ie, without the grouping variables) join the results, it returns the whole tibble grouped without the previous grouping variables and without the ones you have just "added" to the exclusion.
One problem can arise if you would like to re-group a grouped tibble excluding other variables but keeping (a subset of) the grouping ones. Anyway, in those cases, you can list those "maintained" grouping variables outside the call to across() into the call of group_by() (which by itself do not "compute" anything (opposite to across()) and so it does not use the parts of the grouped tibble (which do not have the old grouping variables)). That way the last group_by() create a grouped tibble "with all the variables that are not in the old grouping variables, that are not listed in the new excluded ones, plus the (old-)ones reported outside across()."
Here a running (reproducible) example:
# install.packages("tidyverse")
# install.packages("palmerpenguins")
library(tidyverse)
library(palmerpenguins)
penguins
#> # A tibble: 344 x 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torge… 39.1 18.7 181 3750
#> 2 Adelie Torge… 39.5 17.4 186 3800
#> 3 Adelie Torge… 40.3 18 195 3250
#> 4 Adelie Torge… NA NA NA NA
#> 5 Adelie Torge… 36.7 19.3 193 3450
#> 6 Adelie Torge… 39.3 20.6 190 3650
#> 7 Adelie Torge… 38.9 17.8 181 3625
#> 8 Adelie Torge… 39.2 19.6 195 4675
#> 9 Adelie Torge… 34.1 18.1 193 3475
#> 10 Adelie Torge… 42 20.2 190 4250
#> # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
penguins %>%
group_by(species, island) %>%
group_by(across(-c(
starts_with("bill"),
starts_with("flipper"),
starts_with("body")
))) # species and island are already exluded
#> # A tibble: 344 x 8
#> # Groups: sex, year [9]
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torge… 39.1 18.7 181 3750
#> 2 Adelie Torge… 39.5 17.4 186 3800
#> 3 Adelie Torge… 40.3 18 195 3250
#> 4 Adelie Torge… NA NA NA NA
#> 5 Adelie Torge… 36.7 19.3 193 3450
#> 6 Adelie Torge… 39.3 20.6 190 3650
#> 7 Adelie Torge… 38.9 17.8 181 3625
#> 8 Adelie Torge… 39.2 19.6 195 4675
#> 9 Adelie Torge… 34.1 18.1 193 3475
#> 10 Adelie Torge… 42 20.2 190 4250
#> # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
penguins %>%
group_by(species, island) %>%
group_by(
across(-c(
starts_with("bill"),
starts_with("flipper"),
starts_with("body")
)),
species # "continue" to use species for grouping
)
#> # A tibble: 344 x 8
#> # Groups: sex, year, species [22]
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torge… 39.1 18.7 181 3750
#> 2 Adelie Torge… 39.5 17.4 186 3800
#> 3 Adelie Torge… 40.3 18 195 3250
#> 4 Adelie Torge… NA NA NA NA
#> 5 Adelie Torge… 36.7 19.3 193 3450
#> 6 Adelie Torge… 39.3 20.6 190 3650
#> 7 Adelie Torge… 38.9 17.8 181 3625
#> 8 Adelie Torge… 39.2 19.6 195 4675
#> 9 Adelie Torge… 34.1 18.1 193 3475
#> 10 Adelie Torge… 42 20.2 190 4250
#> # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
Created on 2020-09-07 by the reprex package (v0.3.0)
sessionInfo()
#> R version 4.0.2 (2020-06-22)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=it_IT.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=it_IT.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=it_IT.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices datasets utils methods base
#>
#> other attached packages:
#> [1] palmerpenguins_0.1.0 forcats_0.5.0 stringr_1.4.0
#> [4] dplyr_1.0.2 purrr_0.3.4 readr_1.3.1
#> [7] tidyr_1.1.2 tibble_3.0.3 ggplot2_3.3.2
#> [10] tidyverse_1.3.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.1.0 xfun_0.16 haven_2.3.1 colorspace_1.4-1
#> [5] vctrs_0.3.4 generics_0.0.2 htmltools_0.5.0 yaml_2.2.1
#> [9] utf8_1.1.4 blob_1.2.1 rlang_0.4.7 pillar_1.4.6
#> [13] glue_1.4.2 withr_2.2.0 DBI_1.1.0 dbplyr_1.4.4
#> [17] modelr_0.1.8 readxl_1.3.1 lifecycle_0.2.0 munsell_0.5.0
#> [21] gtable_0.3.0 cellranger_1.1.0 rvest_0.3.6 evaluate_0.14
#> [25] knitr_1.29 fansi_0.4.1 highr_0.8 broom_0.7.0
#> [29] Rcpp_1.0.5 renv_0.12.0 scales_1.1.1 backports_1.1.9
#> [33] jsonlite_1.7.0 fs_1.5.0 hms_0.5.3 digest_0.6.25
#> [37] stringi_1.4.6 grid_4.0.2 cli_2.0.2 tools_4.0.2
#> [41] magrittr_1.5 crayon_1.3.4 pkgconfig_2.0.3 ellipsis_0.3.1
#> [45] xml2_1.3.2 reprex_0.3.0 lubridate_1.7.9 assertthat_0.2.1
#> [49] rmarkdown_2.3 httr_1.4.2 R6_2.4.1 compiler_4.0.2
I have an example data frame:
df <- data.frame(x = 1:112, y = runif(112))
Is there a way to print a list of data frames with the first part of the list containing rows 1:10, the second 11:20, etc. up until the end (111:112)?
You could use split(), with rep() to create the groupings.
n <- 10
nr <- nrow(df)
split(df, rep(1:ceiling(nr/n), each=n, length.out=nr))
This can be solved with nesting using tidyr/dplyr
require(dplyr)
require(tidyr)
num_groups = 10
iris %>%
group_by((row_number()-1) %/% (n()/num_groups)) %>%
nest %>% pull(data)
Based on Rick's answer here is a variant that avoids instantiating copies of the split data. Instead, a callback is called with each chunk. The desired number of rows or cells can be specified.
split_df <- function(x, ..., size_cells = NULL, size_rows = NULL, callback) {
stopifnot(is.function(callback))
if (is.null(size_rows)) {
size_rows <- max(floor(size_cells / ncol(x)), 1)
}
n_rows <- nrow(x)
n_chunks <- ceiling(n_rows / size_rows)
idx <- rep(seq.int(n_chunks), each = size_rows, length.out = n_rows)
split <- split(seq_len(n_rows), idx)
lapply(split, function(i) {
callback(x[i, , drop = FALSE])
NULL
})
invisible()
}
# 30 cells = 3 rows
split_df(palmerpenguins::penguins[1:10, ], size_cells = 30, callback = print)
#> # A tibble: 3 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… 39.1 18.7 181 3750 male
#> 2 Adelie Torge… 39.5 17.4 186 3800 fema…
#> 3 Adelie Torge… 40.3 18 195 3250 fema…
#> # … with 1 more variable: year <int>
#> # A tibble: 3 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… NA NA NA NA <NA>
#> 2 Adelie Torge… 36.7 19.3 193 3450 fema…
#> 3 Adelie Torge… 39.3 20.6 190 3650 male
#> # … with 1 more variable: year <int>
#> # A tibble: 3 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… 38.9 17.8 181 3625 fema…
#> 2 Adelie Torge… 39.2 19.6 195 4675 male
#> 3 Adelie Torge… 34.1 18.1 193 3475 <NA>
#> # … with 1 more variable: year <int>
#> # A tibble: 1 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… 42 20.2 190 4250 <NA>
#> # … with 1 more variable: year <int>
# Specify number of rows instead
split_df(palmerpenguins::penguins[1:3, ], size_rows = 2, callback = print)
#> # A tibble: 2 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… 39.1 18.7 181 3750 male
#> 2 Adelie Torge… 39.5 17.4 186 3800 fema…
#> # … with 1 more variable: year <int>
#> # A tibble: 1 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… 40.3 18 195 3250 fema…
#> # … with 1 more variable: year <int>
Created on 2021-12-18 by the reprex package (v2.0.1)
Another way using split in combination with gl.
n <- 10
nr <- nrow(df)
split(df, gl(ceiling(nr/n), n, nr))
gl is creating a factor what can directly be used by split.
Benchmark
n <- 1e5
df <- data.frame(x = 1:n, y = runif(n))
bench::mark(
"Rich Scriven" = {n <- 10
nr <- nrow(df)
split(df, rep(1:ceiling(nr/n), each=n, length.out=nr))},
GKi = {n <- 10
nr <- nrow(df)
split(df, gl(ceiling(nr/n), n, nr))}
)
# expression min median `itr/sec` mem_alloc gc/se…¹ n_itr n_gc total…²
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:t>
#1 Rich Scriven 411ms 444ms 2.25 3.54MB 13.5 2 12 889ms
#2 GKi 412ms 423ms 2.37 2.03MB 15.4 2 13 845ms
Using gl instead of rep is marginal faster and uses less memory.
I have an example data frame:
df <- data.frame(x = 1:112, y = runif(112))
Is there a way to print a list of data frames with the first part of the list containing rows 1:10, the second 11:20, etc. up until the end (111:112)?
You could use split(), with rep() to create the groupings.
n <- 10
nr <- nrow(df)
split(df, rep(1:ceiling(nr/n), each=n, length.out=nr))
This can be solved with nesting using tidyr/dplyr
require(dplyr)
require(tidyr)
num_groups = 10
iris %>%
group_by((row_number()-1) %/% (n()/num_groups)) %>%
nest %>% pull(data)
Based on Rick's answer here is a variant that avoids instantiating copies of the split data. Instead, a callback is called with each chunk. The desired number of rows or cells can be specified.
split_df <- function(x, ..., size_cells = NULL, size_rows = NULL, callback) {
stopifnot(is.function(callback))
if (is.null(size_rows)) {
size_rows <- max(floor(size_cells / ncol(x)), 1)
}
n_rows <- nrow(x)
n_chunks <- ceiling(n_rows / size_rows)
idx <- rep(seq.int(n_chunks), each = size_rows, length.out = n_rows)
split <- split(seq_len(n_rows), idx)
lapply(split, function(i) {
callback(x[i, , drop = FALSE])
NULL
})
invisible()
}
# 30 cells = 3 rows
split_df(palmerpenguins::penguins[1:10, ], size_cells = 30, callback = print)
#> # A tibble: 3 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… 39.1 18.7 181 3750 male
#> 2 Adelie Torge… 39.5 17.4 186 3800 fema…
#> 3 Adelie Torge… 40.3 18 195 3250 fema…
#> # … with 1 more variable: year <int>
#> # A tibble: 3 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… NA NA NA NA <NA>
#> 2 Adelie Torge… 36.7 19.3 193 3450 fema…
#> 3 Adelie Torge… 39.3 20.6 190 3650 male
#> # … with 1 more variable: year <int>
#> # A tibble: 3 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… 38.9 17.8 181 3625 fema…
#> 2 Adelie Torge… 39.2 19.6 195 4675 male
#> 3 Adelie Torge… 34.1 18.1 193 3475 <NA>
#> # … with 1 more variable: year <int>
#> # A tibble: 1 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… 42 20.2 190 4250 <NA>
#> # … with 1 more variable: year <int>
# Specify number of rows instead
split_df(palmerpenguins::penguins[1:3, ], size_rows = 2, callback = print)
#> # A tibble: 2 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… 39.1 18.7 181 3750 male
#> 2 Adelie Torge… 39.5 17.4 186 3800 fema…
#> # … with 1 more variable: year <int>
#> # A tibble: 1 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… 40.3 18 195 3250 fema…
#> # … with 1 more variable: year <int>
Created on 2021-12-18 by the reprex package (v2.0.1)
Another way using split in combination with gl.
n <- 10
nr <- nrow(df)
split(df, gl(ceiling(nr/n), n, nr))
gl is creating a factor what can directly be used by split.
Benchmark
n <- 1e5
df <- data.frame(x = 1:n, y = runif(n))
bench::mark(
"Rich Scriven" = {n <- 10
nr <- nrow(df)
split(df, rep(1:ceiling(nr/n), each=n, length.out=nr))},
GKi = {n <- 10
nr <- nrow(df)
split(df, gl(ceiling(nr/n), n, nr))}
)
# expression min median `itr/sec` mem_alloc gc/se…¹ n_itr n_gc total…²
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:t>
#1 Rich Scriven 411ms 444ms 2.25 3.54MB 13.5 2 12 889ms
#2 GKi 412ms 423ms 2.37 2.03MB 15.4 2 13 845ms
Using gl instead of rep is marginal faster and uses less memory.
I have an example data frame:
df <- data.frame(x = 1:112, y = runif(112))
Is there a way to print a list of data frames with the first part of the list containing rows 1:10, the second 11:20, etc. up until the end (111:112)?
You could use split(), with rep() to create the groupings.
n <- 10
nr <- nrow(df)
split(df, rep(1:ceiling(nr/n), each=n, length.out=nr))
This can be solved with nesting using tidyr/dplyr
require(dplyr)
require(tidyr)
num_groups = 10
iris %>%
group_by((row_number()-1) %/% (n()/num_groups)) %>%
nest %>% pull(data)
Based on Rick's answer here is a variant that avoids instantiating copies of the split data. Instead, a callback is called with each chunk. The desired number of rows or cells can be specified.
split_df <- function(x, ..., size_cells = NULL, size_rows = NULL, callback) {
stopifnot(is.function(callback))
if (is.null(size_rows)) {
size_rows <- max(floor(size_cells / ncol(x)), 1)
}
n_rows <- nrow(x)
n_chunks <- ceiling(n_rows / size_rows)
idx <- rep(seq.int(n_chunks), each = size_rows, length.out = n_rows)
split <- split(seq_len(n_rows), idx)
lapply(split, function(i) {
callback(x[i, , drop = FALSE])
NULL
})
invisible()
}
# 30 cells = 3 rows
split_df(palmerpenguins::penguins[1:10, ], size_cells = 30, callback = print)
#> # A tibble: 3 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… 39.1 18.7 181 3750 male
#> 2 Adelie Torge… 39.5 17.4 186 3800 fema…
#> 3 Adelie Torge… 40.3 18 195 3250 fema…
#> # … with 1 more variable: year <int>
#> # A tibble: 3 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… NA NA NA NA <NA>
#> 2 Adelie Torge… 36.7 19.3 193 3450 fema…
#> 3 Adelie Torge… 39.3 20.6 190 3650 male
#> # … with 1 more variable: year <int>
#> # A tibble: 3 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… 38.9 17.8 181 3625 fema…
#> 2 Adelie Torge… 39.2 19.6 195 4675 male
#> 3 Adelie Torge… 34.1 18.1 193 3475 <NA>
#> # … with 1 more variable: year <int>
#> # A tibble: 1 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… 42 20.2 190 4250 <NA>
#> # … with 1 more variable: year <int>
# Specify number of rows instead
split_df(palmerpenguins::penguins[1:3, ], size_rows = 2, callback = print)
#> # A tibble: 2 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… 39.1 18.7 181 3750 male
#> 2 Adelie Torge… 39.5 17.4 186 3800 fema…
#> # … with 1 more variable: year <int>
#> # A tibble: 1 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… 40.3 18 195 3250 fema…
#> # … with 1 more variable: year <int>
Created on 2021-12-18 by the reprex package (v2.0.1)
Another way using split in combination with gl.
n <- 10
nr <- nrow(df)
split(df, gl(ceiling(nr/n), n, nr))
gl is creating a factor what can directly be used by split.
Benchmark
n <- 1e5
df <- data.frame(x = 1:n, y = runif(n))
bench::mark(
"Rich Scriven" = {n <- 10
nr <- nrow(df)
split(df, rep(1:ceiling(nr/n), each=n, length.out=nr))},
GKi = {n <- 10
nr <- nrow(df)
split(df, gl(ceiling(nr/n), n, nr))}
)
# expression min median `itr/sec` mem_alloc gc/se…¹ n_itr n_gc total…²
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:t>
#1 Rich Scriven 411ms 444ms 2.25 3.54MB 13.5 2 12 889ms
#2 GKi 412ms 423ms 2.37 2.03MB 15.4 2 13 845ms
Using gl instead of rep is marginal faster and uses less memory.