Sample from groups, but n varies per group in R

Sample from groups, but n varies per group in R - r

I am trying to randomly sample n times a given grouped variable, but the n varies by the group. For example:
library(dplyr)
iris <- iris %>% mutate(len_bin=cut(Sepal.Length,seq(0,8,by=1))
I have these factors, which are my grouped variable:
table(iris$len_bin)
(4,5] (5,6] (6,7] (7,8]
32 57 49 12
Is there a way to randomly sample only these groups n times, n being the number of times each element is present in this vector:
x <- c("(4,5]","(5,6]","(5,6]","(5,6]","(6,7]")
The result should look like:
# Groups: len_bin [4]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species len_bin
<dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 5 2 3.5 1 versicolor (4,5]
2 5.3 3.7 1.5 0.2 setosa (5,6]
2 5.3 3.7 1.5 0.2 setosa (5,6]
2 5.3 3.7 1.5 0.2 setosa (5,6]
3 6.5 3 5.8 2.2 virginica (6,7]
I managed to do this with a for loop and using sample_n() based on the vector. I am assuming there must be a faster way. Can I define n within sample_n() for example?

In base R you can do:
iris <- iris %>% mutate(len_bin = cut(Sepal.Length, seq(4, 8, by = 1))
x <- c("(4,5]","(5,6]","(5,6]","(5,6]","(6,7]")
l <- mapply(\(x, y) x[sample(nrow(x), y), ],
split(iris, iris$len_bin),
c(table(factor(x, levels = levels(iris$len_bin)))),
SIMPLIFY = F)
do.call(rbind.data.frame, l)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species len_bin
#(4,5] 5.0 3.2 1.2 0.2 setosa (4,5]
#(5,6].17 5.4 3.9 1.3 0.4 setosa (5,6]
#(5,6].63 6.0 2.2 4.0 1.0 versicolor (5,6]
#(5,6].97 5.7 2.9 4.2 1.3 versicolor (5,6]
#(6,7] 6.9 3.1 5.1 2.3 virginica (6,7]

Related

How to detect if data.frame is grouped by dplyr from subfunction?

I have an R package where some functions are designed to be typically called within dplyr functions mutate or summarize.
newdata <- dplyr::mutate(group_by(olddata, col1), newcol = myfunc(col1))
However, sometimes users might forget to group their data before putting it into the mutate or summarize call.
newdata <- dplyr::mutate(olddata, newcol = myfunc(col1))
When the data frame is not grouped first, the package functions will produce largely nonsensical results. However, there won't be any errors or warnings per se, which could leave users uncertain about the cause of the issue.
I'd like to add a Warning() within the myfunc code when myfunc detects that the input data isn't coming from a grouped data.frame. However, I can't figure out how myfunc could detect if the data is coming from a grouped data.frame. It appears that mutate only passes a vector to myfunc, so both dplyr::is.grouped_df and inherits(x, "grouped_df") return false.
What I would like:
myfunc <- function(x) {if(comes.from.grouped.df) {print("grouped")} else {print("ungrouped")}}
mutate(olddata, newcol = myfunc(col1))
'ungrouped'
mutate(group_by(olddata, col1), newcol = myfunc(col1))
'grouped'
'grouped'
'grouped'

If you want your function used within a specific context, and emit a warning if the data frame is not grouped, then you can do:
library(tidyverse)
myfunc <- function(x) {
if(all(ls(envir = parent.frame()) == "~")) {
ss <- sys.status()
funcs <- sapply(ss$sys.calls, function(x) deparse(as.list(x)[[1]]))
wf <- which(funcs == "mutate")
if(length(wf) == 0) stop("`myfunc` must be called from inside `mutate`")
wf <- max(wf)
data <- eval(substitute(.data), ss$sys.frames[[wf]])
if(!inherits(data, "grouped_df")) {
warning("`myfunc` called on an ungrouped data frame / tibble.")
}
return(x^2)
}
stop("`myfunc` must be called from inside `mutate`")
}
Used outside mutate, we get an error:
myfunc(1:10)
#> Error in myfunc(1:10): `myfunc` must be called from inside `mutate`
With an ungrouped data frame or tibble we get a warning:
tibble(iris) %>%
mutate(x = myfunc(Sepal.Length))
#> Warning in myfunc(Sepal.Length): `myfunc` called on an ungrouped data frame /
#> tibble.
#> # A tibble: 150 x 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species x
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 26.0
#> 2 4.9 3 1.4 0.2 setosa 24.0
#> 3 4.7 3.2 1.3 0.2 setosa 22.1
#> 4 4.6 3.1 1.5 0.2 setosa 21.2
#> 5 5 3.6 1.4 0.2 setosa 25
#> 6 5.4 3.9 1.7 0.4 setosa 29.2
#> 7 4.6 3.4 1.4 0.3 setosa 21.2
#> 8 5 3.4 1.5 0.2 setosa 25
#> 9 4.4 2.9 1.4 0.2 setosa 19.4
#> 10 4.9 3.1 1.5 0.1 setosa 24.0
#> # ... with 140 more rows
And it runs without complaint if the tibble is grouped:
tibble(iris) %>%
group_by(Species) %>%
mutate(x = myfunc(Sepal.Length))
#> # A tibble: 150 x 6
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species x
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 26.0
#> 2 4.9 3 1.4 0.2 setosa 24.0
#> 3 4.7 3.2 1.3 0.2 setosa 22.1
#> 4 4.6 3.1 1.5 0.2 setosa 21.2
#> 5 5 3.6 1.4 0.2 setosa 25
#> 6 5.4 3.9 1.7 0.4 setosa 29.2
#> 7 4.6 3.4 1.4 0.3 setosa 21.2
#> 8 5 3.4 1.5 0.2 setosa 25
#> 9 4.4 2.9 1.4 0.2 setosa 19.4
#> 10 4.9 3.1 1.5 0.1 setosa 24.0
#> # ... with 140 more rows
Created on 2023-02-15 with reprex v2.0.2

Creating multiple rank columns dplyr

I have a df with several columns, and I'd like to rank them. I can do them one at a time like this:
iris.ranked <-
iris %>%
arrange(Sepal.Length) %>%
mutate(Sepal.Length = rank(Sepal.Length))
But there are lots of columns...and this is clunky. I'd rather feed a list of columns and rank them all in one code chunk. I was thinking something like this but not working...
iris.ranked.all <-
iris %>%
mutate_at(
c('Sepal.Length',
'Sepal.Width',
'Petal.Width',
'Petal.Length'),
function(x) arrange(x) %>% rank()
)

Use mutate(across()) from dplyr:
library(dplyr)
iris |>
mutate(
across(
Sepal.Length:Petal.Width,
rank,
.names = "rank_{.col}")
)
# # A tibble: 150 x 9
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species rank_Sepal.Length rank_Sepal.Width rank_Petal.Length rank_Petal.Width
# <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
# 1 5.1 3.5 1.4 0.2 setosa 37 128. 18 20
# 2 4.9 3 1.4 0.2 setosa 19.5 70.5 18 20
# 3 4.7 3.2 1.3 0.2 setosa 10.5 101 8 20
# 4 4.6 3.1 1.5 0.2 setosa 7.5 89 31 20
# 5 5 3.6 1.4 0.2 setosa 27.5 134. 18 20
# 6 5.4 3.9 1.7 0.4 setosa 49.5 146. 46.5 45
# 7 4.6 3.4 1.4 0.3 setosa 7.5 120. 18 38
# 8 5 3.4 1.5 0.2 setosa 27.5 120. 31 20
# 9 4.4 2.9 1.4 0.2 setosa 3 52.5 18 20
# 10 4.9 3.1 1.5 0.1 setosa 19.5 89 31 3
# # ... with 140 more rows
Or if in fact you want to overwrite the columns as your question suggests, omit the .names argument:
iris |>
mutate(
across(
Sepal.Length:Petal.Width,
rank)
)

Using dplyr verbs in a function with column labels as character vectors

I would like to create a function that takes a data frame and a character vector containing column names as input, and uses tidy verse quoting functions inside in a safe manner.
I believe I have a working example of what I want to do. I would like to know if there is a more elegant solution or I am thinking about this problem incorrectly (perhaps I shouldn't want to do this?). From what I can tell, in order to avoid variable scoping issues I need to wrap the column names in .data[[]] and make it an expression before unquoting for tidy verse NSE verbs.
Regarding previous questions this answer is along the right lines but I want to abstract the code into a function. A github issue
asks about this but using rlang::syms won't work as far as I can tell because the
combination of the column labels with .data makes it an expression not a symbol.
Here
and here
solve the problem but as far as I can tell don't account for a subtle bug in which the variables can leak
in from the environment if they don't exist as column labels in the dataframe or the solutions don't work for the input being a vector of labels.
# Setup
suppressWarnings(suppressMessages(library("dplyr")))
suppressWarnings(suppressMessages(library("rlang")))
# define iris with and without Sepal.Width column
iris <- tibble::as_tibble(iris)
df_with_missing <- iris %>% select(-Sepal.Width)
# This should not be findable by my function
Sepal.Width <- iris$Sepal.Width * -1
################
# Now lets try a function for which we programmatically define the column labels
programmatic_mutate_y <- function(df, col_names, safe = FALSE) {
# Add .data[[]] to the col_names to make evalutation safer
col_exprs <- rlang::parse_exprs(
purrr::map_chr(
col_names,
~ glue::glue(stringr::str_c('.data[["{.x}"]]'))
)
)
output <- dplyr::mutate(df, product = purrr::pmap_dbl(list(!!!col_exprs), ~ prod(...)))
output
}
################
# The desired output
testthat::expect_error(programmatic_mutate_y(df_with_missing, c("Sepal.Width", "Sepal.Length")))
programmatic_mutate_y(iris, c("Sepal.Width", "Sepal.Length"))
#> # A tibble: 150 x 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species product
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 17.8
#> 2 4.9 3 1.4 0.2 setosa 14.7
#> 3 4.7 3.2 1.3 0.2 setosa 15.0
#> 4 4.6 3.1 1.5 0.2 setosa 14.3
#> 5 5 3.6 1.4 0.2 setosa 18
#> 6 5.4 3.9 1.7 0.4 setosa 21.1
#> 7 4.6 3.4 1.4 0.3 setosa 15.6
#> 8 5 3.4 1.5 0.2 setosa 17
#> 9 4.4 2.9 1.4 0.2 setosa 12.8
#> 10 4.9 3.1 1.5 0.1 setosa 15.2
#> # … with 140 more rows
Created on 2019-08-09 by the reprex package (v0.3.0)

I think you are making things complicated. With the _at variant, you can use strings as arguments in almost every dplyr functions. purrr::pmap_dbl() is used to map calculation by rows.
programmatic_mutate_y_v1 <- function(df, col_names, safe = FALSE) {
df["product"] <- purrr::pmap_dbl(dplyr::select_at(df,col_names),prod)
return(df)
}
programmatic_mutate_y_v1(iris, c("Sepal.Width", "Sepal.Length"))
# A tibble: 150 x 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species product
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 17.8
2 4.9 3 1.4 0.2 setosa 14.7
3 4.7 3.2 1.3 0.2 setosa 15.0
4 4.6 3.1 1.5 0.2 setosa 14.3
5 5 3.6 1.4 0.2 setosa 18
6 5.4 3.9 1.7 0.4 setosa 21.1
7 4.6 3.4 1.4 0.3 setosa 15.6
8 5 3.4 1.5 0.2 setosa 17
9 4.4 2.9 1.4 0.2 setosa 12.8
10 4.9 3.1 1.5 0.1 setosa 15.2
# ... with 140 more rows

We can turn col_names into a single expression with parse_expr and paste, then unquote when being evaluated in mutate:
library(dplyr)
library(rlang)
programmatic_mutate_y <- function(df, col_names){
mutate(df, product = !!parse_expr(paste(col_names, collapse = "*")))
}
Output:
> parse_expr(paste(c("Sepal.Width", "Sepal.Length"), collapse = "*"))
Sepal.Width * Sepal.Length
> programmatic_mutate_y(df_with_missing, c("Sepal.Width", "Sepal.Length"))
> Error: object 'Sepal.Width' not found
> programmatic_mutate_y(iris, c("Sepal.Width", "Sepal.Length"))
# A tibble: 150 x 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species product
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 17.8
2 4.9 3 1.4 0.2 setosa 14.7
3 4.7 3.2 1.3 0.2 setosa 15.0
4 4.6 3.1 1.5 0.2 setosa 14.3
5 5 3.6 1.4 0.2 setosa 18
6 5.4 3.9 1.7 0.4 setosa 21.1
7 4.6 3.4 1.4 0.3 setosa 15.6
8 5 3.4 1.5 0.2 setosa 17
9 4.4 2.9 1.4 0.2 setosa 12.8
10 4.9 3.1 1.5 0.1 setosa 15.2
# ... with 140 more rows

R - Selecting Top Records but With a Grouping

Using the Iris dataframe I can pretty easily pull the first n = 100 records with:
m_data<-iris
m_data[1:100,]
But I am also interested in pulling the first 100 records based on a nice split of the Species. Assume for the moment that the first 100 records are all the same species - I would like to pull the data with a "first sampling" based on the varying Species instead.
Any suggestions are welcome. Thank you.

You can also do this with dplyr, here selecting the first 10 from each species:
library(dplyr)
iris %>%
group_by(Species) %>%
filter(row_number() <= 10) # or slice(1:10)
#> # A tibble: 30 x 5
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # ... with 20 more rows
Created on 2018-08-13 by the reprex package (v0.2.0).

Here's an alternative:
do.call(rbind, lapply(split(iris, iris$Species), head, 100))
This pulls the first 100 records from iris by Species
You can use by instead of lapply
do.call(rbind, by(iris, iris$Species, head, 100))

How to pass a dataframe and uneven vectors as parameters in purrr map

I have a function with mixed data types. It takes a data frame and string variable as the input parameter.
library(dplyr)
myfunc <- function (dat=NULL,species=NULL,sepal_thres=NULL) {
dat %>%
filter(Species==species & Sepal.Length <= sepal_thres)
}
myfunc(dat=iris,species="virginica",sepal_thres=5)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 4.9 2.5 4.5 1.7 virginica
But I want to apply it with list of vectors
species_vecs <- c("virginica","setosa")
sepal_thres_vecs <- c(5, 6)
purrr::pmap(list(dat=iris, species=species_vecs, sepal_thres=sepal_thres_vecs), myfunc)
I got this error:
Error: Element 2 has length 2, not 1 or 5.
What's the right way to do it?
Not that the species and sepal_tres parameters are taken from this combination:
> expand.grid(species_vecs,sepal_thres_vecs) %>% rename(species=Var1, sepal_thres=Var2)
species sepal_thres
1 virginica 5
2 setosa 5
3 virginica 6
4 setosa 6
but dat as parameter is fixed.

pmap will use recycling if you have a length-1 element as part of your bigger list. In this case, you can pass iris as a list element within the full list to use it for each species-sepal combination.
Note that pmap goes through list elements with multiple values in the order they appear. If you want every combination of the species and sepal vectors in pmap you would need to create and give the full vectors as list elements (i.e., you would have to do the crossing yourself).
purrr::pmap(list(dat = list(iris), species = rep(species_vecs, 2),
sepal_thres = rep(sepal_thres_vecs, each = 2) ), myfunc)
[[1]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.9 2.5 4.5 1.7 virginica
[[2]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 4.6 3.4 1.4 0.3 setosa
6 5.0 3.4 1.5 0.2 setosa
...

You can use this solution :
expand.grid(species_vecs,sepal_thres_vecs) %>%
rename(species=Var1, sepal_thres=Var2) %>%
as.tibble() %>%
mutate(sum = map2(as.character(species), sepal_thres,myfunc,dat = iris)) %>%
unnest(sum)

You could use Vectorize
input <- expand.grid(species_vecs,sepal_thres_vecs,stringsAsFactors = F) %>% rename(species=Var1, sepal_thres=Var2)
# species sepal_thres
# 1 virginica 5
# 2 setosa 5
# 3 virginica 6
# 4 setosa 6
output <- Vectorize(myfunc,c("species","sepal_thres"),SIMPLIFY=F)(dat=iris,species=input[[1]],sepal_thres=input[[2]])
output[[1]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 4.9 2.5 4.5 1.7 virginica
output[[3]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.8 2.7 5.1 1.9 virginica
# 2 4.9 2.5 4.5 1.7 virginica
# 3 5.7 2.5 5.0 2.0 virginica
# 4 5.8 2.8 5.1 2.4 virginica
# 5 6.0 2.2 5.0 1.5 virginica
# 6 5.6 2.8 4.9 2.0 virginica
# 7 6.0 3.0 4.8 1.8 virginica
# 8 5.8 2.7 5.1 1.9 virginica
# 9 5.9 3.0 5.1 1.8 virginica

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Sample from groups, but n varies per group in R - r

Related

How to detect if data.frame is grouped by dplyr from subfunction?

Creating multiple rank columns dplyr

Using dplyr verbs in a function with column labels as character vectors

R - Selecting Top Records but With a Grouping

How to pass a dataframe and uneven vectors as parameters in purrr map

Categories

Resources