R - Selecting Top Records but With a Grouping

R - Selecting Top Records but With a Grouping - r

Using the Iris dataframe I can pretty easily pull the first n = 100 records with:
m_data<-iris
m_data[1:100,]
But I am also interested in pulling the first 100 records based on a nice split of the Species. Assume for the moment that the first 100 records are all the same species - I would like to pull the data with a "first sampling" based on the varying Species instead.
Any suggestions are welcome. Thank you.

You can also do this with dplyr, here selecting the first 10 from each species:
library(dplyr)
iris %>%
group_by(Species) %>%
filter(row_number() <= 10) # or slice(1:10)
#> # A tibble: 30 x 5
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # ... with 20 more rows
Created on 2018-08-13 by the reprex package (v0.2.0).

Here's an alternative:
do.call(rbind, lapply(split(iris, iris$Species), head, 100))
This pulls the first 100 records from iris by Species
You can use by instead of lapply
do.call(rbind, by(iris, iris$Species, head, 100))

Related

How to detect if data.frame is grouped by dplyr from subfunction?

I have an R package where some functions are designed to be typically called within dplyr functions mutate or summarize.
newdata <- dplyr::mutate(group_by(olddata, col1), newcol = myfunc(col1))
However, sometimes users might forget to group their data before putting it into the mutate or summarize call.
newdata <- dplyr::mutate(olddata, newcol = myfunc(col1))
When the data frame is not grouped first, the package functions will produce largely nonsensical results. However, there won't be any errors or warnings per se, which could leave users uncertain about the cause of the issue.
I'd like to add a Warning() within the myfunc code when myfunc detects that the input data isn't coming from a grouped data.frame. However, I can't figure out how myfunc could detect if the data is coming from a grouped data.frame. It appears that mutate only passes a vector to myfunc, so both dplyr::is.grouped_df and inherits(x, "grouped_df") return false.
What I would like:
myfunc <- function(x) {if(comes.from.grouped.df) {print("grouped")} else {print("ungrouped")}}
mutate(olddata, newcol = myfunc(col1))
'ungrouped'
mutate(group_by(olddata, col1), newcol = myfunc(col1))
'grouped'
'grouped'
'grouped'

If you want your function used within a specific context, and emit a warning if the data frame is not grouped, then you can do:
library(tidyverse)
myfunc <- function(x) {
if(all(ls(envir = parent.frame()) == "~")) {
ss <- sys.status()
funcs <- sapply(ss$sys.calls, function(x) deparse(as.list(x)[[1]]))
wf <- which(funcs == "mutate")
if(length(wf) == 0) stop("`myfunc` must be called from inside `mutate`")
wf <- max(wf)
data <- eval(substitute(.data), ss$sys.frames[[wf]])
if(!inherits(data, "grouped_df")) {
warning("`myfunc` called on an ungrouped data frame / tibble.")
}
return(x^2)
}
stop("`myfunc` must be called from inside `mutate`")
}
Used outside mutate, we get an error:
myfunc(1:10)
#> Error in myfunc(1:10): `myfunc` must be called from inside `mutate`
With an ungrouped data frame or tibble we get a warning:
tibble(iris) %>%
mutate(x = myfunc(Sepal.Length))
#> Warning in myfunc(Sepal.Length): `myfunc` called on an ungrouped data frame /
#> tibble.
#> # A tibble: 150 x 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species x
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 26.0
#> 2 4.9 3 1.4 0.2 setosa 24.0
#> 3 4.7 3.2 1.3 0.2 setosa 22.1
#> 4 4.6 3.1 1.5 0.2 setosa 21.2
#> 5 5 3.6 1.4 0.2 setosa 25
#> 6 5.4 3.9 1.7 0.4 setosa 29.2
#> 7 4.6 3.4 1.4 0.3 setosa 21.2
#> 8 5 3.4 1.5 0.2 setosa 25
#> 9 4.4 2.9 1.4 0.2 setosa 19.4
#> 10 4.9 3.1 1.5 0.1 setosa 24.0
#> # ... with 140 more rows
And it runs without complaint if the tibble is grouped:
tibble(iris) %>%
group_by(Species) %>%
mutate(x = myfunc(Sepal.Length))
#> # A tibble: 150 x 6
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species x
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 26.0
#> 2 4.9 3 1.4 0.2 setosa 24.0
#> 3 4.7 3.2 1.3 0.2 setosa 22.1
#> 4 4.6 3.1 1.5 0.2 setosa 21.2
#> 5 5 3.6 1.4 0.2 setosa 25
#> 6 5.4 3.9 1.7 0.4 setosa 29.2
#> 7 4.6 3.4 1.4 0.3 setosa 21.2
#> 8 5 3.4 1.5 0.2 setosa 25
#> 9 4.4 2.9 1.4 0.2 setosa 19.4
#> 10 4.9 3.1 1.5 0.1 setosa 24.0
#> # ... with 140 more rows
Created on 2023-02-15 with reprex v2.0.2

Replace range of values in a column with a factor using dplyr

I have a question similar to this, but instead of two factors, I would like to create four factors. Replace range of values for factor with levels
How do I do that? I don't know how to share my own data table so I will use the iris dataset.
library(datasets)
data(iris)
Let's say I want to categorize Sepal.Length into 4 categories 4.3-4.9,5-6,6.1-7,7.1-7.9 and label each range as A,B,C,D (factors) in a new column. Can this be done using the dplyr package?
I came across several similar questions that use the "cut" function but I was not able to use it without getting an error message.

You can use cut inside mutate. Pass Sepal.Length as the first argument, the vector of cut points you want to use for the breaks argument (it should be length-5), and the labels you want to assign via the labels argument.
library(tidyverse)
iris %>%
as_tibble() %>%
mutate(newcol = cut(Sepal.Length, breaks = c(0, 1.9, 3.9, 5.9, 8),
labels = LETTERS[1:4]))
#> # A tibble: 150 x 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species newcol
#> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa C
#> 2 4.9 3 1.4 0.2 setosa C
#> 3 4.7 3.2 1.3 0.2 setosa C
#> 4 4.6 3.1 1.5 0.2 setosa C
#> 5 5 3.6 1.4 0.2 setosa C
#> 6 5.4 3.9 1.7 0.4 setosa C
#> 7 4.6 3.4 1.4 0.3 setosa C
#> 8 5 3.4 1.5 0.2 setosa C
#> 9 4.4 2.9 1.4 0.2 setosa C
#> 10 4.9 3.1 1.5 0.1 setosa C
#> # ... with 140 more rows
Created on 2023-01-30 with reprex v2.0.2

Dplyr filter behaviour when using vector [duplicate]

This question already has answers here:
What is the difference between `%in%` and `==`?
(3 answers)
Closed last year.
I use dplyr quite a lot for data wrangling, but I never figured out dplyr filter behaviour when using filter(df, variable == c(value1, value2)
Lets use iris data set as an example.
library(dplyr)
data(iris)
# I want to filter by Species 'setosa' and 'versicolor'
# Solution 1
filter1 <- filter(iris, Species == 'setosa' | Species == 'versicolor')
nrow(filter1)
[1] 100 # expected result
# Solution 2
filter2 <- filter(iris, Species %in% c('setosa', 'versicolor'))
nrow(filter2)
[1] 100 # expected result
filter1 == filter2 # both solutions return the exact same result
#Solution 3
filter3 <- filter(iris, Species == c('setosa', 'versicolor'))
nrow(filter3)
[1] 50 # unexpected result
unique(filter3$Species)
[1] setosa versicolor
Levels: setosa versicolor virginica
Although Solution 3 is filtering for the intended species, as shown by unique(filter3$Species), it only returns half of the occurrences (50 compared to 100 in Solution 1and Solution2). I would appreciate some guidance on what is actually going on in Solution 3.

filter(iris, Species == c("versicolor", "setosa")) does not make sense in an intuitive way, because one Species is not a 2-tuple:
> "setosa" == c("setosa", "versicolor")
[1] TRUE FALSE
Interestingly, filter(iris, Species == c("setosa", "versicolor")) produce the same results: The first Species of the data frame will be returned, so descending sorting will give you versicolor:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
iris %>%
as_tibble()
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows
iris %>%
filter(Species == c('setosa', 'versicolor')) %>%
as_tibble()
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 4.9 3 1.4 0.2 setosa
#> 2 4.6 3.1 1.5 0.2 setosa
#> 3 5.4 3.9 1.7 0.4 setosa
#> 4 5 3.4 1.5 0.2 setosa
#> 5 4.9 3.1 1.5 0.1 setosa
#> 6 4.8 3.4 1.6 0.2 setosa
#> 7 4.3 3 1.1 0.1 setosa
#> 8 5.7 4.4 1.5 0.4 setosa
#> 9 5.1 3.5 1.4 0.3 setosa
#> 10 5.1 3.8 1.5 0.3 setosa
#> # … with 40 more rows
iris %>%
arrange(Species) %>%
filter(Species == c('versicolor', 'setosa')) %>%
as_tibble()
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 4.9 3 1.4 0.2 setosa
#> 2 4.6 3.1 1.5 0.2 setosa
#> 3 5.4 3.9 1.7 0.4 setosa
#> 4 5 3.4 1.5 0.2 setosa
#> 5 4.9 3.1 1.5 0.1 setosa
#> 6 4.8 3.4 1.6 0.2 setosa
#> 7 4.3 3 1.1 0.1 setosa
#> 8 5.7 4.4 1.5 0.4 setosa
#> 9 5.1 3.5 1.4 0.3 setosa
#> 10 5.1 3.8 1.5 0.3 setosa
#> # … with 40 more rows
iris %>%
arrange(desc(Species)) %>%
filter(Species == c('setosa', 'versicolor')) %>%
as_tibble()
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 6.4 3.2 4.5 1.5 versicolor
#> 2 5.5 2.3 4 1.3 versicolor
#> 3 5.7 2.8 4.5 1.3 versicolor
#> 4 4.9 2.4 3.3 1 versicolor
#> 5 5.2 2.7 3.9 1.4 versicolor
#> 6 5.9 3 4.2 1.5 versicolor
#> 7 6.1 2.9 4.7 1.4 versicolor
#> 8 6.7 3.1 4.4 1.4 versicolor
#> 9 5.8 2.7 4.1 1 versicolor
#> 10 5.6 2.5 3.9 1.1 versicolor
#> # … with 40 more rows
iris %>%
arrange(desc(Species)) %>%
filter(Species == c('versicolor', 'setosa')) %>%
as_tibble()
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 7 3.2 4.7 1.4 versicolor
#> 2 6.9 3.1 4.9 1.5 versicolor
#> 3 6.5 2.8 4.6 1.5 versicolor
#> 4 6.3 3.3 4.7 1.6 versicolor
#> 5 6.6 2.9 4.6 1.3 versicolor
#> 6 5 2 3.5 1 versicolor
#> 7 6 2.2 4 1 versicolor
#> 8 5.6 2.9 3.6 1.3 versicolor
#> 9 5.6 3 4.5 1.5 versicolor
#> 10 6.2 2.2 4.5 1.5 versicolor
#> # … with 40 more rows
Created on 2022-02-11 by the reprex package (v2.0.0)

Can you list an exception to tidyselect `everything()`

library(tidyverse)
iris %>% as_tibble() %>% select(everything())
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # ... with 140 more rows
Say I want to select everything in the iris data frame except Species. How do I list this one exception while utilizing tidyselect::everything()?
My actual pipe is below, and when
... %>%
group_by(`ID`) %>%
fill(everything, .direction = "updown") %>%
... %>%
and I get the following error:
Error: Column ID can't be modified because it's a grouping variable

You would do
iris %>% as_tibble() %>% select(-Species)
but assuming you have good reason not to want that, here's a way using everything()
iris %>% as_tibble() %>% select(setdiff(everything(), one_of("Species")))
#> # A tibble: 150 x 4
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <dbl> <dbl> <dbl> <dbl>
#> 1 5.1 3.5 1.4 0.2
#> 2 4.9 3 1.4 0.2
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> 5 5 3.6 1.4 0.2
#> 6 5.4 3.9 1.7 0.4
#> 7 4.6 3.4 1.4 0.3
#> 8 5 3.4 1.5 0.2
#> 9 4.4 2.9 1.4 0.2
#> 10 4.9 3.1 1.5 0.1
#> # ... with 140 more rows
(or just iris %>% as_tibble() %>% select(setdiff(everything(), 5)) if it's acceptable)

Using dplyr verbs in a function with column labels as character vectors

I would like to create a function that takes a data frame and a character vector containing column names as input, and uses tidy verse quoting functions inside in a safe manner.
I believe I have a working example of what I want to do. I would like to know if there is a more elegant solution or I am thinking about this problem incorrectly (perhaps I shouldn't want to do this?). From what I can tell, in order to avoid variable scoping issues I need to wrap the column names in .data[[]] and make it an expression before unquoting for tidy verse NSE verbs.
Regarding previous questions this answer is along the right lines but I want to abstract the code into a function. A github issue
asks about this but using rlang::syms won't work as far as I can tell because the
combination of the column labels with .data makes it an expression not a symbol.
Here
and here
solve the problem but as far as I can tell don't account for a subtle bug in which the variables can leak
in from the environment if they don't exist as column labels in the dataframe or the solutions don't work for the input being a vector of labels.
# Setup
suppressWarnings(suppressMessages(library("dplyr")))
suppressWarnings(suppressMessages(library("rlang")))
# define iris with and without Sepal.Width column
iris <- tibble::as_tibble(iris)
df_with_missing <- iris %>% select(-Sepal.Width)
# This should not be findable by my function
Sepal.Width <- iris$Sepal.Width * -1
################
# Now lets try a function for which we programmatically define the column labels
programmatic_mutate_y <- function(df, col_names, safe = FALSE) {
# Add .data[[]] to the col_names to make evalutation safer
col_exprs <- rlang::parse_exprs(
purrr::map_chr(
col_names,
~ glue::glue(stringr::str_c('.data[["{.x}"]]'))
)
)
output <- dplyr::mutate(df, product = purrr::pmap_dbl(list(!!!col_exprs), ~ prod(...)))
output
}
################
# The desired output
testthat::expect_error(programmatic_mutate_y(df_with_missing, c("Sepal.Width", "Sepal.Length")))
programmatic_mutate_y(iris, c("Sepal.Width", "Sepal.Length"))
#> # A tibble: 150 x 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species product
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 17.8
#> 2 4.9 3 1.4 0.2 setosa 14.7
#> 3 4.7 3.2 1.3 0.2 setosa 15.0
#> 4 4.6 3.1 1.5 0.2 setosa 14.3
#> 5 5 3.6 1.4 0.2 setosa 18
#> 6 5.4 3.9 1.7 0.4 setosa 21.1
#> 7 4.6 3.4 1.4 0.3 setosa 15.6
#> 8 5 3.4 1.5 0.2 setosa 17
#> 9 4.4 2.9 1.4 0.2 setosa 12.8
#> 10 4.9 3.1 1.5 0.1 setosa 15.2
#> # … with 140 more rows
Created on 2019-08-09 by the reprex package (v0.3.0)

I think you are making things complicated. With the _at variant, you can use strings as arguments in almost every dplyr functions. purrr::pmap_dbl() is used to map calculation by rows.
programmatic_mutate_y_v1 <- function(df, col_names, safe = FALSE) {
df["product"] <- purrr::pmap_dbl(dplyr::select_at(df,col_names),prod)
return(df)
}
programmatic_mutate_y_v1(iris, c("Sepal.Width", "Sepal.Length"))
# A tibble: 150 x 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species product
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 17.8
2 4.9 3 1.4 0.2 setosa 14.7
3 4.7 3.2 1.3 0.2 setosa 15.0
4 4.6 3.1 1.5 0.2 setosa 14.3
5 5 3.6 1.4 0.2 setosa 18
6 5.4 3.9 1.7 0.4 setosa 21.1
7 4.6 3.4 1.4 0.3 setosa 15.6
8 5 3.4 1.5 0.2 setosa 17
9 4.4 2.9 1.4 0.2 setosa 12.8
10 4.9 3.1 1.5 0.1 setosa 15.2
# ... with 140 more rows

We can turn col_names into a single expression with parse_expr and paste, then unquote when being evaluated in mutate:
library(dplyr)
library(rlang)
programmatic_mutate_y <- function(df, col_names){
mutate(df, product = !!parse_expr(paste(col_names, collapse = "*")))
}
Output:
> parse_expr(paste(c("Sepal.Width", "Sepal.Length"), collapse = "*"))
Sepal.Width * Sepal.Length
> programmatic_mutate_y(df_with_missing, c("Sepal.Width", "Sepal.Length"))
> Error: object 'Sepal.Width' not found
> programmatic_mutate_y(iris, c("Sepal.Width", "Sepal.Length"))
# A tibble: 150 x 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species product
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 17.8
2 4.9 3 1.4 0.2 setosa 14.7
3 4.7 3.2 1.3 0.2 setosa 15.0
4 4.6 3.1 1.5 0.2 setosa 14.3
5 5 3.6 1.4 0.2 setosa 18
6 5.4 3.9 1.7 0.4 setosa 21.1
7 4.6 3.4 1.4 0.3 setosa 15.6
8 5 3.4 1.5 0.2 setosa 17
9 4.4 2.9 1.4 0.2 setosa 12.8
10 4.9 3.1 1.5 0.1 setosa 15.2
# ... with 140 more rows

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - Selecting Top Records but With a Grouping - r

Here's an alternative: do.call(rbind, lapply(split(iris, iris$Species), head, 100)) This pulls the first 100 records from iris by Species You can use by instead of lapply do.call(rbind, by(iris, iris$Species, head, 100))

Related

How to detect if data.frame is grouped by dplyr from subfunction?

Replace range of values in a column with a factor using dplyr

Dplyr filter behaviour when using vector [duplicate]

Can you list an exception to tidyselect `everything()`

Using dplyr verbs in a function with column labels as character vectors

Categories

Resources