Iterating over values in character vector with dplyr functions - r

I have several variables (id.type and id.subtype in this example) I would like to check for distinct values in a tibble all.snags using the dplyr package. I would like them sorted and all values printed out in the console (a tibble typically prints only the first 10). The output would be equivalent to the following code:
distinct(all.snags,id.type) %>% arrange(id.type) %>% print(n = Inf)
distinct(all.snags,id.subtype) %>% arrange(id.subtype) %>% print(n = Inf)
I think this is better done by looping over the values in a vector, but I can't get it to work.
distinct.vars <- c("id.type","id.subtype")
for (i in distinct.vars) {
distinct(all.snags,distinct.vars[i]) %>%
arrange(distinct.vars[i]) %>%
print(n = Inf)
}

I think this function is what you want:
library(dplyr)
df = iris
print_distinct = function(df, columns) {
for (c in columns) {
print(df %>% distinct_(c) %>% arrange_(c))
}
}
print_distinct(df, c("Sepal.Length", "Sepal.Width"))

Related

How to get this function to work with the pipe in r?

I have created this function that quickly does some summarization operations (mean, median, geometric mean and arranges them in descending order). This is the function:
summarize_values <- function(tbl, variable){
tbl %>%
summarize(summarized_mean = mean({{variable}}),
summarized_median = median({{variable}}),
geom_mean = exp(mean(log({{variable}}))),
n = n()) %>%
arrange(desc(n))
}
I can do this and it works:
summarize_values(data, lifeExp)
However, I would like to be able to do this:
data %>%
select(year, lifeExp) %>%
summarize_values()
or something like this
data %>%
summarize_values(year, lifeExp)
What am I missing to make this work?
thanks
With pipe, we don't need to specify the first argument which is the tbl,
library(dplyr)
data %>%
summarize_values(lifeExp)
-reproducible example
> mtcars %>%
summarize_values(gear)
summarized_mean summarized_median geom_mean n
1 3.6875 4 3.619405 32

How to use tidy eval NSE to expand a `an expression`

I want to expand the !!! expression just like they do in dplyr-verbs e.g.
aggregate_expressions <- list(n = quote(n()))
do_something(iris, !!!(aggregate_expressions))
and say I want do_something to perform
do_something <- function(...) {
iris %>%
some_function( # expand the ... # ) # some_function(n = n())
}
which will do this but the n = n() is dynamic
do_something <- function(...) {
iris %>%
some_function(n = n())
}
I tried to trace the code for dplyr::summarise and I see that enquos(...) which converts the ... to a list of quosure, but then how do I apply the quosures? I think I am meant to create the code summarise(n = n()) from the quosure and then evaluate i using eval_tidy, but I can't figure out how to generate the code. I know that pass ... summarise works but the actual use case is to pass it to summarise.disk.frame which means I can't just reuse dplyr::summarise
The actual case i not
For example in dplyr, the below works by expanding the aggregate_expression using !!!
aggregate_expressions <- list(n = quote(n()))
iris %>%
group_by(Species) %>%
summarise(!!!(aggregate_expressions))
Modify it like this:
do_something <- function(x) {
iris %>%
summarise(!!!x)
}
aggregate_expressions <- list(n = quote(n()))
do_something(aggregate_expressions)
## n
## 1 150

Calculate mode for each column in dataframe using lapply dplyr

I'm trying to create a function that essentially gets me the MODE...or MODE-X (2nd-Xth most common value & and the associated counts for each column in a data frame.
I can't figure out what I may be missing and I'm looking for some assistance? I believe it has to do with the passing in of a variable into dplyr function.
library(tidyverse)
myfunct_get_mode = function(x, rank=1){
mytable = dplyr::count(rlang::sym(x), sort = TRUE)
names(mytable)= c('variable','counts')
# return just the rank specified...such as mode or mode -1, etc
result = table %>% dplyr::slice(rlang::sym(rank))
return(result)
}
mtcars %>% lapply(. %>% (function(x) myfunct_get_mode(x, rank=2)))
There are some problems with your function:
You function-call is not doing what you think. Check with mtcars %>% lapply(. %>% (function(x) print(x))) that actually your x is the whole column of mtcars. To get the names of the column apply the function to names(mtcars). But then you also have to specify the dataframe you're working on.
To evaluate a symbol you get sym from you need to use !! in front of the rlang::sym(x).
rank is not a variable name, thus no need for rlang::sym here.
table should be mytable in second to last line of your function.
So how could it work (although there are probably better ways):
myfunct_get_mode = function(df, x, rank=1){
mytable = count(df, !!rlang::sym(x), sort = TRUE)
names(mytable)= c('variable','counts')
# return just the rank specified...such as mode or mode -1, etc
result = mytable %>% slice(rank)
return(result)
}
names(mtcars) %>% lapply(function(x) myfunct_get_mode(mtcars, x, rank=2))
If we need this in a list, we can use map
f1 <- function(dat, rank = 1) {
purrr::imap(dat, ~
dat %>%
count(!! rlang::sym(.y)) %>%
rename_all(~ c('variable', 'counts')) %>%
arrange(desc(counts)) %>%
slice(seq_len(rank))) #%>%
#bind_cols - convert to a data.frame
}
f1(mtcars, 2)

Trying to reuse dplyr code block in custom function; what am I missing?

Hadley says, "You should consider writing a function whenever you’ve copied and pasted a block of code more than twice"--I write this chain in dplyr often:
df %>%
group_by(col) %>%
summarise(n = n()) %>%
mutate(percent = round((n / sum(n)) * 100, 2) %>%
arrange(desc(n))
I'd like to create a function that does this with two arguments: the data frame and the variable or column name. This is what I am trying now:
value_counts = function(df, col) {
group_by_(df, col) %>%
summarise_(n = n()) %>%
mutate_(percent = round((n / sum(n)) * 100, 2)) %>%
arrange(desc(n))
}
It doesn't work and I've tried some of the other recommendations on this site but I don't quite understand how they work, e.g.:
value_counts = function(df, col) {
group_by_(df, .dots = col) %>%
summarise_(n = n()) %>%
mutate_(percent = round((n / sum(n)) * 100, 2)) %>%
arrange(desc(n))
}
I really want to write a function that uses the pipes and relies on dplyr. I could continue writing the code that works over and over again, but I'd like to start writing useful functions in R to save time.
I'm a big fan of geom_text() and like having the info from dplyr in a data frame quick and easily so I can get lots of graphs made quickly!
Any resources I should read or links to follow would be useful. Thank you!
#jenesaisquoi's answer is great, but whenever I write dplyr-y functions, I try to write them in a similar style as in that package. I would like to have a SE and NSE pair of functions, where you can use bare variable names.
A few things to note.
I got rid of the pipes (%>%) making them slightly faster and easier to debug. Pipes are super convenient in interactive use, but I tend to avoid them while programming functions.
I used :: everywhere a package specific function is used. This means you now don't need to have dplyr loaded to use the function.
I saw no reason to limit use to only one column, so made a ... argument instead, that accept grouping by as many columns as needed.
See vignette(NSE) for more details on the use of the lazyeval package, and the way dplyr deals with ... and .dots.
Functions
value_counts <- function(df, ...) {
value_counts_(df, .dots = lazyeval::lazy_dots(...))
}
value_counts_ <- function(df, ..., .dots) {
dots <- lazyeval::all_dots(.dots, ..., all_named = TRUE)
df <- dplyr::group_by_(df, .dots = dots)
df <- dplyr::summarise(df, n = n())
df <- dplyr::mutate(df, percent = round(n / sum(n) * 100, 2))
df <- dplyr::arrange(df, desc(n))
return(df)
}
Examples
value_counts(mtcars, cyl)
value_counts(mtcars, cyl, vs)
value_counts_(mtcars, ~cyl)
value_counts_(mtcars, ~cyl, ~vs)
value_counts_(mtcars, .dots = list(~cyl, ~vs))
And you can easily pipe them together with other dplyr verbs:
library(dplyr)
mtcars %>%
filter(cyl != 4) %>%
value_counts()
Only replace the NSE functions with their standard equivalents when you are passing a column name as a string to them. In your case, that is only in the group_by_ function where col is a variable, assuming you want to call your function like value_counts(df, "some_column"). The intermediaries, n and percent aren't reliant on the variable, so don't need to be changed at all.
value_counts <- function(df, col) {
group_by_(df, col) %>%
summarise(n = n()) %>%
mutate(percent = round(n / sum(n) * 100, 2)) %>%
arrange(desc(n))
}
value_counts(iris, "Species")

Error when using dplyr inside of a function

I'm trying to put together a function that creates a subset from my original data frame, and then uses dplyr's SELECT and MUTATE to give me the number of large/small entries, based on the sum of the width and length of sepals/petals.
filter <- function (spp, LENGTH, WIDTH) {
d <- subset (iris, subset=iris$Species == spp) # This part seems to work just fine
large <- d %>%
select (LENGTH, WIDTH) %>% # This is where the problem arises.
mutate (sum = LENGTH + WIDTH)
big_samples <- which(large$sum > 4)
return (length(big_samples))
}
Basically, I want the function to return the number of large flowers. However, when I run the function I get the following error -
filter("virginica", "Sepal.Length", "Sepal.Width")
Error: All select() inputs must resolve to integer column positions.
The following do not:
* LENGTH
* WIDTH
What am I doing wrong?
You are running into NSE/SE problems, see the vignette for more info.
Briefly, dplyr uses a non standard evaluation (NSE) of names, and passing names of columns into functions breaks it, without using the standard evaluation (SE) version.
The SE versions of the dplyr functions end in _. You can see that select_ works nicely with your original arguments.
However, things get more complicated when using functions. We can use lazyeval::interp to convert most function arguments into column names, see the conversion of the mutate to mutate_ call in your function below and more generally, the help: ?lazyeval::interp
Try:
filter <- function (spp, LENGTH, WIDTH) {
d <- subset (iris, subset=iris$Species == spp)
large <- d %>%
select_(LENGTH, WIDTH) %>%
mutate_(sum = lazyeval::interp(~X + Y, X = as.name(LENGTH), Y = as.name(WIDTH)))
big_samples <- which(large$sum > 4)
return (length(big_samples))
}
UPDATE: As of dplyr 0.7.0 you can use tidy eval to accomplish this.
See http://dplyr.tidyverse.org/articles/programming.html for more details.
filter_big <- function(spp, LENGTH, WIDTH) {
LENGTH <- enquo(LENGTH) # Create quosure
WIDTH <- enquo(WIDTH) # Create quosure
iris %>%
filter(Species == spp) %>%
select(!!LENGTH, !!WIDTH) %>% # Use !! to unquote the quosure
mutate(sum = (!!LENGTH) + (!!WIDTH)) %>% # Use !! to unquote the quosure
filter(sum > 4) %>%
nrow()
}
filter_big("virginica", Sepal.Length, Sepal.Width)
> filter_big("virginica", Sepal.Length, Sepal.Width)
[1] 50
If quosure and quasiquotation are too much for you, use either .data[[ ]] or rlang {{ }} (curly curly) instead. See Hadley Wickham's 5min video on tidy evaluation and (maybe) Tidy evaluation section in Hadley's Advanced R book for more information.
library(rlang)
library(dplyr)
filter_data <- function(df, spp, LENGTH, WIDTH) {
res <- df %>%
filter(Species == spp) %>%
select(.data[[LENGTH]], .data[[WIDTH]]) %>%
mutate(sum = .data[[LENGTH]] + .data[[WIDTH]]) %>%
filter(sum > 4) %>%
nrow()
return(res)
}
filter_data(iris, "virginica", "Sepal.Length", "Sepal.Width")
#> [1] 50
filter_rlang <- function(df, spp, LENGTH, WIDTH) {
res <- df %>%
filter(Species == spp) %>%
select({{LENGTH}}, {{WIDTH}}) %>%
mutate(sum = {{LENGTH}} + {{WIDTH}}) %>%
filter(sum > 4) %>%
nrow()
return(res)
}
filter_rlang(iris, "virginica", Sepal.Length, Sepal.Width)
#> [1] 50
Created on 2019-11-10 by the reprex package (v0.3.0)

Resources