I'm trying to figure out how I can use optional arguments in an NSE function in my tidyverse workflow. This is a little toy function that I'd like to be able to build upon. I want to be able to operate on a grouped data frame; in this example, I'd like to gather the df, excluding whatever columns the df is grouped by (getting these successfully with groups(df)) and any other optional columns, coming in through .... quos has an argument .ignore_empty, but I'm not sure how to use it exactly. I might be misunderstanding what .ignore_empty does.
I know I can start the function off by checking for missing arguments, then setting up two different sets of piped operations for whether or not there are additional arguments given, but I'd prefer keeping this in a single pipeflow.
Data and the toy function:
library(tidyverse)
df <- structure(list(
town = c("East Haven", "Hamden", "New Haven","West Haven"),
region = c("Inner Ring", "Inner Ring", "New Haven", "Inner Ring"),
Asian = c(1123, 3285, 6042, 2214),
Black = c(693,13209, 42970, 10677),
Latino = c(3820, 6450, 37231, 10977),
Total = c(29015,61476, 130405, 54972),
White = c(22898, 37043, 40164, 28864)),
class = c("tbl_df","tbl", "data.frame"), row.names = c(NA, -4L))
test_dots <- function(df, ...) {
grouping_vars <- groups(df)
gather_vars <- quos(..., .ignore_empty = "all")
df %>%
gather(key = variable, value = value, -c(!!!grouping_vars), -c(!!!gather_vars))
}
With a grouped df and a column name received as ...:
df %>%
group_by(town) %>%
test_dots(region) %>%
head()
#> # A tibble: 6 x 4
#> # Groups: town [4]
#> town region variable value
#> <chr> <chr> <chr> <dbl>
#> 1 East Haven Inner Ring Asian 1123
#> 2 Hamden Inner Ring Asian 3285
#> 3 New Haven New Haven Asian 6042
#> 4 West Haven Inner Ring Asian 2214
#> 5 East Haven Inner Ring Black 693
#> 6 Hamden Inner Ring Black 13209
With a grouped df but nothing going into ...:
df %>%
select(-region) %>%
group_by(town) %>%
test_dots()
#> Error in -x: invalid argument to unary operator
I think the problem is that you are trying to negate an empty vector. If you are sure there will always be a at least one grouping or gather variable, then you can do
test_dots <- function(df, ...) {
grouping_vars <- groups(df)
gather_vars <- quos(...)
vars <- quos(c(!!!grouping_vars), c(!!!gather_vars))
df %>%
gather(key = variable, value = value, -c(!!!vars))
}
I don't think the .ignore_empty has anything to do with it because that would just appear to control how quos works, not the gather().
Related
I have below data frame df3.
City
Income
Cost
Age
NY
1237
2432
43
NY
6352
8632
32
Boston
6487
2846
54
NJ
6547
7353
42
Boston
7564
7252
21
NY
9363
7563
35
Boston
3262
7352
54
NY
9473
8667
76
NJ
6234
4857
31
Boston
5242
7684
39
NJ
7483
4748
47
NY
9273
6573
53
I need to create a function 'ST' to get mean and standard diviation when the city is given. As an example, if I give ST(NY), I should get a table like below.
variable
Mean
SD
Income
XX
XX
Cost
XX
XX
Age
XX
XX
XX are the values in 2 decimal places. I wrote few codes but I am struggeling to concatenate these codes to get one fucntion. Below are my codes.
library(dplyr)
df3 %>%
group_by(City) %>%
summarise_at(vars("Income","Cost","Age"), median,2)
ST <- function(c) {
if (df3$City == s)
dataframe (
library(dplyr)
df3 %>%
group_by(City) %>%
summarise_at(vars("Income","Cost","Age"), mean,2),
library(dplyr)
df3 %>%
group_by(City) %>%
summarise_at(vars("Income","Cost","Age"), sd,2)
else {
"NA"
}
}
ST(NJ)
No need to call library(dplyr) multiple times, and doing so in the middle of a data.frame(..) expression is not right. Candidly, even if that were syntactically correct code (it could be with {...} bracing), it is generally considered better to put things like that at the beginning of the function, organizing the code. Put it at the beginning of your function, ST <- function(c) { library(dplyr); ... }.
From ?summarize_at,
Scoped verbs (_if, _at, _all) have been superseded by the use of across() in an existing verb. See vignette("colwise") for details.), ...
I'll demo the use of across.
summarize can be given multiple (named) functions at once, I'll show that, too.
Your if (df3$City == .) is wrong for a few reasons, notably because if requires its conditional to be exactly length-1 (anything else is an error, a warning, and/or logical failure) but the test is returning a logical vector as long as the number of rows in df3. A better tactic is to use dplyr::filter.
Your function is using objects that were neither passed to it nor defined within it, this is bad practice. Best practice is to pass the data and arguments in the function call.
ST <- function(X, city, na.rm = TRUE) {
library(dplyr)
library(tidyr) # pivot_longer
filter(X, City %in% city) %>%
summarize(across(c("Income", "Cost", "Age"),
list(mu = ~ mean(., na.rm = na.rm),
sigma = ~ sd(., na.rm = na.rm)))) %>%
pivot_longer(everything(), names_pattern = "(.*)_(.*)",
names_to = c("variable", ".value"))
}
ST(df3, "NY")
# # A tibble: 3 x 3
# variable mu sigma
# <chr> <dbl> <dbl>
# 1 Income 7140. 3550.
# 2 Cost 6773. 2576.
# 3 Age 47.8 17.7
Notice that I used City %in% city instead of ==; in most cases this is identical, but there are two benefits to this:
NA inclusion works. Note that NA == NA returns NA (which stifles many conditional processing if not capture correctly) whereas NA %in% NA returns TRUE, which seems more intuitive (to me at least).
It allows for city (the function argument) to be length other than 1, such as ST(df3, c("NY", "Boston")). While that may not be a necessary thing for this function, it can be a handy utility in other function definitions, and can be a good thing to consider. Said differently and in CS-speak, it's good to think about a function handling not just "1" or "2" static things, but perhaps "1 or more" or "0 or more" (relatively unlimited number of arguments). (For this, I'll rename the function argument from city to cities, suggesting it can take more than one.)
From this use of %in%, it might make sense to include the city name in the output; this can be done by adding a group_by after the filter, as in
ST <- function(X, cities, digits = 2, na.rm = TRUE) {
library(dplyr)
library(tidyr) # pivot_longer
filter(X, City %in% cities) %>%
group_by(City) %>%
summarize(across(c("Income", "Cost", "Age"),
list(mu = ~ mean(., na.rm = na.rm),
sigma = ~ sd(., na.rm = na.rm)))) %>%
pivot_longer(-City, names_pattern = "(.*)_(.*)",
names_to = c("variable", ".value")) %>%
mutate(across(c(mu, sigma), ~ round(., digits)))
}
ST(df3, c("NY", "Boston"))
# # A tibble: 6 x 4
# City variable mu sigma
# <chr> <chr> <dbl> <dbl>
# 1 Boston Income 5639. 1847.
# 2 Boston Cost 6284. 2299.
# 3 Boston Age 42 15.7
# 4 NY Income 7140. 3550.
# 5 NY Cost 6773. 2576.
# 6 NY Age 47.8 17.7
Edit: I added the rounding.
ST <- function(city_name) {
df %>%
filter(City == city_name) %>%
pivot_longer(cols = Income:Age, names_to = "variable") %>%
group_by(City, variable) %>%
summarise(mean = mean(value),
sd = sd(value), .groups = "drop")
}
ST("Boston")
# A tibble: 3 × 4
City variable mean sd
<chr> <chr> <dbl> <dbl>
1 Boston Age 42 15.7
2 Boston Cost 6284. 2299.
3 Boston Income 5639. 1847.
Doing an assignment for school in which we use a pre-loaded dataframe (Midwest) from dplyr to manipulate data and display visualizations through shiny.
I'm getting the error "Problem with 'summarise()' input 'Illinois' because "object 'IL' not found (even though that's a variable in a column that I thought I had grouped by.
Here's some of my code at the moment.
bar_chart <- function(midwest) {
data_summary <- midwest %>%
dplyr::group_by(state) %>%
summarize("Illinois" = mean(IL, na.rm = TRUE),
"Minnesota" = mean(MN, na.rm = TRUE),
"Indiana" = mean(IN, na.rm = TRUE),
"Ohio" = mean(OH, na.rm = TRUE),
"Wisconsin" = mean(WN, na.rm = TRUE))
A couple things to understand here. Groups specify a level of aggregation, in this case state. That means when we summarize, we summarize to that specified level of aggregation. We have a data set with multiple states, so when we group by state, that means we'll end up with one row for each state. The result is that you don't have to write a line of code for each state like you did in your provided example.
When we summarize, we need to specify a function which we'll use to summarize (i.e. roll-up) the data, as well as a column to apply it to. In this case you're using mean, so I'll use that as well, and we'll find the mean of poptotal for each state.
Finally, while you can use recode to replace factor levels, my little example below uses a left_join and R's built in table of state names and abbreviations to add it in - a nice little trick if you had all 50 states.
library(tidyverse)
data(midwest)
stateTable <- data.frame(state.abb, state.name)
midwest %>% group_by(state) %>%
summarize(poptotal = mean(poptotal)) %>%
left_join(. , stateTable, by = c( "state" = "state.abb"))
# A tibble: 5 x 3
state poptotal state.name
<chr> <dbl> <fct>
1 IL 112065. Illinois
2 IN 60263. Indiana
3 MI 111992. Michigan
4 OH 123263. Ohio
5 WI 67941. Wisconsin
I have to work with some data that is in recursive lists like this (simplified reproducible example below):
groups
#> $group1
#> $group1$countries
#> [1] "USA" "JPN"
#>
#>
#> $group2
#> $group2$countries
#> [1] "AUS" "GBR"
Code for data input below:
chars <- c("USA", "JPN")
chars2 <- c("AUS", "GBR")
group1 <- list(countries = chars)
group2 <- list(countries = chars2)
groups <- list(group1 = group1, group2 = group2)
groups
I'm trying to work out how to extract the vectors that are in the lists, without manually having to write a line of code for each group. The code below works, but my example has a large number of groups (and the number of groups will change), so it would be great to work out how to extract all of the vectors in a more efficient manner. This is the brute force way, that works:
countries1 <- groups$group1$countries
countries2 <- groups$group2$countries
In the example, the bottom level vector I'm trying to extract is always called countries, but the lists they're contained in change name, varying only by numbering.
Would there be an easy purrr solution? Or tidyverse solution? Or other solution?
Add some additional cases to your list
groups[["group3"]] <- list()
groups[["group4"]] <- list(foo = letters[1:2])
groups[["group5"]] <- list(foo = letters[1:2], countries = LETTERS[1:2])
Here's a function that maps any list to just the elements named "countries"; it returns NULL if there are no elements
fun = function(x)
x[["countries"]]
Map your original list to contain just the elements you're interested in
interesting <- Map(fun, groups)
Then transform these into a data.frame using a combination of unlist() and rep()
df <- data.frame(
country = unlist(interesting, use.names = FALSE),
name = rep(names(interesting), lengths(interesting))
)
Alternatively, use tidy syntax, e.g.,
interesting %>%
tibble(group = names(.), value = .) %>%
unnest("value")
The output is
# A tibble: 6 x 2
group value
<chr> <chr>
1 group1 USA
2 group1 JPN
3 group2 AUS
4 group2 GBR
5 group5 A
6 group5 B
If there are additional problems parsing individual elements of groups, then modify fun, e.g.,
fun = function(x)
as.character(x[["countries"]])
This will put the output in a list which will handle any number of groups
countries <- unlist(groups, recursive = FALSE)
names(countries) <- sub("^\\w+(\\d+)\\.(\\w+)", "\\2\\1", names(countries), perl = TRUE)
> countries
$countries1
[1] "USA" "JPN"
$countries2
[1] "AUS" "GBR"
You can simply transform your nested list to a data.frame and then unnest the country column.
library(dplyr)
library(tidyr)
groups %>%
tibble(group = names(groups),
country = .) %>%
unnest(country) %>%
unnest(country)
#> # A tibble: 4 x 2
#> group country
#> <chr> <chr>
#> 1 group1 USA
#> 2 group1 JPN
#> 3 group2 AUS
#> 4 group2 GBR
Created on 2020-01-15 by the reprex package (v0.3.0)
Since the countries are hidden 2 layers deep, you have to run unnest twice. Otherwise I think this is straightforward.
If you actually want to have each vector as a an object in you global environment a combination of purrr::map2/walk and list2env will work. In order to make this work, we have to give the country entries in the list individual names first, otherwise list2env just overwrites the same object over and over again.
library(purrr)
groups <-
map2(groups, 1:length(groups), ~setNames(.x, paste0(names(.x), .y)))
walk(groups, ~list2env(. , envir = .GlobalEnv))
This would create the exact same results you are describing in your question. I am not sure though, if it is the best solution for a smooth workflow, since I don't know where you are going with this.
I'm getting a dplyr::bind_rows error. It's a very trivial problem, because I can easily get around it, but I'd like to understand the meaning of the error message.
I have the following data of some population groups for New England states, and I'd like to bind on a copy of these same values with the name changed to "New England," so that I can group by name and add them up, giving me values for the individual states, plus an overall value for the region.
df <- structure(list(name = c("CT", "MA", "ME", "NH", "RI", "VT"),
estimate = c(501074, 1057316, 47369, 76630, 141206, 27464)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))
I'm doing this as part of a much larger flow of piped steps, so I can't just do bind_rows(df, df %>% mutate(name = "New England")). dplyr gives the convenient . shorthand for a data frame being piped from one function to the next, but I can't use that to bind the data frame to itself in a way I'd like.
What does work and gets me the output I want:
library(tidyverse)
df %>%
# arbitrary piped operation
mutate(name = str_to_lower(name)) %>%
bind_rows(mutate(., name = "New England")) %>%
group_by(name) %>%
summarise(estimate = sum(estimate))
#> # A tibble: 7 x 2
#> name estimate
#> <chr> <dbl>
#> 1 ct 501074
#> 2 ma 1057316
#> 3 me 47369
#> 4 New England 1851059
#> 5 nh 76630
#> 6 ri 141206
#> 7 vt 27464
But when I try to do the same thing with the . shorthand, I get this error:
df %>%
mutate(name = str_to_lower(name)) %>%
bind_rows(. %>% mutate(name = "New England"))
#> Error in bind_rows_(x, .id): Argument 2 must be a data frame or a named atomic vector, not a fseq/function
Like I said, doing it the first way is fine, but I'd like to understand the error because I write a lot of multi-step piped code.
As #aosmith noted in the comments it's due to the way magrittr parses the dot in this case :
from ?'%>%':
Using the dot-place holder as lhs
When the dot is used as lhs, the
result will be a functional sequence, i.e. a function which applies
the entire chain of right-hand sides in turn to its input.
To avoid triggering this, any modification of the expression on the lhs will do:
df %>%
mutate(name = str_to_lower(name)) %>%
bind_rows((.) %>% mutate(name = "New England"))
df %>%
mutate(name = str_to_lower(name)) %>%
bind_rows({.} %>% mutate(name = "New England"))
df %>%
mutate(name = str_to_lower(name)) %>%
bind_rows(identity(.) %>% mutate(name = "New England"))
Here's a suggestion that avoid the problem altogether:
df %>%
# arbitrary piped operation
mutate(name = str_to_lower(name)) %>%
replicate(2,.,simplify = FALSE) %>%
map_at(2,mutate_at,"name",~"New England") %>%
bind_rows
# # A tibble: 12 x 2
# name estimate
# <chr> <dbl>
# 1 ct 501074
# 2 ma 1057316
# 3 me 47369
# 4 nh 76630
# 5 ri 141206
# 6 vt 27464
# 7 New England 501074
# 8 New England 1057316
# 9 New England 47369
# 10 New England 76630
# 11 New England 141206
# 12 New England 27464
So I am taking a column name which is guaranteed to exist in the data frame as the input, and want to calculate the mean of the given column (Let's suppose c is the parameter in the function).
EDIT: Tried recreating a scenario here.
states <- c("Washington", "Washington", "California", "California")
random.data <- c(10, 20, 30, 40)
data <- data.frame(states, random.data)
data.frame <- data %>%
group_by(states) %>%
summarise_(average = mean(paste0("random", ".data")))
When I tried doing this I got the following error:
Warning message:
In mean.default(paste0("random", ".data")) :
argument is not numeric or logical: returning NA
I have an idea about standard evaluation, but for some reason it is not working for this mean function.
We can use sym with !!
col <- paste0("random", ".data")
data %>%
group_by(states) %>%
summarise(average = mean(!! rlang::sym(col)))
##or use directly
#summarise(average = mean(!! rlang::sym(paste0("random", ".data"))))
# A tibble: 2 x 2
# states average
# <fctr> <dbl>
#1 California 35
#2 Washington 15
Or another option is get
data %>%
group_by(states) %>%
summarise(average = mean(get(paste0("random", ".data"))))
# A tibble: 2 x 2
# states average
# <fctr> <dbl>
#1 California 35
#2 Washington 15