I'm getting a dplyr::bind_rows error. It's a very trivial problem, because I can easily get around it, but I'd like to understand the meaning of the error message.
I have the following data of some population groups for New England states, and I'd like to bind on a copy of these same values with the name changed to "New England," so that I can group by name and add them up, giving me values for the individual states, plus an overall value for the region.
df <- structure(list(name = c("CT", "MA", "ME", "NH", "RI", "VT"),
estimate = c(501074, 1057316, 47369, 76630, 141206, 27464)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))
I'm doing this as part of a much larger flow of piped steps, so I can't just do bind_rows(df, df %>% mutate(name = "New England")). dplyr gives the convenient . shorthand for a data frame being piped from one function to the next, but I can't use that to bind the data frame to itself in a way I'd like.
What does work and gets me the output I want:
library(tidyverse)
df %>%
# arbitrary piped operation
mutate(name = str_to_lower(name)) %>%
bind_rows(mutate(., name = "New England")) %>%
group_by(name) %>%
summarise(estimate = sum(estimate))
#> # A tibble: 7 x 2
#> name estimate
#> <chr> <dbl>
#> 1 ct 501074
#> 2 ma 1057316
#> 3 me 47369
#> 4 New England 1851059
#> 5 nh 76630
#> 6 ri 141206
#> 7 vt 27464
But when I try to do the same thing with the . shorthand, I get this error:
df %>%
mutate(name = str_to_lower(name)) %>%
bind_rows(. %>% mutate(name = "New England"))
#> Error in bind_rows_(x, .id): Argument 2 must be a data frame or a named atomic vector, not a fseq/function
Like I said, doing it the first way is fine, but I'd like to understand the error because I write a lot of multi-step piped code.
As #aosmith noted in the comments it's due to the way magrittr parses the dot in this case :
from ?'%>%':
Using the dot-place holder as lhs
When the dot is used as lhs, the
result will be a functional sequence, i.e. a function which applies
the entire chain of right-hand sides in turn to its input.
To avoid triggering this, any modification of the expression on the lhs will do:
df %>%
mutate(name = str_to_lower(name)) %>%
bind_rows((.) %>% mutate(name = "New England"))
df %>%
mutate(name = str_to_lower(name)) %>%
bind_rows({.} %>% mutate(name = "New England"))
df %>%
mutate(name = str_to_lower(name)) %>%
bind_rows(identity(.) %>% mutate(name = "New England"))
Here's a suggestion that avoid the problem altogether:
df %>%
# arbitrary piped operation
mutate(name = str_to_lower(name)) %>%
replicate(2,.,simplify = FALSE) %>%
map_at(2,mutate_at,"name",~"New England") %>%
bind_rows
# # A tibble: 12 x 2
# name estimate
# <chr> <dbl>
# 1 ct 501074
# 2 ma 1057316
# 3 me 47369
# 4 nh 76630
# 5 ri 141206
# 6 vt 27464
# 7 New England 501074
# 8 New England 1057316
# 9 New England 47369
# 10 New England 76630
# 11 New England 141206
# 12 New England 27464
Related
I have below data frame df3.
City
Income
Cost
Age
NY
1237
2432
43
NY
6352
8632
32
Boston
6487
2846
54
NJ
6547
7353
42
Boston
7564
7252
21
NY
9363
7563
35
Boston
3262
7352
54
NY
9473
8667
76
NJ
6234
4857
31
Boston
5242
7684
39
NJ
7483
4748
47
NY
9273
6573
53
I need to create a function 'ST' to get mean and standard diviation when the city is given. As an example, if I give ST(NY), I should get a table like below.
variable
Mean
SD
Income
XX
XX
Cost
XX
XX
Age
XX
XX
XX are the values in 2 decimal places. I wrote few codes but I am struggeling to concatenate these codes to get one fucntion. Below are my codes.
library(dplyr)
df3 %>%
group_by(City) %>%
summarise_at(vars("Income","Cost","Age"), median,2)
ST <- function(c) {
if (df3$City == s)
dataframe (
library(dplyr)
df3 %>%
group_by(City) %>%
summarise_at(vars("Income","Cost","Age"), mean,2),
library(dplyr)
df3 %>%
group_by(City) %>%
summarise_at(vars("Income","Cost","Age"), sd,2)
else {
"NA"
}
}
ST(NJ)
No need to call library(dplyr) multiple times, and doing so in the middle of a data.frame(..) expression is not right. Candidly, even if that were syntactically correct code (it could be with {...} bracing), it is generally considered better to put things like that at the beginning of the function, organizing the code. Put it at the beginning of your function, ST <- function(c) { library(dplyr); ... }.
From ?summarize_at,
Scoped verbs (_if, _at, _all) have been superseded by the use of across() in an existing verb. See vignette("colwise") for details.), ...
I'll demo the use of across.
summarize can be given multiple (named) functions at once, I'll show that, too.
Your if (df3$City == .) is wrong for a few reasons, notably because if requires its conditional to be exactly length-1 (anything else is an error, a warning, and/or logical failure) but the test is returning a logical vector as long as the number of rows in df3. A better tactic is to use dplyr::filter.
Your function is using objects that were neither passed to it nor defined within it, this is bad practice. Best practice is to pass the data and arguments in the function call.
ST <- function(X, city, na.rm = TRUE) {
library(dplyr)
library(tidyr) # pivot_longer
filter(X, City %in% city) %>%
summarize(across(c("Income", "Cost", "Age"),
list(mu = ~ mean(., na.rm = na.rm),
sigma = ~ sd(., na.rm = na.rm)))) %>%
pivot_longer(everything(), names_pattern = "(.*)_(.*)",
names_to = c("variable", ".value"))
}
ST(df3, "NY")
# # A tibble: 3 x 3
# variable mu sigma
# <chr> <dbl> <dbl>
# 1 Income 7140. 3550.
# 2 Cost 6773. 2576.
# 3 Age 47.8 17.7
Notice that I used City %in% city instead of ==; in most cases this is identical, but there are two benefits to this:
NA inclusion works. Note that NA == NA returns NA (which stifles many conditional processing if not capture correctly) whereas NA %in% NA returns TRUE, which seems more intuitive (to me at least).
It allows for city (the function argument) to be length other than 1, such as ST(df3, c("NY", "Boston")). While that may not be a necessary thing for this function, it can be a handy utility in other function definitions, and can be a good thing to consider. Said differently and in CS-speak, it's good to think about a function handling not just "1" or "2" static things, but perhaps "1 or more" or "0 or more" (relatively unlimited number of arguments). (For this, I'll rename the function argument from city to cities, suggesting it can take more than one.)
From this use of %in%, it might make sense to include the city name in the output; this can be done by adding a group_by after the filter, as in
ST <- function(X, cities, digits = 2, na.rm = TRUE) {
library(dplyr)
library(tidyr) # pivot_longer
filter(X, City %in% cities) %>%
group_by(City) %>%
summarize(across(c("Income", "Cost", "Age"),
list(mu = ~ mean(., na.rm = na.rm),
sigma = ~ sd(., na.rm = na.rm)))) %>%
pivot_longer(-City, names_pattern = "(.*)_(.*)",
names_to = c("variable", ".value")) %>%
mutate(across(c(mu, sigma), ~ round(., digits)))
}
ST(df3, c("NY", "Boston"))
# # A tibble: 6 x 4
# City variable mu sigma
# <chr> <chr> <dbl> <dbl>
# 1 Boston Income 5639. 1847.
# 2 Boston Cost 6284. 2299.
# 3 Boston Age 42 15.7
# 4 NY Income 7140. 3550.
# 5 NY Cost 6773. 2576.
# 6 NY Age 47.8 17.7
Edit: I added the rounding.
ST <- function(city_name) {
df %>%
filter(City == city_name) %>%
pivot_longer(cols = Income:Age, names_to = "variable") %>%
group_by(City, variable) %>%
summarise(mean = mean(value),
sd = sd(value), .groups = "drop")
}
ST("Boston")
# A tibble: 3 × 4
City variable mean sd
<chr> <chr> <dbl> <dbl>
1 Boston Age 42 15.7
2 Boston Cost 6284. 2299.
3 Boston Income 5639. 1847.
I have the following dataset:
df<-data.frame(
identifer=c(1,2,3,4),
DF=c("Tablet","Powder","Suspension","System"),
DF_source1=c("Capsule","Powder,Metered","Tablet",NA),
DF_source2=c(NA,NA,"Tablet",NA),
DF_source3=c("Tablet, Extended Release","Liquid","Tablet",NA),
Route_source1=c("Oral","INHALATION","Oral",NA),
Route_source2=c(NA,"TOPICAL","Oral",NA),
Route_source3=c("Oral","IRRIGATION","oral",NA))
I want to know which DF_source matches DF, and additionally which associated Route I should take.
I want the output to look like this:
df_out<-data.frame(
identifer=c(1,2,3,4),
DF=c("Tablet","Powder","Suspension","System"),
DF_match=c("Tablet, Extended Release","Powder,Metered;Powder",NA,NA),
Route_match=c("Oral","INHALATION;TOPICAL",NA,NA),
DF_match_count=c(1,2,0,0),
DF_route_count=c(1,2,0,0))
I tried this but I'm not sure how to pull values for DF_match and Route_ Match
df%>%mutate_at(vars(matches("(DF_source)")),
list(string_detect = ~str_detect(tolower(DF),tolower(str_replace_all(.,"/|,(\\s)?|(?<!,)\\s","|")))))
Any help would be appreciated, thanks!
I'm not entirely sure this is what you have in mind, but hope this might help.
Your end result appears not to match your example data (e.g. TOPICAL is missing).
This might be easier in a tidier form with pivot_longer.
Edit: If columns are factors, convert to character for str_detect in filter.
library(tidyverse)
library(stringr)
df %>%
mutate_if(is.factor, as.character) %>%
pivot_longer(cols = -c(identifer, DF), names_to = c(".value", "number"), names_pattern = "(\\w+)(\\d+)") %>%
filter(str_detect(DF_source, DF)) %>%
group_by(identifer) %>%
summarise(DF_match = paste(DF_source, collapse = ';'),
Route_match = paste(Route_source, collapse = ';'),
match_count = n()) %>%
right_join(df[,c("identifer", "DF")], by = "identifer") %>%
select(c(identifer, DF, DF_match, Route_match, match_count))
Output
# A tibble: 4 x 5
identifer DF DF_match Route_match match_count
<dbl> <chr> <chr> <chr> <int>
1 1 Tablet Tablet, Extended Release Oral 1
2 2 Powder Powder,Metered;Powder INHALATION;TOPICAL 2
3 3 Suspension NA NA NA
4 4 System NA NA NA
I am trying to create an edge list from a single character vector. My list to be processed is over 93k elements long, but as an example I will provide a small excerpt.
The chracter strings are part of the ICD10 code hierarchy and the parent child relationships exist within the string. That means that a single string, "A0101", would have a parent of "A010"
It would look like this:
A00
A000
A001
A009
A01
A010
A0100
A0101
A02
A03
etc.
My vector does not contain any other data except the strings but i basically need to convert
dat <- c("A00", "A000", "A001", "A009", "A01", "A010", "A0100", "A0101", "A02")
into an edge list formatted as follows...
# (A00, A000)
# (A00, A001)
# (A00, A009)
# (A01, A010)
# (A010, A0100)
# (A010, A0101)
I am fairly certain there are more efficient ways to accomplish this but this excerpt of code should download the ICD10 CM data from the icd.data package. Use the children detection system from the icd package and then make extensive use of the tidyverse to return an edgelist. I had to get a bit creative to connect the "top" of the hierarchies since they do not include the chapters and sub chapters of ICD10 data as an individual 2 or 1 digit code.
Basically sub-chapters become 2 digit codes, chapters become 1 digit codes, and then there is a root node to connect everything at the top.
library(icd.data)
icd10 <- icd10cm2016
library(icd)
code_children <- lapply(icd10$code, children)
code_vec <- sapply(code_children, paste, collapse = ",")
code_df <- as.data.frame(code_vec, stringsAsFactors = F)
library(dplyr);library(stringr);library(tidyr)
code_df_new <- code_df %>%
mutate(parent = sapply(strsplit(code_vec,","), "[", 1)) %>%
separate(code_vec,
paste("code", 1:max(str_count(code_df$code_vec, ",")), sep ="."),
",",extra = "merge")
library(reshape2)
edgelist <- melt(code_df_new, id = "parent") %>%
filter(!is.na(value)) %>%
select(parent, child = value) %>%
arrange(parent)
edgelist <- subset(edgelist, edgelist$parent != edgelist$child)
edgelist <- subset(edgelist, nchar(edgelist$child) == nchar(edgelist$parent) + 1)
subchaps <- icd10 %>% select(three_digit, sub_chapter, chapter) %>%
mutate(two_digit = substr(three_digit, 1, 2)) %>%
select(parent = two_digit, child = three_digit) %>%
distinct()
chaps <- icd10 %>% select(three_digit, sub_chapter, chapter) %>%
mutate(
two_digit = substr(three_digit, 1, 2),
one_digit = substr(three_digit, 1, 1)) %>%
select(parent = one_digit, child = two_digit) %>%
distinct()
root <- icd10 %>% select(three_digit) %>%
mutate(parent = "root", child = substr(three_digit, 1, 1)) %>%
select(parent, child) %>%
distinct()
edgelist_final <- edgelist %>%
bind_rows(list(chaps, subchaps, root)) %>%
arrange(parent)
If anybody has any tips or methods to improve the efficiency of this code I am all ears. (eyes?)
On the assumption that the length of the node names in ICD10 fully define the order (with shorter ones being parents), here's an approach that connects each node with it's immediate parent, if available.
While I think the logic is legible here, I'd be curious to see what a more streamlined solution would look like.
# Some longer fake data to prove that it works acceptably
# with 93k rows (took a few seconds). These are just
# numbers of different lengths, converted to characters, but they
# should suffice if the assumption about length = order is correct.
set.seed(42)
fake <- runif(93000, 0, 500) %>%
magrittr::raise_to_power(3) %>%
as.integer() %>%
as.character()
# Step 1 - prep
library(dplyr); library(tidyr)
fake_2 <- fake %>%
as_data_frame() %>%
mutate(row = row_number()) %>%
# Step 2 - widen by level and fill in all parent nodes
mutate(level = str_length(value)) %>%
spread(level, value) %>%
fill(everything()) %>%
# Step 3 - Get two highest non-NA nodes
gather(level, code, -row) %>%
arrange(row, level) %>%
filter(!is.na(code)) %>%
group_by(row) %>%
top_n(2, wt = level) %>%
# Step 4 - Spread once more to get pairs
mutate(pos = row_number()) %>%
ungroup() %>%
select(-level) %>%
spread(pos, code)
Output on OP data
# A tibble: 9 x 3
row `1` `2`
<int> <chr> <chr>
1 1 A00 NA
2 2 A00 A000
3 3 A00 A001
4 4 A00 A009
5 5 A01 A009
6 6 A01 A010
7 7 A010 A0100
8 8 A010 A0101
9 9 A010 A0101
Output on 93k fake data
> head(fake, 10)
[1] "55174190" "50801321" "46771275" "6480673"
[5] "20447474" "879955" "4365410" "11434009"
[9] "5002257" "9200296"
> head(fake_2, 10)
# A tibble: 10 x 3
row `1` `2`
<int> <chr> <chr>
1 1 55174190 NA
2 2 50801321 NA
3 3 46771275 NA
4 4 6480673 46771275
5 5 6480673 20447474
6 6 6480673 20447474
7 7 4365410 20447474
8 8 4365410 11434009
9 9 5002257 11434009
10 10 9200296 11434009
I'm trying to figure out how I can use optional arguments in an NSE function in my tidyverse workflow. This is a little toy function that I'd like to be able to build upon. I want to be able to operate on a grouped data frame; in this example, I'd like to gather the df, excluding whatever columns the df is grouped by (getting these successfully with groups(df)) and any other optional columns, coming in through .... quos has an argument .ignore_empty, but I'm not sure how to use it exactly. I might be misunderstanding what .ignore_empty does.
I know I can start the function off by checking for missing arguments, then setting up two different sets of piped operations for whether or not there are additional arguments given, but I'd prefer keeping this in a single pipeflow.
Data and the toy function:
library(tidyverse)
df <- structure(list(
town = c("East Haven", "Hamden", "New Haven","West Haven"),
region = c("Inner Ring", "Inner Ring", "New Haven", "Inner Ring"),
Asian = c(1123, 3285, 6042, 2214),
Black = c(693,13209, 42970, 10677),
Latino = c(3820, 6450, 37231, 10977),
Total = c(29015,61476, 130405, 54972),
White = c(22898, 37043, 40164, 28864)),
class = c("tbl_df","tbl", "data.frame"), row.names = c(NA, -4L))
test_dots <- function(df, ...) {
grouping_vars <- groups(df)
gather_vars <- quos(..., .ignore_empty = "all")
df %>%
gather(key = variable, value = value, -c(!!!grouping_vars), -c(!!!gather_vars))
}
With a grouped df and a column name received as ...:
df %>%
group_by(town) %>%
test_dots(region) %>%
head()
#> # A tibble: 6 x 4
#> # Groups: town [4]
#> town region variable value
#> <chr> <chr> <chr> <dbl>
#> 1 East Haven Inner Ring Asian 1123
#> 2 Hamden Inner Ring Asian 3285
#> 3 New Haven New Haven Asian 6042
#> 4 West Haven Inner Ring Asian 2214
#> 5 East Haven Inner Ring Black 693
#> 6 Hamden Inner Ring Black 13209
With a grouped df but nothing going into ...:
df %>%
select(-region) %>%
group_by(town) %>%
test_dots()
#> Error in -x: invalid argument to unary operator
I think the problem is that you are trying to negate an empty vector. If you are sure there will always be a at least one grouping or gather variable, then you can do
test_dots <- function(df, ...) {
grouping_vars <- groups(df)
gather_vars <- quos(...)
vars <- quos(c(!!!grouping_vars), c(!!!gather_vars))
df %>%
gather(key = variable, value = value, -c(!!!vars))
}
I don't think the .ignore_empty has anything to do with it because that would just appear to control how quos works, not the gather().
So I am taking a column name which is guaranteed to exist in the data frame as the input, and want to calculate the mean of the given column (Let's suppose c is the parameter in the function).
EDIT: Tried recreating a scenario here.
states <- c("Washington", "Washington", "California", "California")
random.data <- c(10, 20, 30, 40)
data <- data.frame(states, random.data)
data.frame <- data %>%
group_by(states) %>%
summarise_(average = mean(paste0("random", ".data")))
When I tried doing this I got the following error:
Warning message:
In mean.default(paste0("random", ".data")) :
argument is not numeric or logical: returning NA
I have an idea about standard evaluation, but for some reason it is not working for this mean function.
We can use sym with !!
col <- paste0("random", ".data")
data %>%
group_by(states) %>%
summarise(average = mean(!! rlang::sym(col)))
##or use directly
#summarise(average = mean(!! rlang::sym(paste0("random", ".data"))))
# A tibble: 2 x 2
# states average
# <fctr> <dbl>
#1 California 35
#2 Washington 15
Or another option is get
data %>%
group_by(states) %>%
summarise(average = mean(get(paste0("random", ".data"))))
# A tibble: 2 x 2
# states average
# <fctr> <dbl>
#1 California 35
#2 Washington 15