I'm trying to tighten up a %>% piped workflow where I need to apply the same function to several columns but with one argument changed each time. I feel like purrr's map or invoke functions should help, but I can't wrap my head around it.
My data frame has columns for life expectancy, poverty rate, and median household income. I can pass all these column names to vars in mutate_at, use round as the function to apply to each, and optionally supply a digits argument. But I can't figure out a way, if one exists, to pass different values for digits associated with each column. I'd like life expectancy rounded to 1 digit, poverty rounded to 2, and income rounded to 0.
I can call mutate on each column, but given that I might have more columns all receiving the same function with only an additional argument changed, I'd like something more concise.
library(tidyverse)
df <- tibble::tribble(
~name, ~life_expectancy, ~poverty, ~household_income,
"New Haven", 78.0580437642378, 0.264221051111753, 42588.7592521085
)
In my imagination, I could do something like this:
df %>%
mutate_at(vars(life_expectancy, poverty, household_income),
round, digits = c(1, 2, 0))
But get the error
Error in mutate_impl(.data, dots) :
Column life_expectancy must be length 1 (the number of rows), not 3
Using mutate_at instead of mutate just to have the same syntax as in my ideal case:
df %>%
mutate_at(vars(life_expectancy), round, digits = 1) %>%
mutate_at(vars(poverty), round, digits = 2) %>%
mutate_at(vars(household_income), round, digits = 0)
#> # A tibble: 1 x 4
#> name life_expectancy poverty household_income
#> <chr> <dbl> <dbl> <dbl>
#> 1 New Haven 78.1 0.26 42589
Mapping over the digits uses each of the digits options for each column, not by position, giving me 3 rows each rounded to a different number of digits.
df %>%
mutate_at(vars(life_expectancy, poverty, household_income),
function(x) map(x, round, digits = c(1, 2, 0))) %>%
unnest()
#> # A tibble: 3 x 4
#> name life_expectancy poverty household_income
#> <chr> <dbl> <dbl> <dbl>
#> 1 New Haven 78.1 0.3 42589.
#> 2 New Haven 78.1 0.26 42589.
#> 3 New Haven 78 0 42589
Created on 2018-11-13 by the reprex package (v0.2.1)
2 solutions
mutate with !!!
invoke was a good idea but you need it less now that most tidyverse functions support the !!! operator, here's what you can do :
digits <- c(life_expectancy = 1, poverty = 2, household_income = 0)
df %>% mutate(!!!imap(digits, ~round(..3[[.y]], .x),.))
# # A tibble: 1 x 4
# name life_expectancy poverty household_income
# <chr> <dbl> <dbl> <dbl>
# 1 New Haven 78.1 0.26 42589
..3 is the initial data frame, passed to the function as a third argument, through the dot at the end of the call.
Written more explicitly :
df %>% mutate(!!!imap(
digits,
function(digit, name, data) round(data[[name]], digit),
data = .))
If you need to start from your old interface (though the one I propose will be more flexible), first do:
digits <- setNames(c(1, 2, 0), c("life_expectancy", "poverty", "household_income"))
mutate_at and <<-
Here we bend a bit the good practice of avoiding <<- whenever possible, but readability matters and this one is really easy to read.
digits <- c(1, 2, 0)
i <- 0
df %>%
mutate_at(vars(life_expectancy, poverty, household_income), ~round(., digits[i<<- i+1]))
# A tibble: 1 x 4
# name life_expectancy poverty household_income
# <chr> <dbl> <dbl> <dbl>
# 1 New Haven 78.1 0.26 42589
(or just df %>% mutate_at(names(digits), ~round(., digits[i<<- i+1])) if you use a named vector as in my first solution)
Here's a map2 solution along the lines of Henrik's comment. You can then wrap this inside a custom function. I provided an rough first attempt but I have done minimal tests, so it probably breaks under all sorts of situations if evaluation is strange. It also doesn't use tidyselect for .at, but neither does modify_at...
library(tidyverse)
df <- tibble::tribble(
~name, ~life_expectancy, ~poverty, ~household_income,
"New Haven", 78.0580437642378, 0.264221051111753, 42588.7592521085,
"New York", 12.349685329, 0.324067934, 32156.230974623
)
rounded <- df %>%
select(life_expectancy, poverty, household_income) %>%
map2_dfc(
.y = c(1, 2, 0),
.f = ~ round(.x, digits = .y)
)
df %>%
select(-life_expectancy, -poverty, -household_income) %>%
bind_cols(rounded)
#> # A tibble: 2 x 4
#> name life_expectancy poverty household_income
#> <chr> <dbl> <dbl> <dbl>
#> 1 New Haven 78.1 0.26 42589
#> 2 New York 12.3 0.32 32156
modify2_at <- function(.x, .y, .at, .f) {
modified <- .x[.at] %>%
map2(.y, .f)
.x[.at] <- modified
return(.x)
}
df %>%
modify2_at(
.y = c(1, 2, 0),
.at = c("life_expectancy", "poverty", "household_income"),
.f = ~ round(.x, digits = .y)
)
#> # A tibble: 2 x 4
#> name life_expectancy poverty household_income
#> <chr> <dbl> <dbl> <dbl>
#> 1 New Haven 78.1 0.26 42589
#> 2 New York 12.3 0.32 32156
Created on 2018-11-13 by the reprex package (v0.2.1)
Fun with tidyeval:
prepared_pairs <-
map2(
set_names(syms(list("life_expectancy", "poverty", "household_income"))),
c(1, 2, 0),
~expr(round(!!.x, digits = !!.y))
)
mutate(df, !!! prepared_pairs)
# # A tibble: 1 x 4
# name life_expectancy poverty household_income
# <chr> <dbl> <dbl> <dbl>
# 1 New Haven 78.1 0.26 42589
Related
I'd like to extract weight values from strings with the unit and the time of measurement using tidyverse.
My dataset is like as below:
df <- tibble(ID = c("A","B","C"),
Weight = c("45kg^20221120", "51.5kg^20221015", "66.05kg^20221020"))
------
A tibble: 3 × 2
ID Weight
<chr> <chr>
1 A 45kg^20221120
2 B 11.5kg^20221015
3 C 66.05kg^20221020
I use stringr in the tidyverse package with regular expressions.
library(tidyverse)
df %>%
mutate(Weight = as.numeric(str_extract(Measurement, "(\\d+\\.\\d+)|(\\d+)(?=kg)")))
----------
A tibble: 3 × 3
ID Measurement Weight
<chr> <chr> <dbl>
1 A 45kg^20221120 45
2 B 11.5kg^20221015 11.5
3 C 66.05kg^20221020 66.0
The second decimal place of C (.05) doesn't extracted.
What's wrong with my code?
Any answers or comments are welcome.
Thanks.
Yes, it was extracted, however tibble is rounding it for 66.0 for easy display.
You can see it if you transform it in data.frame or if you View it
Solution
Check here
Check this
df %>%
mutate(Weight = as.numeric(str_extract(Measurement, "(\\d+\\.\\d+)|(\\d+)(?=kg)"))) %>%
as.data.frame()
Output
#> ID Measurement Weight
#> 1 A 45kg^20221120 45.00
#> 2 B 51.5kg^20221015 51.50
#> 3 C 66.05kg^20221020 66.05
Or check this
df %>%
mutate(Weight = as.numeric(str_extract(Measurement, "(\\d+\\.\\d+)|(\\d+)(?=kg)"))) %>%
View()
You could try to pull all the data out of the string at once with extract:
library(tidyverse)
df <- tibble(ID = c("A","B","C"),
Weight = c("45kg^20221120", "51.5kg^20221015", "66.05kg^20221020"))
df |>
extract(col = Weight,
into = c("weight", "unit", "date"),
regex = "(.*)(kg)\\^(.*$)",
remove = TRUE,
convert = TRUE) |>
mutate(date = lubridate::ymd(date))
#> # A tibble: 3 x 4
#> ID weight unit date
#> <chr> <dbl> <chr> <date>
#> 1 A 45 kg 2022-11-20
#> 2 B 51.5 kg 2022-10-15
#> 3 C 66.0 kg 2022-10-20
Note that, as stated in the comments, the .05 is just not printing, but is present in the data.
I think this question is related to Using map_dfr and .id for list names and list of list names but not identical ...
I often use map_dfr for a case where I want to use the value of each argument, not its name, as the .id variable. Here's a silly example: I am computing the mean of mtcars$mpg raised to the second, fourth, and sixth power:
library(tidyverse)
list(2,4,6) %>% map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="name")
## name x
## <chr> <dbl>
## 1 1 439.
## 2 2 262350.
## 3 3 198039783.
I would like the name variable to be 2, 4, 6 instead of 1, 2, 3. I can hack this by including setNames(.data) in the pipeline:
list(2,4,6) %>%
setNames(.data) %>%
map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="name")
but I wonder if there is a more idiomatic approach I'm missing?
As for the suggestion of using something like ~ tible(name=., ...): nice, but slightly less convenient for the case where the mapping function already returns a tibble, because we have to add an otherwise unnecessary tibble() call:
list(2, 4, 6) %>%
map_dfr(~ tibble(name=.,
broom::tidy(lm(mpg~cyl, data=mtcars, offset=rep(., nrow(mtcars))))))
OK, I think I found this shortly before posting (so I'll answer). This answer points out that tibble::lst() is a self-naming list function, so as long as we use tibble::lst(2,4,6) instead of list(2,4,6), it Just Works, e.g.
lst(2,4,6) %>% map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="name")
This can work too:
library(tidyverse)
##ben Bolker answer.
lst(2,4,6) %>% map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="power")
#> # A tibble: 3 x 2
#> power x
#> <chr> <dbl>
#> 1 2 439.
#> 2 4 262350.
#> 3 6 198039783.
list(2, 4, 6) %>% map_df(~ tibble(power = as.character(.x) , x = mean(mtcars$mpg^.)))
#> # A tibble: 3 x 2
#> power x
#> <chr> <dbl>
#> 1 2 439.
#> 2 4 262350.
#> 3 6 198039783.
#another option
seq(2, 6, 2) %>% map2_df(rerun(length(.), mtcars$mpg), ~ c(x = as.character(.x), mean = round(mean(.y^.x), 0)))
#> # A tibble: 3 x 2
#> x mean
#> <chr> <chr>
#> 1 2 439
#> 2 4 262350
#> 3 6 198039783
Created on 2021-06-06 by the reprex package (v2.0.0)
This is also possible, however it would not have been my first choice and only a map would suffice:
library(purrr)
list(2, 4, 6) %>%
pmap_dfr(~ tibble(power = c(...), x = map_dbl(c(...), ~ mean(mtcars$mpg ^ .x))))
# A tibble: 3 x 2
power x
<dbl> <dbl>
1 2 439.
2 4 262350.
3 6 198039783.
I have a tibble with a number of variables collected over time. A very simplified version of the tibble looks like this.
df = tribble(
~id, ~varA.t1, ~varA.t2, ~varB.t1, ~varB.t2,
'row_1', 5, 10, 2, 4,
'row_2', 20, 50, 4, 6
)
I want to systematically create a new set of variables varC so that varC.t# = varA.t# / varB.t# where # is 1, 2, 3, etc. (similarly to the way column names are setup in the tibble above).
How do I use something along the lines of mutate or across to do this?
You can do something like this with mutate(across..., however, for renaming columns there must be a shortcut.
df %>%
mutate(across(.cols = c(varA.t1, varA.t2),
.fns = ~ .x / get(glue::glue(str_replace(cur_column(), "varA", "varB"))),
.names = "V_{.col}")) %>%
rename_with(~str_replace(., "V_varA", "varC"), starts_with("V_"))
# A tibble: 2 x 7
id varA.t1 varA.t2 varB.t1 varB.t2 varC.t1 varC.t2
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 row_1 5 10 2 4 2.5 2.5
2 row_2 20 50 4 6 5 8.33
If there is a long time series you can also create a vector for .cols beforehand.
I have a package on GitHub called {dplyover} which aims to solve this kind of problem in way similar to dplyr::across.
The function is called across2. It lets you define two sets of columns to which you can apply one or several functions. The .names argument supports two glue specifictions: {pre} and {suf}. They extract the shared pre- and suffix of the variable names. This makes it easy to put nice names on our output variables.
The function has one caveat. It is not performant when applied to highly grouped data (there is a vignette with benchmarks).
library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover
df = tribble(
~id, ~varA.t1, ~varA.t2, ~varB.t1, ~varB.t2,
'row_1', 5, 10, 2, 4,
'row_2', 20, 50, 4, 6
)
df %>%
mutate(across2(starts_with("varA"),
starts_with("varB"),
~ .x / .y,
.names = "{pre}C.{suf}"))
#> # A tibble: 2 x 7
#> id varA.t1 varA.t2 varB.t1 varB.t2 varC.t1 varC.t2
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 row_1 5 10 2 4 2.5 2.5
#> 2 row_2 20 50 4 6 5 8.33
Created on 2021-04-10 by the reprex package (v0.3.0)
For such cases I find using base R easy and efficient.
varAcols <- sort(grep('varA', names(df), value = TRUE))
varBcols <- sort(grep('varB', names(df), value = TRUE))
df[sub('A', 'C', varAcols)] <- df[varAcols]/df[varBcols]
# id varA.t1 varA.t2 varB.t1 varB.t2 varC.t1 varC.t2
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 row_1 5 10 2 4 2.5 2.5
#2 row_2 20 50 4 6 5 8.33
Another way to do this with some customization is
Initial setup
library(dplyr)
library(purrr)
library(stringr)
df = tribble(
~id, ~varA.t1, ~varA.t2, ~varB.t1, ~varB.t2,
'row_1', 5, 10, 2, 4,
'row_2', 20, 50, 4, 6
)
# A function take in a formula then parse it and correct the column name
operation_function <- function(df, formula) {
# Extract the column name from the formula
new_column_name <- str_extract(formula, "^.+=")
new_column_name <- trimws(gsub("=", "", new_column_name))
# Process the df
df %>%
# parse the formula - this reuslt in new column name as value formula
mutate(!!rlang::parse_expr(formula)) %>%
# rename the new created column with the correct column name
rename(!!new_column_name := last_col())
}
Note: I think there should be more efficient way to implement the formula that have proper name. Though I couldn't figure it out right now. Welcome ideas from others
Prepare the formula to be process by the data. In this case it simple
For more complicated formula you may want to do it a little bit differently
# Prepare the formula
base_formula <- c("varC.t# = varA.t# / varB.t#")
replacement_list <- c(1, 2)
list_formula <- map(replacement_list, .f = gsub,
pattern = "#", x = base_formula)
list_formula
#> [[1]]
#> [1] "varC.t1 = varA.t1 / varB.t1"
#>
#> [[2]]
#> [1] "varC.t2 = varA.t2 / varB.t2"
Finally process the data with the list of formulas
# process with the function and then reduce them with left_join
reduce(map(.x = list_formula, .f = operation_function, df = df),
left_join)
#> Joining, by = c("id", "varA.t1", "varA.t2", "varB.t1", "varB.t2")
#> # A tibble: 2 x 7
#> id varA.t1 varA.t2 varB.t1 varB.t2 varC.t1 varC.t2
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 row_1 5 10 2 4 2.5 2.5
#> 2 row_2 20 50 4 6 5 8.33
Created on 2021-04-10 by the reprex package (v1.0.0)
Using purrr::map and the magrittr pipe, I am trying generate a new column with values equal to a substring of the existing column.
I can illustrate what I'm trying to do with the following toy dataset:
library(tidyverse)
library(purrr)
test <- list(tibble(geoid_1970 = c(123, 456),
name_1970 = c("here", "there"),
pop_1970 = c(1, 2)),
tibble(geoid_1980 = c(234, 567),
name_1980 = c("here", "there"),
pop_1970 = c(3, 4))
)
Within each listed data frame, I want a column equal to the relevant year. Without iterating, the code I have is:
data <- map(test, ~ .x %>% mutate(year = as.integer(str_sub(names(test[[1]][1]), -4))))
Of course, this returns a year of 1970 in both listed data frames, which I don't want. (I want 1970 in the first and 1980 in the second.)
In addition, it's not piped, and my attempt to pipe it throws an error:
data <- test %>% map(~ .x %>% mutate(year = as.integer(str_sub(names(.x[[1]][1]), -4))))
# > Error: Problem with `mutate()` input `year`.
# > x Input `year` can't be recycled to size 2.
# > ℹ Input `year` is `as.integer(str_sub(names(.x[[1]][1]), -4))`.
# > ℹ Input `year` must be size 2 or 1, not 0.
How can I iterate over each listed data frame using the pipe?
Try:
test %>% map(~.x %>% mutate(year = as.integer(str_sub(names(.x[1]), -4))))
[[1]]
# A tibble: 2 x 4
geoid_1970 name_1970 pop_1970 year
<dbl> <chr> <dbl> <int>
1 123 here 1 1970
2 456 there 2 1970
[[2]]
# A tibble: 2 x 4
geoid_1980 name_1980 pop_1970 year
<dbl> <chr> <dbl> <int>
1 234 here 3 1980
2 567 there 4 1980
We can get the 'year' with parse_number
library(dplyr)
library(purrr)
map(test, ~ .x %>%
mutate(year = readr::parse_number(names(.)[1])))
-output
#[[1]]
# A tibble: 2 x 4
# geoid_1970 name_1970 pop_1970 year
# <dbl> <chr> <dbl> <dbl>
#1 123 here 1 1970
#2 456 there 2 1970
#[[2]]
# A tibble: 2 x 4
# geoid_1980 name_1980 pop_1970 year
# <dbl> <chr> <dbl> <dbl>
#1 234 here 3 1980
#2 567 there 4 1980
I'm in the process of creating a generic function in my package. The goal is to find columns that are percent columns, and then to use parse_number on them if they are character columns. I haven't been able to figure out a solution using mutate_at and ifelse. I've pasted a reprex below.
library(tidyverse)
df <- tibble::tribble(
~name, ~pass_percent, ~attendance_percent, ~grade,
"Jon", "90%", 0.85, "B",
"Jim", "100%", 1, "A"
)
percent_names <- df %>% select(ends_with("percent"))%>% names()
# Error due to attendance_percent already being in numeric value
if (percent_names %>% length() > 0) {
df <-
df %>%
dplyr::mutate_at(percent_names, readr::parse_number)
}
#> Error in parse_vector(x, col_number(), na = na, locale = locale, trim_ws = trim_ws): is.character(x) is not TRUE
your attendance_percent variable is numeric, not character and parse_number only wants character variables, see here. So a solution would be:
edited_parse_number <- function(x, ...) {
if (mode(x) == 'numeric') {
x
} else {
parse_number(x, ...)
}
}
df %>%
dplyr::mutate_at(vars(percent_names), edited_parse_number)
# name pass_percent attendance_percent grade
# <chr> <dbl> <dbl> <chr>
#1 Jon 90 0.85 B
#2 Jim 100 1 A
OR
if you didn't want to use that extra function, extract character variables at beginning:
percent_names <- df %>%
select(ends_with("percent")) %>%
select_if(is.character) %>%
names()
percent_names
# [1] "pass_percent"
df %>%
dplyr::mutate_at(vars(percent_names), parse_number)
# name pass_percent attendance_percent grade
# <chr> <dbl> <dbl> <chr>
# 1 Jon 90 0.85 B
# 2 Jim 100 1 A
Alternatively, without having to create a function, you can just add an ifelse statement into mutate_at such as:
if (percent_names %>% length() > 0) {
df <-
df %>% rowwise() %>%
dplyr::mutate_at(vars(percent_names), ~ifelse(is.character(.),
parse_number(.),
.))
}
Source: local data frame [2 x 4]
Groups: <by row>
# A tibble: 2 x 4
name pass_percent attendance_percent grade
<chr> <dbl> <dbl> <chr>
1 Jon 90 0.85 B
2 Jim 100 1 A