Using mutate_at with mutate_if - r

I'm in the process of creating a generic function in my package. The goal is to find columns that are percent columns, and then to use parse_number on them if they are character columns. I haven't been able to figure out a solution using mutate_at and ifelse. I've pasted a reprex below.
library(tidyverse)
df <- tibble::tribble(
~name, ~pass_percent, ~attendance_percent, ~grade,
"Jon", "90%", 0.85, "B",
"Jim", "100%", 1, "A"
)
percent_names <- df %>% select(ends_with("percent"))%>% names()
# Error due to attendance_percent already being in numeric value
if (percent_names %>% length() > 0) {
df <-
df %>%
dplyr::mutate_at(percent_names, readr::parse_number)
}
#> Error in parse_vector(x, col_number(), na = na, locale = locale, trim_ws = trim_ws): is.character(x) is not TRUE

your attendance_percent variable is numeric, not character and parse_number only wants character variables, see here. So a solution would be:
edited_parse_number <- function(x, ...) {
if (mode(x) == 'numeric') {
x
} else {
parse_number(x, ...)
}
}
df %>%
dplyr::mutate_at(vars(percent_names), edited_parse_number)
# name pass_percent attendance_percent grade
# <chr> <dbl> <dbl> <chr>
#1 Jon 90 0.85 B
#2 Jim 100 1 A
OR
if you didn't want to use that extra function, extract character variables at beginning:
percent_names <- df %>%
select(ends_with("percent")) %>%
select_if(is.character) %>%
names()
percent_names
# [1] "pass_percent"
df %>%
dplyr::mutate_at(vars(percent_names), parse_number)
# name pass_percent attendance_percent grade
# <chr> <dbl> <dbl> <chr>
# 1 Jon 90 0.85 B
# 2 Jim 100 1 A

Alternatively, without having to create a function, you can just add an ifelse statement into mutate_at such as:
if (percent_names %>% length() > 0) {
df <-
df %>% rowwise() %>%
dplyr::mutate_at(vars(percent_names), ~ifelse(is.character(.),
parse_number(.),
.))
}
Source: local data frame [2 x 4]
Groups: <by row>
# A tibble: 2 x 4
name pass_percent attendance_percent grade
<chr> <dbl> <dbl> <chr>
1 Jon 90 0.85 B
2 Jim 100 1 A

Related

How do I map a function to each row in a tibble in R?

I am trying to map a function to each row in a tibble. Please see code below. My desired workflow is as follows -
Convert a list with sub lists to a tibble
Map each row the tibble to a function
My desired output should be a list with a tibble as output for each row mapped to the function. See full code below -
# Packages
library(tidyverse)
library(purrr)
# Function i want to map
sample_func <- function(tib){
a <- tib$name
b <- tib$qty
c <- tib$price
d <- tib$add
e <- b+c+d
t <- tibble(e = c(e), stock = c(a))
return(t)
}
# Define the list with multiple sublists
lst <- list(c( "CHR1", 15, 222.14, 6), c( "CHR2", 10, 119.20, 10))
# Convert each sublist to a tibble and bind the rows
tib <- bind_rows(lapply(lst, function(x) {
tibble(name = x[1], qty = x[2] %>% as.numeric(), price = x[3] %>% as.numeric(),
add = x[4] %>% as.numeric())
}))
# Apply the function to each row in the tibble using map()
result <- tib %>%
rowwise() %>%
mutate(temp = map(list(name, qty, price, add), sample_func)) %>%
unnest(temp)
My desired output should be -
[[1]]
# A tibble: 1 × 2
e name
<dbl> <chr>
1 243. CHR1
[[2]]
# A tibble: 1 × 2
e name
<dbl> <chr>
1 139. CHR2
However when the final rowwise mapping, I get the following error -
Error in `mutate()`:
! Problem while computing `temp = map(list(name, qty, price, add), sample_func)`.
ℹ The error occurred in row 1.
Caused by error in `map()`:
ℹ In index: 1.
Caused by error in `tib$name`:
! $ operator is invalid for atomic vectors
What am I doing wrong here?
An alternative approach is to change the inputs of the sample_func function to be the names of the columns instead of the tibble, then you can do this with pmap():
# Function i want to map
sample_func <- function(name, qty, price, add){
a <- name
b <- qty
c <- price
d <- add
e <- b+c+d
t <- tibble(e = c(e), stock = c(a))
return(t)
}
# Define the list with multiple sublists
lst <- list(c( "CHR1", 15, 222.14, 6), c( "CHR2", 10, 119.20, 10))
# Convert each sublist to a tibble and bind the rows
tib <- bind_rows(lapply(lst, function(x) {
tibble(name = x[1], qty = x[2] %>% as.numeric(), price = x[3] %>% as.numeric(),
add = x[4] %>% as.numeric())
}))
# Apply the function to each row in the tibble using map()
pmap(tib, sample_func)
Instead of passing a tibble to the function you may pass columns of the tibble as vector.
library(dplyr)
library(purrr)
sample_func <- function(name, qty, price, add){
res <- tibble(e = qty + price + add, stock = name)
return(res)
}
You may then use pmap -
out <- tib %>%
mutate(res = pmap(list(name, qty, price, add), sample_func))
out
# A tibble: 2 × 5
# name qty price add res
# <chr> <dbl> <dbl> <dbl> <list>
#1 CHR1 15 222. 6 <tibble [1 × 2]>
#2 CHR2 10 119. 10 <tibble [1 × 2]>
out$res
#[[1]]
# A tibble: 1 × 2
# e stock
# <dbl> <chr>
#1 243. CHR1
#[[2]]
# A tibble: 1 × 2
# e stock
# <dbl> <chr>
#1 139. CHR2
You may use unnest to get separate columns.
out %>% unnest(res)
# name qty price add e stock
# <chr> <dbl> <dbl> <dbl> <dbl> <chr>
#1 CHR1 15 222. 6 243. CHR1
#2 CHR2 10 119. 10 139. CHR2
We could just apply the sample_func on the picked dataset and unnest
library(dplyr)
library(tidyr)
tib %>%
transmute(temp = sample_func(pick(everything()))) %>%
unnest(where(is_tibble))
-output
# A tibble: 2 × 2
e stock
<dbl> <chr>
1 243. CHR1
2 139. CHR2
If we want it as a list of tibbles
tib %>%
rowwise %>%
reframe(temp = list(sample_func(pick(everything())))) %>%
pull(temp)
-output
[[1]]
# A tibble: 1 × 2
e stock
<dbl> <chr>
1 243. CHR1
[[2]]
# A tibble: 1 × 2
e stock
<dbl> <chr>
1 139. CHR2
To get your desired output and without changing your function or tibble we can use dplyr::rowwise() and dplyr::group_map().
With rowwise we tell 'dplyr' to treat each row as a group. With group_map we apply a function to each group (in our case row) and the function takes the data.frame of each group as input .x which fits your sample_func() perfectly.
library(dplyr)
tib %>%
rowwise() %>%
group_map(~ sample_func(.x))
#> [[1]]
#> # A tibble: 1 × 2
#> e stock
#> <dbl> <chr>
#> 1 243. CHR1
#>
#> [[2]]
#> # A tibble: 1 × 2
#> e stock
#> <dbl> <chr>
#> 1 139. CHR2
Data from OP
library(tidyverse)
# Function i want to map
sample_func <- function(tib){
a <- tib$name
b <- tib$qty
c <- tib$price
d <- tib$add
e <- b+c+d
t <- tibble(e = c(e), stock = c(a))
return(t)
}
# Define the list with multiple sublists
lst <- list(c( "CHR1", 15, 222.14, 6), c( "CHR2", 10, 119.20, 10))
# Convert each sublist to a tibble and bind the rows
tib <- bind_rows(lapply(lst, function(x) {
tibble(name = x[1], qty = x[2] %>% as.numeric(), price = x[3] %>% as.numeric(),
add = x[4] %>% as.numeric())
}))
Created on 2023-02-12 with reprex v2.0.2

Split a list of dataframes into separate dataframes [duplicate]

So I have a list with me as below, what I want is to split them into three separate dataframes (with names as Banana/Strawberry & apple) as shown in expected output. I have already seen this (Splitting List into dataframe R) but its exact opposite of what I want. I dont want to combine then I want to split them into three dataframe with same name as list header.
list_a <- list(`Banana` = c(8.7), `Strawberry` = c(2.3), `Apple` = c(3.5))
DF1
Banana
8.7
DF2
Strawberry
2.3
DF3
Apple
3.5
Any Solution preferably in Tidyverse would be greatly appreciated. Actual problem has lot more columns in the list.
First convert them all to a tibble:
list_a <- list(`Banana` = c(8.7), `Strawberry` = c(2.3), `Apple` = c(3.5))
list_a <- purrr::map(list_a, tibble::as_tibble)
Then send this to the global environment:
list2env(list_a, envir = .GlobalEnv)
We can use imap to get the names and then use set_names
library(purrr)
library(dplyr)
library(stringr)
imap(list_a, ~ set_names(tibble(.x), .y)) %>%
set_names(str_c("DF", 1:3)) %>%
list2env(.GlobalEnv)
DF1
# A tibble: 1 x 1
# Banana
# <dbl>
#1 8.7
DF2
# A tibble: 1 x 1
# Strawberry
# <dbl>
#1 2.3
DF3
# A tibble: 1 x 1
# Apple
# <dbl>
#1 3.5
If we need separate columns
library(tibble)
enframe(list_a) %>%
unnest(c(value)) %>%
group_split(rn = row_number(), keep = FALSE) %>%
set_names(str_c("DF", 1:3)) %>%
list2env(.GlobalEnv)
DF1
# A tibble: 1 x 2
# name value
# <chr> <dbl>
#1 Banana 8.7
DF2
# A tibble: 1 x 2
# name value
# <chr> <dbl>
#1 Strawberry 2.3
DF3
# A tibble: 1 x 2
# name value
# <chr> <dbl>
#1 Apple 3.5
A tidyverse way would be
library(tidyverse)
new_list <- set_names(map2(list_a,names(list_a),
~tibble(!!.y := .x)), str_c("df", 1:3))
and it can be done in base R as well
new_list <- setNames(Map(function(x, y) setNames(data.frame(x), y),
list_a,names(list_a)), paste0("df", 1:3))
Now we can write it into global environment.
list2env(new_list, .GlobalEnv)
Less straightforward than previous answers but you can get it using a for loop:
for(i in 1:length(list_a))
{
df <- data.frame(unlist(list_a[[i]]))
colnames(df) <- names(list_a[i])
assign(names(list_a[i]),df, .GlobalEnv)
}

Iterating name of a field with dplyr::summarise function

first time for me here, I'll try to explain you my problem as clearly as possible.
I'm working on erosion data contained in farms in the form of pixels (e.g. 1 farm = 10 pixels so 10 lines in my df), for this I have 4 df in a list, and I would like to calculate for each farm the mean of erosion. I thought about a loop on the name of erosion field but my problem is that my df don't have the exact name (either ERO13 or ERO17). I don't want to work the position of the field because it could change between the df, only with the name which is variable.
Here's a example :
df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))
lst_df <- list(df1,df2)
for (df in lst_df){
cur_df <- df
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(current_name_of_erosion_field = mean(current_name_of_erosion_field))
}
I tried with
for (df in lst_df){
cur_df <- df
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(cur_camp = mean(cur_camp))
}
but first doesn't work because it's a string character and not a variable containing the string character and it works with the position.
How can I build the current_name_of_erosion_field here ?
We may convert it to symbol and evaluate (!!) or may pass the string across. Also, as we are using a for loop, make sure to create a list to store the output. Also, to assign from an object created, use := with !!
out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
cur_df <- lst_df[[i]]
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(!!cur_camp := mean(!! sym(cur_camp)))
out[[i]] <- cur_df
}
-output
> out
[[1]]
# A tibble: 2 × 2
ID ERO13
<dbl> <dbl>
1 1 3
2 2 6
[[2]]
# A tibble: 2 × 2
ID ERO17
<dbl> <dbl>
1 4 4.5
2 6 12
Or may use across
out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
cur_df <- lst_df[[i]]
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(across(all_of(cur_camp), mean))
out[[i]] <- cur_df
}
-output
> out
[[1]]
# A tibble: 2 × 2
ID ERO13
<dbl> <dbl>
1 1 3
2 2 6
[[2]]
# A tibble: 2 × 2
ID ERO17
<dbl> <dbl>
1 4 4.5
2 6 12
A slightly different approach would be to bind the dataframes and use pivot_longer to separate the erosion name from the erosion value. Then you can take the mean of the values without having to specify the name.
library(tidyverse)
df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))
bind_rows(df1, df2) %>%
pivot_longer(starts_with('ERO'),
names_to = 'ERO',
values_drop_na = TRUE) %>%
group_by(ID, ERO) %>%
summarize(value = mean(value))
#> `summarise()` has grouped output by 'ID'. You can override using the `.groups` argument.
#> # A tibble: 4 x 3
#> # Groups: ID [4]
#> ID ERO value
#> <dbl> <chr> <dbl>
#> 1 1 ERO13 3
#> 2 2 ERO13 6
#> 3 4 ERO17 4.5
#> 4 6 ERO17 12
Created on 2022-01-14 by the reprex package (v2.0.0)

How to add a column based on values of columns indicated by another column in a tibble in R

In the example below, I would like to add column 'value' based on the values of column 'variable' (i.e., 1 and 20).
toy_data <-
tibble::tribble(
~x, ~y, ~variable,
1, 2, "x",
10, 20, "y"
)
Like this:
x
y
variable
value
1
2
x
1
10
20
y
20
However, none of the below works:
toy_data %>%
dplyr::mutate(
value = get(variable)
)
toy_data %>%
dplyr::mutate(
value = mget(variable)
)
toy_data %>%
dplyr::mutate(
value = mget(variable, inherits = TRUE)
)
toy_data %>%
dplyr::mutate(
value = !!variable
)
How can I do this?
If you know which variables you have in the dataframe in advance: use simple logic like ifelse() or dplyr::case_when() to choose between them.
If not: use functional programming. Under is an example:
library(dplyr)
f <- function(data, variable_col) {
data[[variable_col]] %>%
purrr::imap_dbl(~ data[[.y, .x]])
}
toy_data$value <- f(toy_data, "variable")
Here are a few options that should scale well.
First is a base option that works along both the variable column and its index. (I made a copy of the data frame just so I had the original intact for more programming.)
library(dplyr)
toy2 <- toy_data
toy2$value <- mapply(function(v, i) toy_data[[v]][i], toy_data$variable, seq_along(toy_data$variable))
toy2
#> # A tibble: 2 × 4
#> x y variable value
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 2 x 1
#> 2 10 20 y 20
Second uses purrr::imap_dbl to iterate along the variable and its index and return a double.
toy_data %>%
mutate(value = purrr::imap_dbl(variable, function(v, i) toy_data[[v]][i]))
#> # A tibble: 2 × 4
#> x y variable value
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 2 x 1
#> 2 10 20 y 20
Third is least straightforward, but what I'd most likely use personally, maybe just because it's a process that fits many of my workflows. Pivotting makes a long version of the data, letting you see both values of variable and corresponding values of x and y, which you can then filter for where those 2 columns match. Then self-join back to the data frame.
inner_join(
toy_data,
toy_data %>%
tidyr::pivot_longer(cols = -variable, values_to = "value") %>%
filter(variable == name),
by = "variable"
) %>%
select(-name)
#> # A tibble: 2 × 4
#> x y variable value
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 2 x 1
#> 2 10 20 y 20
Edit: #jpiversen rightly points out that the self-join won't work if variable has duplicates—in that case, add a row number to the data and use that as an additional joining column. Here I first add an additional observation to illustrate.
toy3 <- toy_data %>%
add_row(x = 5, y = 4, variable = "x") %>%
tibble::rowid_to_column()
inner_join(
toy3,
toy3 %>%
pivot_longer(cols = c(-rowid, -variable), values_to = "value") %>%
filter(variable == name),
by = c("rowid", "variable")
) %>%
select(-name, -rowid)

tidyverse - prefered way to turn a named vector into a data.frame/tibble

Using the tidyverse a lot i often face the challenge of turning named vectors into a data.frame/tibble with the columns being the names of the vector.
What is the prefered/tidyversey way of doing this?
EDIT: This is related to: this and this github-issue
So i want:
require(tidyverse)
vec <- c("a" = 1, "b" = 2)
to become this:
# A tibble: 1 × 2
a b
<dbl> <dbl>
1 1 2
I can do this via e.g.:
vec %>% enframe %>% spread(name, value)
vec %>% t %>% as_tibble
Usecase example:
require(tidyverse)
require(rvest)
txt <- c('<node a="1" b="2"></node>',
'<node a="1" c="3"></node>')
txt %>% map(read_xml) %>% map(xml_attrs) %>% map_df(~t(.) %>% as_tibble)
Which gives
# A tibble: 2 × 3
a b c
<chr> <chr> <chr>
1 1 2 <NA>
2 1 <NA> 3
This is now directly supported using bind_rows (introduced in dplyr 0.7.0):
library(tidyverse))
vec <- c("a" = 1, "b" = 2)
bind_rows(vec)
#> # A tibble: 1 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
This quote from https://cran.r-project.org/web/packages/dplyr/news.html explains the change:
bind_rows() and bind_cols() now accept vectors. They are treated as rows by the former and columns by the latter. Rows require inner names like c(col1 = 1, col2 = 2), while columns require outer names: col1 = c(1, 2). Lists are still treated as data frames but can be spliced explicitly with !!!, e.g. bind_rows(!!! x) (#1676).
With this change, it means that the following line in the use case example:
txt %>% map(read_xml) %>% map(xml_attrs) %>% map_df(~t(.) %>% as_tibble)
can be rewritten as
txt %>% map(read_xml) %>% map(xml_attrs) %>% map_df(bind_rows)
which is also equivalent to
txt %>% map(read_xml) %>% map(xml_attrs) %>% { bind_rows(!!! .) }
The equivalence of the different approaches is demonstrated in the following example:
library(tidyverse)
library(rvest)
txt <- c('<node a="1" b="2"></node>',
'<node a="1" c="3"></node>')
temp <- txt %>% map(read_xml) %>% map(xml_attrs)
# x, y, and z are identical
x <- temp %>% map_df(~t(.) %>% as_tibble)
y <- temp %>% map_df(bind_rows)
z <- bind_rows(!!! temp)
identical(x, y)
#> [1] TRUE
identical(y, z)
#> [1] TRUE
z
#> # A tibble: 2 x 3
#> a b c
#> <chr> <chr> <chr>
#> 1 1 2 <NA>
#> 2 1 <NA> 3
The idiomatic way would be to splice the vector with !!! within a tibble() call so the named vector elements become column definitions :
library(tibble)
vec <- c("a" = 1, "b" = 2)
tibble(!!!vec)
#> # A tibble: 1 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
Created on 2019-09-14 by the reprex package (v0.3.0)
This works for me: c("a" = 1, "b" = 2) %>% t() %>% tbl_df()
Interestingly you can use the as_tibble() method for lists to do this in one call. Note that this isn't best practice since this isn't an exported method.
tibble:::as_tibble.list(vec)
as_tibble(as.list(c(a=1, b=2)))

Resources