Mutate multiple variable to create multiple new variables - r

Let's say I have a tibble where I need to take multiple variables and mutate them into new multiple new variables.
As an example, here is a simple tibble:
tb <- tribble(
~x, ~y1, ~y2, ~y3, ~z,
1,2,4,6,2,
2,1,2,3,3,
3,6,4,2,1
)
I want to subtract variable z from every variable with a name starting with "y", and mutate the results as new variables of tb. Also, suppose I don't know how many "y" variables I have. I want the solution to fit nicely within tidyverse / dplyr workflow.
In essence, I don't understand how to mutate multiple variables into multiple new variables. I'm not sure if you can use mutate in this instance? I've tried mutate_if, but I don't think I'm using it right (and I get an error):
tb %>% mutate_if(starts_with("y"), funs(.-z))
#Error: No tidyselect variables were registered
Thanks in advance!

Because you are operating on column names, you need to use mutate_at rather than mutate_if which uses the values within columns
tb %>% mutate_at(vars(starts_with("y")), funs(. - z))
#> # A tibble: 3 x 5
#> x y1 y2 y3 z
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 2 4 2
#> 2 2 -2 -1 0 3
#> 3 3 5 3 1 1
To create new columns, instead of overwriting existing ones, we can give name to funs
# add suffix
tb %>% mutate_at(vars(starts_with("y")), funs(mod = . - z))
#> # A tibble: 3 x 8
#> x y1 y2 y3 z y1_mod y2_mod y3_mod
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 4 6 2 0 2 4
#> 2 2 1 2 3 3 -2 -1 0
#> 3 3 6 4 2 1 5 3 1
# remove suffix, add prefix
tb %>%
mutate_at(vars(starts_with("y")), funs(mod = . - z)) %>%
rename_at(vars(ends_with("_mod")), funs(paste("mod", gsub("_mod", "", .), sep = "_")))
#> # A tibble: 3 x 8
#> x y1 y2 y3 z mod_y1 mod_y2 mod_y3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 4 6 2 0 2 4
#> 2 2 1 2 3 3 -2 -1 0
#> 3 3 6 4 2 1 5 3 1
Edit: In dplyr 0.8.0 or higher versions, funs() will be deprecated (source1 & source2), need to use list() instead
tb %>% mutate_at(vars(starts_with("y")), list(~ . - z))
#> # A tibble: 3 x 5
#> x y1 y2 y3 z
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 2 4 2
#> 2 2 -2 -1 0 3
#> 3 3 5 3 1 1
tb %>% mutate_at(vars(starts_with("y")), list(mod = ~ . - z))
#> # A tibble: 3 x 8
#> x y1 y2 y3 z y1_mod y2_mod y3_mod
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 4 6 2 0 2 4
#> 2 2 1 2 3 3 -2 -1 0
#> 3 3 6 4 2 1 5 3 1
tb %>%
mutate_at(vars(starts_with("y")), list(mod = ~ . - z)) %>%
rename_at(vars(ends_with("_mod")), list(~ paste("mod", gsub("_mod", "", .), sep = "_")))
#> # A tibble: 3 x 8
#> x y1 y2 y3 z mod_y1 mod_y2 mod_y3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 4 6 2 0 2 4
#> 2 2 1 2 3 3 -2 -1 0
#> 3 3 6 4 2 1 5 3 1
Edit 2: dplyr 1.0.0+ has across() function which simplifies this task even further
Basic usage
across() has two primary arguments:
The first argument, .cols, selects the columns you want to operate on.
It uses tidy selection (like select()) so you can pick variables by
position, name, and type.
The second argument, .fns, is a function or list of functions to apply to
each column. This can also be a purrr style formula (or list of formulas)
like ~ .x / 2. (This argument is optional, and you can omit it if you just want
to get the underlying data; you'll see that technique used in
vignette("rowwise").)
# Control how the names are created with the `.names` argument which
# takes a [glue](http://glue.tidyverse.org/) spec:
tb %>%
mutate(
across(starts_with("y"), ~ .x - z, .names = "mod_{col}")
)
#> # A tibble: 3 x 8
#> x y1 y2 y3 z mod_y1 mod_y2 mod_y3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 4 6 2 0 2 4
#> 2 2 1 2 3 3 -2 -1 0
#> 3 3 6 4 2 1 5 3 1
tb %>%
mutate(
across(num_range(prefix = "y", range = 1:3), ~ .x - z, .names = "mod_{col}")
)
#> # A tibble: 3 x 8
#> x y1 y2 y3 z mod_y1 mod_y2 mod_y3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 4 6 2 0 2 4
#> 2 2 1 2 3 3 -2 -1 0
#> 3 3 6 4 2 1 5 3 1
### Multiple functions
tb %>%
mutate(
across(c(matches("x"), contains("z")), ~ max(.x, na.rm = TRUE), .names = "max_{col}"),
across(c(y1:y3), ~ .x - z, .names = "mod_{col}")
)
#> # A tibble: 3 x 10
#> x y1 y2 y3 z max_x max_z mod_y1 mod_y2 mod_y3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 4 6 2 3 3 0 2 4
#> 2 2 1 2 3 3 3 3 -2 -1 0
#> 3 3 6 4 2 1 3 3 5 3 1
Created on 2018-10-29 by the reprex package (v0.2.1)

Related

Add a sums column to dataset subset and conditiionated to other two columns values

I am working with a huge dataset, but just to simplify what I'd like to do, I will use the following one.
testDF <- data.frame(v1 = rep(c('a', 'b', 'c', 'd', 'e', 'f'), 2),
v2 = rep(c(1,0),6))
Let's assume you could subset it like this.
v1 v2
1 a 1
2 b 0
3 c 1
4 d 0
5 e 1
6 f 0
7 a 1
8 b 0
9 c 1
10 d 0
11 e 1
12 f 0
When the first value of v1 assumes the same value (for example in the I would like to add a third column reporting sum of the second column values. The output will be like this:
testDF
v1 v2 tc
1 a 1 2
2 b 0 0
3 c 1 2
4 d 0 0
5 e 1 2
6 f 0 0
7 a 1 2
8 b 0 0
9 c 1 2
10 d 0 0
11 e 1 2
Which operation I could by perpetuating the dplyr code?
Thanks
Is this what you need?
testDF %>%
group_by(v1) %>%
mutate(third_col = sum(v2, na.rm = TRUE)) %>%
arrange(v1)
# A tibble: 12 × 3
# Groups: v1 [8]
v1 v2 third_col
<dbl> <dbl> <dbl>
1 1 1 1
2 3 0 0
3 3 0 0
4 4 1 1
5 5 1 2
6 5 1 2
7 5 0 2
8 7 0 1
9 7 1 1
10 8 1 1
11 9 0 0
12 NA 0 0
If you need to split by group, then just replace arrange with group_split:
testDF %>%
group_by(v1) %>%
mutate(third_col = sum(v2, na.rm = TRUE)) %>%
group_split(v1)
<list_of<
tbl_df<
v1 : double
v2 : double
third_col: double
>
>[8]>
[[1]]
# A tibble: 1 × 3
v1 v2 third_col
<dbl> <dbl> <dbl>
1 1 1 1
[[2]]
# A tibble: 2 × 3
v1 v2 third_col
<dbl> <dbl> <dbl>
1 3 0 0
2 3 0 0
[[3]]
# A tibble: 1 × 3
v1 v2 third_col
<dbl> <dbl> <dbl>
1 4 1 1
[[4]]
# A tibble: 3 × 3
v1 v2 third_col
<dbl> <dbl> <dbl>
1 5 1 2
2 5 1 2
3 5 0 2
[[5]]
# A tibble: 2 × 3
v1 v2 third_col
<dbl> <dbl> <dbl>
1 7 0 1
2 7 1 1
[[6]]
# A tibble: 1 × 3
v1 v2 third_col
<dbl> <dbl> <dbl>
1 8 1 1
[[7]]
# A tibble: 1 × 3
v1 v2 third_col
<dbl> <dbl> <dbl>
1 9 0 0
[[8]]
# A tibble: 1 × 3
v1 v2 third_col
<dbl> <dbl> <dbl>
1 NA 0 0
If I am understanding you correctly, you could try this brute-force for loop, though I am sure there are more elegant solutions.
library(dplyr)
test_list <- testDF %>%
group_split(v1)
# Convert list elements to df
test_list_df <- lapply(test_list, as.data.frame)
# If more than one row, create new variable ("third_column") and sum v2
for(xx in seq_along(test_list_df)){
if(nrow(test_list[[xx]]) > 1){
test_list_df[[xx]]$third_column <- sum(test_list_df[[xx]]["v2"])
}}
(I am a bit unclear as to what "For each subset, when the first value of v1 is the same..." - I took that to mean if there was more than one row in the subset.)
Output
# [[1]]
# v1 v2
# 1 1 1
#
# [[2]]
# v1 v2 third_column
# 1 3 0 0
# 2 3 0 0
#
# [[3]]
# v1 v2
# 1 4 1
#
# [[4]]
# v1 v2 third_column
# 1 5 1 2
# 2 5 1 2
# 3 5 0 2
#
# [[5]]
# v1 v2 third_column
# 1 7 0 1
# 2 7 1 1
#
# [[6]]
# v1 v2
# 1 8 1
#
# [[7]]
# v1 v2
# 1 9 0
#
# [[8]]
# v1 v2
# 1 NA 0
A similar approach to #jpsmith, except using map
testDF %>%
group_split(v1) %>%
map(~if(nrow(.x) > 1) mutate(.x, v3 = sum(v2)) else .x)
#> [[1]]
#> # A tibble: 1 x 2
#> v1 v2
#> <dbl> <dbl>
#> 1 1 1
#>
#> [[2]]
#> # A tibble: 2 x 3
#> v1 v2 v3
#> <dbl> <dbl> <dbl>
#> 1 3 0 0
#> 2 3 0 0
#>
#> [[3]]
#> # A tibble: 1 x 2
#> v1 v2
#> <dbl> <dbl>
#> 1 4 1
#>
#> [[4]]
#> # A tibble: 3 x 3
#> v1 v2 v3
#> <dbl> <dbl> <dbl>
#> 1 5 1 2
#> 2 5 1 2
#> 3 5 0 2
#>
#> [[5]]
#> # A tibble: 2 x 3
#> v1 v2 v3
#> <dbl> <dbl> <dbl>
#> 1 7 0 1
#> 2 7 1 1
#>
#> [[6]]
#> # A tibble: 1 x 2
#> v1 v2
#> <dbl> <dbl>
#> 1 8 1
#>
#> [[7]]
#> # A tibble: 1 x 2
#> v1 v2
#> <dbl> <dbl>
#> 1 9 0
#>
#> [[8]]
#> # A tibble: 1 x 2
#> v1 v2
#> <dbl> <dbl>
#> 1 NA 0
Ā
#> Error in eval(expr, envir, enclos): object 'A' not found
Created on 2022-11-06 with reprex v2.0.2

Add a column to a `tibble` that gives it's list position

I have a list of tibbles and I want to add a column to each tibble that represents it's position in a list.
Lets say I have the following:
library(tidyverse)
l <- list(
tibble(x = 1:3, y = rev(x)),
tibble(a = 3:1, b = rev(a))
)
Which produces:
> l
[[1]]
# A tibble: 3 x 2
x y
<int> <int>
1 1 3
2 2 2
3 3 1
[[2]]
# A tibble: 3 x 2
a b
<int> <int>
1 3 1
2 2 2
3 1 3
How can I use tidyverse syntax to get out the following:
> l
[[1]]
# A tibble: 3 x 2
x y list_pos
<int> <int> <int>
1 1 3 1
2 2 2 1
3 3 1 1
[[2]]
# A tibble: 3 x 2
a b list_pos
<int> <int> <int>
1 3 1 2
2 2 2 2
3 1 3 2
A possible solution:
library(tidyverse)
imap(l, ~ bind_cols(.x, pos = .y))
#> [[1]]
#> # A tibble: 3 x 3
#> x y pos
#> <int> <int> <int>
#> 1 1 3 1
#> 2 2 2 1
#> 3 3 1 1
#>
#> [[2]]
#> # A tibble: 3 x 3
#> a b pos
#> <int> <int> <int>
#> 1 3 1 2
#> 2 2 2 2
#> 3 1 3 2

Ignore zeros and NAs in cumsum

I need to assign numbers to sets of consecutive values in every column and create new columns. Eventually I want to find a sum of values in z column that correspond to the first consecutive numbers in each column.
My data looks something like this:
library(dplyr)
y1 = c(1,2,3,8,9,0)
y2 = c(0,0,0,4,5,6)
z = c(200,250,200,100,90,80)
yabc <- tibble(y1, y2, z)
# A tibble: 6 × 3
y1 y2 z
<dbl> <dbl> <dbl>
1 1 0 200
2 2 0 250
3 3 0 200
4 8 4 100
5 9 5 90
6 0 6 80
I tried the following formula:
yabc %>%
mutate_at(vars(starts_with("y")),
list(mod = ~ cumsum(c(FALSE, diff(.x)!=1))+1))
that gave me the following result:
# A tibble: 6 × 5
y1 y2 z y1_mod y2_mod
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 200 1 1
2 2 0 250 1 2
3 3 0 200 1 3
4 8 4 100 2 4
5 9 5 90 2 4
6 0 6 80 3 4
I am only interested in numbers greater than zero. I tried replacing zeros with NA, but it did not work either.
# A tibble: 6 × 5
y1 y2 z y1_mod y2_mod
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA 200 1 1
2 2 NA 250 1 NA
3 3 NA 200 1 NA
4 8 4 100 2 NA
5 9 5 90 2 NA
6 NA 6 80 NA NA
What I would like the data to look like is:
# A tibble: 6 × 5
y1 y2 z y1_mod y2_mod
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 200 1 NA
2 2 0 250 1 NA
3 3 0 200 1 NA
4 8 4 100 2 1
5 9 5 90 2 1
6 0 6 80 NA 1
Is there any way to exclude zeros and start applying the formula only when .x is greater than 0? Or any other way to make the formula work the way I need? Thank you!
FYI: mutate_at has been superseded by across, I'll demonstrate the new method in my code.
yabc %>%
mutate(
across(starts_with("y"),
list(mod = ~ if_else(.x > 0,
cumsum(.x > 0 & c(FALSE, diff(.x) != 1)) + 1L,
NA_integer_) )
)
)
# # A tibble: 6 x 5
# y1 y2 z y1_mod y2_mod
# <dbl> <dbl> <dbl> <int> <int>
# 1 1 0 200 1 NA
# 2 2 0 250 1 NA
# 3 3 0 200 1 NA
# 4 8 4 100 2 2
# 5 9 5 90 2 2
# 6 0 6 80 NA 2
If this is sufficient (you don't care if it's 1 or 2 for the first effective group in y2_mod), then you're good. If you want to reduce them all to be 1-based, then
yabc %>%
mutate(
across(starts_with("y"),
list(mod = ~ if_else(.x > 0,
cumsum(.x > 0 & c(FALSE, diff(.x) != 1)),
NA_integer_))),
across(ends_with("_mod"),
~ if_else(is.na(.x), .x, match(.x, na.omit(unique(.x))))
)
)
# # A tibble: 6 x 5
# y1 y2 z y1_mod y2_mod
# <dbl> <dbl> <dbl> <int> <int>
# 1 1 0 200 1 NA
# 2 2 0 250 1 NA
# 3 3 0 200 1 NA
# 4 8 4 100 2 1
# 5 9 5 90 2 1
# 6 0 6 80 NA 1
Notes:
if_else is helpful to handle the NA-including rows specially; it requires the same class, which can be annoying/confusing. Because of this, we need to pass the specific "class" of NA as the false= (third) argument to if_else. For example, cumsum(.)+1 produces a numeric, so the third arg would need to be NA_real_ (since the default NA is actually logical). Another way to deal with it is to either use cumsum(.)+1L (produces an integer) and NA_integer_ or (as I show in my second example) use cumsum(.) by itself (and NA_integer_) since we match things later (and match(.) returns integer)
I demo the shift from your mutate_at to mutate(across(..)). An important change here from mutate is that we run across without assigning its return to anything. In essence, it returns a named-list where each element of the list is an updated column or a new one, depending on the presence of .names; that takes a glue-like string to allow for renaming the calculated columns, thereby adding new columns instead of the default action (no .names) of overwriting the columns in-place. The alternate way of producing new (not in-place) columns is the way you used, with a named list of functions, still a common/supported way to use a list of functions within across(..).
library(data.table)
library(tidyverse)
yabc %>%
mutate(across(starts_with('y'),
~ as.integer(factor(`is.na<-`(rleid(.x - row_number()), !.x))),
.names = '{col}_mod'))
# A tibble: 6 x 5
y1 y2 z y1_mod y2_mod
<dbl> <dbl> <dbl> <int> <int>
1 1 0 200 1 NA
2 2 0 250 1 NA
3 3 0 200 1 NA
4 8 4 100 2 1
5 9 5 90 2 1
6 0 6 80 NA 1
The trick lies in knowing that for consecutive numbers, the difference between the number and their row_number() is the same:
ie consider:
x <- c(1,2,3,6,7,8,10,11,12)
The consecutive numbers can be grouped as:
x - seq_along(x)
[1] 0 0 0 2 2 2 3 3 3
As you can see, the consecutive numbers are grouped together. To get the desired groups, we should use rle
rleid(x-seq_along(x))
[1] 1 1 1 2 2 2 3 3 3
Another possible solution:
library(tidyverse)
y1=c(1,2,3,8,9,0)
y2=c(0,0,0,4,5,6)
z=c(200,250,200,100,90,80)
yabc<-tibble(y1,y2,z)
yabc %>%
mutate(across(starts_with("y"),
~if_else(.x==0, NA_real_, 1+cumsum(c(1,diff(.x)) != 1)), .names="{.col}_mod"))%>%
mutate(across(ends_with("mod"), ~ factor(.x) %>% as.numeric(.)))
#> # A tibble: 6 × 5
#> y1 y2 z y1_mod y2_mod
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 200 1 NA
#> 2 2 0 250 1 NA
#> 3 3 0 200 1 NA
#> 4 8 4 100 2 1
#> 5 9 5 90 2 1
#> 6 0 6 80 NA 1

dplyr `slice_max` interpolation not working

Given a data.frame:
library(tidyverse)
set.seed(0)
df <- tibble(A = 1:10, B = rnorm(10), C = rbinom(10,2,0.6))
var <- "B"
I'd like to get filter the data frame by the highest values of the variable in var. Logically, I'd do either:
df %>%
slice_max({{ var }}, n = 5)
#> # A tibble: 1 × 3
#> A B C
#> <int> <dbl> <int>
#> 1 1 1.26 1
df %>%
slice_max(!! var, n = 5)
#> # A tibble: 1 × 3
#> A B C
#> <int> <dbl> <int>
#> 1 1 1.26 1
But neither interpolation is working... what am I missing here?
Expected output would be the same as:
df %>%
slice_max(B, n = 5)
#> # A tibble: 5 × 3
#> A B C
#> <int> <dbl> <int>
#> 1 10 2.40 0
#> 2 3 1.33 2
#> 3 4 1.27 1
#> 4 1 1.26 1
#> 5 5 0.415 2
I think you need to use the newer .data version as outlined here:
df %>%
slice_max(.data[[var]] , n = 5)
#> # A tibble: 5 × 3
#> A B C
#> <int> <dbl> <int>
#> 1 10 2.40 0
#> 2 3 1.33 2
#> 3 4 1.27 1
#> 4 1 1.26 1
#> 5 5 0.415 2
I am puzzled by why your approach is get the first row only though!
We may convert to sym and evaluate (!!)
library(dplyr)
df %>%
slice_max(!! rlang::sym(var), n = 5)
-output
# A tibble: 5 × 3
A B C
<int> <dbl> <int>
1 10 2.40 0
2 3 1.33 2
3 4 1.27 1
4 1 1.26 1
5 5 0.415 2

modify_at to remove NA values in each element in a list

I have a big list of small datasets like this:
>> my_list
[[1]]
# A tibble: 6 x 2
Year FIPS
<dbl> <chr>
1 2015 12001
2 2015 51013
3 2015 12081
4 2015 12115
5 2015 12127
6 2015 42003
[[2]]
# A tibble: 9 x 2
Year FIPS
<dbl> <chr>
1 2017 04013
2 2017 10003
3 2017 NA
4 2017 25005
5 2017 25009
6 2017 25013
7 2017 25017
8 2017 25021
9 2017 25027
...
I want to remove the NAs from each tibble using modify_at because looks like is a clean way to do it. This is my try:
my_list %>% modify_at(c("FIPS"), drop_na)
I tried also with na.omit, but I get the same error in both cases:
Error: character indexing requires a named object
Can anyone help me here, please? What I'm doing wrong?
Creating some data.
library(tidyverse)
mylist <-
list(tibble(a = c(1, 2, NA),
b = c(2, 2, 2)),
tibble(c = rep(1, 5),
d = sample(c(NA, 2), 5, replace = TRUE)))
The .at argument in purrr::modify_at() specifies the list element to modify, not the column within the dataframe nested in the list. purrr::modify() works for your purposes.
modify(mylist, drop_na)
#> [[1]]
#> # A tibble: 2 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
#> 2 2 2
#>
#> [[2]]
#> # A tibble: 4 x 2
#> c d
#> <dbl> <dbl>
#> 1 1 2
#> 2 1 2
#> 3 1 2
#> 4 1 2
purrr::map() also works. Since your input and output are both list objects, map() is sufficient here, while modify() would be preferred if your input is of another class than a regular list and you want to conserve that class attribute for the output.
map(mylist, drop_na)
#> [[1]]
#> # A tibble: 2 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
#> 2 2 2
#>
#> [[2]]
#> # A tibble: 4 x 2
#> c d
#> <dbl> <dbl>
#> 1 1 2
#> 2 1 2
#> 3 1 2
#> 4 1 2
base R
lapply(mylist, na.omit)
#> [[1]]
#> # A tibble: 2 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
#> 2 2 2
#>
#> [[2]]
#> # A tibble: 4 x 2
#> c d
#> <dbl> <dbl>
#> 1 1 2
#> 2 1 2
#> 3 1 2
#> 4 1 2

Resources