How to mutate multiple columns with dynamic variable using purrr:map function? - r

I have a data frame as below:
df <- data.frame(
id = c(1:5),
a = c(3,10,4,0,15),
b = c(2,1,1,0,3),
c = c(12,3,0,3,1),
d = c(9,7,8,0,0),
e = c(1,2,0,2,2)
)
I need to add multiple columns of which names are given by a combination of a:c and 3:5. 3:5 is also used insum function:
df %>% mutate(
usa_3 = sum(1+3),
usa_4 = sum(1+4),
usa_5 = sum(1+5),
canada_3 = sum(1+3),
canada_4 = sum(1+4),
canada_5 = sum(1+5),
nz_3 = sum(1+3),
nz_4 = sum(1+4),
nz_5 = sum(1+5)
)
The result is really simple but I do not want to put similar codes repeatedly.
id a b c d e usa_3 usa_4 usa_5 canada_3 canada_4 canada_5 nz_3 nz_4 nz_5
1 1 3 2 12 9 1 4 5 6 4 5 6 4 5 6
2 2 10 1 3 7 2 4 5 6 4 5 6 4 5 6
3 3 4 1 0 8 0 4 5 6 4 5 6 4 5 6
4 4 0 0 3 0 2 4 5 6 4 5 6 4 5 6
5 5 15 3 1 0 2 4 5 6 4 5 6 4 5 6
The variables are alphabetical prefix and range of integers as postfix.
Postfix is also related to the sum funcion as 1+postfix.
In this case, they have 3 values for each so the result have 9 additional columns.
I do not prefer to define function outside the a bunch of codes and suppose map functino in purrr may help it.
Do you know how to make it work?
Especially it is difficult to give dynamic column name in pipe.
I found some similar questions but it does not match my need.
Multivariate mutate
How to use map from purrr with dplyr::mutate to create multiple new columns based on column pairs
===== ADDITIONAL INFO =====
Let me clarify some conditions of this issue.
Actually sum(1+3), sum(1+4)... part is replaced by as.factor(cutree(X,k=X)) where X is reuslt of cluster analysis and Y is a variable defined as 3:5 in the example. cutree() is a function to define in which part we cut a dendrogram stored in the result of cluster analysis.
As for the column names usa_3, usa_4 ... nz_5, country name is replaced by methods of cluster analysis such as ward, McQuitty, Median method, etc. (seven methods), and integers 3, 4, 5, are the parameter to define in which part I need to cut a dendrogram as explained.
As for an X in the functionas.factor(cutree(X,k=X)), results of cluster analysis also have several data frame which is corresponded to each method. I realized that another issue how to apply the function to each data frame (result of cluster analysis stored in different dataframe).
Actual scripts that I am using currently is something like this:
cluste_number <- original_df %>% mutate(
## Ward
ward_3=as.factor(cutree(clst.ward,k=3)),
ward_4=as.factor(cutree(clst.ward,k=4)),
ward_5=as.factor(cutree(clst.ward,k=5)),
ward_6=as.factor(cutree(clst.ward,k=6)),
## Single
sing_3=as.factor(cutree(clst.sing,k=3)),
sing_4=as.factor(cutree(clst.sing,k=4)),
sing_5=as.factor(cutree(clst.sing,k=5)),
sing_6=as.factor(cutree(clst.sing,k=6)))
It is sorry not to clarify the actual issue; howerver, due to this reason above, number of countries as usa, canada, nz and number of parameters as 1:3 do not match.
Also some suggestions using i + . does not meet the issue as a function as.factor(cutree(X,k=X)) is used in the actual operation.
Thank you for your support.

Not sure what you are up to, but maybe this helps to clarify the issue ..
library(tidyverse)
df <- data.frame(
id = c(1:5),
a = c(3,10,4,0,15),
b = c(2,1,1,0,3),
c = c(12,3,0,3,1),
d = c(9,7,8,0,0),
e = c(1,2,0,2,2)
)
ctry <- rep(c("usa", "ca", "nz"), each = 3)
nr <- rep(seq(3,5), times = 3)
df %>%
as_tibble() %>%
bind_cols(map_dfc(seq_along(ctry), ~1+nr[.x] %>%
rep(nrow(df))) %>%
set_names(str_c(ctry, nr, sep = "_")))
# A tibble: 5 x 15
id a b c d e usa_3 usa_4 usa_5 ca_3 ca_4 ca_5 nz_3 nz_4 nz_5
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 3 2 12 9 1 4 5 6 4 5 6 4 5 6
2 2 10 1 3 7 2 4 5 6 4 5 6 4 5 6
3 3 4 1 0 8 0 4 5 6 4 5 6 4 5 6
4 4 0 0 3 0 2 4 5 6 4 5 6 4 5 6
5 5 15 3 1 0 2 4 5 6 4 5 6 4 5 6

I'm not sure if I understand the spirit of the problem, but here is one way to generate a data frame with the column names and values you want.
You can change ~ function(i) i + . to be whatever function of i (the column being mutated) you want, and change either of the ns in setNames(n, n) to incorporate a different value into the function you're creating (first n) or change the names of the resulting columns (second n).
countries <- c('usa', 'canada', 'nz')
n <- 3:5
as.data.frame(matrix(1, nrow(df), length(n))) %>%
rename_all(~countries) %>%
mutate_all(map(setNames(n, n), ~ function(i) i + .)) %>%
select(-countries) %>%
bind_cols(df)
# usa_3 canada_3 nz_3 usa_4 canada_4 nz_4 usa_5 canada_5 nz_5 id a b c d e
# 1 4 4 4 5 5 5 6 6 6 1 3 2 12 9 1
# 2 4 4 4 5 5 5 6 6 6 2 10 1 3 7 2
# 3 4 4 4 5 5 5 6 6 6 3 4 1 0 8 0
# 4 4 4 4 5 5 5 6 6 6 4 0 0 3 0 2
# 5 4 4 4 5 5 5 6 6 6 5 15 3 1 0 2

Kinda of a dirty solution, but it does what you want. It combines two map_dfc functions.
library(dplyr)
library(purrr)
df <- tibble(id = c(1:5),
a = c(3,10,4,0,15),
b = c(2,1,1,0,3),
c = c(12,3,0,3,1),
d = c(9,7,8,0,0),
e = c(1,2,0,2,2))
create_postfix_cols <- function(df, country, n) {
# df = a dataframe
# country = suffix value (e.g. "canada")
# n = vector of postfix values (e.g. 3:5)
map2_dfc(.x = rep(country, length(n)),
.y = n,
~ tibble(col = rep(1 + .y, nrow(df))) %>%
set_names(paste(.x, .y, sep = "_")))
}
countries <- c("usa", "canada", "nz")
n <- 3:5
df %>%
bind_cols(map_dfc(.x = countries, ~create_postfix_cols(df, .x, n)))
# A tibble: 5 x 15
id a b c d e usa_3 usa_4 usa_5 canada_3 canada_4 canada_5
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 3 2 12 9 1 4 5 6 4 5 6
2 2 10 1 3 7 2 4 5 6 4 5 6
3 3 4 1 0 8 0 4 5 6 4 5 6
4 4 0 0 3 0 2 4 5 6 4 5 6
5 5 15 3 1 0 2 4 5 6 4 5 6
# ... with 3 more variables: nz_3 <dbl>, nz_4 <dbl>, nz_5 <dbl>

Here is a base R solution. You can rearrange columns if you would like, but this should get your started:
# Create column names using an index and country names
idx <- 3:5
countries <- c("usa", "canada", "nz")
new_columns <- unlist(lapply(countries, paste0, "_", idx))
# Adding new values using index & taking advantage of recycling
df[new_columns] <- sort(rep(1+idx, nrow(df)))
df
id a b c d e usa_3 usa_4 usa_5 canada_3 canada_4 canada_5 nz_3 nz_4 nz_5
1 1 3 2 12 9 1 4 5 6 4 5 6 4 5 6
2 2 10 1 3 7 2 4 5 6 4 5 6 4 5 6
3 3 4 1 0 8 0 4 5 6 4 5 6 4 5 6
4 4 0 0 3 0 2 4 5 6 4 5 6 4 5 6
5 5 15 3 1 0 2 4 5 6 4 5 6 4 5 6
Or, if you prefer:
# All in one long line
df[unlist(lapply(countries, paste0, "_", idx))] <- sort(rep(1+idx, nrow(df)))

Related

How can I create a new column with mutate function in R that is a sequence of values of other columns in R?

I have a data frame that looks like this :
a
b
c
1
2
10
2
2
10
3
2
10
4
2
10
5
2
10
I want to create a column with mutate function of something else under the dplyr framework of functions (or base) that will be sequence from b to c (i.e from 2 to 10 with length the number of rows of this tibble or data frame)
Ideally my new data frame I want to like like this :
a
b
c
c
1
2
10
2
2
2
10
4
3
2
10
6
4
2
10
8
5
2
10
10
How can I do this with R using dplyr ?
library(tidyverse)
n=5
a = seq(1,n,length.out=n)
b = rep(2,n)
c = rep(10,n)
data = tibble(a,b,c)
We may do
library(dplyr)
data %>%
rowwise %>%
mutate(new = seq(b, c, length.out = n)[a]) %>%
ungroup
-output
# A tibble: 5 × 4
a b c new
<dbl> <dbl> <dbl> <dbl>
1 1 2 10 2
2 2 2 10 4
3 3 2 10 6
4 4 2 10 8
5 5 2 10 10
If you want this done "by group" for each a value (creating many new rows), we can create the sequence as a list column and then unnest it:
data %>%
mutate(result = map2(b, c, seq, length.out = n)) %>%
unnest(result)
# # A tibble: 25 × 4
# a b c result
# <dbl> <dbl> <dbl> <dbl>
# 1 1 2 10 2
# 2 1 2 10 4
# 3 1 2 10 6
# 4 1 2 10 8
# 5 1 2 10 10
# 6 2 2 10 2
# 7 2 2 10 4
# 8 2 2 10 6
# 9 2 2 10 8
# 10 2 2 10 10
# # … with 15 more rows
# # ℹ Use `print(n = ...)` to see more rows
If you want to keep the same number of rows and go from the first b value to the last c value, we can use seq directly in mutate:
data %>%
mutate(result = seq(from = first(b), to = last(c), length.out = n()))
# # A tibble: 5 × 4
# a b c result
# <dbl> <dbl> <dbl> <dbl>
# 1 1 2 10 2
# 2 2 2 10 4
# 3 3 2 10 6
# 4 4 2 10 8
# 5 5 2 10 10
This one?
library(dplyr)
df %>%
mutate(c1 = a*b)
a b c c1
1 1 2 10 2
2 2 2 10 4
3 3 2 10 6
4 4 2 10 8
5 5 2 10 10

join columns recursively in R

Hello I have a data frame of 245 columns but to add some sets and generate new columns try to do it recursively as follows
cl1<-sample(1:4,10,replace=TRUE)
cl2<-sample(1:4,10,replace=TRUE)
cl3<-sample(1:4,10,replace=TRUE)
cl4<-sample(1:4,10,replace=TRUE)
cl5<-sample(1:4,10,replace=TRUE)
cl6<-sample(1:4,10,replace=TRUE)
dat<-data.frame(cl1,cl2,cl3,cl4,cl5,cl6)
my intention is to add column 1 with column 3 and 5, likewise column 2 with 4 and 6 and in the end obtain a dataframe with two columns
and you should pay me something like that
I have programmed the following code
revisar<- function(a){
todos = list()
i=1
j=3
l=5
k=1
while(i<=2 ){
cl<-a[,i]
cl2<-a[,j]
cl3<-a[,l]
cl[is.na(cl)] <- 0
cl2[is.na(cl2)] <- 0
cl3[is.na(cl3)] <- 0
colu<-cl+cl2+cl3
col<-cbind(colu,colu)
i<-i+1
j<-j+1
l<-l+1
k<-k+1
}
return(col)
}
it turns out that it only returns column 2 repeated twice and I must replicate the same thing to join those 245 columns.7
I would like to know what is failing the example
base R
Literal programming:
with(dat, data.frame(s1 = cl1+cl3+cl5, s2 = cl2+cl4+cl6))
# s1 s2
# 1 7 11
# 2 7 7
# 3 4 11
# 4 4 10
# 5 9 8
# 6 12 5
# 7 7 6
# 8 7 10
# 9 4 9
# 10 6 5
Programmatically,
L <- list(s1 = c(1,3,5), s2 = c(2,4,6))
out <- data.frame(lapply(L, function(z) do.call(rowSums, list(as.matrix(dat[,z])))))
out
# s1 s2
# 1 7 11
# 2 7 7
# 3 4 11
# 4 4 10
# 5 9 8
# 6 12 5
# 7 7 6
# 8 7 10
# 9 4 9
# 10 6 5
dplyr
library(dplyr)
dat %>%
transmute(
s1 = rowSums(cbind(cl1, cl3, cl5)),
s2 = rowSums(cbind(cl2, cl4, cl6))
)
or programmatically using purrr:
purrr::map_dfc(L, ~ rowSums(dat[, .]))
Data
set.seed(42)
# your `dat` above
Here is an alternative general approach:
Here we sum all uneven columns -> s1 and
all even columns -> s2:
library(dplyr)
dat %>%
rowwise() %>%
mutate(s1 = sum(c_across(seq(1,ncol(dat),2)), na.rm = TRUE),
s2 = sum(c_across(seq(2,ncol(dat),2)), na.rm = TRUE))
cl1 cl2 cl3 cl4 cl5 cl6 s1 s2
<int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 3 2 3 2 7 5
2 2 4 1 4 2 3 5 11
3 2 2 2 2 1 3 5 7
4 2 4 4 3 1 4 7 11
5 2 4 4 3 2 2 8 9
6 3 3 3 2 2 2 8 7
7 2 1 1 2 1 4 4 7
8 2 4 1 3 2 3 5 10
9 3 1 1 2 3 4 7 7
10 2 4 1 3 4 4 7 11

How to get specific values out of a list of values passed to one argument of a UDF with tidyeval

I used tidyeval to write a short function which takes grouping variables as an input, groups the mtcars dataset and counts the number of occurences per group:
test_function <- function(grps){
mtcars %>%
group_by(across({{grps}})) %>%
summarise(Count = n())
}
test_function(grps = c(cyl, gear))
---
cyl gear Count
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
Now imagine for that example I want a subtotal column for each group cyl. So how many cars have 4 (6,8) cylinders? This is what the result should look like:
test_function(grps = c(cyl, gear), subtotalrows = TRUE) ### example function execution
---
cyl gear Count
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 4 total 11
5 6 3 2
6 6 4 4
7 6 5 1
8 6 total 7
9 8 3 12
10 8 5 2
11 8 total 14
In this case the subtotal columns I am looking for can simply be produced with the same function but with one less grouping variable:
test_function(grps = cyl)
---
cyl Count
<dbl> <int>
1 4 11
2 6 7
3 8 14
But since I don't want to use the function in itself (not even sure wether this is possible in R) I would like to go for a different approach: As far as I know the best (and only way) to create subtotal rows so far is by calculating them independently and then binding them row wise to the grouped table (i.e.: rbind, bind_rows). In my case that means only take the first grouping variable, create the subtotal rows and later on bind them to the table. But here is where I have problems with the tidyeval syntax. Here is in pseudocode what I would like to do in the function:
test_function <- function(grps, subtotalrows = TRUE){
grouped_result <- mtcars %>%
group_by(across({{grps}})) %>%
summarise(Count = n())
if(subtotalrows == FALSE){
return(grouped_result)
} else {
#pseudocode
group_for_subcalculation <- grps[[1]] #I want the first element of the grps argument
subtotal_result <- mtcars %>%
group_by(across({{group_for_subcalculation}})) %>%
summarise(Count = n()) %>%
mutate(grps[[2]] := "total") %>%
arrange(grps[[1]], grps[[2]], Count)
return(rbind(grouped_result, subtotal_result))
}
}
So, two questions: I am curious how I can extract the first column name passed by grps and work with it in the following code. Second, this pseudocode example is specific for 2 columns passed by grps. Imagine I want to pass 3 or more even. How would you do that (loops)?
Try this function -
library(dplyr)
test_function <- function(grps, subtotalrows = TRUE){
grouped_data <- mtcars %>% group_by(across({{grps}}))
groups <- group_vars(grouped_data)
col_to_change <- groups[length(groups)] #Last value in grps
grouped_result <- grouped_data %>% summarise(Count = n())
if(!subtotalrows) return(grouped_result)
else {
result <- grouped_result %>%
summarise(Count = sum(Count),
!!col_to_change := 'Total') %>%
bind_rows(grouped_result %>%
mutate(!!col_to_change := as.character(.data[[col_to_change]]))) %>%
select(all_of(groups), Count) %>%
arrange(across(all_of(groups)))
}
return(result)
}
Test the function -
test_function(grps = c(cyl, gear))
# A tibble: 11 x 3
# cyl gear Count
# <dbl> <chr> <int>
# 1 4 3 1
# 2 4 4 8
# 3 4 5 2
# 4 4 Total 11
# 5 6 3 2
# 6 6 4 4
# 7 6 5 1
# 8 6 Total 7
# 9 8 3 12
#10 8 5 2
#11 8 Total 14
test_function(grps = c(cyl, gear), FALSE)
# cyl gear Count
# <dbl> <dbl> <int>
#1 4 3 1
#2 4 4 8
#3 4 5 2
#4 6 3 2
#5 6 4 4
#6 6 5 1
#7 8 3 12
#8 8 5 2
For 3 variables -
test_function(grps = c(cyl, gear, carb))
# cyl gear carb Count
# <dbl> <dbl> <chr> <int>
# 1 4 3 1 1
# 2 4 3 Total 1
# 3 4 4 1 4
# 4 4 4 2 4
# 5 4 4 Total 8
# 6 4 5 2 2
# 7 4 5 Total 2
# 8 6 3 1 2
# 9 6 3 Total 2
#10 6 4 4 4
#11 6 4 Total 4
#12 6 5 6 1
#13 6 5 Total 1
#14 8 3 2 4
#15 8 3 3 3
#16 8 3 4 5
#17 8 3 Total 12
#18 8 5 4 1
#19 8 5 8 1
#20 8 5 Total 2

Create variable with conditions on other multiple variables

I'm trying to create variable with conditions on other multiple variables.
For example, I have 5 variables, A, B, C, D, E. They ranges from 1 to 8.
I want to create new variable, grade, with conditions below.
1) If any of variables (A to E) are under 2, the grade will be 1
2) if all of variables are more than 3 and any of variables are between 3, 4, the grade will be 2.
3) if all of variables are more than 5, the grade will be 3.
I create dataset test arbitrarily.
test<-data.frame(A=c(4,7,4,1,4),
B=c(8,8,6,5,8),
C=c(6,5,6,7,5),
D=c(7,8,7,5,8),
E=c(5,7,8,5,5))
test
In this case, the grade will be 2,3,2,1,2.
I tried mutate_at function with vars and one_of function. However, it didn't return what I expected.
test<-test%>%mutate_at(
vars(one_of("A","B","C","D","E")),
funs(grade=case_when(. %in% c(1,2)~1,
min(.) %in% c(3,4)~2,
min(.) %in% c(5,6,7,8)~3)))
test
A B C D E A_grade B_grade C_grade D_grade E_grade
1 4 8 6 7 5 NA 3 3 3 3
2 7 8 5 8 7 NA 3 3 3 3
3 4 6 6 7 8 NA 3 3 3 3
4 1 5 7 5 5 1 3 3 3 3
5 4 8 5 8 5 NA 3 3 3 3
I would appreciate for all your help.
You can use the new version of dplyr, installed via remotes::install_github("tidyverse/dplyr") and the new c_across to get what you want easily. Note that the result doesn't have a 3 because I interpreted your logic as > 5 rather than >= 5.
library(dplyr)
test<-data.frame(A=c(4,7,4,1,4),
B=c(8,8,6,5,8),
C=c(6,5,6,7,5),
D=c(7,8,7,5,8),
E=c(5,7,8,5,5))
test %>%
rowwise() %>%
mutate(grade = case_when(
sum(c_across(A:E) < 2) > 0 ~ 1,
sum(c_across(A:E) > 5) == 5 ~ 3,
TRUE ~ 2
))
#> # A tibble: 5 x 6
#> # Rowwise:
#> A B C D E grade
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 8 6 7 5 2
#> 2 7 8 5 8 7 2
#> 3 4 6 6 7 8 2
#> 4 1 5 7 5 5 1
#> 5 4 8 5 8 5 2

Create new column based on condition from other column per group using tidy evaluation

Similar to this question but I want to use tidy evaluation instead.
df = data.frame(group = c(1,1,1,2,2,2,3,3,3),
date = c(1,2,3,4,5,6,7,8,9),
speed = c(3,4,3,4,5,6,6,4,9))
> df
group date speed
1 1 1 3
2 1 2 4
3 1 3 3
4 2 4 4
5 2 5 5
6 2 6 6
7 3 7 6
8 3 8 4
9 3 9 9
The task is to create a new column (newValue) whose values equals to the values of the date column (per group) with one condition: speed == 4. Example: group 1 has a newValue of 2 because date[speed==4] = 2.
group date speed newValue
1 1 1 3 2
2 1 2 4 2
3 1 3 3 2
4 2 4 4 4
5 2 5 5 4
6 2 6 6 4
7 3 7 6 8
8 3 8 4 8
9 3 9 9 8
It worked without tidy evaluation
df %>%
group_by(group) %>%
mutate(newValue=date[speed==4L])
#> # A tibble: 9 x 4
#> # Groups: group [3]
#> group date speed newValue
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 3 2
#> 2 1 2 4 2
#> 3 1 3 3 2
#> 4 2 4 4 4
#> 5 2 5 5 4
#> 6 2 6 6 4
#> 7 3 7 6 8
#> 8 3 8 4 8
#> 9 3 9 9 8
But had error with tidy evaluation
my_fu <- function(df, filter_var){
filter_var <- sym(filter_var)
df <- df %>%
group_by(group) %>%
mutate(newValue=!!filter_var[speed==4L])
}
my_fu(df, "date")
#> Error in quos(..., .named = TRUE): object 'speed' not found
Thanks in advance.
We can place the evaluation within brackets. Otherwise, it may try to evaluate the whole expression (filter_var[speed = 4L]) instead of filter_var alone
library(rlang)
library(dplyr)
my_fu <- function(df, filter_var){
filter_var <- sym(filter_var)
df %>%
group_by(group) %>%
mutate(newValue=(!!filter_var)[speed==4L])
}
my_fu(df, "date")
# A tibble: 9 x 4
# Groups: group [3]
# group date speed newValue
# <dbl> <dbl> <dbl> <dbl>
#1 1 1 3 2
#2 1 2 4 2
#3 1 3 3 2
#4 2 4 4 4
#5 2 5 5 4
#6 2 6 6 4
#7 3 7 6 8
#8 3 8 4 8
#9 3 9 9 8
Also, you can use from sqldf. Join df with a constraint on that:
library(sqldf)
df = data.frame(group = c(1,1,1,2,2,2,3,3,3),
date = c(1,2,3,4,5,6,7,8,9),
speed = c(3,4,3,4,5,6,6,4,9))
sqldf("SELECT df_origin.*, df4.`date` new_value FROM
df df_origin join (SELECT `group`, `date` FROM df WHERE speed = 4) df4
on (df_origin.`group` = df4.`group`)")

Resources