I am attempting to reference existing columns in dplyr through a loop. Effectively, I would like to evaluate the operations from one table (evaluation in below example) to be performed to another table (dt in below example). I do not want to hardcode the column names on the RHS within mutate(). I would like to control the evaluations being performed from the evaluation table below. So I am trying to make the process dynamic.
Here is a sample dataframe:
dt = data.frame(
A = c(1:20),
B = c(11:30),
C = c(21:40),
AA = rep(1, 20),
BB = rep(2, 20)
)
Here is a table of sample operations to be performed:
evaluation = data.frame(
New_Var = c("AA", "BB"),
Operation = c("(A*2) > B", "(B*2) <= C"),
Result = c("True", "False")
) %>% mutate_all(as.character)
What I am trying to do is the following:
for (i in 1:nrow(evaluation)) {
var = evaluation$New_Var[i]
dt = dt %>%
rowwise() %>%
mutate(!!var := ifelse(eval(parse(text = evaluation$Operation[i])),
evaluation$Result[i],
!!var))
}
my desired result would be something like this except for the "AA" in the AA column would be the original numeric values of the AA column of 1, 1, 1, 1, 1.
UPDATED:
I believe my syntax in the "False" part of the ifelse statement is incorrect. What is the correct syntax to specify "!!var" in the false portion of the ifelse statement?
I know there are other ways to do it using base R, but I would rather do it through dplyr as it is cleaner code to look at. I am leveraging "rowise()" to do it element by element.
Modified data to (a) enforce type consistency for columns AA and BB and (b) ensure that at least one row satisfies the second condition.
dt = tibble(
A = c(1:20),
B = c(10:29), ## Note the change
C = c(21:40),
AA = rep("a", 20), ## Note initialization with strings
BB = rep("b", 20) ## Ditto
)
To make your loop work, you need to convert your code strings into actual expressions. You can use rlang::sym() for variable names and rlang::parse_expr() for everything else.
for( i in 1:nrow(evaluation) )
{
var <- rlang::sym(evaluation$New_Var[i])
op <- rlang::parse_expr(evaluation$Operation[i])
dt = dt %>% rowwise() %>%
mutate(!!var := ifelse(!!op, evaluation$Result[i],!!var))
}
# # A tibble: 20 x 5
# A B C AA BB
# <int> <int> <int> <chr> <chr>
# 1 1 10 21 a False
# 2 2 11 22 a False
# 3 3 12 23 a b
# 4 4 13 24 a b
# 5 5 14 25 a b
# 6 6 15 26 a b
# 7 7 16 27 a b
# 8 8 17 28 a b
# 9 9 18 29 a b
# 10 10 19 30 True b
# 11 11 20 31 True b
# 12 12 21 32 True b
# 13 13 22 33 True b
# 14 14 23 34 True b
# 15 15 24 35 True b
# 16 16 25 36 True b
# 17 17 26 37 True b
# 18 18 27 38 True b
# 19 19 28 39 True b
# 20 20 29 40 True b
Assuming that Felipe's answer was the functionality you desired, here's a more "tidyverse"/pipe-oriented/functional approach.
Data
library(rlang)
library(dplyr)
library(purrr)
operations <- tibble(
old_var = exprs(A, B),
new_var = exprs(AA, BB),
test = exprs(2*A > B, 2*B <= C),
result = exprs("True", "False")
)
original <- tibble(
A = sample.int(30, 10),
B = sample.int(30, 10),
C = sample.int(30, 10)
)
original
# A tibble: 10 x 3
A B C
<int> <int> <int>
1 4 20 5
2 30 29 11
3 1 27 14
4 2 21 4
5 17 19 24
6 14 25 9
7 5 22 22
8 6 13 7
9 25 4 21
10 12 11 12
Functions
# Here's your reusable functions
generic_mutate <- function(dat, new_var, test, result, old_var) {
dat %>% mutate(!!new_var := ifelse(!!test, !!result, !!old_var))
}
generic_ops <- function(dat, ops) {
pmap(ops, generic_mutate, dat = dat) %>%
reduce(full_join)
}
generic_mutate takes a single original dataframe, a single new_var, etc. It performs the test, adds the new column with the appropriate name and values.
generic_ops is the "vectorized" version. It takes the original dataframe as the first argument, and a dataframe of operations as the second. It then parallel maps over each column of new variable names, tests, etc, and calls generic_mutate on each one. That results in a list of dataframes, each with one added column. The reduce then combines them back all together with a sequential full_join.
Results
original %>%
generic_ops(operations)
Joining, by = c("A", "B", "C")
# A tibble: 10 x 5
A B C AA BB
<int> <int> <int> <chr> <chr>
1 4 20 5 4 20
2 30 29 11 True 29
3 1 27 14 1 27
4 2 21 4 2 21
5 17 19 24 True 19
6 14 25 9 True 25
7 5 22 22 5 22
8 6 13 7 6 13
9 25 4 21 True False
10 12 11 12 True 11
The magic here is using exprs(...) so you can store NSE names and operations in a tibble without forcing their evaluation. I think this is a lot cleaner than storing names and operations in strings with quotation marks.
How's this:
evaluation = data.frame(
Old_Var = c('A', 'B'),
New_Var = c("AA", "BB"),
Operation = c("(A*2) > B", "(B*2) <= C"),
Result = c("True", "False")
) %>% mutate_all(as.character)
for (i in 1:nrow(evaluation)) {
old <- sym(evaluation$Old_Var[i])
new <- sym(evaluation$New_Var[i])
op <- sym(evaluation$Operation[i])
res <- sym(evaluation$Result[i])
dt <- dt %>%
mutate(!!new := ifelse(!!op, !!res, !!old))
}
EDIT: My last answer doesn't work because rlang tries to find a variable named !!op (e.g. named (A*2) > B) instead of evaluating the expression. I got this to work using a mix of tidyselect and base R. You can of course follow #Brian's advice and use this solution with pmap. I honestly don't know how well this will perform though, as I think it will evaluate the ifelse once per row, and am not sure it's a vectorized operation...
dt <- tibble(
A = c(1:20),
B = c(11:30),
C = c(21:40),
AA = rep(1, 20),
BB = rep(2, 20)
)
evaluation = tibble(
Old_Var = c('A', 'B'),
New_Var = c("AA", "BB"),
Operation = c('(A*2) > B', '(B*2) <= C'),
Result = c("True", "False")
)
for (i in 1:nrow(evaluation)) {
old <- evaluation$Old_Var[i]
new <- evaluation$New_Var[i]
op <- evaluation$Operation[i]
res <- evaluation$Result[i]
dt <- dt %>%
mutate(!!sym(new) := eval(parse(text = sprintf('ifelse(%s, "%s", %s)', op, res, old))))
}
One way is to rework the conditions first, then pass them to mutate :
conds <- parse(text=evaluation$Operation) %>%
as.list() %>%
setNames(evaluation$New_Var) %>%
imap(~expr(ifelse(!!.,"True", !!sym(.y))))
conds
#> $AA
#> ifelse((A * 2) > B, "True", AA)
#>
#> $BB
#> ifelse((B * 2) <= C, "True", BB)
dt %>% mutate(!!!conds)
#> A B C AA BB
#> 1 1 11 21 1 2
#> 2 2 12 22 1 2
#> 3 3 13 23 1 2
#> 4 4 14 24 1 2
#> 5 5 15 25 1 2
#> 6 6 16 26 1 2
#> 7 7 17 27 1 2
#> 8 8 18 28 1 2
#> 9 9 19 29 1 2
#> 10 10 20 30 1 2
#> 11 11 21 31 True 2
#> 12 12 22 32 True 2
#> 13 13 23 33 True 2
#> 14 14 24 34 True 2
#> 15 15 25 35 True 2
#> 16 16 26 36 True 2
#> 17 17 27 37 True 2
#> 18 18 28 38 True 2
#> 19 19 29 39 True 2
#> 20 20 30 40 True 2
Related
Say I have a data.frame:
file = read.table(text = "sex age num
M 32 5
F 31 2
M 91 2
M 30 1
M 23 1
F 19 1
F 31 2
F 21 2
M 32 5
F 65 3
M 24 5", header = T, sep = "")
I want to get a sorted data frame of all rows that have the exact same values of sex, age, and num with any other row in the data frame.
The result should look like this (note that the data frame is sorted by the pairs or groups that are duplicated with each other):
result = read.table(text = "sex age num
M 32 5
M 32 5
F 31 2
F 31 2", header = T, sep = "")
I have tried various combinations of distinct in dplyr and duplicated, but they don't quite get at this use case.
We need duplicated twice i.e. one duplicated in the normal direction from up to bottom and second from bottom to top (fromLast = TRUE) and then use | so that it can be TRUE in either direction for subsetting
out <- file[duplicated(file)|duplicated(file, fromLast = TRUE),]
out$sex <- factor(out$sex, levels = c("M", "F"))
out1 <- out[do.call(order, out),]
row.names(out1) <- NULL
-output
> out1
sex age num
1 M 32 5
2 M 32 5
3 F 31 2
4 F 31 2
The above can be written in tidyverse
library(dplyr)
file %>%
arrange(sex == "F", across(everything())) %>%
filter(duplicated(.)|duplicated(., fromLast = TRUE))
sex age num
1 M 32 5
2 M 32 5
3 F 31 2
4 F 31 2
An alternative approach:
Here all groups with more then 1 nrow will be kept:
library(dplyr)
file %>%
group_by(sex, age, num) %>%
filter(n() > 1) %>%
arrange(.by_group = T)
ungroup()
sex age num
<chr> <int> <int>
1 F 31 2
2 F 31 2
3 M 32 5
4 M 32 5
file = read.table(text = "sex age num
M 32 5
F 31 2
M 91 2
M 30 1
M 23 1
F 19 1
F 31 2
F 21 2
M 32 5
F 65 3
M 24 5", header = T, sep = "")
library(vctrs)
library(dplyr, warn = F)
#> Warning: package 'dplyr' was built under R version 4.1.2
file %>%
filter(vec_duplicate_detect(.)) %>%
arrange(across(everything()))
#> sex age num
#> 1 F 31 2
#> 2 F 31 2
#> 3 M 32 5
#> 4 M 32 5
Created on 2022-08-19 by the reprex package (v2.0.1.9000)
A base R option using subset + ave
> subset(file, ave(seq_along(num), sex, age, num, FUN = length) > 1)
sex age num
1 M 32 5
2 F 31 2
7 F 31 2
9 M 32 5
or rbind + split
> do.call(rbind, Filter(function(x) nrow(x) > 1, split(file, ~ sex + age + num)))
sex age num
F.31.2.2 F 31 2
F.31.2.7 F 31 2
M.32.5.1 M 32 5
M.32.5.9 M 32 5
Here is an approach, using .SD[.N>1] by group in data.table
library(data.table)
result = setDT(file)[, i:=.I][, .SD[.N>1],.(sex,age,num)][, i:=NULL]
Output:
sex age num
1: M 32 5
2: M 32 5
3: F 31 2
4: F 31 2
I am in search of an elegant solution that produces a column of values that are column offsets of a 'column offset' column = 'relative_column_position.' The desired answer is provided (radio).
My actual data consists of thousands of rows with ~300 different column positions denoted in 'relative_column_position,' so a hand-solution such as this is not in the cards.
gaga <- tibble(relative_column_position = c(rep(1,3), rep(2,6), rep(3,3) ),
col_1 = 1:12,
col_2 = 13:24,
col_3 = 25:36
)
gaga
radio <- tibble( c(gaga$col_1[1:3],
gaga$col_2[4:9],
gaga$col_3[10:12])
)
radio
Base R answer using matrix subsetting -
gaga <- data.frame(gaga)
result <- data.frame(value = gaga[cbind(seq_len(nrow(gaga)),
gaga$relative_column_position + 1)])
result
# value
#1 1
#2 2
#3 3
#4 16
#5 17
#6 18
#7 19
#8 20
#9 21
#10 34
#11 35
#12 36
gaga$relative_column_position + 1 because the subsetting starts from the 2nd column in the dataset. So when gaga$relative_column_position is 1, we actually want to subset data from 2nd column in gaga dataset.
Here is a base R solution in two steps.
library(tibble)
gaga <- tibble(relative_column_position = c(rep(1,3), rep(2,6), rep(3,3) ),
col_1 = 1:12,
col_2 = 13:24,
col_3 = 25:36
)
radio <- tibble(c(gaga$col_1[1:3],
gaga$col_2[4:9],
gaga$col_3[10:12])
)
rcp <- split(seq_along(gaga$relative_column_position), gaga$relative_column_position)
unlist(mapply(\(x, i) x[i], gaga[-1], rcp))
#> col_11 col_12 col_13 col_21 col_22 col_23 col_24 col_25 col_26 col_31 col_32
#> 1 2 3 16 17 18 19 20 21 34 35
#> col_33
#> 36
Created on 2022-05-21 by the reprex package (v2.0.1)
As a tibble:
rcp <- split(seq_along(gaga$relative_column_position), gaga$relative_column_position)
radio <- tibble(rcp = unlist(mapply(\(x, i) x[i], gaga[-1], rcp)))
rm(rcp)
radio
#> # A tibble: 12 × 1
#> rcp
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 16
#> 5 17
#> 6 18
#> 7 19
#> 8 20
#> 9 21
#> 10 34
#> 11 35
#> 12 36
Created on 2022-05-21 by the reprex package (v2.0.1)
df |>
mutate(rel = apply(df, 1, \(x) x[colnames(df)[x["relative_col"]]] ))
to apply to your df example:
gaga |>
mutate(rel = apply(gaga, 1, \(x) x[colnames(gaga)[x["relative_column_position"] + 1]] ))
Assuming you have a relative column to map over, you can use apply and
mutate
I have a dataset like here:
customer_id <- c("1","1","1","2","2","2","2","3","3","3")
account_id <- as.character(c(11,11,11,55,55,55,55,38,38,38))
time <- c(as.Date("2017-01-01","%Y-%m-%d"), as.Date("2017-02-01","%Y-%m-%d"), as.Date("2017-03-01","%Y-%m-%d"),
as.Date("2017-12-01","%Y-%m-%d"), as.Date("2018-01-01","%Y-%m-%d"), as.Date("2018-02-01","%Y-%m-%d"),
as.Date("2018-03-01","%Y-%m-%d"), as.Date("2018-04-01","%Y-%m-%d"), as.Date("2018-05-01","%Y-%m-%d"),
as.Date("2018-06-01","%Y-%m-%d"))
tenor <- c(1,2,3,1,2,3,4,1,2,3)
variable_x <- c(87,90,100,120,130,150,12,13,15,14)
my_data <- data.table(customer_id,account_id,time,tenor,variable_x)
Now, I would like to create new variables "PD_Q1" up to "PD_Q20" that would equal to the value of "variable_x" when "tenor" is equal to 1 up to 20, i.e., PD_Q1 equal to variable_x's value if tenor = 1, PD_Q2 equal to variable_x's value if tenor = 2, etc. and I would like to do that by customer_id, account_id. I have the code for that, however only for PD_Q1 and I would like to make a loop that loops over i = 1:20 in which I change just tenor == i (this one is easy) and refer to columns PD_Qi in this loop, which is a problem for me. The code for one value of i is here:
my_data[tenor == 1, PD_Q1_temp := variable_x, by = c("customer_id", "account_id")]
list_accs <- my_data[tenor == 1, c("customer_id", "account_id", "PD_Q1_temp")]
list_accs <- unique(list_accs, by = c("customer_id", "account_id"))
names(list_accs) = c("customer_id", "account_id", "PD_Q1")
my_data = merge(x = my_data, y = list_accs, by = c("customer_id", "account_id"), all.x = TRUE)
my_data$PD_Q1_temp <- NULL
Now, can you please advise how to make a loop from 1 to 20, in which tenor, PD_Q1_temp and PD_Q1 would change? Specifically, I don't know how to refer to column names or variables using this i index within a loop.
The expected output for i = 1 and i = 2 (creating variables PD_Q1 and PD_Q2) is here:
> my_data
customer_id account_id time tenor variable_x PD_Q1 PD_Q2
1: 1 11 2017-01-01 1 87 87 90
2: 1 11 2017-02-01 2 90 87 90
3: 1 11 2017-03-01 3 100 87 90
4: 2 55 2017-12-01 1 120 120 130
5: 2 55 2018-01-01 2 130 120 130
6: 2 55 2018-02-01 3 150 120 130
7: 2 55 2018-03-01 4 12 120 130
8: 3 38 2018-04-01 1 13 13 15
9: 3 38 2018-05-01 2 15 13 15
10: 3 38 2018-06-01 3 14 13 15
now I want to create PD_Q3, PD_Q4 etc. in a loop using my code above that creates one such variable.
Can you show your expected output?
I think you can do what you want with tidyr::gather():
library(dplyr)
library(tidyr)
my_data %>%
tbl_df() %>%
select(-time) %>%
mutate(tenor = paste0("PD_Q", tenor)) %>%
spread(tenor, variable_x)
# # A tibble: 3 x 6
# customer_id account_id PD_Q1 PD_Q2 PD_Q3 PD_Q4
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 11 87 90 100 NA
# 2 2 55 120 130 150 12
# 3 3 38 13 15 14 NA
I have the following data frame as an example
df <- data.frame(score=letters[1:15], total1=1:15, total2=16:30)
> df
score total1 total2
1 a 1 16
2 b 2 17
3 c 3 18
4 d 4 19
5 e 5 20
6 f 6 21
7 g 7 22
8 h 8 23
9 i 9 24
10 j 10 25
11 k 11 26
12 l 12 27
13 m 13 28
14 n 14 29
15 o 15 30
I would like to aggregate my data frame by sum by grouping the rows having different name, i.e.
groups sum1 sum2
'a-b-c' 6 51
'c-d-e' 21 60
etc
All the given answers to this kind of question assume that the strings repeat in the row.
The usual aggregate function that I use to obtain the summary delivers a different result:
aggregate(df$total1, by=list(sum1=df$score %in% c('a','b','c'), sum2=df$score %in% c('d','e','f')), FUN=sum)
sum1 sum2 x
1 FALSE FALSE 99
2 TRUE FALSE 6
3 FALSE TRUE 15
If you want a tidyverse solution, here is one possibility:
df <- data.frame(score=letters[1:15], total1=1:15, total2=16:30)
df %>%
mutate(groups = case_when(
score %in% c("a","b","c") ~ "a-b-c",
score %in% c("d","e","f") ~ "d-e-f"
)) %>%
group_by(groups) %>%
summarise_if(is.numeric, sum)
returns
# A tibble: 3 x 3
groups total1 total2
<chr> <int> <int>
1 a-b-c 6 51
2 d-e-f 15 60
3 <NA> 99 234
Add a "groups" column with the category value.
df$groups = NA
and then define each group like this:
df$groups[df$score=="a" | df$score=="b" | df$score=="c" ] = "a-b-c"
Finally aggregate by that column.
Here's a solution that works for any sized data frame.
df <- data.frame(score=letters[1:15], total1=1:15, total2=16:30)
# I'm adding a row to demonstrate that the grouping pattern works when the
# number of rows is not equally divisible by 3.
df <- rbind(df, data.frame(score = letters[16], total1 = 16, total2 = 31))
# A vector that represents the correct groupings for the data frame.
groups <- c(rep(1:floor(nrow(df) / 3), each = 3),
rep(floor(nrow(df) / 3) + 1, nrow(df) - length(1:(nrow(df) / 3)) * 3))
# Your method of aggregation by `groups`. I'm going to use `data.table`.
require(data.table)
dt <- as.data.table(df)
dt[, group := groups]
aggDT <- dt[, list(score = paste0(score, collapse = "-"),
total1 = sum(total1), total2 = sum(total2)), by = group][
, group := NULL]
aggDT
score total1 total2
1: a-b-c 6 51
2: d-e-f 15 60
3: g-h-i 24 69
4: j-k-l 33 78
5: m-n-o 42 87
6: p 16 31
I have a dataframe which looks like:
Student_ID Number Position
VB-123 10 2
VB-456 15 5
VB-789 25 25
VB-889 12 2
VB-965 15 7
VB-758 45 9
VB-245 25 25
I want to add new column and assign a value based on below conditions:
If only Number is duplicate in entire dataframe then Assign A
If only Position is duplicate in entire dataframe then assign B
If both Number and Position are duplicate then assign C
If none of the duplicate then assign D.
Output would looks like:
Student_ID Number Position Assign
VB-123 10 2 B
VB-456 15 5 A
VB-789 25 25 C
VB-889 12 2 B
VB-965 15 7 A
VB-758 45 9 D
VB-245 25 25 C
With dplyr,
library(dplyr)
students <- data.frame(Student_ID = c("VB-123", "VB-456", "VB-789", "VB-889", "VB-965", "VB-758", "VB-245"),
Number = c(10L, 15L, 25L, 12L, 15L, 45L, 25L),
Position = c(2L, 5L, 25L, 2L, 7L, 9L, 25L))
students2 <- students %>%
mutate_at(vars(Number, Position), funs(n = table(.)[as.character(.)])) %>%
mutate(Assign = case_when(Number_n > 1 & Position_n > 1 ~ 'C',
Number_n > 1 ~ 'A',
Position_n > 1 ~ 'B',
TRUE ~ 'D'))
students2
#> Student_ID Number Position Number_n Position_n Assign
#> 1 VB-123 10 2 1 2 B
#> 2 VB-456 15 5 2 1 A
#> 3 VB-789 25 25 2 2 C
#> 4 VB-889 12 2 1 2 B
#> 5 VB-965 15 7 2 1 A
#> 6 VB-758 45 9 1 1 D
#> 7 VB-245 25 25 2 2 C
As an alternative to the mutate_at line, you could use add_count twice, renaming as necessary. To remove the intermediary columns, tack on select(-matches('_n$')).
You can more or less replicate the logic in base by assigning to subsets:
students2 <- cbind(students, lapply(students[2:3], function(x) table(x)[as.character(x)]))
students2$Assign <- 'D'
students2$Assign[students2$Number.Freq > 1 & students2$Position.Freq > 1] <- 'C'
students2$Assign[students2$Number.Freq > 1 & students2$Position.Freq == 1] <- 'A'
students2$Assign[students2$Number.Freq == 1 & students2$Position.Freq > 1] <- 'B'
students2[4:7] <- NULL
students2
#> Student_ID Number Position Assign
#> 1 VB-123 10 2 B
#> 2 VB-456 15 5 A
#> 3 VB-789 25 25 C
#> 4 VB-889 12 2 B
#> 5 VB-965 15 7 A
#> 6 VB-758 45 9 D
#> 7 VB-245 25 25 C
Here is an option using base R. Create a list of column names as in the order of evaluatin ('l1'), pre assign 'D' to create the 'Assign' column in 'dat', loop through the sequence of 'l1', subset the columns of data based on the column names in 'l1', use duplicated to find the duplicate elements and reassign the 'Assign' column to the corresponding LETTER
l1 <- list("Number", "Position", c("Number", "Position"))
dat$Assign <- rep("D", nrow(dat))
for(i in seq_along(l1)){
df <- dat[l1[[i]]]
i1 <- duplicated(df)|duplicated(df, fromLast = TRUE)
dat$Assign <- replace(dat$Assign, i1, LETTERS[i])
}
-output
dat
# Student_ID Number Position Assign
#1 VB-123 10 2 B
#2 VB-456 15 5 A
#3 VB-789 25 25 C
#4 VB-889 12 2 B
#5 VB-965 15 7 A
#6 VB-758 45 9 D
#7 VB-245 25 25 C
A solution using dplyr.
library(dplyr)
dat2 <- dat %>% count(Number)
dat3 <- dat %>% count(Position)
dat4 <- dat %>% count(Number, Position)
dat5 <- dat %>%
left_join(dat2, by = "Number") %>%
left_join(dat3, by = "Position") %>%
left_join(dat4, by = c("Number", "Position")) %>%
mutate(Assign = case_when(
n > 1 ~ "C",
n.x > 1 & n.y == 1 ~ "A",
n.y > 1 & n.x == 1 ~ "B",
TRUE ~ "D"
)) %>%
select(-n.x, -n.y, -n)
dat5
# Student_ID Number Position Assign
# 1 VB-123 10 2 B
# 2 VB-456 15 5 A
# 3 VB-789 25 25 C
# 4 VB-889 12 2 B
# 5 VB-965 15 7 A
# 6 VB-758 45 9 D
# 7 VB-245 25 25 C
DATA
dat <- read.table(text = "Student_ID Number Position
'VB-123' 10 2
'VB-456' 15 5
'VB-789' 25 25
'VB-889' 12 2
'VB-965' 15 7
'VB-758' 45 9
'VB-245' 25 25",
header = TRUE, stringsAsFactors = FALSE)