R: Generate variable in data.frame using minif on other columns - r

I have a data.frame that contains a variable for an obvservation date (Portfolio.Date), IDs (Loan.Number) and a status indicator (1-8) (Current.Delinquency.Code). I want to calculate a new variable that is true only the first time that the status indicator goes above 3 for any ID. In Excel I would write =[Portfolio.Date]=minif([Portfolio.Date], [Loan.Number], [#Loan.Number], [Current.Delinquency.Code], ">3"), but I cannot figure out how to do this in R. Can anybody help me with this?
Thank you very much!
What I am looking for is the formula for the [Delinquent] column in the below example Data, which jumps to "TRUE" the first time that the an observation of "Current.Delinquency.Code" for tha "Loan.Number" is above 3.
Portfolio.Date Loan.Number Current.Delinquency.Code Delinquent
2022/01/01 1 1 FALSE
2022/02/01 1 4 TRUE
2022/03/01 1 4 FALSE
2022/04/01 1 4 FALSE
2022/01/01 2 1 FALSE
2022/02/01 2 1 FALSE
2022/03/01 2 1 FALSE
2022/04/01 2 1 FALSE
2022/01/01 3 1 FALSE
2022/02/01 3 3 FALSE
2022/03/01 3 4 TRUE
2022/04/01 3 4 FALSE

if I understood you correctly this is one possible solution:
library(dplyr)
# to make it reproducable:
set.seed(1)
# sample data
df <- data.frame(Portfolio.Date = seq.Date(from = as.Date("2022-01-01"), to = as.Date("2022-01-30"), by = "days"),
Loan.Number = rep(c(1,2), 15),
Current.Delinquency.Code = sample(1:8, size = 30, replace = TRUE)) %>%
# group by Loan.Number
dplyr::group_by(df, Loan.Number) %>%
# order by Portfolio.Date
dplyr::arrange(Portfolio.Date) %>%
# check the condition and make a cumsum of it, returning only those where cumsum is 1 (frist occurence)
dplyr::mutate(nc = ifelse(Current.Delinquency.Code > 3 & cumsum(Current.Delinquency.Code > 3) == 1, 1, 0)) %>%
# ungroup to prevent unwanted behaviour down stream
dplyr::ungroup()
# A tibble: 30 x 4
Portfolio.Date Loan.Number Current.Delinquency.Code nc
<date> <dbl> <int> <dbl>
1 2022-01-01 1 1 0
2 2022-01-02 2 4 1
3 2022-01-03 1 7 1
4 2022-01-04 2 1 0
5 2022-01-05 1 2 0
6 2022-01-06 2 5 0
7 2022-01-07 1 7 0
8 2022-01-08 2 3 0
9 2022-01-09 1 6 0
10 2022-01-10 2 2 0
# ... with 20 more rows
# i Use `print(n = ...)` to see more rows

Related

Select a row on condition plus the next one

This is my example code:
pdata <- tibble(
id = rep(1:5, each = 5),
time = rep(2016:2020, times = 5),
value = c(c(1,1,1,0,1), c(1,1,0,1,1), c(1,1,1,0,1), c(1,1,1,1,1), c(1,0,1,1,1))
)
How can I select all the rows with a 0 plus just the one next row (regardless of the value)?
I'd like to have an outcome like that:
# A tibble: 25 × 4
id time value condition
<int> <int> <dbl> <lgl>
1 1 2016 1 TRUE
2 1 2017 1 TRUE
3 1 2018 1 TRUE
4 1 2019 0 FALSE
5 1 2020 1 FALSE
6 2 2016 1 TRUE
7 2 2017 1 TRUE
8 2 2018 0 FALSE
9 2 2019 1 FALSE
10 2 2020 1 TRUE
# … with 15 more rows
Importantly, I want to be able to filter both rows (0 and the next one) out of my data set afterwards.
I've tried the mutate() function and a for loop, but nothing has worked, neither have other posts on stackoverflow. Thanks for your help!
Using a combination of filter() and lag() from the tidyverse
library(tidyverse)
pdata %>%
filter(value == 0 | lag(value) == 0)

How can i add missing month value and remove duplicate in dplyr in R?

I have a dataset that looks like this :
vaR
date
A
1/1/2022
A
1/2/2022
A
1/3/2022
B
1/1/2022
B
1/3/2022
C
1/1/2022
C
1/1/2022
C
1/2/2022
C
1/2/2022
C
1/3/2022
And i want to be arranged by month and by the var variable. But if a month is not recorded (missing) i want to be added (to be appeared ) in the new column named Month and to mutate (dplyr phrase) another column that will check if there was an entry on that month (logical condition).But there are some entries for example C that has more that one entries which counts for one (distinct).
Ideally is must look like this :
var
Quarter
Month
Condition
A
1
1
TRUE
A
1
2
TRUE
A
1
3
TRUE
B
1
1
TRUE
B
1
2
FALSE
B
1
3
TRUE
C
1
1
TRUE
C
1
2
TRUE
C
1
3
TRUE
As a start i have tried this one in R :
var = c(rep("A",3),rep("B",2),rep("C",5));var
date = c(as.Date("2022/01/01"),as.Date("2022/02/01"),as.Date("2022/03/01"),
as.Date("2022/01/01"),as.Date("2022/03/01"),
as.Date("2022/01/01"),as.Date("2022/01/01"),as.Date("2022/02/01"),as.Date("2022/02/01"),as.Date("2022/03/01"))
data = tibble(var,date)
quarter = 1
data%>%
dplyr::mutate(month = lubridate::month(date),
Quarter = quarter)
But i don't know how to add the missing month and check for the verified condition.
Any help ?
You can use complete() to fill in the missing months and then check whether they have an associated date, then use distinct() to find the unique combinations.
library(dplyr)
library(tidyr)
var = c(rep("A",3),rep("B",2),rep("C",5))
date = c(as.Date("2022/01/01"),as.Date("2022/02/01"),as.Date("2022/03/01"),
as.Date("2022/01/01"),as.Date("2022/03/01"),
as.Date("2022/01/01"),as.Date("2022/01/01"),as.Date("2022/02/01"),as.Date("2022/02/01"),as.Date("2022/03/01"))
data = tibble(var,date)
quarter = 1
data %>%
mutate(month = lubridate::month(date)) %>%
complete(var, month) %>%
mutate(Quarter = quarter,
Condition = !is.na(date)) %>%
distinct(var, month, Quarter, Condition)
#> # A tibble: 9 × 4
#> var month Quarter Condition
#> <chr> <dbl> <dbl> <lgl>
#> 1 A 1 1 TRUE
#> 2 A 2 1 TRUE
#> 3 A 3 1 TRUE
#> 4 B 1 1 TRUE
#> 5 B 2 1 FALSE
#> 6 B 3 1 TRUE
#> 7 C 1 1 TRUE
#> 8 C 2 1 TRUE
#> 9 C 3 1 TRUE
Created on 2022-06-01 by the reprex package (v2.0.1)
You can approach it this way:
library(lubridate)
library(dplyr)
libraty(tidyr)
df <- df %>% mutate(month=month(date),quarter=quarter(month))
left_join(
expand(df, var,month,quarter),
select(df,var, month) %>% mutate(condition=TRUE) %>% distinct()
) %>% mutate(condition=!is.na(condition))
Output
var month quarter condition
<chr> <dbl> <int> <lgl>
1 A 1 1 TRUE
2 A 2 1 TRUE
3 A 3 1 TRUE
4 B 1 1 TRUE
5 B 2 1 FALSE
6 B 3 1 TRUE
7 C 1 1 TRUE
8 C 2 1 TRUE
9 C 3 1 TRUE

How to change iteratively the values of a column in R without a loop?

I would like to change iteratively the values of a column (value2 in the example). value2 at time i is conditioned by value1 and updated value2 at time i and i-1.
Time values are stocked in ascending order.
Treatment is done separetely for each value of the group colum.
But as describe on my example, I can't succeed to update value2 with accumulate2 (purrr package).
Maybe someone could give me some advices to do this.
Thank you.
input <- data.frame(group=c(1,1,1,2,2,2,2),
time=c(1,2,3,1,2,3,4),
value1=c(4,2,2,3,3,3,3),
value2=c(4,2,1,3,3,1,1))
input<-arrange(input, group,time)
my_function <- function(df) {
df %>%
as_tibble() %>%
group_by(group) %>%
mutate(value2=purrr::accumulate2(.x = value2, .y = ((value1==lag(value1))
& (lag(value2)==value1)
& (value1!=value2))[-1],
.f = function(.i_1, .i, .y) {
if (.y) {.i_1} else {.i}
}) %>% unlist())
}
> input
group time value1 value2
1 1 1 4 4
2 1 2 2 2
3 1 3 2 1
4 2 1 3 3
5 2 2 3 3
6 2 3 3 1
7 2 4 3 1
output <- my_function(input)
> output
group time value1 value2
1 1 1 4 4
2 1 2 2 2
3 1 3 2 2 -> data change (OK)
4 2 1 3 3
5 2 2 3 3
6 2 3 3 3 -> data change (OK)
7 2 4 3 1 -> no data change / should be replaced by 3
It seems that your problem lies in your algorithm. Unfortunately, as you didn't explain it here, we cannot help you in that matter.
purrr::accumulate2 can be hard to use, so I advise you to split your code as much as possible. This will make your code much more readable, and will make debugging and finding errors much easier.
For instance, consider this:
library(tidyverse)
input <- data.frame(group=c(1,1,1,2,2,2,2),
time=c(1,2,3,1,2,3,4),
value1=c(4,2,2,3,3,3,3),
value2=c(4,2,1,3,3,1,1))
input <- arrange(input, group,time)
#document your functions when it
#' #param .i_1 this is ...
#' #param .i this is ...
#' #param .y this is ...
my_accu_function = function(.i_1, .i, .y) {
if(.y) {.i_1} else {.i}
}
my_function <- function(df) {
df %>%
as_tibble() %>%
group_by(group) %>%
mutate(
cond = value1==lag(value1) &
lag(value2)==value1 &
value1!=value2,
value2_update=purrr::accumulate2(.x = value2,
.y = cond[-1],
.f = my_accu_function) %>% unlist()
)
}
input
#> group time value1 value2
#> 1 1 1 4 4
#> 2 1 2 2 2
#> 3 1 3 2 1
#> 4 2 1 3 3
#> 5 2 2 3 3
#> 6 2 3 3 1
#> 7 2 4 3 1
output = my_function(input)
output
#> # A tibble: 7 x 6
#> # Groups: group [2]
#> group time value1 value2 cond value2_update
#> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
#> 1 1 1 4 4 FALSE 4
#> 2 1 2 2 2 FALSE 2
#> 3 1 3 2 1 TRUE 2
#> 4 2 1 3 3 FALSE 3
#> 5 2 2 3 3 FALSE 3
#> 6 2 3 3 1 TRUE 3
#> 7 2 4 3 1 FALSE 1
stopifnot(output$value2_update[7]==3)
#> Error: output$value2_update[7] == 3 is not TRUE
Created on 2022-05-11 by the reprex package (v2.0.1)
You can see that cond is FALSE in the end, so accumulate2 did its job putting the current value 1 and not the previous value 3.
If you explain your algorithm to us, maybe we can help you with setting the proper condition cond so that you get the right output.

Mutating column to dataframe using apply function by group

I am trying to use the apply function to rows within a grouped dataframe to check for the existence of other rows within that group that match certain conditions dependent on each row. I am able to get this to work for one group but not for all.
For example, with no grouping:
library(dplyr)
id <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2)
station <- c(1, 2, 3, 3, 2, 2, 1, 1, 3, 2, 2)
timeslot <- c(13, 14, 20, 21, 24, 23, 8, 9, 10, 15, 16)
df <- data.frame(id, station, timeslot)
s <- 2
df <-
df %>%
filter(id == 1) %>%
arrange(id, timeslot) %>%
mutate(match = ifelse(station == s, apply(., 1, function(x) (any(as.numeric(x[3] + 1) == .$timeslot))), FALSE))
id station timeslot match
1 1 1 13 FALSE
2 1 2 14 FALSE
3 1 3 20 FALSE
4 1 3 21 FALSE
5 1 2 23 TRUE
6 1 2 24 FALSE
In the above code, for each station 2 row, I am trying to check all other rows to see if there exists a timeslot with a value of one greater (for any station). This works as expected.
Then, I go on to apply this to a grouped dataframe:
df <-
df %>%
group_by(id) %>%
arrange(id, timeslot) %>%
mutate(match = ifelse(station == s, apply(., 1, function(x) (any(as.numeric(x[3] + 1) == .$timeslot))), FALSE))
id station timeslot match
<int> <int> <int> <lgl>
1 1 1 13 FALSE
2 1 2 14 TRUE
3 1 3 20 FALSE
4 1 3 21 FALSE
5 1 2 23 TRUE
6 1 2 24 FALSE
7 2 1 8 FALSE
8 2 1 9 FALSE
9 2 3 10 FALSE
10 2 2 15 FALSE
11 2 2 16 TRUE
and get some unwanted results. It seems like it is not applied by group and I can't figure out how to fix this. How can I apply this function so that only the other rows within a group are checked? In reality, my dataset is much bigger and the conditions are more complex, so it is not running quickly either.
Thanks in advance
Edit: I should add that I have also tried a solution using the arrange() and lead() function but since some timeslot values are shared by many stations in my larger dataset I could not get this to work
This seems to work:
df %>%
group_by(id) %>%
arrange(id, timeslot) %>%
mutate(match = station == s & ((timeslot + 1) %in% timeslot))
# # A tibble: 11 x 4
# # Groups: id [2]
# id station timeslot match
# <dbl> <dbl> <dbl> <lgl>
# 1 1 1 13 FALSE
# 2 1 2 14 FALSE
# 3 1 3 20 FALSE
# 4 1 3 21 FALSE
# 5 1 2 23 TRUE
# 6 1 2 24 FALSE
# 7 2 1 8 FALSE
# 8 2 1 9 FALSE
# 9 2 3 10 FALSE
# 10 2 2 15 TRUE
# 11 2 2 16 FALSE
My sincere apologies if I understood the question wrong. This does what I understand from the question:
df$match = apply(df, 1, function(line) any(df$id == line[1] &
df$station == line[2] &
df$timeslot == line[3] + 1))
The result then is
id station timeslot match
1 1 1 13 FALSE
2 1 2 14 FALSE
3 1 3 20 TRUE
4 1 3 21 FALSE
5 1 2 24 FALSE
6 1 2 23 TRUE
7 2 1 8 TRUE
8 2 1 9 FALSE
9 2 3 10 FALSE
10 2 2 15 TRUE
11 2 2 16 FALSE

Filter only rows that are duplicated using dplyr

I have been trying for a while now to solve a problem close to the one as presented at this issue with no success. This consists in filtering for items that are duplicated in a group, but also considering the original one used for comparison with dplyr (I prefer dplyr over base or data.table).
The solution I tried is as follows:
> a <- data.frame(name=c("a","b","b","b","a","a"),position=c(1,2,1,2,2,2),achieved=c(1,0,0,0,1,0))
> a %>% group_by(name,achieved) %>% mutate(duplicated=duplicated(position))
# A tibble: 6 x 4
# Groups: name, achieved [3]
name position achieved duplicated
<fct> <dbl> <dbl> <lgl>
1 a 1 1 FALSE
2 b 2 0 FALSE
3 b 1 0 FALSE
4 b 2 0 TRUE
5 a 2 1 FALSE
6 a 2 0 FALSE
I know that this solution is close to the one I desire, but it only brings me the values that are duplicated after the first one, but I would also like a dplyr solution that gives me all duplicate values per group, so probably this could help me improve my dplyr understanding.
The desired output would be as follows:
# A tibble: 6 x 4
# Groups: name, achieved [3]
name position achieved duplicated
<fct> <dbl> <dbl> <lgl>
1 a 1 1 FALSE
2 b 2 0 TRUE
3 b 1 0 FALSE
4 b 2 0 TRUE
5 a 2 1 FALSE
6 a 2 0 FALSE
Thanks in advance.
It seems like you want to group by all of name, position, and acheived and then just see if there are more than one record in that group
a %>% group_by(name,achieved, position) %>% mutate(duplicated = n()>1)
# name position achieved duplicated
# <fct> <dbl> <dbl> <lgl>
# 1 a 1 1 FALSE
# 2 b 2 0 TRUE
# 3 b 1 0 FALSE
# 4 b 2 0 TRUE
# 5 a 2 1 FALSE
# 6 a 2 0 FALSE
Try this:
a %>%
group_by_all() %>%
mutate(duplicated = n() > 1)

Resources