This is my example code:
pdata <- tibble(
id = rep(1:5, each = 5),
time = rep(2016:2020, times = 5),
value = c(c(1,1,1,0,1), c(1,1,0,1,1), c(1,1,1,0,1), c(1,1,1,1,1), c(1,0,1,1,1))
)
How can I select all the rows with a 0 plus just the one next row (regardless of the value)?
I'd like to have an outcome like that:
# A tibble: 25 × 4
id time value condition
<int> <int> <dbl> <lgl>
1 1 2016 1 TRUE
2 1 2017 1 TRUE
3 1 2018 1 TRUE
4 1 2019 0 FALSE
5 1 2020 1 FALSE
6 2 2016 1 TRUE
7 2 2017 1 TRUE
8 2 2018 0 FALSE
9 2 2019 1 FALSE
10 2 2020 1 TRUE
# … with 15 more rows
Importantly, I want to be able to filter both rows (0 and the next one) out of my data set afterwards.
I've tried the mutate() function and a for loop, but nothing has worked, neither have other posts on stackoverflow. Thanks for your help!
Using a combination of filter() and lag() from the tidyverse
library(tidyverse)
pdata %>%
filter(value == 0 | lag(value) == 0)
Related
I have a data.frame that contains a variable for an obvservation date (Portfolio.Date), IDs (Loan.Number) and a status indicator (1-8) (Current.Delinquency.Code). I want to calculate a new variable that is true only the first time that the status indicator goes above 3 for any ID. In Excel I would write =[Portfolio.Date]=minif([Portfolio.Date], [Loan.Number], [#Loan.Number], [Current.Delinquency.Code], ">3"), but I cannot figure out how to do this in R. Can anybody help me with this?
Thank you very much!
What I am looking for is the formula for the [Delinquent] column in the below example Data, which jumps to "TRUE" the first time that the an observation of "Current.Delinquency.Code" for tha "Loan.Number" is above 3.
Portfolio.Date Loan.Number Current.Delinquency.Code Delinquent
2022/01/01 1 1 FALSE
2022/02/01 1 4 TRUE
2022/03/01 1 4 FALSE
2022/04/01 1 4 FALSE
2022/01/01 2 1 FALSE
2022/02/01 2 1 FALSE
2022/03/01 2 1 FALSE
2022/04/01 2 1 FALSE
2022/01/01 3 1 FALSE
2022/02/01 3 3 FALSE
2022/03/01 3 4 TRUE
2022/04/01 3 4 FALSE
if I understood you correctly this is one possible solution:
library(dplyr)
# to make it reproducable:
set.seed(1)
# sample data
df <- data.frame(Portfolio.Date = seq.Date(from = as.Date("2022-01-01"), to = as.Date("2022-01-30"), by = "days"),
Loan.Number = rep(c(1,2), 15),
Current.Delinquency.Code = sample(1:8, size = 30, replace = TRUE)) %>%
# group by Loan.Number
dplyr::group_by(df, Loan.Number) %>%
# order by Portfolio.Date
dplyr::arrange(Portfolio.Date) %>%
# check the condition and make a cumsum of it, returning only those where cumsum is 1 (frist occurence)
dplyr::mutate(nc = ifelse(Current.Delinquency.Code > 3 & cumsum(Current.Delinquency.Code > 3) == 1, 1, 0)) %>%
# ungroup to prevent unwanted behaviour down stream
dplyr::ungroup()
# A tibble: 30 x 4
Portfolio.Date Loan.Number Current.Delinquency.Code nc
<date> <dbl> <int> <dbl>
1 2022-01-01 1 1 0
2 2022-01-02 2 4 1
3 2022-01-03 1 7 1
4 2022-01-04 2 1 0
5 2022-01-05 1 2 0
6 2022-01-06 2 5 0
7 2022-01-07 1 7 0
8 2022-01-08 2 3 0
9 2022-01-09 1 6 0
10 2022-01-10 2 2 0
# ... with 20 more rows
# i Use `print(n = ...)` to see more rows
I have a dataset that looks like this :
vaR
date
A
1/1/2022
A
1/2/2022
A
1/3/2022
B
1/1/2022
B
1/3/2022
C
1/1/2022
C
1/1/2022
C
1/2/2022
C
1/2/2022
C
1/3/2022
And i want to be arranged by month and by the var variable. But if a month is not recorded (missing) i want to be added (to be appeared ) in the new column named Month and to mutate (dplyr phrase) another column that will check if there was an entry on that month (logical condition).But there are some entries for example C that has more that one entries which counts for one (distinct).
Ideally is must look like this :
var
Quarter
Month
Condition
A
1
1
TRUE
A
1
2
TRUE
A
1
3
TRUE
B
1
1
TRUE
B
1
2
FALSE
B
1
3
TRUE
C
1
1
TRUE
C
1
2
TRUE
C
1
3
TRUE
As a start i have tried this one in R :
var = c(rep("A",3),rep("B",2),rep("C",5));var
date = c(as.Date("2022/01/01"),as.Date("2022/02/01"),as.Date("2022/03/01"),
as.Date("2022/01/01"),as.Date("2022/03/01"),
as.Date("2022/01/01"),as.Date("2022/01/01"),as.Date("2022/02/01"),as.Date("2022/02/01"),as.Date("2022/03/01"))
data = tibble(var,date)
quarter = 1
data%>%
dplyr::mutate(month = lubridate::month(date),
Quarter = quarter)
But i don't know how to add the missing month and check for the verified condition.
Any help ?
You can use complete() to fill in the missing months and then check whether they have an associated date, then use distinct() to find the unique combinations.
library(dplyr)
library(tidyr)
var = c(rep("A",3),rep("B",2),rep("C",5))
date = c(as.Date("2022/01/01"),as.Date("2022/02/01"),as.Date("2022/03/01"),
as.Date("2022/01/01"),as.Date("2022/03/01"),
as.Date("2022/01/01"),as.Date("2022/01/01"),as.Date("2022/02/01"),as.Date("2022/02/01"),as.Date("2022/03/01"))
data = tibble(var,date)
quarter = 1
data %>%
mutate(month = lubridate::month(date)) %>%
complete(var, month) %>%
mutate(Quarter = quarter,
Condition = !is.na(date)) %>%
distinct(var, month, Quarter, Condition)
#> # A tibble: 9 × 4
#> var month Quarter Condition
#> <chr> <dbl> <dbl> <lgl>
#> 1 A 1 1 TRUE
#> 2 A 2 1 TRUE
#> 3 A 3 1 TRUE
#> 4 B 1 1 TRUE
#> 5 B 2 1 FALSE
#> 6 B 3 1 TRUE
#> 7 C 1 1 TRUE
#> 8 C 2 1 TRUE
#> 9 C 3 1 TRUE
Created on 2022-06-01 by the reprex package (v2.0.1)
You can approach it this way:
library(lubridate)
library(dplyr)
libraty(tidyr)
df <- df %>% mutate(month=month(date),quarter=quarter(month))
left_join(
expand(df, var,month,quarter),
select(df,var, month) %>% mutate(condition=TRUE) %>% distinct()
) %>% mutate(condition=!is.na(condition))
Output
var month quarter condition
<chr> <dbl> <int> <lgl>
1 A 1 1 TRUE
2 A 2 1 TRUE
3 A 3 1 TRUE
4 B 1 1 TRUE
5 B 2 1 FALSE
6 B 3 1 TRUE
7 C 1 1 TRUE
8 C 2 1 TRUE
9 C 3 1 TRUE
I am trying to use the apply function to rows within a grouped dataframe to check for the existence of other rows within that group that match certain conditions dependent on each row. I am able to get this to work for one group but not for all.
For example, with no grouping:
library(dplyr)
id <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2)
station <- c(1, 2, 3, 3, 2, 2, 1, 1, 3, 2, 2)
timeslot <- c(13, 14, 20, 21, 24, 23, 8, 9, 10, 15, 16)
df <- data.frame(id, station, timeslot)
s <- 2
df <-
df %>%
filter(id == 1) %>%
arrange(id, timeslot) %>%
mutate(match = ifelse(station == s, apply(., 1, function(x) (any(as.numeric(x[3] + 1) == .$timeslot))), FALSE))
id station timeslot match
1 1 1 13 FALSE
2 1 2 14 FALSE
3 1 3 20 FALSE
4 1 3 21 FALSE
5 1 2 23 TRUE
6 1 2 24 FALSE
In the above code, for each station 2 row, I am trying to check all other rows to see if there exists a timeslot with a value of one greater (for any station). This works as expected.
Then, I go on to apply this to a grouped dataframe:
df <-
df %>%
group_by(id) %>%
arrange(id, timeslot) %>%
mutate(match = ifelse(station == s, apply(., 1, function(x) (any(as.numeric(x[3] + 1) == .$timeslot))), FALSE))
id station timeslot match
<int> <int> <int> <lgl>
1 1 1 13 FALSE
2 1 2 14 TRUE
3 1 3 20 FALSE
4 1 3 21 FALSE
5 1 2 23 TRUE
6 1 2 24 FALSE
7 2 1 8 FALSE
8 2 1 9 FALSE
9 2 3 10 FALSE
10 2 2 15 FALSE
11 2 2 16 TRUE
and get some unwanted results. It seems like it is not applied by group and I can't figure out how to fix this. How can I apply this function so that only the other rows within a group are checked? In reality, my dataset is much bigger and the conditions are more complex, so it is not running quickly either.
Thanks in advance
Edit: I should add that I have also tried a solution using the arrange() and lead() function but since some timeslot values are shared by many stations in my larger dataset I could not get this to work
This seems to work:
df %>%
group_by(id) %>%
arrange(id, timeslot) %>%
mutate(match = station == s & ((timeslot + 1) %in% timeslot))
# # A tibble: 11 x 4
# # Groups: id [2]
# id station timeslot match
# <dbl> <dbl> <dbl> <lgl>
# 1 1 1 13 FALSE
# 2 1 2 14 FALSE
# 3 1 3 20 FALSE
# 4 1 3 21 FALSE
# 5 1 2 23 TRUE
# 6 1 2 24 FALSE
# 7 2 1 8 FALSE
# 8 2 1 9 FALSE
# 9 2 3 10 FALSE
# 10 2 2 15 TRUE
# 11 2 2 16 FALSE
My sincere apologies if I understood the question wrong. This does what I understand from the question:
df$match = apply(df, 1, function(line) any(df$id == line[1] &
df$station == line[2] &
df$timeslot == line[3] + 1))
The result then is
id station timeslot match
1 1 1 13 FALSE
2 1 2 14 FALSE
3 1 3 20 TRUE
4 1 3 21 FALSE
5 1 2 24 FALSE
6 1 2 23 TRUE
7 2 1 8 TRUE
8 2 1 9 FALSE
9 2 3 10 FALSE
10 2 2 15 TRUE
11 2 2 16 FALSE
My apologies if this topic has been discussed somewhere, I was not able to find it out.
I was trying to apply a quite simple conditional mutate() with dplyr when I noticed something quite weird to me, I explain:
Let's say that in a data.frame I want to modify a variable (here VALUE) according to the value of a specific row in each group (here COND).
The modification is: "if the last value of COND within the current group is 0, then set VALUE to 99 for the current group, otherwhise do nothing"
Here's what I naturally wrote:
tab <- data.frame(
ID = c(rep(1,3), rep(2,3)),
COND = c(c(1,0,0), rep(1,3)),
VALUE = 1:6
)
tab %>%
group_by(ID) %>%
mutate(VALUE = ifelse(COND[n()] == 0,
99,
VALUE))
# ID COND VALUE
# <dbl> <dbl> <dbl>
# 1 1 1 99
# 2 1 0 99
# 3 1 0 99
# 4 2 1 4
# 5 2 1 4 <
# 6 2 1 4 <
The propagation went well for the first group since VALUE is now 99 which is legitimate (COND == 0 in row 3) whereas I was surprised to see that VALUE also changed for the other group by propagating the first value of VALUE within the group while the condition is not fulfilled.
Can someone enlight me on what I am misunderstanding here?
Expected result was:
# ID COND VALUE
# <dbl> <dbl> <dbl>
# 1 1 1 99
# 2 1 0 99
# 3 1 0 99
# 4 2 1 4
# 5 2 1 5 <
# 6 2 1 6 <
[edit] I also tried using case_when() which apparently I do not manage well either:
tab %>%
group_by(ID) %>%
mutate(VALUE = case_when(
COND[n()] == 0 ~ 99,
TRUE ~ VALUE
))
# Erreur : must be a double vector, not an integer vector
One workaround that would be to calculate an intermediate variable, but I am quite surprised having to do that.
Possible solution:
tab %>%
group_by(ID) %>%
mutate(TEST_COND = COND[n()] == 0,
VALUE = ifelse(TEST_COND, 99, VALUE))
# ID COND VALUE TEST_COND
# <dbl> <dbl> <dbl> <lgl>
# 1 1 1 99 TRUE
# 2 1 0 99 TRUE
# 3 1 0 99 TRUE
# 4 2 1 4 FALSE
# 5 2 1 5 FALSE
# 6 2 1 6 FALSE
# Yeepee
Try this
library(dplyr)
tab <- data.frame(
ID = c(rep(1,3), rep(2,3)),
COND = c(1, rep(0,2), rep(1,3)),
VALUE = 1:6
)
tab %>%
group_by(ID) %>%
mutate(VALUE = case_when(last(COND) == 0 ~ 99L,
TRUE ~ VALUE))
#> # A tibble: 6 x 3
#> # Groups: ID [2]
#> ID COND VALUE
#> <dbl> <dbl> <int>
#> 1 1 1 99
#> 2 1 0 99
#> 3 1 0 99
#> 4 2 1 4
#> 5 2 1 5
#> 6 2 1 6
Created on 2020-05-12 by the reprex package (v0.3.0)
I have been trying for a while now to solve a problem close to the one as presented at this issue with no success. This consists in filtering for items that are duplicated in a group, but also considering the original one used for comparison with dplyr (I prefer dplyr over base or data.table).
The solution I tried is as follows:
> a <- data.frame(name=c("a","b","b","b","a","a"),position=c(1,2,1,2,2,2),achieved=c(1,0,0,0,1,0))
> a %>% group_by(name,achieved) %>% mutate(duplicated=duplicated(position))
# A tibble: 6 x 4
# Groups: name, achieved [3]
name position achieved duplicated
<fct> <dbl> <dbl> <lgl>
1 a 1 1 FALSE
2 b 2 0 FALSE
3 b 1 0 FALSE
4 b 2 0 TRUE
5 a 2 1 FALSE
6 a 2 0 FALSE
I know that this solution is close to the one I desire, but it only brings me the values that are duplicated after the first one, but I would also like a dplyr solution that gives me all duplicate values per group, so probably this could help me improve my dplyr understanding.
The desired output would be as follows:
# A tibble: 6 x 4
# Groups: name, achieved [3]
name position achieved duplicated
<fct> <dbl> <dbl> <lgl>
1 a 1 1 FALSE
2 b 2 0 TRUE
3 b 1 0 FALSE
4 b 2 0 TRUE
5 a 2 1 FALSE
6 a 2 0 FALSE
Thanks in advance.
It seems like you want to group by all of name, position, and acheived and then just see if there are more than one record in that group
a %>% group_by(name,achieved, position) %>% mutate(duplicated = n()>1)
# name position achieved duplicated
# <fct> <dbl> <dbl> <lgl>
# 1 a 1 1 FALSE
# 2 b 2 0 TRUE
# 3 b 1 0 FALSE
# 4 b 2 0 TRUE
# 5 a 2 1 FALSE
# 6 a 2 0 FALSE
Try this:
a %>%
group_by_all() %>%
mutate(duplicated = n() > 1)