I am trying to use the apply function to rows within a grouped dataframe to check for the existence of other rows within that group that match certain conditions dependent on each row. I am able to get this to work for one group but not for all.
For example, with no grouping:
library(dplyr)
id <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2)
station <- c(1, 2, 3, 3, 2, 2, 1, 1, 3, 2, 2)
timeslot <- c(13, 14, 20, 21, 24, 23, 8, 9, 10, 15, 16)
df <- data.frame(id, station, timeslot)
s <- 2
df <-
df %>%
filter(id == 1) %>%
arrange(id, timeslot) %>%
mutate(match = ifelse(station == s, apply(., 1, function(x) (any(as.numeric(x[3] + 1) == .$timeslot))), FALSE))
id station timeslot match
1 1 1 13 FALSE
2 1 2 14 FALSE
3 1 3 20 FALSE
4 1 3 21 FALSE
5 1 2 23 TRUE
6 1 2 24 FALSE
In the above code, for each station 2 row, I am trying to check all other rows to see if there exists a timeslot with a value of one greater (for any station). This works as expected.
Then, I go on to apply this to a grouped dataframe:
df <-
df %>%
group_by(id) %>%
arrange(id, timeslot) %>%
mutate(match = ifelse(station == s, apply(., 1, function(x) (any(as.numeric(x[3] + 1) == .$timeslot))), FALSE))
id station timeslot match
<int> <int> <int> <lgl>
1 1 1 13 FALSE
2 1 2 14 TRUE
3 1 3 20 FALSE
4 1 3 21 FALSE
5 1 2 23 TRUE
6 1 2 24 FALSE
7 2 1 8 FALSE
8 2 1 9 FALSE
9 2 3 10 FALSE
10 2 2 15 FALSE
11 2 2 16 TRUE
and get some unwanted results. It seems like it is not applied by group and I can't figure out how to fix this. How can I apply this function so that only the other rows within a group are checked? In reality, my dataset is much bigger and the conditions are more complex, so it is not running quickly either.
Thanks in advance
Edit: I should add that I have also tried a solution using the arrange() and lead() function but since some timeslot values are shared by many stations in my larger dataset I could not get this to work
This seems to work:
df %>%
group_by(id) %>%
arrange(id, timeslot) %>%
mutate(match = station == s & ((timeslot + 1) %in% timeslot))
# # A tibble: 11 x 4
# # Groups: id [2]
# id station timeslot match
# <dbl> <dbl> <dbl> <lgl>
# 1 1 1 13 FALSE
# 2 1 2 14 FALSE
# 3 1 3 20 FALSE
# 4 1 3 21 FALSE
# 5 1 2 23 TRUE
# 6 1 2 24 FALSE
# 7 2 1 8 FALSE
# 8 2 1 9 FALSE
# 9 2 3 10 FALSE
# 10 2 2 15 TRUE
# 11 2 2 16 FALSE
My sincere apologies if I understood the question wrong. This does what I understand from the question:
df$match = apply(df, 1, function(line) any(df$id == line[1] &
df$station == line[2] &
df$timeslot == line[3] + 1))
The result then is
id station timeslot match
1 1 1 13 FALSE
2 1 2 14 FALSE
3 1 3 20 TRUE
4 1 3 21 FALSE
5 1 2 24 FALSE
6 1 2 23 TRUE
7 2 1 8 TRUE
8 2 1 9 FALSE
9 2 3 10 FALSE
10 2 2 15 TRUE
11 2 2 16 FALSE
Related
I have a data.frame that contains a variable for an obvservation date (Portfolio.Date), IDs (Loan.Number) and a status indicator (1-8) (Current.Delinquency.Code). I want to calculate a new variable that is true only the first time that the status indicator goes above 3 for any ID. In Excel I would write =[Portfolio.Date]=minif([Portfolio.Date], [Loan.Number], [#Loan.Number], [Current.Delinquency.Code], ">3"), but I cannot figure out how to do this in R. Can anybody help me with this?
Thank you very much!
What I am looking for is the formula for the [Delinquent] column in the below example Data, which jumps to "TRUE" the first time that the an observation of "Current.Delinquency.Code" for tha "Loan.Number" is above 3.
Portfolio.Date Loan.Number Current.Delinquency.Code Delinquent
2022/01/01 1 1 FALSE
2022/02/01 1 4 TRUE
2022/03/01 1 4 FALSE
2022/04/01 1 4 FALSE
2022/01/01 2 1 FALSE
2022/02/01 2 1 FALSE
2022/03/01 2 1 FALSE
2022/04/01 2 1 FALSE
2022/01/01 3 1 FALSE
2022/02/01 3 3 FALSE
2022/03/01 3 4 TRUE
2022/04/01 3 4 FALSE
if I understood you correctly this is one possible solution:
library(dplyr)
# to make it reproducable:
set.seed(1)
# sample data
df <- data.frame(Portfolio.Date = seq.Date(from = as.Date("2022-01-01"), to = as.Date("2022-01-30"), by = "days"),
Loan.Number = rep(c(1,2), 15),
Current.Delinquency.Code = sample(1:8, size = 30, replace = TRUE)) %>%
# group by Loan.Number
dplyr::group_by(df, Loan.Number) %>%
# order by Portfolio.Date
dplyr::arrange(Portfolio.Date) %>%
# check the condition and make a cumsum of it, returning only those where cumsum is 1 (frist occurence)
dplyr::mutate(nc = ifelse(Current.Delinquency.Code > 3 & cumsum(Current.Delinquency.Code > 3) == 1, 1, 0)) %>%
# ungroup to prevent unwanted behaviour down stream
dplyr::ungroup()
# A tibble: 30 x 4
Portfolio.Date Loan.Number Current.Delinquency.Code nc
<date> <dbl> <int> <dbl>
1 2022-01-01 1 1 0
2 2022-01-02 2 4 1
3 2022-01-03 1 7 1
4 2022-01-04 2 1 0
5 2022-01-05 1 2 0
6 2022-01-06 2 5 0
7 2022-01-07 1 7 0
8 2022-01-08 2 3 0
9 2022-01-09 1 6 0
10 2022-01-10 2 2 0
# ... with 20 more rows
# i Use `print(n = ...)` to see more rows
Minimal example: A small dataframe with 4 columns and a variable that holds the name of a new column I want to create. The new column is TRUE if responses to more than a certain number of questions exceed a threshold, and is FALSE otherwise
df1 <- data.frame(ID = LETTERS[1:5],
Q1 = sample(0:10, 5, replace=T),
Q2 = sample(0:10, 5, replace=T)
Q3 = sample(0:10, 5, replace=T)
Q4 = sample(0:10, 5, replace=T)
)
This gives me my dataframe with responses to the various questions:
> df1
ID Q1 Q2 Q3 Q4
1 A 2 4 5 0
2 B 9 6 6 3
3 C 5 5 3 2
4 D 0 5 3 10
5 E 7 5 6 7
I also define the following constants:
QUESTIONS <- c("Q1”, “Q2”, “Q3”, “Q4")
MY_NEW_COL <- "New_Col"
THESHOLD1 <- 5
THESHOLD2 <- 2
I want to add a new column named New_Col that is TRUE if more than THRESHOLD2 columns have a value in excess of THRESHOLD1. I can get this to work in a clumsy, but obvious way:
df1 %>%
mutate(!!MY_NEW_COL := ( (Q1 > THREHOLD1) + (Q2> THREHOLD1) +
(Q3 > THREHOLD1) + (Q4> THREHOLD1) ) > THRESHOLD2)
This gives the right answer:
ID Q1 Q2 Q3 Q4 New_Col
1 A 2 4 5 0 FALSE
2 B 9 6 6 3 TRUE
3 C 5 5 3 2 FALSE
4 D 0 5 3 10 FALSE
5 E 7 5 6 7 TRUE
But I would like to systematize this up as there are 17 questions in all. My code, which I show below, gives the wrong answer
df1 %>%
mutate(!!MY_NEW_COL := sum(all_of(QUESTIONS) > THRESHOLD1)) > THRESHOLD2)
ID Q1 Q2 Q3 Q4 New_Col
1 A 2 4 5 0 TRUE
2 B 9 6 6 3 TRUE
3 C 5 5 3 2 TRUE
4 D 0 5 3 10 TRUE
5 E 7 5 6 7 TRUE
What am I doing wring, and how can I fix this?
Many thanks in advance
Thomas Philips
As you didnt provide a seed, it is not possible to reproduce your results exactly. The solution to you problem is using across() and rowSums(), such that,
df1 %>%
mutate(!!MY_NEW_COL := rowSums(across(QUESTIONS) > THESHOLD1) > THESHOLD2)
It gives the output,
ID Q1 Q2 Q3 Q4 New_Col
1 A 7 9 1 1 FALSE
2 B 3 9 9 7 TRUE
3 C 4 0 6 6 FALSE
4 D 5 1 6 10 FALSE
5 E 6 5 5 1 FALSE
We can also do
library(dplyr)
library(purrr)
library(magrittr)
df1 %>%
mutate(!! MY_NEW_COL := map(select(cur_data(), starts_with("Q")),
~ .x > THESHOLD1) %>%
reduce(`+`) %>%
is_greater_than(THESHOLD2) )
I don't know if the following output is what you have in mind, but I first checked whether all Qs are greater than threshold 1 and if so whether the sum of which are greater than threshold2:
library(dplyr)
f1 <- function(x, threshold1 = 2, threshold2 = 5) {
df1 <- df1 %>%
group_by(ID) %>%
mutate(threshold_1 = if_all(, ~ .x > 2, TRUE),
sum_Qs = sum(Q1:Q4),
threshold_2 = if_else(sum_Qs > threshold2 & threshold_1 == TRUE,
TRUE, FALSE))
df1
}
f1(df1, 2, 5)
# A tibble: 5 x 8
# Groups: ID [5]
ID Q1 Q2 Q3 Q4 threshold_1 sum_Qs threshold_2
<chr> <int> <int> <int> <int> <lgl> <int> <lgl>
1 A 8 0 1 10 FALSE 27 FALSE
2 B 2 3 2 8 FALSE 35 FALSE
3 C 1 8 4 3 FALSE 6 FALSE
4 D 9 3 3 9 TRUE 9 TRUE
5 E 1 3 0 1 FALSE 1 FALSE
I am struggling with one maybe easy question. I have a dataframe of 1 column with n rows (n is a multiple of 3). I would like to add a second column with integers like: 1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,.. How can I achieve this with dplyr as a general solution for different length of rows (all multiple of 3).
I tried this:
df <- tibble(Col1 = c(1:12)) %>%
mutate(Col2 = rep(1:4, each=3))
This works. But I would like to have a solution for n rows, each = 3 . Many thanks!
You can specify each and length.out parameter in rep.
library(dplyr)
tibble(Col1 = c(1:12)) %>%
mutate(Col2 = rep(row_number(), each=3, length.out = n()))
# Col1 Col2
# <int> <int>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 2
# 5 5 2
# 6 6 2
# 7 7 3
# 8 8 3
# 9 9 3
#10 10 4
#11 11 4
#12 12 4
We can use gl
library(dplyr)
df %>%
mutate(col2 = as.integer(gl(n(), 3, n())))
As integer division i.e. %/% 3 over a sequence say 0:n will result in 0, 0, 0, 1, 1, 1, ... adding 1 will generate the desired sequence automatically, so simply this will also do
df %>% mutate(col2 = 1+ (row_number()-1) %/% 3)
# A tibble: 12 x 2
Col1 col2
<int> <dbl>
1 1 1
2 2 1
3 3 1
4 4 2
5 5 2
6 6 2
7 7 3
8 8 3
9 9 3
10 10 4
11 11 4
12 12 4
I have a data.frame that looks like this:
data=data.frame(position=c(1,2,3,1,1,4,5,6,7,8,2,2),
name=c("A","B","C","A","A","D","E","F","G","H","B","B"))
position name
1 A
2 B
3 C
1 A
1 A
4 D
5 E
6 F
7 G
8 H
2 B
2 B
I would like to be able to identify in the column "position" all the consecutive intervals
and then paste into a new column the length of each interval.
I would like my data to look somehow like this.
position length
1 - 3 3
4 - 8 5
Any help and comment are highly appreciated
Here is a base R solution.
Create a column, sequence, which indicates which rows are contiguous.
data$sequence <- c(NA, head(data$position, -1)) + 1 == data$position
data$sequence[[1]] <- data$sequence[[2]]
data
#> position name sequence
#> 1 1 A TRUE
#> 2 2 B TRUE
#> 3 3 C TRUE
#> 4 1 A FALSE
#> 5 1 A FALSE
#> 6 4 D FALSE
#> 7 5 E TRUE
#> 8 6 F TRUE
#> 9 7 G TRUE
#> 10 8 H TRUE
#> 11 2 B FALSE
#> 12 2 B FALSE
Use rle to construct the run lengths.
run_lengths <- rle(data$sequence)
i_ends <- cumsum(run_lengths$lengths)[run_lengths$values]
i_starts <- c(1, head(i_ends, -1))
data.frame(
position = paste0(data$position[i_starts], " - ", data$position[i_ends]),
length = i_ends - i_starts
)
#> position length
#> 1 1 - 3 2
#> 2 3 - 8 7
Does this work:
library(dplyr)
library(tidyr)
library(data.table)
data %>% mutate(ID = case_when (position == lead(position) - 1 ~ 1, TRUE ~ 0)) %>%
mutate(ID = case_when(position == lag(position) + 1 ~ 1, TRUE ~ ID)) %>% mutate(r = rleid(ID)) %>% filter(ID == 1) %>%
group_by(r) %>% mutate(position = paste(min(position),max(position), sep = '-'), length = length(unique(name))) %>% ungroup() %>% select(1,5) %>% distinct()
# A tibble: 2 x 2
position length
<chr> <int>
1 1-3 3
2 4-8 5
>
I have a time series dataset containing different sensor measurements. The sensors software has some bugs, resulting in missing measurements. I added the missing measurement times, resulting in NAs in the "value" column. The dataset looks as follows:
df <- structure(list(time_id = 1:10, value = c(-1.80603125680195, -0.582075924689333,
NA, NA, -0.162309523556819, NA, NA, NA, 1.6059096288573, NA),
is_missing = c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE,
TRUE, FALSE, TRUE)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L))
df
I want to group sequential rows with numeric vs missing values and at the same time count the number of sequential rows in each group. The result should look like this:
df %>% mutate(group = c(1, 1, 2, 2, 3, 4, 4, 4, 5, 6),
seq_NA = c(1:2, 1:2, 1, 1:3, 1, 1))
Help is very much appreciated!
Here is another idea. Here we use is.na() to capture the NAs and change group whenever a non-NA appears, i.e.
df %>%
group_by(grp = cumsum(c(1, diff(is.na(value)) != 0))) %>%
mutate(seq_NA = seq(n()))
which gives,
# A tibble: 10 x 5
# Groups: grp [6]
time_id value is_missing grp seq_NA
<int> <dbl> <lgl> <dbl> <int>
1 1 -1.81 FALSE 1 1
2 2 -0.582 FALSE 1 2
3 3 NA TRUE 2 1
4 4 NA TRUE 2 2
5 5 -0.162 FALSE 3 1
6 6 NA TRUE 4 1
7 7 NA TRUE 4 2
8 8 NA TRUE 4 3
9 9 1.61 FALSE 5 1
10 10 NA TRUE 6 1
Here is a base R solution using ave() + rle()
df$group <- with(df, rep(seq_along(z<-rle(is_missing)$lengths),z))
df$seq_NA <- with(df,ave(seq(nrow(df)),group,FUN = seq_along))
such that
> df
time_id value is_missing group seq_NA
1 1 -1.8060313 FALSE 1 1
2 2 -0.5820759 FALSE 1 2
3 3 NA TRUE 2 1
4 4 NA TRUE 2 2
5 5 -0.1623095 FALSE 3 1
6 6 NA TRUE 4 1
7 7 NA TRUE 4 2
8 8 NA TRUE 4 3
9 9 1.6059096 FALSE 5 1
10 10 NA TRUE 6 1