I have the following data.frame below. I would like to create a new column w (for weight). w should equal 1 / n for the industries that have the n highest returns for each given date and should equal 0 for the rest of the industries. I can group_by(date) and use top_n(3, wt = return) to filter the top industries and then mutate(w = 1/n), but how can I mutate without throwing away the other industries where w = 0?
structure(list(date = structure(c(16556, 16556, 16556, 16556,
16556, 16556, 16556, 16556, 16556, 16556, 16587, 16587, 16587,
16587, 16587, 16587, 16587, 16587, 16587, 16587, 16617, 16617,
16617, 16617, 16617, 16617, 16617, 16617, 16617, 16617), class = "Date"),
industry = c("Hlth", "Txtls", "BusEq", "Fin", "ElcEq", "Food",
"Beer", "Books", "Cnstr", "Carry", "Clths", "Txtls", "Fin",
"Games", "Cnstr", "Meals", "Hlth", "Hshld", "Telcm", "Rtail",
"Smoke", "Games", "Clths", "Rtail", "Servs", "Meals", "Food",
"Hlth", "Beer", "Trans"), return = c(4.89, 4.37, 4.02, 2.99,
2.91, 2.03, 2, 1.95, 1.86, 1.75, 4.17, 4.09, 1.33, 1.26,
0.42, 0.29, 0.08, -0.11, -0.45, -0.48, 9.59, 6, 5.97, 5.78,
5.3, 4.15, 4.04, 3.67, 3.51, 3.27)), row.names = c(NA, -30L
), class = c("tbl_df", "tbl", "data.frame"))
# A tibble: 30 x 3
date industry return
<date> <chr> <dbl>
1 2015-05-01 Hlth 4.89
2 2015-05-01 Txtls 4.37
3 2015-05-01 BusEq 4.02
4 2015-05-01 Fin 2.99
5 2015-05-01 ElcEq 2.91
6 2015-05-01 Food 2.03
7 2015-05-01 Beer 2
8 2015-05-01 Books 1.95
9 2015-05-01 Cnstr 1.86
10 2015-05-01 Carry 1.75
# ... with 20 more rows
EDIT: How would you handle ties? Suppose there is a tie for third place. The third place weight should be split between 3rd and 4th place (assuming only 2 are tied) with weights of (1/n)/2. The 1st and 2nd place weights stay at 1/n.
EDIT: Suppose n = 3. The top 3 A2 values for each A1 should get a weight w of 1/3 if there are no ties. If there is a tie for 3rd place (T3), then we have (1st, 2nd, T3, T3) and I would like weights to be 1/3, 1/3, 1/6, 1/6 to maintain a total weight of 1. This is only for 3rd place however. (1st, T2, T2) should have weights of 1/3, 1/3, 1/3. (T1, T1, T2, T2) should have weights of 1/3, 1/3, 1/6, 1/6, etc.
structure(list(A1 = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), .Label = c("A", "B"), class = "factor"), A2 = c(1, 3, 3,
4, 5, 6, 7, 8, 8)), row.names = c(NA, -9L), class = "data.frame")
The output for df should be:
> df
A1 A2 w
1 A 1 0
2 A 3 0.1666
3 A 3 0.1666
4 A 4 0.3333
5 A 5 0.3333
6 B 6 0
7 B 7 0.3333
8 B 8 0.3333
9 B 8 0.3333
We could create a condition with ifelse. After grouping by 'date', arrange the dataset based on the 'date', and 'return' in descending order, then create the 'w' by creating the condition that if the row_number() is less than 'n', then divide 'return' by 'n' or else return 0
n <- 3
df1 %>%
group_by(date) %>%
arrange(date, -return) %>%
mutate(w = ifelse(row_number() <= n, return/n, 0))
If we are using top_n, then create the column 'w' in the filtered dataset and join with the original
df1 %>%
group_by(date) %>%
top_n(return, n = 3) %>%
mutate(w = return/n()) %>%
right_join(df1) %>%
mutate(w = replace_na(w, 0))
We can group by date then sort the return variable get the last 3 enteries (top 3) and return return/n or else 0.
library(dplyr)
n <- 3
df %>%
group_by(date) %>%
mutate(w = ifelse(return %in% tail(sort(return), n), return/n, 0))
# date industry return w
# <date> <chr> <dbl> <dbl>
# 1 2015-05-01 Hlth 4.89 1.63
# 2 2015-05-01 Txtls 4.37 1.46
# 3 2015-05-01 BusEq 4.02 1.34
# 4 2015-05-01 Fin 2.99 0
# 5 2015-05-01 ElcEq 2.91 0
# 6 2015-05-01 Food 2.03 0
# 7 2015-05-01 Beer 2 0
#....
The base R equivalent of the same logic using ave
ave(df$return, df$date, FUN = function(x) ifelse(x %in% tail(sort(x), n), x/n, 0))
EDIT
As mentioned in comments, in case of ties OP wants to return (1/n)/2 or divide by number of ties we have.
For this I have created a new easier dataframe which makes it easy to understand what is going on.
df <- data.frame(A1 = rep(c("A", "B"),c(5, 4)), A2 = 1:9)
df$A2[2] <- 3
If we use the current code it gives
df %>%
group_by(A1) %>%
mutate(w = ifelse(A2 %in% tail(sort(A2), n), A2/n, 0))
# A tibble: 9 x 3
# Groups: A1 [2]
# A1 A2 w
# <fct> <int> <dbl>
#1 A 1 0
#2 A 3 1
#3 A 3 1
#4 A 4 1.33
#5 A 5 1.67
#6 B 6 0
#7 B 7 2.33
#8 B 8 2.67
#9 B 9 3
which is not what we want. To avoid that, we can group by A2 again and for only those rows where w!=0 we divide it by number of occurrences of A2.
df %>%
group_by(A1) %>%
mutate(w = ifelse(A2 %in% tail(sort(A2), n), A2/n, 0)) %>%
group_by(A2) %>%
mutate(w1 = ifelse(w != 0, w/n(), w)) %>%
ungroup()
# A1 A2 w w1
# <fct> <dbl> <dbl> <dbl>
#1 A 1 0 0
#2 A 3 1 0.5
#3 A 3 1 0.5
#4 A 4 1.33 1.33
#5 A 5 1.67 1.67
#6 B 6 0 0
#7 B 7 2.33 2.33
#8 B 8 2.67 2.67
#9 B 9 3 3
Another EDIT
Turns out we just want to divide w only for the last group present. Moreover, the sum of all the w in each group should sum up to 1. For the updated dataset we can do
n <- 3
temp_df <- df %>%
group_by(A1) %>%
top_n(n, A2)
temp_df %>%
arrange(A1, A2) %>%
mutate(w = ifelse(A2 == A2[1],
(1 - (1/n * sum(A2 != A2[1])))/sum(A2 == A2[1]), 1/n)) %>%
bind_rows(anti_join(df, temp_df) %>%
mutate(w = 0)
) %>%
arrange(A1, A2)
# A1 A2 w
# <fct> <dbl> <dbl>
#1 A 1 0
#2 A 3 0.167
#3 A 3 0.167
#4 A 4 0.333
#5 A 5 0.333
#6 B 6 0
#7 B 7 0.333
#8 B 8 0.333
#9 B 8 0.333
Let's try another variation where we keep all the values of the group same.
df1 = df
df1$A2[6:9] <- 10
temp_df <- df1 %>%
group_by(A1) %>%
top_n(n, A2)
temp_df %>%
arrange(A1, A2) %>%
mutate(w = ifelse(A2 == A2[1],
(1 - (1/n * sum(A2 != A2[1])))/sum(A2 == A2[1]), 1/n)) %>%
bind_rows(anti_join(df1, temp_df) %>%
mutate(w = 0)
) %>%
arrange(A1, A2)
# A1 A2 w
# <fct> <dbl> <dbl>
#1 A 1 0
#2 A 3 0.167
#3 A 3 0.167
#4 A 4 0.333
#5 A 5 0.333
#6 B 10 0.25
#7 B 10 0.25
#8 B 10 0.25
#9 B 10 0.25
The logic is we select the top 3 A2 values along with their groups using top_n. Using anti_join we get all the rows which are not in top 3 and assign a fixed weight w to them as 0. For the rows which are included in top 3 we get the last group rows and assign them the weight which is remaining after assigning the weights to non-last groups.
Related
I would like to make a connection between the x and df2 datasets. Notice that the dataset x, I have a percentage value, which in this case for the day 03-01-2021 is 0.1 and for the days 01-02-2021 and 01-01-2022 it is 0.45. So from that information, I know the percentage value for 03-01-2021 is 0.1, so this value falls into category I of my dataset df2 (since the values range from 0.1 to 0.2). As for the days 02-01-2021 and 01-01-2022, they correspond to category F of the df2,since the values range from 0.4 to 0.5. So, I would like to generate an output table as follows:
library(dplyr)
df1<- structure(
list(date2= c("01-01-2022","01-01-2022","03-01-2021","03-01-2021","01-02-2021","01-02-2021"),
Category= c("ABC","CDE","ABC","CDE","ABC","CDE"),
coef= c(5,4,0,2,4,5)),
class = "data.frame", row.names = c(NA, -6L))
x<-df1 %>%
group_by(date2) %>%
summarize(across("coef", sum),.groups = 'drop')%>%
arrange(date2 = as.Date(date2, format = "%d-%m-%Y"))
number<-20
x$Percentage<-x$coef/number
date2 coef Percentage
<chr> <dbl> <dbl>
1 03-01-2021 2 0.1
2 01-02-2021 9 0.45
3 01-01-2022 9 0.45
df2 <- structure(
list(
Category = c("A", "B", "C", "D",
"E", "F", "G", "H", "I", "J"),
From = c(0.9,
0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0),
Until = c(
1,
0.8999,
0.7999,
0.6999,
0.5999,
0.4999,
0.3999,
0.2999,
0.1999,
0.0999
),
`1 Val` = c(
2222,
2017.8,
1793.6,
1621.5,
1522.4,
1457.3,
1325.2,
1229.15,
1223.1,
1177.05
),
`2 Val` = c(3200, 2220, 2560,
2200, 2220, 2080, 1220, 1240, 1720, 1620),
`3 Val` = c(
4665,
4122.5,
3732,
3498.75,
3265.5,
3032.25,
2799,
2682.375,
2565.75,
2449.125
),
`4 Val` = c(
6112,
5222.8,
4889.6,
4224,
4278.4,
3972.8,
3667.2,
3224.4,
3361.6,
3222.8
)
),
row.names = c(NA,-10L),
class = c("tbl_df",
"tbl", "data.frame")
)
Category From Until 1 Val 2 Val 3 Val 4 Val
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0.9 1 2222 3200 4665 6112
2 B 0.8 0.900 2018 2220 4122 5223
3 C 0.7 0.800 1794 2560 3732 4890
4 D 0.6 0.700 1622 2200 3499 4224
5 E 0.5 0.600 1522 2220 3266 4278
6 F 0.4 0.500 1457 2080 3032 3973
7 G 0.3 0.400 1325 1220 2799 3667
8 H 0.2 0.300 1229 1240 2682 3224
9 I 0.1 0.200 1223 1720 2566 3362
10 J 0 0.0999 1177 1620 2449 3223
Using tidyverse, we do a rowwise on the 'x' dataset, slice the rows of 'df2' where the 'Percentage' falls between the 'From' and 'Until', and unpack the data.frame/tibble column
library(dplyr)
library(tidyr)
x %>%
rowwise %>%
mutate(out = df2 %>%
slice(which(Percentage>= From &
Percentage <= Until)[1]) %>%
select(-(1:3)) ) %>%
ungroup %>%
unpack(out)
-output
# A tibble: 3 × 7
date2 coef Percentage `1 Val` `2 Val` `3 Val` `4 Val`
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 03-01-2021 2 0.1 1223. 1720 2566. 3362.
2 01-02-2021 9 0.45 1457. 2080 3032. 3973.
3 01-01-2022 9 0.45 1457. 2080 3032. 3973.
Or this could be done with a non-equi join
library(data.table)
nm1 <- names(df2)[endsWith(names(df2), 'Val')]
setDT(x)[setDT(df2), (nm1) := mget(nm1),
on = .(Percentage >= From, Percentage <= Until)]
-output
> x
date2 coef Percentage 1 Val 2 Val 3 Val 4 Val
1: 03-01-2021 2 0.10 1223.1 1720 2565.75 3361.6
2: 01-02-2021 9 0.45 1457.3 2080 3032.25 3972.8
3: 01-01-2022 9 0.45 1457.3 2080 3032.25 3972.8
Let's say I have a time series and in each iteration, I take a fixed portion of it and calculate the correlation matrix. Also, assume three elements only, which are denoted with their names in the correlation matrix. I want to give them sequential numbers, meaning the first element is 1, second is 2 and so forth. Then I want to have a data frame in a way that expands these matrices. For example:
The first element is the element "from", the second one is "to", the third one is the correlation value and the fourth one is the time. I can give the times as input and repeat it twice many times as the elements. I realize that I will have duplicates for each correlation value, with a difference in "to" and "from" elements and that is what I am looking for. How can I construct this? Here is my data, where g.list is a list of correlation matrices:
> dput(g.list)
list(structure(c(1, 0.352209944821856, 0.802051885793422, 0.352209944821857,
1, 0.827370298950111, 0.802051885793422, 0.827370298950111, 1
), .Dim = c(3L, 3L), .Dimnames = list(c("jpm", "gs", "ms"), c("jpm",
"gs", "ms"))), structure(c(1, 0.670163753398499, 0.753168359152204,
0.6701637533985, 1, 0, 0.753168359152202, 0, 1), .Dim = c(3L,
3L), .Dimnames = list(c("jpm", "gs", "ms"), c("jpm", "gs", "ms"
))), structure(c(1, 0.681190013681026, 0.153608963486821, 0.681190013681026,
1, 0.82058156983829, 0.153608963486822, 0.82058156983829, 1), .Dim = c(3L,
3L), .Dimnames = list(c("jpm", "gs", "ms"), c("jpm", "gs", "ms"
))))
Are you looking for this ?
result <- do.call(rbind, Map(function(x, y)
cbind(which(x < 1, arr.ind = TRUE), value = x[x != 1], year = y),
g.list, 2018:2020))
result
# row col value year
#gs 2 1 0.352 2018
#ms 3 1 0.802 2018
#jpm 1 2 0.352 2018
#ms 3 2 0.827 2018
#jpm 1 3 0.802 2018
#gs 2 3 0.827 2018
#gs 2 1 0.352 2019
#ms 3 1 0.802 2019
#jpm 1 2 0.352 2019
#ms 3 2 0.827 2019
#jpm 1 3 0.802 2019
#gs 2 3 0.827 2019
#gs 2 1 0.352 2020
#ms 3 1 0.802 2020
#jpm 1 2 0.352 2020
#ms 3 2 0.827 2020
#jpm 1 3 0.802 2020
#gs 2 3 0.827 2020
To get only upper/lower triangle values to avoid duplicates you may use -
do.call(rbind, Map(function(x, y) {
x[upper.tri(x)] <- 1
cbind(which(x < 1, arr.ind = TRUE), value = x[x != 1], year = y)
}, g.list, 2018:2020))
I have 2 R data frames that looks like this:
DATA FRAME 1:
identifier
ef_posterior
position_no
classification
11111
0.260
1
yes
11111
0.0822
2
yes
11111
0.00797
3
yes
11111
0.04
4
no
11111
0.245
5
yes
11111
0.432
6
yes
11112
0.342
1
maybe
11112
0.453
2
yes
11112
0.0032
3
yes
11112
0.241
5
no
11112
0.0422
6
yes
11112
0.311
4
no
DATAFRAME 2:
study_identifier
%LVEF
11111
62
11112
76
I want to merge and rearrange these two data frames into something like this:
Study_identifier and identifier are the same thing (just different column names). Additionally, I would like to recode the classification so that yes = 0, no = 1, maybe = 2
identifier
pos_1
pos_1_class
pos_2
pos_2_class
pos_3
pos_3_class
pos_4
pos_4_class
pos_5
pos_5_class
pos_6
pos_6_class
%LVEF
11111
0.260
0
0.0822
0
0.00797
0
0.04
1
0.245
0
0.432
0
62
11112
0.342
2
0.453
0
0.0032
0
0.311
1
0.241
1
0.0422
0
76
df1 %>% mutate(position_no = paste0("position_", position_no)) %>%
pivot_wider(id_cols = identifier, names_from = position_no, values_from = ef_posterior) %>%
left_join(df2 %>% mutate(study_identifier = as.numeric(as.character(study_identifier))), by = c("identifier" = "study_identifier"))
This is the code I have right now, but I can't figure out where to put in the code for the classification column
How would I go about doing this?
Any help would be very much appreciated!
You can recode quite easily with dplyr and case_when:
df1 %>% mutate(
classification =
case_when( classification == "yes" ~ 1,
classification == "no" ~ 0,
classification == "maybe" ~ 2)
)
I would solve it the following way:
library(tidyverse)
df1 <- data.frame(
stringsAsFactors = FALSE,
identifier = c(11111L,11111L,11111L,11111L,
11111L,11111L,11112L,11112L,11112L,11112L,11112L,
11112L),
ef_posterior = c(0.26,0.0822,0.00797,0.04,
0.245,0.432,0.342,0.453,0.0032,0.241,0.0422,0.311),
position_no = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 5L, 6L, 4L),
classification = c("yes","yes","yes","no",
"yes","yes","maybe","yes","yes","no","yes","no")
)
df2 <- data.frame(
check.names = FALSE,
study_identifier = c(11111L, 11112L),
`%LVEF` = c(62L, 76L)
)
df1 %>% mutate(
classification =
case_when( classification == "yes" ~ 1,
classification == "no" ~ 0,
classification == "maybe" ~ 2)
) %>%
pivot_wider(
id_cols = c(identifier), names_from = c(position_no), values_from = c(classification,ef_posterior)) %>%
left_join(df2, by = c("identifier" = "study_identifier"))
#> # A tibble: 2 x 14
#> identifier classification_1 classification_2 classification_3 classification_4
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 11111 1 1 1 0
#> 2 11112 2 1 1 0
#> # … with 9 more variables: classification_5 <dbl>, classification_6 <dbl>,
#> # ef_posterior_1 <dbl>, ef_posterior_2 <dbl>, ef_posterior_3 <dbl>,
#> # ef_posterior_4 <dbl>, ef_posterior_5 <dbl>, ef_posterior_6 <dbl>,
#> # `%LVEF` <int>
Created on 2021-04-12 by the reprex package (v0.3.0)
I need to determine the percentage of values in each column for each cluster with condition. Reproducible example is below. I have a table like this:
> tab
GI RT TR VR Cluster_number
1 1000086986 0.5814 0.5814 0.628 1
10 1000728257 0.5814 0.5814 0.628 1
13 1000074769 0.7879 0.7879 0.443 2
14 1000498642 0.7879 0.7879 0.443 2
22 1000074765 0.7941 0.3600 0.533 3
26 1000597385 0.7941 0.3600 0.533 3
31 1000502373 0.5000 0.5000 0.607 4
32 1000532631 0.6875 0.7059 0.607 4
33 1000597694 0.5000 0.5000 0.607 4
34 1000598724 0.5000 0.5000 0.607 4
And i need table like this:
> tab1
Cluster_number RT_cond TR_cond VR_cond
1 1 0 0 100
2 2 100 100 0
3 3 100 0 0
4 4 25 25 100
Where the values in the corresponding column indicate the percentage of GI in the corresponding cluster, where RT >= 0.6, TR >= 0.6 and VR >= 0.6, respectively. I.e., in the first cluster, all RT <= 0.6, therefore, in the final table, the value 0 is written in the first row, and, for example, in the fourth cluster, one of the four values TR >= 0.6, so the corresponding value in the final table is 25. How can i do this?
You can group_by Cluster_number and use across to calculate percentage :
library(dplyr)
df %>%
group_by(Cluster_number) %>%
summarise(across(RT:VR, ~mean(. >= 0.6) * 100, .names = '{col}_cond'))
#In older version of dplyr use summarise_at
#summarise_at(vars(RT:VR), ~mean(. >= 0.6) * 100)
# Cluster_number RT_cond TR_cond VR_cond
# <int> <dbl> <dbl> <dbl>
#1 1 0 0 100
#2 2 100 100 0
#3 3 100 0 0
#4 4 25 25 100
In base R, we can use aggregate :
aggregate(cbind(RT, TR, VR)~Cluster_number, df, function(x) mean(x >= 0.6) * 100)
data
df <- structure(list(GI = c(1000086986L, 1000728257L, 1000074769L,
1000498642L, 1000074765L, 1000597385L, 1000502373L, 1000532631L,
1000597694L, 1000598724L), RT = c(0.5814, 0.5814, 0.7879, 0.7879,
0.7941, 0.7941, 0.5, 0.6875, 0.5, 0.5), TR = c(0.5814, 0.5814,
0.7879, 0.7879, 0.36, 0.36, 0.5, 0.7059, 0.5, 0.5), VR = c(0.628,
0.628, 0.443, 0.443, 0.533, 0.533, 0.607, 0.607, 0.607, 0.607
), Cluster_number = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L)),
class = "data.frame", row.names = c("1", "10", "13", "14", "22",
"26", "31", "32", "33", "34"))
With the dplyr package you can use a group_by statement followed by summarise, and then rename the columns of interest with the new rename_with function
library(dplyr)
tab %>%
group_by(Cluster_number) %>%
summarise(across(c(RT, TR, VR), ~mean(. >= 0.6)*100)) %>%
rename_with(~paste0(., "_cond"), c(RT, TR, VR))
# A tibble: 4 x 4
# Cluster_number RT_cond TR_cond VR_cond
# <int> <dbl> <dbl> <dbl>
# 1 1 0 0 100
# 2 2 100 100 0
# 3 3 100 0 0
# 4 4 25 25 100
I need some help working with consecutive results.
Here is my sample data:
df <- structure(list(idno = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2), result = structure(c(1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L), .Label = c("Negative", "Positive"
), class = c("ordered", "factor")), samp_date = structure(c(15909,
15938, 15979, 16007, 16041, 16080, 16182, 16504, 16576, 16645,
16721, 16745, 17105, 17281, 17416, 17429), class = "Date")), class = "data.frame", row.names = c(NA,
-16L))
The 'idno' represents individual people who had a test with 'result' on a given date ('samp_date').
From each individual person, I need to find the earliest consecutive 'Negatives' and return the date of the first 'negative' result. To return this date, the consecutive negatives must span >30 days with no 'positive' results.
The example answer for idno == 1 would be 2013-10-29, and 2015-11-06 for idno == 2.
I have tried using rle(as.character(df$result)) but have struggled to understand how to apply this to grouped data.
I would prefer an approach that uses dplyr or data.table.
Thanks for any help.
Similar to #MKR's answer, you can make a grouping variable and summarize in data.table:
library(data.table)
setDT(df)[, samp_date := as.IDate(samp_date)]
# summarize by grouping var g = rleid(idno, result)
runDT = df[, .(
start = first(samp_date),
end = last(samp_date),
dur = difftime(last(samp_date), first(samp_date), units="days")
), by=.(idno, result, g = rleid(idno, result))]
# idno result g start end dur
# 1: 1 Negative 1 2013-07-23 2013-07-23 0 days
# 2: 1 Positive 2 2013-08-21 2013-10-01 41 days
# 3: 1 Negative 3 2013-10-29 2015-07-29 638 days
# 4: 2 Positive 4 2015-10-13 2015-10-13 0 days
# 5: 2 Negative 5 2015-11-06 2016-10-31 360 days
# 6: 2 Positive 6 2017-04-25 2017-09-20 148 days
# find rows meeting the criterion
w = runDT[.(idno = unique(idno), result = "Negative", min_dur = 30),
on=.(idno, result, dur >= min_dur), mult="first", which=TRUE]
# filter
runDT[w]
# idno result g start end dur
# 1: 1 Negative 3 2013-10-29 2015-07-29 638 days
# 2: 2 Negative 5 2015-11-06 2016-10-31 360 days
A dplyr based solution can be achieved by creating a group of consecutive occurrence of result column and then finally taking 1st occurrence that meets criteria:
library(dplyr)
df %>% mutate(samp_date = as.Date(samp_date)) %>%
group_by(idno) %>%
arrange(samp_date) %>%
mutate(result_grp = cumsum(as.character(result)!=lag(as.character(result),default=""))) %>%
group_by(idno, result_grp) %>%
filter( result == "Negative" & (max(samp_date) - min(samp_date) )>=30) %>%
slice(1) %>%
ungroup() %>%
select(-result_grp)
# # A tibble: 2 x 3
# idno result samp_date
# <dbl> <ord> <date>
# 1 1.00 Negative 2013-10-29
# 2 2.00 Negative 2015-11-06
library(dplyr)
df %>% group_by(idno) %>%
mutate(time_diff = ifelse(result=="Negative" & lead(result)=='Negative', samp_date - lead(samp_date),0),
ConsNegDate = min(samp_date[which(abs(time_diff)>30)]))
# A tibble: 16 x 5
# Groups: idno [2]
idno result samp_date time_diff ConsNegDate
<dbl> <ord> <date> <dbl> <date>
1 1 Negative 2013-07-23 0 2013-10-29
2 1 Positive 2013-08-21 0 2013-10-29
3 1 Positive 2013-10-01 0 2013-10-29
4 1 Negative 2013-10-29 -34 2013-10-29
5 1 Negative 2013-12-02 -39 2013-10-29
6 1 Negative 2014-01-10 -102 2013-10-29
7 1 Negative 2014-04-22 -322 2013-10-29
8 1 Negative 2015-03-10 -72 2013-10-29
9 1 Negative 2015-05-21 -69 2013-10-29
10 1 Negative 2015-07-29 NA 2013-10-29
11 2 Positive 2015-10-13 0 2015-11-06
12 2 Negative 2015-11-06 -360 2015-11-06
13 2 Negative 2016-10-31 0 2015-11-06
14 2 Positive 2017-04-25 0 2015-11-06
15 2 Positive 2017-09-07 0 2015-11-06
16 2 Positive 2017-09-20 0 2015-11-06