Calculated Column Based on Rows with Date Range - r
I have a dataframe as follows:
ID
Col1
RespID
Col3
Col4
Year
Month
Day
1
blue
729Ad
3.2
A
2021
April
2
2
orange
295gS
6.5
A
2021
April
1
3
red
729Ad
8.4
B
2021
April
20
4
yellow
592Jd
2.9
A
2021
March
12
5
green
937sa
3.5
B
2021
May
13
I would like to calculate a new column, Col5, such that its value is 1 if the row has Col4 value of A and there exists another column somewhere in the dataset a row with the same RespId but a Col4 value of B. Otherwise it’s value is 0. Then I will drop all rows with Col4 value of B, to keep just those with A. I'd also like to account for the date fields (year, month, date) so that this is done in groups based on say a 30 day timeframe. So if 'B' appears within 30 days of when 'A' appears in the dataset, only then is there a 1 present (if 'B' appears within 60 days, then there is no 1. Additionally, I'd like to keep everything as data.frames.
Here is what the desired output table would look like prior to dropping rows with Col4 value of B:
ID
Col1
RespID
Col3
Col4
Col5
1
blue
729Ad
3.2
A
1
2
orange
295gS
6.5
A
0
3
red
729Ad
8.4
B
0
4
yellow
592Jd
2.9
A
0
5
green
937sa
3.5
B
0
I have found Ronak's solution in this thread (Calculated Column Based on Rows in Tidymodels Recipe) to be useful, however, would like to modify for the date range.
A lot of things to unpack here.
I think you're tripping up over your own feet by trying to do too many things at once. I've broken down the code into four distinct steps to make the thought process easy to follow. Obviously, for use in a production environment it should be rewritten more efficiently.
1. Generate some data
library(tidyverse)
set.seed(42)
df <- tibble(
id = c(1:10),
resp_id = c(1701, seq(2286, 2289), 1701, seq(2290, 2293)),
grouping = sample(c("A", "B"), size = 10, replace = TRUE),
date = seq.Date(as.Date("2363-10-04"), as.Date("2363-11-17"), length.out = 10)
)
Resulting data:
# A tibble: 10 × 4
id resp_id grouping date
<int> <dbl> <chr> <date>
1 1 1701 A 2363-10-04
2 2 2286 A 2363-10-08
3 3 2287 A 2363-10-13
4 4 2288 A 2363-10-18
5 5 2289 B 2363-10-23
6 6 1701 B 2363-10-28
7 7 2290 B 2363-11-02
8 8 2291 B 2363-11-07
9 9 2292 A 2363-11-12
10 10 2293 B 2363-11-17
2. Check grouping
df <- df %>%
mutate(
is_a = ifelse(grouping == "A", 1, 0),
is_b = ifelse(grouping == "B", 1, 0)
)
We have the grouping now as easy-to-use dummy variables:
> df
# A tibble: 10 × 6
id resp_id grouping date is_a is_b
<int> <dbl> <chr> <date> <dbl> <dbl>
1 1 1701 A 2363-10-04 1 0
2 2 2286 A 2363-10-08 1 0
3 3 2287 A 2363-10-13 1 0
4 4 2288 A 2363-10-18 1 0
5 5 2289 B 2363-10-23 0 1
6 6 1701 B 2363-10-28 0 1
7 7 2290 B 2363-11-02 0 1
8 8 2291 B 2363-11-07 0 1
9 9 2292 A 2363-11-12 1 0
10 10 2293 B 2363-11-17 0 1
3. Check completeness
df <- df %>%
group_by(
resp_id
) %>%
mutate(
# Check if the grouping has both "A" and "B" values
is_complete = ifelse(
sum(is_a) > 0 & sum(is_b) > 0,
1,
0
)
) %>%
ungroup()
We see that there is only one resp_id value that is complete — 1701:
> df
# A tibble: 10 × 7
id resp_id grouping date is_a is_b is_complete
<int> <dbl> <chr> <date> <dbl> <dbl> <dbl>
1 1 1701 A 2363-10-04 1 0 1
2 2 2286 A 2363-10-08 1 0 0
3 3 2287 A 2363-10-13 1 0 0
4 4 2288 A 2363-10-18 1 0 0
5 5 2289 B 2363-10-23 0 1 0
6 6 1701 B 2363-10-28 0 1 1
7 7 2290 B 2363-11-02 0 1 0
8 8 2291 B 2363-11-07 0 1 0
9 9 2292 A 2363-11-12 1 0 0
10 10 2293 B 2363-11-17 0 1 0
4. Assign target value
df <- df %>%
group_by(
resp_id
) %>%
mutate(
# Check if the "A" part of a complete grouping has a another value within 30 days
is_within_timeframe = ifelse(
is_complete == 1 & is_a == 1 & max(date) - min(date) <= 30,
1,
0
)
) %>%
ungroup()
We see that our one complete set has in fact a B value that falls within 30 days of the A observation (Caveat: This only works if there are always exactly one or two observations per grouping!). Column is_within_timeframe corresponds to your Col4:
> df
# A tibble: 10 × 8
id resp_id grouping date is_a is_b is_complete is_within_timeframe
<int> <dbl> <chr> <date> <dbl> <dbl> <dbl> <dbl>
1 1 1701 A 2363-10-04 1 0 1 1
2 2 2286 A 2363-10-08 1 0 0 0
3 3 2287 A 2363-10-13 1 0 0 0
4 4 2288 A 2363-10-18 1 0 0 0
5 5 2289 B 2363-10-23 0 1 0 0
6 6 1701 B 2363-10-28 0 1 1 0
7 7 2290 B 2363-11-02 0 1 0 0
8 8 2291 B 2363-11-07 0 1 0 0
9 9 2292 A 2363-11-12 1 0 0 0
10 10 2293 B 2363-11-17 0 1 0 0
Related
Check multiple columns for value in r after group by
In the following dataframe for each household, individual combination if "X_1","Y_2" and "Z_3" all three variable > 0 then add new column "criteria" = "C1" else 0. Only household 1001 - individual 1 fulfill this condition I tried ifelse option first and then select_at but to no avail. Its throwing error data %>% group_by(household,individual) %>% mutate(criteria = ifelse(X_1 >0 & Y_2 >0 & X_3 >0,"C1",0)) # option_2 data %>% group_by(household,individual) %>% select_at(vars(X_1 >0 & Y_2 >0 & Z_3 >0,"C1",0),all_vars(.>0)) %>% mutate(criteria = "c1") I also want to retain all other variables intact for household - individual combination like year, week, duration in the final dataframe which are not present in the group by. Please suggest sample dataset: data <- data.frame(household=c(1001,1001,1001,1001,1001,1002,1002,1002,1003,1003,1003), individual = c(1,1,1,1,1,2,2,2,1,1,1), year = c(2021,2021,2022,2022,2022,2021,2022,2022,2022,2022,2022), week =c("w51","w52","w1","w2","w4","w51","w1","w3","w1","w2","w3"), duration =c(20,23,24,56,78,12,34,67,87,89,90), X_1 = c(3,3,3,3,3,0,0,0,1,1,1), Y_2 = c(2,2,2,2,2,1,1,1,0,0,0), Z_3 = c(4,4,4,4,4,0,0,0,0,0,0))
You coul use if_all(), which is more efficient than rowwise c_across. data %>% mutate(criteria = ifelse(if_all(X_1:Z_3, `>`, 0), "C1", "0")) # household individual year week duration X_1 Y_2 Z_3 criteria # 1 1001 1 2021 w51 20 3 2 4 C1 # 2 1001 1 2021 w52 23 3 2 4 C1 # 3 1001 1 2022 w1 24 3 2 4 C1 # 4 1001 1 2022 w2 56 3 2 4 C1 # 5 1001 1 2022 w4 78 3 2 4 C1 # 6 1002 2 2021 w51 12 0 1 0 0 # 7 1002 2 2022 w1 34 0 1 0 0 # 8 1002 2 2022 w3 67 0 1 0 0 # 9 1003 1 2022 w1 87 1 0 0 0 # 10 1003 1 2022 w2 89 1 0 0 0 # 11 1003 1 2022 w3 90 1 0 0 0
You're doing a rowwise operation so we can call rowwise and then do the ifelse using the c_across function. Calling ungroup to get out of rowwise library(dplyr) data |> rowwise() |> mutate(criteria = ifelse(all(c_across(X_1:Z_3) > 0), "C1", "0")) |> ungroup() Or you can just do: data$criteria = apply(subset(data, ,X_1:Z_3), 1, \(x) ifelse(all(x) > 0, "C1", "0")) household individual year week duration X_1 Y_2 Z_3 criteria <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> 1 1001 1 2021 w51 20 3 2 4 C1 2 1001 1 2021 w52 23 3 2 4 C1 3 1001 1 2022 w1 24 3 2 4 C1 4 1001 1 2022 w2 56 3 2 4 C1 5 1001 1 2022 w4 78 3 2 4 C1 6 1002 2 2021 w51 12 0 1 0 0 7 1002 2 2022 w1 34 0 1 0 0 8 1002 2 2022 w3 67 0 1 0 0 9 1003 1 2022 w1 87 1 0 0 0 10 1003 1 2022 w2 89 1 0 0 0 11 1003 1 2022 w3 90 1 0 0 0
Keep previous value if it is under a certain threshold
I would like to create a variable called treatment_cont that is grouped by group as follows: ID day day_diff treatment treatment_cont 1 0 NA 1 1 1 14 14 1 1 1 20 6 2 2 1 73 53 1 1 2 0 NA 1 1 2 33 33 1 1 2 90 57 2 2 2 112 22 3 2 2 152 40 1 1 2 178 26 4 1 Treatment_cont is the same as treatment but we want to keep the same treatment regime only when the day_diff, the difference in days between treatments, is lower than 30. I have tried many ways on dplyr, manipulating the table, but I cannot figure out how to do it efficiently.
Probably, a conditional mutate, using case_when and lag might work: df %>% mutate(treatment_cont = case_when(day_diff < 30 ~ treatment,TRUE ~ lag(treatment)))
You are probably looking for lag (and perhaps it's brother, lead): df %>% replace_na(list(day_diff=0)) %>% group_by(ID) %>% arrange(day) %>% mutate( treatment_cont = ifelse(day_diff < 30, lag(treatment_cont, default = treatment_cont[1]),treatment_cont) # A tibble: 10 x 5 ID day day_diff treatment treatment_cont <int> <int> <dbl> <int> <int> 1 1 0 0 1 1 2 1 14 14 1 1 3 1 20 6 2 1 4 1 73 53 1 1 5 2 0 0 1 1 6 2 33 33 1 1 7 2 90 57 2 2 8 2 112 22 3 2 9 2 152 40 1 1 10 2 178 26 4 1 ) %>% ungroup %>% arrange(ID, day)
Create a dummy variable indicating whether a value is observed before
I have a huge dataset and wanted to create a binary dummy variable indicating whether a value is observed before. Here is the sample data set. data.frame( id = c(rep("A",3),rep("B",3),rep("C",3)), time = rep(seq(1:3),3), item = c(11,12,13,11,11,13,22,11,22)) From the dataset, here is the desired column, observed_b4 = c(NA,0,0,NA,1,0,NA,0,1) For each group, I want to have information about whether item is observed before or not. I can do it with for-loop but the data size is too big to do.
Using duplicated: base: cbind(x, flag = as.integer(duplicated(paste(x$id, x$item)))) # id time item flag # 1 A 1 11 0 # 2 A 2 12 0 # 3 A 3 13 0 # 4 B 1 11 0 # 5 B 2 11 1 # 6 B 3 13 0 # 7 C 1 22 0 # 8 C 2 11 0 # 9 C 3 22 1 or dplyr: library(dplyr) x %>% group_by(id) %>% mutate(flag = as.integer(duplicated(item))) ## A tibble: 9 x 4 ## Groups: id [3] # id time item flag # <chr> <int> <dbl> <int> #1 A 1 11 0 #2 A 2 12 0 #3 A 3 13 0 #4 B 1 11 0 #5 B 2 11 1 #6 B 3 13 0 #7 C 1 22 0 #8 C 2 11 0 #9 C 3 22 1
A solution with base R that uses: ave and duplicated. ave allows you to apply a function over df$item for each group made by df$id. duplicated checks whether an item was already shown. ave returns automatically a numeric vector (the name class of the input vector). df$observed_b4 <- ave(df$item, df$id, FUN = duplicated) df #> id time item observed_b4 #> 1 A 1 11 0 #> 2 A 2 12 0 #> 3 A 3 13 0 #> 4 B 1 11 0 #> 5 B 2 11 1 #> 6 B 3 13 0 #> 7 C 1 22 0 #> 8 C 2 11 0 #> 9 C 3 22 1 However, to get exactly what you're looking for, you can use this: df$observed_b4 <- ave(df$item, df$id, FUN = function(x) replace(duplicated(x),1,NA)) df #> id time item observed_b4 #> 1 A 1 11 NA #> 2 A 2 12 0 #> 3 A 3 13 0 #> 4 B 1 11 NA #> 5 B 2 11 1 #> 6 B 3 13 0 #> 7 C 1 22 NA #> 8 C 2 11 0 #> 9 C 3 22 1
We could group by 'id', 'item', create a logical vector with row_number() and coerce it to binary (+) library(dplyr) df1 %>% group_by(id, item) %>% mutate(flag = +(row_number() != 1)) -output # A tibble: 9 x 4 # Groups: id, item [7] # id time item flag # <chr> <int> <dbl> <int> #1 A 1 11 0 #2 A 2 12 0 #3 A 3 13 0 #4 B 1 11 0 #5 B 2 11 1 #6 B 3 13 0 #7 C 1 22 0 #8 C 2 11 0 #9 C 3 22 1
R Reshape Wide To Long Using Column Stub Strings
data1=data.frame("School"=c(1,1,2,2,3,3,4,4), "Fund"=c(0,1,0,1,0,1,0,1), "Total_A_Grade5"=c(22,20,21,24,24,26,25,22), "Group1_A_Grade5"=c(10,6,6,10,9,9,9,10), "Group2_A_Grade5"=c(5,9,9,8,10,8,8,6), "Total_B_Grade5"=c(23,33,19,21,19,23,20,21), "Group1_B_Grade5"=c(8,7,7,10,9,9,5,5), "Group2_B_Grade5"=c(6,10,7,6,6,5,9,9), "Total_A_Grade6"=c(18,24,16,24,26,25,16,19), "Group1_A_Grade6"=c(7,7,5,9,10,9,5,7), "Group2_A_Grade6"=c(5,8,6,7,10,8,8,9), "Total_B_Grade6"=c(26,23,22,24,21,22,24,19), "Group1_B_Grade6"=c(10,10,6,10,7,8,8,7), "Group2_B_Grade6"=c(9,6,9,6,7,6,9,9), "Total_A_Grade7"=c(20,19,18,25,16,21,19,26), "Group1_A_Grade7"=c(9,7,7,9,7,7,5,8), "Group2_A_Grade7"=c(8,5,7,9,6,5,5,9), "Total_B_Grade7"=c(25,21,24,25,18,18,27,18), "Group1_B_Grade7"=c(10,10,10,7,5,6,8,5), "Group2_B_Grade7"=c(9,6,8,10,8,6,10,6)) data2=data.frame("School"=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1), "Fund"=c(0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1), "Type"=c('Total','Total','Group1','Group1','Group2','Group2','Total','Total','Group1','Group1','Group2','Group2','Total','Total','Group1','Group1','Group2','Group2','Total','Total','Group1','Group1','Group2','Group2'), "Class"=c('A','A','A','A','A','A','B','B','B','B','B','B','A','A','A','A','A','A','B','B','B','B','B','B'), "Grade"=c(5,5,5,5,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6,6,6,6,6), "Score"=c(22,20,10,6,5,9,23,33,8,7,6,10,18,24,7,7,5,8,26,23,10,10,9,6)) I have 'data1' and want to reshape to make 'data2' which just shows example for School 1 grade 5 and 6 but I want all of data1 reshaped. The column names of 'data1' contain rich information. For example, Group2_B_Grade6 indicated 'Type' = Group2, 'Class' = B, 'Grade' = 6. I wish to reshape 'data1' and then use these stubs separated by "_" as colnames to prepare 'data2' data3=data.frame("School"=c(1,1,2,2,3,3,4,4), "Fund"=c(0,1,0,1,0,1,0,1), "Grade_5"=c(22,20,21,24,24,26,25,22), "Grade_6"=c(10,6,6,10,9,9,9,10), "Grade_7"=c(5,9,9,8,10,8,8,6))
You can do this directly with pivot_longer with some regex in names_pattern. tidyr::pivot_longer(data1, cols = -c(School, Fund), names_to = c('Type', 'Class', 'Grade'), names_pattern = '(.*?)_([A-Z])_Grade(\\d+)', values_to = 'Score') # A tibble: 144 x 6 # School Fund Type Class Grade Score # <dbl> <dbl> <chr> <chr> <chr> <dbl> # 1 1 0 Total A 5 22 # 2 1 0 Group1 A 5 10 # 3 1 0 Group2 A 5 5 # 4 1 0 Total B 5 23 # 5 1 0 Group1 B 5 8 # 6 1 0 Group2 B 5 6 # 7 1 0 Total A 6 18 # 8 1 0 Group1 A 6 7 # 9 1 0 Group2 A 6 5 #10 1 0 Total B 6 26 # … with 134 more rows
Using dplyr (and tidyr): library(dplyr) library(tidyr) data2 <- data1 %>% pivot_longer(-c(School, Fund)) %>% separate(name, into = c('Type', 'Class', 'Grade')) %>% extract(Grade, 'Grade', "([0-9]+)") data2 #> # A tibble: 144 x 6 #> School Fund Type Class Grade value #> <dbl> <dbl> <chr> <chr> <chr> <dbl> #> 1 1 0 Total A 5 22 #> 2 1 0 Group1 A 5 10 #> 3 1 0 Group2 A 5 5 #> 4 1 0 Total B 5 23 #> 5 1 0 Group1 B 5 8 #> 6 1 0 Group2 B 5 6 #> 7 1 0 Total A 6 18 #> 8 1 0 Group1 A 6 7 #> 9 1 0 Group2 A 6 5 #> 10 1 0 Total B 6 26 #> # … with 134 more rows Created on 2020-04-06 by the reprex package (v0.3.0)
We can use melt from data.table library(data.table) melt(setDT(data1), id.var = c('School', 'Fund'))[, c('Type', 'Class', 'Grade') := tstrsplit(variable, "_")][, Grade := sub('Grade', '', Grade)][, variable := NULL][] # School Fund value Type Class Grade # 1: 1 0 22 Total A 5 # 2: 1 1 20 Total A 5 # 3: 2 0 21 Total A 5 # 4: 2 1 24 Total A 5 # 5: 3 0 24 Total A 5 # --- #140: 2 1 10 Group2 B 7 #141: 3 0 8 Group2 B 7 #142: 3 1 6 Group2 B 7 #143: 4 0 10 Group2 B 7 #144: 4 1 6 Group2 B 7
Identifying duplicate within groups by latest date
I currently have a data frame that looks like this: ID Value Date 1 1 A 1/1/2018 2 1 B 2/3/1988 3 1 B 6/3/1994 4 2 A 12/6/1999 5 2 B 24/12/1957 6 3 A 9/8/1968 7 3 B 20/9/2016 8 3 C 15/4/1993 9 3 C 9/8/1994 10 4 A 8/8/1988 11 4 C 6/4/2001 Within each ID I would like to identify a row where there is a duplicate Value. The Value that I would like to identify is the duplicate with the most recent Date. The resulting data frame should look like this: ID Value Date mostRecentDuplicate 1 1 A 1/1/2018 0 2 1 B 2/3/1988 0 3 1 B 6/3/1994 1 4 2 A 12/6/1999 0 5 2 B 24/12/1957 0 6 3 A 9/8/1968 0 7 3 B 20/9/2016 0 8 3 C 15/4/1993 0 9 3 C 9/8/1994 1 10 4 A 8/8/1988 0 11 4 C 6/4/2001 0` How do I go about doing this?
Using dplyr we can first convert Date to actual date value, then group_by ID and Value and assign value 1 in the group where there is more than 1 row and the row_number is same as row number of maximum Date. library(dplyr) df %>% mutate(Date = as.Date(Date, "%d/%m/%Y")) %>% group_by(ID, Value) %>% mutate(mostRecentDuplicate = +(n() > 1 & row_number() == which.max(Date))) %>% ungroup() # A tibble: 11 x 4 # ID Value Date mostRecentDuplicate # <int> <fct> <date> <int> # 1 1 A 2018-01-01 0 # 2 1 B 1988-03-02 0 # 3 1 B 1994-03-06 1 # 4 2 A 1999-06-12 0 # 5 2 B 1957-12-24 0 # 6 3 A 1968-08-09 0 # 7 3 B 2016-09-20 0 # 8 3 C 1993-04-15 0 # 9 3 C 1994-08-09 1 #10 4 A 1988-08-08 0 #11 4 C 2001-04-06 0