Create a count-consecutive variable which resets to 1 - r

I have a dataset like the following, where "group" is a group variable. I want to count the number of 'next' days by group, but if it is not the next day I want the count to reset to one (as shown in the "want" column). Then, I want to return the max number of the "want" column (as in want2). Suggestions would be appreciated!
df<-data.frame(group=c(1, 1, 1, 1, 2, 2, 2),
date=c("2000-01-01", "2000-01-03", "2000-01-04", "2000-01-05", "2000-01-09", "2000-01-10", "2000-01-12"),
want=c(1,1,2,3,1,2,1),
want2=c(3,3,3,3,2,2,2))
bonus part 2: Thank you for all the feedback, it was extremely helpful. Is there a way to do the same with an added condition? I have a binary variable and I also want my count to reset when that variable==0. Like so:
# group date binary want
#1 1 2000-01-01 1 1
#2 1 2000-01-03 1 1
#3 1 2000-01-04 1 2
#4 1 2000-01-05 0 1
#5 2 2000-01-09 1 1
#6 2 2000-01-10 0 1
#7 2 2000-01-12 1 1
#8 3 2000-01-05 1 1
#9 3 2000-01-06 1 2
#10 3 2000-01-07 1 3
#11 3 2000-01-08 1 4
I have tried akrun's suggestion which worked very well without the binary var, I tried to modify it adding the binary var as part of cumsum but I get errors:
df %>% group_by(group)
%>% mutate(wantn = rowid(cumsum(c(TRUE, diff(as.Date(date)) !=1 & binary==1)))
Thanks!

An option is to group by 'group', then use diff on the Date class convered 'date', create a logical vector and use cumsum to replicate the results in 'want' ('wantn') and then with the 'wantn', apply max on it
library(dplyr)
library(data.table)
df %>%
group_by(group) %>%
mutate(wantn = rowid(cumsum(c(TRUE, diff(as.Date(date)) !=1))),
want2n = max(wantn))
# A tibble: 7 x 6
# Groups: group [2]
# group date want want2 wantn want2n
# <dbl> <fct> <dbl> <dbl> <int> <int>
#1 1 2000-01-01 1 3 1 3
#2 1 2000-01-03 1 3 1 3
#3 1 2000-01-04 2 3 2 3
#4 1 2000-01-05 3 3 3 3
#5 2 2000-01-09 1 2 1 2
#6 2 2000-01-10 2 2 2 2
#7 2 2000-01-12 1 2 1 2
or if we want to not use rowid, then create the grouping variable with cumsum and get the sequence
df %>%
group_by(group) %>%
group_by(group2 = cumsum(c(TRUE, diff(as.Date(date)) !=1)), add = TRUE) %>%
mutate(wantn = row_number()) %>%
group_by(group) %>%
mutate(want2n = max(wantn)) %>%
select(-group2)

Related

How to add a column with most resent recurring observation within a group, but within a certain time period, in R

If I had:
person_ID visit date
1 2/25/2001
1 2/27/2001
1 4/2/2001
2 3/18/2004
3 9/22/2004
3 10/27/2004
3 5/15/2008
and I wanted another column to indicate the earliest recurring observation within 90 days, grouped by patient ID, with the desired output:
person_ID visit date date
1 2/25/2001 2/27/2001
1 2/27/2001 4/2/2001
1 4/2/2001 NA
2 3/18/2004 NA
3 9/22/2004 10/27/2004
3 10/27/2004 NA
3 5/15/2008 NA
Thank you!
We convert the 'visit_date' to Date class, grouped by 'person_ID', create a binary column that returns 1 if the difference between the current and next visit_date is less than 90 or else 0, using this column, get the correponding next visit_date' where the value is 1
library(dplyr)
library(lubridate)
library(tidyr)
df1 %>%
mutate(visit_date = mdy(visit_date)) %>%
group_by(person_ID) %>%
mutate(i1 = replace_na(+(difftime(lead(visit_date),
visit_date, units = 'day') < 90), 0),
date = case_when(as.logical(i1)~ lead(visit_date)), i1 = NULL ) %>%
ungroup
-output
# A tibble: 7 x 3
# person_ID visit_date date
# <int> <date> <date>
#1 1 2001-02-25 2001-02-27
#2 1 2001-02-27 2001-04-02
#3 1 2001-04-02 NA
#4 2 2004-03-18 NA
#5 3 2004-09-22 2004-10-27
#6 3 2004-10-27 NA
#7 3 2008-05-15 NA

Add a new row in each group (Day)

I am trying to make a function with this data and would really appreciate help with this!
example<- data.frame(Day=c(2,4,8,16,32,44,2,4,8,16,32,44,2,4,8,16,32,44),
Replicate=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,
1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,
1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3),
Treament=c("CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC",
"HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP",
"LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL"),
AFDM=c(94.669342,94.465752,84.897023,81.435993,86.556221,75.328294,94.262162,88.791240,75.735474,81.232403,
67.050593,76.346244,95.076522,88.968823,83.879073,73.958836,70.645724,67.184695,99.763156,92.022673,
92.245362,74.513934,50.083136,36.979418,94.872932,86.353037,81.843173,67.795465,46.622106,18.323099,
95.089932,93.244212,81.679814,65.352385,18.286525,7.517794,99.559972,86.759404,84.693433,79.196504,
67.456961,54.765706,94.074014,87.543693,82.492548,72.333367,51.304676,51.304676,98.340870,86.322153,
87.950873,84.693433,63.316485,63.723665))
Example:
I want to insert a new row with an AFDM value (e.g., 0.9823666) that was calculated with another function.
This new row must be on each Day 2 (and call it as Day 0), and I want to preserve the name of each Replica and Treatment of each group.
Thus, this new row must be: Day 0, Replicate=same, Treatment=same, AFDM=0.9823666.
This is so I can later run a regression with the data (from 0 to 44, 3 replicates for each Treatment).
I would prefer a solution on dplyr.
Thanks in advance
We can create a grouping column with cumsum, then expand the dataset with complete and fill the other columns
library(dplyr)
library(tidyr)
example %>%
group_by(grp = cumsum(Day == 2)) %>%
complete(Day = c(0, unique(Day)), fill = list(AFDM = 0.9823666)) %>%
fill(Replicate, Treament, .direction = 'updown')
# A tibble: 63 x 5
# Groups: grp [9]
# grp Day Replicate Treament AFDM
# <int> <dbl> <dbl> <chr> <dbl>
# 1 1 0 1 CC 0.982
# 2 1 2 1 CC 94.7
# 3 1 4 1 CC 94.5
# 4 1 8 1 CC 84.9
# 5 1 16 1 CC 81.4
# 6 1 32 1 CC 86.6
# 7 1 44 1 CC 75.3
# 8 2 0 2 CC 0.982
# 9 2 2 2 CC 94.3
#10 2 4 2 CC 88.8
# … with 53 more rows
You can use distinct to get unique Replicate and Treament, add Day and AFDM column with the default values and bind the rows to the original dataframe.
library(dplyr)
example %>%
distinct(Replicate, Treament) %>%
mutate(Day = 0, AFDM = 0.9823666) %>%
bind_rows(example) %>%
arrange(Replicate, Treament)
# Replicate Treament Day AFDM
#1 1 CC 0 0.9823666
#2 1 CC 2 94.6693420
#3 1 CC 4 94.4657520
#4 1 CC 8 84.8970230
#5 1 CC 16 81.4359930
#6 1 CC 32 86.5562210
#7 1 CC 44 75.3282940
#8 1 HP 0 0.9823666
#9 1 HP 2 99.7631560
#10 1 HP 4 92.0226730
#.....

Is there a way in R to group by 'runs'?

Say I have:
df<-data.frame(group=c(1, 1, 1, 1, 2, 2, 2, 2),
date=c("2000-01-01", "2000-01-02", "2000-01-04", "2000-01-05", "2000-01-09", "2000-01-10", "2000-01-11", "2000-01-13"),
want_group=c(1, 1, 2, 2, 3,3,3,4))
I want to create a want_group variable that groups by date, group, and whether they were "daily". So for example I want to create unique id's for within group 1 for the 1st and 2nd, and then a new unique id for the 4th and 5th, and then similarly for group 2 for the 9th, 10th, and 11th.
group date want_group
1 1 2000-01-01 1
2 1 2000-01-02 1
3 1 2000-01-04 2
4 1 2000-01-05 2
5 2 2000-01-09 3
6 2 2000-01-10 3
7 2 2000-01-11 3
8 2 2000-01-13 4
Thanks,
We can use diff and cumsum to calculate the runs. This increments everytime the difference in date is more than 1.
df$new <- cumsum(c(TRUE, diff(as.Date(df$date)) > 1))
df
# group date want_group new
#1 1 2000-01-01 1 1
#2 1 2000-01-02 1 1
#3 1 2000-01-04 2 2
#4 1 2000-01-05 2 2
#5 2 2000-01-09 3 3
#6 2 2000-01-10 3 3
#7 2 2000-01-11 3 3
#8 2 2000-01-13 4 4
We add TRUE in the beginning since diff returns output of length 1 less than the original vector.
To handle this by group we can do
library(dplyr)
df %>%
mutate(date = as.Date(date)) %>%
group_by(group) %>%
mutate(new = c(TRUE, diff(date) > 1)) %>%
ungroup() %>%
mutate(new = cumsum(new))
With base R, we can also do
df$date <- as.Date(df$date)
df$new <- with(df, cumsum(c(TRUE, date[-1]- date[-length(date)] > 1)))
df$new
#[1] 1 1 2 2 3 3 3 4
Or use difference with lag in dplyr
library(dplyr)
df %>%
mutate(date = as.Date(date),
want_group = cumsum(date - lag(date, default = first(date)) > 1))

Flagging row that meets two conditions

For a given ID, I am trying to identify the latest observation (last wave or highest wave number) that meets a criteria (=1 or =2)
My data:
data <- data.frame(id=c(1,1,1, 2,2,2, 3,3,3), wave=c(1,2,3, 1,2,3, 1,2,3), var=c(NA,1,2, 1,2,NA, 3,1,3))
Outcome:
outcome <- data.frame(id=c(1,1,1, 2,2,2, 3,3,3), wave=c(1,2,3, 1,2,3, 1,2,3), var=c(NA,1,2, 1,2,NA, 3,1,3), flag=c(0,0,1, 0,1,0, 0,1,0))
I can't seem to figure out how to specify to only flag the latest/last row for a given id
data %>% group_by(id) %>% mutate(flag=if_else(var %in% c(1,2) & ...,1,0))
Subset the 'wave', get the max, compare (==) with the 'wave' column and convert to integer
library(dplyr)
data %>%
group_by(id) %>%
mutate(flag = as.integer(wave == max(wave[var %in% 1:2])))
# A tibble: 9 x 4
# Groups: id [3]
# id wave var flag
# <dbl> <dbl> <dbl> <int>
#1 1 1 NA 0
#2 1 2 1 0
#3 1 3 2 1
#4 2 1 1 0
#5 2 2 2 1
#6 2 3 NA 0
#7 3 1 3 0
#8 3 2 1 1
#9 3 3 3 0
Here, we assume that there are unique 'wave' values for each 'id'

delete a row where the order is wrong within a group

I have a data with about 1000 groups each group is ordered from 1-100(can be any number within 100).
As I was looking through the data. I found that some groups had bad orders, i.e., it would order to 100 then suddenly a 24 would show up.
How can I delete all of these error data
As you can see from the picture above(before -> after), I would like to find all rows that don't follow the order within the group and just delete it.
Any help would be great!
lag will compute the difference between the current value and the previous value, diff will be used to select only positive difference i.e. the current value is greater than the previous value. min is used as lag give the first value NA. I keep the helper column diff to check, but you can deselect using %>% select(-diff)
library(dplyr)
df1 %>% group_by(gruop) %>% mutate(diff = order-lag(order)) %>%
filter(diff >= 0 | order==min(order))
# A tibble: 8 x 3
# Groups: gruop [2]
gruop order diff
<int> <int> <int>
1 1 1 NA
2 1 3 2
3 1 5 2
4 1 10 5
5 2 1 NA
6 2 4 3
7 2 4 0
8 2 8 4
Data
df1 <- read.table(text="
gruop order
1 1
1 3
1 5
1 10
1 2
2 1
2 4
2 4
2 8
2 3
",header=T, stringsAsFactors = F)
Assuming the order column increments by 1 every time we can use ave where we remove those rows which do not have difference of 1 with the previous row by group.
df[!ave(df$order, df$group, FUN = function(x) c(1, diff(x))) != 1, ]
# group order
#1 1 1
#2 1 2
#3 1 3
#4 1 4
#6 2 1
#7 2 2
#8 2 3
#9 2 4
EDIT
For the updated example, we can just change the comparison
df[ave(df$order, df$group, FUN = function(x) c(1, diff(x))) >= 0, ]
Playing with data.table:
library(data.table)
setDT(df1)[, diffo := c(1, diff(order)), group][diffo == 1, .(group, order)]
group order
1: 1 1
2: 1 2
3: 1 3
4: 1 4
5: 2 1
6: 2 2
7: 2 3
8: 2 4
Where df1 is:
df1 <- data.frame(
group = rep(1:2, each = 5),
order = c(1:4, 2, 1:4, 3)
)
EDIT
If you only need increasing order, and not steps of one then you can do:
df3 <- transform(df1, order = c(1,3,5,10,2,1,4,7,9,3))
setDT(df3)[, diffo := c(1, diff(order)), group][diffo >= 1, .(group, order)]
group order
1: 1 1
2: 1 3
3: 1 5
4: 1 10
5: 2 1
6: 2 4
7: 2 7
8: 2 9

Resources