I am trying to use dplyr::lag to determine the number of days that have passed for each event since the initial event but I am getting unexpected behavior.
Example, very simple data:
df <- data.frame(id = c("1", "1", "1", "1", "2", "2"),
date= c("4/1/2020", "4/2/2020", "4/3/2020", "4/4/2020", "4/17/2020", "4/18/2020"))
df$date <- as.Date(df$date, format = "%m/%d/%Y")
id date
1 1 4/1/2020
2 1 4/2/2020
3 1 4/3/2020
4 1 4/4/2020
5 2 4/17/2020
6 2 4/18/2020
What I was hoping to do was create a new column days_since_first_event that calculated the number of days between the initial event by id and each subsequent date with this expected output
df <- df %>%
group_by(id) %>%
mutate(days_since_first_event = as.numeric(date - lag(date)))
id date days_since_first_event
1 1 4/1/2020 0
2 1 4/2/2020 1
3 1 4/3/2020 2
4 1 4/4/2020 3
5 2 4/17/2020 0
6 2 4/18/2020 1
But instead I get this output
# A tibble: 6 x 3
# Groups: id [2]
id date days_since_first_event
<chr> <date> <dbl>
1 1 2020-04-01 NA
2 1 2020-04-02 1
3 1 2020-04-03 1
4 1 2020-04-04 1
5 2 2020-04-17 NA
6 2 2020-04-18 1
Any suggestions on what I'm doing wrong?
The first n values of lag() get a default value, because you don't have 'older' data. The default value is NA. Hence the NA in your results.
Furthermore, using lag will only yield the difference between consecutive events.
Related
I have a data frame as follows :
id <- c(1, 2, 3, 4, 5)
week1 <- c(234,567456, 134123, 13412421, 2345245)
week2 <- c(4234,5123456, 454123, 12342421, 8394545)
week3 <- c(1234, 234124, 12348, 9348522, 134534)
data <- data.frame(id, week1, week2, week3)
I would like to find the percent change between week1 and week2, and then week2 and week3 etc (my dataframe is much larger with about 27 columns).
I tried:
data$change1 <- (data$week2-data$week1)*100/data$week1
However this would be extensive with a larger dataset.
Try the following:
library(tidyverse)
df <- gather(df, key='week', value='value', -id)
df$week <- as.integer(as.character((gsub('week', '', df$week))))
df %>% group_by(id) %>% arrange(week) %>% mutate(perc_change = (value-lag(value,1))/lag(value,1)*100)
# A tibble: 15 x 4
# Groups: id [5]
id week value perc_change
<dbl> <int> <dbl> <dbl>
1 1 1 234 NA
2 2 1 567456 NA
3 3 1 134123 NA
4 4 1 13412421 NA
5 5 1 2345245 NA
6 1 2 4234 1709.
7 2 2 5123456 803.
8 3 2 454123 239.
9 4 2 12342421 -7.98
10 5 2 8394545 258.
11 1 3 1234 -70.9
12 2 3 234124 -95.4
13 3 3 12348 -97.3
14 4 3 9348522 -24.3
15 5 3 134534 -98.4
This works reasonably well, but assumes that there is an observation every week, or else your percent change will be based on the last available week (so, if week 3 is missing, the value for week 4 will be a week on week change with week 2 as basis).
(Edit: replaced substr with gsub)
Sense checking:
For row 6, you see id 1. This is week 2 with a value of 4234. In week 1, id 1 had a value of 234. The difference is
(4234-234)/234
[1] 17.09402
So, that is aligned.
My data looks like this:
period
id
category
1
1234
1
1
2345
2
1
4567345
1
2
1234
3
2
2345
3
2
4567345
1
3
123467
2
3
234567
2
3
45673
1
I need to create a new column "category_pre" containing category values from previous period for each ID. If an ID is not found in previous period, the script should return "NA". The new column should be added to existing dataframe.
What would be the best way to do it?
Thanks!
We can use the lag() function here from dplyr:
df <- df %>%
group_by(id) %>%
mutate(category_pre = lag(category, order_by=period))
df
period id category category_pre
<dbl> <dbl> <dbl> <dbl>
1 1 1234 1 NA
2 1 2345 2 NA
3 1 4567345 1 NA
4 2 1234 3 1
5 2 2345 3 2
6 2 4567345 1 1
7 3 123467 2 NA
8 3 234567 2 NA
9 3 45673 1 NA
Step one, make a new data frame with period + 1:
df2 = data.frame(period=df$period+1, id=df$id, category=df$category)
Step two, merge the two data frames on period and id, all.x=T ensures that rows where there is no previous period are included (with NA):
merge(df, df2, by=c('period','id'), all.x = T)
Step three, rename the columns to category and category_pre.
If I had:
person_ID visit date
1 2/25/2001
1 2/27/2001
1 4/2/2001
2 3/18/2004
3 9/22/2004
3 10/27/2004
3 5/15/2008
and I wanted another column to indicate the earliest recurring observation within 90 days, grouped by patient ID, with the desired output:
person_ID visit date date
1 2/25/2001 2/27/2001
1 2/27/2001 4/2/2001
1 4/2/2001 NA
2 3/18/2004 NA
3 9/22/2004 10/27/2004
3 10/27/2004 NA
3 5/15/2008 NA
Thank you!
We convert the 'visit_date' to Date class, grouped by 'person_ID', create a binary column that returns 1 if the difference between the current and next visit_date is less than 90 or else 0, using this column, get the correponding next visit_date' where the value is 1
library(dplyr)
library(lubridate)
library(tidyr)
df1 %>%
mutate(visit_date = mdy(visit_date)) %>%
group_by(person_ID) %>%
mutate(i1 = replace_na(+(difftime(lead(visit_date),
visit_date, units = 'day') < 90), 0),
date = case_when(as.logical(i1)~ lead(visit_date)), i1 = NULL ) %>%
ungroup
-output
# A tibble: 7 x 3
# person_ID visit_date date
# <int> <date> <date>
#1 1 2001-02-25 2001-02-27
#2 1 2001-02-27 2001-04-02
#3 1 2001-04-02 NA
#4 2 2004-03-18 NA
#5 3 2004-09-22 2004-10-27
#6 3 2004-10-27 NA
#7 3 2008-05-15 NA
I have a dataset like the following, where "group" is a group variable. I want to count the number of 'next' days by group, but if it is not the next day I want the count to reset to one (as shown in the "want" column). Then, I want to return the max number of the "want" column (as in want2). Suggestions would be appreciated!
df<-data.frame(group=c(1, 1, 1, 1, 2, 2, 2),
date=c("2000-01-01", "2000-01-03", "2000-01-04", "2000-01-05", "2000-01-09", "2000-01-10", "2000-01-12"),
want=c(1,1,2,3,1,2,1),
want2=c(3,3,3,3,2,2,2))
bonus part 2: Thank you for all the feedback, it was extremely helpful. Is there a way to do the same with an added condition? I have a binary variable and I also want my count to reset when that variable==0. Like so:
# group date binary want
#1 1 2000-01-01 1 1
#2 1 2000-01-03 1 1
#3 1 2000-01-04 1 2
#4 1 2000-01-05 0 1
#5 2 2000-01-09 1 1
#6 2 2000-01-10 0 1
#7 2 2000-01-12 1 1
#8 3 2000-01-05 1 1
#9 3 2000-01-06 1 2
#10 3 2000-01-07 1 3
#11 3 2000-01-08 1 4
I have tried akrun's suggestion which worked very well without the binary var, I tried to modify it adding the binary var as part of cumsum but I get errors:
df %>% group_by(group)
%>% mutate(wantn = rowid(cumsum(c(TRUE, diff(as.Date(date)) !=1 & binary==1)))
Thanks!
An option is to group by 'group', then use diff on the Date class convered 'date', create a logical vector and use cumsum to replicate the results in 'want' ('wantn') and then with the 'wantn', apply max on it
library(dplyr)
library(data.table)
df %>%
group_by(group) %>%
mutate(wantn = rowid(cumsum(c(TRUE, diff(as.Date(date)) !=1))),
want2n = max(wantn))
# A tibble: 7 x 6
# Groups: group [2]
# group date want want2 wantn want2n
# <dbl> <fct> <dbl> <dbl> <int> <int>
#1 1 2000-01-01 1 3 1 3
#2 1 2000-01-03 1 3 1 3
#3 1 2000-01-04 2 3 2 3
#4 1 2000-01-05 3 3 3 3
#5 2 2000-01-09 1 2 1 2
#6 2 2000-01-10 2 2 2 2
#7 2 2000-01-12 1 2 1 2
or if we want to not use rowid, then create the grouping variable with cumsum and get the sequence
df %>%
group_by(group) %>%
group_by(group2 = cumsum(c(TRUE, diff(as.Date(date)) !=1)), add = TRUE) %>%
mutate(wantn = row_number()) %>%
group_by(group) %>%
mutate(want2n = max(wantn)) %>%
select(-group2)
I have a dataframe that contains id(contains duplicate),date(contains duplicate),value. the values are recorded for different consecutive days. now what i want is to group the dataframe with id and date(as n consecutive days) and find mean of values. and return NA if the last group does not contain n days.
id date value
1 2016-10-5 2
1 2016-10-6 3
1 2016-10-7 1
1 2016-10-8 2
1 2016-10-9 5
2 2013-10-6 2
. . .
. . .
. . .
20 2012-2-6 10
desired output with n-consecutive days as 3
id date value group_n_consecutive_days mean_n_consecutive_days
1 2016-10-5 2 1 2
1 2016-10-6 3 1 2
1 2016-10-7 1 1 2
1 2016-10-8 2 2 NA
1 2016-10-9 5 2 NA
2 2013-10-6 2 1 4
.
.
.
.
20 2012-2-6 10 6 25
The data in the question is sorted and consecutive within id so we assume that that is the case. Also when the question refers to duplicate dates we assume that that means that different id values can have the same date but within id the dates are unique and consecutive. Now, using the data shown reproducibly in Note 2 at the end group by id and compute the group numbers using gl. Then grouping by id and group_no take the mean of each group of 3 or NA for smaller groups.
library(dplyr)
DF %>%
group_by(id) %>%
mutate(group_no = c(gl(n(), 3, n()))) %>%
group_by(group_no, add = TRUE) %>%
mutate(mean = if (n() == 3) mean(value) else NA) %>%
ungroup
giving:
# A tibble: 6 x 5
id date value group_no mean
<int> <date> <int> <int> <dbl>
1 1 2016-10-05 2 1 2
2 1 2016-10-06 3 1 2
3 1 2016-10-07 1 1 2
4 1 2016-10-08 2 2 NA
5 1 2016-10-09 5 2 NA
6 2 2013-10-06 2 1 NA
Note 1
An alternative to gl(...) could be cumsum(rep(1:3, length = n()) == 1) and an alternative to if (n() = 3) mean(value) else NA could be mean(head(c(value, NA, NA), 3)) .
Note 2
The input data in reproducible form was assumed to be:
Lines <- "id date value
1 2016-10-5 2
1 2016-10-6 3
1 2016-10-7 1
1 2016-10-8 2
1 2016-10-9 5
2 2013-10-6 2"
DF <- read.table(text = Lines, header = TRUE)
DF$date <- as.Date(DF$date)