How to iterate rows between start_date and end_date in R - r

I have a dataframe that looks like this:
And here is the output I'm hoping for.

This should work. The key is to use uncount from dplyr package. Then you need to do some operations regarding the datetime. There are some tricky issues in calculating the difference in months. What I proposed here may not be the best way to do it, but you get the idea.
library(tidyverse)
library(lubridate)
df = tibble(name = c('Alice', 'Bob', 'Caroline'),
start_date = as.Date(c('2019-01-01','2018-03-01','2019-06-01')),
end_date = as.Date(c('2019-07-01','2019-05-01','2019-09-01')))
# # A tibble: 3 x 3
# name start_date end_date
# <chr> <date> <date>
# 1 Alice 2019-01-01 2019-07-01
# 2 Bob 2018-03-01 2019-05-01
# 3 Caroline 2019-06-01 2019-09-01
df %>% mutate(tenure_in_month = as.integer(difftime(end_date, start_date, units = "days")/365*12+2))%>%
uncount(tenure_in_month)%>%
group_by(name)%>%
mutate(iteratedDate = start_date %m+% months(row_number()-1))%>%
select(name,iteratedDate)
# A tibble: 28 x 2
# Groups: name [3]
name iteratedDate
<chr> <date>
1 Alice 2019-01-01
2 Alice 2019-02-01
3 Alice 2019-03-01
4 Alice 2019-04-01
5 Alice 2019-05-01
6 Alice 2019-06-01
7 Alice 2019-07-01
8 Bob 2018-03-01
9 Bob 2018-04-01
10 Bob 2018-05-01

I use seq function to fix this problem.
library(data.table)
library(lubridate)
# data
original_data <- data.table(
CustomerName = c('Ben','Julie','Angelo','Carlo'),
StartDate = c(ymd(20190101),ymd(20180103),ymd(20190106),ymd(20170108)),
EndDate = c(ymd(20190107),ymd(20190105),ymd(20190109),ymd(20180112))
)
# CustomerName StartDate EndDate
#1: Ben 2019-01-01 2019-01-07
#2: Julie 2018-01-03 2019-01-05
#3: Angelo 2019-01-06 2019-01-09
#4: Carlo 2017-01-08 2018-01-12
finish_data <- original_data %>%
.[,.(IteratedDate = seq(from = StartDate,
to = EndDate, by = 'day')), by = .(CustomerName)]
# CustomerName IteratedDate
#1: Ben 2019-01-01
#2: Ben 2019-01-02
#3: Ben 2019-01-03
#4: Ben 2019-01-04
#5: Ben 2019-01-05
#6: Ben 2019-01-06
#7: Ben 2019-01-07
#8: Julie 2018-01-03
#9: Julie 2018-01-04

Related

Select rows based on multiple conditions from two independent database

I have two independent two datasets, one contains event date. Each ID has only one "Eventdate". As follows:
data1 <- data.frame("ID" = c(1,2,3,4,5,6), "Eventdate" = c("2019-01-01", "2019-02-01", "2019-03-01", "2019-04-01", "2019-05-01", "2019-06-01"))
data1
ID Eventdate
1 1 2019-01-01
2 2 2019-02-01
3 3 2019-03-01
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
In another dataset, one ID have multiple event name (Eventcode) and its event date (Eventdate). As follows:
data2 <- data.frame("ID" = c(1,1,2,3,3,3,4,4,7), "Eventcode"=c(201,202,201,204,205,206,209,208,203),"Eventdate" = c("2019-01-01", "2019-01-01", "2019-02-11", "2019-02-15", "2019-03-01", "2019-03-15", "2019-03-10", "2019-03-20", "2019-06-02"))
data2
ID Eventcode Eventdate
1 1 201 2019-01-01
2 1 202 2019-01-01
3 2 201 2019-02-11
4 3 204 2019-02-15
5 3 205 2019-03-01
6 3 206 2019-03-15
7 4 209 2019-03-10
8 4 208 2019-03-20
9 7 203 2019-06-02
Two datasets were linked by ID. The ID of two datasets were not all the same.
I would like to select cases in data2 with conditions:
Match by ID
Eventdate in data2 >= Eventdate in data1.
If one ID has multiple Eventdates in data2, select the earliest one.
If one ID has multiple Eventcodes at one Eventdate in data2, just randomly select one.
Then merge the selected data2 into data1.
Expected results as follows:
data1
ID Eventdate Eventdate.data2 Eventcode
1 1 2019-01-01 2019-01-01 201
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
or
data1
ID Eventdate Eventdate.data2 Eventcode
1 1 2019-01-01 2019-01-01 202
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
Thank you very very much!
You can try this approach :
library(dplyr)
left_join(data1, data2, by = 'ID') %>%
group_by(ID, Eventdate.x) %>%
summarise(Eventdate = Eventdate.y[Eventdate.y >= Eventdate.x][1],
Eventcode = {
inds <- Eventdate.y >= Eventdate.x
val <- sum(inds, na.rm = TRUE)
if(val == 1) Eventcode[inds]
else if(val > 1) sample(Eventcode[inds], 1)
else NA_real_
})
# ID Eventdate.x Eventdate Eventcode
# <dbl> <chr> <chr> <dbl>
#1 1 2019-01-01 2019-01-01 201
#2 2 2019-02-01 2019-02-11 201
#3 3 2019-03-01 2019-03-01 205
#4 4 2019-04-01 NA NA
#5 5 2019-05-01 NA NA
#6 6 2019-06-01 NA NA
The complicated logic in Eventcode data is for randomness, if you are ok selecting the 1st value like Eventdate you can simplify it to :
left_join(data1, data2, by = 'ID') %>%
group_by(ID, Eventdate.x) %>%
summarise(Eventdate = Eventdate.y[Eventdate.y >= Eventdate.x][1],
Eventcode = Eventcode[Eventdate.y >= Eventdate.x][1])
Does this work:
library(dplyr)
data1 %>% rename(Eventdate_dat1 = Eventdate) %>% left_join(data2, by = 'ID') %>%
group_by(ID) %>% filter(Eventdate >= Eventdate_dat1) %>%
mutate(Eventdate = case_when(length(unique(Eventdate)) > 1 ~ min(Eventdate), TRUE ~ Eventdate),
Eventcode = case_when(length(unique(Eventcode)) > 1 ~ min(Eventcode), TRUE ~ Eventcode)) %>%
distinct() %>% right_join(data1, by = 'ID') %>% select(ID, 'Eventdate' = Eventdate.y, 'Eventdate.data2' = Eventdate.x, Eventcode)
# A tibble: 6 x 4
# Groups: ID [6]
ID Eventdate Eventdate.data2 Eventcode
<dbl> <chr> <chr> <dbl>
1 1 2019-01-01 2019-01-01 201
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01 NA NA
5 5 2019-05-01 NA NA
6 6 2019-06-01 NA NA

How to delete rows in a dataframe that correspond to missing rows in another dataframe?

I have two dataframes with two columns each (Date and data). The lenght of the columns differs. What I want to do is to delete the rows in df1 that are not in df2 by Date.
An example will clarify. These are my dataframes:
df1 = cbind(data.frame(Date = seq(as.Date("2018-11-1"), as.Date("2020-02-1"), by = "months"), stringsAsFactors = F), data.frame(Data = rnorm(16, 0, 1), stringsAsFactors = F))
Date Data
1 2018-11-01 1.09433662
2 2018-12-01 -0.27538189
3 2019-01-01 -0.19712728
4 2019-02-01 0.99852535
5 2019-03-01 -0.50760024
6 2019-04-01 -0.43127396
7 2019-05-01 0.90685965
8 2019-06-01 0.51510503
9 2019-07-01 -0.39070644
10 2019-08-01 1.27976428
11 2019-09-01 -0.63845519
12 2019-10-01 -0.05489751
13 2019-11-01 -0.87745923
14 2019-12-01 0.18082375
15 2020-01-01 0.08852416
16 2020-02-01 1.50827788
df2= cbind(data.frame(Date = df1$Date[c(1:5,7:9,11:13,15:16)]), data.frame(Data = c(1.09433662,-0.27538189, 0.99852535,-0.50760024,-0.43127396, 0.90685965,-0.39070644, 1.27976428,-0.63845519,-0.05489751,-0.87745923, 0.18082375, 1.50827788)))
Date Data
1 2018-11-01 1.09433662
2 2018-12-01 -0.27538189
3 2019-01-01 0.99852535
4 2019-02-01 -0.50760024
5 2019-03-01 -0.43127396
6 2019-05-01 0.90685965
7 2019-06-01 -0.39070644
8 2019-07-01 1.27976428
9 2019-09-01 -0.63845519
10 2019-10-01 -0.05489751
11 2019-11-01 -0.87745923
12 2020-01-01 0.18082375
13 2020-02-01 1.50827788
What I want now is that df1 is reduced to the same length as df2 by deleting the rows that are not in df2. The rows to be deleted correspond to the missing months in df2.
The result would be this for df1:
#df1 where the rows corresponding to the missing months in df2 have been deleted
Date Data
1 2018-11-01 1.09433662
2 2018-12-01 -0.27538189
3 2019-01-01 -0.19712728
4 2019-02-01 0.99852535
5 2019-03-01 -0.50760024
6 2019-05-01 0.90685965
7 2019-06-01 0.51510503
8 2019-07-01 -0.39070644
9 2019-09-01 -0.63845519
10 2019-10-01 -0.05489751
11 2019-11-01 -0.87745923
12 2020-01-01 0.08852416
13 2020-02-01 1.50827788
Can anyone help me?
Thanks a lot!
semi_join from dplyr does what you are looking for. Note that your copied the data from df2 as the output example.
library(dplyr)
semi_join(df1, df2, by = "Date")
Date Data
1 2018-11-01 0.38376758
2 2018-12-01 -0.28738352
3 2019-01-01 1.79556305
4 2019-02-01 -0.34680836
5 2019-03-01 0.57803280
6 2019-05-01 1.96801082
7 2019-06-01 0.38448708
8 2019-07-01 0.39829417
9 2019-09-01 0.94912096
10 2019-10-01 -0.04469681
11 2019-11-01 0.32008546
12 2020-01-01 1.09054839
13 2020-02-01 -1.45438502
and anti_join shows the records that should be removed.
anti_join(df1, df2, by = "Date")
Date Data
1 2019-04-01 2.1303783
2 2019-08-01 1.6907800
3 2019-12-01 -0.8593388

Calculate each overlapping date ranges from two independent databases in r

I have two independent two databases, one contains followup data (start date and end date). As follows:
> data1 <- data.frame("ID" = c(1,1,1,1,2,2,2), "FUstart" = c("2019-01-01", "2019-04-01", "2019-07-01", "2019-10-01", "2019-04-01", "2019-07-01", "2019-10-01"), "FUend" = c("2019-03-31", "2019-06-30", "2019-09-30", "2019-12-31", "2019-06-30", "2019-09-30", "2019-12-31"))
> data1
ID FUstart FUend
1 1 2019-01-01 2019-03-31
2 1 2019-04-01 2019-06-30
3 1 2019-07-01 2019-09-30
4 1 2019-10-01 2019-12-31
5 2 2019-04-01 2019-06-30
6 2 2019-07-01 2019-09-30
7 2 2019-10-01 2019-12-31
Another contains drug use data (also start date and end date). As follows:
> data2 <- data.frame("ID" = c(1,1,1,2), "Drugstart" = c("2019-01-11", "2019-03-26", "2019-06-26", "2019-03-20"), "Drugend" = c("2019-01-20", "2019-04-05", "2019-10-05", "2019-10-10"))
> data2
ID Drugstart Drugend
1 1 2019-01-11 2019-01-20
2 1 2019-03-26 2019-04-05
3 1 2019-06-26 2019-10-05
4 2 2019-03-20 2019-10-10
The two databases are linked by "ID". The problem is that the rows for each ID may not be the same. I would like to calculate overlapping days and add it into the data1. I would expect to have the following results:
> data1
ID FUstart FUend Overlapping.Days
1 1 2019-01-01 2019-03-31 16
2 1 2019-04-01 2019-06-30 10
3 1 2019-07-01 2019-09-30 92
4 1 2019-10-01 2019-12-31 5
5 2 2019-04-01 2019-06-30 91
6 2 2019-07-01 2019-09-30 92
7 2 2019-10-01 2019-12-31 10
Note that data1 is the basic database. And adds data2's overlapping days into data1. Many many thanks for helping~~
An option using data.table::foverlaps:
foverlaps(data1, data2)[,
sum(1L + pmin(Drugend, FUend) - pmax(Drugstart, FUstart)),
.(ID, FUstart, FUend)]
output and I am also getting slightly diff numbers from OP's expected output:
ID FUstart FUend V1
1: 1 2019-01-01 2019-03-31 16
2: 1 2019-04-01 2019-06-30 10
3: 1 2019-07-01 2019-09-30 92
4: 1 2019-10-01 2019-12-31 5
5: 2 2019-04-01 2019-06-30 91
6: 2 2019-07-01 2019-09-30 92
7: 2 2019-10-01 2019-12-31 10
data:
library(data.table)
setDT(data1)
cols <- paste0("FU", c("start","end"))
data1[, (cols) := lapply(.SD, as.IDate, format="%Y-%m-%d"), .SDcols=cols]
setkeyv(data1, c("ID", cols))
#too lazy to generalize and hence copy paste
setDT(data2)
cols <- paste0("Drug", c("start","end"))
data2[, (cols) := lapply(.SD, as.IDate, format="%Y-%m-%d"), .SDcols=cols]
setkeyv(data2, c("ID", cols))

Running complex functions per row

I would like to use complex functions in a nested data frame.
My data looks like this:
Name Date
John 01.01.
Mark 03.09.
Edith 03.04.
Edith 08.08.
Mark 04.01.
Edith 01.03.
John 01.03.
John 01.04.
Mark 02.03.
Edith 04.05.
Edith 07.05.
Mark 04.02.
Edith 09.01.
John 01.09.
In a new column Day, For each name, I would like to know the number of days between a given Date row and the earliest date for that person.
So that John will look like:
Day
0
..
2
2
..
9
I am experimenting with nest(), then running a function with modfiy, but I am very new to R, and it doesn't work I and looks don't really understand what even the problem is.
Thanks for help!
Note that it is not clear from your sample data, whether you are using %d.%m. or %m.%d. format. Please change that in the code if needed.
library(tidyverse)
library(lubridate)
df <- read_table(
'name date
John 01.01.
Mark 03.09.
Edith 03.04.
Edith 08.08.
Mark 04.01.
Edith 01.03.
John 01.03.
John 01.04.
Mark 02.03.
Edith 04.05.
Edith 07.05.
Mark 04.02.
Edith 09.01.
John 01.09.')
df %>%
mutate(date = as_date(date, "%d.%m.")) %>%
group_by(name) %>%
mutate(diff_dates = date - min(date))
Result:
> df
# A tibble: 14 x 3
name date diff_dates
<chr> <date> <drtn>
1 John 2019-01-01 0 days
2 Mark 2019-09-03 245 days
3 Edith 2019-04-03 92 days
4 Edith 2019-08-08 219 days
5 Mark 2019-01-04 3 days
6 Edith 2019-03-01 59 days
7 John 2019-03-01 59 days
8 John 2019-04-01 90 days
9 Mark 2019-03-02 60 days
10 Edith 2019-05-04 123 days
11 Edith 2019-05-07 126 days
12 Mark 2019-02-04 34 days
13 Edith 2019-01-09 8 days
14 John 2019-09-01 243 days
Using the dplyr package we get
library(dplyr)
data <- data %>%
mutate(Date = as.Date(Date, format = "%m.%d")) %>%
group_by(Name) %>%
mutate(early = min(Date)) %>%
mutate(Day = difftime(Date, early, units = "days"))
data
# # A tibble: 14 x 4
# # Groups: Name [3]
# Name Date early Day
# <fct> <date> <date> <time>
# 1 John 2019-01-01 2019-01-01 0 days
# 2 Mark 2019-03-09 2019-02-03 34 days
# 3 Edith 2019-03-04 2019-01-03 60 days
# 4 Edith 2019-08-08 2019-01-03 217 days
# 5 Mark 2019-04-01 2019-02-03 57 days
# 6 Edith 2019-01-03 2019-01-03 0 days
# 7 John 2019-01-03 2019-01-01 2 days
# 8 John 2019-01-04 2019-01-01 3 days
# 9 Mark 2019-02-03 2019-02-03 0 days
# 10 Edith 2019-04-05 2019-01-03 92 days
# 11 Edith 2019-07-05 2019-01-03 183 days
# 12 Mark 2019-04-02 2019-02-03 58 days
# 13 Edith 2019-09-01 2019-01-03 241 days
# 14 John 2019-01-09 2019-01-01 8 days
Edited as per Cole's recommendations.

How to create a column based on two conditions from other data frame?

I'm trying to create a column that identifies if the row meets two conditions. For example, I have a table similar to this:
> dat <- data.frame(Date = c(rep(c("2019-01-01", "2019-02-01","2019-03-01", "2019-04-01"), 4)),
+ Rep = c(rep("Mike", 4), rep("Tasha", 4), rep("Dane", 4), rep("Trish", 4)),
+ Manager = c(rep("Amber", 2), rep("Michelle", 2), rep("Debbie", 4), rep("Brian", 4), rep("Tim", 3), "Trevor"),
+ Sales = floor(runif(16, min = 0, max = 10)))
> dat
Date Rep Manager Sales
1 2019-01-01 Mike Amber 6
2 2019-02-01 Mike Amber 3
3 2019-03-01 Mike Michelle 9
4 2019-04-01 Mike Michelle 2
5 2019-01-01 Tasha Debbie 9
6 2019-02-01 Tasha Debbie 6
7 2019-03-01 Tasha Debbie 0
8 2019-04-01 Tasha Debbie 4
9 2019-01-01 Dane Brian 3
10 2019-02-01 Dane Brian 6
11 2019-03-01 Dane Brian 6
12 2019-04-01 Dane Brian 1
13 2019-01-01 Trish Tim 6
14 2019-02-01 Trish Tim 7
15 2019-03-01 Trish Tim 6
16 2019-04-01 Trish Trevor 1
Out of the Reps that have switched manager, I would like to identify weather this manager is the first or the second manager with respect to the date. The ideal output would look something like:
Date Rep Manager Sales New_Column
1 2019-01-01 Mike Amber 6 1
2 2019-02-01 Mike Amber 3 1
3 2019-03-01 Mike Michelle 9 2
4 2019-04-01 Mike Michelle 2 2
5 2019-01-01 Trish Tim 6 1
6 2019-02-01 Trish Tim 7 1
7 2019-03-01 Trish Tim 6 1
8 2019-04-01 Trish Trevor 1 2
I have tried a few things but they're not quite working out. I have created two separate data frames where one consists of the first instance of that Rep and associated manager (df1) and the other one consists of the last instance of that rep and associated manager (df2). The code that I have tried that has gotten the closest is:
dat$New_Column <- ifelse(dat$Rep %in% df1$Rep & dat$Manager %in% df1$Manager, 1,
ifelse(dat$Rep %in% df2$Rep & dat$Manager %in% df2$Manager, 2, NA))
However this reads as two separate conditions, rather than having a condition of a condition (i.e. If Mike exists in the first instance and Amber exists in the first instance assign 1 rather than If Mike exists with the manager Amber in the first instance assign 1). Any help would be really appreciated. Thank you!
An option is to first grouped by 'Rep' filter the rows where the number of unique 'Manager' is 2, and then add a column by matching the 'Manager' with the unique elements of 'Manager' to get the indices
library(dplyr)
dat %>%
group_by(Rep) %>%
filter(n_distinct(Manager) == 2) %>%
mutate(New_Column = match(Manager, unique(Manager)))
# A tibble: 8 x 5
# Groups: Rep [2]
# Date Rep Manager Sales New_Column
# <chr> <chr> <chr> <int> <int>
#1 2019-01-01 Mike Amber 6 1
#2 2019-02-01 Mike Amber 3 1
#3 2019-03-01 Mike Michelle 9 2
#4 2019-04-01 Mike Michelle 2 2
#5 2019-01-01 Trish Tim 6 1
#6 2019-02-01 Trish Tim 7 1
#7 2019-03-01 Trish Tim 6 1
#8 2019-04-01 Trish Trevor 1 2

Resources