How to mutate a complex variable involving dates? - r

I have a tibble in which each row represents an image of an eye and contains the following relevant variables: patientId, laterality (left or right), date, imageId.
I would like to manipulate this to create another tibble showing the number of followUpYears for each eye (patientId, laterality). followUpYears is defined in a somewhat unusual way:
In order to meet the requirements for follow-up in a particular year, there must be two different imaging dates during that year i.e. between days 0-365 for year 1, days 366-730 for year 2 etc. The first image date is always the baseline and followUpYears is always an integer.
Only one image per date is considered.
Follow-up ceases as soon as the requirement for 2 imaging dates in a year is not met i.e. if there is only 1 imaging date in the first year, followUpYears is 0 regardless of how many images are taken subsequently.
There is no requirement for there to be at least n years between the first and last image date for an eye to have n followUpYears.
The following dummy data demonstrates these points:
data <- tibble(patientId = c('A','A','A','A','A','A','B','B','B','B','B','B','B'),
laterality = c('L','L','L','L','L','L','R','R','R','R','L','L','L'),
date = as.Date(c('2000-05-05','2000-05-05','2001-05-06','2001-05-07','2002-05-06','2002-05-07','2000-09-08','2001-09-07','2001-09-09','2001-09-10','2000-09-08','2001-09-07','2001-09-10')),
imageId = 1:13)
expected_output <- tibble(patientId = c('A','B','B'),
laterality = c('L','R','L'),
followUpYears = c(0, 2, 1))
Patient A's left eye has 0 followUpYears because of points 2 and 3. Patient B's right eye has 2 followUpYears because of point 4 (despite the fact that there is only slightly more than 1 year between the first and last image date). Patient B's left eye only has 1 year of follow up since it doesn't meet the requirement for 2 image dates in year 2.
I am familiar with the basic dplyr verbs but I can't think of how to frame this type of variable. Note that patients might have one or both eyes included and some might have 10+ years of follow up. Finally, a solution that considers 1 year to be 365 days regardless of leap years is fine.
Thank you!

Here's a way with ifelse. diff_year is a helper function that computes the difference between two dates in year rounded to the value above.
library(dplyr)
diff_year <- function(date1, date2) ceiling(as.numeric(difftime(date1, date2)) / 365)
data %>%
group_by(patientId) %>%
summarise(followUpYears = ifelse(diff_year(date[date != first(date)][1], first(date)) <= 1,
diff_year(max(date), min(date)), 0))
#A tibble: 2 × 2
# patientId followUpYears
# <chr> <dbl>
#1 A 0
#2 B 2
Update with OP's comment. This should work with all conditions:
diff_year <- function(date1, date2) as.numeric((date1 - date2) / 365)
data %>%
distinct(patientId, laterality, date, .keep_all = TRUE) %>%
group_by(patientId, laterality) %>%
mutate(diffYear = floor(diff_year(date, min(date)))) %>%
add_count(count = diffYear) %>%
filter(!cumany(lag(n == 1, default = 0)) | row_number() == 1) %>%
summarise(followUpYears = ifelse(any(n > 1), ceiling(diff_year(max(date[n != 1]), min(date))), 0))
# patientId laterality followUpYears
#1 A L 0
#2 B L 1
#3 B R 2

Below is my approach which should cover all four conditions, I'm not sure however, how you get:
#> # A tibble: 1 x 3
#> patientId laterality followUpYears
#> <chr> <chr> <dbl>
#> 1 B L 1
since according to your logic it should fall into the two year band from 2000-09-08 to 2001-09-10 are 367 days which equals two years.
The idea is that we first calculate a followup_flag which checks if the date is within 365 days of the former date, and then takes the cummin() so that the series breaks as soon there is no direct follow up year.
Then we can filter all rows which meet the followup_flag == 1.
And for this data set we check how many years are between the first and the last date, and since we want to count 367 as 2 years we have to take the ceiling().
library(dplyr)
library(lubridate)
data %>%
group_by(patientId, laterality) %>%
mutate(followup_flag = cummin(date - dplyr::lag(date, default = first(date)) <= 365)) %>%
filter(followup_flag == 1) %>%
summarise(followUpYears = as.numeric(
difftime(last(date), first(date), units = "days") / 365) %>%
ceiling()
)
#> `summarise()` has grouped output by 'patientId'. You can override using the
#> `.groups` argument.
#> # A tibble: 3 x 3
#> # Groups: patientId [2]
#> patientId laterality followUpYears
#> <chr> <chr> <dbl>
#> 1 A L 0
#> 2 B L 2
#> 3 B R 2
Data used:
data <- tibble(patientId = c('A','A','A','A','A','A','B','B','B','B','B','B','B'),
laterality = c('L','L','L','L','L','L','R','R','R','R','L','L','L'),
date = as.Date(c('2000-05-05','2000-05-05','2001-05-06','2001-05-07','2002-05-06','2002-05-07','2000-09-08','2001-09-07','2001-09-09','2001-09-10','2000-09-08','2001-09-07','2001-09-10')),
imageId = 1:13)
Created on 2023-02-08 by the reprex package (v2.0.1)

Related

How do I create a new variable (in a covid dataset) that keeps track of number of days passed since a country hit 1000 cases

I have a covid dataset like below
In some countries the virus has been spreading for longer than in others. As a result, some countries have a larger number of cases than others just because they are “further ahead on the curve”. To compare how one country (e.g, the UK) is doing compared to other countries where the disease has been going on for longer (e.g., Italy) it might be helpful to measure time as days since first recording x number of cases per day (where x
could be 1000). In this normalized time scale day 0 is the first day that a country recorded 1000 cases. Create a new variable that keeps track of number of days passed since a country hit 1000 cases.
An idea would be to find the first occurrence of at least 1000 new cases per date and location. Using dplyr, I'd write a function that does exactly that.
First a reprex based on your data.
df <- data.frame(
location = c("Spain", "Spain", "Kyrgyzstan", "Kyrgyzstan"),
date = as.Date(c(
"2020-05-28", "2020-05-29", "2020-08-21", "2020-08-22"
)),
total_cases = c(237906, 238564, 42703, 42889),
new_cases = c(1647,-372, 196, 186)
)
Then a function that finds and counts the time between the previous occurrence of at least 1000 cases, given a date and location.
days_since_1000_cases <- function(loc, dat) {
df_filtered <- df %>%
filter(
location == loc,
date <= dat,
new_cases >= 1000) %>%
# Order on the dates and pick the first as that was the previous occurence
arrange(date) %>%
head(1)
if (nrow(df_filtered) > 0) {
return(dat - df_filtered$date)
}
return(NA)
}
And finally, using the function on the data leads to
df %>%
group_by(location, date) %>%
mutate(days_since = days_since_1000_cases(location, date))
# A tibble: 4 × 5
# Groups: location, date [4]
location date total_cases new_cases days_since
<fct> <date> <dbl> <dbl> <drtn>
1 Spain 2020-05-28 237906 1647 0 days
2 Spain 2020-05-29 238564 -372 1 days
3 Kyrgyzstan 2020-08-21 42703 196 NA days
4 Kyrgyzstan 2020-08-22 42889 186 NA days

R: How to calculate incidence of 2days duration from a cumulative data after extracting the maximum number of the same day?

I have a cumulative data like;
df1 <- data.frame(code=c(1,1,1,1,1,2,2,2,2,3,3,3,3,3,3,4,4,4,4,5,5,5,5),
date=c("2020-01-01", "2020-01-01","2020-01-02","2020-01-03","2020-01-04","2020-01-01","2020-01-02","2020-01-03",
"2020-01-04","2020-01-01","2020-01-01","2020-01-02","2020-01-02","2020-01-03","2020-01-04","2020-01-01",
"2020-01-02","2020-01-04","2020-01-03","2020-01-01","2020-01-02","2020-01-03","2020-01-04"),
cumulative=c(2,3,3,4,4,4,4,6,6,7,8,10,13,14,16,1,2,3,5,1,2,3,5))
From here, I want to extract the maximum cumulative number of each code and each date like;
df2 <- data.frame(code=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5),
date=c("2020-01-01","2020-01-02","2020-01-03","2020-01-04","2020-01-01","2020-01-02","2020-01-03",
"2020-01-04","2020-01-01","2020-01-02","2020-01-03","2020-01-04","2020-01-01",
"2020-01-02","2020-01-03","2020-01-04","2020-01-01","2020-01-02","2020-01-03","2020-01-04"),
cumulative=c(3,3,4,4,4,4,6,6,8,13,14,16,1,2,3,5,1,2,3,5))
Now I have cumulative numbers for each code of each day.
From here I want to calculate incidence of 2days duration.
df3 <- data.frame(code=c(1,2,3,4,5),
incidence1=c(1,2,6,2,2),incidence2=c(1,2,3,3,3))
Incidence1 means the difference between 2020-01-01 and 2020-01-03,
Incidence2 means the difference between 2020-01-02 and 2020-01-04
What I want to know is
1) How to extract the maximum number within the same day
2) How to calculate the difference between 2days
Please teach me, thanks.
Here is one way to do this by creating groups of every alternate row and get the difference of the cumulative value between them. To get the expected output in the same format as shown we can use pivot_wider from tidyr.
library(dplyr)
library(tidyr)
df2 %>%
group_by(code) %>%
group_by(gr = rep(seq(1, n()/2), 2), add = TRUE) %>%
summarise(incidence = diff(cumulative)) %>%
pivot_wider(names_from = gr, values_from = incidence, names_prefix = "incidence")
# code incidence1 incidence2
# <dbl> <dbl> <dbl>
#1 1 1 1
#2 2 2 2
#3 3 6 3
#4 4 2 3
#5 5 2 3

Generate records in an R data frame between two dates

I have a data frame that consists of customers scheduled subscription payments as follows:
CusID <- c(1,2,3)
FromDate <- c(ymd("2019-01-01"), ymd("2019-01-04"), ymd("2019-02-02"))
ToDate <-c(ymd("2019-01-16"), ymd("2019-01-15"), ymd("2019-04-03"))
Amount <- c(5,10,12)
Frequency <- c("Weekly", "Fortnightly", "Monthly")
Input <- data.frame(CusID, Amount, Frequency, FromDate, ToDate)
For each row (customer), I wish to loop from the FromDate to the ToDate and output one row of each data for each scheduled payment that falls between those dates, resulting in the following data frame:
CusID <- c(1,1,1,2,3,3,3)
PaymentDate <- c(ymd("2019-01-01"), ymd("2019-01-08"), ymd("2019-01-15"),
ymd("2019-01-04"),ymd("2019-02-02"),ymd("2019-03-02"),ymd("2019-04-02"))
Amount <- c(5,5,5,10,12,12,12)
Output <- data.frame(CusID, PaymentDate, Amount)
What is an efficient way to achieve this using R (and preferably using dplyr / tidyverse functions)?
In SAS my approach would be to use a DO / WHILE LOOP and OUTPUT statement to write a new line for each scheduled payment. e.g.
data Output;
set Input;
PaymentDate = FromDate;
do while (PaymentDate < ToDate);
Payment = Amount;
PaymentDate = PaymentDate + (7 / 14 / 30 ~ logic based on Frequency);
output;
loop;
run;
(The key here in SAS is the output statement - it explicitly writes a new record each time it is invoked, thus can be used in a loop to write multiple output lines per input line).
Is there an equivalent method available in R, or is a different approach recommended?
Another option using tidyverse
Input %>%
mutate(Frequency = case_when(Frequency == "Weekly" ~ 7L,
Frequency == "Fortnightly" ~ 14L,
Frequency == "Monthly" ~ 30L,
TRUE ~ 0L)) %>%
group_by(CusID) %>%
group_modify(~ {PaymentDate <- seq.Date(from = .x$FromDate, to = .x$ToDate, by = .x$Frequency)
crossing(.x[,1], PaymentDate)})
# A tibble: 7 x 3
# Groups: CusID [3]
CusID PaymentDate Amount
<dbl> <date> <dbl>
1 1 2019-01-01 5
2 1 2019-01-08 5
3 1 2019-01-15 5
4 2 2019-01-04 10
5 3 2019-02-02 12
6 3 2019-03-04 12
7 3 2019-04-03 12
Payment dates are a little different from your expected output because seq.Date adds 30 days taking into account the different number of days in those months.
UPDATE:
Here is a more verbatim solution
Input %>%
mutate(PaymentDate = FromDate,
RFrequency = case_when(Frequency == "Weekly" ~ '1 week',
Frequency == "Fortnightly" ~ '2 weeks',
Frequency == "Monthly" ~ '1 month')) %>%
group_by(CusID, Amount) %>%
expand(PaymentDate = seq.Date(FromDate,ToDate, by = RFrequency))
# A tibble: 7 x 3
# Groups: CusID, Amount [3]
CusID Amount PaymentDate
<dbl> <dbl> <date>
1 1 5 2019-01-01
2 1 5 2019-01-08
3 1 5 2019-01-15
4 2 10 2019-01-04
5 3 12 2019-02-02
6 3 12 2019-03-02
7 3 12 2019-04-02
I tweaked your Input data.frame so that the Frequency values are strings, not factors.
You could create a helper table freq_mapping to convert from your Frequency to the frequency format R likes. This would avoid the 30 day issue that one of the other answers pointed out.
freq_mapping <- data.frame(Frequency=c('Weekly', 'Fortnightly', 'Monthly'),
RFrequency = c('1 week', '2 weeks', '1 month'),
stringsAsFactors = FALSE)
Then merge Input with this:
Input <- Input %>%
inner_join(freq_mapping, by = 'Frequency')
Now you can create the payment dates:
Input$PaymentDate <- Input$FromDate
Input %>%
group_by(CusID) %>%
complete(PaymentDate = seq.Date(FromDate,ToDate, by = RFrequency)) %>%
fill(PaymentDate,Amount) %>%
select(CusID, PaymentDate, Amount)
not so easy problem for me. The solution is not beautiful but it should somehow do the work. You'll see there is a problem for the monthly payment which is not always 30, but otherwise it should work. But nicer solution surely exist.
library(data.table)
Input <- data.frame(CusID, Amount, Frequency, FromDate, ToDate)
Input=data.table(Input)
Input[Frequency=="Weekly",freq:=7][Frequency=="Fortnightly",freq:=14][Frequency=="Monthly",freq:=30]
Input[,Ratio:=(ToDate-FromDate)/freq]
#What is the maximum rows ? for a customer ?
NREP=as.integer(max(ceiling(Input$Ratio)))
Input[,Rep:=1][,PaymentDate:=FromDate]
for(i in 1:NREP){
Inputtemp=copy(Input)
Inputtemp[,FromDate:=FromDate+freq]
Input=rbind(Input,Inputtemp)
}
#Remove invalid rows
Input=unique(Input)
Input=Input[!(FromDate>ToDate),]
setorder(Input,CusID)
Input=Input[,c("CusID","FromDate","Amount")]
setnames(Input,"FromDate","PaymentDate")
Input==data.table(Output)
A mashup of Humpelstielzchen and user2474226's answers, to bring all logic into a single dplyr step.
Output <- Input %>%
mutate(PaymentDate = FromDate,
RFrequency = case_when(Frequency == "Weekly" ~ '1 week',
Frequency == "Fortnightly" ~ '2 weeks',
Frequency == "Monthly" ~ '1 month')) %>%
group_by(CusID) %>%
complete(PaymentDate = seq.Date(FromDate,ToDate, by = RFrequency)) %>%
fill(PaymentDate,Amount) %>%
select(CusID, PaymentDate, Amount)

How to count the dates that are not in a list

I have a dataframe between two dates (date1 date2) with clients and the date of arrive.
date1<- "2019-07-29"
date2<- "2019-08-08"
clients<-data.frame(id= c(1:10),
arrive=c("2019-07-31", "2019-07-29", "2019-08-01",
"2019-08-03", "2019-08-05", "2019-08-08",
"2019-08-02", "2019-08-06", "2019-07-29",
"2019-08-02"),
hotel= c(rep(900067, 5), rep(9001649,5)))
I want to count between does dates, how many days each hotel did not have a new clients.
Hotel 900067 did not have new clients the following days: 2019-07-30, 2019-08-02, 2019-08-04, 2019-08-06, 2019-08-07, 2019-08-08. This are 6 days between date1 and date2 without any new client.
The dataframe result should be something like:
Result<- data.frame(hotel= c(900067, 9001649),
days_without_new_clients= c(6, 7))
Thank you in advance.
Perhaps you could create a data frame containing all hotel-dates and then see how many that do not exist in the clientsdata.
library(dplyr)
all_hotel_dates <- expand.grid(arrive = seq.Date(as.Date("2019-07-29"), as.Date("2019-08-08"), "day"), hotel = c(900067, 9001649))
clients %>%
mutate(arrive = as.Date(arrive)) %>%
full_join(all_hotel_dates) %>%
group_by(hotel) %>%
summarise(days_without_new_clients = sum(is.na(id)))
# A tibble: 2 x 2
hotel days_without_new_clients
<dbl> <int>
1 900067 6
2 9001649 7
Here is an idea via dplyr using complete to fill in the dates after we filter to the required period, i.e.
library(dplyr)
clients %>%
mutate(arrive = as.Date(arrive)) %>%
group_by(hotel) %>%
filter(arrive >= as.Date(date1) & arrive <= as.Date(date2)) %>%
complete(arrive = seq.Date(as.Date(date1), as.Date(date2), 1)) %>%
summarise(days_no_clients = sum(is.na(id)))
# A tibble: 2 x 2
# hotel days_no_clients
# <dbl> <int>
#1 900067 6
#2 9001649 7
You can create a udf which receives 2 dates, and returns an array of strings.
Something like this (pseudocode):
def getDatesBetween(dateA, dateB):
days = time.difference(dateA, dateB).days
dates = range(days).map(d => time.add(dateA, d).toString)
return dates
Then create a new column which has the intersection between these dates and the arrival dates.

choose the best month for a season

species <- c("frog1","frog1","frog1","frog1","frog1","frog1","frog1","frog1"
,"frog1","frog1","frog2","frog2","frog2","frog2","frog2",
"frog2","frog2","frog2","frog2","frog2")
month <- c(1,12,5,8,10,3,5,7,9,4,2,4,6,7,6,3,8,9,11,1)
number <- c(3,4,5,1,2,3,4,7,6,7,3,5,6,7,8,9,9,5,3,1)
a<- data.frame(species,month,number)
my data frame means I caught two kinds of frogs,frog1 and frog2 with different numbers in different months.
I would like to convert months into 4 seasons. The first season is month 1, month 12, month 2, second is 4,3,5 , third is 7,6,8 ,and forth is 10,9,11. there is order in theses 4 seasons,namely, in the first season I would like to choose month 1 first, month 12 secondly, month 2 finally,in the same way, in the second season I would choose month 4 first, month 3 secondly, month 5 eventually, and so on.For example, in the frog 1 , there are 2 months 1 and 12, I would like to pick up the month 1 instead of month 12 for the first season.
I would like to ask how do I create a column that can choose the most important month in turn for 4 seasons in two kinds of frogs.For instance,in the frog 1 , there are 2 months 1 and 12, I would like to pick up the month 1 instead of month 12 for the first season.
My expected output is :
species <- c("frog1","frog1","frog1","frog1","frog1","frog1","frog1","frog1"
,"frog1","frog1","frog2","frog2","frog2","frog2","frog2",
"frog2","frog2","frog2","frog2","frog2")
month <- c(1,12,5,8,10,3,5,7,9,4,2,4,6,7,6,3,8,9,11,1)
number <- c(3,4,5,1,2,3,4,7,6,7,3,5,6,7,8,9,9,5,3,1)
choosemonth <- c("season1","","","","season4","","","season3","","season2",
"","season2","","season3","","","","season4","","season1")
b<- data.frame(species,month,number,choosemonth)
I'm guessing a little at your final desired result, but here's how to create the season and the importance, and I'm solving for the most important month for each species
Here's a way with dplyr:
library(dplyr)
a %>%
# Season is basically just one-off quarters
mutate(season = trunc((month + 1)%%12 / 3),
# for each month the value mod 3 goes in order 2,3,1
importance = c(2,3,1)[month %% 3 + 1]) %>%
group_by(season, species) %>%
# keep only those with the max importance
filter(importance == max(importance))
EDIT: It looks like you just want to flag the value with the most importance, so here's how to do that,
a %>%
# Season is basically just one-off quarters
mutate(season = trunc((month + 1)%%12 / 3),
# for each month the value mod 3 goes in order 2,3,1
importance = c(2,3,1)[month %% 3 + 1]) %>%
mutate(choosemonth = ifelse(importance == 3, paste0('season',season + 1),''))
EDIT 2: edited one more time, was dividing into 3 seasons rather than 4

Resources