How to calculate duration of time between two dates - r

I'm working with a large data set in RStudio that includes multiple test scores for the same individuals. I've filtered my data set to display the same individual's scores in two consecutive rows with the test date for each test administration in one column. My data appears as follows:
id test_date score baseline_number_1 baseline_number_2
1 08/15/2017 21.18 Baseline N/A
1 08/28/2019 28.55 N/A Baseline
2 11/22/2017 33.38 Baseline N/A
2 11/06/2019 35.3 N/A Baseline
3 07/25/2018 30.77 Baseline N/A
3 07/31/2019 33.42 N/A Baseline
I would like to calculate the total duration of time between baseline 1 and baseline 2 administration and store that value in a new column. Therefore, my first question is what is the best way to calculate the duration of time between two dates? And two, what is the best way to condense each individual's data into one row to make calculating the difference between test scores easier and to be stored in a new column?
Thank you for any assistance!

This is a solution inside the tidyverse universe. The packages we are going to use are dplyr and tidyr.
First, we create the dataset (you read it from a file instead) and convert strings to date format:
library(dplyr)
library(tidyr)
dataset <- read.table(text = "id test_date score baseline_number_1 baseline_number_2
1 08/15/2017 21.18 Baseline N/A
1 08/28/2019 28.55 N/A Baseline
2 11/22/2017 33.38 Baseline N/A
2 11/06/2019 35.3 N/A Baseline
3 07/25/2018 30.77 Baseline N/A
3 07/31/2019 33.42 N/A Baseline", header = TRUE)
dataset$test_date <- as.Date(dataset$test_date, format = "%m/%d/%Y")
# id test_date score baseline_number_1 baseline_number_2
# 1 1 2017-08-15 21.18 Baseline <NA>
# 2 1 2019-08-28 28.55 <NA> Baseline
# 3 2 2017-11-22 33.38 Baseline <NA>
# 4 2 2019-11-06 35.30 <NA> Baseline
# 5 3 2018-07-25 30.77 Baseline <NA>
# 6 3 2019-07-31 33.42 <NA> Baseline
The best solution to condense each individual's data into one row and compute the difference between the two baselines can be achieved as follows:
dataset %>%
group_by(id) %>%
mutate(number = row_number()) %>%
ungroup() %>%
pivot_wider(
id_cols = id,
names_from = number,
values_from = c(test_date, score),
names_glue = "{.value}_{number}"
) %>%
mutate(
time_between = test_date_2 - test_date_1
)
Brief explanation: first we create the variable number which indicates the baseline number in each row; then we use pivot_wider to make the dataset "wider" indeed, i.e. we have one row for each id along with its features; finally we create the variable time_between which contains the difference in days between two baselines. In you are not familiar with some of these functions, I suggest you break the pipeline after each operation and analyse it step by step.
Final output
# A tibble: 3 x 6
# id test_date_1 test_date_2 score_1 score_2 time_between
# <int> <date> <date> <dbl> <dbl> <drtn>
# 1 1 2017-08-15 2019-08-28 21.2 28.6 743 days
# 2 2 2017-11-22 2019-11-06 33.4 35.3 714 days
# 3 3 2018-07-25 2019-07-31 30.8 33.4 371 days

Related

Six-month peak-season running average

I'm trying to implement this:
The recommendation is a peak season ozone AQG level of 60 μg/m3
(the average of daily maximum 8-hour mean ozone concentrations).
The peak season is defined as the six consecutive months of the year
with the highest six-month running-average ozone concentration.
In regions away from the equator, this period will typically be in the
warm season within a single calendar year (northern hemisphere)
or spanning two calendar years (southern hemisphere). Close to
the equator, such clear seasonal patterns may not be obvious, but a
running-average six-month peak season will usually be identifiable
from existing monitoring or modelling data.
I have:
# A tibble: 300 × 2
date value
<dttm> <dbl>
1 1997-01-01 00:00:00 NA
2 1997-02-01 00:00:00 NA
3 1997-03-01 00:00:00 NA
4 1997-04-01 00:00:00 30.2
5 1997-05-01 00:00:00 20.9
6 1997-06-01 00:00:00 10.1
7 1997-07-01 00:00:00 9.40
8 1997-08-01 00:00:00 22.4
9 1997-09-01 00:00:00 26.2
10 1997-10-01 00:00:00 32.9
# … with 290 more rows
Every year is complete (with or without NA). I found the peaks by "findpeaks" from pracma package, and get:
peaks = findpeaks(mda8_omit$value, minpeakdistance = 6,
minpeakheight = mean(mda8_omit$value))
How do i optimize to get the best six month by peak? For northern hemisphere is easier because the peaks is within a yer (summer) but in the southern hemisphere is split in two years and peaks may change depending on latitude. Any ideas on how to continue?
Assuming that
we only use windows with 6 consecutive months of data
the year that a window falls is determined by the last month of the window
we compare all such windows, at most 12, within each calendar year
Calculate the rolling mean and then grouping by year take the row with the largest rolling mean within year. This row is the last month of the 6 month window. The input is shown reproducibly in the Note at the end.
library(dplyr)
library(zoo)
DF %>%
mutate(date = as.yearmon(date),
peakmean = rollapplyr(value, 6, mean, fill = NA)) %>%
group_by(year = as.integer(date)) %>%
slice_max(peakmean) %>%
ungroup %>%
select(-year)
## # A tibble: 1 × 3
## date value peakmean
## <yearmon> <dbl> <dbl>
## 1 Oct 1997 32.9 20.3
Note
Lines <- "date value
1 1997-01-01T00:00:00 NA
2 1997-02-01T00:00:00 NA
3 1997-03-01T00:00:00 NA
4 1997-04-01T00:00:00 30.2
5 1997-05-01T00:00:00 20.9
6 1997-06-01T00:00:00 10.1
7 1997-07-01T00:00:00 9.40
8 1997-08-01T00:00:00 22.4
9 1997-09-01T00:00:00 26.2
10 1997-10-01T00:00:00 32.9"
DF <- read.table(text = Lines)

summing based on conditions from two dataframes and dealing with dates

I have two dataframes, one with climate data for every location and date across 4 years. The other data frame has a date for each day an animal was trapped at a site. I am trying to calculate the mean of each climate variable based on a specific amount of time before the day the animal was trapped (time length depends on variable in question).
climate <- data.frame(site=c(1,1,1,1,2,2,2,2,1,1,1,1),
precip=c(0.1,0.2,0.1,0.1,0.5,0.2,0.3,0.1,0.2,0.1,0.1,0.5),
humid=c(1,1,3,1,2,3,3,1,1,3,1,2),
date=c("6/13/2020","6/12/2020","6/11/2020","6/14/2020","6/13/2020","6/12/2020","6/11/2020","6/14/2020","2/13/2019","2/14/2019","2/15/2019","2/16/2019"))
trap <- data.frame(site=c(1,2,3,3), date=c("7/1/2020","7/1/2020","7/2/2020","7/4/2020"))
> climate
site precip humid date
1 1 0.1 1 6/13/2020
2 1 0.2 1 6/12/2020
3 1 0.1 3 6/11/2020
4 1 0.1 1 6/14/2020
5 2 0.5 2 6/13/2020
6 2 0.2 3 6/12/2020
7 2 0.3 3 6/11/2020
8 2 0.1 1 6/14/2020
9 1 0.2 1 2/13/2019
10 1 0.1 3 2/14/2019
11 1 0.1 1 2/15/2019
12 1 0.5 2 2/16/2019
> trap
site date
1 1 7/1/2020
2 2 7/1/2020
3 3 7/2/2020
4 3 7/4/2020
I want to calculate the mean humid 18-20 days before the date written in the trap dataframe. So essentially what is the mean humid between 6/11/2020 and 6/13/2020 according to the climate data.frame for animals trapped on 7/1/2020. So for site 1 that would be: 1.667 and site 2 that would be 2.67.
I also want to calculate the sum of precipitation 497-500 days before the date written in the trap dataframe. So I would need to calculate the sum (total) precip between 2/13/2019 and 2/16/2019 for an animal trapped on 7/1/2020 at each site. So for site 1 precip would be 0.9.
I know how to create new columns in the trap data frame for mean precip and sum humid but I'm not sure where to start in terms of coding so that each value is calculated as described above and the data that corresponds to the correct date is used for the large dataset that contains many different trap dates.
Thank you very much, hopefully I am being clear in my description.
I have a solution using functions from the tidyverse. It is always useful to convert date variables to the class date. With this class, you can make calculations. Note, that I renamed the date column in the trap data to trap_date. See comments for more details:
library(tidyverse)
climate <- data.frame(site=c(1,1,1,1,2,2,2,2,1,1,1,1),
precip=c(0.1,0.2,0.1,0.1,0.5,0.2,0.3,0.1,0.2,0.1,0.1,0.5),
humid=c(1,1,3,1,2,3,3,1,1,3,1,2),
date=c("6/13/2020","6/12/2020","6/11/2020","6/14/2020","6/13/2020","6/12/2020","6/11/2020","6/14/2020","2/13/2019","2/14/2019","2/15/2019","2/16/2019"))
trap <- data.frame(site=c(1,2,3,3), trap_date=c("7/1/2020","7/1/2020","7/2/2020","7/4/2020"))
# merge data
data <- merge(climate, trap, by="site")
> head(data)
site precip humid date trap_date
1 1 0.1 1 2020-06-13 2020-07-01
2 1 0.2 1 2020-06-12 2020-07-01
3 1 0.1 3 2020-06-11 2020-07-01
4 1 0.1 1 2020-06-14 2020-07-01
5 1 0.2 1 2019-02-13 2020-07-01
6 1 0.1 3 2019-02-14 2020-07-01
# parse dates to class 'date'; enables calculations
data <- data %>%
mutate(date = parse_date(date, format="%m/%d/%Y"),
trap_date = parse_date(trap_date, format="%m/%d/%Y"))
For means:
# humid means
data %>%
group_by(site) %>%
filter(date >= trap_date-20 & date <= trap_date-18) %>%
summarise(mean = mean(humid))
# A tibble: 2 x 2
site mean
<dbl> <dbl>
1 1 1.67
2 2 2.67
However, it seems that the range of 497 to 500 days before the trap date contains no observations. When I used your specified dates, I got the same result of 0.9:
# precip sums
data %>%
group_by(site) %>%
filter(date >= trap_date-500 & date <= trap_date-497)
# A tibble: 0 x 5
# Groups: site [0]
# ... with 5 variables: site <dbl>, precip <dbl>, humid <dbl>,
# date <date>, trap_date <date>
# using your provided dates
data %>%
group_by(site) %>%
filter(date >= as.Date("2019-02-13") & date <= as.Date("2019-02-16")) %>%
summarise(sum = sum(precip))
# A tibble: 1 x 2
site sum
<dbl> <dbl>
1 1 0.9
Hope I can help.

Time difference calculated from wide data with missing rows

There is a longitudinal data set in the wide format, from which I want to compute time (in years and days) between the first observation date and the last date an individual was observed. Dates are in the format yyyy-mm-dd. The data set has four observation periods with missing dates, an example is as follows
df1<-data.frame("id"=c(1:4),
"adate"=c("2011-06-18","2011-06-18","2011-04-09","2011-05-20"),
"bdate"=c("2012-06-15","2012-06-15",NA,"2012-05-23"),
"cdate"=c("2013-06-18","2013-06-18","2013-04-09",NA),
"ddate"=c("2014-06-15",NA,"2014-04-11",NA))
Here "adate" is the first date and the last date is the date an individual was last seen. To compute the time difference (lastdate-adate), I have tried using "lubridate" package, for example
lubridate::time_length(difftime(as.Date("2012-05-23"), as.Date("2011-05-20")),"years")
However, I'm challenged by the fact that the last date is not coming from one column. I'm looking for a way to automate the calculation in R. The expected output would look like
id years days
1 1 2.99 1093
2 2 2.00 731
3 3 3.01 1098
4 4 1.01 369
Years is approximated to 2 decimal places.
Another tidyverse solution can be done by converting the data to long format, removing NA dates, and getting the time difference between last and first date for each id.
library(dplyr)
library(tidyr)
library(lubridate)
df1 %>%
pivot_longer(-id) %>%
na.omit %>%
group_by(id) %>%
mutate(value = as.Date(value)) %>%
summarise(years = time_length(difftime(last(value), first(value)),"years"),
days = as.numeric(difftime(last(value), first(value))))
#> # A tibble: 4 x 3
#> id years days
#> <int> <dbl> <dbl>
#> 1 1 2.99 1093
#> 2 2 2.00 731
#> 3 3 3.01 1098
#> 4 4 1.01 369
We could use pmap
library(dplyr)
library(purrr)
library(tidyr)
df1 %>%
mutate(out = pmap(.[-1], ~ {
dates <- as.Date(na.omit(c(...)))
tibble(years = lubridate::time_length(difftime(last(dates),
first(dates)), "years"),
days = lubridate::time_length(difftime(last(dates), first(dates)), "days"))
})) %>%
unnest_wider(out)
# A tibble: 4 x 7
# id adate bdate cdate ddate years days
# <int> <chr> <chr> <chr> <chr> <dbl> <dbl>
#1 1 2011-06-18 2012-06-15 2013-06-18 2014-06-15 2.99 1093
#2 2 2011-06-18 2012-06-15 2013-06-18 <NA> 2.00 731
#3 3 2011-04-09 <NA> 2013-04-09 2014-04-11 3.01 1098
#4 4 2011-05-20 2012-05-23 <NA> <NA> 1.01 369
Probably most of the functions introduced here might be quite complex. You should try to learn them if possible. Although will provide a Base R approach:
grp <- droplevels(interaction(df[,1],row(df[-1]))) # Create a grouping:
days <- tapply(unlist(df[-1]),grp, function(x)max(x,na.rm = TRUE) - x[1]) #Get the difference
cbind(df[1],days, years = round(days/365,2)) # Create your table
id days years
1.1 1 1093 2.99
2.2 2 731 2.00
3.3 3 1098 3.01
4.4 4 369 1.01
if comfortable with other higher functions then you could do:
dat <- aggregate(adate~id,reshape(df1,list(2:ncol(df1)), dir="long"),function(x)max(x) - x[1])
transform(dat,year = round(adate/365,2))
id adate year
1 1 1093 2.99
2 2 731 2.00
3 3 1098 3.01
4 4 369 1.01
Using base R apply :
df1[-1] <- lapply(df1[-1], as.Date)
df1[c('years', 'days')] <- t(apply(df1[-1], 1, function(x) {
x <- na.omit(x)
x1 <- difftime(x[length(x)], x[1], 'days')
c(x1/365, x1)
}))
df1[c('id', 'years', 'days')]
# id years days
#1 1 2.994521 1093
#2 2 2.002740 731
#3 3 3.008219 1098
#4 4 1.010959 369

time differences for multiple events for same ID in R

I'm new to Stackoverflow and looked at similar posts but couldn't find a solution that can capture time differences from multiple events from the same ID.
What I've got:
Time<-c('2016-10-04','2016-10-18', '2016-10-04','2016-10-18','2016-10-19','2016-10-28','2016-10-04','2016-10-19','2016-10-21','2016-10-22', '2017-01-02', '2017-03-04')
Value<-c(0,1,0,1,0,0,0,1,0,1,1,0)
StoreID<-c('a','a','b','b','c','c','d','d','a','a','d','c')
Unit<-c(1,1,2,2,5,5,6,6,1,1,6,5)
Helper<-c('a1','a1','b2','b2','c5','c5','d6','d6','a1','a1','d6','c5')
The helper column is the StoreID and Unit combined because I couldn't figure out how to group by both Store ID and the Unit. I want to sort the data to show when the unit was disabled (value =0) and enabled again (value =1).
Ultimately, I'd want:
Store_ID Unit Helper Time(v=0) Time(v=1) Time2(v=0) Time 2(v=1)
a 1 a1 2016-10-04 2016-10-18 2016-10-21 2016-10-22
b 2 b2 2016-10-04 2016-10-18
c 5 c5 2016-10-19 2016-10-28 2017-03-04
d 6 d6 2016-10-04 2017-10-19
Any thoughts?
I'm thinking something in dplyr but am stumped about where to go further.
Create a Header column that combines the Value column and the row number that distinguishes duplicates, then spread to wide format:
Didn't use the helper column, grouped by StoredID and Unit instead.
df <- data.frame(StoreID, Unit, Time, Value)
df %>%
group_by(StoreID, Unit, Value) %>%
mutate(Headers = sprintf('Time %s (v=%s)', row_number(), Value)) %>%
ungroup() %>% select(-Value) %>%
spread(Headers, Time)
# A tibble: 4 x 7
# StoreID Unit `Time 1 (v=0)` `Time 1 (v=1)` `Time 2 (v=0)` `Time 2 (v=1)` `Time 3 (v=0)`
#* <fctr> <dbl> <fctr> <fctr> <fctr> <fctr> <fctr>
#1 a 1 2016-10-04 2016-10-18 2016-10-21 2016-10-22 NA
#2 b 2 2016-10-04 2016-10-18 NA NA NA
#3 c 5 2016-10-19 NA 2016-10-28 NA 2017-03-04
#4 d 6 2016-10-04 2016-10-19 NA 2017-01-02 NA

How to diagonally subtract different columns in R

I have a dataset of a hypothetical exam.
id <- c(1,1,3,4,5,6,7,7,8,9,9)
test_date <- c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15")
result_date <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20")
data1 <- as_data_frame(id)
data1$test_date <- test_date
data1$result_date <- result_date
colnames(data1)[1] <- "id"
"id" indicates the ID of the students who have taken a particular exam. "test_date" is the date the students took the test and "result_date" is the date when the students' results are posted. I'm interested in finding out which students retook the exam BEFORE the result of that exam session was released, e.g. students who knew that they have underperformed and retook the exam without bothering to find out their scores. For example, student with "id" 1 took the exam for the second time on "2012-07-10" which was before the result date for his first exam - "2012-07-29".
I tried to:
data1%>%
group_by(id) %>%
arrange(id, test_date) %>%
filter(n() >= 2) %>% #To only get info on students who have taken the exam more than once and then merge it back in with the original data set using a join function
So essentially, I want to create a new column called "re_test" where it would equal 1 if a student retook the exam BEFORE receiving the result of a previous exam and 0 otherwise (those who retook after seeing their marks or those who did not retake).
I have tried to mutate in order to find cases where dates are either positive or negative by subtracting the 2nd test_date from the 1st result_date:
mutate(data1, re_test = result_date - lead(test_date, default = first(test_date)))
However, this leads to mixing up students with different id's. I tried to split but mutate won't work on a list of dataframes so now I'm stuck:
split(data1, data1$id)
Just to add on, this is a part of the desired result:
data2 <- as_data_frame(id <- c(1,1,3,4))
data2$test_date_result <- c("2012-06-27","2012-07-10", "2013-07-04","2012-03-24")
data2$result_date_result <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25")
data2$re_test <- c(1, 0, 0, 0)
Apologies for the verbosity and hope I was clear enough.
Thanks a lot in advance!
library(reshape2)
library(dplyr)
# first melt so that we can sequence by date
data1m <- data1 %>%
melt(id.vars = "id", measure.vars = c("test_date", "result_date"), value.name = "event_date")
# any two tests in a row is a flag - use dplyr::lag to comapre the previous
data1mc <- data1m %>%
arrange(id, event_date) %>%
group_by(id) %>%
mutate (multi_test = (variable == "test_date" & lag(variable == "test_date"))) %>%
filter(multi_test)
# id variable event_date multi_test
# 1 1 test_date 2012-07-10 TRUE
# 2 9 test_date 2012-03-15 TRUE
## join back to the original
data1 %>%
left_join (data1mc %>% select(id, event_date, multi_test),
by=c("id" = "id", "test_date" = "event_date"))
I have a piecewise answer that may work for you. I first create a data.frame called student that contains the re-test information, and then join it with the data1 object. If students re-took the test multiple times, it will compare the last test to the first, which is a flaw, but I'm unsure if students have the ability to re-test multiple times?
student <- data1 %>%
group_by(id) %>%
summarise(retest=(test_date[length(test_date)] < result_date[1]) == TRUE)
Some re-test values were NA. These were individuals that only took the test once. I set these to FALSE here, but you can retain the NA, as they do contain information.
student$retest[is.na(student$retest)] <- FALSE
Join the two data.frames to a single object called data2.
data2 <- left_join(data1, student, by='id')
I am sure there are more elegant ways to approach this. I did this by taking advantage of the structure of your data (sorted by id) and the lag function that can refer to the previous records while dealing with a current record.
### Ensure Data are sorted by ID ###
data1 <- arrange(data1,id)
### Create Flag for those that repeated ###
data1$repeater <- ifelse(lag(data1$id) == data1$id,1,0)
### I chose to do this on all data, you could filter on repeater flag first ###
data1$timegap <- as.Date(data1$result_date) - as.Date(data1$test_date)
data1$lagdate <- as.Date(data1$test_date) - lag(as.Date(data1$result_date))
### Display results where your repeater flag is 1 and there is negative time lag ###
data1[data1$repeater==1 & !is.na(data1$repeater) & as.numeric(data1$lagdate) < 0,]
# A tibble: 2 × 6
id test_date result_date repeater timegap lagdate
<dbl> <chr> <chr> <dbl> <time> <time>
1 1 2012-07-10 2012-09-02 1 54 days -19 days
2 9 2012-03-15 2012-04-20 1 36 days -2 days
I went with a simple shift comparison. 1 line of code.
data1 <- data.frame(id = c(1,1,3,4,5,6,7,7,8,9,9), test_date = c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15"), result_date = c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20"))
data1$re_test <- unlist(lapply(split(data1,data1$id), function(x)
ifelse(as.Date(x$test_date) > c(NA, as.Date(x$result_date[-nrow(x)])), 0, 1)))
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 NA
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 NA
4 4 2012-03-24 2012-04-25 NA
5 5 2012-07-22 2012-09-01 NA
6 6 2013-09-16 2013-10-20 NA
7 7 2012-06-21 2012-07-01 NA
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 NA
10 9 2012-02-16 2012-03-17 NA
11 9 2012-03-15 2012-04-20 1
I think there is benefit in leaving NAs but if you really want all others as zero, simply:
data1$re_test <- ifelse(is.na(data1$re_test), 0, data1$re_test)
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 0
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 0
4 4 2012-03-24 2012-04-25 0
5 5 2012-07-22 2012-09-01 0
6 6 2013-09-16 2013-10-20 0
7 7 2012-06-21 2012-07-01 0
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 0
10 9 2012-02-16 2012-03-17 0
11 9 2012-03-15 2012-04-20 1
Let me know if you have any questions, cheers.

Resources