Time spent in each calendar year - r

I followed some individuals A and B from start to end
df<-data.frame(id=c("A", "B"), start=as.Date(c("2015-01-01", "2013-01-01")), end=as.Date(c("2021-06-12", "2017-10-10")))
df
id start end
1 A 2015-01-01 2021-06-12
2 B 2013-01-01 2017-10-10
I would like to calculate the the follow up time for each calendar year. For example I have 1 year for 2013 (from B), 1 year for 2014 (from B), 2 years for 2015 (from A and B) and so on.
I tried to treat year as an integer and count how many years each individual contributes but due to rounding errors the result is not plausible.
I tried
years<-NULL
for (i in 1:length(df$id)){
years<-c(years, as.character(as.Date(seq.Date(from = df$start[i], to = df$end[i], by = "day"))))
}
library(lubridate)
table(year(years))/365
2013 2014 2015 2016 2017 2018 2019 2020 2021
1.0000000 1.0000000 2.0000000 2.0054795 1.7753425 1.0000000 1.0000000 1.0027397 0.4465753
which is the answer I am trying to get but is computationally inefficient and very slow in large data. I am wondering is there any way to do this without the loop? Or do it more efficiently?

I'm now guessing what you actually don't want to round or truncate anything, so here's a solution that works and gives output similar to your method (correcting the 2016 value):
func <- function(st, ed) {
stopifnot(length(st) == 1, length(ed) == 1)
stL <- as.POSIXlt(st)
edL <- as.POSIXlt(ed)
start_year <- 1900 + stL$year
end_year <- 1900 + edL$year
start_eoy <- as.POSIXlt(paste0(start_year, "-12-31"))
end_eoy <- as.POSIXlt(paste0(end_year, "-12-31"))
firstyear <- (start_eoy$yday - stL$yday) / start_eoy$yday
lastyear <- edL$yday / end_eoy$yday
data.frame(
year = seq(start_year, end_year),
n = c(firstyear, rep(1, max(0, end_year - start_year - 1)), lastyear)
)
}
base R
aggregate(n ~ year, data = do.call(rbind, Map(func, df$start, df$end)), FUN = sum)
# year n
# 1 2013 1.0000000
# 2 2014 1.0000000
# 3 2015 2.0000000
# 4 2016 2.0000000
# 5 2017 1.7747253
# 6 2018 1.0000000
# 7 2019 1.0000000
# 8 2020 1.0000000
# 9 2021 0.4450549
dplyr
library(dplyr)
df %>%
with(Map(func, start, end)) %>%
bind_rows() %>%
group_by(year) %>%
summarize(n = sum(n))
# # A tibble: 9 x 2
# year n
# <int> <dbl>
# 1 2013 1
# 2 2014 1
# 3 2015 2
# 4 2016 2
# 5 2017 1.77
# 6 2018 1
# 7 2019 1
# 8 2020 1
# 9 2021 0.445

Sounds like a job for a great package called lubridate. See example:
By the way, I assumed dates are year-month-day, therefore ymd. If not, you can use ydm (year-day-month) for American date format.
df<-data.frame(id=c("A", "B"), start=as.Date(c("2015-01-01", "2013-01-01")), end=as.Date(c("2021-06-12", "2017-10-10")))
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
library(tidyverse)
df %>%
mutate(across(start:end, ymd),
follow_up_years = interval(start, end)/years(1),
follow_up_months = interval(start, end)/months(1),
follow_up_days = interval(start, end)/days(1),
)
#> id start end follow_up_years follow_up_months follow_up_days
#> 1 A 2015-01-01 2021-06-12 6.443836 77.36667 2354
#> 2 B 2013-01-01 2017-10-10 4.772603 57.29032 1743
Created on 2021-10-28 by the reprex package (v2.0.1)
Edit
I think I understand. I guess we can also just use lubridate intervals:
df %>%
mutate(follow_up_2015 = interval(start, as_date("2015-01-01"))/years(1)) %>%
pull(follow_up_2015) %>%
sum()
#> [1] 2
Created on 2021-10-28 by the reprex package (v2.0.1)

Related

How to generate a sequence of numbers increasing at a fixed percentage?

I would like to calculate the predicted value at 2% growth rate over 10 years.
My data looks like this
df <- structure(list(fin_year = c(2016, 2017, 2018, 2019, 2020, 2021
), Total = c(136661.9, 142748.25, 146580.77, 155486.07, 171115.58,
69265.01)), class = "data.frame", row.names = c(NA, -6L))
I would like to add a new column (two_percent) with the calculated amounts based on the 2016 Total value.
I expect the answers to look like this:
I've tried this but can't figure out how to code the script properly to do what I want
df1 <- df %>%
mutate(two_percent = rep(Total[1:1] *1.02))
Your help is much appreciated
The formula is 1.02^n where n is the number of periods. One may need to subtract 1 from n depending on whether the interest is at the beginning or end of the period.
basevalue <- df$Total[1]
df1 <- df %>%
mutate(two_percent = basevalue*1.02^(row_number()-1))
We can use purrr::accumulate to calculate the 2% growth forecast. First let's calculate this for the existing data.frame. We need to supply a vector of 1.02 in the length of one less than the total row number to accumulates .x argument. Further, we need the base value of Total as .init argument (this is the value we want to base the forecast on). The function .f that we then use is just .x * .y.
library(dplyr)
library(purrr)
# Calculate the growth rate for the existing data.frame
df %>%
mutate(two_percent = accumulate(rep(1.02, nrow(.)-1),
~ .x * .y,
.init = first(Total)))
#> fin_year Total two_percent
#> 1 2016 136661.90 136661.9
#> 2 2017 142748.25 139395.1
#> 3 2018 146580.77 142183.0
#> 4 2019 155486.07 145026.7
#> 5 2020 171115.58 147927.2
#> 6 2021 69265.01 150885.8
While this works for the existing data.frame we need a new one, if we want to forecast values for years that the current df doesn't contain. Basically, we use the same approach as above and combine it with a right_join:
# Calculate the growth rate for a 10 year period, and then join
new_df <- tibble(Year = 1:10,
two_percent = df$Total[1]) %>%
mutate(two_percent = accumulate(rep(1.02, nrow(.)-1),
~ .x * .y,
.init = first(two_percent)))
df %>%
mutate(Year = row_number()) %>%
right_join(new_df)
#> Joining, by = "Year"
#> fin_year Total Year two_percent
#> 1 2016 136661.90 1 136661.9
#> 2 2017 142748.25 2 139395.1
#> 3 2018 146580.77 3 142183.0
#> 4 2019 155486.07 4 145026.7
#> 5 2020 171115.58 5 147927.2
#> 6 2021 69265.01 6 150885.8
#> 7 NA NA 7 153903.5
#> 8 NA NA 8 156981.6
#> 9 NA NA 9 160121.2
#> 10 NA NA 10 163323.6
Created on 2022-01-13 by the reprex package (v2.0.1)
Here's another simple method that anchors the base of the Two_percent calculation to the value of Total in the first fin_year using which.min(fin_year)
library(tidyverse)
df <- structure(list(fin_year = c(2016, 2017, 2018, 2019, 2020, 2021
), Total = c(136661.9, 142748.25, 146580.77, 155486.07, 171115.58,
69265.01)), class = "data.frame", row.names = c(NA, -6L))
df %>%
mutate(two_percent = Total[which.min(fin_year)] * 1.02^(seq_along(fin_year)))
#> fin_year Total two_percent
#> 1 2016 136661.90 139395.1
#> 2 2017 142748.25 142183.0
#> 3 2018 146580.77 145026.7
#> 4 2019 155486.07 147927.2
#> 5 2020 171115.58 150885.8
#> 6 2021 69265.01 153903.5
Created on 2022-01-13 by the reprex package (v2.0.1)

Using dplyr mutate function to create new variable conditionally based on current row

I am working on creating conditional averages for a large data set that involves # of flu cases seen during the week for several years. The data is organized as such:
What I want to do is create a new column that tabulates that average number of cases for that same week in previous years. For instance, for the row where Week.Number is 1 and Flu.Year is 2017, I would like the new row to give the average count for any year with Week.Number==1 & Flu.Year<2017. Normally, I would use the case_when() function to conditionally tabulate something like this. For instance, when calculating the average weekly volume I used this code:
mutate(average = case_when(
Flu.Year==2016 ~ mean(chcc$count[chcc$Flu.Year==2016]),
Flu.Year==2017 ~ mean(chcc$count[chcc$Flu.Year==2017]),
Flu.Year==2018 ~ mean(chcc$count[chcc$Flu.Year==2018]),
Flu.Year==2019 ~ mean(chcc$count[chcc$Flu.Year==2019]),
),
However, since there are four years of data * 52 weeks which is a lot of iterations to spell out the conditions for. Is there a way to elegantly code this in dplyr? The problem I keep running into is that I want to call values in counts column based on Week.Number and Flu.Year values in other rows conditioned on the current value of Week.Number and Flu.Year, and I am not sure how to accomplish that. Please let me know if there is further information / detail I can provide.
Thanks,
Steven
dat <- tibble( Flu.Year = rep(2016:2019,each = 52), Week.Number = rep(1:52,4), count = sample(1000, size=52*4, replace=TRUE) )
It's bad-form and, in some cases, an error when you use $-indexing within dplyr verbs.
I think a better way to get that average field is to group_by(Flu.Year) and calculate it straight-up.
library(dplyr)
set.seed(42)
dat <- tibble(
Flu.Year = sample(2016:2020, size=100, replace=TRUE),
count = sample(1000, size=100, replace=TRUE)
)
dat %>%
group_by(Flu.Year) %>%
mutate(average = mean(count)) %>%
# just to show a quick summary
slice(1:3) %>%
ungroup()
# # A tibble: 15 x 3
# Flu.Year count average
# <int> <int> <dbl>
# 1 2016 734 578.
# 2 2016 356 578.
# 3 2016 411 578.
# 4 2017 217 436.
# 5 2017 453 436.
# 6 2017 920 436.
# 7 2018 963 558
# 8 2018 609 558
# 9 2018 536 558
# 10 2019 943 543.
# 11 2019 740 543.
# 12 2019 536 543.
# 13 2020 627 494.
# 14 2020 218 494.
# 15 2020 389 494.
An alternative approach is to generate a summary table (just one row per year) and join it back in to the original data.
dat %>%
group_by(Flu.Year) %>%
summarize(average = mean(count))
# # A tibble: 5 x 2
# Flu.Year average
# <int> <dbl>
# 1 2016 578.
# 2 2017 436.
# 3 2018 558
# 4 2019 543.
# 5 2020 494.
dat %>%
group_by(Flu.Year) %>%
summarize(average = mean(count)) %>%
full_join(dat, by = "Flu.Year")
# # A tibble: 100 x 3
# Flu.Year average count
# <int> <dbl> <int>
# 1 2016 578. 734
# 2 2016 578. 356
# 3 2016 578. 411
# 4 2016 578. 720
# 5 2016 578. 851
# 6 2016 578. 822
# 7 2016 578. 465
# 8 2016 578. 679
# 9 2016 578. 30
# 10 2016 578. 180
# # ... with 90 more rows
The result, after chat:
tibble( Flu.Year = rep(2016:2018,each = 3), Week.Number = rep(1:3,3), count = 1:9 ) %>%
arrange(Flu.Year, Week.Number) %>%
group_by(Week.Number) %>%
mutate(year_week.average = lag(cumsum(count) / seq_along(count)))
# # A tibble: 9 x 4
# # Groups: Week.Number [3]
# Flu.Year Week.Number count year_week.average
# <int> <int> <int> <dbl>
# 1 2016 1 1 NA
# 2 2016 2 2 NA
# 3 2016 3 3 NA
# 4 2017 1 4 1
# 5 2017 2 5 2
# 6 2017 3 6 3
# 7 2018 1 7 2.5
# 8 2018 2 8 3.5
# 9 2018 3 9 4.5
We can use aggregate from base R
aggregate(count ~ Flu.Year, data, FUN = mean)

Assign day of the day year to a month

Sample data
df <- data.frame(ID1 = rep(1:1000, each= 5*365), year = rep(rep(2000:2004, each = 365), times = 1000),
day = rep(1:365, times = 1000*5),
x= runif(365*1000*5))
This data contains a column day which is the day of the year. I need to produce two columns:
Month column: a column of month (which month does the day belong)
Biweek column: which biweek does a day belong to. There are 24 biweek in a year. All days <= 15 in a month is the first biweek and > 15 is second biweek.
For e.g.
15th Jan is Biweek 1,
16-31 Jan is biweek 2,
1-15 Feb is biweek 3 and
16-28 Feb is biweek 4 and so on.
For sake of simplicity, I am assuming all the years are non-leap years.
Here's the code I have (with help from RS as well) that creates the two columns.
# create a vector of days for each month
months <- list(1:31, 32:59, 60:90, 91:120, 121:151, 152:181, 182:212, 213:243, 244:273, 274:304, 305:334, 335:365)
library(dplyr)
ptm <- proc.time()
df <- df %>% mutate(month = sapply(day, function(x) which(sapply(months, function(y) x %in% y))), # this assigns each day to a month
date = as.Date(paste0(year,'-',format(strptime(paste0('1981-',day), '%Y-%j'), '%m-%d'))), # this creates a vector of dates for a non-leap year
twowk = month*2 - (as.numeric(format(date, "%d")) <= 15)) %>% # this describes which biweek each day falls into
dplyr::select(-date)
proc.time() - ptm
user system elapsed
121.71 0.31 122.43
My issue is that of the time it takes to run this script and I am looking for a solution that is relatively faster
EDIT: To be clear, I have assumed all years must have 365 days. In one of the answers below, for the year 2000 (a leap year), Feb has 29 days (last day of Feb is 60 but I want the last day to be 59) and therefore Dec has only 30 days (Dec start with 336 though it should start with 335). I hope this is clear. My solution addresses this issue but takes lot of time to run.
Here is a solution using lubridate extractors and replacement functions as mentioned by Frank in a comment. The key ones are yday<-, mday() and month(), which respectively set the day of year of a date, get the day of month of a date, and get the month of a date. 8 sec running time seems pretty acceptable to me, though I'm sure some optimising could shave that down though there might be a loss of generality.
Note also the use of case_when to ensure the correct numbering of days after Feb 29 on a leap year.
EDIT: Here is a significantly faster solution. You can just get the mapping of DOYs to months and biweeks for a single year, and then left_join to the main table. 0.36s running time, since you no longer have to repetitively create the date. We also bypass having to use case_when, since the join will take care of the missing days. See that Day 59 of year 2000 is February and Day 60 is March, as requested.
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
tbl <- tibble(
ID1 = rep(1:1000, each= 5*365),
year = rep(rep(2000:2004, each = 365), times = 1000),
day = rep(1:365, times = 1000*5),
x= runif(365*1000*5)
)
tictoc::tic("")
doys <- tibble(
day = rep(1:365),
date = seq.Date(ymd("2001-1-1"), ymd("2001-12-31"), by = 1),
month = month(date),
biweek = case_when(
mday(date) <= 15 ~ (month * 2) - 1,
mday(date) > 15 ~ month * 2
)
)
tbl_out2 <- left_join(tbl, select(doys, -date), by = "day")
tictoc::toc()
#> : 0.36 sec elapsed
tbl_out2
#> # A tibble: 1,825,000 x 6
#> ID1 year day x month biweek
#> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 2000 1 0.331 1. 1.
#> 2 1 2000 2 0.284 1. 1.
#> 3 1 2000 3 0.627 1. 1.
#> 4 1 2000 4 0.762 1. 1.
#> 5 1 2000 5 0.460 1. 1.
#> 6 1 2000 6 0.500 1. 1.
#> 7 1 2000 7 0.340 1. 1.
#> 8 1 2000 8 0.952 1. 1.
#> 9 1 2000 9 0.663 1. 1.
#> 10 1 2000 10 0.385 1. 1.
#> # ... with 1,824,990 more rows
tbl_out2[55:65, ]
#> # A tibble: 11 x 6
#> ID1 year day x month biweek
#> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 2000 55 0.127 2. 4.
#> 2 1 2000 56 0.779 2. 4.
#> 3 1 2000 57 0.625 2. 4.
#> 4 1 2000 58 0.245 2. 4.
#> 5 1 2000 59 0.640 2. 4.
#> 6 1 2000 60 0.423 3. 5.
#> 7 1 2000 61 0.439 3. 5.
#> 8 1 2000 62 0.105 3. 5.
#> 9 1 2000 63 0.218 3. 5.
#> 10 1 2000 64 0.668 3. 5.
#> 11 1 2000 65 0.589 3. 5.
Created on 2018-04-06 by the reprex package (v0.2.0).
You can speed this up almost an order of magnitude by defining date first, reducing redundancy in the date call, and then extracting month from date.
ptm <- proc.time()
df <- df %>% mutate(
date = as.Date(paste0(year, "-", day), format = "%Y-%j"), # this creates a vector of dates
month = as.numeric(format(date, "%m")), # extract month
twowk = month*2 - (as.numeric(format(date, "%d")) <= 15)) %>% # this describes which biweek each day falls into
dplyr::select(-date)
proc.time() - ptm
# user system elapsed
# 18.58 0.13 18.75
Versus original version in the question
# user system elapsed
# 117.67 0.15 118.45
Filtered for one year. I think it solves the leap issue you described, unless I'm not clear on what you're saying. Last day of Feb is 59 in the df in my result below, but only because day is 0 indexed.
df2000 <- filter(df, year == "2000")
ptm <- proc.time()
df2000 <- df2000 %>% mutate(
day = day - 1, # dates are 0 indexed
date = as.Date(day, origin = "2000-01-01"),
month = as.numeric(as.POSIXlt(date, format = "%Y-%m-%d")$mon + 1),
bis = month * 2 - (as.numeric(format(date, "%d")) <= 15)
)
proc.time() - ptm
user system elapsed
0.8 0.0 0.8
One year is 0.2 of the whole df, so times reflect that.

Dropping the rows by checking whether it has multiple values in R

I have a data frame in this form;
Year Department Jan Feb ................... Dec
2017 TF 15.15 225.51 .............. 5562.1
2015 CIF ...................................
2013 TTR ....................................
2011 COR ....................
. .............................
. ......................
As a summary, I want to create an algorithm but first I have to make this filtering:
If a department does not have a value for 2013, 2014, 2015, 2016 years, than I want to exclude that department from my data set.
In other words, by reading the each departments data, filtering the data by departments that has all four years values in the months columns.
I tried exists, is.na but the multiple filtering always fails. And another handicap is that filter works for only single condition, but here I need like 4 condition. 4 years values must be exist to use them in next step.
Thank you.
I can't find a clear duplicate to this question. Seems like a quick fix with group_by:
library(dplyr)
df <- data_frame(Year = c(2013:2016, 2015, 2016),
Department = c(rep('TF', 4), 'CIF', 'TTR'))
df
#> # A tibble: 6 x 2
#> Year Department
#> <dbl> <chr>
#> 1 2013 TF
#> 2 2014 TF
#> 3 2015 TF
#> 4 2016 TF
#> 5 2015 CIF
#> 6 2016 TTR
df %>%
group_by(Department) %>%
mutate(x = Year %in% c(2013:2016),
y = sum(x)) %>%
ungroup() %>%
filter(y == 4)
#> # A tibble: 4 x 4
#> Year Department x y
#> <dbl> <chr> <lgl> <int>
#> 1 2013 TF TRUE 4
#> 2 2014 TF TRUE 4
#> 3 2015 TF TRUE 4
#> 4 2016 TF TRUE 4
A solution using R base:
df = read.table(text = "Year, Department
2016,TF
2017,TF
2013,CIF
2014,CIF
2015,CIF
2016,CIF
2013,TTR", header = TRUE, sep = ",", stringsAsFactors = FALSE)
df[df$Department %in% subset(aggregate(subset(df, Year %in% c(2013,2014,2015,2016)), by=list(n$Department), FUN=length), Department==4)[,1], ]
Output:
Year Department
3 2013 CIF
4 2014 CIF
5 2015 CIF
6 2016 CIF

Clean way to calculate both group and overall statistics

I would like like to calculate the median not only for different groups of my data, but also the median over all groups and store the result in a single data.frame. While accomplishing each of these tasks separately is easy, I have not found a clean way to do both at the same time.
Right now, what I'm doing is calculate both statistics separately; then join the results; then tidy the data if necessary. Here's an example of what this may look like if I wanted to know the median delay per day and per month:
library(dplyr)
library(hflights)
data(hflights)
# Calculate both statistics separately
per_day <- hflights %>%
group_by(Year, Month, DayofMonth) %>%
summarise(Delay = mean(ArrDelay, na.rm = TRUE)) %>%
mutate(Interval = "Daily")
per_month <- hflights %>%
group_by(Year, Month) %>%
summarise(Delay = mean(ArrDelay, na.rm = TRUE)) %>%
mutate(Interval = "Monthly", DayofMonth = NA)
# Join into a single data.frame
my_summary <- full_join(per_day, per_month,
by = c("Year", "Month", "DayofMonth", "Interval", "Delay"))
my_summary
# Source: local data frame [377 x 5]
# Groups: Year, Month
#
# Year Month DayofMonth Delay Interval
# 1 2011 1 1 10.067642 Daily
# 2 2011 1 2 10.509745 Daily
# 3 2011 1 3 6.038627 Daily
# 4 2011 1 4 7.970740 Daily
# 5 2011 1 5 4.172650 Daily
# 6 2011 1 6 6.069909 Daily
# 7 2011 1 7 3.907295 Daily
# 8 2011 1 8 3.070140 Daily
# 9 2011 1 9 17.254325 Daily
# 10 2011 1 10 11.040388 Daily
# .. ... ... ... ... ...
Are there better ways to do this?
(Note that in many cases one could easily progressively roll up summaries as pointed out in the Introduction to dplyr. However, this doesn't work for statistics like median, mean etc.)
As a one-off table. This is fairly straightforward in data.table:
require(data.table)
setDT(hflights)[,{
mo_del <- mean(ArrDelay,na.rm=TRUE)
.SD[,.(DailyDelay = mean(ArrDelay,na.rm=TRUE),MonthlyDelay = mo_del),by=DayofMonth]
},by=.(Year,Month)]
# Year Month DayofMonth DailyDelay MonthlyDelay
# 1: 2011 1 1 10.0676417 4.926065
# 2: 2011 1 2 10.5097451 4.926065
# 3: 2011 1 3 6.0386266 4.926065
# 4: 2011 1 4 7.9707401 4.926065
# 5: 2011 1 5 4.1726496 4.926065
# ---
# 361: 2011 12 14 1.0293610 5.013244
# 362: 2011 12 17 -0.1049822 5.013244
# 363: 2011 12 24 -4.1457490 5.013244
# 364: 2011 12 25 -2.2976827 5.013244
# 365: 2011 12 31 46.4846491 5.013244
How it works. The basic syntax is DT[i,j,by].
With by=.(Year,Month), all operations in j are done per "by group."
We can nest another "by group" using the data.table of the current Subset of Data, .SD.
To return columns in j we use .(colname1=col1,colname2=col2,...).
Creating new variables. Alternately, we could create new variables in hflights using := in j.
hflights[,DailyDelay := mean(ArrDelay,na.rm=TRUE),.(Year,Month,DayofMonth)]
hflights[,MonthlyDelay := mean(ArrDelay,na.rm=TRUE),.(Year,Month)]
Then we can view the summary table:
hflights[,.GRP,.(Year,Month,DayofMonth,DailyDelay,MonthlyDelay)]
# Year Month DayofMonth DailyDelay MonthlyDelay .GRP
# 1: 2011 1 1 10.0676417 4.926065 1
# 2: 2011 1 2 10.5097451 4.926065 2
# 3: 2011 1 3 6.0386266 4.926065 3
# 4: 2011 1 4 7.9707401 4.926065 4
# 5: 2011 1 5 4.1726496 4.926065 5
# ---
# 361: 2011 12 14 1.0293610 5.013244 361
# 362: 2011 12 17 -0.1049822 5.013244 362
# 363: 2011 12 24 -4.1457490 5.013244 363
# 364: 2011 12 25 -2.2976827 5.013244 364
# 365: 2011 12 31 46.4846491 5.013244 365
(Something needed to be put in j here, so I used the "by group" code, .GRP.)

Resources