I have a data frame , it has three columns employid , start date(ydm) and end date(ydm). my objective was to create another data frame which has two columns, one is employee ID and the other one is date. Second data frame would be built around first Data frame such that it will take ids from the first data frame, and the column date will take all the months between Start Date and end date of that employee. In simple words , i would expand the data in first data frame by months according to the employee start date and end date.
I actually successfully created the code, using for loop. Problem is, it is very slower, and some where I read that one is to avoid loops in r. is there a way that can do the same in a much quicker way ?
an example of my data frame and code is below:
# Creating Data frame
a<- data.frame(employeeid =c('a','b','c'), StartDate= c('2018-1-1','2018-1-5','2018-11-2'),
EndDate= c('2018-1-3','2018-1-9','2018-1-8'), stringsAsFactors = F)
a$StartDate <- ydm(a$StartDate)
a$EndDate <- ydm(a$EndDate)
#second empty data frame
a1 <-a
a1 <- a1[0,1:2]
#my code starts
r <- 1
r.1 <- 1
for (id in a$employeeid) {
#r.1 <- 1
for ( i in format(seq(a[r,2],a[r,3],by="month"), "%Y-%m-%d") ) {
a1[r.1,1] <- a[r,1]
a1[r.1,2] <- i
r.1 <- r.1 +1
}
r <- r+1
}
This results in this :
I want the same result, but a bit quicker
Almost a one-liner with tidyverse:
> result
# A tibble: 12 x 2
employeeid date
<chr> <date>
1 a 2018-01-01
2 a 2018-02-01
3 a 2018-03-01
4 b 2018-05-01
5 b 2018-06-01
6 b 2018-07-01
7 b 2018-08-01
8 b 2018-09-01
9 c 2018-11-01
10 c 2018-12-01
11 c 2019-01-01
12 c 2019-02-01
Code
result <- df %>%
group_by(employeeid) %>%
summarise(date = list(seq(StartDate,
EndDate,
by = "month"))) %>%
unnest()
Data
library(tidyverse)
library(lubridate)
df <- data.frame(employeeid = c('a', 'b', 'c'),
StartDate = ymd(c('2018-1-1', '2018-5-1', '2018-11-1')),
EndDate = ymd(c('2018-3-1', '2018-9-1', '2019-02-1')),
stringsAsFactors = FALSE)
I'd try to solve this with by using apply and a custom function, that calculates the difference of end and start.
Im not sure how your desired output looks like, but in the function of the following example all month in between start and end are pasted in a string.
library(lubridate)
# Creating Data frame
a<- data.frame(employeeid =c('a','b','c'), StartDate= c('2018-1-1','2018-1-5','2018-11-2'),
EndDate= c('2018-2-3','2019-1-9','2020-1-8'), stringsAsFactors = F)
a$StartDate <- ymd(a$StartDate)
a$EndDate <- ymd(a$EndDate)
# create month-name month nummeric value mapping
month_names = month.abb[1:12]
month_dif = function(dates) # function to calc the dif. it expects a 2 units vector to be passed over
{
start = dates[1] # first unit of the vector is expected to be the start date
end = dates[2] # second unit is expected to be the end date
start_month = month(start)
end_month = month(end)
start_year = year(start)
end_year = year(end)
year_dif = end_year - start_year
if(year_dif == 0){ #if start and end both are in the same year month is start till end
return(paste(month_names[start_month:end_month], collapse= ", " ))
} else { #if there is an overlap, mont is start till dezember and jan till end (with x full year in between)
paste(c(month_names[start_month:12],
rep(month_names, year_dif-1),
month_names[1:end_month]), collapse = ", ")
}
}
apply(a[2:3], 1, month_dif)
output:
> apply(a[2:3], 1, month_dif)
[1] "Jan, Feb"
[2] "Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, Jan"
[3] "Nov, Dec, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, Jan"
You can use a combination of apply and do.call:
out_apply_list <- apply(X=a, MARGIN=1,
FUN=function(x) {
data.frame(id= x[1],
date=seq(from = as.Date(x[2], "%Y-%d-%m"),
to = as.Date(x[3], "%Y-%d-%m"),
by = "month"),
row.names = NULL)
})
df <- do.call(what = rbind, args = out_apply_list)
which gives you the following output:
> df
id date
1 a 2018-01-01
2 a 2018-02-01
3 a 2018-03-01
4 b 2018-05-01
5 b 2018-06-01
6 b 2018-07-01
7 b 2018-08-01
8 b 2018-09-01
9 c 2018-02-11
10 c 2018-03-11
11 c 2018-04-11
12 c 2018-05-11
13 c 2018-06-11
14 c 2018-07-11
For the sake of completeness, here is a concise one-line with data.table:
library(data.table)
setDT(a)[, .(StartDate = seq(StartDate, EndDate, by = "month")), by = employeeid]
employeeid StartDate
1: a 2018-01-01
2: a 2018-02-01
3: a 2018-03-01
4: b 2018-05-01
5: b 2018-06-01
6: b 2018-07-01
7: b 2018-08-01
8: b 2018-09-01
9: c 2018-02-11
10: c 2018-03-11
11: c 2018-04-11
12: c 2018-05-11
13: c 2018-06-11
14: c 2018-07-11
Related
R:
I have a data-set with N Products sales value from some yyyy-mm-dd to some yyyy-mm-dd, I just want to filter the data for the last 12 months for each product in the data-set.
Eg:
Say, I have values from 2016-01-01 to 2020-02-01
So now I want to filter the sales values for the last 12 months that is from 2019-02-01 to 2020-02-01
I just cannot simply mention a "filter(Month >= as.Date("2019-04-01") & Month <= as.Date("2020-04-01"))" because the end date keeps changing for my case as every months passes by so I need to automate the case.
You can use :
library(dplyr)
library(lubridate)
data %>%
group_by(Product) %>%
filter(between(date, max(date) - years(1), max(date)))
#filter(date >= (max(date) - years(1)) & date <= max(date))
You can test whether the date is bigger equal the maximal date per product minus 365 days:
library(dplyr)
df %>%
group_by(Products) %>%
filter(Date >= max(Date)-365)
# A tibble: 6 x 2
# Groups: Products [3]
Products Date
<dbl> <date>
1 1 2002-01-21
2 1 2002-02-10
3 2 2002-02-24
4 2 2002-02-10
5 2 2001-07-01
6 3 2005-03-10
Data
df <- data.frame(
Products = c(1,1,1,1,2,2,2,3,3,3),
Date = as.Date(c("2000-02-01", "2002-01-21", "2002-02-10",
"2000-06-01", "2002-02-24", "2002-02-10",
"2001-07-01", "2003-01-02", "2005-03-10",
"2002-05-01")))
If your aim is to just capture entries from today back to the same day last year, then:
The function Sys.Date() returns the current date as an object of type Date. You can then convert that to POSIXlc form to adjust the year to get the start date. For example:
end.date <- Sys.Date()
end.date.lt <- asPOSIXlt(end.date)
start.date.lt <- end.date.lt
start.date.lt$year <- start.date.lt$year - 1
start.date <- asPOSIXct(start.date.lt)
Now this does have one potential fail-state: if today is February 29th. One way to deal with that would be to write a "today.last.year" function to do the above conversion, but give an explicit treatment for leap years - possibly including an option to count "today last year" as either February 28th or March 1st, depending on which gives you the desired behaviour.
Alternatively, if you wanted to filter based on a start-of-month date, you can make your function also set start.date.lt$day = 1, and so forth if you need to adjust in different ways.
Input:
product date
1: a 2017-01-01
2: b 2017-04-01
3: a 2017-07-01
4: b 2017-10-01
5: a 2018-01-01
6: b 2018-04-01
7: a 2018-07-01
8: b 2018-10-01
9: a 2019-01-01
10: b 2019-04-01
11: a 2019-07-01
12: b 2019-10-01
Code:
library(lubridate)
library(data.table)
DT <- data.table(
product = rep(c("a", "b"), 6),
date = seq(as.Date("2017-01-01"), as.Date("2019-12-31"), by = "quarter")
)
yearBefore <- function(x){
year(x) <- year(x) - 1
x
}
date_DT <- DT[, .(last_date = last(date)), by = product]
date_DT[, year_before := yearBefore(last_date)]
result <- DT[, date_DT[DT, on = .(product, year_before <= date), nomatch=0]]
result[, last_date := NULL]
setnames(result, "year_before", "date")
Output:
product date
1: a 2018-07-01
2: b 2018-10-01
3: a 2019-01-01
4: b 2019-04-01
5: a 2019-07-01
6: b 2019-10-01
Is this what you are looking for?
I have a field of dates that mark the end of weeks. I am trying to create a new field using a function that flags (1 or 0) whether that week included any of 6 holidays specified using the timeDate package. I am getting the following error: "Error during wrapup: comparison (3) is possible only for atomic and list types" - how can I resolve?
The function should take the end-of-week date (x) in the format yyyy-mm-dd (e.g., 2017-01-01) and the year of that date (y) in the format yyyy (e.g., 2017).
library(lubridate)
library(timeDate)
Date = as.Date(c("2017-01-01", "2017-01-08", "2017-01-15", "2017-01-22", "2017-06-04", "2017-07-09", "2017-07-16"))
Year = year(Date)
Holiday.During.Week = as.Date(c("2017-01-01", NA, NA, NA, "2017-05-29", "2017-07-04", NA))
Desired.Output = c(1,0,0,0,1,1,0)
data <- data.frame(Date, Year, Holiday.During.Week, Desired.Output)
data
holiday.function = function(x, y) {
return(
as.numeric(
(USNewYearsDay(y) < x & USNewYearsDay(y) > (x - 7)) +
(USMemorialDay(y) < x & USMemorialDay(y) > (x - 7)) +
(USIndependenceDay(y) < x & USIndependenceDay(y) > (x - 7)) +
(USLaborDay(y) < x & USLaborDay(y) > (x - 7)) +
(USThanksgivingDay(y) < x & USThanksgivingDay(y) > (x - 7)) +
(USChristmasDay(y) < x & USChristmasDay(y) > (x - 7))
)
)
}
data$Holiday.Flag = holiday.function(data$Date, data$Year)
Edit: thanks to Ian Campbell for working on this without provided data. I've updated the code to include a sample data frame and libraries
Sounds like you have a lot of data, so let's use data.table. I generated a random sample of 10,000,000 days at the end of this answer.
First, we'll make a data.table of all of the holidays between 1900 and 2020.
library(timeDate)
library(data.table)
library(lubridate)
HolidayTable <- rbindlist(lapply(1900:2020,function(y){data.frame(Year = y, Holiday = as.Date(c(USNewYearsDay(y),USMemorialDay(y),USIndependenceDay(y),USLaborDay(y),USThanksgivingDay(y),USChristmasDay(y))))}))
We need to make a copy of the holiday date, because data.table rolling joins merge the column you join on.
setDT(test.data)
setDT(HolidayTable)
HolidayTable[,Date := Holiday]
test.data[, Year := year(Date)]
Now we do a rolling join with roll = 6 to join on dates that are at most 6 days in the future. We can then create a the desired output with a logical comparison coerced to integer with +.
HolidayTable[test.data, on = c("Year","Date"), roll = 6][
,.(Index,Year,Date,Holiday,HolidayPresent = +(!is.na(Holiday)))]
# Index Year Date Holiday HolidayPresent
# 1: 1 2018 2018-04-21 <NA> 0
# 2: 2 2017 2017-09-30 <NA> 0
# 3: 3 2017 2017-01-07 2017-01-01 1
# 4: 4 2017 2017-08-26 <NA> 0
# 5: 5 2018 2018-09-01 <NA> 0
# ---
# 9999996: 9999996 2017 2017-06-24 <NA> 0
# 9999997: 9999997 2018 2018-03-17 <NA> 0
# 9999998: 9999998 2018 2018-07-07 2018-07-04 1
# 9999999: 9999999 2018 2018-01-13 <NA> 0
#10000000: 10000000 2017 2017-08-12 <NA> 0
10,000,000 rows done in just 2.5 seconds on my laptop.
system.time({HolidayTable[test.data, on = c("Year","Date"), roll = 6][,.(Index,Year,Date,Holiday,HolidayPresent = +(!is.na(Holiday)))]})
user system elapsed
2.045 0.426 2.484
Data
library(zoo)
WeekEndingDate2017 <- zoo::as.Date(Reduce(function(x,y){x + days(7)},1:51,as.Date("2017-01-07","%Y-%m-%d"), accumulate = TRUE))
WeekEndingDate2018 <- zoo::as.Date(Reduce(function(x,y){x + days(7)},1:51,as.Date("2018-01-06","%Y-%m-%d"), accumulate = TRUE))
set.seed(1)
test.data <- data.frame(Index = 1:10000000, Date = sample(c(WeekEndingDate2017,WeekEndingDate2018),size = 10000000, replace = TRUE))
So I have this dataset of about 2,800 observations. The headers look a little something like this:
ItemName ItemNumber PromotedDate
ItemA 14321 12/31/2018
ItemB 14335 11/18/2018
ItemC 14542 10/05/2018
I want to be able to add a new column to this dataset, Number.Times.Promoted.Last.3.Months, that would count how many times each item exists in the dataset over the last three months of the PromotedDate variable.
I've tried creating some code (below) but it returns 0 for every row. When I just try it with the item number, I get the number of observations in the entire dataset.
df$Number.Times.Promoted.Last.Three.Months <- sum(df$ItemNumber == df$ItemNumber &
df$PromotedDate < df$PromotedDate &
df$PromotedDate > (as.Date(df$PromotedDate - 100)),
na.rm=TRUE)))
I'd love for the code to return the actual number of times each item in the dataset was promoted over the last 3 months since the PromotedDate variable, and for that to be attached to each row of the data (df). Would love some help in figuring out what I'm doing wrong. Thanks!
Note: In the file linked to there is a typo, the first ItemB starts with a lower case i. The code below works even if this is not corrected.
I find the following solution a bit too complicated but it does what the question asks for.
library(lubridate)
fun <- function(x){
ifelse(month(x) == 12 & day(x) == 31,
x - days(31 + 30 + 31),
x - months(3)
)
}
df <- readxl::read_xlsx("example_20190519.xlsx")
df$PromotedDate <- as.Date(df$PromotedDate)
sp <- split(df, tolower(df$ItemName))
res <- lapply(sp, function(DF){
tmp <- as.Date(fun(DF$PromotedDate), origin = "1970-01-01")
sapply(seq_len(nrow(DF)), function(i){
sum(DF$PromotedDate[i] > DF$PromotedDate & DF$PromotedDate > tmp[i])
})
})
df$New.3.Months <- NA
for(nm in names(res)) {
df$New.3.Months[tolower(df$ItemName) == nm] <- res[[nm]]
}
Now test to see if the result is the same as in the example .xlsx file.
all.equal(df$Times.Promoted.Last.3.Months, df$New.3.Months)
#[1] TRUE
And final cleanup.
rm(sp)
Here's an arguably simpler solution that relies on dplyr and fuzzyjoin.
First I define a day 90 days earlier**, and then join the list with itself, pulling in each Item match with a promotion date that is both "since 90 days before" and "up to current date." The number of rows for each Item-Date is the number of promotions within 90 days. By subtracting the row representing itself, we get the number of prior promotions.
** "90 days earlier" is simpler than "3mo earlier," which varies in length and is arguable for some dates: what's 3 months before May 30?
Prep
library(dplyr); library(fuzzyjoin); library(lubridate)
df <- readxl::read_excel(
"~/Downloads/example_20190519.xlsx",
col_types = c("text", "numeric", "date", "numeric"))
df_clean <- df %>% select(-Times.Promoted.Last.3.Months)
Solution
df_clean %>%
mutate(PromotedDate_less90 = PromotedDate - days(90)) %>%
# Pull in all matches (including current row) with matching Item and Promoted Date
# that is between Promoted Date and 90 days prior.
fuzzy_left_join(df_clean,
by = c("ItemName" = "ItemName",
"ItemNumber" = "ItemNumber",
"PromotedDate_less90" = "PromotedDate",
"PromotedDate" = "PromotedDate"),
match_fun = list(`==`, `==`, `<=`, `>=`)
) %>%
group_by(ItemName = ItemName.x,
ItemNumber = ItemNumber.x,
PromotedDate = PromotedDate.x) %>%
summarize(promotions_in_prior_90d = n() - 1) %>%
ungroup()
Output (in different order, but matching goal)
# A tibble: 12 x 4
ItemName ItemNumber PromotedDate promotions_in_prior_90d
<chr> <dbl> <dttm> <dbl>
1 ItemA 10021 2018-09-19 00:00:00 0
2 ItemA 10021 2018-10-15 00:00:00 1
3 ItemA 10021 2018-11-30 00:00:00 2
4 ItemA 10021 2018-12-31 00:00:00 2
5 itemB 10024 2018-12-15 00:00:00 0
6 ItemB 10024 2018-04-02 00:00:00 0
7 ItemB 10024 2018-06-05 00:00:00 1
8 ItemB 10024 2018-12-01 00:00:00 0
9 ItemC 19542 2018-07-20 00:00:00 0
10 ItemC 19542 2018-11-17 00:00:00 0
11 ItemC 19542 2018-12-01 00:00:00 1
12 ItemC 19542 2018-12-14 00:00:00 2
I have a cohort of data with multiple person visits and want to group visits with a common ID based on person # and the time of the visit. The condition is if an start is within 24 hours of a the previous exit, then I want those to have the same ID.
Sample of what data looks like:
dat <- data.frame(
Person_ID = c(1,1,1,2,3,3,3,4,4),
Admit_Date_Time = as.POSIXct(c("2017-02-07 15:26:00","2017-04-21 10:20:00",
"2017-04-22 12:12:00", "2017-10-16 01:31:00","2017-01-24 02:41:00","2017- 01-24 05:31:00", "2017-01-28 04:26:00", "2017-12-01 01:31:00","2017-12-01
01:31:00"), format = "%Y-%m-%d %H:%M"),
Discharge_Date_Time = as.POSIXct(c("2017-03-01 11:42:00","2017-04-22
05:56:00",
"2017-04-26 21:01:00",
"2017-10-18 20:11:00",
"2017-01-27 22:15:00",
"2017-01-26 15:35:00",
"2017-01-28 09:25:00",
"2017-12-05 18:33:00",
"2017-12-04 16:41:00"),format = "%Y-%m-%d %H:%M" ),
Visit_ID = c(1:9))
this is what I tried to start:
dat1 <-
dat %>%
arrange(Person_ID, Admit_Date_Time) %>%
group_by(Person_ID) %>%
mutate(Previous_Visit_Interval = difftime(lag(Discharge_Date_Time,
1),Admit_Date_Time, units = "hours")) %>%
mutate(start = c(1,Previous_Visit_Interval[-1] < hours(-24)), run =
cumsum(start))
dat1$ID = as.numeric(as.factor(paste0(dat1$Person_ID,dat1$run)))
Which is almost right, except it does not give the correct ID for visit 7 (person #3). Since there are three visits and the second visit is entirely within the first, and the third starts within 24 hours of the first but not the second.
There's probably a way to shorten this, but here's an approach using tidyr::gather and spread. By gathering into long format, we can track the cumulative admissions inside each visit. A new visit is recorded whenever there's a new Person_ID or that Person_ID completed a visit (cumulative admissions went to zero) at least 24 hours prior.
library(tidyr)
dat1 <- dat %>%
# Gather into long format with event type in one column, timestamp in another
gather(event, time, Admit_Date_Time:Discharge_Date_Time) %>%
# I want discharges to have an effect up to 24 hours later. Sort using that.
mutate(time_adj = if_else(event == "Discharge_Date_Time",
time + ddays(1),
time)) %>%
arrange(Person_ID, time_adj) %>%
# For each Person_ID, track cumulative admissions. 0 means a visit has completed.
# (b/c we sorted by time_adj, these reflect the 24hr period after discharges.)
group_by(Person_ID) %>%
mutate(admissions = if_else(event == "Admit_Date_Time", 1, -1)) %>%
mutate(admissions_count = cumsum(admissions)) %>%
ungroup() %>%
# Record a new Hosp_ID when either (a) a new Person, or (b) preceded by a
# completed visit (ie admissions_count was zero).
mutate(Hosp_ID_chg = 1 *
(Person_ID != lag(Person_ID, default = 1) | # (a)
lag(admissions_count, default = 1) == 0), # (b)
Hosp_ID = cumsum(Hosp_ID_chg)) %>%
# Spread back into original format
select(-time_adj, -admissions, -admissions_count, -Hosp_ID_chg) %>%
spread(event, time)
Results
> dat1
# A tibble: 9 x 5
Person_ID Visit_ID Hosp_ID Admit_Date_Time Discharge_Date_Time
<dbl> <int> <dbl> <dttm> <dttm>
1 1 1 1 2017-02-07 15:26:00 2017-03-01 11:42:00
2 1 2 2 2017-04-21 10:20:00 2017-04-22 05:56:00
3 1 3 2 2017-04-22 12:12:00 2017-04-26 21:01:00
4 2 4 3 2017-10-16 01:31:00 2017-10-18 20:11:00
5 3 5 4 2017-01-24 02:41:00 2017-01-27 22:15:00
6 3 6 4 2017-01-24 05:31:00 2017-01-26 15:35:00
7 3 7 4 2017-01-28 04:26:00 2017-01-28 09:25:00
8 4 8 5 2017-12-01 01:31:00 2017-12-05 18:33:00
9 4 9 5 2017-12-01 01:31:00 2017-12-04 16:41:00
Here's a data.table approach using an overlap-join
library( data.table )
library( lubridate )
setDT( dat )
setorder( dat, Person_ID, Admit_Date_Time )
#create a 1-day extension after each discharge
dt2 <- dat[, discharge_24h := Discharge_Date_Time %m+% days(1)][]
#now create id
setkey( dat, Admit_Date_Time, discharge_24h )
#create data-table with overlap-join, create groups based on overlapping ranges
dt2 <- setorder(
foverlaps( dat,
dat,
mult = "first",
type = "any",
nomatch = 0L
),
Visit_ID )[, list( Visit_ID = i.Visit_ID,
Hosp_ID = .GRP ),
by = .( Visit_ID )][, Visit_ID := NULL]
#reorder the result
setorder( dt2[ dat, on = "Visit_ID" ][, discharge_24h := NULL], Visit_ID )[]
# Visit_ID Hosp_ID Person_ID Admit_Date_Time Discharge_Date_Time
# 1: 1 1 1 2017-02-07 15:26:00 2017-03-01 11:42:00
# 2: 2 2 1 2017-04-21 10:20:00 2017-04-22 05:56:00
# 3: 3 2 1 2017-04-22 12:12:00 2017-04-26 21:01:00
# 4: 4 3 2 2017-10-16 01:31:00 2017-10-18 20:11:00
# 5: 5 4 3 2017-01-24 02:41:00 2017-01-27 22:15:00
# 6: 6 4 3 2017-01-24 05:31:00 2017-01-26 15:35:00
# 7: 7 4 3 2017-01-28 04:26:00 2017-01-28 09:25:00
# 8: 8 5 4 2017-12-01 01:31:00 2017-12-05 18:33:00
# 9: 9 5 4 2017-12-01 01:31:00 2017-12-04 16:41:00
I'm working on an R script that would display the weekday range and week the dates fall in, in a data frame.
output i'm trying to display
--------------------------------------------------
DateRange | Week
--------------------------------------------------
1/7/2018 - 1/13/2018 | 2
--------------------------------------------------
1/14/2018 - 1/20/2018 | 3
--------------------------------------------------
1/21/2018 - 1/26/2018 | 4
--------------------------------------------------
library(data.table)
dd <- seq(as.IDate("2018-01-01"), as.IDate("2018-04-10"), 1)
dt <- data.table(i = 1:length(dd),
day = dd,
weekday = weekdays(dd),
day_rounded = round(dd, "weeks"))
## Now let's add the weekdays for the "rounded" date
dt[ , weekday_rounded := weekdays(day_rounded)]
## This seems to make internal sense with the "week" calculation
dt[ , weeknumber := week(day)]
dt$weekday_rounded <- NULL
dt$day_rounded <- NULL
dt
If I am not mistaken, then lubridate can help:
library(lubridate)
library(data.table)
dd <- seq(as.IDate("2018-01-01"), as.IDate("2018-04-10"), 1)
dt <- data.table(i = 1:length(dd), day = dd)
dt[, week := week(day),]
dt[, week_start := floor_date(day, unit = "week"),]
dt[, week_range := interval(week_start, week_start + days(6))]
dt[, week_start := NULL,]
Output:
i day week week_range
1: 1 2018-01-01 1 2017-12-31 UTC--2018-01-06 UTC
2: 2 2018-01-02 1 2017-12-31 UTC--2018-01-06 UTC
3: 3 2018-01-03 1 2017-12-31 UTC--2018-01-06 UTC
4: 4 2018-01-04 1 2017-12-31 UTC--2018-01-06 UTC
5: 5 2018-01-05 1 2017-12-31 UTC--2018-01-06 UTC
6: 6 2018-01-06 1 2017-12-31 UTC--2018-01-06 UTC
7: 7 2018-01-07 1 2018-01-07 UTC--2018-01-13 UTC
8: 8 2018-01-08 2 2018-01-07 UTC--2018-01-13 UTC
9: 9 2018-01-09 2 2018-01-07 UTC--2018-01-13 UTC
10: 10 2018-01-10 2 2018-01-07 UTC--2018-01-13 UTC
.......
If you have some table dt with a day column and other arbitrary columns, you can add the i, weekday, weeknumber and WeekRange in a single call in data.table:
dt[, ':='(
i = .I,
weekday = weekdays(day),
WeekRange = paste(min(day), max(day), sep = ' - ')
), .(weeknumber = week(day))]
The way this works is to group the days by weeknumber, or week(day), and then to paste the min date and max date in those groups together to create a date range for every week.
Here's a mock table to experiment with:
n <- 100
dt <-
data.table(
day = seq.Date(as.Date('2018-01-01'), by = 'day', length.out = n),
a = runif(n),
b = runif(n)
)