Averaging daily data over many years to replace NA values in R - r

I have a dataframe called daily which looks like this:
daily[1:10,]
Climate_Division Date Precipitation
1 1 1948-07-01 0.2100000
2 1 1948-07-02 0.7000000
3 1 1948-07-03 0.1900000
4 1 1948-07-04 0.1033333
5 1 1948-07-05 0.1982895
6 1 1948-07-06 0.1433333
7 1 1948-07-07 NA
8 1 1948-07-08 NA
9 1 1948-07-09 NA
10 1 1948-07-10 NA
The objective that I would like to accomplish is average all the day values throughout the years (1948-1995) to replace the NA value that occurs on that particular day. For example, since row 7 has an NA for July 7, 1948, I would average all the July 7 from 1948-1995 and replace that particular day with the average.
What I have tried so far is this:
index <- which(is.na(daily$Precipitation)) # find where the NA's occur
daily_avg <- daily # copy dataframe
daily_avg$Date <- strftime(daily_avg$Date, format="2000-%m-%d") # Change the Date format to represent only the day and month and disregard year
daily_avg <- aggregate(Precipitation~Date, FUN = mean, data = daily_avg, na.rm = TRUE) # find the mean precip per day
daily[index,3] <- daily_avg[daily_avg$Date %in% strftime(daily[index,2], format="2000-%m-%d"), 2]
The last line in the code is not working properly, I'm not sure why yet. That is how my thought process of this problem is going. However, I was wondering if there is a better way of doing it using a built in function that I am not aware of. Any help is greatly appreciated. Thank you

I think the data in your example, don't explain the problem. You should give data for a certain day over many years with some NA values. For example, here I change the problem for 2 days over 3 years.
Climate_Division Date Precipitation
1 1 1948-07-01 0.2100000
2 1 1948-07-02 NA
3 1 1949-07-01 0.1900000
4 1 1949-07-02 0.1033333
5 1 1950-07-01 NA
6 1 1950-07-02 0.1433333
The idea if I understand , is to replace NA values by the mean of the values over all years. You can use ave and transform to create the new column containing the mean, then replace the NA value with it.
daily$daymonth <- strftime(daily$Date, format="%m-%d")
daily <- transform(daily, mp =ave(Precipitation,daymonth,
FUN=function(x) mean(x,na.rm=TRUE) ))
transform(daily, Precipitation =ifelse(is.na(Precipitation),mp,Precipitation))
Climate_Division Date Precipitation daymonth mp
1 1 1948-07-01 0.2100000 07-01 0.2000000
2 1 1948-07-02 0.1233333 07-02 0.1233333
3 1 1949-07-01 0.1900000 07-01 0.2000000
4 1 1949-07-02 0.1033333 07-02 0.1233333
5 1 1950-07-01 0.2000000 07-01 0.2000000
6 1 1950-07-02 0.1433333 07-02 0.1233333

Using data.table
Some dummy data
set.seed(1)
library(data.table)
daily <- seq(as.Date('1948-01-01'),as.Date('1995-12-31')
dd <- data.table(date = daily, precip = runif(length(daily)))
# add na values
nas <- sample(length(daily),300, FALSE)
dd[, precip := {is.na(precip) <- nas; precip}]
## calculate the daily averages
# add day and month
dd[, c('month','day') := list(month(date), mday(date))]
monthdate <- dd[, list(mprecip = mean(precip, na.rm = TRUE)),
keyby = list(month, date)]
# set key for joining
setkey(dd, month, date)
# replace NA with day-month averages
dd[monthdate, precip := ifelse(is.na(precip), mprecip, precip)]
# set key to reorder to daily
setkey(dd, date)

A slightly neater version of mnel's answer, which I would prefer to the accepted one:
set.seed(1)
library(data.table)
# step 1: form data
daily <- seq(as.Date('1948-01-01'),as.Date('1995-12-31'),by="day")
dd <- data.table(date = daily, precip = runif(length(daily)))
# step 2: add NA values
nas <- sample(length(daily),300, FALSE)
dd[, precip := {is.na(precip) <- nas; precip}]
# step 3: replace NAs with day-of-month across years averages
dd[, c('month','day') := list(month(date), mday(date))]
dd[,precip:= ifelse(is.na(precip), mean(precip, na.rm=TRUE), precip), by=list(month,day)]

Related

Time series - Convert every column of dataframe to time series

I have a dataframe df in R:
month abc1 def2 xyz3
201201 1 2 4
201202 2 5 7
201203 4 11 4
201204 6 23 40
I would like to convert each of the columns (of which there are ~50, each with ~100 monthly observations) to a time series format in order to check for seasonality in the data, using the decompose function.
I assumed a for loop using the ts function would be the best way of doing this. I would like to use something along the lines of the loop below, although I realise using a function on the left side of the <- produces an error. Is there a way to dynamically name variables generated by a loop?
for(i in 2:ncol(df)) {
paste(names(df[, i]), "_ts") <- ts(df[ ,i], start = c(2012, 1), end = c(2021,11), frequency = 12)
}
You could try zoo:
test = data.frame(month=c("201201", "201202", "201203", "201204"), abc1=c(1,2,3,4), def2=c(4,6,7,10), xyz3=c(12,15,16,19))
library(zoo)
ZOO =zoo(test[, c("abc1", "def2", "xyz3")], order.by=as.Date(paste0(test$month, "01"), format="%Y%m%d"))
ts(ZOO, frequency=12)
Output:
abc1 def2 xyz3
Jan 1 1 4 12
Feb 1 2 6 15
Mar 1 3 7 16
Apr 1 4 10 19
attr(,"index")
[1] 2012-01-01 2012-02-01 2012-03-01 2012-04-01
Update:
Now with correct frequency.

How to convert dataframe with YearMonth column to times series in R

I have the below dataset which is a dataframe. But I would like to convert it into time series so that I can do ARIMA forecasting.
Have searched various topics in SO but could not find anything similar which is at YEARMONTH grain. Everyone talked about date field. But here I don't have date.
I am using the below code but this gives error
dataset <- data.frame(year =c(2017), YearMonth = c(201701,201702,201703,201704), sales = c(100,200,300,400))
library(zoo)
newdataset <- as.ts(read.zoo(dataset, FUN = as.yearmon))
# Error:
#
# In zoo(coredata(x), tt) :
# some methods for “zoo” objects do not work if the index entries in ‘order.by’ are not unique
I know it gives error because I have year column as 1st column which does not have unique values but not really sure how to fix it.
Any help would be really appreciated.
Regards,
Akash
One option is to convert YearMonth to 1st date of a month and generate ts.
library(zoo)
dataset$YearMonth = as.Date(as.yearmon(as.character(dataset$YearMonth),"%Y%m"), frac = 0)
dataset
# year YearMonth sales
# 1 2017 2017-01-01 100
# 2 2017 2017-02-01 200
# 3 2017 2017-03-01 300
# 4 2017 2017-04-01 400
Just for ts another option is as:
dataset$YearMonth = as.yearmon(as.character(dataset$YearMonth),"%Y%m")
as.ts(dataset[-1])
# Time Series:
# Start = 1
# End = 4
# Frequency = 1
# YearMonth sales
# 1 2017.000 100
# 2 2017.083 200
# 3 2017.167 300
# 4 2017.250 400

How to cut yearly time-based data into 36 parts with R?

I have a df like the following with 30 years until 2015. I want to cut every month into three data like 1-10, 11-20, and 21-31 and average all ten (less then ten) data. Thus, each month has three data. How can I do it?
1993-01-29 28.92189
1993-02-01 29.12760
1993-02-02 29.18927
1993-02-03 29.49786
1993-02-04 29.62128
1993-02-05 29.60068
1993-02-08 29.60068
1993-02-09 29.39498
------
------
2015-08-18 209.92999
2015-08-19 208.28000
2015-08-20 204.01000
2015-08-21 197.63001
2015-08-24 189.55000
2015-08-25 187.23000
2015-08-26 194.67999
2015-08-27 199.16000
2015-08-28 199.24000
tryCatch is for eliminate data start date problem. I will provide more info when i have time.
library(xts)
dates<-seq(as.Date("1993-01-29"),as.Date("2015-08-25"),"days")
sample<-rnorm(length(dates))
tmpxts<-split.xts(xts(x = sample,order.by = dates),f = "months")
mxts<-lapply(tmpxts,function(x) {
tmp<-data.frame(val=tryCatch(c(mean(x[1:10]),mean(x[11:20]),mean(x[21:length(x)])),
error=function(e) matrix(mean(x),1)))
row.names(tmp)<-tryCatch(index(x[c(1,11,21)]),error=function(e) index(x[1]))
tmp
})
do.call(rbind,mxts)
This is a base solution that builds cuts from an increasing sequence the cycles through years, months and your cuts at 1st, 11th and 21th of the month, The default for the base cut functions are to include the breaks as the "right-side" of intervals, but your specification required cuts at 1,11,and 21 (to leave 10, and 20 in the lower intervals) so I used right=TRUE:
tapply(dat$V2, cut.Date(dat$V1,
breaks=as.Date(
apply( expand.grid( c(1,11,21), 1:12, 1993:2015), 1,
function( x) paste(rev(x), collapse="-")) ), right=TRUE), FUN=mean)
1993-01-01 1993-01-11 1993-01-21 1993-02-01 1993-02-11 1993-02-21 1993-03-01
NA NA 29.02475 29.48412 NA NA NA
snipped many empty intervals
And the bottom of results included:
2015-07-21 2015-08-01 2015-08-11 2015-08-21 2015-09-01 2015-09-11 2015-09-21
NA NA 204.96250 193.97200 NA NA NA
2015-10-01 2015-10-11 2015-10-21 2015-11-01 2015-11-11 2015-11-21 2015-12-01
NA NA NA NA NA NA NA
2015-12-11
NA
The code below cuts each month separately into thirds, based on the number of days in each month.
library(dplyr)
library(lubridate)
library(ggplot2)
# Fake data
df = data.frame(date=seq.Date(as.Date("2013-01-01"),
as.Date("2013-03-31"), by="day"))
set.seed(394)
df$value = rnorm(nrow(df), sqrt(1:nrow(df)), 2)
# Cut months into thirds
df = df %>%
# Create a new column to group by Year-Month
mutate(yr_mon = paste0(year(date) , "_", month(date, label=TRUE, abbr=TRUE))) %>%
group_by(yr_mon) %>%
# Cut each month into thirds
mutate(cutMonth = cut(day(date),
breaks=c(0, round(1/3*n()), round(2/3*n()), n()),
labels=c("1st third","2nd third","3rd third")),
# Add yr_mon to cutMonth so that we have a unique group label for
# each third of each month
cutMonth = paste0(yr_mon, "\n", cutMonth)) %>%
ungroup() %>%
# Turn cutMonth into a factor with correct date ordering
mutate(cutMonth = factor(cutMonth, levels=unique(cutMonth)))
And here is the result:
# Show number of observations in each group
as.data.frame(table(df$cutMonth))
Var1 Freq
1 2013_Jan\n1st third 10
2 2013_Jan\n2nd third 11
3 2013_Jan\n3rd third 10
4 2013_Feb\n1st third 9
5 2013_Feb\n2nd third 10
6 2013_Feb\n3rd third 9
7 2013_Mar\n1st third 10
8 2013_Mar\n2nd third 11
9 2013_Mar\n3rd third 10
# Plot means by group (just to visualize the result of the date grouping operations)
ggplot(df, aes(cutMonth, value)) +
stat_summary(fun.y=mean, geom='point', size=4, colour="red") +
coord_cartesian(ylim=c(-0.2,10.2)) +
theme(axis.text.x = element_text(size=14))

Combination of merge and aggregate in R

I have created the following 2 dummy datasets as follows:
id<-c(8,8,50,87,141,161,192,216,257,282)
date<-c("2011-03-03","2011-12-12","2010-08-18","2009-04-28","2010-11-29","2012-04-02","2013-01-08","2007-01-22","2009-06-03","2009-12-02")
data<-data.frame(cbind(id,date))
id<-c(3,8,11,11,11,11,11,11,19,19,19,19,19,50,50,50,50,50,87,87,87,87,87,87,282,282,282,282,282,282,282,282,282,282,288,288,288,288,288,288,288,288,288,288,288,288,288)
date<-c("2010-11-04","2011-02-25","2009-07-26","2009-07-27","2009-08-09","2009-08-10","2009-08-30","2004-01-20","2006-02-13","2006-07-18","2007-04-20","2008-05-12","2008-05-29","2009-06-10","2010-08-17","2010-08-15","2011-05-13","2011-06-08","2007-08-09","2008-01-19","2008-02-19","2009-04-28","2009-05-16","2009-05-20","2005-05-14","2007-04-15","2007-07-25","2007-10-12","2007-10-23","2007-10-27","2007-11-20","2009-11-28","2012-08-16","2012-08-16","2008-11-17","2009-10-23","2009-10-27","2009-10-27","2009-10-27","2009-10-27","2009-10-28","2010-06-15","2010-06-17","2010-06-23","2010-07-27","2010-07-27","2010-07-28")
ns<-data.frame(cbind(id,date))
Note that only some of the id in data are included in ns and viceversa.
For each of the values in data$id I am trying to find if there is a ns$date that is 14 days before the data$date where data$id==ns$id and report the number of days difference.
The output I need is a vector/column ("received") of the same number of rows of data, with a TRUE/FALSE whre ns$date[ns$id==data$id] is less than 14 days before the respective data$date and a similar vector with the actual number of days where "received" is TRUE. I hope this makes sense now.
This is where I got so far
# convert dates
data$date <- ymd(data$date)
ns$date <- ymd(ns$date)
# left join datasets
tmp <- merge(data, ns, by="id", all.x=TRUE)
#NOTE THAT this will automatically rename data$date as date.x and tmp$date as date.y
# create variable to say when there is a date difference less than 14 days
tmp$received <- with(tmp, difftime(date.x, date.y, units="days")<14&difftime(date.x, date.y, units="days")>0)
#create a variable that reports the days difference
tmp$dif<-ifelse(tmp$received==TRUE,difftime(tmp$date.x,tmp$date.y, units="days"),NA)
This link Find if date is within 14 days if id matches between datasets in R provides an idea but the result does not include the number of days difference in tmp$dif.
In the result table I need only the lowest difference for each data$id for those cases were tmp$received was TRUE.
Hope this makes more sense now? If not please let me know what needs further clarification.
M
PS: as requested I added what the desired output should look like (same number of rows of data = 10 - no rows for data in ns not in data). Should have thought this might help earlier.
id date received dif
1 8 2011-03-03 TRUE 6
2 8 2011-12-12 FALSE NA
3 50 2010-08-18 TRUE 1
4 87 2009-04-28 TRUE 0
5 141 2010-11-29 NA NA
6 161 2012-04-02 NA NA
7 192 2013-01-08 NA NA
8 216 2007-01-22 NA NA
9 257 2009-06-03 NA NA
10 282 2009-12-02 TRUE 4
Here's a data.table approach
Converting to data.table objects
library(data.table)
setkey(setDT(data), id)
setkey(setDT(ns), id)
Merging
ns <- ns[data]
Converting to Date class
ns[, c("date", "date.1") := lapply(.SD, as.Date), .SDcols = c("date", "date.1")]
Computing days differences and TRUE/FALSE
ns[, `:=`(timediff = date.1 - date,
Logical = (date.1 - date) < 14)]
Taking only the rows we are interested in
res <- ns[is.na(timediff) | timediff >= 0, list(received = any(Logical), dif = timediff[Logical]), by = list(id, date.1)]
Sorting by id and date
res[, id := as.numeric(as.character(id))]
setkey(res, id, date.1)
Subsetting by minimum dstance
res[, list(diff = min(dif)), by = list(id, date.1, received)]
# id date.1 received diff
# 1: 8 2011-03-03 TRUE 6 days
# 2: 8 2011-12-12 FALSE NA days
# 3: 50 2010-08-18 TRUE 1 days
# 4: 87 2009-04-28 TRUE 0 days
# 5: 141 2010-11-29 NA NA days
# 6: 161 2012-04-02 NA NA days
# 7: 192 2013-01-08 NA NA days
# 8: 216 2007-01-22 NA NA days
# 9: 257 2009-06-03 NA NA days
# 10: 282 2009-12-02 TRUE 4 days

R: How to get the Week number of the month

I am new in R.
I want the week number of the month, which the date belongs to.
By using the following code:
>CurrentDate<-Sys.Date()
>Week Number <- format(CurrentDate, format="%U")
>Week Number
"31"
%U will return the Week number of the year .
But i want the week number of the month.
If the date is 2014-08-01 then i want to get 1.( The Date belongs to the 1st week of the month).
Eg:
2014-09-04 -> 1 (The Date belongs to the 1st week of the month).
2014-09-10 -> 2 (The Date belongs to the 2nd week of the month).
and so on...
How can i get this?
Reference:
http://astrostatistics.psu.edu/su07/R/html/base/html/strptime.html
By analogy of the weekdays function:
monthweeks <- function(x) {
UseMethod("monthweeks")
}
monthweeks.Date <- function(x) {
ceiling(as.numeric(format(x, "%d")) / 7)
}
monthweeks.POSIXlt <- function(x) {
ceiling(as.numeric(format(x, "%d")) / 7)
}
monthweeks.character <- function(x) {
ceiling(as.numeric(format(as.Date(x), "%d")) / 7)
}
dates <- sample(seq(as.Date("2000-01-01"), as.Date("2015-01-01"), "days"), 7)
dates
#> [1] "2004-09-24" "2002-11-21" "2011-08-13" "2008-09-23" "2000-08-10" "2007-09-10" "2013-04-16"
monthweeks(dates)
#> [1] 4 3 2 4 2 2 3
Another solution to use stri_datetime_fields() from the stringi package:
stringi::stri_datetime_fields(dates)$WeekOfMonth
#> [1] 4 4 2 4 2 3 3
You can use day from the lubridate package. I'm not sure if there's a week-of-month type function in the package, but we can do the math.
library(lubridate)
curr <- Sys.Date()
# [1] "2014-08-08"
day(curr) ## 8th day of the current month
# [1] 8
day(curr) / 7 ## Technically, it's the 1.14th week
# [1] 1.142857
ceiling(day(curr) / 7) ## but ceiling() will take it up to the 2nd week.
# [1] 2
Issue Overview
It was difficult to tell which answers worked, so I built my own function nth_week and tested it against the others.
The issue that's leading to most of the answers being incorrect is this:
The first week of a month is often a short-week
Same with the last week of the month
For example, October 1st 2019 is a Tuesday, so 6 days into October (which is a Sunday) is already the second week. Also, contiguous months often share the same week in their respective counts, meaning that the last week of the prior month is commonly also the first week of the current month. So, we should expect a week count higher than 52 per year and some months that contain a span of 6 weeks.
Results Comparison
Here's a table showing examples where some of the above suggested algorithms go awry:
DATE Tori user206 Scri Klev Stringi Grot Frei Vale epi iso coni
Fri-2016-01-01 1 1 1 1 5 1 1 1 1 1 1
Sat-2016-01-02 1 1 1 1 1 1 1 1 1 1 1
Sun-2016-01-03 2 1 1 1 1 2 2 1 -50 1 2
Mon-2016-01-04 2 1 1 1 2 2 2 1 -50 -51 2
----
Sat-2018-12-29 5 5 5 5 5 5 5 4 5 5 5
Sun-2018-12-30 6 5 5 5 5 6 6 4 -46 5 6
Mon-2018-12-31 6 5 5 5 6 6 6 4 -46 -46 6
Tue-2019-01-01 1 1 1 1 6 1 1 1 1 1 1
You can see that only Grothendieck, conighion, Freitas, and Tori are correct due to their treatment of partial week periods. I compared all days from year 100 to year 3000; there are no differences among those 4. (Stringi is probably correct for noting weekends as separate, incremented periods, but I didn't check to be sure; epiweek() and isoweek(), because of their intended uses, show some odd behavior near year-ends when using them for week incrementation.)
Speed Comparison
Below are the tests for efficiency between the implementations of: Tori, Grothendieck, Conighion, and Freitas
# prep
library(lubridate)
library(tictoc)
kepler<- ymd(15711227) # Kepler's birthday since it's a nice day and gives a long vector of dates
some_dates<- seq(kepler, today(), by='day')
# test speed of Tori algorithm
tic(msg = 'Tori')
Tori<- (5 + day(some_dates) + wday(floor_date(some_dates, 'month'))) %/% 7
toc()
Tori: 0.19 sec elapsed
# test speed of Grothendieck algorithm
wk <- function(x) as.numeric(format(x, "%U"))
tic(msg = 'Grothendieck')
Grothendieck<- (wk(some_dates) - wk(as.Date(cut(some_dates, "month"))) + 1)
toc()
Grothendieck: 1.99 sec elapsed
# test speed of conighion algorithm
tic(msg = 'conighion')
weeknum <- as.integer( format(some_dates, format="%U") )
mindatemonth <- as.Date( paste0(format(some_dates, "%Y-%m"), "-01") )
weeknummin <- as.integer( format(mindatemonth, format="%U") ) # the number of the week of the first week within the month
conighion <- weeknum - (weeknummin - 1) # this is as an integer
toc()
conighion: 2.42 sec elapsed
# test speed of Freitas algorithm
first_day_of_month_wday <- function(dx) {
day(dx) <- 1
wday(dx)
}
tic(msg = 'Freitas')
Freitas<- ceiling((day(some_dates) + first_day_of_month_wday(some_dates) - 1) / 7)
toc()
Freitas: 0.97 sec elapsed
Fastest correct algorithm by about at least 5X
require(lubridate)
(5 + day(some_dates) + wday(floor_date(some_dates, 'month'))) %/% 7
# some_dates above is any vector of dates, like:
some_dates<- seq(ymd(20190101), today(), 'day')
Function Implementation
I also wrote a generalized function for it that performs either month or year week counts, begins on a day you choose (i.e. say you want to start your week on Monday), labels output for easy checking, and is still extremely fast thanks to lubridate.
nth_week<- function(dates = NULL,
count_weeks_in = c("month","year"),
begin_week_on = "Sunday"){
require(lubridate)
count_weeks_in<- tolower(count_weeks_in[1])
# day_names and day_index are for beginning the week on a day other than Sunday
# (this vector ordering matters, so careful about changing it)
day_names<- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")
# index integer of first match
day_index<- pmatch(tolower(begin_week_on),
tolower(day_names))[1]
### Calculate week index of each day
if (!is.na(pmatch(count_weeks_in, "year"))) {
# For year:
# sum the day of year, index for day of week at start of year, and constant 5
# then integer divide quantity by 7
# (explicit on package so lubridate and data.table don't fight)
n_week<- (5 +
lubridate::yday(dates) +
lubridate::wday(floor_date(dates, 'year'),
week_start = day_index)
) %/% 7
} else {
# For month:
# same algorithm as above, but for month rather than year
n_week<- (5 +
lubridate::day(dates) +
lubridate::wday(floor_date(dates, 'month'),
week_start = day_index)
) %/% 7
}
# naming very helpful for review
names(n_week)<- paste0(lubridate::wday(dates,T), '-', dates)
n_week
}
Function Output
# Example raw vector output:
some_dates<- seq(ymd(20190930), today(), by='day')
nth_week(some_dates)
Mon-2019-09-30 Tue-2019-10-01 Wed-2019-10-02
5 1 1
Thu-2019-10-03 Fri-2019-10-04 Sat-2019-10-05
1 1 1
Sun-2019-10-06 Mon-2019-10-07 Tue-2019-10-08
2 2 2
Wed-2019-10-09 Thu-2019-10-10 Fri-2019-10-11
2 2 2
Sat-2019-10-12 Sun-2019-10-13
2 3
# Example tabled output:
library(tidyverse)
nth_week(some_dates) %>%
enframe('DATE','nth_week_default') %>%
cbind(some_year_day_options = as.vector(nth_week(some_dates, count_weeks_in = 'year', begin_week_on = 'Mon')))
DATE nth_week_default some_year_day_options
1 Mon-2019-09-30 5 40
2 Tue-2019-10-01 1 40
3 Wed-2019-10-02 1 40
4 Thu-2019-10-03 1 40
5 Fri-2019-10-04 1 40
6 Sat-2019-10-05 1 40
7 Sun-2019-10-06 2 40
8 Mon-2019-10-07 2 41
9 Tue-2019-10-08 2 41
10 Wed-2019-10-09 2 41
11 Thu-2019-10-10 2 41
12 Fri-2019-10-11 2 41
13 Sat-2019-10-12 2 41
14 Sun-2019-10-13 3 41
Hope this work saves people the time of having to weed through all the responses to figure out which are correct.
I don't know R but if you take the week of the first day in the month you could use it to get the week in the month
2014-09-18
First day of month = 2014-09-01
Week of first day on month = 36
Week of 2014-09-18 = 38
Week in the month = 1 + (38 - 36) = 3
Using lubridate you can do
ceiling((day(date) + first_day_of_month_wday(date) - 1) / 7)
Where the function first_day_of_month_wday returns the weekday of the first day of month.
first_day_of_month_wday <- function(dx) {
day(dx) <- 1
wday(dx)
}
This adjustment must be done in order to get the correct week number otherwise if you have the 7th day of month on a Monday you will get 1 instead of 2, for example.
This is only a shift in the day of month.
The minus 1 is necessary because when the first day of month is sunday the adjustment is not needed, and the others weekdays follow this rule.
I came across the same issue and I solved it with mday from data.table package. Also, I realized that when using the ceiling() function, one also needs to account for the '5th week' situation. For example ceiling of the 30th day of a month ceiling(30/7) will give 5 ! Therefore, the ifelse statement below.
# Create a sample data table with days from year 0 until present
DT <- data.table(days = seq(as.Date("0-01-01"), Sys.Date(), "days"))
# compute the week of the month and account for the '5th week' case
DT[, week := ifelse( ceiling(mday(days)/7)==5, 4, ceiling(mday(days)/7) )]
> DT
days week
1: 0000-01-01 1
2: 0000-01-02 1
3: 0000-01-03 1
4: 0000-01-04 1
5: 0000-01-05 1
---
736617: 2016-10-14 2
736618: 2016-10-15 3
736619: 2016-10-16 3
736620: 2016-10-17 3
736621: 2016-10-18 3
To have an idea about the speed, then run:
system.time( DT[, week := ifelse( ceiling(mday(days)/7)==5, 4, ceiling(mday(days)/7) )] )
# user system elapsed
# 3.23 0.05 3.27
It took approx. 3 seconds to compute the weeks for more than 700 000 days.
However, the ceiling way above will always create the last week longer than all the other weeks (the four weeks have 7,7,7, and 9 or 10 days). Another way would be to use something like
ceiling(1:31/31*4)
[1] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4
where you get 7, 8 , 8 and 8 days per respective week in a 31 days month.
DT[, week2 := ceiling(mday(days)/31*4)]
There is a simple way to do it with lubridate package:
isoweek() returns the week as it would appear in the ISO 8601 system, which uses a reoccurring leap week.
epiweek() is the US CDC version of epidemiological week. It follows same rules as
isoweek() but starts on Sunday. In other parts of the world the convention is to start epidemiological weeks on Monday, which is the same as isoweek().
Reference here
I am late to the party and maybe noone is gonna read this answer...
Anyway, why not stay simple and do it like this:
library(lubridate)
x <- ymd(20200311, 20200308)
week(x) - week(floor_date(x, unit = "months")) + 1
[1] 3 2
I don't know any build in functions but a work around would be
CurrentDate <- Sys.Date()
# The number of the week relative to the year
weeknum <- as.integer( format(CurrentDate, format="%U") )
# Find the minimum week of the month relative to the year
mindatemonth <- as.Date( paste0(format(CurrentDate, "%Y-%m"), "-01") )
weeknummin <- as.integer( format(mindatemonth, format="%U") ) # the number of the week of the first week within the month
# Calculate the number of the week relative to the month
weeknum <- weeknum - (weeknummin - 1) # this is as an integer
# With the following you can convert the integer to the same format of
# format(CurrentDate, format="%U")
formatC(weeknum, width = 2, flag = "0")
Simply do this:
library(lubridate)
ds1$Week <- week(ds1$Sale_Date)
This is high performance! It instantly works on my 12 milion rows dataset.
On example above, ds1 is the dataset, Sale_Date is a date column (like "2015-11-23")
The other approach, using "as.integer( format..." might work on small datasets, but on 12 million rows it would keep running forever...

Resources