lubridate::floor_date returns NA - r

I am trying to write a function that, given a date start and an integer n, adds n months to the date and then takes the last day of the resulting month.
The following piece of code, however, returns NA for n=8+12k, but seems to work in other cases.
library(lubridate)
start <- ymd("2020-06-29")
n <- 8
floor_date(start + months(n), "month") + months(1) - days(1)
>[1] NA
I guess this is somewhat due to the existence of leap years, but I still find it puzzling. Plus, I couldn't find anything in the docs about this.
Can someone please explain what is going on, or suggest a better way to do the job?

The clock package is really good for this, as it has an explicit argument specifying what you want done with invalid dates. And also the date_end function.
I think you want: (R >= 4.1 for native pipe)
library(clock)
start <- parse_date("2020-06-29")
n <- 8
start |> add_months(n, invalid = "previous") |> date_end('month')
Which gives:
[1] "2021-02-28"

Related

round_date() function returns floor_date instead of rounded date

Using an example from a related issue: nearest month end in R
library(lubridate)
library(dplyr)
dt<-data.frame(orig_dt=as.Date(c("1997-04-01","1997-06-29")))
dt %>% mutate(round_dt=round_date(orig_dt, unit="month"),
modified_dt=round_date(orig_dt, unit="month")-days(1))
in one session I correctly get the rounded dates (R 4.0.0, Rcpp_1.0.4.6 loaded via a namespace)
orig_dt round_dt modified_dt
1 1997-04-01 1997-04-01 1997-03-31
2 1997-06-29 1997-07-01 1997-06-30
in another session I get floor instead of round (different machine, R 4.0.2, Rcpp not loaded via a namespace)
orig_dt round_dt modified_dt
1 1997-04-01 1997-04-01 1997-03-31
2 1997-06-29 1997-06-01 1997-05-31
I think it could be related to Rcpp , as earlier I got an error message
Error in C_valid_tz(tzone) (rscrpt.R#27): function 'Rcpp_precious_remove' not provided by package 'Rcpp'
Show stack trace
Although I am not getting the error anymore, the values are different and I wonder why/how to fix it without going through complete reinstallation.
I am able to reproduce your issue in a vanilla R session.
$ R --vanilla
> packageVersion("lubridate")
[1] ‘1.8.0’
> library("lubridate")
> round_date(x = as.Date("1997-06-29"), unit = "month")
[1] "1997-06-01"
It seems to be a bug in round_date, introduced in this commit. Prior to the commit, the body of round_date contained:
above <- unclass(as.POSIXct(ceiling_date(x, unit = unit, week_start = week_start)))
mid <- unclass(as.POSIXct(x))
below <- unclass(as.POSIXct(floor_date(x, unit = unit, week_start = week_start)))
Here, below, mid, and above are defined as the number of seconds from 1970-01-01 00:00:00 UTC to the month-floor of x, x, and the month-ceiling of x, respectively (more precisely, time 00:00:00 on those three Dates, in your system's time zone). Thus, below < mid < above, and round_date would compare mid-below to above-mid to determine which of below and above was closer to mid.
Since the commit, mid has been defined as
mid <- unclass(x)
which is the number of days from 1970-01-01 to x. Now, mid << below < above, making mid-below negative and above-mid positive. As a result, round_date considers below to be "closer" to mid than above, and it incorrectly rounds 1997-06-29 down to 1997-06-01.
I have reported the regression to the package maintainers here. I imagine that it will be fixed soon...
In the mean time, you can try reverting to an older version of lubridate, from before the commit, or using this temporary work-around:
round_date_patched <- function(x, unit) {
as.Date(round_date(as.POSIXct(x), unit = unit))
}
round_date_patched(x = as.Date("1997-06-29"), unit = "month") # "1997-07-01"

Trying to extract a date from a 5 or 6-digit number

I am trying to extract a date from a number. The date is stored as the first 6 digits of a 11-digit personal ID-number (date-month-year). Unfortunately the cloud-based database (REDCap) output of this gets formatted as a number, so that the leading zero in those born on the first nine days of the month end up with a 10 digit ID number instead of a 11 digit one. I managed to extract the 6 or 5 digit number corresponding to the date, i.e. 311230 for 31st December 1930, or 11230 for first December 1930. I end up with two problems that I have not been able to solve.
Let's say we use the following numbers:
dato <- c(311230, 311245, 311267, 311268, 310169, 201104, 51230, 51269, 51204)
I convert these into string, and then apply the as.Date() function:
datostr <- as.character(dato)
datofinal <- as.Date(datostr, "%d%m%y")
datofinal
The problems i have are:
Five-digit numbers (eg 11230) gets reported as NA.
Six-digit numbers are recognized, but those born before 1.1.1969 gets reported with 100 years added, i.e. 010160 gets converted to 2060.01.01
I am sure this must be easy for those who are more knowledgeable about R, but, I struggle a bit solving this. Any help is greatly appreciated.
Greetings
Bjorn
If your 5-digit numbers really just need to be zero-padded, then
dato_s <- sprintf("%06d", dato)
dato_s
# [1] "311230" "311245" "311267" "311268" "310169" "201104" "051230" "051269" "051204"
From there, your question about "dates before 1969", take a look at ?strptime for the '%y' pattern:
'%y' Year without century (00-99). On input, values 00 to 68 are
prefixed by 20 and 69 to 99 by 19 - that is the behaviour
specified by the 2018 POSIX standard, but it does also say
'it is expected that in a future version the default century
inferred from a 2-digit year will change'.
So if you have specific alternate years for those, you need to add the century before sending to as.Date (which uses strptime-patterns).
dato_d <- as.Date(gsub("([0-4][0-9])$", "20\\1",
gsub("([5-9][0-9])$", "19\\1", dato_s)),
format = "%d%m%Y")
dato_d
# [1] "2030-12-31" "2045-12-31" "1967-12-31" "1968-12-31" "1969-01-31" "2004-11-20"
# [7] "2030-12-05" "1969-12-05" "2004-12-05"
In this case, I'm assuming 50-99 will be 1900, everything else 2000. If you need 40s or 30s, feel free to adjust the pattern: add digits to the second pattern (e.g., [3-9]) and remove from the first pattern (e.g., [0-2]), ensuring that all decades are included in exactly one pattern, not "neither" and not "both".
Borrowing from Allan's answer, I like that assumption of now() (since you did mention "born on"). Without lubridate, try this:
dato_s <- sprintf("%06d", dato)
dato_d <- as.Date(dato_s, format = "%d%m%y")
dato_d[ dato_d > Sys.Date() ] <-
as.Date(sub("([0-9]{2})$", "19\\1", dato_s[ dato_d > Sys.Date() ]), format = "%d%m%Y")
dato_d
# [1] "1930-12-31" "1945-12-31" "1967-12-31" "1968-12-31" "1969-01-31" "2004-11-20"
# [7] "1930-12-05" "1969-12-05" "2004-12-05"
You can make this a bit easier using lubridate, and noting that no-one can have a date of birth that is in the future of the current time:
library(lubridate)
dato <- dmy(sprintf("%06d", dato))
dato[dato > now()] <- dato[dato > now()] - years(100)
dato
#> [1] "1930-12-31" "1945-12-31" "1967-12-31" "1968-12-31" "1969-01-31"
#> [6] "2004-11-20" "1930-12-05" "1969-12-05" "2004-12-05"
Of course, without further information, this method will not (nor will any other method) be able to pick out the edge cases of people who are aged over 100. This might be easy to determine from the context.
Created on 2020-06-29 by the reprex package (v0.3.0)
Converting five digit "numbers" to six digits is straightforward: x <- stringr::str_pad(x, 6, pad="0") or similar will do the trick.
Your problem with years is the Millennium bug revisited. You'll have to consult with whoever compiled your data to see what assumptions they used.
I suspect all dates on or before 31Dec1970 are affected, not just those before 01Jan1960. That's because as.Date uses a default origin of 01Jan1970 when deciding how to handle two digit years. So your solution is to pick an appropriate origin in your conversion to fix this dataset. Something like d <- as.Date(x, origin="1900-01-01"). And then start using four digit years in the fiture! ;)

What does the lubridate note "method with signature ‘Timespan#Timespan’ chosen for function ‘%/%’" mean?

When I run the following code in R I get a strange note (it only appears the first time I run the code in a session):
> library(lubridate)
Attaching package: ‘lubridate’
The following object is masked from ‘package:base’:
date
Warning message:
package ‘lubridate’ was built under R version 3.3.2
> data.frame(i = interval(ymd(20140101), ymd(20160101)))$i %/% years(1)
Note: method with signature ‘Timespan#Timespan’ chosen for function ‘%/%’,
target signature ‘Interval#Period’.
"Interval#ANY", "ANY#Period" would also be valid
[1] 2
I am doubly confused:
I am unclear as to what the alternative syntax is that it is
recommending. A # is a comment in R, so presumably the hash is meant
to mean something other than a hash, but what?
Is it telling me I am doing something wrong? The note seems to suggest it is an FYI, but an FYI is an odd thing to be spat out of a function if there is no problem.
This warning will only occur the first time you run it to remind you of that doing integer division has the problem that months or years do not necessarily have the same length in other units like hours or days.
Suppose that we divide the interval 2014--2018 by 2 years, it would not be completely correct to say that the answer is 4 because 2016 is a leap year and has 366 days. So it will be correct if you unit of measure is only years, but it is not strictly correct if you present it as an interval (which may be expressed in years, but also in days, or hours).
There is really no way around the warning either (at least not for integer division), as the warning is always to the point, even if you are dividing interval %/% interval or period %/% period.
But it will only show the first time you run your division, after that it goes silent.
data.frame(i = interval(ymd(20140101), ymd(20160101)))$i %/% years(1)
Note: method with signature ‘Timespan#Timespan’ chosen for function ‘%/%’,
target signature ‘Interval#Period’.
"Interval#ANY", "ANY#Period" would also be valid
[1] 2
data.frame(i = interval(ymd(20140101), ymd(20160101)))$i %/% years(1)
[1] 2
In theory it should be possible to avoid the warning if both sides of the division are represented by as a timespan class. But I have never tried to do that.

R - strtoi strange behavior to get week of year

I use strtoi to determine the week of year in the following function:
to.week <- function(x) strtoi(format(x, "%W"))
It works fine for most dates:
> to.week(as.Date("2015-01-11"))
[1] 1
However, when I'm trying dates between 2015-02-23 and 2015-03-08, I get NA as a result:
> to.week(as.Date("2015-02-25"))
[1] NA
Could you please explain to me what causes the problem?
Here is an implementation that works:
to.week <- function(x) as.integer(format(x, "%W"))
The reason strtoi fails is by default it tries to interpret numbers as if they were octal when they are preceeded by a "0". Since "%W" returns "08", and 8 doesn't exist in octal, you get the NA. From ?strtoi:
Convert strings to integers according to the given base using the C function strtol, or choose a suitable base following the C rules.
...
For decimal strings as.integer is equally useful.
Also, you can use:
week(as.Date("2015-02-25"))
Though you may have to offset the result of that by 1 to match your expectations.
you can slightly modify your code like this
to.week <- function(x) strtoi(format(x, "%W"), 10)
and use base 10.

Forcing full weeks with apply.weekly()

I'm trying to figure out what xts (or zoo) uses as the time after doing an apply.period. Consider the following:
> myTs = xts(1:10, as.Date(1:10, origin = '2012-12-1'))
> apply.weekly(myTs, colSums)
[,1]
2012-12-02 1
2012-12-09 35
2012-12-11 19
I think the '2012-12-02' means "for the week ending 2012-12-02, the sum is 1". So basically the time is the end of the week.
But the problem is with that "2012-12-11" - I think what it's doing is saying that the 11th is the last day of the week that was given, so it's giving that as the time.
Is there any way to force it to give the sunday on which it ends, even if that day was not included in the data set?
Try this:
nextsun <- function(x) 7 * ceiling(as.numeric(x-0+4) / 7) + as.Date(0-4)
aggregate(myTs, nextsun, sum)
where nextsun was derived from nextfri code given in the zoo quick reference by replacing 5 (for Friday) with 0 (for Sunday).
Those are full weeks. It's only showing you the date of the very last observation. See ?endpoints (apply.weekly, is essentially a thin wrapper for endpoints).
apply.weekly
function (x, FUN, ...)
{
ep <- endpoints(x, "weeks")
period.apply(x, ep, FUN, ...)
}
<environment: namespace:xts>
From ?endpoints
endpoints returns a numeric vector corresponding to the last
observation in each period specified by on, with a zero added to the
beginning of the vector, and the index of the last observation in x at
the end.
Valid values for the argument on include: “us” (microseconds),
“microseconds”, “ms” (milliseconds), “milliseconds”, “secs” (seconds),
“seconds”, “mins” (minutes), “minutes”, “hours”, “days”, “weeks”,
“months”, “quarters”, and “years”.
The answer to your second question is no, there is no option to do so. But you could always edit the last date manually, if you're going to present all data wrapped up anyways, I don't see any harm in it.
No you can't force it give you the sunday.
Because the index of the result of period.apply is given by
ep <- endpoints(myTs,'weeks')
myTs[ep]
[,1]
2012-12-02 2
2012-12-09 9
2012-12-10 10
So you need to shift the last date. Unfortunately xts don't offer this option, you can't shift a single value of the index. I don't know why (maybe a design choice get unique index)
e.g You can do the flowing:
ts.weeks <- apply.weekly(myTs, colSums)
ts.weeks[length(ts.weeks)] <- last(index(myTs)) + 7-last(floor(diff(ep)))

Resources