converting dates in R is changing to future dates, not past - r

I use this:
want=as.Date(date, '%d-%b-%y')
to convert dates like this: 1-JAN-52
Instead of returning '1952-01-01' I am getting '2052-01-01'. Any advice?

Welcome to modern computers, all shaped after the early Unix systems of the 1970s. The start of time, so to speak, is the epoch, aka 1 Jan 1970.
Your problem here, in a nutshell, is the inferior input data. You only supply two years and by a widely followed convention, values less than 70 are taken for the next century. It's all about the epoch.
So you have two choices. You could preprend '19' to the year part and parse via %Y, or you could just take the year value out of the date and reduce it by 100 if need be.
Some example code for the second (and IMHO better) option, makeing 1970 the cutoff date:
> datestr <- "1-Jan-52"
> d <- as.Date(datestr, '%d-%b-%y')
>
> d
[1] "2052-01-01"
>
> if (as.integer(strftime(d, "%Y")) >= 1970) {
+ dp <- as.POSIXlt(d)
+ dp$year <- dp$year - 100
+ d <- as.Date(dp)
+}
> d
[1] "1952-01-01"
>
You need to go via POSIXlt to get the components easily.

Related

lubridate::floor_date returns NA

I am trying to write a function that, given a date start and an integer n, adds n months to the date and then takes the last day of the resulting month.
The following piece of code, however, returns NA for n=8+12k, but seems to work in other cases.
library(lubridate)
start <- ymd("2020-06-29")
n <- 8
floor_date(start + months(n), "month") + months(1) - days(1)
>[1] NA
I guess this is somewhat due to the existence of leap years, but I still find it puzzling. Plus, I couldn't find anything in the docs about this.
Can someone please explain what is going on, or suggest a better way to do the job?
The clock package is really good for this, as it has an explicit argument specifying what you want done with invalid dates. And also the date_end function.
I think you want: (R >= 4.1 for native pipe)
library(clock)
start <- parse_date("2020-06-29")
n <- 8
start |> add_months(n, invalid = "previous") |> date_end('month')
Which gives:
[1] "2021-02-28"

How do I check if time is positive or negative and compare it with fixed time

Time is not a numeric datatype. So how do i check if
time is positive
compare time with some fixed time value
library(lubridate)
t = as_hms(Sys.time())
# question 1 - check if t is positive or negative
# question 2 - compare with time 11:23:01
Load the package lubridate. Then be sure your date have the date format. If not, use lubridate::as_datetime(object) function to convert it.
Then, all the "classic" functions will apply:
as_datetime(Sys.time()) > 0
[1] TRUE
And to compare two dates:
t <- as_datetime(Sys.time())
e <- as_datetime("2021-08-01")
> e > t
[1] TRUE
> e - t
Time difference of 30.38764 days

Trying to extract a date from a 5 or 6-digit number

I am trying to extract a date from a number. The date is stored as the first 6 digits of a 11-digit personal ID-number (date-month-year). Unfortunately the cloud-based database (REDCap) output of this gets formatted as a number, so that the leading zero in those born on the first nine days of the month end up with a 10 digit ID number instead of a 11 digit one. I managed to extract the 6 or 5 digit number corresponding to the date, i.e. 311230 for 31st December 1930, or 11230 for first December 1930. I end up with two problems that I have not been able to solve.
Let's say we use the following numbers:
dato <- c(311230, 311245, 311267, 311268, 310169, 201104, 51230, 51269, 51204)
I convert these into string, and then apply the as.Date() function:
datostr <- as.character(dato)
datofinal <- as.Date(datostr, "%d%m%y")
datofinal
The problems i have are:
Five-digit numbers (eg 11230) gets reported as NA.
Six-digit numbers are recognized, but those born before 1.1.1969 gets reported with 100 years added, i.e. 010160 gets converted to 2060.01.01
I am sure this must be easy for those who are more knowledgeable about R, but, I struggle a bit solving this. Any help is greatly appreciated.
Greetings
Bjorn
If your 5-digit numbers really just need to be zero-padded, then
dato_s <- sprintf("%06d", dato)
dato_s
# [1] "311230" "311245" "311267" "311268" "310169" "201104" "051230" "051269" "051204"
From there, your question about "dates before 1969", take a look at ?strptime for the '%y' pattern:
'%y' Year without century (00-99). On input, values 00 to 68 are
prefixed by 20 and 69 to 99 by 19 - that is the behaviour
specified by the 2018 POSIX standard, but it does also say
'it is expected that in a future version the default century
inferred from a 2-digit year will change'.
So if you have specific alternate years for those, you need to add the century before sending to as.Date (which uses strptime-patterns).
dato_d <- as.Date(gsub("([0-4][0-9])$", "20\\1",
gsub("([5-9][0-9])$", "19\\1", dato_s)),
format = "%d%m%Y")
dato_d
# [1] "2030-12-31" "2045-12-31" "1967-12-31" "1968-12-31" "1969-01-31" "2004-11-20"
# [7] "2030-12-05" "1969-12-05" "2004-12-05"
In this case, I'm assuming 50-99 will be 1900, everything else 2000. If you need 40s or 30s, feel free to adjust the pattern: add digits to the second pattern (e.g., [3-9]) and remove from the first pattern (e.g., [0-2]), ensuring that all decades are included in exactly one pattern, not "neither" and not "both".
Borrowing from Allan's answer, I like that assumption of now() (since you did mention "born on"). Without lubridate, try this:
dato_s <- sprintf("%06d", dato)
dato_d <- as.Date(dato_s, format = "%d%m%y")
dato_d[ dato_d > Sys.Date() ] <-
as.Date(sub("([0-9]{2})$", "19\\1", dato_s[ dato_d > Sys.Date() ]), format = "%d%m%Y")
dato_d
# [1] "1930-12-31" "1945-12-31" "1967-12-31" "1968-12-31" "1969-01-31" "2004-11-20"
# [7] "1930-12-05" "1969-12-05" "2004-12-05"
You can make this a bit easier using lubridate, and noting that no-one can have a date of birth that is in the future of the current time:
library(lubridate)
dato <- dmy(sprintf("%06d", dato))
dato[dato > now()] <- dato[dato > now()] - years(100)
dato
#> [1] "1930-12-31" "1945-12-31" "1967-12-31" "1968-12-31" "1969-01-31"
#> [6] "2004-11-20" "1930-12-05" "1969-12-05" "2004-12-05"
Of course, without further information, this method will not (nor will any other method) be able to pick out the edge cases of people who are aged over 100. This might be easy to determine from the context.
Created on 2020-06-29 by the reprex package (v0.3.0)
Converting five digit "numbers" to six digits is straightforward: x <- stringr::str_pad(x, 6, pad="0") or similar will do the trick.
Your problem with years is the Millennium bug revisited. You'll have to consult with whoever compiled your data to see what assumptions they used.
I suspect all dates on or before 31Dec1970 are affected, not just those before 01Jan1960. That's because as.Date uses a default origin of 01Jan1970 when deciding how to handle two digit years. So your solution is to pick an appropriate origin in your conversion to fix this dataset. Something like d <- as.Date(x, origin="1900-01-01"). And then start using four digit years in the fiture! ;)

Descriptive statistics of time variables

I want to compute simple descriptive statistics (mean, etc) of times when people go to bed. I ran into two problems. The original data comes from an Excel file in which just the time that people went to bed, were typed in - in 24 hrs format. My problem is that r so far doesn't recognizes if people went to bed at 1.00 am the next day. Meaning that a person who went to bed at 10 pm is 3 hrs apart from the one at 1.00 am (and not 21 hrs).
In my dataframe the variable in_bed is a POSIXct format so I thought to apply an if-function telling that if the time is before 12:00 than I want to add 24 hrs.
My function is:
Patr$in_bed <- if(Patr$in_bed <= ) {
Patr$in_bed + 24*60*60
}
My data frame looks like this
in_bed
1 1899-12-30 22:13:00
2 1899-12-30 23:44:00
3 1899-12-30 00:08:00
If I run my function my variable gets deleted and the following error message gets printed:
Warning message:
In if (Patr$in_bed < "1899-12-30 12:00") { :
the condition has length > 1 and only the first element will be used
What do I do wrong or does anyone has a better idea? And can I run commands such as mean on variables in POSIXct format and if not how do I do it?
When you compare Patr$in_bed (vector) and "1899-12-30 12:00" (single value), you get a logical vector. But the IF statement requires a single logical, thus it generates a warning and consider only the first element of the vector.
You can try :
Patr$in_bed <- Patr$in_bed + 24*60*60 * (Patr$in_bed < as.POSIXct("1899-12-30 12:00"))
Explanations : the comparison in the parenthesis will return a logical vector, which will be converted to integer (0 for FALSE and 1 for TRUE). Then the dates for which the statement is true will have +24*60*60, and the others dates will have +0.
But since the POSIXct format includes the date, I don't see the purpose of adding 24 hrs. For instance,
as.POSIXct("1899-12-31 01:00:00") - as.POSIXct("1899-12-30 22:00:00")
returns a time difference of 3 hours, not 21.
To answer your last question, yes you can compute the mean of a POSIXct vector, simply with :
mean(Patr$in_bed)
Hope it helps,
Jérémy

Forcing full weeks with apply.weekly()

I'm trying to figure out what xts (or zoo) uses as the time after doing an apply.period. Consider the following:
> myTs = xts(1:10, as.Date(1:10, origin = '2012-12-1'))
> apply.weekly(myTs, colSums)
[,1]
2012-12-02 1
2012-12-09 35
2012-12-11 19
I think the '2012-12-02' means "for the week ending 2012-12-02, the sum is 1". So basically the time is the end of the week.
But the problem is with that "2012-12-11" - I think what it's doing is saying that the 11th is the last day of the week that was given, so it's giving that as the time.
Is there any way to force it to give the sunday on which it ends, even if that day was not included in the data set?
Try this:
nextsun <- function(x) 7 * ceiling(as.numeric(x-0+4) / 7) + as.Date(0-4)
aggregate(myTs, nextsun, sum)
where nextsun was derived from nextfri code given in the zoo quick reference by replacing 5 (for Friday) with 0 (for Sunday).
Those are full weeks. It's only showing you the date of the very last observation. See ?endpoints (apply.weekly, is essentially a thin wrapper for endpoints).
apply.weekly
function (x, FUN, ...)
{
ep <- endpoints(x, "weeks")
period.apply(x, ep, FUN, ...)
}
<environment: namespace:xts>
From ?endpoints
endpoints returns a numeric vector corresponding to the last
observation in each period specified by on, with a zero added to the
beginning of the vector, and the index of the last observation in x at
the end.
Valid values for the argument on include: “us” (microseconds),
“microseconds”, “ms” (milliseconds), “milliseconds”, “secs” (seconds),
“seconds”, “mins” (minutes), “minutes”, “hours”, “days”, “weeks”,
“months”, “quarters”, and “years”.
The answer to your second question is no, there is no option to do so. But you could always edit the last date manually, if you're going to present all data wrapped up anyways, I don't see any harm in it.
No you can't force it give you the sunday.
Because the index of the result of period.apply is given by
ep <- endpoints(myTs,'weeks')
myTs[ep]
[,1]
2012-12-02 2
2012-12-09 9
2012-12-10 10
So you need to shift the last date. Unfortunately xts don't offer this option, you can't shift a single value of the index. I don't know why (maybe a design choice get unique index)
e.g You can do the flowing:
ts.weeks <- apply.weekly(myTs, colSums)
ts.weeks[length(ts.weeks)] <- last(index(myTs)) + 7-last(floor(diff(ep)))

Resources