age_calc in eeptools producing calculation error - r

I am trying to calculate age in months and one particular date is giving me an incorrect result
See code below. For some reason when the date is March 1, 2019 the age in months is 2.9, which is incorrect, but for March 2, 2019 the function works better although I believe the result should be 4.x.
age_calc(as.Date("10/30/18",format="%m/%d/%y"),as.Date("3/1/19",format="%m/%d/%y"),units="months")
#[1] 2.960829
age_calc(as.Date("10/30/18",format="%m/%d/%y"),as.Date("3/2/19",format="%m/%d/%y"),units="months")
#[1] 3.993088
is this an error in the function? Or I am I doing something wrong? Is this an issue as February has fewer days??

This is perhaps more of an extended comment.
eeptools::age_calc seems to calculate the age in months in an unconventional way (you can see the source code when you type age_calc into an R terminal and hit Enter).
Perhaps a more canonical way to calculate the age between two dates in months is to simply divide the interval by the unit duration. A relevant & interesting post is Get the difference between dates in terms of weeks, months, quarters, and years.
From said post, #Gregor defined a convenient function that does something similar to eeptools::age_calc
library(lubridate)
age <- function(dob, age.day = today(), units = "years", floor = TRUE) {
calc.age = interval(dob, age.day) / duration(num = 1, units = units)
if (floor) return(as.integer(floor(calc.age)))
return(calc.age)
}
Using age we then get
age(
as.Date("10/30/18",format="%m/%d/%y"),
as.Date("3/1/19",format="%m/%d/%y"),
units="months", floor = FALSE)
#[1] 4.010959
age(
as.Date("10/30/18",format="%m/%d/%y"),
as.Date("3/2/19",format="%m/%d/%y"),
units="months", floor = FALSE)
#[1] 4.043836
These values are consistent with values that e.g. Wolfram Alpha gives for the same dates.

Related

Identify Min & Max Numeric Value within Date/Datetime range repeatedly

I am completely new to R so this is proving too complex to handle for me right now, so any help is much appreciated.
I am analysing price action data for BTC. I have 1 minute candles from 2019-09-08 19:13:00 to 2022-03-15 00:22:00 with the variables of open, high, low, close price as well as volume in BTC & USD and trade count for each of those minutes. Data source is https://www.cryptodatadownload.com/data/binance/ for anyone interested.
I cleaned up & correctly formatted the data and now want to analyse when BTC price made a low & high for various date & time ranges, for example:
What time of day in 30 minute increments did BTC made a low for the week?
Here is what I believe I need to do:
I need to tell R that 30 minutes is a range and identify the lowest & highest value for the "Low" and "High" variables within in as well as that a day is a range and within that the lowest & highest value for the "Low" and "High" variables as well as define a week as a range and within that the lowest & highest value for the "Low" and "High" variables.
Then I'd need to mark these values, the best method I can think of would be creating a new variable and have it as a TRUE/FALSE column like so:
btcusdt_binance_fut_1min$pa.low.of.week.30min
btcusdt_binance_fut_1min$pa.high.of.week.30min
Every minute row that is within that 30min low and high will be marked TRUE and every other minute within that week will be marked FALSE.
I looked at lubridate's interval() function but as far as I know the problem is I'd need to define each year, month, week, day, 30mins interval individually with start and end time, which is obviously not feasible. I believe I run into the same problem with the subset() function.
Another option seems to be the seq() and seq.POSIXt() functions as well as the range() function, but I haven't found a way for it.
Here is all my code and I am using this data set: https://www.cryptodatadownload.com/cdd/BTCUSDT_Binance_futures_data_minute.csv
library(readr)
library(lubridate)
library(tidyverse)
library(plyr)
library(dplyr)
# IMPORT CSV FILE AS DATA SET
# Name data set & choose import file
# Skip = 1 for skipping first row of CSV
btcusdt_binance_fut_1min <-
read.csv(
file.choose(),
skip = 1,
header = T,
sep = ","
)
# CLEAN UP & REORGANISE DATA
# Remove unix & symbol column
btcusdt_binance_fut_1min$unix = NULL
btcusdt_binance_fut_1min$symbol = NULL
# Rename date column to datetime
colnames(btcusdt_binance_fut_1min)[colnames(btcusdt_binance_fut_1min) == "date"] <-
"datetime"
# Convert datetime column to POSIXct format
btcusdt_binance_fut_1min$datetime <-
as_datetime(btcusdt_binance_fut_1min$datetime, tz = "UTC")
# Create variable column for each time element
btcusdt_binance_fut_1min$year <-
year(btcusdt_binance_fut_1min$datetime)
btcusdt_binance_fut_1min$month <-
month(btcusdt_binance_fut_1min$datetime)
btcusdt_binance_fut_1min$week <-
isoweek(btcusdt_binance_fut_1min$datetime)
btcusdt_binance_fut_1min$weekday <-
wday(btcusdt_binance_fut_1min$datetime,
label = TRUE,
abbr = FALSE)
btcusdt_binance_fut_1min$hour <-
hour(btcusdt_binance_fut_1min$datetime)
btcusdt_binance_fut_1min$minute <-
minute(btcusdt_binance_fut_1min$datetime)
# Reorder columns
btcusdt_binance_fut_1min <-
btcusdt_binance_fut_1min[, c(1, 9, 10, 11, 12, 13, 14, 4, 3, 2, 5, 6, 7, 8)]
Using data.table we can do the following:
btcusdt_binance_fut_1min <- data.table(datetime = seq.POSIXt(as.POSIXct("2022-01-01 0:00"), as.POSIXct("2022-01-01 2:59"), by = "1 min"))
btcusdt_binance_fut_1min[, group := format(as.POSIXct(cut(datetime, breaks = "30 min")), "%H:%M")]
the cut function will "floor" each datetime to it's nearest, smaller, half an hour. The format and as.POSIXct are just there to remove the date part to allow for easy comparing between dates for the same half hours, but if you prefer to keep it a datetime you can remove these functions.
After this the next steps are pretty straightforward:
btcusdt_binance_fut_1min[, .(High = max(High), Low = min(Low)), by=.(group)]

Is it possible to use difftime function in R with years ONLY (i.e. no DD/MM)?

For example:
Year A
Year B
1990
2021
1980
2021
Thanks in advance.
It depends of what you expect.
If you only want to use the difftime() function with date objects composed only of years (as below), it will work (it will set the day and month to the ones of today for the calculation).
> a = as.Date("2021", format("%Y"))
> b = as.Date("2010", format("%Y"))
> difftime(a,b)
Time difference of 4018 days
But if you want to get the difference in year, it is not possible, as the function documentation clearly state that the return value unit must be: units = c("auto", "secs", "mins", "hours", "days", "weeks")
You might find better way to handle date/time data with the lubridate package.
difftime requests a date object to be used, I tried reproducing this using only years but was unable to.
Why not simply use absolute value (abs())If you're only interested in year difference?
as an example so you can see the difference added to a new column:
Year_A <- c(1990, 1980)
Year_B <- c(2021, 2021)
df <- data.frame(Year_A, Year_B)
df$diff <- abs(Year_A - Year_B)
P.S I noticed the answer above me was added while I was answering and I can't comment to it due to low rep, i see you can't use "years" as a unit value there, the biggest one being weeks, but you can manipulate that from days/weeks to years if that's what you're after.

Next week day for a given vector of dates

I'm trying to get the next week day for a vector of dates in R. My approach was to create a vector of weekdays and then find the date to the weekend date I have. The problem is that for Saturday and some holidays (which are a lot in my country) i end up getting the previous week day which doesn't work.
This is an example of my problem:
vecDates = as.Date(c("2011-01-11","2011-01-12","2011-01-13","2011-01-14","2011-01-17","2011-01-18",
"2011-01-19","2011-01-20","2011-01-21","2011-01-24"))
testDates = as.Date(c("2011-01-22","2011-01-23"))
findInterval(testDates,vecDates)
for both dates the correct answer should be 10 which is "2011-01-24" but I get 9.
I though of a solution where I remove all the previous dates to the date i'm analyzing, and then use findInterval. It works but it is not vectorized and therefore kind of slow which does not work for my actual purpose.
Does this do what you want?
vecDates = as.Date(c("2011-01-11","2011-01-12",
"2011-01-13","2011-01-14",
"2011-01-17","2011-01-18",
"2011-01-19","2011-01-20",
"2011-01-21","2011-01-24"))
testDates = as.Date(c("2011-01-20","2011-01-22","2011-01-23"))
get_next_biz_day <- function(testdays, bizdays){
o <- findInterval(testdays, bizdays) + 1
bizdays[o]
}
get_next_biz_day(testDates, vecDates)
#[1] "2011-01-21" "2011-01-24" "2011-01-24"

How to find decimal representation of years in R?

Since I need reasonably accurate representations of years in decimal format (~ 4-5 digits of accuracy would work) I turned to the lubridate package. This is what I have tried:
refDate <- as.Date("2016-01-10")
endDate <- as.Date("2020-12-31")
daysInLeapYear <- 366
daysInRegYear <- 365
leapYearFractStart <- 0
leapYearRegStart <- 0
daysInterval <- as.interval(difftime(endDate, refDate, unit = "d"), start = refDate)
periodObject <- as.period(daysInterval)
if(leap_year(refDate)) {
leapYearFractStart <- (as.numeric(days_in_month(refDate))-as.numeric(format(refDate, "%d")))/daysInLeapYear
}
if(!leap_year(refDate)) {
leapYearRegStart <- (as.numeric(days_in_month(refDate))-as.numeric(format(refDate, "%d")))/daysInRegYear
}
returnData <- periodObject#year+(periodObject#month/12)+leapYearFractStart+leapYearRegStart
It is safe to assume that the end date is always at the end of a month, hence no leap year check at the end. Relying on lubridate for proper year/month counting I am adjusting for leap-years only for the start date.
I recon this gets me to within 3 digits of accuracy only! In addition, it looks a bit crude.
Is there a more complete and accurate procedure to determine decimal representation of years in an interval?
It's very unclear what you're trying to do exactly here, which makes accuracy difficult to talk about.
lubridate has a function decimal_date which turns dates into decimals. But since 3 decimal places gives you 1000 possible positions within a year, when we only have 365/366 days, there are between 2 and 3 viable values that fall within a day. Accuracy depends on when in the day you want the result to fall.
> decimal_date(as.POSIXlt("2016-01-10 00:00:01"))
[1] 2016.025
> decimal_date(as.POSIXlt("2016-01-10 12:00:00"))
[1] 2016.026
> decimal_date(as.POSIXlt("2016-01-10 23:59:59"))
[1] 2016.027
In other words, going beyond 3 decimal places is only really important if you're interested in the time of day.
This solution uses only base R. We get the beginning of the year using cut(..., "year") and the number of days in the year by differencing it with the beginning of the next year obtained using cut(..., "year") on an arbitrary date in the following year. Finally use those quantities to get the fraction and add it to the year.
d <- as.Date(c("2015-01-31", "2016-01-01", "2016-01-10", "2016-12-31")) # sample input
year_begin <- as.Date(cut(d, "year"))
days_in_year <- as.numeric( as.Date(cut(year_begin + 366, "year")) - year_begin )
as.numeric(format(d, "%Y")) + as.numeric(d - year_begin) / days_in_year
## [1] 2015.082 2016.000 2016.025 2016.997
Alternately, using as.POSIXlt this variation crams it into one line:
with(unclass(as.POSIXlt(d)),1900+year+yday/as.numeric(as.Date(cut(d-yday+366,"y"))-d+yday))
## [1] 2015.082 2016.000 2016.025 2016.997

Calculate average value over multiple years for each hour and day

I am trying to calculate an average over multiple years for hourly data. I want to retain the days and hours and average over the years. I feel like this should be simple but I have looked around for an answer and not found one.
I am using R version 3.0.3.
start <- ISOdatetime(1970, 1, 1, hour=0, min=0, sec=0, tz="GMT")
end <- ISOdatetime(1971, 12, 31, hour=18, min=0, sec=0, tz="GMT")
set.seed(1)
z <- zooreg(rnorm(2920), start = start , end = end, frequency = 4, deltat = 21600)
#attempt to aggregate ... doesn't work
z.daily.agg <- aggregate(z, as.POSIXct(cut(time(z), "6 hours", include=T)), mean)
What I would like for the output is the following:
01-01 00:00 average of all January 1st zero hours from 1970-1971
01-01 06:00 average of all January 6th zero hours from 1970-1971
Thanks for your assistance with this!
I believe this will work - using the hour function from the lubridate package.
require(lubridate)
aggregate(z, hour(index(z)), mean)
Edit in response to your comments - sorry, I didn't realise exactly what you wanted. You can average across each hour by day by month across the two years (which I think is what you want) like so:
aggregate(z ~ month(index(z)) + day(index(z)) + hour(index(z)), FUN = 'mean')
Hope that helps
A little crude but you could
#1) Use the substr function to extract the parts of the date string you want:
date = substr(time(z), 6,16)
#2) Then bind this to the data:
temp = data.frame(z, date)
#3) Make sure the date is a factor:
temp$date = as.factor(temp$date)
#4) And now aggregate:
aggregate(temp$z~temp$date, FUN=mean)
Does this give you the results you were after?

Resources