Date comparison with System Date in R - r

I want to compare the data from one column which is the end date(end_date) with the system date(todays_date). Both columns are in the char format.
Input:
$ name: chr "Abby" "Abby" "Abby" "Abby" ...
$ std: int 2 3 4 5 6 7 8 9 10 11 ...
$ end_date: chr "25-02-2016" "25-02-2016" "25-03-2018" "25-02-2019" ...
$ todays_date: chr "07-03-2018" "07-03-2018" "07-03-2018" "07-03-2018" ...
Is there any way I can pass a sqldf statement where I can get all the values of the input csv where end_date < todays_date? Any way other than a sqldf statement where I can extract the values of the csv where end_date< todays_date will do.
I tried a few possible variations the below query but I can't seem to get the required output:
sel_a <- sqldf(paste("SELECT * FROM input_a WHERE end_date<",
todays_date, "", sep = ""))
sel_a
PS: I have a huge amount of data and have reduced it to fit this question.
Any help would be appreciated.

To get a more specific answer, make a reproducible example
Convert the date column from character to date-time objects, e.g., with
library(lubridate)
your_df$end_date <- mdy(your_df$end_date)
Then, you don't even need a column for todays date, just use it as a filter condition
library(dplyr)
filter(your_df, end_date < Sys.Date())
# will return a data frame with those rows that have a date before today.
Or if you prefer:
your_df[your_df$end_date < Sys.Date(),]
# produces the same rows

Using the raw input shown in the Note at the end first convert the dates to "Date" class and then use any of the alternatives shown. The first two use end_date in the input and the last two use Sys.Date(). We show both sqldf and base solutions.
library(sqldf)
fmt <- "%d-%m-%Y"
Input <- transform(Input_raw, end_date = as.Date(end_date, fmt),
todays_date = as.Date(todays_date, fmt))
# 1
sqldf("select * from Input where end_date <= todays_date")
# 2
subset(Input, end_date <= todays_date)
# 3
fn$sqldf("select * from Input where end_date <= `Sys.Date()`")
# 4
subset(Input, end_date <= Sys.Date())
Note
The Input in reproducible form:
Input_raw <- data.frame(name = "Abby", std = 2:5,
end_date = c("25-02-2016", "25-02-2016", "25-03-2018", "25-02-2019"),
todays_date = "07-03-2018", stringsAsFactors = FALSE)

Related

add_months function in Spark R

I have a variable of the form "2020-09-01". I need to increase and decrease this by 3 months and 5 months and store it in other variables. I need a syntax in Spark R.Thanks. Any other method will also work.Thanks, Again
In R following code works fine
y <- as.Date(load_date,"%Y-%m-%d") %m+% months(i)
The code below didn't work. Error says
unable to find an inherited method for function ‘add_months’ for signature ‘"Date", "numeric"
loaddate = 202009
year <- substr(loaddate,1,4)
month <- substr(loaddate,5,6)
load_date <- paste(year,month,"01",sep = "-")
y <- as.Date(load_date,"%Y%m%d")
y1 <- add_months(y,-3)
Expected Result - 2020-06-01
The lubridate package makes dealing with dates much easier. Here I have shuffled as.Date up a step, then simply subtract 3 months.
library(lubridate)
loaddate = 202009
year <- substr(loaddate,1,4)
month <- substr(loaddate,5,6)
load_date <- as.Date(paste(year,month,"01",sep = "-"))
new_date <- load_date - months(3)
new_date Output:
Date[1:1], format: "2020-06-01"

Separating time and date into different columns (format=2019-05-26T13:50:56.335288Z)

In my data, Time and date is stored in a column 'cord' (class=factor). I want to separate the date and the time into two separate columns.
The data looks like this:
1 2019-05-26T13:50:56.335288Z
2 2019-05-26T17:55:45.348073Z
3 2019-05-26T18:12:00.882572Z
4 2019-05-26T18:26:49.577310Z
I have successfully extracted the date using:cord$Date <- as.POSIXct(cord$Time)
I have however not been able to find a way to extract the time in format "H:M:S".
The output of dput(head(cord$Time)) returns a long list of timestamps: "2020-04-02T13:34:07.746777Z", "2020-04-02T13:41:11.095014Z",
"2020-04-02T14:08:05.508818Z", "2020-04-02T14:17:10.337101Z", and so on...
Extract H:M:S
library(lubridate)
format(as_datetime(cord$Time), "%H:%M:%S")
#> [1] "13:50:56" "17:55:45" "18:12:00" "18:26:49"
If you need milliseconds too:
format(as_datetime(cord$Time), "%H:%M:%OS6")
#> [1] "13:50:56.335288" "17:55:45.348073" "18:12:00.882572" "18:26:49.577310"
where cord is:
cord <- read.table(text = " Time
1 2019-05-26T13:50:56.335288Z
2 2019-05-26T17:55:45.348073Z
3 2019-05-26T18:12:00.882572Z
4 2019-05-26T18:26:49.577310Z ", header = TRUE)
I typically use lubridate and data.table to do my date and manipulation work. This works for me copying in some of your raw dates as strings
library(lubridate)
library(data.table)
x <- c("2019-05-26T13:50:56.335288Z", "2019-05-26T17:55:45.348073Z")
# lubridate to parse to date time
y <- parse_date_time(x, "ymd HMS")
# data.table to split in to dates and time
split_y <- tstrsplit(y, " ")
dt <- as.data.table(split_y)
setnames(dt, "Date", "Time")
dt[]
# if you use data.frames instead
df <- as.data.frame(dt)
df

R how to avoid a loop. Counting weekends between two dates in a row for each row in a dataframe

I have two columns of dates. Two example dates are:
Date1= "2015-07-17"
Date2="2015-07-25"
I am trying to count the number of Saturdays and Sundays between the two dates each of which are in their own column (5 & 7 in this example code). I need to repeat this process for each row of my dataframe. The end results will be one column that represents the number of Saturdays and Sundays within the date range defined by two date columns.
I can get the code to work for one row:
sum(weekdays(seq(Date1[1,5],Date2[1,7],"days")) %in% c("Saturday",'Sunday')*1))
The answer to this will be 3. But, if I take out the "1" in the row position of date1 and date2 I get this error:
Error in seq.Date(Date1[, 5], Date2[, 7], "days") :
'from' must be of length 1
How do I go line by line and have one vector that lists the number of Saturdays and Sundays between the two dates in column 5 and 7 without using a loop? Another issue is that I have 2 million rows and am looking for something with a little more speed than a loop.
Thank you!!
map2* functions from the purrr package will be a good way to go. They take two vector inputs (eg two date columns) and apply a function in parallel. They're pretty fast too (eg previous post)!
Here's an example. Note that the _int requests an integer vector back.
library(purrr)
# Example data
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
# Wrapper function to compute number of weekend days between dates
n_weekend_days <- function(date_1, date_2) {
sum(weekdays(seq(date_1, date_2, "days")) %in% c("Saturday",'Sunday'))
}
# Iterate row wise
map2_int(d$Date1, d$Date2, n_weekend_days)
#> [1] 3 4 2
If you want to add the results back to your original data frame, mutate() from the dplyr package can help:
library(dplyr)
d <- mutate(d, end_days = map2_int(Date1, Date2, n_weekend_days))
d
#> Date1 Date2 end_days
#> 1 2015-07-17 2015-07-25 3
#> 2 2015-07-28 2015-08-14 4
#> 3 2015-08-15 2015-08-20 2
Here is a solution that uses dplyr to clean things up. It's not too difficult to use with to assign the columns in the dataframe directly.
Essentially, use a reference date, calculate the number of full weeks (by floor or ceiling). Then take the difference between the two. The code does not include cases in which the start date or end data fall on Saturday or Sunday.
# weekdays(as.Date(0,"1970-01-01")) -> "Friday"
require(dplyr)
startDate = as.Date(0,"1970-01-01") # this is a friday
df <- data.frame(start = "2015-07-17", end = "2015-07-25")
df$start <- as.Date(df$start,"", format = "%Y-%m-%d", origin="1970-01-01")
df$end <- as.Date(df$end, format = "%Y-%m-%d","1970-01-01")
# you can use with to define the columns directly instead of %>%
df <- df %>%
mutate(originDate = startDate) %>%
mutate(startDayDiff = as.numeric(start-originDate), endDayDiff = as.numeric(end-originDate)) %>%
mutate(startWeekDiff = floor(startDayDiff/7),endWeekDiff = floor(endDayDiff/7)) %>%
mutate(NumSatsStart = startWeekDiff + ifelse(startDayDiff %% 7>=1,1,0),
NumSunsStart = startWeekDiff + ifelse(startDayDiff %% 7>=2,1,0),
NumSatsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 1,1,0),
NumSunsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 2,1,0)
) %>%
mutate(NumSats = NumSatsEnd - NumSatsStart, NumSuns = NumSunsEnd - NumSunsStart)
Dates are number of days since 1970-01-01, a Thursday.
So the following is the number of Saturdays or Sundays since that date
f <- function(d) {d <- as.numeric(d); r <- d %% 7; 2*(d %/% 7) + (r>=2) + (r>=3)}
For the number of Saturdays or Sundays between two dates, just subtract, after decrementing the start date to have an inclusive count.
g <- function(d1, d2) f(d2) - f(d1-1)
These are all vectorized functions so you can just call directly on the columns.
# Example data, as in Simon Jackson's answer
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
As follows
within(d, end_days<-g(Date1,Date2))
# Date1 Date2 end_days
# 1 2015-07-17 2015-07-25 3
# 2 2015-07-28 2015-08-14 4
# 3 2015-08-15 2015-08-20 2

How to interpret H2O's time data type?

I have a data frame in R that I am passing to H2O using the as.h2o().
dataset.h2o <- as.h2o(dataset,destination_frame = "dataset.h2o")
Doing an str() on the data frame, we can see that the week_of_date class is of datatype Date
$ primary_account_id : int 31 31 31 31 31 31 31 31 31 31 ...
$ week_of_date : Date, format: "2015-08-31" "2015-09-07" "2015-09-14" "2015-09-21" ...
However, when viewed in H2O Flow, it seems to be converted to a datatype called time - which is of the format
week_of_date time 0 0 0 0 1440943200000.0 1447592400000.0 1444480409625.8884 2013362534.5706
When I bring back the data to R using as.data.frame
returned.dataset <- as.data.frame(dataset.h2o)
it is stored in a format that I am unable to understand and therefore parse back
$ primary_account_id: int 31 31 698 1060 1060 1060 1060 1060 1060 1133 ...
$ week_of_date :Class 'POSIXct' num [1:194] 1442757600000 1446382800000 1446382800000 1442152800000 1442757600000 ...
Could you please point me in the direction of how I can achieve better interoperability with dates between R and H2O?
Thanks!
It is a bug in h2o. H2o returns date time in milliseconds while R expects seconds. See jira issue 3434.
What you can do in the meantime is recode the date column:
as.Date(structure(returned.dataset$week_of_date/1000, class = c("POSIXct", "POSIXt")))
Refer to the response by phiver for a more detailed answer, but another simple workaround would be to convert the date columns to character before passing to H2O (if you do not need the column in a date format in H2O). Here is a simple example.
# construct a sample df with a date format column
df <- data.frame(week_of_date = as.Date(c('2015-09-29','2015-10-05')))
str(df$week_of_date)
Date[1:2], format: "2015-09-29" "2015-10-05"
# convert the column to H2O
df$week_of_date <- as.character(df$week_of_date)
str(df$week_of_date)
chr [1:2] "2015-09-29" "2015-10-05"
# convert to H2OFRAME and pass back to R data.frame and re-convert to date
df.hex <- as.h2o(df)
df2 <- as.data.frame(df.hex)
df2$week_of_date <- as.Date(df2$week_of_date)
str(df2$week_of_date)
Date[1:2], format: "2015-09-29" "2015-10-05"
Both answers above are great. However, my workaround which I deem more efficient would be to pass the dataset to h2o excluding the date column. Then when you train a model and then make predictions, these would have the same amount of fields/rows as that of the original dataset for which you could just attach the Date column to the predictions vector or matrix.
Of course, the predictions in this solutions is related to the period as for backtesting.
Converting to H2o and back is easy if the date-time columns are in the proper format. (Accuracy of times in milliseconds cab be lost). As mentioned in the H20 FAQ
H2O is set to auto-detect two major date/time formats. The first
format is for dates formatted as yyyy-MM-dd. ... The second date
format is for dates formatted as dd-MMM-yy.
Times are specified as HH:mm:ss. HH is a two-digit hour and must be a
value between 0-23 (for 24-hour time) or 1-12 (for a twelve-hour
clock). mm is a two-digit minute value and must be a value between 0-59.
ss is a two-digit second value and must be a value between 0-59.
Example
Example Data
dates <- c("02/27/92", "02/27/92", "01/14/92", "02/28/92", "02/01/92")
times <- c("23:03:20", "22:29:56", "01:03:30", "18:21:03", "16:56:26")
x <- paste(dates, times)
df <- data.frame(datetime = strptime(x, "%m/%d/%y %H:%M:%S"))
# > df
# datetime
# 1 1992-02-27 23:03:20
# 2 1992-02-27 22:29:56
# 3 1992-01-14 01:03:30
# 4 1992-02-28 18:21:03
# 5 1992-02-01 16:56:26
Change the format to one that H2o prefers
# Change format
df$datetime <- format(df$datetime, format = "%Y-%m-%d %H:%M:%S")
#H2o format
h2o_df <- as.h2o(df)
# Convert back
back_df <- as.data.frame(h2o_df)
back_df
# datetime
# 1 1992-02-27 23:03:20
# 2 1992-02-27 22:29:56
# 3 1992-01-14 01:03:30
# 4 1992-02-28 18:21:03
# 5 1992-02-01 16:56:26

R subset data frame where date is less than a variable date

I am trying to select a subset of a data frame where the date needs to be less than a (calculated/variable) date.
The following code throws an error:
loanFrame_excluding_young <- loanFrame[loanFrame$LoanEffective < AddMonths(as.Date("2015-11-11"),-loanFrame$TermMonths),]
Error in seq.Date(X[[i]], ...) : 'by' must be of length 1
Any ideas?
The problem lies with the DescTools::AddMonths function. in AddMonths(x, n, ceiling = TRUE) the n can only be a single number, not a vector.
Using the following code does work using the %m-% function of lubridate.
library(lubridate)
loanFrame <- data.frame(TermMonths = c(1,3,5,7),
LoanEffective = as.Date(c("2015-09-15", "2015-08-05", "2015-10-01", "2015-06-25")))
loanFrame_excluding_young <- loanFrame[loanFrame$LoanEffective < as.Date("2015-11-11") %m-% months(loanFrame$TermMonths),]
loanFrame_excluding_young
TermMonths LoanEffective
1 1 2015-09-15
2 3 2015-08-05

Resources