In R how to calculate if Date is earlier then date X? - r

I got a DF with a date column in it. I want to check if the date in the column is after or before 1st of January 2020. Create a new column and if the previous columns date is before then insert 1st of January 2020 if not then insert previous columns date.
Date is in format YYYY-MM-DD
Beginning End
2020-12-31 2021-01-12
2018-01-02 2020-03-10
2019-04-12 2020-12-04
2020-10-15 2021-03-27
I want:
Beginning End Beginning_2
2020-12-31 2021-01-12 2020-12-31
2018-01-02 2020-03-10 2020-01-01
2019-04-12 2020-12-04 2020-01-01
2020-10-15 2021-03-27 2020-10-15
The code i wrote is:
DF$Beginning_2 <- ifelse("2020-01-01" > DF$Beginning,"2020-01-01", DF$Beginning)
I'm getting this
Beginning End Beginning_2
2020-12-31 2021-01-12 18554
2018-01-02 2020-03-10 2020-01-01
2019-04-12 2020-12-04 2020-01-01
2020-10-15 2021-03-27 18453
My code works half way. It turns the format in to char. I need it to stay as date. I tried butting as date all over the code but nothing much changed. The biggest change was that greater then 2020-01-01 dates were NA instead of "18554".
How to fix my code?
Thank you

You can use pmax:
DF$Beginning_2 <- pmax(DF$Beginning, as.Date("2020-01-01"))
#DF$Beginning_2 <- pmax(DF$Beginning, "2020-01-01") #Works also
DF
# Beginning End Beginning_2
#1 2020-12-31 2021-01-12 2020-12-31
#2 2018-01-02 2020-03-10 2020-01-01
#3 2019-04-12 2020-12-04 2020-01-01
#4 2020-10-15 2021-03-27 2020-10-15
str(DF)
#'data.frame': 4 obs. of 3 variables:
# $ Beginning : Date, format: "2020-12-31" "2018-01-02" ...
# $ End : Date, format: "2021-01-12" "2020-03-10" ...
# $ Beginning_2: Date, format: "2020-12-31" "2020-01-01" ...

Base R ifelse would return dates as numbers you will need to convert them back to dates.
DF$Beginning_2 <- as.Date(ifelse(DF$Beginning > as.Date("2020-01-01"),
DF$Beginning, as.Date("2020-01-01")), origin = '1970-01-01')
You may use dplyr::if_else which will maintain the class of the date columns.
DF$Beginning_2 <- dplyr::if_else(DF$Beginning > as.Date("2020-01-01"),
DF$Beginning, as.Date("2020-01-01"))
DF
# Beginning End Beginning_2
#1 2020-12-31 2021-01-12 2020-12-31
#2 2018-01-02 2020-03-10 2020-01-01
#3 2019-04-12 2020-12-04 2020-01-01
#4 2020-10-15 2021-03-27 2020-10-15

Related

How to unify Date & Time format when there inconsistency exists across a single column

I am working on a data frame that has a column that uses the date and time format as "mm/dd/yy HH:MM" for some observations and "yyyy/mm/dd HH:MM:SS" format for other observations, of course, this inconsistency results in errors or NA returns in my code, how can I unify the whole column so my calculations are not interrupted by this inconsistency?
enter image description here
Update step by step:
We have a data frame with for columns each of them have character type columns (you can check this with str(df)
In order to change the format from character to datetime in all four columns we use mutate(across(1:4, ...
What we want is that in each column 1:4 the character type is changed to datetime
this can be done with the function parse_date_time from lubridate package
Here we use ~ to indicate an anonymous function
the . indicates column 1-4.
and most important the argument c("ymd_HMS", "mdy_HM") which gives the order of the different formats of the date columns!
We could use parse_date_time() from lubridate package. Important is the argument c("ymd_HMS", "mdy_HM"). Here you define the occurence of the different formats:
and note to use HMS , because:
hms, hm and ms usage is defunct, please use HMS, HM or MS instead. Deprecated in version '1.5.6'.
library(dplyr)
library(lubridate)
df %>%
mutate(across(1:4, ~parse_date_time(., c("ymd_HMS", "mdy_HM"))))
started_at ended_at started_at_1 ended_at_1
<dttm> <dttm> <dttm> <dttm>
1 2021-10-29 17:42:36 2021-10-29 18:00:23 2021-06-13 11:40:00 2021-06-13 12:02:00
2 2021-10-01 15:06:10 2021-10-01 15:09:23 2021-06-27 16:26:00 2021-06-27 16:39:00
3 2021-10-28 23:02:53 2021-10-28 23:07:11 2021-06-10 20:06:00 2021-06-10 20:28:00
4 2021-10-17 00:58:17 2021-10-17 01:02:08 2021-06-11 15:54:00 2021-06-11 16:11:00
5 2021-10-27 18:29:34 2021-10-27 18:34:48 2021-06-05 14:09:00 2021-06-05 14:42:00
6 2021-10-17 13:30:21 2021-10-17 13:35:26 2021-06-05 14:14:00 2021-06-05 14:37:00
7 2021-10-04 19:59:28 2021-10-04 21:06:24 2021-06-16 19:05:00 2021-06-16 19:16:00
8 2021-10-10 00:27:09 2021-10-10 00:39:58 2021-06-23 20:29:00 2021-06-23 20:43:00
data:
structure(list(started_at = c("2021-10-29 17:42:36", "2021-10-01 15:06:10",
"2021-10-28 23:02:53", "2021-10-17 00:58:17", "2021-10-27 18:29:34",
"2021-10-17 13:30:21", "2021-10-04 19:59:28", "2021-10-10 00:27:09"
), ended_at = c("2021-10-29 18:00:23", "2021-10-01 15:09:23",
"2021-10-28 23:07:11", "2021-10-17 01:02:08", "2021-10-27 18:34:48",
"2021-10-17 13:35:26", "2021-10-04 21:06:24", "2021-10-10 00:39:58"
), started_at_1 = c("6/13/21 11:40", "6/27/21 16:26", "6/10/21 20:06",
"6/11/21 15:54", "6/5/21 14:09", "6/5/21 14:14", "6/16/21 19:05",
"6/23/21 20:29"), ended_at_1 = c("6/13/21 12:02", "6/27/21 16:39",
"6/10/21 20:28", "6/11/21 16:11", "6/5/21 14:42", "6/5/21 14:37",
"6/16/21 19:16", "6/23/21 20:43")), class = "data.frame", row.names = c(NA,
-8L))

Aggregating daily data to weekly, ending today

I'm currently building some charts of covid-related data....my script goes out and downloads most recent data and goes from there. I wind up with dataframes that look like
head(NMdata)
Date state positiveIncrease totalTestResultsIncrease
1 2020-05-19 NM 158 4367
2 2020-05-18 NM 81 4669
3 2020-05-17 NM 195 4126
4 2020-05-16 NM 159 4857
5 2020-05-15 NM 139 4590
6 2020-05-14 NM 152 4722
I've been aggregating to weekly data using the tq_transmute function from tidyquant.
NMweeklyPos <- NMdata %>% tq_transmute(select = positiveIncrease, mutate_fun = apply.weekly, FUN=sum)
This works, but it aggregates on week of the year, with weeks starting on Sunday.
head(NMweeklyPos)
Date positiveIncrease
<dttm> <int>
1 2020-03-08 00:00:00 0
2 2020-03-15 00:00:00 13
3 2020-03-22 00:00:00 44
4 2020-03-29 00:00:00 180
5 2020-04-05 00:00:00 306
6 2020-04-12 00:00:00 631
So for instance if I ran it today (which happens to be a Wednesday) my last entry is a partial week with Monday, Tuesday, Wednesday.
tail(NMweeklyPos)
Date positiveIncrease
<dttm> <int>
1 2020-04-19 00:00:00 624
2 2020-04-26 00:00:00 862
3 2020-05-03 00:00:00 1072
4 2020-05-10 00:00:00 1046
5 2020-05-17 00:00:00 1079
6 2020-05-19 00:00:00 239
For purposes of my chart this winds up being a small value, and so I have been discarding the partial weeks at the end, but that means I'm throwing out the most recent data.
I would prefer the throw out a partial week from the start of the dataset and have the aggregation automatically use weeks that end on whatever day the script is being run. So if I ran it today (Wednesday) it would aggregate on weeks ending Wednesday so that I had the most current data included...I could drop the partial week from the beginning of the data. But tomorrow it would choose weeks ending Thursday, etc. And I don't want to have to hardcode the week end day and change it each time.
How can I go about achieving that?
Using lubridate, the below code will find what day of the week it is and define that day as the floor for each week.
Hope this helps!
library(lubridate)
library(dplyr)
end = as.Date("2020-04-14")
data = data.frame(
date = seq.Date(as.Date("2020-01-01"), end, by = "day"),
val = 1
)
# get the day of the week
weekday = wday(end)
# using the floor_date function we can use todays date to determine what day of the week will be our floor
data%>%
mutate(week = floor_date(date, "week", week_start = weekday))%>%
group_by(week)%>%
summarise(total = sum(val))

Split a rows into two when a date range spans a change in calendar year

I am trying to figure out how to add a row when a date range spans a calendar year. Below is a minimal reprex:
I have a date frame like this:
have <- data.frame(
from = c(as.Date('2018-12-15'), as.Date('2019-12-20'), as.Date('2019-05-13')),
to = c(as.Date('2019-06-20'), as.Date('2020-01-25'), as.Date('2019-09-10'))
)
have
#> from to
#> 1 2018-12-15 2019-06-20
#> 2 2019-12-20 2020-01-25
#> 3 2019-05-13 2019-09-10
I want a data.frame that splits into two rows when to and from span a calendar year.
want <- data.frame(
from = c(as.Date('2018-12-15'), as.Date('2019-01-01'), as.Date('2019-12-20'), as.Date('2020-01-01'), as.Date('2019-05-13')),
to = c(as.Date('2018-12-31'), as.Date('2019-06-20'), as.Date('2019-12-31'), as.Date('2020-01-25'), as.Date('2019-09-10'))
)
want
#> from to
#> 1 2018-12-15 2018-12-31
#> 2 2019-01-01 2019-06-20
#> 3 2019-12-20 2019-12-31
#> 4 2020-01-01 2020-01-25
#> 5 2019-05-13 2019-09-10
I am wanting to do this because for a particular row, I want to know how many days are in each year.
want$time_diff_by_year <- difftime(want$to, want$from)
Created on 2020-05-15 by the reprex package (v0.3.0)
Any base R, tidyverse solutions would be much appreciated.
You can determine the additional years needed for your date intervals with map2, then unnest to create additional rows for each year.
Then, you can identify date intervals of intersections between partial years and a full calendar year. This will keep the partial years starting Jan 1 or ending Dec 31 for a given year.
library(tidyverse)
library(lubridate)
have %>%
mutate(date_int = interval(from, to),
year = map2(year(from), year(to), seq)) %>%
unnest(year) %>%
mutate(year_int = interval(as.Date(paste0(year, '-01-01')), as.Date(paste0(year, '-12-31'))),
year_sect = intersect(date_int, year_int),
from_new = as.Date(int_start(year_sect)),
to_new = as.Date(int_end(year_sect))) %>%
select(from_new, to_new)
Output
# A tibble: 5 x 2
from_new to_new
<date> <date>
1 2018-12-15 2018-12-31
2 2019-01-01 2019-06-20
3 2019-12-20 2019-12-31
4 2020-01-01 2020-01-25
5 2019-05-13 2019-09-10

How to get the ending date from the first observation and use it as the starting date for the second observation for the same ID?

My df has some unique and some double entries and columns showing starting and ending date for every observation, but they cannot overlap for the same id.
df <- data.frame(id = c(22,22,102,102,102),
start_date = as.Date(c("2013-10-29","2014-01-09",
"2016-09-14",
"2016-09-14","2016-09-14")),
end_date = as.Date(c("2017-08-15","2018-10-05",
"2016-10-09",
"2017-12-12","2018-10-17")))
head(df)
id start_date end_date
1 22 2013-10-29 2017-08-15
2 22 2014-01-09 2018-10-05
3 102 2016-09-14 2016-10-09
4 102 2016-09-14 2017-12-12
5 102 2016-09-14 2018-10-17
ids 22 and 102 dates interval overlap, but for 22 with different start_date and for 102 with the same start_date.
The result I need is:
When the dates overlap, to have the final date of the previous observation as the starting date.
When the dates don't overlap, keep the actual values.
Any idea or suggestions?
The result I'd expect is:
head(fixed_df)
id start_date end_date
1 22 2013-10-29 2017-08-15
2 22 2017-08-15 2018-10-05
3 102 2016-09-14 2016-10-09
4 102 2016-10-09 2017-12-12
5 102 2017-12-12 2018-10-17
In R, you can easily compare date objects with normal ==, > or < operators, so by using a loop and few tests here is a working solution:
#Loop over every lines except the last one
for (line in c(1:(length(df$id)-1)))
{
#Do something only if next line have the same ID
if(df$id[line]==df$id[line+1])
{
#Check if end date is after start date of the next line
if(df$end_date[line]>df$start_date[line+1])
{
#If yes, put the start date of next line to end date of current line
df$start_date[line+1]=df$end_date[line]
}
}
}
With dplyr, I would do it as such:
library(dplyr)
df %>% group_by(id) %>%
arrange(start_date) %>%
mutate(
lag(end_date),
overlap = start_date < lag(end_date, default=as.Date('2000-01-01')),
new_start_date = if_else(overlap, lag(end_date), start_date)
)
id start_date end_date `lag(end_date)` overlap new_start_date
<dbl> <date> <date> <date> <lgl> <date>
1 22 2013-10-29 2017-08-15 NA FALSE 2013-10-29
2 22 2014-01-09 2018-10-05 2017-08-15 TRUE 2017-08-15
3 102 2016-09-14 2016-10-09 NA FALSE 2016-09-14
4 102 2016-09-14 2017-12-12 2016-10-09 TRUE 2016-10-09
5 102 2016-09-14 2018-10-17 2017-12-12 TRUE 2017-12-12
This one is quite verbose, but merely to demonstrate what is going one.
Some key points:
Use group_by to keep comparisons within id.
Next, sort things.
lag - compare with previous value. But use a good default value, that is also the same type.
Consider using lag(end_date) + days(1) if you want strict no overlaps.

Remove duplicate events within delta time

Given the datraframe below
class timestamp
1 A 2019-02-14 15:00:29
2 A 2019-01-27 17:59:53
3 A 2019-01-27 18:00:00
4 B 2019-02-02 18:00:00
5 C 2019-03-08 16:00:37
observation 2 and 3 point to the same event. How do I remove rows belonging to the same class if another timestamp within 2 minutes already exists?
Desired output:
class timestamp
1 A 2019-02-14 15:00:00
2 B 2019-01-27 18:00:00
3 A 2019-02-02 18:00:00
4 C 2019-03-08 16:00:00
round( ,c("mins")) can be used to get rid of the second component but if the timestamps are to far off some test samples will be rounded to the wrong minute leaving still different timestamps
EDIT
I think I over-complicated the problem in first attempt, I think what would work for your case is to round time for 2 minute intervals which we can do using round_date from lubridate .
library(lubridate)
library(dplyr)
df %>%
mutate(timestamp = round_date(as.POSIXct(timestamp), unit = "2 minutes")) %>%
group_by(class) %>%
filter(!duplicated(timestamp))
# class timestamp
# <chr> <dttm>
#1 A 2019-02-14 15:00:00
#2 A 2019-01-27 18:00:00
#3 B 2019-02-02 18:00:00
#4 C 2019-03-08 16:00:00
Original Attempt
We can first convert the timestamp to POSIXct object, then arrange rows by class and timestamp, use cut to divide them into "2 min" interval and then remove duplicates.
library(dplyr)
df %>%
mutate(timestamp = as.POSIXct(timestamp)) %>%
arrange(class, timestamp) %>%
group_by(class) %>%
filter(!duplicated(as.numeric(cut(timestamp, breaks = "2 mins")), fromLast = TRUE))
# class timestamp
# <chr> <dttm>
#1 A 2019-01-27 18:00:00
#2 A 2019-02-14 15:00:29
#3 B 2019-02-02 18:00:00
#4 C 2019-03-08 16:00:37
Here, I haven't changed or rounded the timestamp column and kept it as it is but it would be simple to round it if you use cut in mutate. Also if you want to keep the first entry like 2019-01-27 17:59:53 then remove fromLast = TRUE argument.

Resources