Subsetting a Dataframe in R on last two month's - r

Hi I have a dataframe for example:
Order Number Date
4378345 2020-01-02
4324375 2020-02-03
Now I want to subset this Dataframe only on orders with Date greater than last 2 month from today's Date, So that when I put it on automation the code it automatically take's last two month to till date.
Any help would be appreciated.
EDIT: My apologies i think i Should have been more clear as by last two month i mean if the date is today 2020-03-16, I wound want my data to be from 2020-02-01 to till date

library(lubridate)
subset(your_data, Date > today() - months(2))
This assumes your date column is of class Date already.
In general, months are not super well-defined... you may want to use a more deterministic criterion. For example, what is 2 months before April 28, April 29, April 30, May 1? Keep in mind that February had 29 days this year. You can see lubridate's opinion with (as.Date("2020-04-28") + 0:3) - months(2), which is NA in the case of 2020-04-30. Doing 60 days before (Date > today() - days(60)) or some other better-defined criteria will give you more consistency.
To go from the first day of the previous month, use the following code. This is well-defined, as all months have a first day.
subset(your_data, Date >= floor_date(today(), unit = "month") - months(1))

You can use the package lubridate with something like:
library(lubridate)
subset(yourDF,
Date >= (today() - months(2)))
[edit: Ouch, someone was faster)

library(dplyr)
library(lubridate)
x <- tibble(Date = as_date(c("2020-01-02", "2020-02-03")))
x %>% filter(Date >= today() - months(2))

time <- read.table(textConnection("
OrderNumber Date
4378345 2020-01-02
4324375 2020-02-03"), header = TRUE)
time <- as.data.frame(time)
library(lubridate)
library(dplyr)
time2 <- time %>%
filter(Date >= today() - months(2))

In base R you can find the first of last month using seq (credits to #G.Grothendieck), and replacing the day with the first using strftime.
last.1st <-
as.Date(paste0(strftime(seq(today, length=2, by="-1 month")[2], format="%Y-%m"), "-01"))
The first of current month can be obtained using gsub and regular expressions.
curr.1 <- as.Date(gsub("\\d{2}$", "01", Sys.Date()))
Then subsetting as usual is straightforward. The whole period:
dat[dat$date >= last.1, ]
# order.num date
# 18 432174 2020-02-01
# 19 432175 2020-02-03
# 20 432176 2020-02-05
# 21 432177 2020-02-07
# 22 432178 2020-02-09
# 23 432179 2020-02-11
# 24 432180 2020-02-13
# 25 432181 2020-02-15
# 26 432182 2020-02-17
# 27 432183 2020-02-19
# 28 432184 2020-02-21
# 29 432185 2020-02-23
# 30 432186 2020-02-25
# 31 432187 2020-02-27
# 32 432188 2020-02-29
# 33 432189 2020-03-02
# 34 432190 2020-03-04
# 35 432191 2020-03-06
# 36 432192 2020-03-08
# 37 432193 2020-03-10
# 38 432194 2020-03-12
# 39 432195 2020-03-14
# 40 432196 2020-03-16
And just the last month:
dat[dat$date >= last.1 & dat$date <= curr.1, ]
# order.num date
# 18 432174 2020-02-01
# 19 432175 2020-02-03
# 20 432176 2020-02-05
# 21 432177 2020-02-07
# 22 432178 2020-02-09
# 23 432179 2020-02-11
# 24 432180 2020-02-13
# 25 432181 2020-02-15
# 26 432182 2020-02-17
# 27 432183 2020-02-19
# 28 432184 2020-02-21
# 29 432185 2020-02-23
# 30 432186 2020-02-25
# 31 432187 2020-02-27
# 32 432188 2020-02-29
Toy data
dat <- structure(list(order.num = c(432157, 432158, 432159, 432160,
432161, 432162, 432163, 432164, 432165, 432166, 432167, 432168,
432169, 432170, 432171, 432172, 432173, 432174, 432175, 432176,
432177, 432178, 432179, 432180, 432181, 432182, 432183, 432184,
432185, 432186, 432187, 432188, 432189, 432190, 432191, 432192,
432193, 432194, 432195, 432196), date = structure(c(18259, 18261,
18263, 18265, 18267, 18269, 18271, 18273, 18275, 18277, 18279,
18281, 18283, 18285, 18287, 18289, 18291, 18293, 18295, 18297,
18299, 18301, 18303, 18305, 18307, 18309, 18311, 18313, 18315,
18317, 18319, 18321, 18323, 18325, 18327, 18329, 18331, 18333,
18335, 18337), class = "Date")), class = "data.frame", row.names = c(NA,
-40L))

Related

adding a column to specify duration of event based on dates

I have a dataframe where i have to columns that represent the start of an event and the planned end of the event
What is the best way to add a column in which i could see the duration in days of the event in the dataframe ?
Another alternative would be to directly create a new dataset from it by using the group_by function on which i could see for each day the average duration of a campaign, but it seems too complicated
structure(list(launched_at = c("03/26/2021", "03/24/2021", "01/05/2021",
"02/17/2021", "02/15/2021", "02/25/2021"), deadline = c("04/25/2021",
"04/08/2021", "01/17/2021", "03/03/2021", "03/01/2021", "04/26/2021"
)), row.names = c(NA, 6L), class = "data.frame")
We could use mdy function from lubridate package:
library(lubridate)
library(dplyr)
df %>%
mutate(across(, mdy), # this line only if your dates are not in date format
duration_days = as.integer(deadline - launched_at))
launched_at deadline duration_days
1 2021-03-26 2021-04-25 30
2 2021-03-24 2021-04-08 15
3 2021-01-05 2021-01-17 12
4 2021-02-17 2021-03-03 14
5 2021-02-15 2021-03-01 14
6 2021-02-25 2021-04-26 60
One option
as.POSIXct(df$deadline,tz="UTC",format="%m/%d/%y")-
as.POSIXct(df$launched_at,tz="UTC",format="%m/%d/%y")
Time differences in days
[1] 30 15 12 15 15 61
If you're looking for duration between 'launched_at' and 'deadline',
library(dplyr)
df %>%
mutate(launched_at = as.Date(launched_at, "%m/%d/%Y"),
deadline = as.Date(deadline, "%m/%d/%Y"),
duration = deadline - launched_at)
launched_at deadline duration
1 2021-03-26 2021-04-25 30 days
2 2021-03-24 2021-04-08 15 days
3 2021-01-05 2021-01-17 12 days
4 2021-02-17 2021-03-03 14 days
5 2021-02-15 2021-03-01 14 days
6 2021-02-25 2021-04-26 60 days
more concise way(#Darren Tsai)
df %>%
mutate(across(c(launched_at, deadline), as.Date, "%m/%d/%Y"),
duration = deadline - launched_at)
You can use the built-in functions within and as.Date:
df = within(df, {
launched_at = as.Date(launched_at, "%m/%d/%y")
deadline = as.Date(deadline, "%m/%d/%y")
duration = deadline-launched_at})
launched_at deadline duration
1 2020-03-26 2020-04-25 30 days
2 2020-03-24 2020-04-08 15 days
3 2020-01-05 2020-01-17 12 days
4 2020-02-17 2020-03-03 15 days
5 2020-02-15 2020-03-01 15 days
6 2020-02-25 2020-04-26 61 days
Another option using difftime:
df <- structure(list(launched_at = c("03/26/2021", "03/24/2021", "01/05/2021",
"02/17/2021", "02/15/2021", "02/25/2021"), deadline = c("04/25/2021",
"04/08/2021", "01/17/2021", "03/03/2021", "03/01/2021", "04/26/2021"
)), row.names = c(NA, 6L), class = "data.frame")
df$duration <- with(df, difftime(as.Date(deadline, "%m/%d/%Y"), as.Date(launched_at, "%m/%d/%Y"), units = c("days")))
df
#> launched_at deadline duration
#> 1 03/26/2021 04/25/2021 30 days
#> 2 03/24/2021 04/08/2021 15 days
#> 3 01/05/2021 01/17/2021 12 days
#> 4 02/17/2021 03/03/2021 14 days
#> 5 02/15/2021 03/01/2021 14 days
#> 6 02/25/2021 04/26/2021 60 days
Created on 2022-07-22 by the reprex package (v2.0.1)

In R: identify/subset observations where event dates are later than fiscal year dates, with additional condition

I have a very large dataset and is why I would like to find a simpler way to handle this:
I would like to identify or subset those observations where the event date is later than the fiscal year date. An additional condition would be that, out of the observations identified in the previous sentence, I would only want those event dates that lie between 31st May and the fiscal year. If it is not possible to apply such a condition, maybe one could apply that the event date lies between 31st May and 1st Jan?
For example, we have the following
fiscal year date event date
2010-04-30 2010-05-03
2016-03-31 2016-04-28
2020-01-31 2020-02-10
2019-08-30 2019-06-03
2009-07-31 2009-10-10
2003-03-31 2003-02-18
2012-06-30 2012-03-10
From the data above, only the first three observations would be kept when applying the conditional code. Any help is super appreciated, thank you! :)
Using tidyverse:
library(tidyverse)
d %>%
mutate(MayDate = as.Date(paste0(lubridate::year(fiscal_year_date),"-05-31"))) %>%
filter(event_date > fiscal_year_date & event_date <= MayDate)
# fiscal_year_date event_date MayDate
# 1 2010-04-30 2010-05-03 2010-05-31
# 2 2016-03-31 2016-04-28 2016-05-31
# 3 2020-01-31 2020-02-10 2020-05-31
data
d <- structure(list(fiscal_year_date = structure(c(14729, 16891, 18292,
18138, 14456, 12142, 15521), class = "Date"), event_date = structure(c(14732,
16919, 18302, 18050, 14527, 12101, 15409), class = "Date")),
class = "data.frame", row.names = c(NA, -7L))

Creating Labels for Dates

I am working in R. I have a data frame that consists of Sampling Date and water temperature. I have provided a sample dataframe below:
Date Temperature
2015-06-01 11
2015-08-11 13
2016-01-12 2
2016-07-01 12
2017-01-08 4
2017-08-13 14
2018-03-04 7
2018-09-19 10
2019-8-24 8
Due to the erratic nature of sampling dates (due to samplers ability to site) I am unable to classify years normally January 1 to December 31st and instead am using the beginning of the sampling period as the start of 1 year. In this case a year would start June 1st and End may 31st, that way I can accruately compare the years to one another. Thus I want 4 years to have the following labels
Year_One = "2015-06-01" - "2016-05-31"
Year_Two = "2016-06-01" - "2017-05-31"
Year_Three = "2017-06-01" - "2018-05-31"
Year_Four = "2018-06-01" - "2019-08-24"
My goal is to create an additional column with these labels but have thus far been unable to do so.
I create two columns year1 and year2 with two different approaches. The year2 approach needs that all the periods start june 1st and end may 31st (in your code the year_four ends 2019-08-24) so it may not be exactly what you need:
library(tidyverse)
library(lubridate)
dt$Date <- as.Date(dt$Date)
dt %>%
mutate(year1= case_when(between(Date, as.Date("2015-06-01") , as.Date("2016-05-31")) ~ "Year_One",
between(Date, as.Date("2016-06-01") , as.Date("2017-05-31")) ~ "Year_Two",
between(Date, as.Date("2017-06-01") , as.Date("2018-05-31")) ~ "Year_Three",
between(Date, as.Date("2018-06-01") , as.Date("2019-08-24")) ~ "Year_Four",
TRUE ~ "0")) %>%
mutate(year2 = paste0(year(Date-months(5)),"/", year(Date-months(5))+1))
The output:
# A tibble: 9 x 4
Date Temperature year1 year2
<date> <dbl> <chr> <chr>
1 2015-06-01 11 Year_One 2015/2016
2 2015-08-11 13 Year_One 2015/2016
3 2016-01-12 2 Year_One 2015/2016
4 2016-07-01 12 Year_Two 2016/2017
5 2017-01-08 4 Year_Two 2016/2017
6 2017-08-13 14 Year_Three 2017/2018
7 2018-03-04 7 Year_Three 2017/2018
8 2018-09-19 10 Year_Four 2018/2019
9 2019-08-24 8 Year_Four 2019/2020
Using strftime to get the years, then make a factor with levels on the unique values. I'd recommend numbers instead of words, because they can be coded automatically. Else, use labels=c("one", "two", ...).
d <- within(d, {
year <- strftime(Date, "%Y")
year <- paste("Year", factor(year, labels=seq(unique(year))), sep="_")
})
# Date temperature year
# 1 2017-06-01 1 Year_1
# 2 2017-09-01 2 Year_1
# 3 2017-12-01 3 Year_1
# 4 2018-03-01 4 Year_2
# 5 2018-06-01 5 Year_2
# 6 2018-09-01 6 Year_2
# 7 2018-12-01 7 Year_2
# 8 2019-03-01 8 Year_3
# 9 2019-06-01 9 Year_3
# 10 2019-09-01 10 Year_3
# 11 2019-12-01 11 Year_3
# 12 2020-03-01 12 Year_4
# 13 2020-06-01 13 Year_4
Data:
d <- structure(list(Date = structure(c(17318, 17410, 17501, 17591,
17683, 17775, 17866, 17956, 18048, 18140, 18231, 18322, 18414
), class = "Date"), temperature = 1:13), class = "data.frame", row.names = c(NA,
-13L))

R How to calculate "task time" for business hours only

Is there a way to calculate a "task time" for working hours only? Working hours 8 to 5, Monday through Friday. Example:
Using datediff():
expected result:
sample task times:
df %>%
select(v_v_initiated,v_v_complete)
v_v_initiated v_v_complete
1 2020-04-23 14:13:52.0000000 2020-04-23 16:04:28.0000000
2 2020-11-10 11:48:53.0000000 2020-11-10 13:12:31.0000000
3 2020-10-20 16:03:39.0000000 2020-10-20 16:25:16.0000000
4 2020-04-02 13:43:54.0000000 2020-04-02 14:14:45.0000000
5 2020-07-09 08:52:54.0000000 2020-07-23 09:18:29.0000000
6 2020-06-09 14:56:33.0000000 2020-06-10 07:44:17.0000000
7 2020-09-17 15:11:39.0000000 2020-09-17 15:13:41.0000000
8 2020-10-28 14:08:20.0000000 2020-10-28 14:07:35.0000000
9 2020-04-21 12:55:36.0000000 2020-04-27 12:56:17.0000000
10 2020-11-06 11:02:03.0000000 2020-11-06 11:02:30.0000000
11 2020-02-17 12:29:21.0000000 2020-02-18 12:52:23.0000000
12 2020-08-25 15:25:46.0000000 2020-08-26 10:18:26.0000000
13 2020-02-19 15:05:28.0000000 2020-02-20 09:43:48.0000000
14 2020-09-23 21:19:41.0000000 2020-09-24 14:52:21.0000000
15 2020-07-01 14:20:11.0000000 2020-07-01 14:20:59.0000000
16 2020-05-01 15:22:58.0000000 2020-05-01 16:32:35.0000000
17 2020-06-29 13:10:58.0000000 2020-06-30 13:53:29.0000000
18 2020-06-16 12:56:54.0000000 2020-06-16 14:27:15.0000000
19 2020-03-27 11:02:29.0000000 2020-03-30 15:18:51.0000000
20 2020-04-08 07:38:01.0000000 2020-04-08 07:52:35.0000000
21 2020-07-30 09:32:42.0000000 2020-07-30 10:32:28.0000000
22 2020-06-17 14:03:31.0000000 2020-07-10 15:38:03.0000000
23 2020-04-24 10:41:27.0000000 2020-04-29 13:07:05.0000000
24 2020-08-26 10:41:10.0000000 2020-08-26 12:55:23.0000000
25 2020-10-26 18:11:16.0000000 2020-10-27 16:10:39.0000000
26 2020-01-08 11:12:49.0000000 2020-01-09 09:18:37.0000000
27 2020-04-17 11:40:10.0000000 2020-04-17 15:51:21.0000000
28 2020-02-11 10:38:21.0000000 2020-02-11 10:33:54.0000000
29 2020-03-23 12:10:21.0000000 2020-03-23 12:33:06.0000000
30 2020-06-02 12:44:00.0000000 2020-06-03 08:28:05.0000000
31 2020-04-13 09:30:31.0000000 2020-04-13 13:16:55.0000000
32 2020-04-07 17:36:02.0000000 2020-04-07 17:36:44.0000000
33 2020-01-15 12:24:42.0000000 2020-01-15 12:25:00.0000000
34 2020-08-18 08:55:58.0000000 2020-08-18 09:02:34.0000000
35 2020-07-06 14:10:23.0000000 2020-07-07 10:28:05.0000000
36 2020-03-25 15:03:20.0000000 2020-03-31 14:17:43.0000000
37 2020-01-29 12:58:33.0000000 2020-02-14 09:53:06.0000000
38 2020-02-07 15:11:21.0000000 2020-02-10 09:13:53.0000000
39 2020-07-27 17:51:13.0000000 2020-07-29 11:52:51.0000000
40 2020-09-02 11:43:02.0000000 2020-09-02 13:10:46.0000000
41 2020-07-22 11:04:50.0000000 2020-07-22 11:12:34.0000000
42 2020-06-29 13:57:17.0000000 2020-06-30 07:34:55.0000000
43 2020-07-21 10:46:58.0000000 2020-07-21 16:15:59.0000000
44 2020-05-27 07:38:46.0000000 2020-05-27 07:51:24.0000000
45 2020-07-14 10:33:49.0000000 2020-07-14 11:38:28.0000000
46 2020-06-04 16:59:09.0000000 2020-06-09 10:49:20.0000000
You could adapt another function that calculates business hours for a time interval (such as this.
First, create a sequence of dates from start to end, and filter by only include weekdays.
Next, create time intervals using the business hours of interest (in this case, "08:00" to "17:00").
Determine how much of each day business hours overlap with your times. This way, if a time starts at "09:05", that time will be used for the start of the day, and not "08:00".
Finally, sum up the time intervals, and determine the number of business days (assuming a 9-hour day), and remainder hours and minutes.
If you want to apply this function to rows in a data frame, you could use mapply as in:
df$business_hours <- mapply(calc_bus_hours, df$start_date, df$end_date)
Hope this is helpful.
library(lubridate)
library(dplyr)
calc_bus_hours <- function(start, end) {
my_dates <- seq.Date(as.Date(start), as.Date(end), by = "day")
my_dates <- my_dates[!weekdays(my_dates) %in% c("Saturday", "Sunday")]
my_intervals <- interval(ymd_hm(paste(my_dates, "08:00"), tz = "UTC"), ymd_hm(paste(my_dates, "17:00"), tz = "UTC"))
int_start(my_intervals[1]) <- pmax(pmin(start, int_end(my_intervals[1])), int_start(my_intervals[1]))
int_end(my_intervals[length(my_intervals)]) <- pmax(pmin(end, int_end(my_intervals[length(my_intervals)])),
int_start(my_intervals[length(my_intervals)]))
total_time <- sum(time_length(my_intervals, "minutes"))
total_days <- total_time %/% (9 * 60)
total_hours <- total_time %% (9 * 60) %/% 60
total_minutes <- total_time - (total_days * 9 * 60) - (total_hours * 60)
paste(total_days, "days,", total_hours, "hours,", total_minutes, "minutes")
}
calc_bus_hours(as.POSIXct("11/4/2020 9:05", format = "%m/%d/%Y %H:%M", tz = "UTC"),
as.POSIXct("11/9/2020 11:25", format = "%m/%d/%Y %H:%M", tz = "UTC"))
[1] "3 days, 2 hours, 20 minutes"
Edit: As mentioned by #DPH this is more complex with holidays and partial holidays.
You could create a data frame of holidays and indicate times open, allowing for partial holidays (e.g., Christmas Eve from 8:00 AM to Noon).
Here is a modified function that should give comparable results.
library(lubridate)
library(dplyr)
holiday_df <- data.frame(
date = as.Date(c("2020-12-24", "2020-12-25", "2020-12-31", "2020-01-01")),
start = c("08:00", "08:00", "08:00", "08:00"),
end = c("12:00", "08:00", "08:00", "08:00")
)
calc_bus_hours <- function(start, end) {
my_dates <- seq.Date(as.Date(start), as.Date(end), by = "day")
my_dates_df <- data.frame(
date = my_dates[!weekdays(my_dates) %in% c("Saturday", "Sunday")],
start = "08:00",
end = "17:00"
)
all_dates <- union_all(
inner_join(my_dates_df["date"], holiday_df),
anti_join(my_dates_df, holiday_df["date"])
) %>%
arrange(date)
my_intervals <- interval(ymd_hm(paste(all_dates$date, all_dates$start), tz = "UTC"),
ymd_hm(paste(all_dates$date, all_dates$end), tz = "UTC"))
int_start(my_intervals[1]) <- pmax(pmin(start, int_end(my_intervals[1])), int_start(my_intervals[1]))
int_end(my_intervals[length(my_intervals)]) <- pmax(pmin(end, int_end(my_intervals[length(my_intervals)])),
int_start(my_intervals[length(my_intervals)]))
total_time <- sum(time_length(my_intervals, "minutes"))
total_days <- total_time %/% (9 * 60)
total_hours <- total_time %% (9 * 60) %/% 60
total_minutes <- total_time - (total_days * 9 * 60) - (total_hours * 60)
paste(total_days, "days,", total_hours, "hours,", total_minutes, "minutes")
}

is there any ways to select row that is between two dates?

i have a data with two date with is start_date and end_date
how do i select the row which is for example the month of may?
what i have tried
filter with start_date >= "05" & end_date <= "05"
or
subset(data, format.Date(start_date, "%m") >= "05" & format.Date(end_date, "%m") <= "05")
thanks in advance, im new with R.
i solves the question, silly me.
campaign_on_may <- subset(data, format.Date(start_date, "%m") == "05" & format.Date(start_date, "%Y") == "2017")
campaign_on_may <- rbind(data, subset(campaign_descriptions, format.Date(end_date, "%m") == "05" & format.Date(start_date, "%Y") == "2017"))
In pure R, you can use as.numeric with format.
subset(data, subset=as.numeric(format(start_date, "%m"))==5 &
as.numeric(format(end_date, "%m"))==5)
start_date end_date
8 2020-05-06 2020-05-26
Or using lubridate to reduce code:
library(lubridate)
subset(data, month(start_date)==5 & month(end_date)==5)
start_date end_date
8 2020-05-06 2020-05-26
Use of as.numeric is not really needed for that example, but suppose now you want to subset rows in which a date is between two months, say May and June. We can easily modify the above code with the following.
> subset(data, month(start_date)>=5 & month(end_date)<=6)
start_date end_date
8 2020-05-06 2020-05-26
9 2020-05-25 2020-06-14
And to incorporate the year, we just use as.Date() instead of month.
subset(data, start_date>=as.Date("2020-04-01") & end_date<=as.Date("2020-05-31"))
start_date end_date
5 2020-04-10 2020-04-30
6 2020-04-10 2020-04-30
7 2020-04-20 2020-05-10
8 2020-05-06 2020-05-26
Data:
data <- structure(list(start_date = structure(c(18311.2161748139, 18312.2842345089,
18326.147890578, 18349.8989499761, 18362.2961949771, 18362.3596080979,
18372.229088068, 18388.4478125423, 18407.5741012516, 18430.0561228655
), class = "Date"), end_date = structure(c(18331.2161748139,
18332.2842345089, 18346.147890578, 18369.8989499761, 18382.2961949771,
18382.3596080979, 18392.229088068, 18408.4478125423, 18427.5741012516,
18450.0561228655), class = "Date")), row.names = c(NA, -10L), class = "data.frame")
start_date end_date
1 2020-02-19 2020-03-10
2 2020-02-20 2020-03-11
3 2020-03-05 2020-03-25
4 2020-03-28 2020-04-17
5 2020-04-10 2020-04-30
6 2020-04-10 2020-04-30
7 2020-04-20 2020-05-10
8 2020-05-06 2020-05-26
9 2020-05-25 2020-06-14
10 2020-06-17 2020-07-07

Resources