Getting months as numerical value in R - r

I created this for loop to iterate through a list of student records (SU_students) and get the difference between the enrollment begin and enrollment end dates in a new column called "enroll_months".
I'm using the interval() function from lubridate library and when I use it outside of the loop on a single value of two dates it returns numerical value which is what I'm looking for; to have the months as a numerical value in column in the data frame.
for (row in 1:nrow(SU_students)){
SU_students$enroll_months[row] <- interval(Enrollment_Begin[row], Enrollment_End[row]) %/% months(1)
}

Assuming your SU_students is the same length as Enrollment_Begin and Enrollment_end, you can do this all within a data.frame. I have found it easier to use lubridate::time_length() as it feels more intuitive and easier to parameterize if I start changing things.
These functions are vectorized so there's no need for the for loop to iterate over the elements.
set.seed(42)
df <- data.frame(
SU_students = letters[1:10],
Enrollment_Begin = as.Date("2021-10-04") + runif(10, -1, 1) * 100,
Enrollment_End = as.Date("2021-10-04") + runif(10, -1, 1) * 100
)
df$enroll_months <- lubridate::time_length(lubridate::interval(df$Enrollment_Begin, df$Enrollment_End), "months")
df
#> SU_students Enrollment_Begin Enrollment_End enroll_months
#> 1 a 2021-12-25 2021-09-25 -3.0133179
#> 2 b 2021-12-30 2021-11-16 -1.4384720
#> 3 c 2021-08-22 2021-12-29 4.2485981
#> 4 d 2021-12-09 2021-08-16 -3.7743148
#> 5 e 2021-11-01 2021-09-26 -1.1630180
#> 6 f 2021-10-07 2021-12-31 2.7478618
#> 7 g 2021-11-20 2022-01-07 1.5912136
#> 8 h 2021-07-22 2021-07-19 -0.1145282
#> 9 i 2021-11-04 2021-09-28 -1.1799681
#> 10 j 2021-11-14 2021-10-16 -0.9337551
Created on 2021-10-04 by the reprex package (v2.0.1)

Related

Count the number of timestamps in a given vector that fall within an interval in R

I want to count the number of events that occur within intervals.
I start with a table that has three columns: start dates, end dates, and the interval created by them.
table <-
tibble(
start = c( "2022-08-02", "2022-10-06", "2023-01-11"),
end = c("2022-08-04", "2023-02-06", "2023-02-04"),
interval = start %--% end
)
I also have a vector of timestamp events:
events <- c(ymd("2022-08-07"), ymd("2022-10-17"), ymd("2023-01-17"), ymd("2023-02-02"))
For each interval in my table, I want to know how many events fell within that interval so that my final table looks something like this (but with the correct counts):
start end interval n_events_within_interval
<chr> <chr> <Interval> <int>
1 2022-08-02 2022-08-04 2022-08-02 UTC--2022-08-04 UTC 2
2 2022-10-06 2023-02-06 2022-10-06 UTC--2023-02-06 UTC 2
3 2023-01-11 2023-02-04 2023-01-11 UTC--2023-02-04 UTC 2
I have tried this so far, but I'm not sure how to get mutate to cycle through the events vector for each row:
library(tidyverse)
library(lubridate)
library(purrr)
table <-
tibble(
start = c( "2022-08-02", "2022-10-06", "2023-01-11"),
end = c("2022-08-04", "2023-02-06", "2023-02-04"),
interval = start %--% end
)
table
#> # A tibble: 3 × 3
#> start end interval
#> <chr> <chr> <Interval>
#> 1 2022-08-02 2022-08-04 2022-08-02 UTC--2022-08-04 UTC
#> 2 2022-10-06 2023-02-06 2022-10-06 UTC--2023-02-06 UTC
#> 3 2023-01-11 2023-02-04 2023-01-11 UTC--2023-02-04 UTC
events <- c(ymd("2022-08-07"), ymd("2022-10-17"), ymd("2023-01-17"), ymd("2023-02-02"))
events
#> [1] "2022-08-07" "2022-10-17" "2023-01-17" "2023-02-02"
table %>%
mutate(
n_events_within_interval = sum(events %within% interval)
)
#> Warning in as.numeric(a) - as.numeric(int#start): longer object length is not a
#> multiple of shorter object length
#> Warning in as.numeric(a) - as.numeric(int#start) <= int#.Data: longer object
#> length is not a multiple of shorter object length
#> Warning in as.numeric(a) - as.numeric(int#start): longer object length is not a
#> multiple of shorter object length
#> # A tibble: 3 × 4
#> start end interval n_events_within_interval
#> <chr> <chr> <Interval> <int>
#> 1 2022-08-02 2022-08-04 2022-08-02 UTC--2022-08-04 UTC 2
#> 2 2022-10-06 2023-02-06 2022-10-06 UTC--2023-02-06 UTC 2
#> 3 2023-01-11 2023-02-04 2023-01-11 UTC--2023-02-04 UTC 2
Created on 2023-02-15 with reprex v2.0.2
We could use rowwise
library(dplyr)
library(lubridate)
table %>%
rowwise %>%
mutate(n_events_within_interval = sum(events %within% interval)) %>%
ungroup

Extract data values at a higher frequency than time stamps

I have continuous behavior data with the timestamp when the subject changed behaviors and what each behavior was, and I need to extract the instantaneous behavior at each minute, starting at the second that the first behavior began: if the behavior started at 17:34:06, I'd define the next minute as 17:35:06. I also have the durations of each behavior calculated. This is what my data looks like:
df <- data.frame(Behavior = c("GRAZ", "MLTC", "GRAZ", "MLTC", "VIGL"),
Behavior_Start = c("2022-05-10 17:34:06","2022-05-10 17:38:04","2022-05-10 17:38:26","2022-05-10 17:41:49","2022-05-10 17:42:27"),
Behavior_Duration_Minutes = c(0.000000,3.961683,4.325933,7.722067,8.350017))
print(df)
I've used cut() to bin each row into the minute it falls into, but I can't figure out how to get the behavior values for the minutes in which a new behavior doesn't occur (i.e. minutes 2:4 here), and this bases it off the minute but doesn't account for the second that the first behavior began.
time <- data.frame(as.POSIXct(df$Behavior_Start, tz = "America/Denver"))
colnames(time) <- "time"
df <- cbind(df,time)
df.cut <- data.frame(df, cuts = cut(df$time, breaks= "1 min", labels = FALSE))
print(df.cut)
So the dataframe I'd like to end up with would look like this:
new.df <- data.frame(Minute = c(1:10),
Timestamp = c("2022-05-10 17:34:06","2022-05-10 17:35:06","2022-05-10 17:36:06","2022-05-10 17:37:06","2022-05-10 17:38:06","2022-05-10 17:39:06","2022-05-10 17:40:06","2022-05-10 17:41:06","2022-05-10 17:42:06","2022-05-10 17:43:06"),
Behavior = c("GRAZ","GRAZ","GRAZ","MLTC","GRAZ","GRAZ","GRAZ","MLTC","VIGL","VIGL"))
print(new.df)
Your data:
library(dplyr)
library(tidyr)
library(purrr)
your_df <- data.frame(
Behavior = c("Grazing","Vigilant","Grazing","Other","Grazing"),
Behavior_Start = c("2022-05-10 17:34:06","2022-05-10 17:38:04","2022-05-10 17:38:26","2022-05-10 17:41:49","2022-05-10 17:42:27"),
Behavior_Duration_Minutes = c(0.000000,3.961683,4.325933,7.722067,8.350017)
)
Using lead() on the duration column gives you the start and end of each
"period" of the activity, and then you need to fill in with a minute for each of that duration.
# Make a list column that generates a sequence of minutes "included" in
# the `Behavior_Duration_Minutes` column. You'll need to play with this
# logic in terms of whether or not you want `floor()` or `round()` etc.
# Also update the endpoint, here hardcoded at 10 minutes.
high_res_df <-
your_df %>%
mutate(
minutes_covered = purrr::map2(
ceiling(Behavior_Duration_Minutes),
lead(Behavior_Duration_Minutes, default = 10),
~seq(.x, .y)
)
)
high_res_df
#> Behavior Behavior_Start Behavior_Duration_Minutes minutes_covered
#> 1 Grazing 2022-05-10 17:34:06 0.000000 0, 1, 2, 3
#> 2 Vigilant 2022-05-10 17:38:04 3.961683 4
#> 3 Grazing 2022-05-10 17:38:26 4.325933 5, 6, 7
#> 4 Other 2022-05-10 17:41:49 7.722067 8
#> 5 Grazing 2022-05-10 17:42:27 8.350017 9, 10
Now that you've generated the list of minutes included, you can use unnest() to get closer to your desired output.
# And here expand out that list-column into a regular sequence
high_res_long <-
tidyr::unnest(
high_res_df,
"minutes_covered"
)
high_res_long
#> # A tibble: 11 × 4
#> Behavior Behavior_Start Behavior_Duration_Minutes minutes_covered
#> <chr> <chr> <dbl> <int>
#> 1 Grazing 2022-05-10 17:34:06 0 0
#> 2 Grazing 2022-05-10 17:34:06 0 1
#> 3 Grazing 2022-05-10 17:34:06 0 2
#> 4 Grazing 2022-05-10 17:34:06 0 3
#> 5 Vigilant 2022-05-10 17:38:04 3.96 4
#> 6 Grazing 2022-05-10 17:38:26 4.33 5
#> 7 Grazing 2022-05-10 17:38:26 4.33 6
#> 8 Grazing 2022-05-10 17:38:26 4.33 7
#> 9 Other 2022-05-10 17:41:49 7.72 8
#> 10 Grazing 2022-05-10 17:42:27 8.35 9
#> 11 Grazing 2022-05-10 17:42:27 8.35 10
Created on 2023-01-13 with reprex v2.0.2
You'll need to play around with this a bit to match exactly what you want.

Is there a quick way of extracting data over several years corresponding to today's date

I have a dataframe that has a date column that spans from 2014-01-01 to today 2021-04-29, and other columns of associated data.
What I would like to do is filter for data for the current day and month across all years. So data for 04-29 is brought back for all years 2014 to 2021 on that particular day.
What would be the most efficient / tidiest way of doing this?
Do something like this
set.seed(2)
df <- data.frame(d = as.Date('2015-01-01') + sample(1:3000, 1000),
o = runif(1000))
head(df)
#> d o
#> 1 2017-09-02 0.1481156
#> 2 2016-12-11 0.3957168
#> 3 2022-09-23 0.2654405
#> 4 2016-02-21 0.3482240
#> 5 2016-01-28 0.6943241
#> 6 2015-10-01 0.5069469
library(lubridate)
#>
#> date, intersect, setdiff, union
df[month(df$d) == month(Sys.Date()) & day(df$d) == day(Sys.Date()),]
#> d o
#> 24 2016-04-29 0.2431883
#> 131 2017-04-29 0.9359659
#> 383 2022-04-29 0.2703415
Created on 2021-04-29 by the reprex package (v2.0.0)
dplyr method is similar
df %>% filter(month(d) == month(Sys.Date()) & day(d) == day(Sys.Date()))

Split a rows into two when a date range spans a change in calendar year

I am trying to figure out how to add a row when a date range spans a calendar year. Below is a minimal reprex:
I have a date frame like this:
have <- data.frame(
from = c(as.Date('2018-12-15'), as.Date('2019-12-20'), as.Date('2019-05-13')),
to = c(as.Date('2019-06-20'), as.Date('2020-01-25'), as.Date('2019-09-10'))
)
have
#> from to
#> 1 2018-12-15 2019-06-20
#> 2 2019-12-20 2020-01-25
#> 3 2019-05-13 2019-09-10
I want a data.frame that splits into two rows when to and from span a calendar year.
want <- data.frame(
from = c(as.Date('2018-12-15'), as.Date('2019-01-01'), as.Date('2019-12-20'), as.Date('2020-01-01'), as.Date('2019-05-13')),
to = c(as.Date('2018-12-31'), as.Date('2019-06-20'), as.Date('2019-12-31'), as.Date('2020-01-25'), as.Date('2019-09-10'))
)
want
#> from to
#> 1 2018-12-15 2018-12-31
#> 2 2019-01-01 2019-06-20
#> 3 2019-12-20 2019-12-31
#> 4 2020-01-01 2020-01-25
#> 5 2019-05-13 2019-09-10
I am wanting to do this because for a particular row, I want to know how many days are in each year.
want$time_diff_by_year <- difftime(want$to, want$from)
Created on 2020-05-15 by the reprex package (v0.3.0)
Any base R, tidyverse solutions would be much appreciated.
You can determine the additional years needed for your date intervals with map2, then unnest to create additional rows for each year.
Then, you can identify date intervals of intersections between partial years and a full calendar year. This will keep the partial years starting Jan 1 or ending Dec 31 for a given year.
library(tidyverse)
library(lubridate)
have %>%
mutate(date_int = interval(from, to),
year = map2(year(from), year(to), seq)) %>%
unnest(year) %>%
mutate(year_int = interval(as.Date(paste0(year, '-01-01')), as.Date(paste0(year, '-12-31'))),
year_sect = intersect(date_int, year_int),
from_new = as.Date(int_start(year_sect)),
to_new = as.Date(int_end(year_sect))) %>%
select(from_new, to_new)
Output
# A tibble: 5 x 2
from_new to_new
<date> <date>
1 2018-12-15 2018-12-31
2 2019-01-01 2019-06-20
3 2019-12-20 2019-12-31
4 2020-01-01 2020-01-25
5 2019-05-13 2019-09-10

extracting subset of data based on whether a transaction contains at least a part of the time range in R

I have a data frame df that contains different transactions. Each transaction has a start date and an end date. The two variables for this are start_time and end_time. They are of the class POSIXct.
An example of how they look are as follows "2018-05-23 23:40:00" "2018-06-24 00:10:00".
There are about 13000 transactions in df and I want to extract all transactions that contain at least a bit of the specified time interval if not all. The time interval or range is 20:00:00 - 8:00:00 so basically 8 P.M =< interval < 8 A.M.
I am trying to use dplyr and the function filter() to do this however my problem is I am not sure how to write the boolean expression. What I have written in code so far is this:
df %>% filter(hour(start_time) >= 20 | hour(start_time) < 8 |hour(end_time) >= 20 | hour(end_time) < 8 )
I thought maybe this would get all transactions that contain at least a part of that interval but then I thought about transactions that maybe start and end outside of that interval but their duration is so long that it contains those hours from the interval. I thought maybe of adding | duration > 12 because any start time that has a duration longer than 12 hours will contain a part of that time interval. However, I feel like this code is unnecessarily long and there must be a simpler way but I don't know how.
I'll start with a sample data frame, since a sample df isn't given in the question:
library(lubridate)
library(dplyr)
set.seed(69)
dates <- as.POSIXct("2020-04-01") + days(sample(30, 10, TRUE))
start_time <- dates + seconds(sample(86400, 10, TRUE))
end_time <- start_time + seconds(sample(50000, 10, TRUE))
df <- data.frame(Transaction = LETTERS[1:10], start_time, end_time)
df
#> Transaction start_time end_time
#> 1 A 2020-04-18 16:51:03 2020-04-19 00:05:54
#> 2 B 2020-04-28 21:32:10 2020-04-29 06:18:06
#> 3 C 2020-04-03 02:12:52 2020-04-03 06:11:20
#> 4 D 2020-04-17 19:15:43 2020-04-17 21:01:52
#> 5 E 2020-04-09 11:36:19 2020-04-09 19:01:14
#> 6 F 2020-04-14 20:51:25 2020-04-15 06:08:10
#> 7 G 2020-04-08 12:01:55 2020-04-09 01:45:53
#> 8 H 2020-04-16 01:43:38 2020-04-16 04:22:39
#> 9 I 2020-04-08 23:11:51 2020-04-09 09:04:26
#> 10 J 2020-04-07 12:28:08 2020-04-07 12:55:42
We can enumerate the possibilities for a match as follows:
Any start time before 08:00 or after 20:00
Any stop time before 08:00 or after 20:00
The stop and start times are on different dates.
Using a little modular math, we can write this as:
df %>% filter((hour(start_time) + 12) %% 20 > 11 |
(hour(end_time) + 12) %% 20 > 11 |
date(start_time) != date(end_time))
#> Transaction start_time end_time
#> 1 A 2020-04-18 16:51:03 2020-04-19 00:05:54
#> 2 B 2020-04-28 21:32:10 2020-04-29 06:18:06
#> 3 C 2020-04-03 02:12:52 2020-04-03 06:11:20
#> 4 D 2020-04-17 19:15:43 2020-04-17 21:01:52
#> 5 F 2020-04-14 20:51:25 2020-04-15 06:08:10
#> 6 G 2020-04-08 12:01:55 2020-04-09 01:45:53
#> 7 H 2020-04-16 01:43:38 2020-04-16 04:22:39
#> 8 I 2020-04-08 23:11:51 2020-04-09 09:04:26
You can check that all the times are at least partly within the given range, and that the two removed rows are not.

Resources