Group, take duration and set condition within R (dplyr, r) - r

I have a dataset, df: (the dataset contains over 4000 rows)
DATEB
9/9/2019 7:51:58 PM
9/9/2019 7:51:59 PM
9/9/2019 7:51:59 PM
9/9/2019 7:52:00 PM
9/9/2019 7:52:01 PM
9/9/2019 7:52:01 PM
9/9/2019 7:52:02 PM
9/9/2019 7:52:03 PM
9/9/2019 7:54:00 PM
9/9/2019 7:54:02 PM
9/10/2019 8:00:00PM
I am wanting to place these in separate groups, and take the duration, if the time between date-time exceeds 120 seconds.
Desired output:
Group Duration
a 5 sec
b 2 sec
c 0 sec
dput:
structure(list(DATEB = structure(c(2L, 3L, 3L, 4L, 5L, 5L, 6L,
7L, 8L, 9L, 1L), .Label = c(" 9/10/2019 8:00:00 PM", " 9/9/2019 7:51:58 PM",
" 9/9/2019 7:51:59 PM", " 9/9/2019 7:52:00 PM", " 9/9/2019 7:52:01 PM",
" 9/9/2019 7:52:02 PM", " 9/9/2019 7:52:03 PM", " 9/9/2019 7:54:00 PM",
" 9/9/2019 7:54:02 PM"), class = "factor")), class = "data.frame", row.names = c(NA,
-11L))
I have tried the code below, which works well, except I am wanting the 7:51:59 and 7:52:00 to be in the same group. The only time the duration should break and create a new group, is when the time in between datetimes exceed 120 secs.
df %>%
mutate(DATEB = lubridate::mdy_hms(DATEB),
temp = floor_date(DATEB, "120 secs")) %>%
group_by(temp) %>%
summarise(duration = difftime(max(DATEB), min(DATEB), units = "secs"))
Any suggestion is appreciated.

We can use cut here :
library(dplyr)
df %>%
mutate(DATEB = lubridate::mdy_hms(DATEB),
temp = cut(DATEB, breaks = "2 mins")) %>%
group_by(temp) %>%
summarise(duration = difftime(max(DATEB), min(DATEB), units = "secs"))
# A tibble: 3 x 2
# temp duration
# <fct> <drtn>
#1 2019-09-09 19:51:00 5 secs
#2 2019-09-09 19:53:00 2 secs
#3 2019-09-10 19:59:00 0 secs

The OP has asked for:
The only time the duration should break and create a new group, is
when the time in between datetimes exceed 120 secs.
The words "the time in between datetimes" suggest the OP is looking for a gap or pause. (Well, this is what I would look for if I've been given a vector of ordered date-times and been tasked to group the data.)
Unfortunately, the expected result and accepted answer do not correspond to this interpretation.
However, here is what I would do:
gap_threshold <- 10
df %>%
mutate(DATEB = lubridate::mdy_hms(DATEB),
gap = c(0, diff(DATEB))) %>%
group_by(grp = cumsum(gap > gap_threshold)) %>%
summarise(begin = min(DATEB), end = max(DATEB), duration = difftime(end, begin, units = "secs"))
# A tibble: 3 x 4
grp begin end duration
<int> <dttm> <dttm> <drtn>
1 0 2019-09-09 19:51:58 2019-09-09 19:52:03 5 secs
2 1 2019-09-09 19:54:00 2019-09-09 19:54:02 2 secs
3 2 2019-09-10 20:00:00 2019-09-10 20:00:00 0 secs

Related

Duplicate rows in R based on content of columns

I'm working with a school schedule dataset for a visualization project and had days of classes originally in the form "MW" or "TTh" etc - they are now in the format below:
name start end first second
finance 9:00 10:00 M W
stats 10:30 11:30 T Th
econ 16:30 19:00 T NA
I'm looking to duplicate the first three columns to get a dataframe that looks like:
day name start end
M finance 9:00 10:00
W finance 9:00 10:00
T stats 10:30 11:30
Th stats 10:30 11:30
W econ 10:30 11:30
Any ideas?
We can use pivot_longer
library(dplyr)
library(tidyr)
pivot_longer(df1, cols = c(first, second), values_to = 'day',
names_to = 'name1') %>%
select(day, name, start, end) %>%
filter(complete.cases(day))
-output
# A tibble: 5 x 4
# day name start end
# <chr> <chr> <chr> <chr>
#1 M finance 9:00 10:00
#2 W finance 9:00 10:00
#3 T stats 10:30 11:30
#4 Th stats 10:30 11:30
#5 T econ 16:30 19:00
data
df1 <- structure(list(name = c("finance", "stats", "econ"), start = c("9:00",
"10:30", "16:30"), end = c("10:00", "11:30", "19:00"), first = c("M",
"T", "T"), second = c("W", "Th", NA)), class = "data.frame", row.names = c(NA,
-3L))

Groupby a column and find its sum and count

Background:
I have a dataset, df,
Date Duration
1/2/2020 5:00:00 PM 20
1/2/2020 5:30:01 PM 30
1/2/2020 6:00:00 PM 10
1/5/2020 7:00:01 AM 5
1/6/2020 8:00:00 AM 2
1/6/2020 9:00:00 AM 8
Desired Output:
Date Total_Duration Count
1/2/2020 60 3
1/5/2020 5 1
1/6/2020 10 2
Dput:
structure(list(Date = structure(1:6, .Label = c("1/2/2020 5:00:00 PM",
"1/2/2020 5:30:01 PM", "1/2/2020 6:00:00 PM", "1/5/2020 7:00:01 AM",
"1/6/2020 8:00:00 AM", "1/6/2020 9:00:00 AM"), class = "factor"),
Duration = c(20L, 30L, 10L, 5L, 2L, 8L)), class = "data.frame", row.names = c(NA,
-6L))
What I have tried:
library(dplyr)
df %>% group_by(Date) %>% add_tally() %>%
summarize(Duration)
Any guidance will be helpful.
We can get the Date only part from the 'Date' after converting to 'DateTime' with dmy_hms (assuming the format is DD/MM/YYYYY HH::MM:SS), use that as grouping variable and get the sum of 'Duration' and 'Count' as the n()
library(dplyr)
library(lubridate)
df %>%
group_by(Date = as.Date(dmy_hms(Date))) %>%
summarise(Total_Duration = sum(Duration), Count = n())
# A tibble: 3 x 3
# Date Total_Duration Count
# <date> <int> <int>
#1 2020-02-01 60 3
#2 2020-05-01 5 1
#3 2020-06-01 10 2

Check previous row in datetime, if time is greater than a certain value, place in a group and take its duration in seconds (R, dplyr, lubridate)

I have a dataset, df: (the dataset contains over 4000 rows)
DATEB
9/9/2019 7:51:58 PM
9/9/2019 7:51:59 PM
9/9/2019 7:51:59 PM
9/9/2019 7:52:00 PM
9/9/2019 7:52:01 PM
9/9/2019 7:52:01 PM
9/9/2019 7:52:02 PM
9/9/2019 7:52:03 PM
9/9/2019 7:54:00 PM
9/9/2019 7:54:02 PM
9/10/2019 8:00:00PM
I wish to place in groups (if the times are not within 10 seconds of the previous row) and then take the duration of the newly formed group.
Desired output:
Group Duration
a 5 sec
b 2 sec
c 0 sec
dput:
structure(list(DATEB = structure(c(2L, 3L, 3L, 4L, 5L, 5L, 6L,
7L, 8L, 9L, 1L), .Label = c(" 9/10/2019 8:00:00 PM", " 9/9/2019 7:51:58 PM",
" 9/9/2019 7:51:59 PM", " 9/9/2019 7:52:00 PM", " 9/9/2019 7:52:01 PM",
" 9/9/2019 7:52:02 PM", " 9/9/2019 7:52:03 PM", " 9/9/2019 7:54:00 PM",
" 9/9/2019 7:54:02 PM"), class = "factor")), class = "data.frame", row.names = c(NA,
-11L))
I have tried the code below, which works well, except, I am wanting the units in seconds only. The code below gives units of minutes and seconds.
library(dplyr)
library(lubridate)
df2 <- mutate(df,
DATEB = lubridate::mdy_hms(DATEB))
df2$time_since_last_row <- df2$DATEB - lag(df2$DATEB)
df2$time_since_last_row[[1]] <- 0 # replace the first NA
df2$group_10s <- 0
for ( i in 2:nrow(df2))
{
if(df2$time_since_last_row[[i]]>seconds(10))
df2$group_10s[[i]] <- df2$group_10s[[i-1]] +1
else
df2$group_10s[[i]] <- df2$group_10s[[i-1]]
}
df3 <- group_by(df2,
group_10s) %>%
summarise(volume_in_group=n(),
min_DATEB=min(DATEB),
max_DATEB=max(DATEB),
group_duration = max_DATEB - min_DATEB)
#nirgrahamuk-R community
Any suggestion is appreciated.
This is what I would do:
gap_threshold <- 10
df %>%
mutate(DATEB = lubridate::mdy_hms(DATEB),
gap = c(0, diff(DATEB))) %>%
group_by(grp = cumsum(gap > gap_threshold)) %>%
summarise(begin = min(DATEB), end = max(DATEB),
duration = difftime(end, begin, units = "secs"))
# A tibble: 3 x 4
grp begin end duration
<int> <dttm> <dttm> <drtn>
1 0 2019-09-09 19:51:58 2019-09-09 19:52:03 5 secs
2 1 2019-09-09 19:54:00 2019-09-09 19:54:02 2 secs
3 2 2019-09-10 20:00:00 2019-09-10 20:00:00 0 secs
Note that there are more columns in the output than requested just for demonstration.
Whenever the gap between two subsequent rows is greater than the given gap_threshold the group count grp is advanced by one. Finally, min() and max() are taken for each group and the duration is computed from these.
In fact I did something similar before. You can modify your last block with:
df3 <- group_by(df2, group_10s) %>%
summarise(
volume_in_group=n(),
min_DATEB=min(DATEB),
max_DATEB=max(DATEB),
group_duration = as.numeric(max_DATEB - min_DATEB, units = "secs")
)

segregation datetime on hourly basis in R

I'm using below-mentioned dataframe in R:
ID Datetime Value
T-1 2020-01-01 15:12:14 10
T-2 2020-01-01 00:12:10 20
T-3 2020-01-01 03:11:11 25
T-4 2020-01-01 14:01:01 20
T-5 2020-01-01 18:07:11 10
T-6 2020-01-01 20:10:09 15
T-7 2020-01-01 15:45:23 15
Using the above mentioned dataframe, I want to segregate datetime on hourly basis. For which, I'm using following code.
library(tidyverse)
DF$bins <- cut(lubridate::hour(DF$Datetime), c(-1, 0:24 - 0.01))
levels(DF$bins) <- c("00:00 to 00:59", "00:01 to 01:59", "00:02 to 02:59", "00:03 to 03:59", "00:04 to 04:59", "00:05 to 05:59",
"00:06 to 06:59", "00:07 to 07:59", "00:08 to 08:59", "00:09 to 09:59", "00:10 to 10:59", "00:11 to 11:59",
"00:12 to 12:59", "00:13 to 13:59", "00:14 to 14:59", "00:15 to 15:59", "00:16 to 16:59", "00:17 to 17:59",
"00:18 to 18:59", "00:19 to 19:59", "00:20 to 20:59", "00:21 to 21:59", "00:22 to 22:59", "00:23 to 23:59")
newDF <- DF %>%
dplyr::group_by(bins, .drop = FALSE) %>%
dplyr::summarise(Count = length(Value), Total = sum(Value))
Final<-newDF %>%
dplyr::summarise(bins = "January", Count = sum(Count), Total = sum(Total)) %>% bind_rows(newDF)
Final[,c(2,3)]<-sapply(Final[,c(2,3)], function(x) scales::comma(x))
at levels(DF$bins)<- I'm getting error Error inlevels<-.factor(tmp, value = c("00:00 to 00:59", "00:01 to 01:59", :
number of levels differs
How to keep below mentioned segregation static and aggregate the numbers accordingly.
"00:00 to 00:59", "00:01 to 01:59", "00:02 to 02:59", "00:03 to 03:59", "00:04 to 04:59", "00:05 to 05:59", "00:06 to 06:59", "00:07 to 07:59", "00:08 to 08:59", "00:09 to 09:59", "00:10 to 10:59", "00:11 to 11:59","00:12 to 12:59", "00:13 to 13:59", "00:14 to 14:59", "00:15 to 15:59", "00:16 to 16:59", "00:17 to 17:59","00:18 to 18:59", "00:19 to 19:59", "00:20 to 20:59", "00:21 to 21:59", "00:22 to 22:59", "00:23 to 23:59"
Expected Output:
Month Count Sum
Jan-20 7 115
12:00 AM to 05:00 AM 2 45
06:00 AM to 12:00 PM 0 0
12:00 PM to 03:00 PM 1 20
03:00 PM to 08:00 PM 3 35
08:00 PM to 12:00 AM 1 15
We can use floor_date/ceiling_date from lubridate to create hourly breaks, create a grouping column (bins) based on our requirement using sprintf and then use this column to calculate whatever we want for each group.
library(dplyr)
library(lubridate)
df %>%
mutate(bins = floor_date(Datetime, "hour"),
hour = hour(bins),
bins = paste0(sprintf("%02d:00 :", hour), sprintf(" %02d:59", hour))) %>%
group_by(bins) %>%
summarise(sum = sum(Value))
# A tibble: 6 x 2
# bins sum
# <chr> <int>
#1 00:00 : 00:59 20
#2 03:00 : 03:59 25
#3 14:00 : 14:59 20
#4 15:00 : 15:59 25
#5 18:00 : 18:59 10
#6 20:00 : 20:59 15
For the updated condition, we can do
df %>%
mutate(hour = hour(Datetime),
gr = case_when(hour >= 0 & hour < 6 ~ "12:00 AM to 06:00 AM",
hour >= 6 & hour < 12 ~ "06:00 AM to 12:00 PM",
hour >= 12 & hour < 15 ~ "12:00 PM to 03:00 PM",
hour >= 15 & hour < 20 ~ "03:00 PM to 08:00 PM",
TRUE ~ "08:00 PM to 12:00 AM"),
month_year = format(Datetime, "%Y-%m"),
bins = factor(gr, levels = c("12:00 AM to 06:00 AM", "06:00 AM to 12:00 PM",
"12:00 PM to 03:00 PM", "03:00 PM to 08:00 PM",
"08:00 PM to 12:00 AM"))) %>%
group_by(month_year, bins, .drop = FALSE) %>%
summarise(sum = n())
# month_year bins sum
# <chr> <fct> <int>
#1 2020-01 12:00 AM to 06:00 AM 2
#2 2020-01 06:00 AM to 12:00 PM 0
#3 2020-01 12:00 PM to 03:00 PM 1
#4 2020-01 03:00 PM to 08:00 PM 3
#5 2020-01 08:00 PM to 12:00 AM 1
data
df <- structure(list(ID = structure(1:7, .Label = c("T-1", "T-2", "T-3",
"T-4", "T-5", "T-6", "T-7"), class = "factor"), Datetime = structure(c(1577891534,
1577837530, 1577848271, 1577887261, 1577902031, 1577909409, 1577893523
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Value = c(10L,
20L, 25L, 20L, 10L, 15L, 15L)), row.names = c(NA, -7L), class = "data.frame")

Compare Hour in R

Here is my sample dataset
id hour
1 15:10
2 12:10
3 22:10
4 06:30
I need to find out the earliest time and latest time. The class of the hour is factor. So I need to convert factor to an appropriate class, and compare the earlier and later time. I tried to format the hour using the code below, but it did not work out as expected
format(as.Date(date),"%H:%M")
Use times of chron package
#Data
xx
# id hour
#1 1 15:10
#2 2 12:10
#3 3 22:10
#4 4 06:30
library(chron)
xx$hour = times(paste0(as.character(xx$hour), ":00"))
xx
# id hour
#1 1 15:10:00
#2 2 12:10:00
#3 3 22:10:00
#4 4 06:30:00
#Min and Max
range(xx$hour)
#[1] 06:30:00 22:10:00
xx = structure(list(id = 1:4, hour = structure(c(3L, 2L, 4L, 1L), .Label = c("06:30",
"12:10", "15:10", "22:10"), class = "factor")), .Names = c("id",
"hour"), row.names = c(NA, -4L), class = "data.frame")
If all you need is to find earliest (min) and latest (max) times, you can just convert the times to a character and use min, max: e.g.,
hour <- c("15:10", "12:10", "22:10", "06:30")
hour[which(hour == max(hour))]
> "22:10"

Resources