segregation datetime on hourly basis in R - r

I'm using below-mentioned dataframe in R:
ID Datetime Value
T-1 2020-01-01 15:12:14 10
T-2 2020-01-01 00:12:10 20
T-3 2020-01-01 03:11:11 25
T-4 2020-01-01 14:01:01 20
T-5 2020-01-01 18:07:11 10
T-6 2020-01-01 20:10:09 15
T-7 2020-01-01 15:45:23 15
Using the above mentioned dataframe, I want to segregate datetime on hourly basis. For which, I'm using following code.
library(tidyverse)
DF$bins <- cut(lubridate::hour(DF$Datetime), c(-1, 0:24 - 0.01))
levels(DF$bins) <- c("00:00 to 00:59", "00:01 to 01:59", "00:02 to 02:59", "00:03 to 03:59", "00:04 to 04:59", "00:05 to 05:59",
"00:06 to 06:59", "00:07 to 07:59", "00:08 to 08:59", "00:09 to 09:59", "00:10 to 10:59", "00:11 to 11:59",
"00:12 to 12:59", "00:13 to 13:59", "00:14 to 14:59", "00:15 to 15:59", "00:16 to 16:59", "00:17 to 17:59",
"00:18 to 18:59", "00:19 to 19:59", "00:20 to 20:59", "00:21 to 21:59", "00:22 to 22:59", "00:23 to 23:59")
newDF <- DF %>%
dplyr::group_by(bins, .drop = FALSE) %>%
dplyr::summarise(Count = length(Value), Total = sum(Value))
Final<-newDF %>%
dplyr::summarise(bins = "January", Count = sum(Count), Total = sum(Total)) %>% bind_rows(newDF)
Final[,c(2,3)]<-sapply(Final[,c(2,3)], function(x) scales::comma(x))
at levels(DF$bins)<- I'm getting error Error inlevels<-.factor(tmp, value = c("00:00 to 00:59", "00:01 to 01:59", :
number of levels differs
How to keep below mentioned segregation static and aggregate the numbers accordingly.
"00:00 to 00:59", "00:01 to 01:59", "00:02 to 02:59", "00:03 to 03:59", "00:04 to 04:59", "00:05 to 05:59", "00:06 to 06:59", "00:07 to 07:59", "00:08 to 08:59", "00:09 to 09:59", "00:10 to 10:59", "00:11 to 11:59","00:12 to 12:59", "00:13 to 13:59", "00:14 to 14:59", "00:15 to 15:59", "00:16 to 16:59", "00:17 to 17:59","00:18 to 18:59", "00:19 to 19:59", "00:20 to 20:59", "00:21 to 21:59", "00:22 to 22:59", "00:23 to 23:59"
Expected Output:
Month Count Sum
Jan-20 7 115
12:00 AM to 05:00 AM 2 45
06:00 AM to 12:00 PM 0 0
12:00 PM to 03:00 PM 1 20
03:00 PM to 08:00 PM 3 35
08:00 PM to 12:00 AM 1 15

We can use floor_date/ceiling_date from lubridate to create hourly breaks, create a grouping column (bins) based on our requirement using sprintf and then use this column to calculate whatever we want for each group.
library(dplyr)
library(lubridate)
df %>%
mutate(bins = floor_date(Datetime, "hour"),
hour = hour(bins),
bins = paste0(sprintf("%02d:00 :", hour), sprintf(" %02d:59", hour))) %>%
group_by(bins) %>%
summarise(sum = sum(Value))
# A tibble: 6 x 2
# bins sum
# <chr> <int>
#1 00:00 : 00:59 20
#2 03:00 : 03:59 25
#3 14:00 : 14:59 20
#4 15:00 : 15:59 25
#5 18:00 : 18:59 10
#6 20:00 : 20:59 15
For the updated condition, we can do
df %>%
mutate(hour = hour(Datetime),
gr = case_when(hour >= 0 & hour < 6 ~ "12:00 AM to 06:00 AM",
hour >= 6 & hour < 12 ~ "06:00 AM to 12:00 PM",
hour >= 12 & hour < 15 ~ "12:00 PM to 03:00 PM",
hour >= 15 & hour < 20 ~ "03:00 PM to 08:00 PM",
TRUE ~ "08:00 PM to 12:00 AM"),
month_year = format(Datetime, "%Y-%m"),
bins = factor(gr, levels = c("12:00 AM to 06:00 AM", "06:00 AM to 12:00 PM",
"12:00 PM to 03:00 PM", "03:00 PM to 08:00 PM",
"08:00 PM to 12:00 AM"))) %>%
group_by(month_year, bins, .drop = FALSE) %>%
summarise(sum = n())
# month_year bins sum
# <chr> <fct> <int>
#1 2020-01 12:00 AM to 06:00 AM 2
#2 2020-01 06:00 AM to 12:00 PM 0
#3 2020-01 12:00 PM to 03:00 PM 1
#4 2020-01 03:00 PM to 08:00 PM 3
#5 2020-01 08:00 PM to 12:00 AM 1
data
df <- structure(list(ID = structure(1:7, .Label = c("T-1", "T-2", "T-3",
"T-4", "T-5", "T-6", "T-7"), class = "factor"), Datetime = structure(c(1577891534,
1577837530, 1577848271, 1577887261, 1577902031, 1577909409, 1577893523
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Value = c(10L,
20L, 25L, 20L, 10L, 15L, 15L)), row.names = c(NA, -7L), class = "data.frame")

Related

Duplicate rows in R based on content of columns

I'm working with a school schedule dataset for a visualization project and had days of classes originally in the form "MW" or "TTh" etc - they are now in the format below:
name start end first second
finance 9:00 10:00 M W
stats 10:30 11:30 T Th
econ 16:30 19:00 T NA
I'm looking to duplicate the first three columns to get a dataframe that looks like:
day name start end
M finance 9:00 10:00
W finance 9:00 10:00
T stats 10:30 11:30
Th stats 10:30 11:30
W econ 10:30 11:30
Any ideas?
We can use pivot_longer
library(dplyr)
library(tidyr)
pivot_longer(df1, cols = c(first, second), values_to = 'day',
names_to = 'name1') %>%
select(day, name, start, end) %>%
filter(complete.cases(day))
-output
# A tibble: 5 x 4
# day name start end
# <chr> <chr> <chr> <chr>
#1 M finance 9:00 10:00
#2 W finance 9:00 10:00
#3 T stats 10:30 11:30
#4 Th stats 10:30 11:30
#5 T econ 16:30 19:00
data
df1 <- structure(list(name = c("finance", "stats", "econ"), start = c("9:00",
"10:30", "16:30"), end = c("10:00", "11:30", "19:00"), first = c("M",
"T", "T"), second = c("W", "Th", NA)), class = "data.frame", row.names = c(NA,
-3L))

in R Replace values with following value

I have a data set that I'm cleaning. The 2nd column starts with a - while the value below it is the one I need. How do I replace the - with the value under it.
I have thousands of rows like this, with the value below it being different names so I cant just do
df$agent[df$agen == "-"] <- "john"
It would have to be done over 1,000 times. I'm looking for a way to do this much more efficiently.
1 Field Support - 6:00 AM - 6:59 AM 1/1/2020 9
3 Field Support John 7:00 AM - 7:59 AM 1/1/2020 4
4 Field Support John 8:00 AM - 8:59 AM 1/1/2020 4
You can use case_when and lead from the dplyr package:
library(dplyr)
data %>%
mutate(Name = case_when(Name == "-" ~ lead(Name),
TRUE ~ Name))
Role Name Time Date Value
1 Field Support John 6:00 AM - 6:59 AM 1/1/2020 9
2 Field Support John 7:00 AM - 7:59 AM 1/1/2020 4
3 Field Support John 8:00 AM - 8:59 AM 1/1/2020 4
Data
data <- structure(list(Role = structure(c(1L, 1L, 1L), .Label = "Field Support", class = "factor"),
Name = structure(c(1L, 2L, 2L), .Label = c("-", "John"), class = "factor"),
Time = structure(1:3, .Label = c("6:00 AM - 6:59 AM", "7:00 AM - 7:59 AM",
"8:00 AM - 8:59 AM"), class = "factor"), Date = structure(c(1L,
1L, 1L), .Label = "1/1/2020", class = "factor"), Value = c(9L,
4L, 4L)), class = "data.frame", row.names = c(NA, -3L))
Here is a solution with base R and package tidyr.
library(tidyr)
col_num <- 3
is.na(df[[col_num]]) <- df[[col_num]] == '-'
fill(df, all_of(col_num), .direction = "up")
# V1 V2 V3 V4 V5 V6
#1 1 Field Support John 6:00 AM - 6:59 AM 1/1/2020 9
#2 3 Field Support John 7:00 AM - 7:59 AM 1/1/2020 4
#3 4 Field Support John 8:00 AM - 8:59 AM 1/1/2020 4
Data
df <- read.table(text = "
1 'Field Support' - '6:00 AM - 6:59 AM' 1/1/2020 9
3 'Field Support' John '7:00 AM - 7:59 AM' 1/1/2020 4
4 'Field Support' John '8:00 AM - 8:59 AM' 1/1/2020 4
")
Here's a solution without using any packages:
> df <- data.frame("ID" = c(1, 3, 4, 5, 8),
"Job" = rep("Field Support", 5),
"Agent" = c("-", rep("John", 2), "-", "Mary"),
"Hours" = c("6:00 AM - 6:59 AM",
"7:00 AM - 7:59 AM",
"8:00 AM - 8:59 AM",
"9:00 AM - 9:59 AM",
"10:00 AM - 10:59 AM"),
"Date" = rep("1/1/2020", 5),
"Metric" = c(9, 4, 4, 6, 2))
> print(df)
ID Job Agent Hours Date Metric
1 1 Field Support - 6:00 AM - 6:59 AM 1/1/2020 9
2 3 Field Support John 7:00 AM - 7:59 AM 1/1/2020 4
3 4 Field Support John 8:00 AM - 8:59 AM 1/1/2020 4
4 5 Field Support - 9:00 AM - 9:59 AM 1/1/2020 6
5 8 Field Support Mary 10:00 AM - 10:59 AM 1/1/2020 2
> df$Agent[which(df$Agent == "-")] <- df$Agent[which(df$Agent == "-") + 1]
> print(df)
ID Job Agent Hours Date Metric
1 1 Field Support John 6:00 AM - 6:59 AM 1/1/2020 9
2 3 Field Support John 7:00 AM - 7:59 AM 1/1/2020 4
3 4 Field Support John 8:00 AM - 8:59 AM 1/1/2020 4
4 5 Field Support Mary 9:00 AM - 9:59 AM 1/1/2020 6
5 8 Field Support Mary 10:00 AM - 10:59 AM 1/1/2020 2

Groupby a column and find its sum and count

Background:
I have a dataset, df,
Date Duration
1/2/2020 5:00:00 PM 20
1/2/2020 5:30:01 PM 30
1/2/2020 6:00:00 PM 10
1/5/2020 7:00:01 AM 5
1/6/2020 8:00:00 AM 2
1/6/2020 9:00:00 AM 8
Desired Output:
Date Total_Duration Count
1/2/2020 60 3
1/5/2020 5 1
1/6/2020 10 2
Dput:
structure(list(Date = structure(1:6, .Label = c("1/2/2020 5:00:00 PM",
"1/2/2020 5:30:01 PM", "1/2/2020 6:00:00 PM", "1/5/2020 7:00:01 AM",
"1/6/2020 8:00:00 AM", "1/6/2020 9:00:00 AM"), class = "factor"),
Duration = c(20L, 30L, 10L, 5L, 2L, 8L)), class = "data.frame", row.names = c(NA,
-6L))
What I have tried:
library(dplyr)
df %>% group_by(Date) %>% add_tally() %>%
summarize(Duration)
Any guidance will be helpful.
We can get the Date only part from the 'Date' after converting to 'DateTime' with dmy_hms (assuming the format is DD/MM/YYYYY HH::MM:SS), use that as grouping variable and get the sum of 'Duration' and 'Count' as the n()
library(dplyr)
library(lubridate)
df %>%
group_by(Date = as.Date(dmy_hms(Date))) %>%
summarise(Total_Duration = sum(Duration), Count = n())
# A tibble: 3 x 3
# Date Total_Duration Count
# <date> <int> <int>
#1 2020-02-01 60 3
#2 2020-05-01 5 1
#3 2020-06-01 10 2

Group, take duration and set condition within R (dplyr, r)

I have a dataset, df: (the dataset contains over 4000 rows)
DATEB
9/9/2019 7:51:58 PM
9/9/2019 7:51:59 PM
9/9/2019 7:51:59 PM
9/9/2019 7:52:00 PM
9/9/2019 7:52:01 PM
9/9/2019 7:52:01 PM
9/9/2019 7:52:02 PM
9/9/2019 7:52:03 PM
9/9/2019 7:54:00 PM
9/9/2019 7:54:02 PM
9/10/2019 8:00:00PM
I am wanting to place these in separate groups, and take the duration, if the time between date-time exceeds 120 seconds.
Desired output:
Group Duration
a 5 sec
b 2 sec
c 0 sec
dput:
structure(list(DATEB = structure(c(2L, 3L, 3L, 4L, 5L, 5L, 6L,
7L, 8L, 9L, 1L), .Label = c(" 9/10/2019 8:00:00 PM", " 9/9/2019 7:51:58 PM",
" 9/9/2019 7:51:59 PM", " 9/9/2019 7:52:00 PM", " 9/9/2019 7:52:01 PM",
" 9/9/2019 7:52:02 PM", " 9/9/2019 7:52:03 PM", " 9/9/2019 7:54:00 PM",
" 9/9/2019 7:54:02 PM"), class = "factor")), class = "data.frame", row.names = c(NA,
-11L))
I have tried the code below, which works well, except I am wanting the 7:51:59 and 7:52:00 to be in the same group. The only time the duration should break and create a new group, is when the time in between datetimes exceed 120 secs.
df %>%
mutate(DATEB = lubridate::mdy_hms(DATEB),
temp = floor_date(DATEB, "120 secs")) %>%
group_by(temp) %>%
summarise(duration = difftime(max(DATEB), min(DATEB), units = "secs"))
Any suggestion is appreciated.
We can use cut here :
library(dplyr)
df %>%
mutate(DATEB = lubridate::mdy_hms(DATEB),
temp = cut(DATEB, breaks = "2 mins")) %>%
group_by(temp) %>%
summarise(duration = difftime(max(DATEB), min(DATEB), units = "secs"))
# A tibble: 3 x 2
# temp duration
# <fct> <drtn>
#1 2019-09-09 19:51:00 5 secs
#2 2019-09-09 19:53:00 2 secs
#3 2019-09-10 19:59:00 0 secs
The OP has asked for:
The only time the duration should break and create a new group, is
when the time in between datetimes exceed 120 secs.
The words "the time in between datetimes" suggest the OP is looking for a gap or pause. (Well, this is what I would look for if I've been given a vector of ordered date-times and been tasked to group the data.)
Unfortunately, the expected result and accepted answer do not correspond to this interpretation.
However, here is what I would do:
gap_threshold <- 10
df %>%
mutate(DATEB = lubridate::mdy_hms(DATEB),
gap = c(0, diff(DATEB))) %>%
group_by(grp = cumsum(gap > gap_threshold)) %>%
summarise(begin = min(DATEB), end = max(DATEB), duration = difftime(end, begin, units = "secs"))
# A tibble: 3 x 4
grp begin end duration
<int> <dttm> <dttm> <drtn>
1 0 2019-09-09 19:51:58 2019-09-09 19:52:03 5 secs
2 1 2019-09-09 19:54:00 2019-09-09 19:54:02 2 secs
3 2 2019-09-10 20:00:00 2019-09-10 20:00:00 0 secs

Converting Date/Time in R

I am struggling hard with date time formatting in R. I am sure this is an easy fix... can someone write me a line of code that will convert all values from Year, M, D, Time into a new column "datetime"?
What data looks like:
x year m d time
A 2019 2 23 11:12 PM
B 2019 1 31 2:04 PM
C 2018 12 31 12:01 AM
D 2017 2 1 10:14 AM
What I want:
x datetime
A 2/23/19 11:12 PM
B 1/31/19 11:12 PM
C 12/31/18 12:01 AM
D 2/23/17 10:14 PM
Since it's a datetime value we can convert it into a standard format by pasting the values together.
df$datetime <- with(df, as.POSIXct(paste(year, m, d, time),
format = "%Y %m %d %I:%M %p", tz = "UTC"))
df
# x year m d time datetime
#1 A 2019 2 23 11:12PM 2019-02-23 23:12:00
#2 B 2019 1 31 2:04PM 2019-01-31 14:04:00
#3 C 2018 12 31 12:01AM 2018-12-31 00:01:00
#4 D 2017 2 1 10:14AM 2017-02-01 10:14:00
Or using lubridate
library(dplyr)
library(lubridate)
df %>% mutate(datetime = ymd_hm(paste(year, m, d, time)))
data
df <- structure(list(x = structure(1:4, .Label = c("A", "B", "C", "D"
), class = "factor"), year = c(2019L, 2019L, 2018L, 2017L), m = c(2L,
1L, 12L, 2L), d = c(23L, 31L, 31L, 1L), time = c("11:12 PM",
"2:04 PM", "12:01 AM", "10:14 AM")), row.names = c(NA, -4L), class = "data.frame")
I think the below should work for your goal:
df <- data.frame(datetime = apply(df,1, function(v) sprintf("%s/%s/%s %s",v["d"], v["m"], v["year"], v["time"])))
If you want to append the new column to the existing data.frame df, then use:
df$datetime <- apply(df,1, function(v) sprintf("%s/%s/%s %s",v["d"], v["m"], v["year"], v["time"]))

Resources