Groupby a column and find its sum and count - r

Background:
I have a dataset, df,
Date Duration
1/2/2020 5:00:00 PM 20
1/2/2020 5:30:01 PM 30
1/2/2020 6:00:00 PM 10
1/5/2020 7:00:01 AM 5
1/6/2020 8:00:00 AM 2
1/6/2020 9:00:00 AM 8
Desired Output:
Date Total_Duration Count
1/2/2020 60 3
1/5/2020 5 1
1/6/2020 10 2
Dput:
structure(list(Date = structure(1:6, .Label = c("1/2/2020 5:00:00 PM",
"1/2/2020 5:30:01 PM", "1/2/2020 6:00:00 PM", "1/5/2020 7:00:01 AM",
"1/6/2020 8:00:00 AM", "1/6/2020 9:00:00 AM"), class = "factor"),
Duration = c(20L, 30L, 10L, 5L, 2L, 8L)), class = "data.frame", row.names = c(NA,
-6L))
What I have tried:
library(dplyr)
df %>% group_by(Date) %>% add_tally() %>%
summarize(Duration)
Any guidance will be helpful.

We can get the Date only part from the 'Date' after converting to 'DateTime' with dmy_hms (assuming the format is DD/MM/YYYYY HH::MM:SS), use that as grouping variable and get the sum of 'Duration' and 'Count' as the n()
library(dplyr)
library(lubridate)
df %>%
group_by(Date = as.Date(dmy_hms(Date))) %>%
summarise(Total_Duration = sum(Duration), Count = n())
# A tibble: 3 x 3
# Date Total_Duration Count
# <date> <int> <int>
#1 2020-02-01 60 3
#2 2020-05-01 5 1
#3 2020-06-01 10 2

Related

How to create the frequency of a column and then perform an aggregation on that data in R

Objective:
I have a dataset, df, that I wish to first tally up the number of occurrences for each date and then multiply the output by a certain number.
Sent Duration Length
1/7/2020 8:11:00 PM 34 216
1/22/2020 7:51:05 AM 432 111
1/7/2020 1:35:08 AM 57 90
1/22/2020 3:43:26 AM 22 212
1/22/2020 4:00:00 AM 55 500
Desired Outcome:
Date Count Aggregation(80)
1/7/2020 2 160
1/22/2020 3 240
I wish to count the number of times a particular 'datetime' occurs and then multiply this outcome by 80. The date, 1/7/2020 occurs twice, and the date of 1/22/2020, occurs three times. I am then multiplying this number count by the number 80.
The dput is:
structure(list(Sent = structure(c(5L, 3L, 4L, 1L, 2L), .Label = c("1/22/2020 3:43:26 AM",
"1/22/2020 4:00:00 AM", "1/22/2020 7:51:05 PM", "1/7/2020 1:35:08 AM",
"1/7/2020 8:11:00 PM"), class = "factor"), Duration = c(34L,
432L, 57L, 22L, 55L), length = c(216L, 111L, 90L, 212L, 500L)), class = "data.frame", row.names = c(NA,
-5L))
This is what I have tried:
df1<- aggregate(df$Sent, by=list(Category= df$dSent),
FUN=length)
However, I need to output the frequency that the dates occurs along with the aggregation (multiply by 80)
Any suggestions are welcome.
We can convert Sent to POSIXct format and extract the date, count the number of rows in each date and multiply it by 80. Using dplyr, we can do it as :
library(dplyr)
df %>%
group_by(Date = as.Date(lubridate::mdy_hms(Sent))) %>%
summarise(Count = n(), `Aggregation(80)` = Count * 80)
# Date Count `Aggregation(80)`
# <date> <int> <dbl>
#1 2020-01-07 2 160
#2 2020-01-22 3 240
Using table.
as.data.frame(cbind(Count=(r <- table(as.Date(df$Sent, format="%m/%d/%Y %H:%M:%S"))),
Agg=r*80))
# Count Agg
# 2020-01-07 2 160
# 2020-01-22 3 240
or
`rownames<-`(as.data.frame(cbind(Count=(r <- table(as.Date(df$Sent, format="%m/%d/%Y %H:%M:%S"))),
Agg=r*80, Date=names(r)))[c(3, 1:2)], NULL)
# Date Count Agg
# 1 2020-01-07 2 160
# 2 2020-01-22 3 240
Here is the data.table way of things..
code
library( data.table )
#set data as data.table
setDT(mydata)
#set timestamps as posix
mydata[, Sent := as.POSIXct( Sent, format = "%m/%d/%Y %H:%M:%S %p" ) ]
#summarise
mydata[, .(Count = .N, Aggregation = .N * 80), by = .(Date = as.Date(Sent) )]
output
# Date Count Aggregation
# 1: 2020-01-07 2 160
# 2: 2020-01-22 3 240

Filter within a column by date in R

I have a dataset, df, The Date column consists of dates from December and January. I would like to filter and make a new dataset with dates only from January onward.
Date ID
12/20/2019 1:00:01 AM A
12/30/2019 2:00:02 AM B
01/01/2020 1:00:00 AM C
02/05/2020 2:00:05 AM D
I would like this:
Date ID
01/01/2020 1:00:00 AM C
02/05/2020 2:00:05 AM D
Can I use dplyr with this? or Base R
library(lubridate)
library(tidyverse)
filter(Date) >= 01-01-2020 ?
dput is
structure(list(Date = structure(c(2L, 3L, 1L, 4L), .Label = c("1/1/2020 1:00:00 AM",
"12/20/2019 1:00:01 AM", "12/30/2019 2:00:02 AM", "2/5/2020 2:00:05 AM"
), class = "factor"), ID = structure(1:4, .Label = c("A", "B",
"C", "D"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
Maybe just filter on year and select datest from 2020?
library(dplyr)
library(lubridate)
df %>% mutate(Date = mdy_hms(Date)) %>% filter(year(Date) >= 2020)
# Date ID
#1 2020-01-01 01:00:00 C
#2 2020-02-05 02:00:05 D
Or using base R :
subset(transform(df, Date = as.POSIXct(Date, format = "%m/%d/%Y %I:%M:%S %p")),
as.integer(format(Date, "%Y")) >= 2020)
We can use subset with strptime in base R
subset(df1, strptime(Date, "%m/%d/%Y %I:%M:%S %p")$year + 1900 >=2020)
# Date ID
#3 1/1/2020 1:00:00 AM C
#4 2/5/2020 2:00:05 AM D

segregation datetime on hourly basis in R

I'm using below-mentioned dataframe in R:
ID Datetime Value
T-1 2020-01-01 15:12:14 10
T-2 2020-01-01 00:12:10 20
T-3 2020-01-01 03:11:11 25
T-4 2020-01-01 14:01:01 20
T-5 2020-01-01 18:07:11 10
T-6 2020-01-01 20:10:09 15
T-7 2020-01-01 15:45:23 15
Using the above mentioned dataframe, I want to segregate datetime on hourly basis. For which, I'm using following code.
library(tidyverse)
DF$bins <- cut(lubridate::hour(DF$Datetime), c(-1, 0:24 - 0.01))
levels(DF$bins) <- c("00:00 to 00:59", "00:01 to 01:59", "00:02 to 02:59", "00:03 to 03:59", "00:04 to 04:59", "00:05 to 05:59",
"00:06 to 06:59", "00:07 to 07:59", "00:08 to 08:59", "00:09 to 09:59", "00:10 to 10:59", "00:11 to 11:59",
"00:12 to 12:59", "00:13 to 13:59", "00:14 to 14:59", "00:15 to 15:59", "00:16 to 16:59", "00:17 to 17:59",
"00:18 to 18:59", "00:19 to 19:59", "00:20 to 20:59", "00:21 to 21:59", "00:22 to 22:59", "00:23 to 23:59")
newDF <- DF %>%
dplyr::group_by(bins, .drop = FALSE) %>%
dplyr::summarise(Count = length(Value), Total = sum(Value))
Final<-newDF %>%
dplyr::summarise(bins = "January", Count = sum(Count), Total = sum(Total)) %>% bind_rows(newDF)
Final[,c(2,3)]<-sapply(Final[,c(2,3)], function(x) scales::comma(x))
at levels(DF$bins)<- I'm getting error Error inlevels<-.factor(tmp, value = c("00:00 to 00:59", "00:01 to 01:59", :
number of levels differs
How to keep below mentioned segregation static and aggregate the numbers accordingly.
"00:00 to 00:59", "00:01 to 01:59", "00:02 to 02:59", "00:03 to 03:59", "00:04 to 04:59", "00:05 to 05:59", "00:06 to 06:59", "00:07 to 07:59", "00:08 to 08:59", "00:09 to 09:59", "00:10 to 10:59", "00:11 to 11:59","00:12 to 12:59", "00:13 to 13:59", "00:14 to 14:59", "00:15 to 15:59", "00:16 to 16:59", "00:17 to 17:59","00:18 to 18:59", "00:19 to 19:59", "00:20 to 20:59", "00:21 to 21:59", "00:22 to 22:59", "00:23 to 23:59"
Expected Output:
Month Count Sum
Jan-20 7 115
12:00 AM to 05:00 AM 2 45
06:00 AM to 12:00 PM 0 0
12:00 PM to 03:00 PM 1 20
03:00 PM to 08:00 PM 3 35
08:00 PM to 12:00 AM 1 15
We can use floor_date/ceiling_date from lubridate to create hourly breaks, create a grouping column (bins) based on our requirement using sprintf and then use this column to calculate whatever we want for each group.
library(dplyr)
library(lubridate)
df %>%
mutate(bins = floor_date(Datetime, "hour"),
hour = hour(bins),
bins = paste0(sprintf("%02d:00 :", hour), sprintf(" %02d:59", hour))) %>%
group_by(bins) %>%
summarise(sum = sum(Value))
# A tibble: 6 x 2
# bins sum
# <chr> <int>
#1 00:00 : 00:59 20
#2 03:00 : 03:59 25
#3 14:00 : 14:59 20
#4 15:00 : 15:59 25
#5 18:00 : 18:59 10
#6 20:00 : 20:59 15
For the updated condition, we can do
df %>%
mutate(hour = hour(Datetime),
gr = case_when(hour >= 0 & hour < 6 ~ "12:00 AM to 06:00 AM",
hour >= 6 & hour < 12 ~ "06:00 AM to 12:00 PM",
hour >= 12 & hour < 15 ~ "12:00 PM to 03:00 PM",
hour >= 15 & hour < 20 ~ "03:00 PM to 08:00 PM",
TRUE ~ "08:00 PM to 12:00 AM"),
month_year = format(Datetime, "%Y-%m"),
bins = factor(gr, levels = c("12:00 AM to 06:00 AM", "06:00 AM to 12:00 PM",
"12:00 PM to 03:00 PM", "03:00 PM to 08:00 PM",
"08:00 PM to 12:00 AM"))) %>%
group_by(month_year, bins, .drop = FALSE) %>%
summarise(sum = n())
# month_year bins sum
# <chr> <fct> <int>
#1 2020-01 12:00 AM to 06:00 AM 2
#2 2020-01 06:00 AM to 12:00 PM 0
#3 2020-01 12:00 PM to 03:00 PM 1
#4 2020-01 03:00 PM to 08:00 PM 3
#5 2020-01 08:00 PM to 12:00 AM 1
data
df <- structure(list(ID = structure(1:7, .Label = c("T-1", "T-2", "T-3",
"T-4", "T-5", "T-6", "T-7"), class = "factor"), Datetime = structure(c(1577891534,
1577837530, 1577848271, 1577887261, 1577902031, 1577909409, 1577893523
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Value = c(10L,
20L, 25L, 20L, 10L, 15L, 15L)), row.names = c(NA, -7L), class = "data.frame")

Converting Date/Time in R

I am struggling hard with date time formatting in R. I am sure this is an easy fix... can someone write me a line of code that will convert all values from Year, M, D, Time into a new column "datetime"?
What data looks like:
x year m d time
A 2019 2 23 11:12 PM
B 2019 1 31 2:04 PM
C 2018 12 31 12:01 AM
D 2017 2 1 10:14 AM
What I want:
x datetime
A 2/23/19 11:12 PM
B 1/31/19 11:12 PM
C 12/31/18 12:01 AM
D 2/23/17 10:14 PM
Since it's a datetime value we can convert it into a standard format by pasting the values together.
df$datetime <- with(df, as.POSIXct(paste(year, m, d, time),
format = "%Y %m %d %I:%M %p", tz = "UTC"))
df
# x year m d time datetime
#1 A 2019 2 23 11:12PM 2019-02-23 23:12:00
#2 B 2019 1 31 2:04PM 2019-01-31 14:04:00
#3 C 2018 12 31 12:01AM 2018-12-31 00:01:00
#4 D 2017 2 1 10:14AM 2017-02-01 10:14:00
Or using lubridate
library(dplyr)
library(lubridate)
df %>% mutate(datetime = ymd_hm(paste(year, m, d, time)))
data
df <- structure(list(x = structure(1:4, .Label = c("A", "B", "C", "D"
), class = "factor"), year = c(2019L, 2019L, 2018L, 2017L), m = c(2L,
1L, 12L, 2L), d = c(23L, 31L, 31L, 1L), time = c("11:12 PM",
"2:04 PM", "12:01 AM", "10:14 AM")), row.names = c(NA, -4L), class = "data.frame")
I think the below should work for your goal:
df <- data.frame(datetime = apply(df,1, function(v) sprintf("%s/%s/%s %s",v["d"], v["m"], v["year"], v["time"])))
If you want to append the new column to the existing data.frame df, then use:
df$datetime <- apply(df,1, function(v) sprintf("%s/%s/%s %s",v["d"], v["m"], v["year"], v["time"]))

Compare Hour in R

Here is my sample dataset
id hour
1 15:10
2 12:10
3 22:10
4 06:30
I need to find out the earliest time and latest time. The class of the hour is factor. So I need to convert factor to an appropriate class, and compare the earlier and later time. I tried to format the hour using the code below, but it did not work out as expected
format(as.Date(date),"%H:%M")
Use times of chron package
#Data
xx
# id hour
#1 1 15:10
#2 2 12:10
#3 3 22:10
#4 4 06:30
library(chron)
xx$hour = times(paste0(as.character(xx$hour), ":00"))
xx
# id hour
#1 1 15:10:00
#2 2 12:10:00
#3 3 22:10:00
#4 4 06:30:00
#Min and Max
range(xx$hour)
#[1] 06:30:00 22:10:00
xx = structure(list(id = 1:4, hour = structure(c(3L, 2L, 4L, 1L), .Label = c("06:30",
"12:10", "15:10", "22:10"), class = "factor")), .Names = c("id",
"hour"), row.names = c(NA, -4L), class = "data.frame")
If all you need is to find earliest (min) and latest (max) times, you can just convert the times to a character and use min, max: e.g.,
hour <- c("15:10", "12:10", "22:10", "06:30")
hour[which(hour == max(hour))]
> "22:10"

Resources