How to count the number of occurences in a column using dplyr - r

I'm trying to count the number of occurences in a single column. Here is a snippet of the df I'm working with:
Here is the code I have so far:
my_df$day <- weekdays(as.Date(my_df$deadline))
most_common_day <- my_df %>%
arrange(day) %>%
filter(day == "Friday") %>%
select(day)
So the main goal is to get which weekday is the most common. Any suggestions?

There are various ways to count the number of occurrences in R. The basic R method is table():
table(my_df$day)
# Friday Monday Saturday Sunday Thursday Tuesday Wednesday
# 4 6 8 11 6 5 10
The dplyr approach can be with count():
count(my_df, day)
# day n
#1 Friday 4
#2 Monday 6
#3 Saturday 8
#4 Sunday 11
#5 Thursday 6
#6 Tuesday 5
#7 Wednesday 10
You can also use tally() from dplyr but you will also need group_by():
my_df %>% group_by(day) %>% tally
# day n
#1 Friday 4
#2 Monday 6
#3 Saturday 8
#4 Sunday 11
#5 Thursday 6
#6 Tuesday 5
#7 Wednesday 10
To get the most common day(s), you can do:
# when using table()
names(table(my_df$day))[table(my_df$day) == max(table(my_df$day))]
#[1] "Sunday"
# when using count()
count(my_df, day) %>% slice_max(n)
# day n
#1 Sunday 11
# when using tally()
my_df %>% group_by(day) %>% tally %>% slice_max(n)
## A tibble: 1 x 2
# day n
# <fct> <int>
#1 Sunday 11

Related

How to add rows based on a specific column value and append date column accordingly

Currently I have a data frame that looks like this:
Months Total Date
1 2 6 05/01/2021
2 5 10 18/06/2021
I want to transform the data so that the month are added to the "Date" and the "Total" is divided by the "Months" giving a row for each month like the following:
Total Date
1 3 05/01/2021
2 3 05/02/2021
3 2 18/06/2021
4 2 18/07/2021
5 2 18/08/2021
6 2 18/09/2021
7 2 18/10/2021
Here is one way -
Change Date to date class so that is easier to perform arithmetic operation on it.
uncount to repeat each row Months times
For each row, divide the Total value by number of times that row is repeated.
Add 1 month for every row of the date.
library(dplyr)
library(tidyr)
library(lubridate)
df %>%
mutate(Date = dmy(Date),
row = row_number()) %>%
uncount(Months) %>%
group_by(row) %>%
mutate(Total = Total/n(),
Date = Date %m+% months(row_number() - 1)) %>%
ungroup %>%
select(-row)
# Total Date
# <dbl> <date>
#1 3 2021-01-05
#2 3 2021-02-05
#3 2 2021-06-18
#4 2 2021-07-18
#5 2 2021-08-18
#6 2 2021-09-18
#7 2 2021-10-18

R - exclude weekends from time interval calculations with lubridate

I wish to calculate the intervals between dates. The differences in days should take weekends in account. I have over 200 dates stamps.
For example, the currently displayed time difference between 5th (Tuesday) and 11th (Monday) January are 5 days. I would like to obtain 3 days.
I could manage to get to a solution without excluding Saturday and Sunday with the following code and the packages lubridate and dplyr.
Could you please guide me how to exclude the weekends for calculation?
Thank you.
library(lubridate)
library(dplyr)
dates <- c("2021-01-01", "2021-01-04", "2021-01-05", "2021-01-06", "2021-01-11", "2021-01-13", "2021-01-14", "2021-01-18", "2021-01-25", "2021-01-29")
d <- do.call(rbind, lapply(dates, as.data.frame))
dateoverview <- rename(d, Dates = 1)
dateoverview$Dates <- lubridate::ymd(dateoverview$Dates)
datecalculation <- dateoverview %>%
mutate(Days = Dates - lag(Dates)) %>%
mutate(Weekday = wday(Dates, label = FALSE))
datecalculation
## Dates Days Weekday
## 1 2021-01-01 NA days 6
## 2 2021-01-04 3 days 2
## 3 2021-01-05 1 days 3
## 4 2021-01-06 1 days 4
## 5 2021-01-11 5 days 2
## 6 2021-01-13 2 days 4
## 7 2021-01-14 1 days 5
## 8 2021-01-18 4 days 2
## 9 2021-01-25 7 days 2
## 10 2021-01-29 4 days 6
Probably, there is a function somewhere already doing this but here is a custom one which can help you calculate date difference excluding weekends.
library(dplyr)
library(purrr)
date_diff_excluding_wekeends <- function(x, y) {
if(is.na(x) || is.na(y)) return(NA)
sum(!format(seq(x, y - 1, by = '1 day'), '%u') %in% 6:7)
}
datecalculation %>%
mutate(Days = map2_dbl(lag(Dates), Dates, date_diff_excluding_wekeends))
# Dates Days Weekday
#1 2021-01-01 NA 6
#2 2021-01-04 1 2
#3 2021-01-05 1 3
#4 2021-01-06 1 4
#5 2021-01-11 3 2
#6 2021-01-13 2 4
#7 2021-01-14 1 5
#8 2021-01-18 2 2
#9 2021-01-25 5 2
#10 2021-01-29 4 6
seq(x, y - 1, by = '1 day') creates a sequence of dates between previous date and current date - 1.
format(..., "%u") returns day of the week. 1 is for Monday, 7 for Sunday.
Using sum(!format(...) %in% 6:7) we count number of days that are present on weekdays.
Another possible solution:
library(lubridate)
# sample data
df = data.frame(Dates = seq(ymd('2021-01-01'),ymd('2021-12-31'),by='days'))
df_weekdays = df %>% filter(!(weekdays(as.Date(df$Dates)) %in% c('Saturday','Sunday')))
#Application to your data
datecalculation = datecalculation %>%
filter(!(weekdays(as.Date(datecalculation$Dates)) %in% c('Saturday','Sunday')))

How to calculate sum on unique values in R

So here's the data:
DF1
ID DOW
1 Monday
1 Monday
1 Tuesday
2 Tuesday
2 Wednesday
3 Friday
3 Monday
3 Tuesday
I would like to join the following dictionary.
DF2
ID DOW Hours
1 Monday 20
1 Tuesday 21
2 Tuesday 30
2 Wednesday 25
3 Friday 24
3 Monday 42
3 Tuesday 54
My goal is I want the total count of entries on each day as well as the hours worked on that day. But if a value on the list exists twice, it is not counted twice. (Thats the hard part)
Here's my attempt following R Code:
df3 <- df1 %>%
left_join(df2, by = c("DOW" ,"ID"))
df3 %>%
group_by(ID) %>%
summarize(count = n())
sum = sum(Employee_Hrs)) %>%
mutate(injRate = count/sum)
This does not work because though it does successfully count total number of entries for each ID, it sums employee_Hrs every time, even when it is entered multiple times...
End product should be:
ID count sum
1 3 41
2 2 55
3 3 120
Again, take count, but sum hours , dont double count.
Here is a base R option using merge + aggregate
u <- merge(df1, df2, by = c("ID", "DOW"))
res <- setNames(
merge(aggregate(DOW ~ ID, u, length),
aggregate(Hours ~ ID, unique(u), sum),
by = "ID"
),
c("ID", "Count", "Sum")
)
which gives
> res
ID Count Sum
1 1 3 41
2 2 2 55
3 3 3 120
An option with data.table
library(data.table)
setDT(df1)[df2, .(Count = .N, Hours), on = .(ID), by = .EACHI][,
.(Sum = sum(Hours)), .(ID, Count)]
# ID Count Sum
#1: 1 3 41
#2: 2 2 55
#3: 3 3 120
Another approach is to summarize the tables prior to joining them.
textFile1 <- "ID DOW
1 Monday
1 Monday
1 Tuesday
2 Tuesday
2 Wednesday
3 Friday
3 Monday
3 Tuesday"
textFile2 <- "ID DOW Hours
1 Monday 20
1 Tuesday 21
2 Tuesday 30
2 Wednesday 25
3 Friday 24
3 Monday 42
3 Tuesday 54"
df1 <- read.table(text =textFile1,header=TRUE )
df2 <- read.table(text =textFile2,header=TRUE )
df1 %>% group_by(ID) %>%
summarise(count = n()) -> counts
df2 %>%
group_by(ID) %>%
summarize(sum = sum(Hours)) %>%
left_join(counts) %>%
mutate(injRate = count/sum)
...and the output:
# A tibble: 3 x 4
ID sum count injRate
<int> <int> <int> <dbl>
1 1 41 3 0.0732
2 2 55 2 0.0364
3 3 120 3 0.025
Try this solution where you compute the number of counts and then you filter to obtain final summary:
library(tidyverse)
#Data
df3 <- df1 %>%
left_join(df2, by = c("DOW" ,"ID"))
#Code
df3 %>%
group_by(ID) %>%
mutate(count=n()) %>%
filter(!duplicated(DOW)) %>%
summarise(count=unique(count),Sum=sum(Hours))
Output:
# A tibble: 3 x 3
ID count Sum
<int> <int> <int>
1 1 3 41
2 2 2 55
3 3 3 120

In a tidy dataframe, copy values over from one observation to another

I have a dataframe that contains information for various countries, days and variables. I have observations for one of those variables only. A simple working example would look like this:
df <- data.frame(country=c("NL","NL","NL","NL","BE","BE","BE","BE"),
day=c("Monday","Monday","Tuesday","Tuesday","Monday","Monday","Tuesday","Tuesday"),
variable=c("A","B","A","B","A","B","A","B"),
value=c(8,NA,13,NA,12,NA,9,NA))
> df
country day variable value
1 NL Monday A 8
2 NL Monday B NA
3 NL Tuesday A 13
4 NL Tuesday B NA
5 BE Monday A 12
6 BE Monday B NA
7 BE Tuesday A 9
8 BE Tuesday B NA
I want to copy those observations over to the other variable, as long as country and day are identical. The end result would look like this:
> df
country day variable value
1 NL Monday A 8
2 NL Monday B 8
3 NL Tuesday A 13
4 NL Tuesday B 13
5 BE Monday A 12
6 BE Monday B 12
7 BE Tuesday A 9
8 BE Tuesday B 9
The actual dataframe is quite large and I would like to avoid having to build loops. A solution using pipes would be preferable.
Perhaps you could just do:
library(dplyr)
df %>%
group_by(country, day) %>%
mutate(value = value[!is.na(value)])
Output:
# A tibble: 8 x 4
# Groups: country, day [4]
country day variable value
<fct> <fct> <fct> <dbl>
1 NL Monday A 8
2 NL Monday B 8
3 NL Tuesday A 13
4 NL Tuesday B 13
5 BE Monday A 12
6 BE Monday B 12
7 BE Tuesday A 9
8 BE Tuesday B 9
Another way would be via fill, though this is probably unnecessary (if needed, rather use mutate(value = zoo::na.locf(value)) as last line since fill itself is quite slow):
library(tidyverse)
df %>%
group_by(country, day) %>%
arrange(country, day, value) %>%
fill(value)
With data.table, we can do
library(data.table)
setDT(df)[, value := na.omit(value), .(country, day)]
Or using na.locf
library(zoo)
setDT(df)[, value := na.locf0(value), .(country, day)]

dplyr: grouping and summarizing/mutating data with rolling time windows

I have irregular timeseries data representing a certain type of transaction for users. Each line of data is timestamped and represents a transaction at that time. By the irregular nature of the data some users might have 100 rows in a day and other users might have 0 or 1 transaction in a day.
The data might look something like this:
data.frame(
id = c(1, 1, 1, 1, 1, 2, 2, 3, 4),
date = c("2015-01-01",
"2015-01-01",
"2015-01-05",
"2015-01-25",
"2015-02-15",
"2015-05-05",
"2015-01-01",
"2015-08-01",
"2015-01-01"),
n_widgets = c(1,2,3,4,4,5,2,4,5)
)
id date n_widgets
1 1 2015-01-01 1
2 1 2015-01-01 2
3 1 2015-01-05 3
4 1 2015-01-25 4
5 1 2015-02-15 4
6 2 2015-05-05 5
7 2 2015-01-01 2
8 3 2015-08-01 4
9 4 2015-01-01 5
Often I'd like to know some rolling statistics about users. For example: for this user on a certain day, how many transactions occurred in the previous 30 days, how many widgets were sold in the previous 30 days etc.
Corresponding to the above example, the data should look like:
id date n_widgets n_trans_30 total_widgets_30
1 1 2015-01-01 1 1 1
2 1 2015-01-01 2 2 3
3 1 2015-01-05 3 3 6
4 1 2015-01-25 4 4 10
5 1 2015-02-15 4 2 8
6 2 2015-05-05 5 1 5
7 2 2015-01-01 2 1 2
8 3 2015-08-01 4 1 4
9 4 2015-01-01 5 1 5
If the time window is daily then the solution is simple: data %>% group_by(id, date) %>% summarize(...)
Similarly if the time window is monthly this is also relatively simple with lubridate: data %>% group_by(id, year(date), month(date)) %>% summarize(...)
However the challenge I'm having is how to setup a time window for an arbitrary period: 5-days, 10-days etc.
There's also the RcppRoll library but both RcppRoll and the rolling functions in zoo seem more setup for regular time series. As far as I can tell these window functions work based on the number of rows instead of a specified time period -- the key difference is that a certain time period might have a differing number of rows depending on date and user.
For example, it's possible for user 1, that the number of transactions in the 5 days previous of 2015-01-01 is equal to 100 transactions and for the same user the number of transactions in the 5 days previous of 2015-02-01 is equal to 5 transactions. Thus looking back a set number of rows will simply not work.
Additionally, there is another SO thread discussing rolling dates for irregular time series type data (Create new column based on condition that exists within a rolling date) however the accepted solution was using data.table and I'm specifically looking for a dplyr way of achieving this.
I suppose at the heart of this issue, this problem can be solved by answering this question: how can I group_by arbitrary time periods in dplyr. Alternatively, if there's a different dplyr way to achieve above without a complicated group_by, how can I do it?
EDIT: updated example to make nature of the rolling window more clear.
This can be done using SQL:
library(sqldf)
dd <- transform(data, date = as.Date(date))
sqldf("select a.*, count(*) n_trans30, sum(b.n_widgets) 'total_widgets30'
from dd a
left join dd b on b.date between a.date - 30 and a.date
and b.id = a.id
and b.rowid <= a.rowid
group by a.rowid")
giving:
id date n_widgets n_trans30 total_widgets30
1 1 2015-01-01 1 1 1
2 1 2015-01-01 2 2 3
3 1 2015-01-05 3 3 6
4 1 2015-01-25 4 4 10
5 2 2015-05-05 5 1 5
6 2 2015-01-01 2 1 2
7 3 2015-08-01 4 1 4
8 4 2015-01-01 5 1 5
Another approach is to expand your dataset to contain all possible days (using tidyr::complete), then use a rolling function (RcppRoll::roll_sum)
The fact that you have multiple observations per day is probably creating an issue though...
library(tidyr)
library(RcppRoll)
df2 <- df %>%
mutate(date=as.Date(date))
## create full dataset with all possible dates (go even 30 days back for first observation)
df_full<- df2 %>%
mutate(date=as.Date(date)) %>%
complete(id,
date=seq(from=min(.$date)-30,to=max(.$date), by=1),
fill=list(n_widgets=0))
## now use rolling function, and keep only original rows (left join)
df_roll <- df_full %>%
group_by(id) %>%
mutate(n_trans_30=roll_sum(x=n_widgets!=0, n=30, fill=0, align="right"),
total_widgets_30=roll_sum(x=n_widgets, n=30, fill=0, align="right")) %>%
ungroup() %>%
right_join(df2, by = c("date", "id", "n_widgets"))
The result is the same as yours (by chance)
id date n_widgets n_trans_30 total_widgets_30
<dbl> <date> <dbl> <dbl> <dbl>
1 1 2015-01-01 1 1 1
2 1 2015-01-01 2 2 3
3 1 2015-01-05 3 3 6
4 1 2015-01-25 4 4 10
5 1 2015-02-15 4 2 8
6 2 2015-05-05 5 1 5
7 2 2015-01-01 2 1 2
8 3 2015-08-01 4 1 4
9 4 2015-01-01 5 1 5
But as said, it will fail for some days as it count last 30 obs, not last 30 days. So you might want first to summarise the information by day, then apply this.
EDITED based on comment below.
You can try something like this for up to 5 days:
df %>%
arrange(id, date) %>%
group_by(id) %>%
filter(as.numeric(difftime(Sys.Date(), date, unit = 'days')) <= 5) %>%
summarise(n_total_widgets = sum(n_widgets))
In this case, there are no days within five of current. So, it won't produce any output.
To get last five days for each ID, you can do something like this:
df %>%
arrange(id, date) %>%
group_by(id) %>%
filter(as.numeric(difftime(max(date), date, unit = 'days')) <= 5) %>%
summarise(n_total_widgets = sum(n_widgets))
Resulting output will be:
Source: local data frame [4 x 2]
id n_total_widgets
(dbl) (dbl)
1 1 4
2 2 5
3 3 4
4 4 5
I found a way to do this while working on this question
df <- data.frame(
id = c(1, 1, 1, 1, 1, 2, 2, 3, 4),
date = c("2015-01-01",
"2015-01-01",
"2015-01-05",
"2015-01-25",
"2015-02-15",
"2015-05-05",
"2015-01-01",
"2015-08-01",
"2015-01-01"),
n_widgets = c(1,2,3,4,4,5,2,4,5)
)
count_window <- function(df, date2, w, id2){
min_date <- date2 - w
df2 <- df %>% filter(id == id2, date >= min_date, date <= date2)
out <- length(df2$date)
return(out)
}
v_count_window <- Vectorize(count_window, vectorize.args = c("date2","id2"))
sum_window <- function(df, date2, w, id2){
min_date <- date2 - w
df2 <- df %>% filter(id == id2, date >= min_date, date <= date2)
out <- sum(df2$n_widgets)
return(out)
}
v_sum_window <- Vectorize(sum_window, vectorize.args = c("date2","id2"))
res <- df %>% mutate(date = ymd(date)) %>%
mutate(min_date = date - 30,
n_trans = v_count_window(., date, 30, id),
total_widgets = v_sum_window(., date, 30, id)) %>%
select(id, date, n_widgets, n_trans, total_widgets)
res
id date n_widgets n_trans total_widgets
1 1 2015-01-01 1 2 3
2 1 2015-01-01 2 2 3
3 1 2015-01-05 3 3 6
4 1 2015-01-25 4 4 10
5 1 2015-02-15 4 2 8
6 2 2015-05-05 5 1 5
7 2 2015-01-01 2 1 2
8 3 2015-08-01 4 1 4
9 4 2015-01-01 5 1 5
This version is fairly case specific but you could probably make a version of the functions that is more general.
For simplicity reasons I recommend runner package which handles sliding window operations. In OP request window size k = 30 and windows depend on date idx = date. You can use runner function which applies any R function on given window, and sum_run
library(runner)
library(dplyr)
df %>%
group_by(id) %>%
arrange(date, .by_group = TRUE) %>%
mutate(
n_trans30 = runner(n_widgets, k = 30, idx = date, function(x) length(x)),
n_widgets30 = sum_run(n_widgets, k = 30, idx = date),
)
# id date n_widgets n_trans30 n_widgets30
#<dbl> <date> <dbl> <dbl> <dbl>
# 1 2015-01-01 1 1 1
# 1 2015-01-01 2 2 3
# 1 2015-01-05 3 3 6
# 1 2015-01-25 4 4 10
# 1 2015-02-15 4 2 8
# 2 2015-01-01 2 1 2
# 2 2015-05-05 5 1 5
# 3 2015-08-01 4 1 4
# 4 2015-01-01 5 1 5
Important: idx = date should be in ascending order.
For more go to documentation and vignettes

Resources