Count number in group, but restart for non-consecutive dates - r

I have data that looks like this:
sample <- data.frame(
group = c("A","A","A","B","B","B"),
date = c(as.Date("2014-12-31"),
as.Date("2015-01-31"),
as.Date("2015-02-28"),
as.Date("2015-01-31"),
as.Date("2015-03-31"),
as.Date("2015-04-30")),
obs = c(100, 200, 300, 50, 100, 150)
)
Note that the date variable always takes the last date of the month. In table format, the data looks like this:
group date obs
1 A 2014-12-31 100
2 A 2015-01-31 200
3 A 2015-02-28 300
4 B 2015-01-31 50
5 B 2015-03-31 100
6 B 2015-04-30 150
I want to create a forth column that counts the number of observations in the group. HOWEVER, I want the count to start over if a month doesn't immediately follow the month before. This is what I want it to look like:
group date obs num
1 A 2014-12-31 100 1
2 A 2015-01-31 200 2
3 A 2015-02-28 300 3
4 B 2015-01-31 50 1
5 B 2015-03-31 100 1
6 B 2015-04-30 150 2
So far all I can get is the following:
library(tidyverse)
sample <- sample %>%
arrange(date) %>%
group_by(group) %>%
mutate(num = row_number())
group date obs num
1 A 2014-12-31 100 1
2 A 2015-01-31 200 2
3 A 2015-02-28 300 3
4 B 2015-01-31 50 1
5 B 2015-03-31 100 2
6 B 2015-04-30 150 3
Any help would be much appreciated. I also want to be able to do the same thing but with quarterly data (instead of monthly).

We can use lubridate::days_in_month to get number of days in a month compare it with difference of current and past date to create a new group. We can then assign row_number() in each group.
library(dplyr)
sample %>%
group_by(group) %>%
mutate(diff_days = cumsum(as.numeric(date - lag(date, default = first(date))) !=
lubridate::days_in_month(date))) %>%
group_by(diff_days, add = TRUE) %>%
mutate(num = row_number()) %>%
ungroup() %>%
select(-diff_days)
# group date obs num
# <fct> <date> <dbl> <int>
#1 A 2014-12-31 100 1
#2 A 2015-01-31 200 2
#3 A 2015-02-28 300 3
#4 B 2015-01-31 50 1
#5 B 2015-03-31 100 1
#6 B 2015-04-30 150 2

We can create a group based on the differnece of month of 'date' and if it is not equal to 1 i.e. one month difference
library(dplyr)
library(lubridate)
sample %>%
arrange(group, date) %>%
group_by(group, mth = cumsum(c(TRUE, diff(month(date)) != 1))) %>%
mutate(num = row_number()) %>%
ungroup %>%
select(-mth)
# A tibble: 6 x 4
# group date obs num
# <fct> <date> <dbl> <int>
#1 A 2015-01-31 100 1
#2 A 2015-02-28 200 2
#3 A 2015-03-31 300 3
#4 B 2015-01-31 50 1
#5 B 2015-03-31 100 1
#6 B 2015-04-30 150 2
If the year also needs to be considered
library(zoo)
sample %>%
arrange(group, date) %>%
mutate(yearmon = as.yearmon(date)) %>%
group_by(group) %>%
group_by(grp = cumsum(c(TRUE, as.integer(diff(yearmon) * 12)> 1)),
add = TRUE ) %>%
mutate(num = row_number()) %>%
ungroup %>%
select(-grp, -yearmon)
# A tibble: 6 x 4
# group date obs num
# <fct> <date> <dbl> <int>
#1 A 2015-01-31 100 1
#2 A 2015-02-28 200 2
#3 A 2015-03-31 300 3
#4 B 2015-01-31 50 1
#5 B 2015-03-31 100 1
#6 B 2015-04-30 150 2

Related

Find max value for each partition in dataframe in R

I have a data as:
ID Date1 VarA
1 2005-01-02 x
1 2021-01-02 20
1 2021-01-01 y
2 2020-12-20 No
2 2020-12-19 10
3 1998-05-01 0
Here is the R-code to reproduce the data
example = data.frame(ID = c(1,1,1,2,2,3),
Date1 = c('2005-01-02',
'2021-01-02',
'2021-01-01',
'2020-12-20',
'2020-12-19',
'1998-05-01'),
VarA = c('x','20','y','No', '10','0'))
I would prefer the solution to do following:
First, flag the maximum date in data.
ID Date1 VarA Last_visit
1 2005-01-02 x 0
1 2021-01-02 20 1
1 2021-01-01 y 0
2 2020-12-20 No 1
2 2020-12-19 10 0
3 1998-05-01 0 1
Finally, It should retain only where the Last_visit=1
ID Date1 VarA Last_visit
1 2021-01-02 20 1
2 2020-12-20 No 1
3 1998-05-01 0 1
I am requesting the intermediate steps as well to perform a sanity check. Thanks!
We create a new column after grouping by 'ID'
library(dplyr)
example %>%
group_by(ID) %>%
mutate(Last_visit = +(row_number() %in% which.max(as.Date(Date1)))) %>%
ungroup
and then filter/slice based on the column
example %>%
group_by(ID) %>%
mutate(Last_visit = +(row_number() %in% which.max(as.Date(Date1)))) %>%
slice_max(n = 1, order_by = Last_visit) %>%
ungroup
-output
# A tibble: 3 × 4
ID Date1 VarA Last_visit
<dbl> <chr> <chr> <int>
1 1 2021-01-02 20 1
2 2 2020-12-20 No 1
3 3 1998-05-01 0 1
Another option is to convert the 'Date1' to Date class first, then do an arrange and use distinct
example %>%
mutate(Date1 = as.Date(Date1)) %>%
arrange(ID, desc(Date1)) %>%
distinct(ID, .keep_all = TRUE) %>%
mutate(Last_visit = 1)
ID Date1 VarA Last_visit
1 1 2021-01-02 20 1
2 2 2020-12-20 No 1
3 3 1998-05-01 0 1

How to filter by multiple range of dates in R?

Thank you, experts for previous answers (How to filter by range of dates in R?)
I am still having some problems dealing with the data.
Example:
id q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
My idea is to eliminate the observations that have more than 3 "units" in a period of 30 days. That is, if "a" has a unit "q" on "12/02/2021" [dd/mm]yyyy]: (a) if between 12/01/2021 and 12/02/2021 there are already 3 observations it must be deleted . (b) If there are less than 3 this one must remain.
My expected result is:
p q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
With this code:
df <- df %>%
mutate(day = dmy(data))%>%
group_by(p) %>%
arrange(day, .by_group = TRUE) %>%
mutate(diff = day - first(day)) %>%
mutate(row = row_number()) %>%
filter(row <= 3 | !diff < 30)
But the result is:
P Q DATE DAY DIFF ROW
a 1 1/1/2021 1/1/2021 0 1
a 1 1/1/2021 1/1/2021 0 2
a 1 21/1/2021 21/1/2021 20 3
a 1 12/2/2021 12/2/2021 42 5
a 1 12/2/2021 12/2/2021 42 6
a 1 12/2/2021 12/2/2021 42 7
a 1 12/2/2021 12/2/2021 42 8
The main problem is that the diff variable must count days in periods of 30 days from the last day of the previous 30-days period - not since the first observation day.
Any help? Thanks
Using floor_date it is quite straighforward:
library(lubridate)
library(dplyr)
df %>%
group_by(floor = floor_date(date, '30 days')) %>%
slice_head(n = 3) %>%
ungroup() %>%
select(-floor)
# A tibble: 6 x 3
id q date
<chr> <int> <date>
1 a 1 2021-01-01
2 a 1 2021-01-01
3 a 1 2021-01-21
4 a 1 2021-02-12
5 a 1 2021-02-12
6 a 1 2021-02-12
data
df <- read.table(header = T, text = "id q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021")
df$date<-as.Date(df$date, format = "%d/%m/%Y")

Given different start and end dates, find the daily average of each variable

I have data with varying start and end dates.
mydata <- data.frame(id=c(1,2,3), start=c("2010/01/01","2010/01/01","2010/01/02"), end=c("2010/01/01","2010/01/05","2010/01/03"), a=c(140,750,56),b=c(48,25,36))
mydata
id start end a b
1 1 2010-01-01 2010-01-01 140 48
2 2 2010-01-01 2010-01-05 750 25
3 3 2010-01-02 2010-01-03 56 36
I want to find the average of the variables a and b for each day. Below I execute it by expanding every row with different start and end dates, then collapsing it back to the daily level.
mydata$subt <- as.numeric(as.Date(mydata$end, "%Y/%m/%d") - as.Date(mydata$start, "%Y/%m/%d") + 1)
require(data.table)
mydata <- setDT(mydata)[ , list(idnum = id, date = seq(start, end, by = "day"), a=a/subt, b=b/subt), by = 1:nrow(mydata)]
mydata
nrow idnum date a b
1: 1 1 2010-01-01 140 48
2: 2 2 2010-01-01 150 5
3: 2 2 2010-01-02 150 5
4: 2 2 2010-01-03 150 5
5: 2 2 2010-01-04 150 5
6: 2 2 2010-01-05 150 5
7: 3 3 2010-01-02 28 18
8: 3 3 2010-01-03 28 18
mydata %>%
group_by(date) %>%
summarize(a = sum(a),
b = sum(b))
Desired Outcome:
date a b
<date> <dbl> <dbl>
1 2010-01-01 290 53
2 2010-01-02 178 23
3 2010-01-03 178 23
4 2010-01-04 150 5
5 2010-01-05 150 5
However, I have plenty of rows with different start and end dates, and sometimes the length of difference is very long. I am wondering if there is an easier way (i.e., without expanding every row) to find the daily averages for each variable. It would also be great if there is a way to find the weekly averages without first finding the daily figures. Thank you!
Here is an option with tidyverse. We convert the 'start' 'end' columns to Date class with ymd (from lubridate), create a sequence of dates from 'start' to 'end' for corresponding elements with map2, mutate the 'a', 'b' by dividing them with the lengths of the list column 'date', unnest the 'date' and grouped by 'date' we get the sum of 'a', 'b'
library(dplyr)
library(tidyr)
library(lubridate)
library(purrr)
mydata %>%
mutate(across(c(start, end), ymd)) %>%
transmute(id, date = map2(start, end, seq, by = 'day'), a, b) %>%
mutate(across(c(a, b), ~ ./lengths(date))) %>%
unnest(date) %>%
group_by(date) %>%
summarise(across(c(a, b), sum, na.rm = TRUE))
# A tibble: 5 x 3
# date a b
# <date> <dbl> <dbl>
#1 2010-01-01 290 53
#2 2010-01-02 178 23
#3 2010-01-03 178 23
#4 2010-01-04 150 5
#5 2010-01-05 150 5

Performing in group operations in R

I have a data in which I have 2 fields in a table sf -> Customer id and Buy_date. Buy_date is unique but for each customer, but there can be more than 3 different values of Buy_dates for each customer. I want to calculate difference in consecutive Buy_date for each Customer and its mean value. How can I do this.
Example
Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13
I want the results for each customer in the format
Customer mean
Here's a dplyr solution.
Your data:
df <- data.frame(Customer = c(1,1,1,1,2,2,2), Buy_date = c("2018/03/01", "2018/03/19", "2018/04/3", "2018/05/10", "2018/01/02", "2018/02/10", "2018/04/13"))
Grouping, mean Buy_date calculation and summarising:
library(dplyr)
df %>% group_by(Customer) %>% mutate(mean = mean(as.POSIXct(Buy_date))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <dttm>
1 1 2018-03-31 06:30:00
2 2 2018-02-17 15:40:00
Or as #r2evans points out in his comment for the consecutive days between Buy_dates:
df %>% group_by(Customer) %>% mutate(mean = mean(diff(as.POSIXct(Buy_date)))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <time>
1 1 23.3194444444444
2 2 50.4791666666667
I am not exactly sure of the desired output but this what I think you want.
library(dplyr)
library(zoo)
dat <- read.table(text =
"Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13", header = T, stringsAsFactors = F)
dat$Buy_date <- as.Date(dat$Buy_date)
dat %>% group_by(Customer) %>% mutate(diff_between = as.vector(diff(zoo(Buy_date), na.pad=TRUE)),
mean_days = mean(diff_between, na.rm = TRUE))
This produces:
Customer Buy_date diff_between mean_days
<int> <date> <dbl> <dbl>
1 1 2018-03-01 NA 23.3
2 1 2018-03-19 18 23.3
3 1 2018-04-03 15 23.3
4 1 2018-05-10 37 23.3
5 2 2018-01-02 NA 50.5
6 2 2018-02-10 39 50.5
7 2 2018-04-13 62 50.5
EDITED BASED ON USER COMMENTS:
Because you said that you have factors and not characters just convert them by doing the following:
dat$Buy_date <- as.Date(as.character(dat$Buy_date))
dat$Customer <- as.character(dat$Customer)

Average Difference between Orders dates of a product in R

I have a data set of this format
Order_Name Frequency Order_Dt
A 2 2016-01-20
A 2 2016-05-01
B 1 2016-02-12
C 3 2016-03-04
C 3 2016-07-01
C 3 2016-08-09
I need to find the average difference between the dates of those order which have been placed for more than 1 times, i.e., frequency > 1.
require(dplyr)
# loading the data
df0 <- read.table(text =
'Order_Name Frequency Order_Dt
A 2 2016-01-20
A 2 2016-05-01
B 1 2016-02-12
C 3 2016-03-04
C 3 2016-07-01
C 3 2016-08-09',
stringsAsFactors = F,
header = T)
# putting the date in the right format
df0$Order_Dt <- as.Date(df0$Order_Dt)
# obtaining the averages
df0 %>% filter(Frequency > 1) %>%
arrange(., Order_Name, Order_Dt) %>%
mutate(diff_date = Order_Dt - lag(Order_Dt)) %>%
group_by(Order_Name) %>%
summarise(avg_days = mean(diff_date, na.rm = T))
# A tibble: 2 × 2
Order_Name avg_days
<chr> <time>
1 A 102.00000 days
2 C 33.33333 days

Resources