Performing in group operations in R - r

I have a data in which I have 2 fields in a table sf -> Customer id and Buy_date. Buy_date is unique but for each customer, but there can be more than 3 different values of Buy_dates for each customer. I want to calculate difference in consecutive Buy_date for each Customer and its mean value. How can I do this.
Example
Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13
I want the results for each customer in the format
Customer mean

Here's a dplyr solution.
Your data:
df <- data.frame(Customer = c(1,1,1,1,2,2,2), Buy_date = c("2018/03/01", "2018/03/19", "2018/04/3", "2018/05/10", "2018/01/02", "2018/02/10", "2018/04/13"))
Grouping, mean Buy_date calculation and summarising:
library(dplyr)
df %>% group_by(Customer) %>% mutate(mean = mean(as.POSIXct(Buy_date))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <dttm>
1 1 2018-03-31 06:30:00
2 2 2018-02-17 15:40:00
Or as #r2evans points out in his comment for the consecutive days between Buy_dates:
df %>% group_by(Customer) %>% mutate(mean = mean(diff(as.POSIXct(Buy_date)))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <time>
1 1 23.3194444444444
2 2 50.4791666666667

I am not exactly sure of the desired output but this what I think you want.
library(dplyr)
library(zoo)
dat <- read.table(text =
"Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13", header = T, stringsAsFactors = F)
dat$Buy_date <- as.Date(dat$Buy_date)
dat %>% group_by(Customer) %>% mutate(diff_between = as.vector(diff(zoo(Buy_date), na.pad=TRUE)),
mean_days = mean(diff_between, na.rm = TRUE))
This produces:
Customer Buy_date diff_between mean_days
<int> <date> <dbl> <dbl>
1 1 2018-03-01 NA 23.3
2 1 2018-03-19 18 23.3
3 1 2018-04-03 15 23.3
4 1 2018-05-10 37 23.3
5 2 2018-01-02 NA 50.5
6 2 2018-02-10 39 50.5
7 2 2018-04-13 62 50.5
EDITED BASED ON USER COMMENTS:
Because you said that you have factors and not characters just convert them by doing the following:
dat$Buy_date <- as.Date(as.character(dat$Buy_date))
dat$Customer <- as.character(dat$Customer)

Related

How to count exact matches across two data frames within IDs in R

I have two datasets similar to the one below (but with 4m observations) and I want to count the number of matching sample days between the two data frames (see example below).
DF1
ID date
1 1992-10-15
1 2010-02-17
2 2019-09-17
2 2015-08-18
3 2020-10-27
3 2020-12-23
DF2
ID date
1 1992-10-15
1 2001-04-25
1 2010-02-17
3 1990-06-22
3 2014-08-18
3 2020-10-27
Expected output
ID Count
1 2
2 0
3 1
I have tried the aggregate function (though unsure what to put in "which":
test <- aggregate(date~ID, rbind(DF1, DF2), length(which(exact?)))
and the table function:
Y<-table(DF1$ID)
X <- table(DF2$ID)
Y2 <- DF1[Y %in% X,]
I am having trouble finding an example to help my situation.
Your help is appreciated!
in Base R
data.frame(table(factor(merge(df1,df2)$ID, unique(df1$ID))))
Var1 Freq
1 1 2
2 2 0
3 3 1
Using tidyverse
library(dplyr)
library(tidyr)
inner_join(df1, df2) %>%
complete(ID = unique(df1$ID)) %>%
reframe(Freq = sum(!is.na(date)), .by = "ID")
-output
# A tibble: 3 × 2
ID Freq
<int> <int>
1 1 2
2 2 0
3 3 1
Here is one way to do it with 'dplyr' and 'tidyr':
library(dplyr)
library(tidyr)
DF1 %>%
semi_join(DF2) %>%
count(ID) %>%
complete(ID = DF1$ID,
fill = list(n = 0))
#> Joining with `by = join_by(ID, date)`
#> # A tibble: 3 × 2
#> ID n
#> <dbl> <int>
#> 1 1 2
#> 2 2 0
#> 3 3 1
data
DF1 <- tibble(ID = c(1,1,2,2,3,3),
date = c("1992-10-15", "2010-02-17", "2019-09-17",
"2015-08-18", "2020-10-27", "2020-12-23"))
DF2 <- tibble(ID = c(1,1,1,3,3,3),
date = c("1992-10-15", "2001-04-25", "2010-02-17",
"1990-06-22", "2014-08-18", "2020-10-27"))
Created on 2023-02-16 with reprex v2.0.2

Find max value for each partition in dataframe in R

I have a data as:
ID Date1 VarA
1 2005-01-02 x
1 2021-01-02 20
1 2021-01-01 y
2 2020-12-20 No
2 2020-12-19 10
3 1998-05-01 0
Here is the R-code to reproduce the data
example = data.frame(ID = c(1,1,1,2,2,3),
Date1 = c('2005-01-02',
'2021-01-02',
'2021-01-01',
'2020-12-20',
'2020-12-19',
'1998-05-01'),
VarA = c('x','20','y','No', '10','0'))
I would prefer the solution to do following:
First, flag the maximum date in data.
ID Date1 VarA Last_visit
1 2005-01-02 x 0
1 2021-01-02 20 1
1 2021-01-01 y 0
2 2020-12-20 No 1
2 2020-12-19 10 0
3 1998-05-01 0 1
Finally, It should retain only where the Last_visit=1
ID Date1 VarA Last_visit
1 2021-01-02 20 1
2 2020-12-20 No 1
3 1998-05-01 0 1
I am requesting the intermediate steps as well to perform a sanity check. Thanks!
We create a new column after grouping by 'ID'
library(dplyr)
example %>%
group_by(ID) %>%
mutate(Last_visit = +(row_number() %in% which.max(as.Date(Date1)))) %>%
ungroup
and then filter/slice based on the column
example %>%
group_by(ID) %>%
mutate(Last_visit = +(row_number() %in% which.max(as.Date(Date1)))) %>%
slice_max(n = 1, order_by = Last_visit) %>%
ungroup
-output
# A tibble: 3 × 4
ID Date1 VarA Last_visit
<dbl> <chr> <chr> <int>
1 1 2021-01-02 20 1
2 2 2020-12-20 No 1
3 3 1998-05-01 0 1
Another option is to convert the 'Date1' to Date class first, then do an arrange and use distinct
example %>%
mutate(Date1 = as.Date(Date1)) %>%
arrange(ID, desc(Date1)) %>%
distinct(ID, .keep_all = TRUE) %>%
mutate(Last_visit = 1)
ID Date1 VarA Last_visit
1 1 2021-01-02 20 1
2 2 2020-12-20 No 1
3 3 1998-05-01 0 1

Rank function in R after group by

How to use R to create a rank column? Below is an example
This is what I have:
Date group
12/5/2020 A
12/5/2020 A
11/7/2020 A
11/7/2020 A
11/9/2020 B
11/9/2020 B
10/8/2020 B
This is what I want:
Date group rank
12/5/2020 A 2
12/5/2020 A 2
11/7/2020 A 1
11/7/2020 A 1
11/9/2020 B 2
11/9/2020 B 2
10/8/2020 B 1
tidyverse
(I'm using dplyr here since I think it is easy to see the steps being done.)
A first approach might be to capitalize on R's factor function, which assigns an integer to each distinct value, so that operations on this factor is faster (when compared with strings). That is, it takes a (possibly looooong) vector of strings and converts it into a just-as-long vector of integers (much smaller and faster) and a very short vector of strings, where the integers are indices into the small vector of strings. This small vector is called the factor's "levels".
library(dplyr)
group_by(dat, group) %>%
mutate(rank = as.integer(factor(Date))) %>%
ungroup()
# # A tibble: 7 x 3
# Date group rank
# <chr> <chr> <int>
# 1 12/5/2020 A 2
# 2 12/5/2020 A 2
# 3 11/7/2020 A 1
# 4 11/7/2020 A 1
# 5 11/9/2020 B 2
# 6 11/9/2020 B 2
# 7 10/8/2020 B 1
This "sorta" works, but there are two problems:
This is reliant on the lexicographic sorting of the Date column, for which this data sample is acceptable, but this will fail. A better way is to convert to something more appropriately sortable, such as a Date object.
Failing sorts:
sort(c("12/9/2020", "11/9/2020", "2/9/2020"))
# [1] "11/9/2020" "12/9/2020" "2/9/2020"
dat %>%
mutate(Date = as.Date(Date, format = "%m/%d/%Y")) %>%
group_by(group) %>%
mutate(rank = as.integer(factor(Date))) %>%
ungroup()
# # A tibble: 7 x 3
# Date group rank
# <date> <chr> <int>
# 1 2020-12-05 A 2
# 2 2020-12-05 A 2
# 3 2020-11-07 A 1
# 4 2020-11-07 A 1
# 5 2020-11-09 B 2
# 6 2020-11-09 B 2
# 7 2020-10-08 B 1
and
There really are better functions for ranking, such as dplyr::dense_rank (which #akrun put in an answer first ... I was building to it, honestly):
dat %>%
mutate(Date = as.Date(Date, format = "%m/%d/%Y")) %>%
group_by(group) %>%
mutate(rank = dense_rank(Date)) %>%
ungroup()
# # A tibble: 7 x 3
# Date group rank
# <date> <chr> <int>
# 1 2020-12-05 A 2
# 2 2020-12-05 A 2
# 3 2020-11-07 A 1
# 4 2020-11-07 A 1
# 5 2020-11-09 B 2
# 6 2020-11-09 B 2
# 7 2020-10-08 B 1
We can use dense_rank after converting the 'Date' to Date class
library(dplyr)
library(lubridate)
df1 %>%
group_by(group) %>%
mutate(rank = dense_rank(mdy(Date)))
# A tibble: 7 x 3
# Groups: group [2]
# Date group rank
# <chr> <chr> <int>
#1 12/5/2020 A 2
#2 12/5/2020 A 2
#3 11/7/2020 A 1
#4 11/7/2020 A 1
#5 11/9/2020 B 2
#6 11/9/2020 B 2
#7 10/8/2020 B 1
data
df1 <- structure(list(Date = c("12/5/2020", "12/5/2020", "11/7/2020",
"11/7/2020", "11/9/2020", "11/9/2020", "10/8/2020"), group = c("A",
"A", "A", "A", "B", "B", "B")), class = "data.frame", row.names = c(NA,
-7L))
Convert the Date column to the actual date object, arrange the data by Date and use match with unique to get rank column.
library(dplyr)
df %>%
mutate(Date = lubridate::mdy(Date)) %>%
arrange(group, Date) %>%
group_by(group) %>%
mutate(rank = match(Date, unique(Date)))
# Date group rank
# <date> <chr> <int>
#1 2020-11-07 A 1
#2 2020-11-07 A 1
#3 2020-12-05 A 2
#4 2020-12-05 A 2
#5 2020-10-08 B 1
#6 2020-11-09 B 2
#7 2020-11-09 B 2
data
df <- structure(list(Date = c("12/5/2020", "12/5/2020", "11/7/2020",
"11/7/2020", "11/9/2020", "11/9/2020", "10/8/2020"), group = c("A",
"A", "A", "A", "B", "B", "B")), class = "data.frame", row.names = c(NA, -7L))

How to summarize `Number of days since first date` and `Number of days seen` by ID and for a large data frame

The dataframe df1 summarizes detections of individuals (ID) through the time (Date). As a short example:
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Date= ymd(c("2016-08-21","2016-08-24","2016-08-23","2016-08-29","2016-08-27","2016-09-02","2016-09-01","2016-09-09","2016-09-01","2016-09-10")))
df1
ID Date
1 1 2016-08-21
2 2 2016-08-24
3 1 2016-08-23
4 2 2016-08-29
5 1 2016-08-27
6 2 2016-09-02
7 1 2016-09-01
8 2 2016-09-09
9 1 2016-09-01
10 2 2016-09-10
I want to summarize either the Number of days since the first detection of the individual (Ndays) and Number of days that the individual has been detected since the first time it was detected (Ndifdays).
Additionally, I would like to include in this summary table a variable called Prop that simply divides Ndifdays between Ndays.
The summary table that I would expect would be this:
> Result
ID Ndays Ndifdays Prop
1 1 11 4 0.360 # Between 21st Aug and 01st Sept there is 11 days.
2 2 17 5 0.294 # Between 24th Aug and 10st Sept there is 17 days.
Does anyone know how to do it?
You could achieve using various summarising functions in dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(Ndays = as.integer(max(Date) - min(Date)),
Ndifdays = n_distinct(Date),
Prop = Ndifdays/Ndays)
# ID Ndays Ndifdays Prop
# <dbl> <int> <int> <dbl>
#1 1 11 4 0.364
#2 2 17 5 0.294
The data.table version of this would be
library(data.table)
df12 <- setDT(df1)[, .(Ndays = as.integer(max(Date) - min(Date)),
Ndifdays = uniqueN(Date)), by = ID]
df12$Prop <- df12$Ndifdays/df12$Ndays
and base R with aggregate
df12 <- aggregate(Date~ID, df1, function(x) c(max(x) - min(x), length(unique(x))))
df12$Prop <- df1$Ndifdays/df1$Ndays
After grouping by 'ID', get the diff or range of 'Date' to create 'Ndays', and then get the unique number of 'Date' with n_distinct, divide by the number of distinct by the Ndays to get the 'Prop'
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(Ndays = as.integer(diff(range(Date))),
Ndifdays = n_distinct(Date),
Prop = Ndifdays/Ndays)
# A tibble: 2 x 4
# ID Ndays Ndifdays Prop
# <dbl> <int> <int> <dbl>
#1 1 11 4 0.364
#2 2 17 5 0.294

summarize multiple dynamic columns and store results in new columns

I have the following situation.
df <- rbind(
data.frame(thisDate = rep(seq(as.Date("2018-1-1"), as.Date("2018-1-2"), by="day")) ),
data.frame(thisDate = rep(seq(as.Date("2018-2-1"), as.Date("2018-2-2"), by="day")) ))
df <- cbind(df,lastMonth = as.Date(format(as.Date(df$thisDate - months(1)),"%Y-%m-01")))
df <- cbind(df, prod1Quantity= seq(1:4) )
I have quantities for different days of a month for an unknown number of products. I want to have 1 column for every product with the total monthly quantity of that product for all of the previous month. So the output would be like this .. ie grouped by lastMonth, Prod1Quantity . I just don't get how to group by, mutate and summarise dynamically if that indeed is the right approach.
I came across data.table generate multiple columns and summarize them . I think it appears to do what I need - but I just don't get how it is working!
Desired Output
thisDate lastMonth prod1Quantity prod1prevMonth
1 2018-01-01 2017-12-01 1 NA
2 2018-01-02 2017-12-01 2 NA
3 2018-02-01 2018-01-01 3 3
4 2018-02-02 2018-01-01 4 3
Another approach could be
library(dplyr)
library(lubridate)
temp_df <- df %>%
mutate(thisDate_forJoin = as.Date(format(thisDate,"%Y-%m-01")))
final_df <- temp_df %>%
mutate(thisDate_forJoin = thisDate_forJoin %m-% months(1)) %>%
left_join(temp_df %>%
group_by(thisDate_forJoin) %>%
summarise_if(is.numeric, sum),
by="thisDate_forJoin") %>%
select(-thisDate_forJoin)
Output is:
thisDate prod1Quantity.x prod2Quantity.x prod1Quantity.y prod2Quantity.y
1 2018-01-01 1 10 NA NA
2 2018-01-02 2 11 NA NA
3 2018-02-01 3 12 3 21
4 2018-02-02 4 13 3 21
Sample data:
df <- structure(list(thisDate = structure(c(17532, 17533, 17563, 17564
), class = "Date"), prod1Quantity = 1:4, prod2Quantity = 10:13), class = "data.frame", row.names = c(NA,
-4L))
# thisDate prod1Quantity prod2Quantity
#1 2018-01-01 1 10
#2 2018-01-02 2 11
#3 2018-02-01 3 12
#4 2018-02-02 4 13
A solution can be reached by calculating the month-wise production quantity and then joining on month of lastMonth and thisDate.
lubridate::month function has been used evaluate month from date.
library(dplyr)
library(lubridate)
df %>% group_by(month = as.integer(month(thisDate))) %>%
summarise(prodQuantMonth = sum(prod1Quantity)) %>%
right_join(., mutate(df, prevMonth = month(lastMonth)), by=c("month" = "prevMonth")) %>%
select(thisDate, lastMonth, prod1Quantity, prodQuantLastMonth = prodQuantMonth)
# # A tibble: 4 x 4
# thisDate lastMonth prod1Quantity prodQuantLastMonth
# <date> <date> <int> <int>
# 1 2018-01-01 2017-12-01 1 NA
# 2 2018-01-02 2017-12-01 2 NA
# 3 2018-02-01 2018-01-01 3 3
# 4 2018-02-02 2018-01-01 4 3

Resources