Find max value for each partition in dataframe in R - r

I have a data as:
ID Date1 VarA
1 2005-01-02 x
1 2021-01-02 20
1 2021-01-01 y
2 2020-12-20 No
2 2020-12-19 10
3 1998-05-01 0
Here is the R-code to reproduce the data
example = data.frame(ID = c(1,1,1,2,2,3),
Date1 = c('2005-01-02',
'2021-01-02',
'2021-01-01',
'2020-12-20',
'2020-12-19',
'1998-05-01'),
VarA = c('x','20','y','No', '10','0'))
I would prefer the solution to do following:
First, flag the maximum date in data.
ID Date1 VarA Last_visit
1 2005-01-02 x 0
1 2021-01-02 20 1
1 2021-01-01 y 0
2 2020-12-20 No 1
2 2020-12-19 10 0
3 1998-05-01 0 1
Finally, It should retain only where the Last_visit=1
ID Date1 VarA Last_visit
1 2021-01-02 20 1
2 2020-12-20 No 1
3 1998-05-01 0 1
I am requesting the intermediate steps as well to perform a sanity check. Thanks!

We create a new column after grouping by 'ID'
library(dplyr)
example %>%
group_by(ID) %>%
mutate(Last_visit = +(row_number() %in% which.max(as.Date(Date1)))) %>%
ungroup
and then filter/slice based on the column
example %>%
group_by(ID) %>%
mutate(Last_visit = +(row_number() %in% which.max(as.Date(Date1)))) %>%
slice_max(n = 1, order_by = Last_visit) %>%
ungroup
-output
# A tibble: 3 × 4
ID Date1 VarA Last_visit
<dbl> <chr> <chr> <int>
1 1 2021-01-02 20 1
2 2 2020-12-20 No 1
3 3 1998-05-01 0 1
Another option is to convert the 'Date1' to Date class first, then do an arrange and use distinct
example %>%
mutate(Date1 = as.Date(Date1)) %>%
arrange(ID, desc(Date1)) %>%
distinct(ID, .keep_all = TRUE) %>%
mutate(Last_visit = 1)
ID Date1 VarA Last_visit
1 1 2021-01-02 20 1
2 2 2020-12-20 No 1
3 3 1998-05-01 0 1

Related

How to filter by multiple range of dates in R?

Thank you, experts for previous answers (How to filter by range of dates in R?)
I am still having some problems dealing with the data.
Example:
id q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
My idea is to eliminate the observations that have more than 3 "units" in a period of 30 days. That is, if "a" has a unit "q" on "12/02/2021" [dd/mm]yyyy]: (a) if between 12/01/2021 and 12/02/2021 there are already 3 observations it must be deleted . (b) If there are less than 3 this one must remain.
My expected result is:
p q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
With this code:
df <- df %>%
mutate(day = dmy(data))%>%
group_by(p) %>%
arrange(day, .by_group = TRUE) %>%
mutate(diff = day - first(day)) %>%
mutate(row = row_number()) %>%
filter(row <= 3 | !diff < 30)
But the result is:
P Q DATE DAY DIFF ROW
a 1 1/1/2021 1/1/2021 0 1
a 1 1/1/2021 1/1/2021 0 2
a 1 21/1/2021 21/1/2021 20 3
a 1 12/2/2021 12/2/2021 42 5
a 1 12/2/2021 12/2/2021 42 6
a 1 12/2/2021 12/2/2021 42 7
a 1 12/2/2021 12/2/2021 42 8
The main problem is that the diff variable must count days in periods of 30 days from the last day of the previous 30-days period - not since the first observation day.
Any help? Thanks
Using floor_date it is quite straighforward:
library(lubridate)
library(dplyr)
df %>%
group_by(floor = floor_date(date, '30 days')) %>%
slice_head(n = 3) %>%
ungroup() %>%
select(-floor)
# A tibble: 6 x 3
id q date
<chr> <int> <date>
1 a 1 2021-01-01
2 a 1 2021-01-01
3 a 1 2021-01-21
4 a 1 2021-02-12
5 a 1 2021-02-12
6 a 1 2021-02-12
data
df <- read.table(header = T, text = "id q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021")
df$date<-as.Date(df$date, format = "%d/%m/%Y")

Efficient dplyr-style operations

I have these two toy example tables:
Table 1:
attendance_events <- data.frame(student_id = c("RA123","RB123","RC123","RA456","RB456","RC456","RA123","RB123","RC123","RA456","RB456","RC456"),
dates = c("2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-02","2020-02-02","2020-02-02","2020-02-02","2020-02-02","2020-02-02"),
attendance = c(1,1,1,0,1,1,0,0,1,0,0,1),
stringsAsFactors = F)
attendance_events
student_id dates attendance
1 RA123 2020-02-01 1
2 RB123 2020-02-01 1
3 RC123 2020-02-01 1
4 RA456 2020-02-01 0
5 RB456 2020-02-01 1
6 RC456 2020-02-01 1
7 RA123 2020-02-02 0
8 RB123 2020-02-02 0
9 RC123 2020-02-02 1
10 RA456 2020-02-02 0
11 RB456 2020-02-02 0
12 RC456 2020-02-02 1
Table2:
all_students <- data.frame(student_id = c("RA123","RB123","RC123","RA456","RB456",'RC456'),
school_id = c(1,1,1,1,1,2),
grade_level = c(10,10,9,9,11,11),
date_of_birth = c("1990-02-02","1990-02-02","1991-01-01","1991-02-01","1989-02-02","1989-02-02"),
hometown = c("farm","farm","farm","farm","farm","city"),
stringsAsFactors = F)
> all_students
student_id school_id grade_level date_of_birth hometown
1 RA123 1 10 1990-02-02 farm
2 RB123 1 10 1990-02-02 farm
3 RC123 1 9 1991-01-01 farm
4 RA456 1 9 1991-02-01 farm
5 RB456 1 11 1989-02-02 farm
6 RC456 2 11 1989-02-02 city
attendance in attendance_events is 0 if the student was absent that day.
My question is what is the most efficient way in R to find the grade_level that had the largest drop off in attendance between "2020-02-01" and "2020-02-02"
My code is:
#Only include absences because it will be a smaller dataset
att_ws_alt <- inner_join(attendance_events, all_students[,c("student_id","grade_level")], by = "student_id") %>%
filter(attendance == 0)
#Set days to check between
date_from <- "2020-02-01"
date_to <- "2020-02-02"
#Continously pipe to not have to store and reference(?)
att_drop_alt <- att_ws_alt %>%
filter(dates %in% c(date_from, date_to)) %>%
group_by(grade_level,dates) %>%
summarize(absence_bydate = n()) %>%
dcast(grade_level ~ dates) %>%
sapply(FUN = function(x) { x[is.na(x)] <- 0; x}) %>%
as.data.frame() %>%
mutate("absence_change" = .[,3] - .[,2]) %>%
select(grade_level, absence_change) %>%
arrange(desc(absence_change))
>att_drop_alt
grade_level absence_change
1 10 2
2 11 1
3 9 0
However, this feels a bit complex for what seems like a reasonably simple question. I want to see other ways R programmers could answer this question, ideally for better performance but even readability would be good to see.
Thanks community!
With data.table
library(data.table)
setDT(attendance_events)[all_students, .SD[, .(sum(attendance)),
.(grade_level, dates)], on = .(student_id)][,
.(attendanace_change = diff(rev(V1))), .(grade_level)]
# grade_level attendanace_change
#1: 10 2
#2: 9 0
#3: 11 1
I guess this is a little more concise:
left_join(attendance_events, all_students, by = "student_id") %>%
group_by(grade_level, dates) %>%
summarise(attendance = sum(attendance)) %>%
group_by(grade_level) %>%
summarize(attendance_change = diff(attendance))
#> # A tibble: 3 x 2
#> grade_level attendance_change
#> <dbl> <dbl>
#> 1 9 0
#> 2 10 -2
#> 3 11 -1
Of course, if you want to count absences instead of attendances, just put a minus sign in front of the diff on the last line.
Sorry if this doesn't exactly answer your question, but I wouldn't want to unfairly accuse the students of being more absent then they were ;)
library(dplyr)
all_students %>%
left_join(attendance_events) %>%
mutate(dates = as.Date(dates)) %>%
group_by(grade_level, dates) %>%
summarise(NAbs = sum(ifelse(attendance == 0, 1, 0)),
N = n(),
pctAbs = NAbs / n() * 100) %>%
arrange(dates) %>%
mutate(change = pctAbs - lag(pctAbs)) %>%
ungroup() %>%
arrange(change)
# A tibble: 6 x 6
dates grade_level NAbs N pctAbs change
<date> <dbl> <dbl> <int> <dbl> <dbl>
1 2020-02-02 9 1 2 50 0
2 2020-02-02 11 1 2 50 50
3 2020-02-02 10 2 2 100 100
4 2020-02-01 9 1 2 50 NA
5 2020-02-01 10 0 2 0 NA
6 2020-02-01 11 0 2 0 NA

Count number in group, but restart for non-consecutive dates

I have data that looks like this:
sample <- data.frame(
group = c("A","A","A","B","B","B"),
date = c(as.Date("2014-12-31"),
as.Date("2015-01-31"),
as.Date("2015-02-28"),
as.Date("2015-01-31"),
as.Date("2015-03-31"),
as.Date("2015-04-30")),
obs = c(100, 200, 300, 50, 100, 150)
)
Note that the date variable always takes the last date of the month. In table format, the data looks like this:
group date obs
1 A 2014-12-31 100
2 A 2015-01-31 200
3 A 2015-02-28 300
4 B 2015-01-31 50
5 B 2015-03-31 100
6 B 2015-04-30 150
I want to create a forth column that counts the number of observations in the group. HOWEVER, I want the count to start over if a month doesn't immediately follow the month before. This is what I want it to look like:
group date obs num
1 A 2014-12-31 100 1
2 A 2015-01-31 200 2
3 A 2015-02-28 300 3
4 B 2015-01-31 50 1
5 B 2015-03-31 100 1
6 B 2015-04-30 150 2
So far all I can get is the following:
library(tidyverse)
sample <- sample %>%
arrange(date) %>%
group_by(group) %>%
mutate(num = row_number())
group date obs num
1 A 2014-12-31 100 1
2 A 2015-01-31 200 2
3 A 2015-02-28 300 3
4 B 2015-01-31 50 1
5 B 2015-03-31 100 2
6 B 2015-04-30 150 3
Any help would be much appreciated. I also want to be able to do the same thing but with quarterly data (instead of monthly).
We can use lubridate::days_in_month to get number of days in a month compare it with difference of current and past date to create a new group. We can then assign row_number() in each group.
library(dplyr)
sample %>%
group_by(group) %>%
mutate(diff_days = cumsum(as.numeric(date - lag(date, default = first(date))) !=
lubridate::days_in_month(date))) %>%
group_by(diff_days, add = TRUE) %>%
mutate(num = row_number()) %>%
ungroup() %>%
select(-diff_days)
# group date obs num
# <fct> <date> <dbl> <int>
#1 A 2014-12-31 100 1
#2 A 2015-01-31 200 2
#3 A 2015-02-28 300 3
#4 B 2015-01-31 50 1
#5 B 2015-03-31 100 1
#6 B 2015-04-30 150 2
We can create a group based on the differnece of month of 'date' and if it is not equal to 1 i.e. one month difference
library(dplyr)
library(lubridate)
sample %>%
arrange(group, date) %>%
group_by(group, mth = cumsum(c(TRUE, diff(month(date)) != 1))) %>%
mutate(num = row_number()) %>%
ungroup %>%
select(-mth)
# A tibble: 6 x 4
# group date obs num
# <fct> <date> <dbl> <int>
#1 A 2015-01-31 100 1
#2 A 2015-02-28 200 2
#3 A 2015-03-31 300 3
#4 B 2015-01-31 50 1
#5 B 2015-03-31 100 1
#6 B 2015-04-30 150 2
If the year also needs to be considered
library(zoo)
sample %>%
arrange(group, date) %>%
mutate(yearmon = as.yearmon(date)) %>%
group_by(group) %>%
group_by(grp = cumsum(c(TRUE, as.integer(diff(yearmon) * 12)> 1)),
add = TRUE ) %>%
mutate(num = row_number()) %>%
ungroup %>%
select(-grp, -yearmon)
# A tibble: 6 x 4
# group date obs num
# <fct> <date> <dbl> <int>
#1 A 2015-01-31 100 1
#2 A 2015-02-28 200 2
#3 A 2015-03-31 300 3
#4 B 2015-01-31 50 1
#5 B 2015-03-31 100 1
#6 B 2015-04-30 150 2

Removing rows with dates conditional to specific IDs

Basically, I have a data frame that contains IDs, Dates, VolumeX, and VolumeY.
I want to split the VolumeX data frame into before and after the max date of VolumeY specific to an ID.
Ex.
df looks like (with many different IDs) :
ID Date VolX VolY
1 2018 - 02- 01 5 -
1 2018 - 03- 01 6 -
1 2018 - 08- 01 3 -
1 2018 - 10- 01 1 -
1 2017 - 02- 01 - 1
1 2014 - 10- 01 - 0
1 2014 - 11- 01 - 5
1 2018 - 02- 01 - 0
So for the max date of VolY for every ID, I'd like to split the data frame into two: before and after that date for each ID soas to sum VolX before and after VolY max date.
Seems like this needs to be some kind of nested for loop. I am able to extract max dates and total volume... just having a hard time selecting out ID-specific
Is this what you're after?
library(dplyr)
df %>%
replace(., . == "-", NA) %>%
mutate(Date = as.Date(gsub("\\s", "", Date))) %>%
mutate_at(vars(VolX, VolY), as.numeric) %>%
group_by(ID, Before_After = cumsum(c(0, lag(+(Date == max(Date)))[-1]))) %>%
mutate(
sum_Volx = sum(VolX[Date != max(Date)], na.rm = T),
sum_VolY = sum(VolY[Date != max(Date)], na.rm = T)
) %>% ungroup() %>% select(-Before_After)
Output:
# A tibble: 8 x 6
ID Date VolX VolY sum_Volx sum_VolY
<int> <date> <dbl> <dbl> <dbl> <dbl>
1 1 2018-02-01 5 NA 14 0
2 1 2018-03-01 6 NA 14 0
3 1 2018-08-01 3 NA 14 0
4 1 2018-10-01 1 NA 14 0
5 1 2017-02-01 NA 1 0 6
6 1 2014-10-01 NA 0 0 6
7 1 2014-11-01 NA 5 0 6
8 1 2018-02-01 NA 0 0 6
You could also make separate columns for before/after, like this:
df %>%
replace(., . == "-", NA) %>%
mutate_at(vars(VolX, VolY), as.numeric) %>%
group_by(ID) %>%
mutate(
Date = as.Date(gsub("\\s", "", Date)),
Before_After = cumsum(c(0, lag(+(Date == max(Date)))[-1])),
sum_Volx_Before = sum(VolX[Date != max(Date) & Before_After == 0], na.rm = T),
sum_VolY_Before = sum(VolY[Date != max(Date) & Before_After == 0], na.rm = T),
sum_Volx_After = sum(VolX[Date != max(Date) & Before_After == 1], na.rm = T),
sum_VolY_After = sum(VolY[Date != max(Date) & Before_After == 1], na.rm = T)
) %>% ungroup() %>% select(-Before_After)
Output:
# A tibble: 8 x 8
ID Date VolX VolY sum_Volx_Before sum_VolY_Before sum_Volx_After sum_VolY_After
<int> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2018-02-01 5 NA 14 0 0 6
2 1 2018-03-01 6 NA 14 0 0 6
3 1 2018-08-01 3 NA 14 0 0 6
4 1 2018-10-01 1 NA 14 0 0 6
5 1 2017-02-01 NA 1 14 0 0 6
6 1 2014-10-01 NA 0 14 0 0 6
7 1 2014-11-01 NA 5 14 0 0 6
8 1 2018-02-01 NA 0 14 0 0 6
On the other hand, you could just create 2 separate new data frames in your environment, named Before and After, that literally exclude the maximum date and summarise the information, like below:
df_list <- df %>%
replace(., . == "-", NA) %>%
mutate_at(vars(VolX, VolY), as.numeric) %>%
group_by(ID) %>%
mutate(
Date = as.Date(gsub("\\s", "", Date)),
Before_After = cumsum(c(0, lag(+(Date == max(Date)))[-1]))
) %>%
filter(!Date == max(Date)) %>%
group_by(ID, Before_After) %>%
summarise(
sum_VolX = sum(VolX, na.rm = T),
sum_VolY = sum(VolY, na.rm = T)
) %>%
split(., .$Before_After)
names(df_list) <- c("Before", "After")
list2env(df_list, envir = .GlobalEnv)
Let's go through one-by-one:
first we replace the - signs by NA (not strictly needed, just to avoid errors later on);
afterwards we transform VolX and VolY into numeric;
then we group by ID so that everything is applied to each group separately;
afterwards we transform the Date into a proper Date format;
then it is the crucial part: we calculate the flag Before_After column where first we flag with 1 if in the previous row the maximum date was observed; afterwards we calculate a cumulative sum of such column, so that everything before this event is 0 and everything after 1;
then we filter out the maximum Date;
we group again by ID and Before_After indicator;
we shrink the data frame with summarise so that it only contains the sum of the respective columns;
we turn the data frame into 2 different ones by splitting on Before_After column;
as the obtained result is a list of 2 data frames, we need to get them into global environment, so first we assign the names to each one and then we turn them into 'proper' data frames.
Output:
Before
# A tibble: 1 x 4
# Groups: ID [1]
ID Before_After sum_VolX sum_VolY
<int> <dbl> <dbl> <dbl>
1 1 0 14 0
After
# A tibble: 1 x 4
# Groups: ID [1]
ID Before_After sum_VolX sum_VolY
<int> <dbl> <dbl> <dbl>
1 1 1 0 6
Note that 0 corresponds to Before and 1 to After.

Performing in group operations in R

I have a data in which I have 2 fields in a table sf -> Customer id and Buy_date. Buy_date is unique but for each customer, but there can be more than 3 different values of Buy_dates for each customer. I want to calculate difference in consecutive Buy_date for each Customer and its mean value. How can I do this.
Example
Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13
I want the results for each customer in the format
Customer mean
Here's a dplyr solution.
Your data:
df <- data.frame(Customer = c(1,1,1,1,2,2,2), Buy_date = c("2018/03/01", "2018/03/19", "2018/04/3", "2018/05/10", "2018/01/02", "2018/02/10", "2018/04/13"))
Grouping, mean Buy_date calculation and summarising:
library(dplyr)
df %>% group_by(Customer) %>% mutate(mean = mean(as.POSIXct(Buy_date))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <dttm>
1 1 2018-03-31 06:30:00
2 2 2018-02-17 15:40:00
Or as #r2evans points out in his comment for the consecutive days between Buy_dates:
df %>% group_by(Customer) %>% mutate(mean = mean(diff(as.POSIXct(Buy_date)))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <time>
1 1 23.3194444444444
2 2 50.4791666666667
I am not exactly sure of the desired output but this what I think you want.
library(dplyr)
library(zoo)
dat <- read.table(text =
"Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13", header = T, stringsAsFactors = F)
dat$Buy_date <- as.Date(dat$Buy_date)
dat %>% group_by(Customer) %>% mutate(diff_between = as.vector(diff(zoo(Buy_date), na.pad=TRUE)),
mean_days = mean(diff_between, na.rm = TRUE))
This produces:
Customer Buy_date diff_between mean_days
<int> <date> <dbl> <dbl>
1 1 2018-03-01 NA 23.3
2 1 2018-03-19 18 23.3
3 1 2018-04-03 15 23.3
4 1 2018-05-10 37 23.3
5 2 2018-01-02 NA 50.5
6 2 2018-02-10 39 50.5
7 2 2018-04-13 62 50.5
EDITED BASED ON USER COMMENTS:
Because you said that you have factors and not characters just convert them by doing the following:
dat$Buy_date <- as.Date(as.character(dat$Buy_date))
dat$Customer <- as.character(dat$Customer)

Resources