Efficient dplyr-style operations - r

I have these two toy example tables:
Table 1:
attendance_events <- data.frame(student_id = c("RA123","RB123","RC123","RA456","RB456","RC456","RA123","RB123","RC123","RA456","RB456","RC456"),
dates = c("2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-02","2020-02-02","2020-02-02","2020-02-02","2020-02-02","2020-02-02"),
attendance = c(1,1,1,0,1,1,0,0,1,0,0,1),
stringsAsFactors = F)
attendance_events
student_id dates attendance
1 RA123 2020-02-01 1
2 RB123 2020-02-01 1
3 RC123 2020-02-01 1
4 RA456 2020-02-01 0
5 RB456 2020-02-01 1
6 RC456 2020-02-01 1
7 RA123 2020-02-02 0
8 RB123 2020-02-02 0
9 RC123 2020-02-02 1
10 RA456 2020-02-02 0
11 RB456 2020-02-02 0
12 RC456 2020-02-02 1
Table2:
all_students <- data.frame(student_id = c("RA123","RB123","RC123","RA456","RB456",'RC456'),
school_id = c(1,1,1,1,1,2),
grade_level = c(10,10,9,9,11,11),
date_of_birth = c("1990-02-02","1990-02-02","1991-01-01","1991-02-01","1989-02-02","1989-02-02"),
hometown = c("farm","farm","farm","farm","farm","city"),
stringsAsFactors = F)
> all_students
student_id school_id grade_level date_of_birth hometown
1 RA123 1 10 1990-02-02 farm
2 RB123 1 10 1990-02-02 farm
3 RC123 1 9 1991-01-01 farm
4 RA456 1 9 1991-02-01 farm
5 RB456 1 11 1989-02-02 farm
6 RC456 2 11 1989-02-02 city
attendance in attendance_events is 0 if the student was absent that day.
My question is what is the most efficient way in R to find the grade_level that had the largest drop off in attendance between "2020-02-01" and "2020-02-02"
My code is:
#Only include absences because it will be a smaller dataset
att_ws_alt <- inner_join(attendance_events, all_students[,c("student_id","grade_level")], by = "student_id") %>%
filter(attendance == 0)
#Set days to check between
date_from <- "2020-02-01"
date_to <- "2020-02-02"
#Continously pipe to not have to store and reference(?)
att_drop_alt <- att_ws_alt %>%
filter(dates %in% c(date_from, date_to)) %>%
group_by(grade_level,dates) %>%
summarize(absence_bydate = n()) %>%
dcast(grade_level ~ dates) %>%
sapply(FUN = function(x) { x[is.na(x)] <- 0; x}) %>%
as.data.frame() %>%
mutate("absence_change" = .[,3] - .[,2]) %>%
select(grade_level, absence_change) %>%
arrange(desc(absence_change))
>att_drop_alt
grade_level absence_change
1 10 2
2 11 1
3 9 0
However, this feels a bit complex for what seems like a reasonably simple question. I want to see other ways R programmers could answer this question, ideally for better performance but even readability would be good to see.
Thanks community!

With data.table
library(data.table)
setDT(attendance_events)[all_students, .SD[, .(sum(attendance)),
.(grade_level, dates)], on = .(student_id)][,
.(attendanace_change = diff(rev(V1))), .(grade_level)]
# grade_level attendanace_change
#1: 10 2
#2: 9 0
#3: 11 1

I guess this is a little more concise:
left_join(attendance_events, all_students, by = "student_id") %>%
group_by(grade_level, dates) %>%
summarise(attendance = sum(attendance)) %>%
group_by(grade_level) %>%
summarize(attendance_change = diff(attendance))
#> # A tibble: 3 x 2
#> grade_level attendance_change
#> <dbl> <dbl>
#> 1 9 0
#> 2 10 -2
#> 3 11 -1
Of course, if you want to count absences instead of attendances, just put a minus sign in front of the diff on the last line.

Sorry if this doesn't exactly answer your question, but I wouldn't want to unfairly accuse the students of being more absent then they were ;)
library(dplyr)
all_students %>%
left_join(attendance_events) %>%
mutate(dates = as.Date(dates)) %>%
group_by(grade_level, dates) %>%
summarise(NAbs = sum(ifelse(attendance == 0, 1, 0)),
N = n(),
pctAbs = NAbs / n() * 100) %>%
arrange(dates) %>%
mutate(change = pctAbs - lag(pctAbs)) %>%
ungroup() %>%
arrange(change)
# A tibble: 6 x 6
dates grade_level NAbs N pctAbs change
<date> <dbl> <dbl> <int> <dbl> <dbl>
1 2020-02-02 9 1 2 50 0
2 2020-02-02 11 1 2 50 50
3 2020-02-02 10 2 2 100 100
4 2020-02-01 9 1 2 50 NA
5 2020-02-01 10 0 2 0 NA
6 2020-02-01 11 0 2 0 NA

Related

Find max value for each partition in dataframe in R

I have a data as:
ID Date1 VarA
1 2005-01-02 x
1 2021-01-02 20
1 2021-01-01 y
2 2020-12-20 No
2 2020-12-19 10
3 1998-05-01 0
Here is the R-code to reproduce the data
example = data.frame(ID = c(1,1,1,2,2,3),
Date1 = c('2005-01-02',
'2021-01-02',
'2021-01-01',
'2020-12-20',
'2020-12-19',
'1998-05-01'),
VarA = c('x','20','y','No', '10','0'))
I would prefer the solution to do following:
First, flag the maximum date in data.
ID Date1 VarA Last_visit
1 2005-01-02 x 0
1 2021-01-02 20 1
1 2021-01-01 y 0
2 2020-12-20 No 1
2 2020-12-19 10 0
3 1998-05-01 0 1
Finally, It should retain only where the Last_visit=1
ID Date1 VarA Last_visit
1 2021-01-02 20 1
2 2020-12-20 No 1
3 1998-05-01 0 1
I am requesting the intermediate steps as well to perform a sanity check. Thanks!
We create a new column after grouping by 'ID'
library(dplyr)
example %>%
group_by(ID) %>%
mutate(Last_visit = +(row_number() %in% which.max(as.Date(Date1)))) %>%
ungroup
and then filter/slice based on the column
example %>%
group_by(ID) %>%
mutate(Last_visit = +(row_number() %in% which.max(as.Date(Date1)))) %>%
slice_max(n = 1, order_by = Last_visit) %>%
ungroup
-output
# A tibble: 3 × 4
ID Date1 VarA Last_visit
<dbl> <chr> <chr> <int>
1 1 2021-01-02 20 1
2 2 2020-12-20 No 1
3 3 1998-05-01 0 1
Another option is to convert the 'Date1' to Date class first, then do an arrange and use distinct
example %>%
mutate(Date1 = as.Date(Date1)) %>%
arrange(ID, desc(Date1)) %>%
distinct(ID, .keep_all = TRUE) %>%
mutate(Last_visit = 1)
ID Date1 VarA Last_visit
1 1 2021-01-02 20 1
2 2 2020-12-20 No 1
3 3 1998-05-01 0 1

Logic to validate data and removing noise rows

I have a big dataset of about 4 Milion rows.
the columns are
Idx - dog serial number
date - date of event YYYY-MM-DD ( 2016 till 2021)
Is_sterilized - 1 if the dog was sterilized and 0 if not sterilized.
each dog can appear many times in a year,
It can appear in 2016 and 2020 but not in 2017-2019.
I want to count how many dogs were sterilized each year, meaning, if a dog change from Is_serilized==0 to Is_sterilized ==1 in a year I count it as sterilized that year, the first year it appears sterilized counted as his year fo sterilization.
The issue is that my database is not clean and for some dogs goes from sterilized to not sterilized, this can not happen since sterilization is one-way ticket surgery.
It can happen that a dog appears sterilized, 3 years consecutive and then one year by mistake unsterilized and then sterilized for 2 years.
What I'm asking is if there is a logic that I can estimate/count how many dogs having the wrong direction.
And if so, how can I deduce those dogs from my dataset?
In the example data, Idx = A and C make sense but B and D does not make senese
df_test <- data.frame(Idx=c( 'A', 'B', 'B', 'B','A', 'A', 'C', 'C', 'D','D','D','D','D','D','C', 'C','A' ),
YEAR_date=as.Date(c("2016-01-01","2016-01-29","2017-01-01","2016-05-01","2016-05-06","2016-05-01","2016-03-03","2016-04-22","2018-05-05", "2017-02-01"," 2021-11-12"," 2019-09-13"," 2019-11-12"," 2019-08-17", "2011-09-01"," 2011-07-05","2021-01-05")),
Is_sterilized =c(0,1,0,1,1,1,1,1,1,1,0,1,0,1,1,1,1)
)
df_test[,c( "Idx" ,"YEAR_date", "Is_sterilized")] %>% arrange(Idx ,YEAR_date )
Idx YEAR_date Is_sterilized
1 A 2016-01-01 0
2 A 2016-05-01 1
3 A 2016-05-06 1
4 A 2021-01-05 1
5 B 2016-01-29 1
6 B 2016-05-01 1
7 B 2017-01-01 0
8 C 2011-07-05 1
9 C 2011-09-01 1
10 C 2016-03-03 1
11 C 2016-04-22 1
12 D 2017-02-01 1
13 D 2018-05-05 1
14 D 2019-08-17 1
15 D 2019-09-13 1
16 D 2019-11-12 0
17 D 2021-11-12 0
I have more columns is if you thing anything else is relevant please write and I'll check I have it.
Any hint idea anything will be helpul
Thanks You in advance
Here's some dplyr code to identify instances where a dog's sterilization went from 1 to zero:
library(dplyr)
df_test %>%
group_by(Idx) %>%
mutate(change = Is_sterilized-lag(Is_sterilized, default = 0)) %>%
filter(change == -1) %>%
ungroup()
# A tibble: 3 x 4
Idx YEAR_date Is_sterilized change
<chr> <date> <dbl> <dbl>
1 B 2017-01-01 0 -1
2 D 2021-11-12 0 -1
3 D 2019-11-12 0 -1
If you want to count the number of dogs in that list, add %>% count(Idx) at the end.
df_test %>%
group_by(Idx) %>%
mutate(change = Is_sterilized-lag(Is_sterilized, default = 0)) %>%
filter(change == -1) %>%
ungroup() %>%
count(Idx, name = "times_desterilized")
# A tibble: 2 x 2
Idx times_desterilized
<chr> <int>
1 B 1
2 D 2

Computing minimum distance between a row and all previous rows in R

I want to compute the minimum distance between the current row and every row before it within each group. My data frame has several groups, and each group has multiple dates with longitude and latitude. I use a Haversine function to compute distance, and I need to apply this function as described above. The data frame looks like the following:
grp date long lat rowid
1 1 1995-07-01 11 12 1
2 1 1995-07-05 3 0 2
3 1 1995-07-09 13 4 3
4 1 1995-07-13 4 25 4
5 2 1995-03-07 12 6 1
6 2 1995-03-10 3 27 2
7 2 1995-03-13 34 8 3
8 2 1995-03-16 25 9 4
My current attempt uses purrrlyr::by_row, but the method is too slow. In practice, each group has thousands of dates and geographic positions. Here is part of my current attempt:
calc_min_distance <- function(df, grp.name, row){
df %>%
filter(
group_name==grp.name
) %>%
filter(
row_number() <= row
) %>%
mutate(
last.lat = last(lat),
last.long = last(long),
rowid = 1:n()
) %>%
group_by(rowid) %>%
purrrlyr::by_row(
~haversinedistance.fnct(.$last.long, .$last.lat, .$long, .$lat),
.collate='rows',
.to = 'min.distance'
) %>%
filter(
row_number() < n()
) %>%
summarise(
min = min(min.distance)
) %>%
.$min
}
df_dist <-
df %>%
group_by(grp_name) %>%
mutate(rowid = 1:n()) %>%
group_by(grp_name, rowid) %>%
purrrlyr::by_row(
~calc_min_distance(df, .$grp_name,.$rowid),
.collate='rows',
.to = 'min.distance'
) %>%
ungroup %>%
select(-rowid)
Suppose that distance is defined as (lat + long) for reference row - (lat + long) for each pairwise row less than the reference row. My expected output for grp 1 is the following:
grp date long lat rowid min.distance
1 1 1995-07-01 11 12 1 0
2 1 1995-07-05 3 0 2 -20
3 1 1995-07-09 13 4 3 -6
4 1 1995-07-13 4 25 4 6
How can I quickly compute the minimum distance between the current rowid and all rowids before it?
Here's how I would go about it. You need to calculate all the within-group pair-wise distances anyway, so we'll use geosphere::distm which is designed to do just that. I'd suggest stepping through my function line-by-line and looking at what it does, I think it will make sense.
library(geosphere)
find_min_dist_above = function(long, lat, fun = distHaversine) {
d = distm(x = cbind(long, lat), fun = fun)
d[lower.tri(d, diag = TRUE)] = NA
d[1, 1] = 0
return(apply(d, MAR = 2, min, na.rm = TRUE))
}
df %>% group_by(grp) %>%
mutate(min.distance = find_min_dist_above(long, lat))
# # A tibble: 8 x 6
# # Groups: grp [2]
# grp date long lat rowid min.distance
# <int> <fct> <int> <int> <int> <dbl>
# 1 1 1995-07-01 11 12 1 0
# 2 1 1995-07-05 3 0 2 1601842.
# 3 1 1995-07-09 13 4 3 917395.
# 4 1 1995-07-13 4 25 4 1623922.
# 5 2 1995-03-07 12 6 1 0
# 6 2 1995-03-10 3 27 2 2524759.
# 7 2 1995-03-13 34 8 3 2440596.
# 8 2 1995-03-16 25 9 4 997069.
Using this data:
df = read.table(text = ' grp date long lat rowid
1 1 1995-07-01 11 12 1
2 1 1995-07-05 3 0 2
3 1 1995-07-09 13 4 3
4 1 1995-07-13 4 25 4
5 2 1995-03-07 12 6 1
6 2 1995-03-10 3 27 2
7 2 1995-03-13 34 8 3
8 2 1995-03-16 25 9 4', h = TRUE)

Removing rows with dates conditional to specific IDs

Basically, I have a data frame that contains IDs, Dates, VolumeX, and VolumeY.
I want to split the VolumeX data frame into before and after the max date of VolumeY specific to an ID.
Ex.
df looks like (with many different IDs) :
ID Date VolX VolY
1 2018 - 02- 01 5 -
1 2018 - 03- 01 6 -
1 2018 - 08- 01 3 -
1 2018 - 10- 01 1 -
1 2017 - 02- 01 - 1
1 2014 - 10- 01 - 0
1 2014 - 11- 01 - 5
1 2018 - 02- 01 - 0
So for the max date of VolY for every ID, I'd like to split the data frame into two: before and after that date for each ID soas to sum VolX before and after VolY max date.
Seems like this needs to be some kind of nested for loop. I am able to extract max dates and total volume... just having a hard time selecting out ID-specific
Is this what you're after?
library(dplyr)
df %>%
replace(., . == "-", NA) %>%
mutate(Date = as.Date(gsub("\\s", "", Date))) %>%
mutate_at(vars(VolX, VolY), as.numeric) %>%
group_by(ID, Before_After = cumsum(c(0, lag(+(Date == max(Date)))[-1]))) %>%
mutate(
sum_Volx = sum(VolX[Date != max(Date)], na.rm = T),
sum_VolY = sum(VolY[Date != max(Date)], na.rm = T)
) %>% ungroup() %>% select(-Before_After)
Output:
# A tibble: 8 x 6
ID Date VolX VolY sum_Volx sum_VolY
<int> <date> <dbl> <dbl> <dbl> <dbl>
1 1 2018-02-01 5 NA 14 0
2 1 2018-03-01 6 NA 14 0
3 1 2018-08-01 3 NA 14 0
4 1 2018-10-01 1 NA 14 0
5 1 2017-02-01 NA 1 0 6
6 1 2014-10-01 NA 0 0 6
7 1 2014-11-01 NA 5 0 6
8 1 2018-02-01 NA 0 0 6
You could also make separate columns for before/after, like this:
df %>%
replace(., . == "-", NA) %>%
mutate_at(vars(VolX, VolY), as.numeric) %>%
group_by(ID) %>%
mutate(
Date = as.Date(gsub("\\s", "", Date)),
Before_After = cumsum(c(0, lag(+(Date == max(Date)))[-1])),
sum_Volx_Before = sum(VolX[Date != max(Date) & Before_After == 0], na.rm = T),
sum_VolY_Before = sum(VolY[Date != max(Date) & Before_After == 0], na.rm = T),
sum_Volx_After = sum(VolX[Date != max(Date) & Before_After == 1], na.rm = T),
sum_VolY_After = sum(VolY[Date != max(Date) & Before_After == 1], na.rm = T)
) %>% ungroup() %>% select(-Before_After)
Output:
# A tibble: 8 x 8
ID Date VolX VolY sum_Volx_Before sum_VolY_Before sum_Volx_After sum_VolY_After
<int> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2018-02-01 5 NA 14 0 0 6
2 1 2018-03-01 6 NA 14 0 0 6
3 1 2018-08-01 3 NA 14 0 0 6
4 1 2018-10-01 1 NA 14 0 0 6
5 1 2017-02-01 NA 1 14 0 0 6
6 1 2014-10-01 NA 0 14 0 0 6
7 1 2014-11-01 NA 5 14 0 0 6
8 1 2018-02-01 NA 0 14 0 0 6
On the other hand, you could just create 2 separate new data frames in your environment, named Before and After, that literally exclude the maximum date and summarise the information, like below:
df_list <- df %>%
replace(., . == "-", NA) %>%
mutate_at(vars(VolX, VolY), as.numeric) %>%
group_by(ID) %>%
mutate(
Date = as.Date(gsub("\\s", "", Date)),
Before_After = cumsum(c(0, lag(+(Date == max(Date)))[-1]))
) %>%
filter(!Date == max(Date)) %>%
group_by(ID, Before_After) %>%
summarise(
sum_VolX = sum(VolX, na.rm = T),
sum_VolY = sum(VolY, na.rm = T)
) %>%
split(., .$Before_After)
names(df_list) <- c("Before", "After")
list2env(df_list, envir = .GlobalEnv)
Let's go through one-by-one:
first we replace the - signs by NA (not strictly needed, just to avoid errors later on);
afterwards we transform VolX and VolY into numeric;
then we group by ID so that everything is applied to each group separately;
afterwards we transform the Date into a proper Date format;
then it is the crucial part: we calculate the flag Before_After column where first we flag with 1 if in the previous row the maximum date was observed; afterwards we calculate a cumulative sum of such column, so that everything before this event is 0 and everything after 1;
then we filter out the maximum Date;
we group again by ID and Before_After indicator;
we shrink the data frame with summarise so that it only contains the sum of the respective columns;
we turn the data frame into 2 different ones by splitting on Before_After column;
as the obtained result is a list of 2 data frames, we need to get them into global environment, so first we assign the names to each one and then we turn them into 'proper' data frames.
Output:
Before
# A tibble: 1 x 4
# Groups: ID [1]
ID Before_After sum_VolX sum_VolY
<int> <dbl> <dbl> <dbl>
1 1 0 14 0
After
# A tibble: 1 x 4
# Groups: ID [1]
ID Before_After sum_VolX sum_VolY
<int> <dbl> <dbl> <dbl>
1 1 1 0 6
Note that 0 corresponds to Before and 1 to After.

right join with dplyr make rows columns

I would like to do right join data1 and data2 by ProductCode and I need to get below desired output table
data1=data.frame(ProductCode=c(1,1,1,2,2,3),region=c("A","A","A","B","B","C"))
data1
ProductCode region
1 A
1 A
1 A
2 B
2 B
3 C
data2=data.frame(ProductCode=c(1,1,1,2,2,3),Period=c("promo1","promo2"
,"promo3","promo2","promo3","promo1"),promosales=c(15,12,7,18,20,2))
data2
ProductCode Period promosales
1 promo1 15
1 promo2 12
1 promo3 7
2 promo2 18
2 promo3 20
3 promo1 2
Desired output table
ProdcutCode region Promo1_sales Promo2_sales Promo3_sales
1 A 15 12 7
2 B 18 20 0
3 C 2 0 0
If I do it with sql, I have to group by after that by maximizing each row
sqldf("select a.*,
case when Period='promo1' then b.promosales else 0 end as
Promo1_sales1,
case when Period='promo2' then b.promosales else 0 end as
Promo1_sales2,
case when Period='promo3' then b.promosales else 0 end as
Promo1_sales3,
case when Period='promo4' then b.promosales else 0 end as
Promo1_sales4
from data1 a
left join data2 b on a.ProductCode=b.ProductCode
")
Can I do it dplyr or anything else?
Thank you.
Not sure this will work in your general case, but you can do:
data1 <- data.frame(ProductCode=c(1,1,1,2,2,3),
region=c(rep('A', 3), rep('B', 2),'C'))
data2 <- data.frame(ProductCode=c(1,1,1,2,2,3),
Period=c("promo1","promo2","promo3","promo2","promo3","promo1"),
promosales=c(15,12,7,18,20,2))
library(dplyr)
library(tidyr)
data1 %>%
distinct() %>%
inner_join(data2, by = 'ProductCode') %>%
group_by(ProductCode) %>%
mutate(rownr = paste0('Promo', row_number(), '_sales')) %>%
select(-Period) %>%
spread(rownr, promosales, fill = 0)
#> # A tibble: 3 x 5
#> # Groups: ProductCode [3]
#> ProductCode region Promo1_sales Promo2_sales Promo3_sales
#> <dbl> <fct> <dbl> <dbl> <dbl>
#> 1 1 A 15 12 7
#> 2 2 B 18 20 0
#> 3 3 C 2 0 0
A better approach would be simpler:
data1 %>%
distinct() %>%
inner_join(data2, by = 'ProductCode') %>%
group_by(ProductCode) %>%
spread(Period, promosales, fill = 0)
#> # A tibble: 3 x 5
#> # Groups: ProductCode [3]
#> ProductCode region promo1 promo2 promo3
#> <dbl> <fct> <dbl> <dbl> <dbl>
#> 1 1 A 15 12 7
#> 2 2 B 0 18 20
#> 3 3 C 2 0 0
Created on 2018-05-23 by the reprex package (v0.2.0).

Resources