How to create a CUMULATIVE dropout rate table from raw data - r

I'm trying to modify a solution posted here Create cohort dropout rate table from raw data
I'd like to create a CUMULATIVE dropout rate table using these data.
DT<-data.table(
id =c (1,2,3,4,5,6,7,8,9,10,
11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,27,28,29,30,31,32,33,34,35),
year =c (2014,2014,2014,2014,2014,2014,2014,2014,2014,2014,
2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,
2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016),
cohort =c(1,1,1,1,1,1,1,1,1,1,
2,2,2,1,1,2,1,2,1,2,
1,1,3,3,3,2,2,2,2,3,3,3,3,3,3))
So far, I've been able to get to this point
library(tidyverse)
DT %>%
group_by(year) %>%
count(cohort) %>%
ungroup() %>%
spread(year, n) %>%
mutate(y2014_2015_dropouts = (`2014` - `2015`),
y2015_2016_dropouts = (`2015` - `2016`)) %>%
mutate(y2014_2015_cumulative =y2014_2015_dropouts/`2014`,
y2015_2016_cumulative =y2015_2016_dropouts/`2014`+y2014_2015_cumulative)%>%
replace_na(list(y2014_2015_dropouts = 0.0,
y2015_2016_dropouts = 0.0)) %>%
select(cohort, y2014_2015_dropouts, y2015_2016_dropouts, y2014_2015_cumulative,y2015_2016_cumulative )
A cumulative dropout rate table reflects the proportion of students within a class who dropped out of school across years.
# A tibble: 3 x 5
cohort y2014_2015_dropouts y2015_2016_dropouts y2014_2015_cumulative y2015_2016_cumulative
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 6 2 0.6 0.8
2 2 0 2 NA NA
3 3 0 0 NA NA
>
The last two columns of the tibble show that by the end of year 2014-2015, 60% of cohort 1 students dropped out; and by the end of year 2015-2016, 80% of cohort 1 students had dropped out.
I'd like to calculate the same for cohorts 2 and 3, but I don't know how to do it.

Here is an alternative data.table solution that keeps your data organized in a way that I find easier to deal with. Using your DT input data:
Organize and order by cohort and year:
DT2 <- DT[, .N, list(cohort, year)][order(cohort, year)]
Assign the year range:
DT2[, year := paste(lag(year), year, sep = "_"),]
Get dropouts per year
DT2[, dropouts := ifelse(!is.na(lag(N)), lag(N) - N, 0), , cohort, ]
Get the cumulative sum of proportion dropped out each year per cohort:
DT2[, cumul := cumsum(dropouts) / max(N), cohort]
Output:
> DT2
cohort year N dropouts cumul
1: 1 NA_2014 10 0 0.0000000
2: 1 2014_2015 4 6 0.6000000
3: 1 2015_2016 2 2 0.8000000
4: 2 2016_2015 6 0 0.0000000
5: 2 2015_2016 4 2 0.3333333
6: 3 2016_2016 9 0 0.0000000

Because you spread your data by year early in your pipe and your 2014 columns have NA values for everything related to cohort 2, you need to coalesce the denominator in your calculation for y2015_2016_cumulative. If you replace the definition for that variable from the current
y2015_2016_cumulative =y2015_2016_dropouts/`2014`+y2014_2015_cumulative
to
y2015_2016_cumulative =y2015_2016_dropouts/coalesce(`2014`, `2015`) +
coalesce(y2014_2015_cumulative, 0)
you should be good to go. The coalesce function tries the first argument, but inputs the second argument if the first is NA. That being said, this current method isn't extremely scalable. You would have to add additional coalesce statements for every year you added. If you keep your data in the tidy format, you can keep a running list at the year-cohort level using
DT %>%
group_by(year) %>%
count(cohort) %>%
ungroup() %>%
group_by(cohort) %>%
mutate(dropouts = lag(n) - n,
dropout_rate = dropouts / max(n)) %>%
replace_na(list(dropouts = 0, n = 0, dropout_rate = 0)) %>%
mutate(cumulative_dropouts = cumsum(dropouts),
cumulative_dropout_rate = cumulative_dropouts / max(n))

Related

How can I create a column that cumulatively adds the sum of two previous rows based on conditions?

I tried asking this question before but was it was poorly stated. This is a new attempt cause I haven't solved it yet.
I have a dataset with winners, losers, date, winner_points and loser_points.
For each row, I want two new columns, one for the winner and one for the loser that shows how many points they have scored so far (as both winners and losers).
Example data:
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
I want the output to be:
winner_points_sum <- c(0, 0, 1, 3, 1, 3, 5, 3, 5)
loser_points_sum <- c(0, 2, 2, 1, 4, 5, 4, 7, 4)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points, winner_points_sum, loser_points_sum)
How I've solved it thus far is to do a for loop such as:
library(dplyr)
test_data$winner_points_sum_loop <- 0
test_data$loser_points_sum_loop <- 0
for(i in row.names(test_data)) {
test_data[i,]$winner_points_sum_loop <-
(
test_data %>%
dplyr::filter(winner == test_data[i,]$winner & date < test_data[i,]$date) %>%
dplyr::summarise(points = sum(winner_points, na.rm = TRUE))
+
test_data %>%
dplyr::filter(loser == test_data[i,]$winner & date < test_data[i,]$date) %>%
dplyr::summarise(points = sum(loser_points, na.rm = TRUE))
)
}
test_data$winner_points_sum_loop <- unlist(test_data$winner_points_sum_loop)
Any suggestions how to tackle this problem? The queries take quite some time when the row numbers add up. I've tried elaborating with the AVE function, I can do it for one column to sum a players point as winner but can't figure out how to add their points as loser.
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
library(dplyr)
library(tidyr)
test_data %>%
unite(winner, winner, winner_points) %>% # unite winner columns
unite(loser, loser, loser_points) %>% # unite loser columns
gather(type, pl_pts, winner, loser, -date) %>% # reshape
separate(pl_pts, c("player","points"), convert = T) %>% # separate columns
arrange(date) %>% # order dates (in case it's not)
group_by(player) %>% # for each player
mutate(sum_points = cumsum(points) - points) %>% # get points up to that date
ungroup() %>% # forget the grouping
unite(pl_pts_sumpts, player, points, sum_points) %>% # unite columns
spread(type, pl_pts_sumpts) %>% # reshape
separate(loser, c("loser", "loser_points", "loser_points_sum"), convert = T) %>% # separate columns and give appropriate names
separate(winner, c("winner", "winner_points", "winner_points_sum"), convert = T) %>%
select(winner, loser, date, winner_points, loser_points, winner_points_sum, loser_points_sum) # select the order you prefer
# # A tibble: 9 x 7
# winner loser date winner_points loser_points winner_points_sum loser_points_sum
# * <int> <int> <date> <int> <int> <int> <int>
# 1 1 3 2017-10-01 2 1 0 0
# 2 2 1 2017-10-02 1 0 0 2
# 3 3 1 2017-10-03 2 1 1 2
# 4 1 2 2017-10-04 1 0 3 1
# 5 2 1 2017-10-05 2 1 1 4
# 6 3 1 2017-10-06 1 0 3 5
# 7 1 3 2017-10-07 2 1 5 4
# 8 2 1 2017-10-08 1 0 3 7
# 9 3 2 2017-10-09 2 1 5 4
I finally understood what you want. And I took an approach of getting cumulative points of each player at each point in time and then joining it to the original test_data data frame.
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
library(dplyr)
library(tidyr)
cum_points <- test_data %>%
gather(end_game_status, player_id, winner, loser) %>%
gather(which_point, how_many_points, winner_points, loser_points) %>%
filter(
(end_game_status == "winner" & which_point == "winner_points") |
(end_game_status == "loser" & which_point == "loser_points")) %>%
arrange(date = as.Date(date)) %>%
group_by(player_id) %>%
mutate(cumulative_points = cumsum(how_many_points)) %>%
mutate(cumulative_points_sofar = lag(cumulative_points, default = 0))
select(player_id, date, cumulative_points)
output <- test_data %>%
left_join(cum_points, by = c('date', 'winner' = 'player_id')) %>%
rename(winner_points_sum = cumulative_points_sofar) %>%
left_join(cum_points, by = c('date', 'loser' = 'player_id')) %>%
rename(loser_points_sum = cumulative_points_sofar)
output
The difference to the previous question of the OP is that the OP is now asking for the cumulative sum of points each player has scored so far, i.e., before the actual date. Furthermore, the sample data set now contains a date column which uniquely identifies each row.
So, my previous approach can be used here as well, with some modifications. The solution below reshapes the data from wide to long format whereby two value variables are reshaped simultaneously, computes the cumulative sums for each player id , and finally reshapes from long back to wide format, again. In order to sum only points scored before the actual date, the rows are lagged by one.
It is important to note that the winner and loser columns contain the respective player ids.
library(data.table)
cols <- c("winner", "loser")
setDT(test_data)[
# reshape multiple value variables simultaneously from wide to long format
, melt(.SD, id.vars = "date",
measure.vars = list(cols, paste0(cols, "_points")),
value.name = c("id", "points"))][
# rename variable column
, variable := forcats::lvls_revalue(variable, cols)][
# order by date and cumulate the lagged points by id
order(date), points_sum := cumsum(shift(points, fill = 0)), by = id][
# reshape multiple value variables simultaneously from long to wide format
, dcast(.SD, date ~ variable, value.var = c("id", "points", "points_sum"))]
date id_winner id_loser points_winner points_loser points_sum_winner points_sum_loser
1: 2017-10-01 1 3 2 1 0 0
2: 2017-10-02 2 1 1 0 0 2
3: 2017-10-03 3 1 2 1 1 2
4: 2017-10-04 1 2 1 0 3 1
5: 2017-10-05 2 1 2 1 1 4
6: 2017-10-06 3 1 1 0 3 5
7: 2017-10-07 1 3 2 1 5 4
8: 2017-10-08 2 1 1 0 3 7
9: 2017-10-09 3 2 2 1 5 4

Create cohort dropout rate table from raw data

I need help creating a cohort dropout table from raw data.
I have a dataset that looks like this:
DT<-data.table(
id =c (1,2,3,4,5,6,7,8,9,10,
11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,27,28,29,30,31,32,33,34,35),
year =c (2014,2014,2014,2014,2014,2014,2014,2014,2014,2014,
2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,
2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016),
cohort =c(1,1,1,1,1,1,1,1,1,1,
2,2,2,1,1,2,1,2,1,2,
1,1,3,3,3,2,2,2,2,3,3,3,3,3,3))
I want to calculate the dropout rate by cohort, and get a table like this:
cohortdt<-data.table(
cohort =c(1,2,3),
drop_rateY1 =c(.60,0.0,0.0),
droprate_Y2 =c (.50,.33,0.0))
For cohort 1, the dropout rate at the end of Y1 is 60%. (i.e. 60 percent of students who were originally enrolled dropped out at the end of year 1. The value in Y2 means that 50% of those who remained at the end of year 1, dropped out at the end of year 2.
How can create a table like this from the raw data?
Here is one solution:
library(tidyverse)
DT %>%
group_by(year) %>%
count(cohort) %>%
ungroup() %>%
spread(year, n) %>%
mutate(year_1_drop_rate = 1 - (`2015` / `2014`),
year_2_drop_rate = 1 - (`2016` / `2015`)) %>%
replace_na(list(year_1_drop_rate = 0.0,
year_2_drop_rate = 0.0)) %>%
select(cohort, year_1_drop_rate, year_2_drop_rate)
Which returns:
# A tibble: 3 x 3
cohort year_1_drop_rate year_2_drop_rate
<dbl> <dbl> <dbl>
1 1 0.6 0.5000000
2 2 0.0 0.3333333
3 3 0.0 0.0000000
group by year
count cohort by year groups
ungroup
spread year in columns 2014, 2015, and 2016
mutate twice to get dropout rates for year 1 and year 2
replace_na to 0
select cohort, year_1_drop_rate, and year_2_drop_rate
This solution takes a tidy dataset and makes it untidy by spreading the year variable (i.e. 2014, 2015, and 2016 are separate columns).
I have a simple data.table solution :
x <- DT[,.N, by = .(cohort,year)]
count the number of student each year on each cohort and create the new data.table x
x[,drop := (1-N/(c(NA,N[-.N])))*100,by = cohort]
here I make the ratio between the number of student and the number of student the year after (c(NA,N[-.N]) is the shifted vector of N), that gives you the percentage of lost student each year
x[,.SD,by = cohort]
cohort year N drop
1: 1 2014 10 NA
2: 1 2015 4 60.00000
3: 1 2016 2 50.00000
4: 2 2015 6 NA
5: 2 2016 4 33.33333
6: 3 2016 9 NA
Hope it helps

Calculating cumulative mean of recent observations

My dataset has as features: players IDs, weeks and points.
I want to calculate the mean of points for previous weeks, but not all past weeks, just to the last 5 or less (if the current week is smaller than 5).
Example: For player_id = 5, week = 7, the result will be the average of POINTS for player_id = 5 and weeks 2, 3, 4, 5 and 6.
The following code already does the average for all previous week, so I need an adaptation to make it for just 5 previous week.
player_id<-c(rep(1,30),rep(2,30),rep(3,30),rep(4,30),rep(5,30))
week<-1:30
points<-round(runif(150,1,10),0)
mydata<- data.frame(player_id=player_id,week=rep(week,5),points)
mydata<-mydata %>%
group_by(player_id) %>% # the group to perform the stat on
arrange(week) %>% # order the weeks within each group
mutate(previous_mean = cummean(points) ) %>% # for each week get the
cumulative mean
mutate(previous_mean = lag(previous_mean) ) %>% # shift cumulative
mean back one week
arrange(player_id) # sort by player_id
HAVB's approach is great, but depending on what you want, here is another. This approach is adapted from this answer to a different question, but changed for your circumstances:
library(dplyr)
library(zoo)
# set the seed for reproducibility
set.seed(123)
player_id<-c(rep(1,30),rep(2,30),rep(3,30),rep(4,30),rep(5,30))
week<-1:30
points<-round(runif(150,1,10),0)
mydata<- data.frame(player_id=player_id,week=rep(week,5),points)
roll_mean <- function(x, k) {
result <- rollapplyr(x, k, mean, partial=TRUE, na.rm=TRUE)
result[is.nan(result)] <- NA
return( result )
}
mydata<- data.frame(player_id=player_id,week=rep(week,5),points)
mydata<-mydata %>%
group_by(player_id) %>%
arrange(week) %>%
mutate(rolling_mean = roll_mean(x=lag(points), k=5) ) %>%
arrange(player_id)
Then we can look at a subset to show it worked:
mydata[mydata$player_id %in% 1:2 & mydata$week %in% 1:6, ]
# A tibble: 12 x 4
# Groups: player_id [2]
player_id week points rolling_mean
<dbl> <int> <dbl> <dbl>
1 1 1 4 NA
2 1 2 8 4.000000
3 1 3 5 6.000000
4 1 4 9 5.666667
5 1 5 9 6.500000
6 1 6 1 7.000000
7 2 1 10 NA
8 2 2 9 10.000000
9 2 3 7 9.500000
10 2 4 8 8.666667
11 2 5 1 8.500000
12 2 6 5 7.000000
So we can see at each time t, rolling_mean for player i will be the mean of the points observations for player i at times {t - 1, ..., min(1, t - 5)}.
You can use slice to select just the last 5 weeks for each group. Try this:
player_id<-c(rep(1,30),rep(2,30),rep(3,30),rep(4,30),rep(5,30))
week<-1:30
points<-round(runif(150,1,10),0)
mydata<- data.frame(player_id=player_id,week=rep(week,5),points)
library(dplyr)
mydata <- mydata %>%
group_by(player_id) %>% # the group to perform the stat on
arrange(week) %>% # order the weeks within each group
slice( (n()-4):n() ) %>% # "slice" the last 5 rows (weeks) of every group
mutate(previous_mean = cummean(points) ) %>% # for each week get the cumulative mean
mutate(previous_mean = lag(previous_mean) ) %>% # shift cumulative mean back one week
arrange(player_id) # sort by player_id
The line
slice( (n()-4):n() )
selects rows within the range [(last row - 4) : last row], for each group
EDIT: To avoid trouble when the current week is less than 5, use an ifelse statement to validate:
mydata %>%
group_by(player_id) %>% # the group to perform the stat on
arrange(week) %>% # order the weeks within each group
slice(ifelse(n() < 5, 1:n(), n()-4):n()) %>% # "slice" the last 5 rows (weeks) of every group
mutate(previous_mean = cummean(points) ) %>% # for each week get the cumulative mean
mutate(previous_mean = lag(previous_mean) ) %>% # shift cumulative mean back one week
arrange(player_id) # sort by player_id

r: Summarise for rowSums after group_by

I've tried searching a number of posts on SO but I'm not sure what I'm doing wrong here, and I imagine the solution is quite simple. I'm trying to group a dataframe by one variable and figure the mean of several variables within that group.
Here is what I am trying:
head(airquality)
target_vars = c("Ozone","Temp","Solar.R")
airquality %>% group_by(Month) %>% select(target_vars) %>% summarise(rowSums(.))
But I get the error that my lenghts don't match. I've tried variations using mutate to create the column or summarise_all, but neither of these seem to work. I need the row sums within group, and then to compute the mean within group (yes, it's nonsensical here).
Also, I want to use select because I'm trying to do this over just certain variables.
I'm sure this could be a duplicate, but I can't find the right one.
EDIT FOR CLARITY
Sorry, my original question was not clear. Imagine the grouping variable is the calendar month, and we have v1, v2, and v3. I'd like to know, within month, what was the average of the sums of v1, v2, and v3. So if we have 12 months, the result would be a 12x1 dataframe. Here is an example if we just had 1 month:
Month v1 v2 v3 Sum
1 1 1 0 2
1 1 1 1 3
1 1 0 0 3
Then the result would be:
Month Average
1 8/3
You can try:
library(tidyverse)
airquality %>%
select(Month, target_vars) %>%
gather(key, value, -Month) %>%
group_by(Month) %>%
summarise(n=length(unique(key)),
Sum=sum(value, na.rm = T)) %>%
mutate(Average=Sum/n)
# A tibble: 5 x 4
Month n Sum Average
<int> <int> <int> <dbl>
1 5 3 7541 2513.667
2 6 3 8343 2781.000
3 7 3 10849 3616.333
4 8 3 8974 2991.333
5 9 3 8242 2747.333
The idea is to convert the data from wide to long using tidyr::gather(), then group by Month and calculate the sum and the average.
This seems to deliver what you want. It's regular R. The sapply function keeps the months separated by "name". The sum function applied to each dataframe will not keep the column sums separate. (Correction # 2: used only target_vars):
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)
5 6 7 8 9
7541 8343 10849 8974 8242
If you wanted the per number of variable results, then you would divide by the number of variables:
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)/
(length(target_vars))
5 6 7 8 9
2513.667 2781.000 3616.333 2991.333 2747.333
Perhaps this is what you're looking for
library(dplyr)
library(purrr)
library(tidyr) # forgot this in original post
airquality %>%
group_by(Month) %>%
nest(Ozone, Temp, Solar.R, .key=newcol) %>%
mutate(newcol = map_dbl(newcol, ~mean(rowSums(.x, na.rm=TRUE))))
# A tibble: 5 x 2
# Month newcol
# <int> <dbl>
# 1 5 243.2581
# 2 6 278.1000
# 3 7 349.9677
# 4 8 289.4839
# 5 9 274.7333
I've never encountered a situation where all the answers disagreed. Here's some validation (at least I think) for the 5th month
airquality %>%
filter(Month == 5) %>%
select(Ozone, Temp, Solar.R) %>%
mutate(newcol = rowSums(., na.rm=TRUE)) %>%
summarise(sum5 = sum(newcol), mean5 = mean(newcol))
# sum5 mean5
# 1 7541 243.2581

Check for score trend

I have customer scoring data as follows:
cust_id score_date score
1 5/1/2016 80
1 5/2/2016 83
1 5/22/2016 90
2 6/1/2016 92
2 7/2/2016 87
and I want to check the customer's scores trend; meaning, I'd like to check if the customer's score increased over time or not (positive trend).
I thought of using something like this (with dplyr):
results <- df %>%
group_by(cust_id) %>%
.[order(-.[, 2]), ]
but I'm not so sure how to check the score's difference.
I'd like my answer set to count the number of customers with positive trend; something like:
positive_trend (number of customers)
yes 1,000
no 78
Your help will be appreciated
Using dplyr. For every cust_id we calculate the difference between consecutive rows with diff and then summarise them to count number of positive and negative values.
library(dplyr)
df %>%
group_by(cust_id) %>%
mutate(difference = c(0, diff(score))) %>%
summarise(yes = sum(difference > 0),
no = sum(difference < 0))
# cust_id yes no
# <int> <int> <int>
#1 1 2 0
#2 2 0 1
NOTE : According to this code, the first row in every group would be neglected as there is no trend at the beginning.
We can do this with data.table
library(data.table)
setDT(df)[, as.list(table(factor(diff(score)>0, levels = c(TRUE, FALSE),
labels = c("yes", "no")))), cust_id]
# cust_id yes no
#1: 1 2 0
#2: 2 0 1
Or using base R
table(transform(stack(with(df, tapply(score, cust_id,
FUN = diff)))[2:1], values = values > 0))

Resources