Create cohort dropout rate table from raw data

Create cohort dropout rate table from raw data - r

I need help creating a cohort dropout table from raw data.
I have a dataset that looks like this:
DT<-data.table(
id =c (1,2,3,4,5,6,7,8,9,10,
11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,27,28,29,30,31,32,33,34,35),
year =c (2014,2014,2014,2014,2014,2014,2014,2014,2014,2014,
2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,
2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016),
cohort =c(1,1,1,1,1,1,1,1,1,1,
2,2,2,1,1,2,1,2,1,2,
1,1,3,3,3,2,2,2,2,3,3,3,3,3,3))
I want to calculate the dropout rate by cohort, and get a table like this:
cohortdt<-data.table(
cohort =c(1,2,3),
drop_rateY1 =c(.60,0.0,0.0),
droprate_Y2 =c (.50,.33,0.0))
For cohort 1, the dropout rate at the end of Y1 is 60%. (i.e. 60 percent of students who were originally enrolled dropped out at the end of year 1. The value in Y2 means that 50% of those who remained at the end of year 1, dropped out at the end of year 2.
How can create a table like this from the raw data?

Here is one solution:
library(tidyverse)
DT %>%
group_by(year) %>%
count(cohort) %>%
ungroup() %>%
spread(year, n) %>%
mutate(year_1_drop_rate = 1 - (`2015` / `2014`),
year_2_drop_rate = 1 - (`2016` / `2015`)) %>%
replace_na(list(year_1_drop_rate = 0.0,
year_2_drop_rate = 0.0)) %>%
select(cohort, year_1_drop_rate, year_2_drop_rate)
Which returns:
# A tibble: 3 x 3
cohort year_1_drop_rate year_2_drop_rate
<dbl> <dbl> <dbl>
1 1 0.6 0.5000000
2 2 0.0 0.3333333
3 3 0.0 0.0000000
group by year
count cohort by year groups
ungroup
spread year in columns 2014, 2015, and 2016
mutate twice to get dropout rates for year 1 and year 2
replace_na to 0
select cohort, year_1_drop_rate, and year_2_drop_rate
This solution takes a tidy dataset and makes it untidy by spreading the year variable (i.e. 2014, 2015, and 2016 are separate columns).

I have a simple data.table solution :
x <- DT[,.N, by = .(cohort,year)]
count the number of student each year on each cohort and create the new data.table x
x[,drop := (1-N/(c(NA,N[-.N])))*100,by = cohort]
here I make the ratio between the number of student and the number of student the year after (c(NA,N[-.N]) is the shifted vector of N), that gives you the percentage of lost student each year
x[,.SD,by = cohort]
cohort year N drop
1: 1 2014 10 NA
2: 1 2015 4 60.00000
3: 1 2016 2 50.00000
4: 2 2015 6 NA
5: 2 2016 4 33.33333
6: 3 2016 9 NA
Hope it helps

Related

Cumulative sum of unique values based on multiple criteria

I've got a df with multiple columns containing information of species sightings over the years in different sites, therefore each year might show multiple records. I would like to filter my df and calculate some operations based on certain columns, but I'd like to keep all columns for further analyses. I had some previous code using summarise but as I would like to keep all columns I was trying to avoid using it.
Let's say the columns I'm interested to work with at the moment are as follows:
df <- data.frame("Country" = LETTERS[1:5], "Site"=LETTERS[6:10], "species"=1:5, "Year"=1981:2010)
I would like to calculate:
1- The cumulative sum of the records in which a species has been documented within each site creating a new column "Spsum".
2- The number of different years that each species has been seen on a particular site, this could be done as cumulative sum as well, on a new column "nYear".
For example, if species 1 has been recorded 5 times in 1981, and 2 times in 1982 in Site G, Spsum would show 7 (cumulative sum of records) whereas nYear would show 2 as it was spotted over two different years. So far I've got this, but nYear is displaying 0s as a result.
Df1 <- df %>%
filter(Year>1980)%>%
group_by(Country, Site, Species, Year) %>%
mutate(nYear = n_distinct(Year[Species %in% Site]))%>%
ungroup()
Thanks!

This cound help, without the need for a join.
df %>% arrange(Country, Site, species, Year) %>%
filter(Year>1980) %>%
group_by(Site, species) %>%
mutate(nYear = length(unique(Year))) %>%
mutate(spsum = rowid(species))
# A tibble: 30 x 6
# Groups: Site, species [5]
Country Site species Year nYear spsum
<chr> <chr> <int> <int> <int> <int>
1 A F 1 1981 6 1
2 A F 1 1986 6 2
3 A F 1 1991 6 3
4 A F 1 1996 6 4
5 A F 1 2001 6 5
6 A F 1 2006 6 6
7 B G 2 1982 6 1
8 B G 2 1987 6 2
9 B G 2 1992 6 3
10 B G 2 1997 6 4
# ... with 20 more rows

If the table contains multiple records per Country+Site+species+Year combination, I would first aggregate those and then calculate the cumulative counts from that. The counts can then be joined back to the original table.
Something along these lines:
cumulative_counts <- df %>%
count(Country, Site, species, Year) %>%
group_by(Country, Site, species) %>%
arrange(Year) %>%
mutate(Spsum = cumsum(n), nYear = row_number())
df %>%
left_join(cumulative_counts)

How can I keep only those rows that together contain the longest consecutive run of a variable increasing by one, using dplyr in R?

I have a tibble where each row contains a subject identifier and a year. My goal is to isolate, for each subject, only those rows which together constitute the longest sequence of rows in which the variable year increases by 1 from one row to the next.
I've tried quite a few things with a grouped filter, such as building helper variables that code whether year on one row is one more or less than year on the previous row, and using the rle() function. But so far nothing has worked exactly as it should.
Here is a toy example of my data. Note that the number of rows varies between subjects and that there are typically (some) gaps between years. Also note that the data have been arranged so that the year value always increases from one row to the next within each subject.
# A tibble: 8 x 2
subject year
<dbl> <dbl>
1 1 2012
2 1 2013
3 1 2015
4 1 2016
5 1 2017
6 1 2019
7 2 2011
8 2 2013
The toy example tibble can be recreated by running this code:
dat = structure(list(subject = c(1, 1, 1, 1, 1, 1, 2, 2), year = c(2012,
2013, 2015, 2016, 2017, 2019, 2011, 2013)), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
To clarify, for this tibble the desired output is:
# A tibble: 3 x 2
subject year
<dbl> <dbl>
1 1 2015
2 1 2016
3 1 2017
(Note that subject 2 is dropped because she has no sequence of years increasing by one.)
There must be an elegant way to do this using dplyr!

This doesn't take into account ties, but ...
dat %>%
group_by(subject) %>%
mutate( r = cumsum(c(TRUE, diff(year) != 1)) ) %>%
group_by(subject, r) %>%
mutate( rcount = n() ) %>%
group_by(subject) %>%
filter(rcount > 1, rcount == max(rcount)) %>%
select(-r, -rcount) %>%
ungroup()
# # A tibble: 3 x 2
# subject year
# <dbl> <dbl>
# 1 1 2015
# 2 1 2016
# 3 1 2017

r: select row with highest year followed by highest month

For my data frame I want to filter the row with the highest year followed by the highest month.
Sample data frame:
df <- data.frame(ID = c(1:5),
year = c(2018,2018,2018,2018,2019),
month = c(9,10,11,12,11))
I tried the following but this does not return any record (which makes sense, because the max year and max month are in different rows). Does anybody have the answer? Obviously my desired output would be row 5.
df %>% filter(year == max(year) & month == max(month))

We can do:
df %>%
filter(year==max(year) & lag(month) == max(month))
ID year month
1 5 2019 11

This will give back the highest year containing the highest month value within that year.
df <- df %>%
arrange(desc(year), desc(month))
ID year month
1 5 2019 11
2 4 2018 12
3 3 2018 11
4 2 2018 10
5 1 2018 9
df[1,] # first row
ID year month
1 5 2019 11

How to generate a dummy treatment variable based on values from two different variables

I would like to generate a dummy treatment variable "treatment" based on country variable "iso" and earthquakes dummy variable "quake" (for dataset "data").
I would basically like to get a dummy variable "treatment" where, if quake==1 for at least one time in my entire timeframe (let's say 2000-2018), I would like all values for that "iso" have "treatment"==1, for all other countries "iso"==0. So countries that are affected by earthquakes have all observations 1, others 0.
I have tried using dplyr but since I'm still very green at R, it has taken me multiple tries and I haven't found a solution yet. I've looked on this website and google.
I suspect the solution should be something along the lines of but I can't finish it myself:
data %>%
filter(quake==1) %>%
group_by(iso) %>%
mutate(treatment)

Welcome to StackOverflow ! You should really consider Sotos's links for your next questions on SO :)
Here is a dplyr solution (following what you started) :
## data
set.seed(123)
data <- data.frame(year = rep(2000:2002, each = 26),
iso = rep(LETTERS, times = 3),
quake = sample(0:1, 26*3, replace = T))
## solution (dplyr option)
library(dplyr)
data2 <- data %>% arrange(iso) %>%
group_by(iso) %>%
mutate(treatment = if_else(sum(quake) == 0, 0, 1))
data2
# A tibble: 78 x 4
# Groups: iso [26]
year iso quake treatment
<int> <fct> <int> <dbl>
1 2000 A 0 1
2 2001 A 1 1
3 2002 A 1 1
4 2000 B 1 1
5 2001 B 1 1
6 2002 B 0 1
7 2000 C 0 1
8 2001 C 0 1
9 2002 C 1 1
10 2000 D 1 1
# ... with 68 more rows

How to create a CUMULATIVE dropout rate table from raw data

I'm trying to modify a solution posted here Create cohort dropout rate table from raw data
I'd like to create a CUMULATIVE dropout rate table using these data.
DT<-data.table(
id =c (1,2,3,4,5,6,7,8,9,10,
11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,27,28,29,30,31,32,33,34,35),
year =c (2014,2014,2014,2014,2014,2014,2014,2014,2014,2014,
2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,
2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016),
cohort =c(1,1,1,1,1,1,1,1,1,1,
2,2,2,1,1,2,1,2,1,2,
1,1,3,3,3,2,2,2,2,3,3,3,3,3,3))
So far, I've been able to get to this point
library(tidyverse)
DT %>%
group_by(year) %>%
count(cohort) %>%
ungroup() %>%
spread(year, n) %>%
mutate(y2014_2015_dropouts = (`2014` - `2015`),
y2015_2016_dropouts = (`2015` - `2016`)) %>%
mutate(y2014_2015_cumulative =y2014_2015_dropouts/`2014`,
y2015_2016_cumulative =y2015_2016_dropouts/`2014`+y2014_2015_cumulative)%>%
replace_na(list(y2014_2015_dropouts = 0.0,
y2015_2016_dropouts = 0.0)) %>%
select(cohort, y2014_2015_dropouts, y2015_2016_dropouts, y2014_2015_cumulative,y2015_2016_cumulative )
A cumulative dropout rate table reflects the proportion of students within a class who dropped out of school across years.
# A tibble: 3 x 5
cohort y2014_2015_dropouts y2015_2016_dropouts y2014_2015_cumulative y2015_2016_cumulative
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 6 2 0.6 0.8
2 2 0 2 NA NA
3 3 0 0 NA NA
>
The last two columns of the tibble show that by the end of year 2014-2015, 60% of cohort 1 students dropped out; and by the end of year 2015-2016, 80% of cohort 1 students had dropped out.
I'd like to calculate the same for cohorts 2 and 3, but I don't know how to do it.

Here is an alternative data.table solution that keeps your data organized in a way that I find easier to deal with. Using your DT input data:
Organize and order by cohort and year:
DT2 <- DT[, .N, list(cohort, year)][order(cohort, year)]
Assign the year range:
DT2[, year := paste(lag(year), year, sep = "_"),]
Get dropouts per year
DT2[, dropouts := ifelse(!is.na(lag(N)), lag(N) - N, 0), , cohort, ]
Get the cumulative sum of proportion dropped out each year per cohort:
DT2[, cumul := cumsum(dropouts) / max(N), cohort]
Output:
> DT2
cohort year N dropouts cumul
1: 1 NA_2014 10 0 0.0000000
2: 1 2014_2015 4 6 0.6000000
3: 1 2015_2016 2 2 0.8000000
4: 2 2016_2015 6 0 0.0000000
5: 2 2015_2016 4 2 0.3333333
6: 3 2016_2016 9 0 0.0000000

Because you spread your data by year early in your pipe and your 2014 columns have NA values for everything related to cohort 2, you need to coalesce the denominator in your calculation for y2015_2016_cumulative. If you replace the definition for that variable from the current
y2015_2016_cumulative =y2015_2016_dropouts/`2014`+y2014_2015_cumulative
to
y2015_2016_cumulative =y2015_2016_dropouts/coalesce(`2014`, `2015`) +
coalesce(y2014_2015_cumulative, 0)
you should be good to go. The coalesce function tries the first argument, but inputs the second argument if the first is NA. That being said, this current method isn't extremely scalable. You would have to add additional coalesce statements for every year you added. If you keep your data in the tidy format, you can keep a running list at the year-cohort level using
DT %>%
group_by(year) %>%
count(cohort) %>%
ungroup() %>%
group_by(cohort) %>%
mutate(dropouts = lag(n) - n,
dropout_rate = dropouts / max(n)) %>%
replace_na(list(dropouts = 0, n = 0, dropout_rate = 0)) %>%
mutate(cumulative_dropouts = cumsum(dropouts),
cumulative_dropout_rate = cumulative_dropouts / max(n))