Measure retention in R - r

I have a dataset containing information on students enrolled in an after-school program in the following format:
student_id year group number
1 2009-10 1
2 2009-10 1
3 2009-10 2
4 2009-10 3
5 2009-10 3
1 2010-11 1
2 2010-11 2
3 2010-11 3
4 2010-11 2
5 2010-11 2
I want to measure retention for each group on a per-year basis. I need to write some kind of loop statement that will look back at the previous year, compute a value of similar IDs for each group, and return a value divided by total numbers in that group. I have sketched out code (which is probably inefficient/missing some steps) as follows?
for (i in levels(data$year)){
if (i=="2009-10"){
#no previous year to look for
next
}else{
for(g in levels(data$group)){
##perhaps a plyr summarize function?
#look for id in previous year for that group
#compute count of identical ids
#return value/length(group)
}
}
edit
after reading some suggestions, perhaps it would be simpler to use the ddply(transform) function. is there a way to create an associative relationship between the year and group number? the code would look something like this:
tracking=ddply(data,"student_id", transform, enroll.year1=1, enroll.year2=ifelse(criteria goes here,1,0), enroll.year3=ifelse(criteria goes here,1,0)
some sample output might look like this:
Year Group retention rate
2010-11 1 0.88
2011-12 1 0.8
2010-11 2 0.5
2011-12 2 0.6
2010-11 3 0.5
2011-12 3 0.5
has anyone solved a similar retention problem before? I'm having difficulty conceptualizing the steps, let alone implementing in R. any help would be greatly appreciated.
*

One solution I can think of involves a self-join shifting one year. I will be using data.table for the simplicity of both joining and the grouping required. I'll also be changing some names and the year format for convenience. I have saved your data in a data.frame called dd:
names(dd) <- c('id', 'year', 'group')
dd$year <- as.integer(substr(dd$year, 1, 4))
dd.prev <- dd
dd.prev$year <- dd.prev$year + 1 ## shifting year upwards so it matches the next year
require(data.table)
dd <- data.table(dd)
setkey(dd, group, year)
dd.prev <- data.table(dd.prev)
setkey(dd.prev, group, year)
setnames(dd.prev, 'id', 'id.prev') ## changing variable name so it is distinct
extra.year <- max(dd$year) + 1 ## the shift generates an extra year
dd.prev <- dd.prev[!year==extra.year] ## drop extra year as retention not defined
dd <- dd[dd.prev] ## join data from previous year to current data
dd.all[, retention:=as.numeric(length(intersect(id, id.prev))
/length(unique(id.prev))),
by=list(year, id)]
That last bit computes the retention rate as you defined it: number of students who still remain from last year, intersect(id, id.prev), divided by the total number of students last year, unique(id.prev). With this data it only generates the retention for 2010, but with a longer series it would generate it for all years except the first.
UPDATE 1: Using plyr
names(dd) <- c('id', 'year', 'group')
dd$year <- as.integer(substr(dd$year, 1, 4))
dd.prev <- dd
dd.prev$year <- dd.prev$year + 1 ## shifting year upwards so it matches the next year
names(dd.prev)[1] <- 'id.prev' ## changing variable name so it is distinct
extra.year <- max(dd$year) + 1 ## the shift generates an extra year
dd.prev <- dd.prev[dd.prev$year!=extra.year,] ## drop extra year
dd <- merge(dd, dd.prev, all.y=TRUE) ## join data from previous year to current data
require(plyr)
dd <- ddply(dd, .(group, year), summarize,
retention=length(intersect(id, id.prev))
/length(unique(id.prev)))
I hope that helps.

Related

Replacement of missing day and month in dates using R

This question is about how to replace missing days and months in a data frame using R. Considering the data frame below, 99 denotes missing day or month and NA represents dates that are completely unknown.
df<-data.frame("id"=c(1,2,3,4,5),
"date" = c("99/10/2014","99/99/2011","23/02/2016","NA",
"99/04/2009"))
I am trying to replace the missing days and months based on the following criteria:
For dates with missing day but known month and year, the replacement date would be a random selection from the middle of the interval (first day to the last day of that month). Example, for id 1, the replacement date would be sampled from the middle of 01/10/2014 to 31/10/2014. For id 5, this would be the middle of 01/04/2009 to 30/04/2009. Of note is the varying number of days for different months, e.g. 31 days for October and 30 days for April.
As in the case of id 2, where both day and month are missing, the replacement date is a random selection from the middle of the interval (first day to last day of the year), e.g 01/01/2011 to 31/12/2011.
Please note: complete dates (e.g. the case of id 3) and NAs are not to be replaced.
I have tried by making use of the seq function together with the as.POSIXct and as.Date functions to obtain the sequence of dates from which the replacement dates are to be sampled. The difficulty I am experiencing is how to automate the R code to obtain the date intervals (it varies across distinct id) and how to make a random draw from the middle of the intervals.
The expected output would have the date of id 1, 2 and 5 replaced but those of id 3 and 4 remain unchanged. Any help on this is greatly appreciated.
This isn't the prettiest, but it seems to work and adapts to differing month and year lengths:
set.seed(999)
df$dateorig <- df$date
seld <- grepl("^99/", df$date)
selm <- grepl("^../99", df$date)
md <- seld & (!selm)
mm <- seld & selm
df$date <- as.Date(gsub("99","01",as.character(df$date)), format="%d/%m/%Y")
monrng <- sapply(df$date[md], function(x) seq(x, length.out=2, by="month")[2]) - as.numeric(df$date[md])
df$date[md] <- df$date[md] + sapply(monrng, sample, 1)
yrrng <- sapply(df$date[mm], function(x) seq(x, length.out=2, by="12 months")[2]) - as.numeric(df$date[mm])
df$date[mm] <- df$date[mm] + sapply(yrrng, sample, 1)
#df
# id date dateorig
#1 1 2014-10-14 99/10/2014
#2 2 2011-02-05 99/99/2011
#3 3 2016-02-23 23/02/2016
#4 4 <NA> NA
#5 5 2009-04-19 99/04/2009

Longest consecutive period above threshold using rle and for loop

I have four years of streamflow data for one month and I'm trying to figure out how to extract the longest consecutive period at or above a certain threshold for each of the four years. In the example below, the threshold is 4. I want to try to accomplish this using a for loop or possibly one of the apply functions, but I'm not sure how to go about it.
Here's my example dataframe:
year <- c(rep(2009,31), rep(2010, 31), rep(2011, 31), rep(2012, 31))
day<-c(rep(seq(1:31),4))
discharge <- c(4,4,4,5,6,5,4,8,4,5,3,8,8,8,8,8,8,8,1,2,2,8,8,8,8,8,8,8,8,8,4,4,4,5,6,3,1,1,3,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,9,10,3,3,3,3,3,3,1,1,3,8,8,8,8,8,8,8,8,8,1,2,2,8,8,3,8,8,8,8,8,8,4,4,4,5,6,3,1,1,3,3,3,3,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,9,3)
df<-data.frame(cbind(year, day, discharge))
df$threshold<-ifelse(discharge>=4,1,0)
In this example, the threshold column is coded as 1 if the discharge is at or above the threshold and 0 if not. I'm able to partially get my desired output for one year (2009 in the example below), with the following code:
rl2009<-with(subset(df,year==2009),rle(threshold))
cs2009 <- cumsum(rl2009$lengths)
index2009<-cbind(cs2009[rl2009$values == 1] - rl2009$length[rl2009$values == 1] + 1,
cs2009[rl2009$values == 1])
df2009<-data.frame(index2009)
df2009 #ouput all periods when flow is above threshold
df2009$X3<-df2009$X2-df2009$X1+1
max2009<-df2009[which.max(df2009$X3),]
max2009 #output the first and longest period when flow is above threshold
For 2009, there are three time periods when the discharge equals or exceeds 4, but the period from day 1 to day 10 is chosen because it is the first of the longest period above the threshold. X1 represents the start of the time period, X2 the end of the time period, and X3 the number of days in the time period. If there is more than one period with the same number of days, I want to select the first of such periods.
My desired output for all four years is below:
year X1 X2 X3
2009 1 10 10
2010 9 31 23
2011 10 18 9
2012 12 30 19
The actual data includes many more years and many streams, so it's not feasible to do this for each year individually. If anyone has any thoughts on how to achieve this, it'd be greatly appreciated. Thanks.
Simply, generalize your process with a defined function such as threshold_find and pass dataframes subsetted for each year into it which can be handled with by.
As the object-oriented wrapper to tapply, by slices a dataframe by one or more factors (i.e., year) and returns a list of whatever object the defined function outputs, here being the max dataframe. At end, do.call() row binds all dataframes in by list into one dataframe.
threshold_find <- function(df) {
rl <- with(df, rle(threshold))
cs <- cumsum(rl$lengths)
index <- cbind(cs[rl$values == 1] - rl$length[rl$values == 1] + 1,
cs[rl$values == 1])
df <- data.frame(index)
df$X3 <- df$X2 - df$X1+1
max <- df[which.max(df$X3),]
max
}
finaldf <- do.call(rbind, by(df, df$year, FUN=threshold_find))
finaldf
# X1 X2 X3
# 2009 1 10 10
# 2010 9 31 23
# 2011 10 18 9
# 2012 12 30 19

Plotting a variable measured monthly with a variable measured yearly in the same plot (R)

Here are two samples of datasets I would like to plot together on the same plot:
>head(df1)
Date y
1 2015-10-01 6217.734
2 2015-09-01 6242.592
3 2015-08-01 6772.145
4 2015-07-01 6865.719
and
>head(df2)
Year x
1 1980 5760
2 1981 4765
3 1982 2620
4 1983 7484
Given that df2$Year and df1$Date overlap date ranges and df1$y and df2$x are of the same scale, how can I best plot y and x against time on the same plot given that x is measured only yearly and y monthly?
I imagine it will require converting Year to an arbitrary date (1980-01-01, 1981-01-01). But beyond that, other than altering my df2 data.frame to having twelve observations per year with the same x value per observation, then combining the two data.frames, I cannot think of what to do.
I would prefer to use ggplot2 if there is a solution there.
Can you try this out for me?
library(dygraphs)
library(xts)
rename one of your variable to match the other scaled variable
rename Year to match the other's date
then do
prep <- cbind(df1, df2)
ts_object <- as.xts(prep[,2:ncol(prep)], prep$Year)
dygraph(ts_object)
Note that you are providing literally NO data for me to work with here. If you can do so that'd be great. Try using dput(df1), and dput(df2), and post the output of these commands

grouping by date and treatment in R

I have a time series that looks at how caffeine impacts test scores. On each day, the first test is used to measure a baseline score for the day, and the second score is the effect of a treatment.
Post Caffeine Score Time/Date
yes 10 3/17/2014 17:58:28
no 9 3/17/2014 23:55:47
no 7 3/18/2014 18:50:50
no 10 3/18/2014 23:09:03
Some days have a caffeine treatment, others not. Here's a question: how do I group variables by the day of the week, and create a measure of impact, by subtracting the second days' score from the first.
I'm going to be using these groupings for later graphs and analysis, so I think it's most efficient if there's a way to create objects that look at the improvement in score each day and groups by whether caffeine (treatment) was used.
Thank you for your help!
First make a column for the day:
df$day = strftime(df$'Time/Date', format="%Y-%m-%d")
then I think what you're after is two aggregates:
1) To find if the day had caffeine
dayCaf = aggregate(df$Caffeine~df$day, FUN=function(x) ifelse(length(which(grepl("yes",x)))>0,1,0))
2) To calculate the difference in scores
dayDiff = aggregate(df$Score~df$day, FUN=function(x) x[2]-x[1])
Now put the two together
out = merge(dayCaf, dayDiff, by='df$day')
That gives:
df$day df$caff df$score
1 2014-03-17 1 -1
2 2014-03-18 0 3
The whole code is:
df$day = strftime(df$'Time/Date', format="%Y-%m-%d")
dayCaf = aggregate(df$Caffeine~df$day, FUN=function(x) ifelse(length(which(grepl("yes",x)))>0,1,0))
dayDiff = aggregate(df$Score~df$day, FUN=function(x) x[2]-x[1])
out = merge(dayCaf, dayDiff, by='df$day')
Just replace "df" with the name of your frame and it should work.
Alternatively:
DF <- data.frame(Post.Caffeine = c("Yes","No","No","No"),Score=c(10,9,7,10),Time.Date=c("3/17/2014 17:58:28","3/17/2014 23:55:47","3/18/2014 18:50:50", "3/18/2014 23:09:03"))
DF$Time.Date <- as.Date(DF$Time.Date,format="%m/%d/%Y")
DF2 <- setNames(aggregate(Score~Time.Date,DF,diff),c("Date","Diff"))
DF2$PC <- DF2$Date %in% DF$Time.Date[DF$Post.Caffeine=="Yes"]
DF2
EDIT: This assumes that your data is in the order that you demonstrate.
data.table solution. The order part sorts your data first (If it is already sorted, you can remove the order part, just leave the comma in place). The advantage of this approach is that you are doing the whole process in one line and that it will be fast too
library(data.table)
setDT(temp)[order(as.POSIXct(strptime(`Time/Date`, "%m/%d/%Y %H:%M:%S"))),
list(HadCafffeine = if(any(PostCaffeine == "yes")) "yes" else "no",
Score = diff(Score),
by = as.Date(strptime(`Time/Date`, "%m/%d/%Y"))]
## as.Date HadCafffeine Score
## 1: 2014-03-17 yes -1
## 2: 2014-03-18 no 3
This solution assumes temp as your data set and PostCaffeine instead Post Caffeine as the variable name (it is bad practice in R to put spaces or / into variable names as it limits your possibilities to work with them).

Producing Ordered Columns of Integers in R for odd-numbered ranges

Total newb R question, but here it is: lets say I want to create a data frame with two columns, one with all years in a range, and the other with every month in each year. When I'm done, I should have this:
year month
1990 1
1990 2
1990 3
Et cetera. This seems like a pretty obvious job for cbind, to turn a range into a column, and repeat, to produce 12 instances of each year. This works great, but only for an even number of years in the range. So, for instance:
df <- data.frame(cbind(year=rep(c(1990:2000), 12)))
Works fine. And so does this:
df <- data.frame(cbind(year=rep(c(1990:2000), 12), month=c(1:12)))
But this produces overt nonsense:
df <- data.frame(cbind(year=rep(c(1990:2001), 12), month=c(1:12)))
The first line of code produces 12 instances of each year in the range, just as you'd expect; the second line produces the desired result. The third line produces 12 instances of each year, where each year only gets one month number. Thus:
year month
1990 1
1990 1
1990 1
Is there a way around this that doesn't require always adding a year and trimming it off later?
You are looking for expand.grid
df <- expand.grid(year = 1990:2001, month = 1:12)

Resources