How can I tidy student enrollment data on a per semester basis? - r

I have a dataset that currently lists student information on a term basis (i.e., 201610, 201620, 201630, 201640, 201710, etc.) with suffix 10 = fall, 20 = winter, 30 = spring, and 40 = summer. Not all terms are necessarily listed for every student.
What I would like to do is identify the first term in which a student was enrolled, presumably the fall, as T1, and subsequent terms as T2, T3, etc. Since some students may take a winter summer term, I would like to identify those as T1_Winter, T2_Summer, etc.
I've been able to isolate the individual terms for which a student has enrolled, and have been able to identify the first, intermediate, and last terms as 1, 2, 3, etc. However, I can't manage to wrap my head around how to identify fall and spring as 1, 2, 3, 4, and the intermediary terms, winter and summer, and 1.5, 2.5, 3.5, 4.5, etc.
# Create the sample dataset
data <- data.frame(
ID = c(1, 1, 1, 2, 2, 2, 2),
RegTerm = c(201810, 201820, 201830, 201910, 201930, 201940, 202010))
)
# Isolate student IDs and terms
stdTerm <- subset(data, select = c("ID","RegTerm"))
# Sort according to ID and RegTerm
stdTerm <- stdTerm[
with(stdTerm, order(ID, RegTerm)),
]
# Remove duplicate combinations of ID and term
y <- stdTerm[!duplicated(stdTerm[c(1,2)]),]
# Create an index to identify the term number
# for which a student enrolled
library(dplyr)
z <- y %>%
arrange(ID, RegTerm) %>%
group_by(ID) %>%
mutate(StdTermIndex = seq(n()))
Right now, it's identifying the progression of all terms for a student as 1, 2, 3, etc., but not winter and summer as intermediary terms. That is, if a student enrolled in fall and winter, winter will appear as 2 and spring will appear as 3.
In the sample data provided, I would like Student ID 1 to reflect 201810 as 1, 201820 as 1.5, and 201830 as 2, etc. Any suggestions or previous code I could reference to wrap my head around how I can code the intermediary semesters?

So, to do it in your sample, I created a handle variable that tells me whether the RegTerm is even or odd.
The reason is simple, odd RegTerm means it is a regular term, whereas even ones will be either winter or summer terms.
library(dplyr)
data <- data.frame(
ID = c(1, 1, 1, 2, 2, 2, 2),
RegTerm = c(201810, 201820, 201830, 201910, 201930, 201940, 202010)
)
dat <- data %>%
mutate(term = str_extract(RegTerm, '(?<=\\d{4})\\d{1}(?=0)'),
term = as.numeric(term) %% 2) %>%
group_by(ID) %>%
mutate(numTerm = cumsum(term),
numTerm = ifelse(term == 0, numTerm + 0.5, numTerm))
The first mutate extracts the 5th digit in the RegTerm column and get the rest of its division by 2. If it equals 1, it means it is a regular term, otherwise it will be either summer or winter.
Next I take the cumulative sum of this variable, which will give you in which RegTerm the student is. Then, for every term == 0 I add to numTerm 0.5, to account for the winter and summer terms.
# A tibble: 7 x 4
# Groups: ID [2]
ID RegTerm term numTerm
<dbl> <dbl> <dbl> <dbl>
1 1 201810 1 1
2 1 201820 0 1.5
3 1 201830 1 2
4 2 201910 1 1
5 2 201930 1 2
6 2 201940 0 2.5
7 2 202010 1 3
This way, if there is a student starting in a winter term, numTerm will be assigned a 0.5 value, having numTerm = 1 only when he reaches a regular term (term == 1)

I think a good way to do this would be to separate your RegTerm column into year and suffix and then apply some condition formula once you have the values split up.
The below code does that, we just have to then apply it to the whole column and do some rejigging.
paste(strsplit(as.character(201810), "")[[1]][1:4], collapse = ""))
# "2018"
paste(strsplit(as.character(201810), "")[[1]][5:6], collapse = ""))
# "10"
So to do it on the data frame you want to use something like lapply and then unlist the result and add a new column. After that you can change the values to numeric and then use some conditional statements in a mutate function to set the intermediary values etc.
z$year <- unlist(lapply(z$RegTerm, function(x) paste(strsplit(as.character(x), "")[[1]][1:4], collapse = "")))
z$suf <- unlist(lapply(z$RegTerm, function(x) paste(strsplit(as.character(x), "")[[1]][5:6], collapse = "")))
It looks a bit ugly but all it is doing is separating RegTerm then selecting the first 4 or last 2 characters for year and suf respectively then collapsing (using collapse = "" in paste) them into a single string. We lapply this to the whole column then unlist it to make vector.
I would recommend understanding the first two lines of code in this answer and then it will be made obvious.

Related

Retain observations that hasn't occured in the year before in a long time-series dataset in R

I have a df that looks like this:
ID Year
5 2010
5 2011
5 2014
3 2013
3 2014
10 2013
1 2010
1 2012
1 2014
...
The df contains the years 2009-2019, and is filtered on individuals living in a one particular town, that are 18-64 years old at that particular year.
For every year I need to keep only individuals that have moved into this town that particular year. So for example, I need to keep the difference between the population at year 2010 minus the population at year 2009. I also need to do this for every year (so for example, some people move out of town for a couple of years and then return - ID 5 is an example of this). In the end, I want one df for every year 2010-2019, so ten dfs that contain only individuals that moved into town that particular year.
I have played around with group_by() and left_join(), but haven't managed to succeed. There must be a simple solution, but I haven't been able to find one yet.
You can use the setdiff function to perform set(A) - set(B) operation. Split your data into dataframes by year, and then loop through them, finding the new joiners.
Example code:
library(dplyr)
set.seed(123)
df <- tibble(
id = c(1, 2, 3, 4, 5, # first year
1, 2, 3, 5, 6, 7, # 4 moves out, 6,7 move in
2, 3, 4, 6, 7, 8), # 1,5 moves out, 4,8 move in
year = c(rep(2009, 5),
rep(2010, 6),
rep(2011, 6)),
age = sample(18:64, size = 17) # extra column
)
# split into list of dataframes by year
df_by_year <- split(df, df$year)
# create a list to contain the 2 df (total years 3 - 1)
df_list <- vector("list", 2)
for(i in 1:length(df_list)){
# determine incoming new people
new_joinees <- setdiff(df_by_year[[i+1]]$id, df_by_year[[i]]$id)
# filter for above IDs
df_list[[i]] <- dplyr::filter(df_by_year[[i+1]], id %in% new_joinees)
}

Extracting specific rows from long format dataset by conditions / adopting long format data to survival analysis

Background:
I have a dataset that I am preparing for a survival analysis, it's originally a longitudinal dataset in long format. I have an ID variable separating participants, a time variable (months), and my binary 0/1 event variable (whether or not somebody met a "monthly loss limit" when gambling).
Problem/goal:
I am trying to create the necessary variables for the survival analysis and then remove the excess/unnecessary rows. My event (meeting a loss limit) can technically occur multiple times for each participant across the study period, but I am only interested in the first occurrence for a participant. I have made a time duration variable and attempted to modify it with an if-else statement so that participants that meet a loss limit have that specific month as their endpoint.
The problem is that I can't seem to do the filtering in a way that I only keep the rows that I want. I have attempted some code with an if-else statement but I am getting an error. For participants that have met one or more loss limits I want to extract the row with their first loss limit met because the modified time duration is also contained within this row. For participants that never reach a loss limit I doesn't matter, any row is fine because they all have the necessary information.
How do I accomplish this?
Example data frame and code:
library(dplyr)
# Example variables and data frame in long form
# Includes id variable, time variable and example event variable
id <- c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3 )
time <- c(2, 3, 4, 7, 3, 5, 7, 1, 2, 3, 4, 5)
metLimit <- c(0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1)
dfLong <- data.frame(id = id, time = time, metLimit = metLimit)
# Making variables, time at start, finish and duration variable
dfLong <- dfLong %>%
group_by(id) %>%
mutate(startTime = first(time),
lastTime = last(time))
dfLong <- dfLong %>%
group_by(id) %>%
mutate(timeDuration = ifelse(metLimit == "1", c(time - startTime),
lastTime - startTime))
# My failed attempt at solving the problem
dfLong <- dfLong %>%
group_by(id) %>%
ifelse(metLimit == "1", filter(first(metLimit)), filter(last(time)
You could sort the idgroups:
dfLong %>%
group_by(id) %>%
arrange(desc(metLimit),time,.by_group=TRUE) %>%
# This one is critical, order by metlimit descending first
# (MetLimit==1 will be in the first rows of the group if it exists for this
# particular id) then order by time:
# Within every Group of id,MeTlimit , put the lowest tim in the upper row
# of the id Group
slice_head(n=1) # get the first row for each id-group
This results in:
# A tibble: 3 x 6
# Groups: id [3]
id time metLimit startTime lastTime timeDuration
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 0 2 7 5
2 2 5 1 3 7 2
3 3 2 1 1 5 1
As you do not care about the samplepoint of participants that have never reached their limit, this should be sufficient.
How about replacing the last step with:
dfLong <- dfLong %>%
group_by(id) %>%
dplyr::filter(metLimit == ifelse(sum(metLimit), 1, 0)) %>%
dplyr::slice_head(n = 1)
> # A tibble: 3 x 6
> # Groups: id [3]
> id time metLimit startTime lastTime timeDuration
> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 1 2 0 2 7 5
> 2 2 5 1 3 7 2
> 3 3 2 1 1 5 1
The filter() step gets the rows where metLimit is 1 unless they are all 0 (sum == 0 == false). Then you get the first row.

Removing negative values and one positive value from R dataframe

I have a dataframe where one column is the amount spent. In the amount spent column there are the values for amount spent and also negative values for any returns. For example.
ID Store Spent
123 A 18.50
123 A -18.50
123 A 18.50
I want to remove the negative value then one of its positive counter parts - the idea is to only keep fully completed spend amounts so I can look at total spend.
Right now I am thinking something like this - where I have the data frame sorted by spend
if spend < 0 {
take absolute value of spend
if diff between abs(spend) and spend+1 = 0 then both are NA}
I would like to have something like
df[df$spend < 0] <- NA
where I can also set one positive counterpart to NA as well. Any suggestions?
There should be a simpler solution to this but here is one way. Also created my own example since the one shared did not have sufficient data points to test
#Original vector
x <- c(1, 2, -2, 1, -1, -1, 2, 3, -4, 1, 4)
#Count the frequency of negative numbers, keeping all the unique numbers
vals <- table(factor(abs(x[x < 0]), levels = unique(abs(x))))
#Count the frequency of absolute value of original vector
vals1 <- table(abs(x))
#Subtract the frequencies between two vectors
new_val <- vals1 - (vals * 2 )
#Recreate the new vector
as.integer(rep(names(new_val), new_val))
#[1] 1 2 3
If you add a rowid column you can do this with data.table ant-joins.
Here's an example which takes ID into account, not deleting "positive counterparts" unless they're the same ID
First create more interesting sample data
df <- fread('
ID Store Spent
123 A 18.50
123 A -18.50
123 A 18.50
123 A -19.50
123 A 19.50
123 A -99.50
124 A -94.50
124 A 99.50
124 A 94.50
124 A 94.50
')
Now remove all the negative values with positive counterparts, and remove those counterparts
negs <- df[Spent < 0][, Spent := -Spent][, rid := rowid(ID, Spent)]
pos <- df[Spent > 0][, rid := rowid(ID, Spent)]
pos[!negs, on = .(ID, Spent, rid), -'rid']
# ID Store Spent rid
# 1: 123 A 18.5 2
# 2: 124 A 99.5 1
# 3: 124 A 94.5 2
And as applied to Ronak's x vector example
x <- c(1, 2, -2, 1, -1, -1, 2, 3, -4, 1, 4)
negs <- data.table(x = -x[x<0])[, rid := rowid(x)]
pos <- data.table(x = x[x>0])[, rid := rowid(x)]
pos[!negs, on = names(pos), -'rid']
# x
# 1: 2
# 2: 3
# 3: 1
I used the following code.
library(dplyr)
store <- rep(LETTERS[1:3], 3)
id <- c(1:4, 1:3, 1:2)
expense <- runif(9, -10, 10)
tibble(store, id, expense) %>%
group_by(store) %>%
summarise(net_expenditure = sum(expense))
to get this output:
# A tibble: 3 x 2
store net_expenditure
<chr> <dbl>
1 A 13.3
2 B 8.17
3 C 16.6
Alternatively, if you wanted the net expenditure per store-id pairing, then you could use this code:
tibble(store, id, expense) %>%
group_by(store, id) %>%
summarise(net_expenditure = sum(expense))
I've approached your question from a slightly different perspective. I'm not sure that my code answers your question, but it might help.

Subset rows for each group based on a character in a column and order of occurrence in a data frame

I have a data similar to this.
B <- data.frame(State = c(rep("Arizona", 8), rep("California", 8), rep("Texas", 8)),
Account = rep(c("Balance", "Balance", "In the Bimester", "In the Bimester", "Expenses",
"Expenses", "In the Bimester", "In the Bimester"), 3), Value = runif(24))
You can see that Account has 4 occurrences of the element "In the Bimester", two "chunks" of two elements for each state, "Expenses" in between them.
The order here matters because the first chunk is not referring to the same thing as the second chunk.
My data is actually more complex, It has a 4th variable, indicating what each row of Account means. The number of its elements for each Account element (factor per se) can change. For example, In some state, the first "chunk" of "In the Bimester" can have 6 rows and the second, 7; but, I cannot differentiate by this 4th variable.
Desired: I'd like to subset my data, spliting those two "In the Bimester" by each state, subsetting only the rows of the first "chunks" by each state or the second "chunks".
I have a solution using data.table package, but I'm finding it kind of poor. any thoughts?
library(data.table)
B <- as.data.table(B)
B <- B[, .(Account, Value, index = 1:.N), by = .(State)]
x <- B[Account == "Expenses", .(min_ind = min(index)), by = .(State)]
B <- merge(B, x, by = "State")
B <- B[index < min_ind & Account == "In the Bimester", .(Value), by = .(State)]
You can use dplyr package:
library(dplyr)
B %>% mutate(helper = data.table::rleid(Account)) %>%
filter(Account == "In the Bimester") %>%
group_by(State) %>% filter(helper == min(helper)) %>% select(-helper)
# # A tibble: 6 x 3
# # Groups: State [3]
# State Account Value
# <fctr> <fctr> <dbl>
# 1 Arizona In the Bimester 0.17730148
# 2 Arizona In the Bimester 0.05695585
# 3 California In the Bimester 0.29089678
# 4 California In the Bimester 0.86952723
# 5 Texas In the Bimester 0.54076144
# 6 Texas In the Bimester 0.59168138
If instead of min you use max you'll get the last occurrences of "In the Bimester" for each State. You can also exclude Account column by changing the last pipe to select(-helper,-Account).
p.s. If you don't want to use rleid from data.table and just use dplyr functions take a look at this thread.

R conditional lookup and sum

I have data on college course completions, with estimated numbers of students from each cohort completing after 1, 2, 3, ... 7 years. I want to use these estimates to calculate the total number of students outputting from each College and Course in any year.
The output of students in a given year will be the sum of the previous 7 cohorts outputting after 1, 2, 3, ... 7 years.
For example, the number of students outputting in 2014 from COLLEGE 1, COURSE A is equal to the sum of:
Output of 2013 cohort (College 1, Course A) after 1 year +
Output of 2012 cohort (College 1, Course A) after 2 years +
Output of 2011 cohort (College 1, Course A) after 3 years +
Output of 2010 cohort (College 1, Course A) after 4 years +
Output of 2009 cohort (College 1, Course A) after 5 years +
Output of 2008 cohort (College 1, Course A) after 6 years +
Output of 2007 cohort (College 1, Course A) after 7 years +
So there are two dataframes: a lookup table that contains all the output estimates, and a smaller summary table that I'm trying to modify. I want to update dummy.summary$output with, for each row, the total output based on the above calculation.
The following code will replicate my data pretty well
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
# Summary table to be modified
dummy.summary <- aggregate(x = dummy.lookup["intake"], by = list(dummy.lookup$cohort, dummy.lookup$college, dummy.lookup$course), FUN = mean)
names(dummy.summary)[1:3] <- c("year", "college", "course")
dummy.summary <- dummy.summary[order(dummy.summary$year, dummy.summary$college, dummy.summary$course), ]
dummy.summary$output <- 0
The following code does not work, but shows the approach I've been attempting.
dummy.summary$output <- sapply(dummy.summary$output, function(x){
# empty vector to fill with output values
vec <- c()
# Find relevant output for college + course, from each cohort and exit year
for(j in 1:7){
append(x = vec,
values = dummy.lookup[dummy.lookup$college==dummy.summary[x, "college"] &
dummy.lookup$course==dummy.summary[x, "course"] &
dummy.lookup$cohort==dummy.summary[x, "year"]-j &
dummy.lookup$output.year==j, "output"])
}
# Sum and return total output
sum_vec <- sum(vec)
return(sum_vec)
}
)
I guess it doesn't work because I was hoping to use 'x' in the anonymous function to index particular values of the dummy.summary dataframe. But that clearly isn't happening and is only returning zero for each row, presumably because the starting value of 'x' is zero each time. I don't know if it is possible to access the index position of each value that sapply loops over, and use that to index my summary dataframe.
Is this approach fixable or do I need a completely different approach?
Even if it is fixable, is there a more elegant/faster way to acheive what I'm trying to do?
Thanks in anticipation.
I've just updated your output.year to output.year2 where instead of a value from 1 to 7 it gets a value of a year based on the cohort you have.
I've realised that the output information you want corresponds to the output.year, but the intake information you want corresponds to the cohort. So, I calculate them separately and then I join tables/information. This automatically creates empty (NA that I transform to 0) output info for 1998.
# fix your random sampling
set.seed(24)
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
dummy.lookup$output[dummy.lookup$yr %in% 1:2] <- 0
library(dplyr)
# create result table for output info
dt_output =
dummy.lookup %>%
mutate(output.year2 = output.year+cohort) %>% # update output.year to get a year value
group_by(output.year2, college, course) %>% # for each output year, college, course
summarise(SumOutput = sum(output)) %>% # calculate sum of intake
ungroup() %>%
arrange(college,course,output.year2) %>% # for visualisation purposes
rename(cohort = output.year2) # rename column
# create result for intake info
dt_intake =
dummy.lookup %>%
select(cohort, college, course, intake) %>% # select useful columns
distinct() # keep distinct rows/values
# join info
dt_intake %>%
full_join(dt_output, by=c("cohort","college","course")) %>%
mutate(SumOutput = ifelse(is.na(SumOutput),0,SumOutput)) %>%
arrange(college,course,cohort) %>% # for visualisation purposes
tbl_df() # for printing purposes
# Source: local data frame [720 x 5]
#
# cohort college course intake SumOutput
# (int) (fctr) (fctr) (int) (dbl)
# 1 1998 College 1 Course A 194 0
# 2 1999 College 1 Course A 198 11
# 3 2000 College 1 Course A 223 29
# 4 2001 College 1 Course A 198 45
# 5 2002 College 1 Course A 289 62
# 6 2003 College 1 Course A 163 78
# 7 2004 College 1 Course A 211 74
# 8 2005 College 1 Course A 181 108
# 9 2006 College 1 Course A 277 101
# 10 2007 College 1 Course A 157 109
# .. ... ... ... ... ...

Resources