Estimating likelihood of survey attrition relative to treatment - r

I have a panel survey data where each row represents an individual, their interview date, and labor market status during that period. However, it's an unbalanced panel data where some observations appear more than others (i.e. because some individuals stopped responding to the survey's organizers). Data was collected on individuals before and after some observations were randomly given a cash assistance benefit.
I am interested in knowing whether some individuals stopped responding to our survey specifically after they received the cash benefit (i.e. the treatment date which is on 2019-09-03)?
In other words, I am interested in testing the probability of leaving the survey relative to the "date" variable but I am not sure how to do that.
Here is a data example. For instance, we can see that some individuals like Cartman who received treatment in Sept 2019 stopped responding to the survey in following years and thus their job market status is recorded as "N/A"
Other observations in the control group like Mackey who did not receive the treatment continued responding to the survey in the following years.
individual date labor_status cash_ benefit
Kenny 2018-09-02. unemployed 0
Kenny 2019-09-03. unemployed 1
Kenny 2020-09-07. employed 1
Kenny 2021-09-13. employed 1
Cartman 2018-09-03. unemployed 0
Cartman 2019-09-06. unemployed 1
Cartman 2020-09-08. N/A 1
Cartman 2021-09-08. N/A 1
Mackey 2018-09-03. employed 0
Mackey 2019-09-04. unemployed 0
Mackey 2020-09-08. employed 0
Mackey 2021-09-13. employed 0

If you’re looking to test this statistically, you should ask on Cross Validated. But if you just want the probability of dropout after 2019 conditional on receiving benefit:
library(dplyr)
library(lubridate)
dat %>%
group_by(individual) %>%
summarize(
benefit = any(cash_benefit == 1),
dropout_after_2019 = all(
year(date) < 2019 |
(year(date) == 2019 & !is.na(labor_status)) |
is.na(labor_status)
)
) %>%
group_by(benefit) %>%
summarize(p_dropout_after_2019 = mean(dropout_after_2019))
# A tibble: 2 × 2
benefit p_dropout_after_2019
<lgl> <dbl>
1 FALSE 0
2 TRUE 0.5

Related

summarize multiple binary variables in a single column

in a survey I conducted, I asked about the education level of the participants. The results are spread over several columns as binary variables. I would appreciate efficient ways to combine the results into a single variable. The tables below show the current and desired data format.
ID
high school
college
PhD
1
high school
-1
-1
2
-1
college
-1
3
-1
-1
PhD
4
high school
-1
-1
ID
Educational background
1
high school
2
college
3
PhD
4
high school
To answer your specific question using the tidyverse, creating a test dataset with the code at the end of this post:
library(tidyverse)
df %>%
mutate(
across(-ID, function(x) ifelse(x == "-1", NA, x)),
EducationalBackground=coalesce(high_school, college, PhD)
)
ID high_school college PhD EducationalBackground
1 1 high_school <NA> <NA> high_school
2 2 <NA> college <NA> college
3 3 <NA> <NA> PhD PhD
4 4 high_school <NA> <NA> high_school
The code works by converting the text values of "-1" in your columns, which I take to be missing value flags, to true missing values. Then I use coalesce to find the first non-missing value in the three columns that contain survey data and place it in the new summary column. This assumes that there will be one and only one non-missing value in each row of the data frame.
That said, my preference would be to avoid the problem by adapting your workflow earlier in the piece to avoid the problem. But you haven't given any details of that, so I can't make any suggestions about how to do that.
Test data
df <- read.table(textConnection("ID high_school college PhD
1 high_school -1 -1
2 -1 college -1
3 -1 -1 PhD
4 high_school -1 -1"), header=TRUE)

Fuzzy matching strings within a single column and documenting possible matches

I have a relatively large dataset of ~ 5k rows containing titles of journal/research papers. Here is a small sample of the dataset:
dt = structure(list(Title = c("Community reinforcement approach in the treatment of opiate addicts",
"Therapeutic justice: Life inside drug court", "Therapeutic justice: Life inside drug court",
"Tuberculosis screening in a novel substance abuse treatment center in Malaysia: Implications for a comprehensive approach for integrated care",
"An ecosystem for improving the quality of personal health records",
"Patterns of attachment and alcohol abuse in sexual and violent non-sexual offenders",
"A Model for the Assessment of Static and Dynamic Factors in Sexual Offenders",
"A model for the assessment of static and dynamic factors in sexual offenders",
"The problem of co-occurring disorders among jail detainees: Antisocial disorder, alcoholism, drug abuse, and depression",
"Co-occurring disorders among mentally ill jail detainees. Implications for public policy",
"Comorbidity and Continuity of Psychiatric Disorders in Youth After Detention: A Prospective Longitudinal Study",
"Behavioral Health and Adult Milestones in Young Adults With Perinatal HIV Infection or Exposure",
"Behavioral health and adult milestones in young adults with perinatal HIV infection or exposure",
"Revising the paradigm for jail diversion for people with mental and substance use disorders: Intercept 0",
"Diagnosis of active and latent tuberculosis: summary of NICE guidance",
"Towards tackling tuberculosis in vulnerable groups in the European Union: the E-DETECT TB consortium"
)), row.names = c(NA, -16L), class = c("tbl_df", "tbl", "data.frame"
))
You can see that there are some duplicates of titles in there, but with formatting/case differences. I want to identify titles that are duplicated and create a new variable that documents which rows are possibly matching. To do this, I have attempted to use the agrep function as suggested here :
dt$is.match <- sapply(dt$Title,agrep,dt$Title)
This identifies matches, but saves the results as a list in the new variable column. Is there a way to do this (preferably using base r or data.table) where the results of agrep are not saved as a list, but only identifying which rows are matches (e.g., 6:7)?
Thanks in advance - hope I have provided enough information.
Do you need something like this?
dt$is.match <- sapply(dt$Title,function(x) toString(agrep(x, dt$Title)), USE.NAMES = FALSE)
dt
# A tibble: 16 x 2
# Title is.match
# <chr> <chr>
# 1 Community reinforcement approach in the treatment of opiate addicts 1
# 2 Therapeutic justice: Life inside drug court 2, 3
# 3 Therapeutic justice: Life inside drug court 2, 3
# 4 Tuberculosis screening in a novel substance abuse treatment center in Malaysia: Implications for a comp… 4
# 5 An ecosystem for improving the quality of personal health records 5
# 6 Patterns of attachment and alcohol abuse in sexual and violent non-sexual offenders 6
# 7 A Model for the Assessment of Static and Dynamic Factors in Sexual Offenders 7, 8
# 8 A model for the assessment of static and dynamic factors in sexual offenders 7, 8
# 9 The problem of co-occurring disorders among jail detainees: Antisocial disorder, alcoholism, drug abuse… 9
#10 Co-occurring disorders among mentally ill jail detainees. Implications for public policy 10
#11 Comorbidity and Continuity of Psychiatric Disorders in Youth After Detention: A Prospective Longitudina… 11
#12 Behavioral Health and Adult Milestones in Young Adults With Perinatal HIV Infection or Exposure 12, 13
#13 Behavioral health and adult milestones in young adults with perinatal HIV infection or exposure 12, 13
#14 Revising the paradigm for jail diversion for people with mental and substance use disorders: Intercept 0 14
#15 Diagnosis of active and latent tuberculosis: summary of NICE guidance 15
#16 Towards tackling tuberculosis in vulnerable groups in the European Union: the E-DETECT TB consortium 16
This isn't base r nor data.table, but here's one way using tidyverse to detect duplicates:
library(janitor)
library(tidyverse)
dt %>%
mutate(row = row_number()) %>%
get_dupes(Title)
Output:
# A tibble: 2 x 3
Title dupe_count row
<chr> <int> <int>
1 Therapeutic justice: Life inside drug court 2 2
2 Therapeutic justice: Life inside drug court 2 3
If you wanted to pick out duplicates that aren't case-sensitive, try this:
dt %>%
mutate(Title = str_to_lower(Title),
row = row_number()) %>%
get_dupes(Title)
Output:
# A tibble: 6 x 3
Title dupe_count row
<chr> <int> <int>
1 a model for the assessment of static and dynamic factors in sexual offend… 2 7
2 a model for the assessment of static and dynamic factors in sexual offend… 2 8
3 behavioral health and adult milestones in young adults with perinatal hiv… 2 12
4 behavioral health and adult milestones in young adults with perinatal hiv… 2 13
5 therapeutic justice: life inside drug court 2 2
6 therapeutic justice: life inside drug court 2 3

Calculating conditional summaries of grouped data in dplyr

I have a dataset of population mortality data segregated by year, decile (ranked) of deprivation, gender, cause of death and age. Age data is broken down into categories including 0-1, 1-4, 5-9, 10-14 etc.
I am trying to coerce my dataset such that the mortality data for 0-1 and 1-4 is merged together to create age categories 0-4, 5-9, 10-14 and so on up to 90. My data is in long format.
Using dplyr I am trying to use if_else and summarise() to aggregate mortality data for 0-1 and 1-4 together, however any iteration of code I apply is merely producing the same dataset I originally had, i.e. the code is not merging my data together.
head(death_popn_long) #cause_death variable content removed for brevity
Year deprivation_decile Sex cause_death ageband deaths popn
1 2017 1 Male NA 0 0 2106
2 2017 1 Male NA 0 0 2106
3 2017 1 Male NA 0 0 2106
4 2017 1 Male NA 0 0 2106
5 2017 1 Male NA 0 0 2106
6 2017 1 Male NA 0 0 2106
#Attempt to merge ageband 0-1 & 1-4 by summarising combined death counts
test <- death_popn_long %>%
group_by(Year, deprivation_decile, Sex, cause_death, ageband) %>%
summarise(deaths = if_else(ageband %in% c("0", "1"), sum(deaths),
deaths))
I would like the deaths variable to be the combined (i.e. sum of both 0-1 and 1-4) death count for these age bands, however the above any any alternative code I attempt merely recreates the previous dataset I already had.
You don't want to use ageband in your group_by statement if you intend on manipulating its groups. You'll need to create your new version of ageband and then group by that:
test <- death_popn_long %>%
mutate(new_ageband = if_else(ageband %in% c("0", "1"), 1, ageband)) %>%
group_by(Year, deprivation_decile, Sex, cause_death, new_ageband) %>%
summarise(deaths = sum(deaths))
If you'd like a marginally shorter version you can define new_ageband in the group_by clause instead of using a mutate verb beforehand. I just did that to be explicit.
Also, for future SO questions - it's very helpful to provide data in your question (using something like dput). :)

How to relate two different dataframes to make calculations

I know how to work and computing math/statistics with one dataframe. But, what happens when I have to deal with two? For example:
> df1
supervisor salesperson
1 Supervisor1 Matt
2 Supervisor2 Amelia
3 Supervisor2 Philip
> df2
month channel Matt Amelia Philip
1 Jan Internet 10 50 20
2 Jan Cellphone 20 60 30
3 Feb Internet 40 40 30
4 Feb Cellphone 30 120 40
How can I compute the sales by supervisor grouped by channel in a efficient and generalizable way?. Is there any methodology or criteria when you need to relate two or more dataframes in order to compute the data you need?
PS: The number are the sales made by each sales person.
Here is the idea of converting to long and merging using tidyverse,
library(tidyverse)
df2 %>%
gather(salesperson, val, -c(1:2)) %>%
left_join(., df1, by = 'salesperson') %>%
spread(salesperson, val, fill = 0) %>%
group_by(channel, supervisor) %>%
summarise_at(vars(names(.)[4:6]), funs(sum))
which gives,
# A tibble: 4 x 5
# Groups: channel [?]
channel supervisor Amelia Matt Philip
<fct> <fct> <dbl> <dbl> <dbl>
1 Cellphone Supervisor1 0. 50. 0.
2 Cellphone Supervisor2 180. 0. 70.
3 Internet Supervisor1 0. 50. 0.
4 Internet Supervisor2 90. 0. 50.
NOTE: You can also add month in the group_by

Survival Analysis Data with country-year observations

I'm trying to fit a Cox Proportional Hazard model to analyze the impact of the number of protest events on the survival rates of different political regimes in different countries.
My dataset looks similar to this:
Country year sdate edate time evercollapsed protest GDPgrowth
Country A 2003 1996-11-24 2012-12-31 5881 0 78 14.78
Country A 2004 NA NA NA 0 99 8.56
Country A 2005 NA NA NA 0 25 3.56
Country B 2003 2000-10-26 2011-05-21 3859 1 13 2.33
Country B 2004 NA NA NA 1 28 5.43
Country B 2005 NA NA NA 1 7 1.89
So, basically my dataset provides yearly information on a number of variables for each year, but information about the start and end dates for the regime and the time of survival (measured in days) is only provided in the first row of each given political regime.
My data includes information for 48 different political regimes and 15 of them collapse in the time span I am looking at.
I fitted a Cox PH model with the survival package:
myCPH <- coxph(Surv(time, evercollapsed) ~ protest + GDPgrowth, data = mydata)
This gives me the following result:
Call:
coxph(formula = Surv(time, evercollapsed) ~ protest + GDPgrowth,
data = mydata)
coef exp(coef) se(coef) z p
protest 0.01630 1.01644 0.00722 2.26 0.024
GDPgrowth -0.03447 0.96612 0.01523 -2.26 0.024
Likelihood ratio test=9.26 on 2 df, p=0.00977
n= 48, number of events= 15
(556 observations deleted due to missingness)
So, these results imply that I'm losing 556 country years, because the rows in my data frame do not include the information on the survival time of the regime.
My question now is, how to include the country years into the analysis which do not provide the information on sdate, edate and time?
I assume, if I would just copy the information for each country-year, this would increase my number of regime collapses?
I assume I have to give an unique ID for every given political regime to make sure R can distinguish the different cases. Then, how do I have to fit the Cox PH model that includes the information of the differen country-years in the analysis?
Many thanks in advance!

Resources