Creating Inverse Probability of Attrition Weights in R - r

Weuve et al. (2012) wrote a great paper about implementing Inverse Probability of Attrition Weighting (IPAW), a weighting method used to account for bias introduced by attrition during the course of a longitudinal study. Here is a link to said article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3237815/#R30
I am working on a project where I am trying to implement this IPAW method and there isn't much out there on how to implement and code this method, so I'm looking for some help just to make sure I'm doing everything correctly.
The data I am working with involves older individuals who may have dementia, so it makes sense to use IPAW because those with dementia are more likely to leave the study. Each individual has at least a baseline visit and then up to 12 follow up visit (the average number of visits for each person is around 3). My understanding is that I should create weights for each round of follow up visits, so I start by subsetting the data to only a certain visit, creating a variable for whether or not somebody drops out immediately following the visit, and then I proceed to creating the models and weights.
Below is the r code I have been using to generate the weights:
Creating weights for the first follow up visit (visit == 1)
for-loop to create a variable for attrition
(For ease, I am just calling the last observation "x")
data$attrition<-c()
data$attrition[x] <- 1
for (i in 1:x){
if(data$visit[i+1] == 0) {
data$attrition[i] = 1
} else {
data$attrition[i] = 0
}
}
subsetting to only get the data for the first follow up visit
data_visit1 <-subset(data, data$visit == 1)
creating stepwise model for the likelihood of attriting
# specifying null model
null_visit1 <- glm(attrition ~ 1, family = binomial, data = data_visit1)
# specifying full model --
full_visit1 <- glm(attrition~
predictor1 +
predictor2 +
...,
family = binomial, data = data_visit1)
# running combined selection
stepmodel_visit1 <- step(null_visit1, scope=list(lower = null_visit1, upper = full_visit1), direction = "forward", k=2)
Creating weights
# re-naming model for denominator
denom.model <- stepmodel_visit1
# creating the predicted categorizations
pd_visit1 <- predict(denom.model, type = "response")
## estimation of numerator of ip weights using stabilizer instead of just 1
numer.model <- glm(attrition ~ 1, family = binomial(), data = data_visit1)
# predicting the numerator values
pn_visit1 <- predict(numer.model, type = "response")
# Putting together the actual weights
data_visit1$weight <- ifelse(data2$attrition == 1, pn_visit1 / pd_visit1, (1- pn_visit1)/(1 - (pd_visit1)))
Following this, I rejoin the weights back to the full dataset and then repeat the process for each round of follow up visits. So my question is, does this all look good? I would love any and all feedback on my approach. Thanks so much!

Related

In Rjags/runjags, what causes the "node inconsistent with parents" error when using dinterval?

I have wracked my brain trying to come up with a solution to this problem and I'm at wits end! First, the necessary context: Aquatic plants in lakes are sampled with rakes. You throw a rake out into the lake, you pull it back into your boat, and you figure out what plants are on its tines. In our case, we measure both presence/absence as well as "abundance," but in an ordinal/interval-censored way --> it's 0 if species X isn't noticed on the rake at all, 1 if it covers < 25% of the rake's tines, 2 if it covers between 25 and 75%, and 3 if it covers > 75%. However, it's fairly easy to miss a species entirely when it's in low abundance, so 0s are sketchy--they may not represent true absences, and that is really the issue our model is trying to explore.
So, there are really three layers here--a true, fully latent abundance that we don't observe directly at all, a partially latent "true presence/absence" in that we know where true presences are but not where true absences are, and then we have our observed presence/absence data. What's more interesting is that we think some environmental variables may affect both true abundance and true occurrence but differently, and then other variables may affect detectability, and it's those processes we're trying to tease apart.
So, anyhow, my actual model is much larger and more complicated than what I've pasted below, but here is a sort of functional (but probably academically meritless) training version of it that replicates the error I am getting.
#data setup
N = 1500 #Number of cases
obs = sample(c(0,1,2,3), N,
replace=T, prob=c(0.7, 0.2, 0.075, 0.025)) #Our observed, interval-censored data.
X1 = rnorm(N) #Some covariate that probably affects both occurrance and abundance but maybe in different ways.
abundances = rep(NA, times = N) #Abundance is a latent variable we don't directly observe. From elsewhere, I know the values here need to be NAs so the model will know to impute them
occur = rep(1, times = N) #Occurance is a degraded form of our abundance data.
#d will be the initials for the abundance data, since this is apparently needed to jumpstart the imputation.
d = vector()
for(o in 1:N) {
if (obs[o]==0) { d[o] = 0.025; occur[o] = 0 }
if (obs[o]==1) { d[o] = 0.15 }
if (obs[o]==2) { d[o] = 0.5 }
if (obs[o]==3) { d[o] = 0.875 }
}
#Data
test.data = list("N" = N,
"obs" = obs,
"X1" = X1,
"abund" = abundances,
"lim" = c(0.05, 0.25, 0.75, 0.9999),
"occur" = occur)
#Inits
inits = list(abund = d)
cat("model
{
for (i in 1:N) {
obs[i] ~ dinterval(abund[i], lim)
abund[i] ~ dbeta(theta[i], rho[i]) T(0.0001, 0.9999)
theta[i] <- mu[i] * epsilon
rho[i] <- epsilon * (1-mu[i])
logit(mu[i]) <- alpha1 + X.beta1 * X1[i]
occur[i] ~ dbern(phi[i])
logit(phi[i]) <- alpha2 + X.beta2 * X1[i]
}
#Priors
epsilon ~ dnorm(5, 0.1) T(0.01, 10)
alpha1 ~ dnorm(0, 0.01)
X.beta1 ~ dnorm(0, 0.01)
alpha2 ~ dnorm(0, 0.01)
X.beta2 ~ dnorm(0, 0.01)
}
", file = "training.txt")
test.run = jags.model(file = "training.txt", inits = inits, data=test.data, n.chains = 3)
params = c("epsilon",
"alpha1",
"alpha2",
"X.beta1",
"X.beta2")
run1 = run.jags("training.txt", data = test.data, n.chains=3, burnin = 1000, sample = 5000, adapt = 4000, thin = 2,
monitor = c(params), method="parallel", modules = 'glm')
At the end, I get this error, and I always get this error any time I try to do something even remotely like this:
Graph information: Observed stochastic nodes: 3000 Unobserved
stochastic nodes: 1505 Total graph size: 19519 . Reading
parameter file inits1.txt. Initializing model Error in node obs1
Node inconsistent with parents
I've read every posting that covers this error I can find, including this one, this one, this one, and this one. I can surmise from my research and testing that the error is probably occurring for one of the following reasons.
My initials for the latent abundance variable are not adequate somehow. It sounds like this requires pretty useful initial values to work.
One or more of my priors is allowing values that are not permissible OR they are too broad and that's causing problems somehow. This might be especially an issue because of the beta distribution I am using which has strong requirements about not having values outside of 0 and 1.
I am using the dinterval() function incorrectly, which seems likely because it is always the line containing it that trips the error.
My model is somehow mis-specified.
But I can't see where I might be going wrong--I have tried a number of different options for 1 and 2, and so far as I can tell from the documentation (see pages 55-56), I am using dinterval correctly. What am I missing??
In case it's relevant, from what I have gathered, the idea of dinterval() is that the variable on the left of the ~ is the interval-censored version of the variable given in the first argument (here, abundance). Then, the second argument (here, lim) is a vector of "breakpoints" that dictate which intervals the abundance data end up in. So, here, you end up with an observed abundance code of 0 if you are lower than the lowest lim (here, 0.05), 1 if you are in between the first two values in lim, etc. It's like the abundance variable is being pushed through a "binning sieve" created by the lim variable to produce a binned output variable, our observed abundances.
Any guidance would be most welcome!!
I have run your example with JAGS 4.3.0 and rjags 4-12. For me, the version with rjags runs correctly. The version with runjags does not work because you have not provided intial values. This is easily fixed by adding the argument
inits=list(inits, inits, inits)
to the call to run.jags().
You have correctly understood the purpose of dinterval. This is an "observable function" which imposes constraints on its parameters via a likelihood. When using dinterval you must always provide initial values that satisfy the constraints from the fist iteration. As far as I can see, your initial values do satisfy the constraints and this is verified by the fact that I can run your example (with initial values).

How to conduct parametric bootstrapping in R?

I am working with the orings data set in the faraway package in R. I have written the following grouped binomial model:
orings_model <- glm(cbind(damage, 6-damage) ~ temp, family = binomial, data = orings)
summary(orings_model)
I then constructed the Chi-Square test statistic and calculated the p-value:
pchisq(orings_model$null.deviance, orings_model$df.null,lower=FALSE)
First, I would like to generate data under the null distribution for this test statistic using rbinom with the average proportion of damaged o-rings (i.e., the variable "damage"). Second, I would like to recompute the above test statistic with this new data. I am not sure how to do this.
And second, I want to the process above 1000 times, saving the test statistic
each time. I am also not sure how to do this. My inclination is to use a for loop, but I am not sure how to set it up. Any help would be really appreciated!
It is not completely clear what you're looking to do here, but we can at least show some quick principles of how we can achieve this, and then hopefully you can get to your goal.
1) Simulating the null model
It is not entirely clear that you would like to simulate the null model here. It seems more like you're interested in simulating the actual model fit. Note that the null model is the model with form cbind(damage, 6-damage) ~ 1, and the null deviance and df are from this model. Either way, we can simulate data from the model using the simulate function in base R.
sims <- simulate(orings_model, 1000)
If you want to go the manual way estimate the mean vector of your model and use this for the probabilities in your call to rbinom
nsim <- 1000 * nrow(orings)
probs <- predict(orings_model, type = 'response')
sims_man <- matrix(rbinom(nsim, 6, probs),
ncol = 1000)
# Check they are equal:
# rowMeans(sims_man) - probs
In the first version we get a data.frame with 1000 columns each with a n times 2 matrix (damage vs not damage). In the latter we just summon the damage outcome.
2) Perform the bootstrapping
You could do this manually with the data above.
# Data from simulate
statfun <- function(x){
data <- orings_model$data
data$damage <- if(length(dim(x)) > 1)
x[, 1]
else
x
newmod <- update(orings_model, data = data)
pchisq(newmod$null.deviance, newmod$df.null, lower=FALSE)
}
sapply(sims, statfun)
# data from manual method
apply(sims_man, 2, statfun)
or alternatively one could take a bit of time with the boot function, allowing for a standardized way to perform the bootstrap:
library(boot)
# See help("boot")
ran_gen <- function(data, mle){
data$damage <- simulate(orings_model)[[1]][,1]
data
}
boot_metric <- function(data, w){
model <- glm(cbind(damage = damage, not_damage = 6 - damage) ~ temp,
family = binomial, data = data)
pchisq(model$null.deviance,
model$df.null,
lower=FALSE)
}
boots <- boot(orings, boot_metric,
R = 1000,
sim = 'parametric',
ran.gen = ran_gen,
mle = pchisq(orings_model$null.deviance,
orings_model$df.null,
lower=FALSE))
At which point we have the statistic in boots$t and the null statistic in boots$t0, so a simple statistic can be estimated using sum(boots$t > boots$t0) / boots$R (R being the number of replication).

Logistic Regression in R: glm() vs rxGlm()

I fit a lot of GLMs in R. Usually I used revoScaleR::rxGlm() for this because I work with large data sets and use quite complex model formulae - and glm() just won't cope.
In the past these have all been based on Poisson or gamma error structures and log link functions. It all works well.
Today I'm trying to build a logistic regression model, which I haven't done before in R, and I have stumbled across a problem. I'm using revoScaleR::rxLogit() although revoScaleR::rxGlm() produces the same output - and has the same problem.
Consider this reprex:
df_reprex <- data.frame(x = c(1, 1, 2, 2), # number of trials
y = c(0, 1, 0, 1)) # number of successes
df_reprex$p <- df_reprex$y / df_reprex$x # success rate
# overall average success rate is 2/6 = 0.333, so I hope the model outputs will give this number
glm_1 <- glm(p ~ 1,
family = binomial,
data = df_reprex,
weights = x)
exp(glm_1$coefficients[1]) / (1 + exp(glm_1$coefficients[1])) # overall fitted average 0.333 - correct
glm_2 <- rxLogit(p ~ 1,
data = df_reprex,
pweights = "x")
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1])) # overall fitted average 0.167 - incorrect
The first call to glm() produces the correct answer. The second call to rxLogit() does not. Reading the docs for rxLogit(): https://learn.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/rxlogit it states that "Dependent variable must be binary".
So it looks like rxLogit() needs me to use y as the dependent variable rather than p. However if I run
glm_2 <- rxLogit(y ~ 1,
data = df_reprex,
pweights = "x")
I get an overall average
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1]))
of 0.5 instead, which also isn't the correct answer.
Does anyone know how I can fix this? Do I need to use an offset() term in the model formula, or change the weights, or...
(by using the revoScaleR package I occasionally painting myself into a corner like this, because not many other seem to use it)
I'm flying blind here because I can't verify these in RevoScaleR myself -- but would you try running the code below and leave a comment as to what the results were? I can then edit/delete this post accordingly
Two things to try:
Expand data, get rid of weights statement
use cbind(y,x-y)~1 in either rxLogit or rxGlm without weights and without expanding data
If the dependent variable is required to be binary, then the data has to be expanded so that each row corresponds to each 1 or 0 response and then this expanded data is run in a glm call without a weights argument.
I tried to demonstrate this with your example by applying labels to df_reprex and then making a corresponding df_reprex_expanded -- I know this is unfortunate, because you said the data you were working with was already large.
Does rxLogit allow a cbind representation, like glm() does (I put an example as glm1b), because that would allow data to stay same size… from the rxLogit page, I'm guessing not for rxLogit, but rxGLM might allow it, given the following note in the formula page:
A formula typically consists of a response, which in most RevoScaleR
functions can be a single variable or multiple variables combined
using cbind, the "~" operator, and one or more predictors,typically
separated by the "+" operator. The rxSummary function typically
requires a formula with no response.
Does glm_2b or glm_2c in the example below work?
df_reprex <- data.frame(x = c(1, 1, 2, 2), # number of trials
y = c(0, 1, 0, 1), # number of successes
trial=c("first", "second", "third", "fourth")) # trial label
df_reprex$p <- df_reprex$y / df_reprex$x # success rate
# overall average success rate is 2/6 = 0.333, so I hope the model outputs will give this number
glm_1 <- glm(p ~ 1,
family = binomial,
data = df_reprex,
weights = x)
exp(glm_1$coefficients[1]) / (1 + exp(glm_1$coefficients[1])) # overall fitted average 0.333 - correct
df_reprex_expanded <- data.frame(y=c(0,1,0,0,1,0),
trial=c("first","second","third", "third", "fourth", "fourth"))
## binary dependent variable
## expanded data
## no weights
glm_1a <- glm(y ~ 1,
family = binomial,
data = df_reprex_expanded)
exp(glm_1a$coefficients[1]) / (1 + exp(glm_1a$coefficients[1])) # overall fitted average 0.333 - correct
## cbind(success, failures) dependent variable
## compressed data
## no weights
glm_1b <- glm(cbind(y,x-y)~1,
family=binomial,
data=df_reprex)
exp(glm_1b$coefficients[1]) / (1 + exp(glm_1b$coefficients[1])) # overall fitted average 0.333 - correct
glm_2 <- rxLogit(p ~ 1,
data = df_reprex,
pweights = "x")
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1])) # overall fitted average 0.167 - incorrect
glm_2a <- rxLogit(y ~ 1,
data = df_reprex_expanded)
exp(glm_2a$coefficients[1]) / (1 + exp(glm_2a$coefficients[1])) # overall fitted average ???
# try cbind() in rxLogit. If no, then try rxGlm below
glm_2b <- rxLogit(cbind(y,x-y)~1,
data=df_reprex)
exp(glm_2b$coefficients[1]) / (1 + exp(glm_2b$coefficients[1])) # overall fitted average ???
# cbind() + rxGlm + family=binomial FTW(?)
glm_2c <- rxGlm(cbind(y,x-y)~1,
family=binomial,
data=df_reprex)
exp(glm_2c$coefficients[1]) / (1 + exp(glm_2c$coefficients[1])) # overall fitted average ???

Need help applying regression model to dataset in R (sports data)

Update: Solved!
I'm currently trying to create a regression model for football that predicts a team's total points based on their pass yards and rush yards. I was able to get all the way to figuring out the regression equation but from here I do not know how to "plug in" the formula.
The data table is essentially all 32 NFL teams listed in rows and their offensive stats listed in columns
Code:
# 1. Import
Offense <- read.csv(file.choose(), header=TRUE)
#2 View
show (Offense)
#3 Attach so headers can be referenced
attach (Offense)
#4 Create Regression Model
mod1 <-lm(Total.Points ~ Pass.Yds + Rush.Yds)
summary(mod1)
#Formula obtained from summary: -255.60178 + .10565(Pass) + .12154(Rush)
#Plug in the Regression Equation
predict(mod1)
Output: https://imgur.com/a/AbTNF
I see that at the end it applied the regression equation to all 32 rows, but how do I
get it to display in a ranked list
get it to display, say, the team name as well as the projected score (so I don't have to wonder what team "1" or "2" refer to
Since I have the equation, could I also just write a loop function that ran the equation for every row of data I have and print the results?
I'm a beginner so much appreciated!
Update: Came up with this
####Part 2. Interpretation
#1. Examining quality of model
summary(mod1)
cor(Pass.Yds, Rush.Yds)
#2. Formula obtained from summary: -255.60178 + .10565(Pass) + .12154(Rush)
#3. Predicted Points (Descending Order)
proj <- sort(predict(mod1), decreasing = TRUE)
proj
#4. Corresponding Name (Descending)
name <- Team[order(predict(mod1), decreasing = TRUE)]
name
#Data Frame
Projections <- data.frame(name, proj)
Projections
While bbrot provided a much simpler version
Assuming that Teams is the vector of team names, something like cbind(Teams[order(predict(mod1), decreasing = TRUE)], sort(predict(mod1), decreasing = TRUE)) should do...
Edit: Your Teams vector seems to be a factor. In this case, the following commands are going to work:
# returns a character matrix
cbind(as.character(Teams)[order(predict(mod1), decreasing = TRUE)],
sort(predict(mod1), decreasing = TRUE))
# returns a data frame
data.frame(Teams = Teams[order(predict(mod1), decreasing = TRUE)],
Points = sort(predict(mod1), decreasing = TRUE))

Effects from multinomial logistic model in mlogit

I received some good help getting my data formatted properly produce a multinomial logistic model with mlogit here (Formatting data for mlogit)
However, I'm trying now to analyze the effects of covariates in my model. I find the help file in mlogit.effects() to be not very informative. One of the problems is that the model appears to produce a lot of rows of NAs (see below, index(mod1) ).
Can anyone clarify why my data is producing those NAs?
Can anyone help me get mlogit.effects to work with the data below?
I would consider shifting the analysis to multinom(). However, I can't figure out how to format the data to fit the formula for use multinom(). My data is a series of rankings of seven different items (Accessible, Information, Trade offs, Debate, Social and Responsive) Would I just model whatever they picked as their first rank and ignore what they chose in other ranks? I can get that information.
Reproducible code is below:
#Loadpackages
library(RCurl)
library(mlogit)
library(tidyr)
library(dplyr)
#URL where data is stored
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
#Get data
dat <- read.csv(dat.url)
#Complete cases only as it seems mlogit cannot handle missing values or tied data which in this case you might get because of median imputation
dat <- dat[complete.cases(dat),]
#Change the choice index variable (X) to have no interruptions, as a result of removing some incomplete cases
dat$X <- seq(1,nrow(dat),1)
#Tidy data to get it into long format
dat.out <- dat %>%
gather(Open, Rank, -c(1,9:12)) %>%
arrange(X, Open, Rank)
#Create mlogit object
mlogit.out <- mlogit.data(dat.out, shape='long',alt.var='Open',choice='Rank', ranked=TRUE,chid.var='X')
#Fit Model
mod1 <- mlogit(Rank~1|gender+age+economic+Job,data=mlogit.out)
Here is my attempt to set up a data frame similar to the one portrayed in the help file. It doesnt work. I confess although I know the apply family pretty well, tapply is murky to me.
with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt, mean)))
Compare from the help:
data("Fishing", package = "mlogit")
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
m <- mlogit(mode ~ price | income | catch, data = Fish)
# compute a data.frame containing the mean value of the covariates in
# the sample data in the help file for effects
z <- with(Fish, data.frame(price = tapply(price, index(m)$alt, mean),
catch = tapply(catch, index(m)$alt, mean),
income = mean(income)))
# compute the marginal effects (the second one is an elasticity
effects(m, covariate = "income", data = z)
I'll try Option 3 and switch to multinom(). This code will model the log-odds of ranking an item as 1st, compared to a reference item (e.g., "Debate" in the code below). With K = 7 items, if we call the reference item ItemK, then we're modeling
log[ Pr(Itemk is 1st) / Pr(ItemK is 1st) ] = αk + xTβk
for k = 1,...,K-1, where Itemk is one of the other (i.e. non-reference) items. The choice of reference level will affect the coefficients and their interpretation, but it will not affect the predicted probabilities. (Same story for reference levels for the categorical predictor variables.)
I'll also mention that I'm handling missing data a bit differently here than in your original code. Since my model only needs to know which item gets ranked 1st, I only need to throw out records where that info is missing. (E.g., in the original dataset record #43 has "Information" ranked 1st, so we can use this record even though 3 other items are NA.)
# Get data
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
dat <- read.csv(dat.url)
# dataframe showing which item is ranked #1
ranks <- (dat[,2:8] == 1)
# for each combination of predictor variable values, count
# how many times each item was ranked #1
dat2 <- aggregate(ranks, by=dat[,9:12], sum, na.rm=TRUE)
# remove cases that didn't rank anything as #1 (due to NAs in original data)
dat3 <- dat2[rowSums(dat2[,5:11])>0,]
# (optional) set the reference levels for the categorical predictors
dat3$gender <- relevel(dat3$gender, ref="Female")
dat3$Job <- relevel(dat3$Job, ref="Government backbencher")
# response matrix in format needed for multinom()
response <- as.matrix(dat3[,5:11])
# (optional) set the reference level for the response by changing
# the column order
ref <- "Debate"
ref.index <- match(ref, colnames(response))
response <- response[,c(ref.index,(1:ncol(response))[-ref.index])]
# fit model (note that age & economic are continuous, while gender &
# Job are categorical)
library(nnet)
fit1 <- multinom(response ~ economic + gender + age + Job, data=dat3)
# print some results
summary(fit1)
coef(fit1)
cbind(dat3[,1:4], round(fitted(fit1),3)) # predicted probabilities
I didn't do any diagnostics, so I make no claim that the model used here provides a good fit.
You are working with Ranked Data, not just Multinomial Choice Data. The structure for the Ranked data in mlogit is that first set of records for a person are all options, then the second is all options except the one ranked first, and so on. But the index assumes equal number of options each time. So a bunch of NAs. We just need to get rid of them.
> with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt[complete.cases(index(mod1)$alt)], mean)))
economic
Accessible 5.13
Debate 4.97
Information 5.08
Officials 4.92
Responsive 5.09
Social 4.91
Trade.Offs 4.91

Resources