Survey Weighted Regression Without FPC in R - r

I'm using the svydesign package in R to run survey weighted logit regressions as follows:
sdobj <- svydesign(id = ~0, weights = ~chweight, strata = ~strata, data = svdat)
model1 <- svyglm(formula=formula1,design=sdobj,family = quasibinomial)
However, the documentation states a caveat about regressions without specifying finite population corrections (FPC):
If fpc is not specified then sampling is assumed to be
with replacement at the top level and only the first stage of
cluster is used in computing variances.
Unfortunately, I do not have sufficient information to specify my populations at each level (of which I sampling very little). Any information on how to specify survey weights without FPC information would be very helpful.

You're doing it right. "With replacement" is survey statistics jargon for what you want in this case.
If the sampling fraction is low, it is standard to use an approximation that would be exact if the sampling fraction were infinitesimal or sampling were with replacement. No-one actually does surveys with replacement, but the approximation is almost universal. With this approximation you don't need to supply fpc, and conversely, if you don't supply fpc, svydesign() assumes you want this approximation.

Related

Survdiff (package survival) with frequency weights (obtained after Coarsened Exact Matching - package matchit)

I am doing a counterfactual impact evaluation on survival data. More precisely, I try to evaluate the impact of vocational training on time spent in unemployment. I use the Kaplan Meier estimator of the survival curve (package survival).
Before doing Kaplan Meier, I use coarsened exact matching (aim is ATT) to get the control and treatment groups close in terms of pretreatment covariates (package MatchIt).
For the Kaplan Meier estimator, I have to use the weights form the matching, which works well using the weights option and robust standard errors of survfit :
library(survival)
library(survminer)
kp_cem <- survfit(Surv(time=time_cem,event=status_cem)~treatment_cem, data=data_impact_cem,robust =TRUE,weights =weights)
Although, when I try to use a log-rank test to test for the difference in survival curves between treatment and control groups, I cannot take into account the frequency weights from the matching so the test statistics are not correct.
log_rank <- survdiff(Surv(time=time_cem,event=status_cem)~treatment_cem, data=data_impact_cem,rho=0)
I tried the option "pval = TRUE" of ggsurvplot (package survminer) but the problem is the same, the frequency weights are not taken into account.
How can I include frequency weights in survdiff? Are there other packages to compute log-rank test taking into account frequency weights (obtained after matching)?
There are at least two ways to do this:
First, you can use the survey::svylogrank function, as #IRTFM suggests. This will treat the weights as sampling weights, but I think that's ok with the robust standard errors that svylogrank uses.
Second, you can use survival::coxph. The logrank test is the score test in a Cox model, and coxph takes frequency weights. Use robust=TRUE if you want a robust score test: it will be at the bottom of the output of summary(your_cox_model) and you can extract it as summary(your_cox_model)$robscore
Thank you very much #Thomas Lumley and #IRTFM for your answers.
Here is how I apply your 2 suggestions (I added some comments + references).
1. Using survey::svylogrank
I don’t feel very confortable using sampling weights while it is really frequency weights that I have.
How should I specify the survey design ? The weights come from Coarsened Exact Matching (matchit with method = "cem") which is a class of stratum matching.
Should I specify the strata and the weights in the survey design ? In this vignette form Matchit Estimating Effects After Matching, it is suggested to use only weights and robust standard errors in the survival analysis (not the strata) (p. 27).
Here is how I specify the design and how I obtain the log-rank test using the package survey taking into account the weights from matching :
library(survey)
design_weights <- svydesign(id=~ibis, strata=~subclass, weights=~weights, data=data_impact_cem)
log_rank <- svylogrank(Surv(time=time_cem,event=status_cem)~treatment_cem, design=design_weights, rho=0)
2. Using survival::coxph
Thank you for this piece of information, being quite new to survival analysis, I overlooked this nice property of the equivalency of score test from cox model and log-rank test. For people wishing more info on this subject, I found this book very instructive : Moore, D. (2016). Applied survival analysis using R. New York: NY : Springer (p 58).
I find this 2d option more attractive than the 1st involving survey. Here is how I apply it :
library(survival)
cox_cem <-coxph(Surv(time=time_cem,event=status_cem)~treatment_cem, data=data_impact_cem,robust =TRUE,weights =weights)
sum_cox_cem <-summary(cox_cem)
score_test <-sum_cox_month[[13]][[1]]
score_test <- round(score_test,3)
pvalue <- sum_cox_month[[13]][[3]]
pvalue <-if(pvalue<0.001){"<0.001"} else{round(pvalue,3)}
Here is the difference between the 2 test statistics (quite close in the end).
enter image description here
Though, I still wonder why the weights option does not exist in survdiff.

Propensity score matching with individual weights

I'm trying to perform propensity score matching on survey data. I'm aware of the package MatchIt which is able to make the matching procedure but can I include in some ways the individual weights? because if I don't consider them, a less relevant observation can be match with a more relevant one. Thank you!
Update 2020-11-25 below this answer.
Survey weights cannot be used with matching in this way. You might consider using weighting, which can accommodate survey weights. With weighting, you estimate the propensity score weights using a model that accounts for the survey weights, and then multiply the estimated weights by the survey weights to arrive at your final set of weights.
This can be done using the weighting companion to the MatchIt package, WeightIt (of which I am the author). With your treatment A, outcome Y (I assume continuous for this demonstration), covariates X1 and X2, and sampling weights S, you could run the following:
#Estimate the propensity score weights
w.out <- weightit(A ~ X1 + X2, data = data, s.weights = "S",
method = "ps", estimand = "ATT")
#Combine the estimated weights with the survey weights
att.weights <- w.out$weights * data$S
#Fit the outcome model with the weights
fit <- lm(Y ~ A, data = data, weights = att.weights)
#Estimate the effect of treatment and its robust standard error
lmtest::coeftest(fit, vcov. = sandwich::vcovHC)
It's critical that you assess balance after estimating the weights; you can do that using the cobalt package, which works with WeightIt objects and automatically incorporates the sampling weights into the balance statistics. Prior to estimating the effect, you would run the following:
cobalt::bal.tab(w.out, un = TRUE)
Only if balance was achieved would you continue on to estimating the treatment effect.
There are other ways to estimate weights besides using logistic regression propensity scores. WeightIt provides support for many methods, and almost all of them support sampling weights. The documentation for each method explains whether sampling weights are supported.
MatchIt 4.0.0 now supports survey weights through the s.weights, just like WeightIt. This supplies survey weights to the model used to estimate the propensity scores but otherwise does not affect the matching. If you want units to be paired with other units that have similar survey weights, you should enter the survey weights as a variable to match on or to place a caliper on.

How to deal with spatially autocorrelated residuals in GLMM

I am conducting an analysis of where on the landscape a predator encounters potential prey. My response data is binary with an Encounter location = 1 and a Random location = 0 and my independent variables are continuous but have been rescaled.
I originally used a GLM structure
glm_global <- glm(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs,
data=Data_scaled, family=binomial)
but realized that this failed to account for potential spatial-autocorrelation in the data (a spline correlogram showed high residual correlation up to ~1000m).
Correlog_glm_global <- spline.correlog (x = Data_scaled[, "Y"],
y = Data_scaled[, "X"],
z = residuals(glm_global,
type = "pearson"), xmax = 1000)
I attempted to account for this by implementing a GLMM (in lme4) with the predator group as the random effect.
glmm_global <- glmer(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs+(1|Group),
data=Data_scaled, family=binomial)
When comparing AIC of the global GLMM (1144.7) to the global GLM (1149.2) I get a Delta AIC value >2 which suggests that the GLMM fits the data better. However I am still getting essentially the same correlation in the residuals, as shown on the spline correlogram for the GLMM model).
Correlog_glmm_global <- spline.correlog (x = Data_scaled[, "Y"],
y = Data_scaled[, "X"],
z = residuals(glmm_global,
type = "pearson"), xmax = 10000)
I also tried explicitly including the Lat*Long of all the locations as an independent variable but results are the same.
After reading up on options, I tried running Generalized Estimating Equations (GEEs) in “geepack” thinking this would allow me more flexibility with regards to explicitly defining the correlation structure (as in GLS models for normally distributed response data) instead of being limited to compound symmetry (which is what we get with GLMM). However I realized that my data still demanded the use of compound symmetry (or “exchangeable” in geepack) since I didn’t have temporal sequence in the data. When I ran the global model
gee_global <- geeglm(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs,
id=Pride, corstr="exchangeable", data=Data_scaled, family=binomial)
(using scaled or unscaled data made no difference so this is with scaled data for consistency)
suddenly none of my covariates were significant. However, being a novice with GEE modelling I don’t know a) if this is a valid approach for this data or b) whether this has even accounted for the residual autocorrelation that has been evident throughout.
I would be most appreciative for some constructive feedback as to 1) which direction to go once I realized that the GLMM model (with predator group as a random effect) still showed spatially autocorrelated Pearson residuals (up to ~1000m), 2) if indeed GEE models make sense at this point and 3) if I have missed something in my GEE modelling. Many thanks.
Taking the spatial autocorrelation into account in your model can be done is many ways. I will restrain my response to R main packages that deal with random effects.
First, you could go with the package nlme, and specify a correlation structure in your residuals (many are available : corGaus, corLin, CorSpher ...). You should try many of them and keep the best model. In this case the spatial autocorrelation in considered as continous and could be approximated by a global function.
Second, you could go with the package mgcv, and add a bivariate spline (spatial coordinates) to your model. This way, you could capture a spatial pattern and even map it. In a strict sens, this method doesn't take into account the spatial autocorrelation, but it may solve the problem. If the space is discret in your case, you could go with a random markov field smooth. This website is very helpfull to find some examples : https://www.fromthebottomoftheheap.net
Third, you could go with the package brms. This allows you to specify very complex models with other correlation structure in your residuals (CAR and SAR). The package use a bayesian approach.
I hope this help. Good luck

Feature Selection for Regression Models in R

I’m trying to find a Feature Selection Package in R that can be used for Regression most of the packages implement their methods for classification using a factor or class for the response variable. In particular I’m interested if there’s a method using Random Forest for that purpose. Also a good paper in this field would be helpfull.
IIRC the randomForest package also does regression trees. You could start with the Breiman paper and go from there.
There are many ways you can use randomforest for calculating variable importance.
I. Mean Decrease Impurity (MDI) / Gini Importance :
This makes use of a random forest model or a decision tree. When training a tree, it is measured by how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. Here is an example of the same using R.
fit <- randomForest(Target ~.,importance = T,ntree = 500, data=training_data)
var.imp1 <- data.frame(importance(fit, type=2))
var.imp1$Variables <- row.names(var.imp1)
varimp1 <- var.imp1[order(var.imp1$MeanDecreaseGini,decreasing = T),]
par(mar=c(10,5,1,1))
giniplot <- barplot(t(varimp1[-2]/sum(varimp1[-2])),las=2,
cex.names=1,
main="Gini Impurity Index Plot")
And the output will look like this: Gini Importance Plot
II. Permutation Importance or Mean Decrease in Accuracy (MDA) : Permutation Importance or Mean Decrease in Accuracy (MDA) is assessed for each feature by removing the association between that feature and the target. This is achieved by randomly permuting the values of the feature and measuring the resulting increase in error. The influence of the correlated features is also removed. Example in R:
fit <- randomForest(Target ~.,importance = T,ntree = 500, data=training_data)
var.imp1 <- data.frame(importance(fit, type=1))
var.imp1$Variables <- row.names(var.imp1)
varimp1 <- var.imp1[order(var.imp1$MeanDecreaseGini,decreasing = T),]
par(mar=c(10,5,1,1))
giniplot <- barplot(t(varimp1[-2]/sum(varimp1[-2])),las=2,
cex.names=1,
main="Permutation Importance Plot")
This two are are the ones which use Random Forest directly. There are some more easy to use metrics for variable importance calculation purpose. 'Boruta' method and Weight of evidence (WOE) and Information Value (IV) might also be helpful.

specifying probability weights in R *without* using Lumley survey package

I would really appreciate any help with specifying probability weights in R without using the Lumley survey package. I am conducting mediation analysis in R using the Imai et al mediation package, which does not currently support svyglm.
The code I am currently running is:
olsmediator_basic<-lm(poledu ~ gateway_strict_alt + gender_n + spline1 + spline2 + spline3,
data = unifiedanalysis, weights = designweight).
However, I'm unsure if this is weighting the data correctly. The reason is that this code yields standard errors that differ from those I am getting in Stata. The Stata code I am running is:
reg poledu gateway_strict_alt gender_n spline1 spline2 spline3 [pweight=designweight]).
I was wondering if the weights option in R may not be for inverse probability weights, but I was unable to determine this from the documentation, this forum or elsewhere. If I am missing something, I really apologize - I am new to R as well as to this forum.
Thank you in advance for your help.
The R documentation specifies that the weights parameter of the lm function is inversely proportional to the variance of the observations. This is the definition of analytic weights, or aweights in Stata.
Have a look at the ipw package for inverse probability weighting.
To correct a previous answer - I looked up the manual on weights and found the following description for weights in lm
Non-NULL weights can be used to indicate that different observations have different variances (with the values in weights being inversely proportional to the variances); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations (including the case that there are w_i observations equal to y_i and the data have been summarized).
These are actually frequency weights (fweights in stata). They multiply out the observation n number of times as defined by the weight vector. Probability weights, on the other hand, refer to the probability that observations group is included in the population. Doing so adjusts the impact of the observation on the coefficients, but not on the standard errors, as they don't change the number of observations represented in the sample.

Resources