Survey weights and boostrap wieghts to get counts and CI's - r

I have a file containing survey data. For example, the file looks like this:
IDNUMBER AGE SEX NumPrescr OnPrescr SURV_WGT BSW1 BSW2....BSW500
123456 22 1 6 1 ... ... ... ...
Here, OnPrescrp is a binary variable indicating whether or not the subjects is on prescription meds and BSW1 - BSW500 are the bootstrap weights and SURV_WGT is the survery weight per subject. There are roughly 20000 entries.
I am tasked with creating tables of various statistics within certain age-gender group breakdowns. For example, how many males from 17 to 24 are on prescription medications. And I need a count N and 95% CI for each of these types of calculations. I'm not familiar at all with survey methods.
From what I understand, I can't just simply add the number of people in each category to get the final count N for each question/category (i.e., cannot just add all the males 17 to 24 who are using prescription meds). Instead, I have to take into account the survery weights and bootstrap weights when constructing my final count N and confidence intervals.
I was then told in STATA this is a one line command:
svyset [pw=SURV_WGT], brr(bsw1-bsw500)
I am working in R however. What is the equivalent command in R and what exactly is the above command doing?
PS: My sample of roughly 20000 indiviudals is a sample of a population of roughly 35 million.

You will want to use the survey package in R. This will be your best friend for weighted/complex survey analysis in R.
install.packages("survey")
The survey package has two main steps to your analysis. The first is creating the svydesign object, which stores information about your survey design including weights, replicate weights, data, etc. Then use any number of analysis functions to run analysis/descriptives on those design objects (e.g., svymean, svyby - for subgroup analysis, svyglm, and many more).
Based on your question, you have survey weights and replicate weights (bootstrapped). While the more common svydesign function is used for surveys with a single set of weights, you want to use svrepdesign, which will allow you to specify survey weights and replicate weights. Check out the documentation, but here is what you can do:
mydesign <- svrepdesign(data = mydata,
weights = ~SURV_WGT,
repweights = "BSW[0-9]+",
type = "bootstrap",
combined.weights = TRUE)
You should read the documentation, but briefly: data will be your data frame, weights takes your single survey weight vector, usually as a formula, repweights is great in that it accepts a regex string that identifies all the replicate weight columns in your data by column name, type tells the design what your replicate weights are (how they were derived), combined.weights is logical for whether the replicate weights contain sampling weights - I assume this is true but it may not be.
From this design object, you can then run analysis. E.g., let's calculate the average number of prescriptions by sex:
myresult <- svyby(~NumPrescr, # variable to pass to function
by = ~SEX, # grouping
design = mydesign, # design object
vartype = "ci", # report variation as confidence interval
FUN = svymean # specify function from survey package, mean here
)
Hope this helps!
EDIT: if you want to look at something by age groups, as you suggest, you need to create a character or factor variable that is coded for each age group and use that new variable in your svyby call.

Related

How to graph multiply imputation survey data

I'm working with Federal Reserve Survey of Consumer Finances (SCF) data, which expands the ~6500 actual observed responses into ~29,000 entries through multiple imputation. I'm able to generate summary statistics (counts, means, quantiles, etc.) using scf_MIcombine in the lodown package, but I'm having a lot of trouble representing it visually. The functions that account for multiple imputation tend to spit out svyimputationlist objects, which are challenging to cast into objects that ggplot can understand.
For example:
scf_design <-
svrepdesign(
weights = ~wgt ,
repweights = scf_rw[ , -1 ] ,
data = imputationList( scf_imp ) ,
scale = 1 ,
rscales = rep( 1 / 998 , 999 ) ,
mse = FALSE ,
type = "other" ,
combined.weights = TRUE
)
scf_design_work <- subset(scf_design, age>24 & age<65)
tab_knolLIT <- scf_MIcombine(with(svytable(~finlit+knowlcat, design = subset(scf_design_work, finlit!=0))))
#Error in UseMethod("svytable", design) : no applicable method for 'svytable' applied to an object of class "svyimputationList"
Any suggestions?
To graph multiply imputed survey data, you can use the mi or mice packages in R. Both packages offer functions for creating graphs based on multiply imputed data. Here is a step by step process:
Combine the multiple imputed data using pool() or complete()
functions in the mi or mice package respectively.
Use ggplot2 to create visualizations. The ggplot2 library is
powerful and flexible, allowing you to create a wide range of
visualizations.
Use mi_boxplot() or mi_histogram() to create box plots or histograms for each variable. These functions take multiply imputed data as input and automatically handle the uncertainty introduced by the multiple imputations.
Use mi_qqplot() to create a quantile-quantile plot, which can help
you to assess whether the distribution of the data is normal.
Use mi_scatterplot() to create a scatterplot matrix, which can help
you visualize the relationship between multiple variables.
Note that it is important to account for the uncertainty introduced by multiple imputation when interpreting results, as the estimates and standard errors may be different than if a single imputed dataset was used.

How to use the "how" function for an unbalanced repeated design

I have a set of control and treated plots which had been sampled during years. I run the prc function in the vegan package and want to perform a permutation test to check whether control vs treated plots significantly differ during years. As my data is unbalanced, I can not use strata function. my code look like:
library(vegan)
year=as.factor(c(rep(1995,8),rep(1999,8),rep(2001,8),rep(2013,4),rep(1995,4),
rep(1999,4),rep(2001,4),rep(2013,4)))
treatment=as.factor(c(rep("control",28),rep("treated",16)))
I've written this, but I'm sure that it is wrong because the treatment is missing here:
h1 <- how(within = Within(type = "series", mirror = F),
blocks = year, nperm = 999
)
Any suggestions is greatly appreciated.
Under the null hypothesis, samples from the control or treated groups are exchangeable and hence you don't want them in the permutation design; you really want to permute them to generate the permutation-based null distribution for the test statistic.
The permutation design is there to indicate what isn't exchangeable.
You haven't explained why you want samples within the blocks to be permuted in series; why are samples within years also time series? If they're not, you don't want this.
You only need to worry about imbalance if you want to permute the strata. Whilst using blocks is similar in some respects to strata, blocks are never permuted so if you can use blocks you can use strata as you won't be permuting them.
If you want to permute the years as groups of samples, then you'll need strata and you'll need balance at the year level, which you don't have.
What you have defined with your call to how() is:
groups samples by year and as such samples will never be swapped between years, and
samples within the levels of year will be permuted in series, keeping their temporal order intact after applying cyclic shift permutations.
If that's not what you want to do, you need to explain in words what you want to do. By "do" I mean what is it you want to test? What is your model in vegan?

Reproduction of Stata design of a national multi-staged sample in R survey package

I have a 3-stage stratified sampling design of a national survey. I have code in Stata to do the weighting, but I struggle to reproduce it with the R survey package.
Sampling design is the following. The sampling universe is stratified into 49 strata. Within each stratum, sampling is done in three stages. (1) PPS selection of a precinct. (2) Random systematic selection of a household using random route technique, basically. (3) Random systematic selection of a respondent within a household using some Kish technique modification.
There is a Stata weighting code that is assumed to perform well:
svyset precinct [pweight=indwt], strata(strt) fpc(npsu) singleunit(certainty) || qnum, fpc(nhh) || _n, fpc(nhhm)
Here, precinct are precinct numbers, strt are strata codes. Population sizes used for FPC are npsu - number of PSUs (i.e. precincts) per stratum, nhh - numbers of households in PSUs, nhhm - number of eligible members in a household. qnum - questionnaire unique numbers, which are the same for both selected households and respondents.
I try to reproduce it with the following R code.
library(survey)
options("survey.lonely.psu" = "certainty")
svy_data <- svydesign(ids = ~precinct + qnum,
strata = ~strt,
weights = ~indwt,
fpc = ~npsu + nhh,
data = data)
I can't do fpc = ~npsu + nhh + nhhm, because then I get an error:
Error in popsize < sampsize : non-conformable arrays.
Resulting confidence intervals through confint(svymean(...)) in R don't match Stata confidence intervals through tabout ado package. They are close, but shifted a bit in R.
My assumption is that I should do something that Stata's _n term does, and get a 3-stage design instead of a 2-stage one. How could I do that?
Or is there anything else I can try to improve in my R code to match Stata?

Regressing out or Removing age as confounding factor from experimental result

I have obtained cycle threshold values (CT values) for some genes for diseased and healthy samples. The healthy samples were younger than the diseased. I want to check if the age (exact age values) are impacting the CT values. And if so, I want to obtain an adjusted CT value matrix in which the gene values are not affected by age.
I have checked various sources for confounding variable adjustment, but they all deal with categorical confounding factors (like batch effect). I can't get how to do it for age.
I have done the following:
modcombat = model.matrix(~1, data=data.frame(data_val))
modcancer = model.matrix(~Age, data=data.frame(data_val))
combat_edata = ComBat(dat=t(data_val), batch=Age, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)
pValuesComBat = f.pvalue(combat_edata,mod,mod0)
qValuesComBat = p.adjust(pValuesComBat,method="BH")
data_val is the gene expression/CT values matrix.
Age is the age vector for all the samples.
For some genes the p-value is significant. So how to correctly modify those gene values so as to remove the age effect?
I tried linear regression as well (upon checking some blogs):
lm1 = lm(data_val[1,] ~ Age) #1 indicates first gene. Did this for all genes
cor.test(lm1$residuals, Age)
The blog suggested checking p-val of correlation of residuals and confounding factors. I don't get why to test correlation of residuals with age.
And how to apply a correction to CT values using regression?
Please guide if what I have done is correct.
In case it's incorrect, kindly tell me how to obtain data_val with no age effect.
There are many methods to solve this:-
Basic statistical approach
A very basic method to incorporate the effect of Age parameter in the data and make the final dataset age agnostic is:
Do centring and scaling of your data based on Age. By this I mean group your data by age and then take out the mean of each group and then standardise your data based on these groups using this mean.
For standardising you can use two methods:
1) z-score normalisation : In this you can change each data point to as (x-mean(x))/standard-dev(x)); by using group-mean and group-standard deviation.
2) mean normalization: In this you simply subtract groupmean from every observation.
3) min-max normalisation: This is a modification to z-score normalisation, in this in place of standard deviation you can use min or max of the group, ie (x-mean(x))/min(x)) or (x-mean(x))/max(x)).
On to more complex statistics:
You can get the importance of all the features/columns in your dataset using some algorithms like PCA(principle component analysis) (https://en.wikipedia.org/wiki/Principal_component_analysis), though it is generally used as a dimensionality reduction algorithm, still it can be used to get the variance in the whole data set and also get the importance of features.
Below is a simple example explaining it:
I have plotted the importance using the biplot and graph, using the decathlon dataset from factoextra package:
library("factoextra")
data(decathlon2)
colnames(data)
data<-decathlon2[,1:10] # taking only 10 variables/columns for easyness
res.pca <- prcomp(data, scale = TRUE)
#fviz_eig(res.pca)
fviz_pca_var(res.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
hep.PC.cor = prcomp(data, scale=TRUE)
biplot(hep.PC.cor)
output
[1] "X100m" "Long.jump" "Shot.put" "High.jump" "X400m" "X110m.hurdle"
[7] "Discus" "Pole.vault" "Javeline" "X1500m"
On these similar lines you can use PCA on your data to get the importance of the age parameter in your data.
I hope this helps, if I find more such methods I will share.

SVD with missing values in R

I am performing a SVD analysis with R, but I have a matrix with structural NA values. Is it possible to obtain a SVD decomposition in this case? Are there alternative solutions? Thanks in advance
You might want to try out the SVDmiss function in SpatioTemporal package which does missing value imputation as well as computes the SVD on the imputed matrix. Check this link SVDmiss Function
However, you might want to be wary of the nature of your data and whether missing value imputation makes sense in your case.
I have tried using the SVM in R with NA values without succes.
Sometimes they are important in analysis so I usually transform my data as follows:
If you have lots of variables try to reduce their number (clustering, lasso, etc...)
Transform the remaining predictors like this:
- for quantitative variables:
- calculate deciles per predictor (leaving missing obs out)
- calculate frequency of Y per decile (assuming Y is qualitative)
- regroup deciles on their Y freq similarity into 2/3/4 groups
(you can do this by looking at their plot too)
- create for each group a new binary variable
(X11 = 1 if X1 takes values in the interval ...)
- calculate Y frequency for missing obs of that predictor
- join the missing obs category to the variable that has the closest Y freq
- for qualitative variables:
- if you have variables with lots of levels you should do clustering by Y
variable
- for variables with lesser levels, you can calculate Y freq per class
- regroup the classes like above
- calculate the same thing for missing obs and attach it to the most similar
group of non-missing
- recode the variable as for numeric case*
There, now you have a complete database of dummy variables and the chance to perform SVM, neural networks, LASSO, etc...

Resources