How to apply weights associated with the NIS (National inpatient sample) in R - r

I am trying to apply weights given with NIS data using the R package "survey", but I have been unsuccessful. I am fairly new to R and survey commands.
This is what I have tried:
# Create the unweighted dataset
d <- read.dta13(path)
# This produces the correct weighted amount of cases I need.
sum(d$DISCWT) # This produces the correct weighted amount of cases I need.
library(survey)
# Create survey object
dsvy <- svydesign(id = ~ d$HOSP_NIS, strata = ~ d$NIS_STRATUM, weights = ~ d$DISCWT, nest = TRUE, data = d)
d$count <- 1
svytotal(~d$count, dsvy)
However I get the following error after running the survey total:
Error in onestrat(x[index, , drop = FALSE], clusters[index], nPSU[index][1], :
Stratum (1131) has only one PSU at stage 1
Any help would be greatly appreciated, thank you!

The error indicates that you have specified a design where one of the strata has just a single primary sampling unit. It's not possible to get an unbiased estimate of variance for a design like that: the contribution of stratum 1131 will end up as 0/0.
As you see, R's default response is to give an error, because a reasonably likely explanation is that the data or the svydesign statement is wrong. Sometimes, as here, that's not what you want, and the global option 'survey.lonely.psu' describes other ways to respond. You want to set
options(survey.lonely.psu = "adjust")
This and other options are documented at help(surveyoptions)

Related

R and multiple time series and Error in model.frame.default: variable lengths differ

I am new to R and I am using it to analyse time series data (I am also new to this).
I have quarterly data for 15 years and I am interested in exploring the interplay between drinking and smoking rates in young people - treating smoking as the outcome variable. I was advised to use the gls command in the nlme package as this would allow me to include AR and MA terms. I know I could use more complex approaches like ARIMAX but as a first step, I would like to use simpler models.
After loading the data, specify the time series
data.ts = ts(data=data$smoke, frequency=4, start=c(data[1, "Year"], data[1, "Quarter"]))
data.ts.dec = decompose(data.ts)
After decomposing the data and some tests (KPSS and ADF test), it is clear that the data are not stationary so I differenced the data:
diff_dv<-diff(data$smoke, difference=1)
plot.ts(diff_dv, main="differenced")
data.diff.ts = ts(diff_dv, frequency=4, start=c(hse[1, "Year"], hse[1, "Quarter"]))
The ACF and PACF plots suggest AR(2) should also be included so I set up the model as:
mod.gls = gls(diff_dv ~ drink+time , data = data,
correlation=corARMA(p=2), method="ML")
However, when I run this command I get the following:
"Error in model.frame.default: variable lengths differ".
I understand from previous posts that this is due to the differencing and the fact that the diff_dv is now shorter. I have attempted fixing this by modifying the code but neither approach works:
mod.gls = gls(diff_dv ~ drink+time , data = data[1:(length(data)-1), ],
correlation=corARMA(p=2), method="ML")
mod.gls = gls(I(c(diff(smoke), NA)) ~ drink+time+as.factor(quarterly) , data = data,
correlation=corARMA(p=2), method="ML")
Can anyone help with this? Is there a workaround which would allow me to run the -gls- command or is there an alternative approach which would be equivalent to the -gls- command?
As a side question, is it OK to include time as I do - a variable with values 1 to 60? A similar question is for the quarters which I included as dummies to adjust for possible seasonality - is this OK?
Your help is greatly appreciated!
Specify na.action = na.omit or na.action = na.exclude to omit the rows with NA's. Here is an example using the built-in Ovary data set. See ?na.fail for info on the differences between these two.
Ovary2 <- transform(Ovary, dfoll = c(NA, diff(follicles)))
gls(dfoll ~ sin(2*pi*Time) + cos(2*pi*Time), Ovary2,
correlation = corAR1(form = ~ 1 | Mare), na.action = na.exclude)

Problems in calculating svymean for factor variables in R

I'm trying to calculate simple weighted means for my masters thesis using the survey package in RStudio. Most of my variables are factor variables. When I try to calculate the svymean I always get "NA" as results.
I did the following step for all of my variables:
data_subset$fTox <- factor(data_subset$PosNeg)
data_subset$fTox <- plyr::mapvalues(data_subset$fTox,from =c("1","2"), to = c("Negativ", "Positiv"))
Then created the survey design:
QSLab <- svydesign(ids = ~ppoint, data = data_subset1, weights = ~wQSLAB)
When I use this code:
svymean(design=QSLab, x=~fTox)
confint(svymean(design = QSLab, x = ~fTox))
I always get this as a result, no matter which variable I use:
enter image description here
Hope someone can help me.

Can I get unwtd.count included when running the svymean from the R Survey package?

I've written an R script to loop through a bunch of variables in a survey and output weighted values, CVs, CIs etc.
I would like it to also output the unweighted observations count.
I know it's a bit of a lazy question because I can calculate unweighted counts on my own and join them back in. I'm just trying to replicate a stata script that would return 'obs'
svy:tab jdvariable, per cv ci obs column format(%14.4g)
This is my calculated values table:
myresult_year_calc <- svyby(make.formula(newmetricname), # variable to pass to function
by = ~year, # grouping
design = subset(csurvey, geoname %in% jv_geo), # design object with subset definition
vartype = c("ci","cvpct"), # report variation as ci, and cv percentage
na.rm.all=TRUE,
FUN = svymean # specify function from survey package
)
By using unwtd.count instead of FUN, I get the counts I want.
myresult_year_obs <- svyby(make.formula(newmetricname), # variable to pass to function
by = ~year, # grouping
design = subset(csurvey, geoname %in% jv_geo), # design object with subset definition
vartype = c("ci","cvpct"), # report variation as ci, and cv percentage
na.rm.all=TRUE,
unwtd.count
)
Honestly in writing this question I made it 98% through a solution, but I'll ask anyway in case someone knows a more efficient way.
myresult_year_calc and myresult_year_obs both return what I expect, and if I use merge(myresult_year_calc, myresult_year_obs by"year") I get the table I want. This actually just gives me one count, per year in this example instead of one count for 'Yes' responses and one count for 'No'.
Is there any way to get both means and unweighted counts with a single command?
I figured this out by creating a second dsgn function where weights = ~0. When I ran svyby using the svytotal function with the unweighted design it followed the formula.
dsgn2 <- svydesign(ids = ~0,
weights = ~0,
data = data,
na.rm = T)
unweighted_n <- svyby(~interaction(group1,group2), ~as.factor(mean_rating), design = dsgn2, FUN = svytotal, na.rm = T)

How to obtain Tukey compact letter display from a GLM with interactions

I have set of data that I've analyzed with a generalized linear model that has three categorical factors in 3-way interaction (factorA, factorB, factorC) and a fourth continuous factor (factorD) that is simply added in the model. I am trying to obtain a set of Tukey letter groups (ie, compact letter display) from the model but haven't found a way to include the interaction successfully. I'm not interested in including factorD, just the three in the interaction.
I have gotten the Tukey-adjusted pairwise comparisons with this:
lsmeans(my.glm, factorA*factorB*factorC)
But I was not able to figure out how to produce a compact letters display from that. It can be done with multcomp package but I could only find ways to do it with main effects with that package, not interactions.
So then I tried the agricolae package, as this post (https://stats.stackexchange.com/questions/31547/how-to-obtain-the-results-of-a-tukey-hsd-post-hoc-test-in-a-table-showing-groupe) discusses that that should work. However, following the instructions in that answer led to a non-functional response from HSD.test. Specifically, I could get the main effects tests to work fine, e.g. HSD.test(my.glm,"factorA") but I could not get the interactions to work. I tried this:
intxns<-with(my.data, interaction(factorA,factorB,factorC))
HSD.test(my.glm,"intxns",group=TRUE)
But a get an error that indicates the HSD.test function didn't recognize "intxns" as a valid object, it looks like this (also, I checked the intxns object and it looks good and the number of rows matched the number of residuals of my glm):
Name: inxtns
factorA factorB factorC factorD
I get that same error if I just put nonsense into the factor field in the HSD.test function call. I checked the inxtns object and it looks good and the number of rows matched the number of residu
The agricolae notes don't actually cover the use of interactions in HSD.test, but I assume it can work.
Does anyone know how to get HSD.test to work with interactions? Or is there any other function you've gotten to work to produce compact letter displays for a glm with interactions?
I've been working on this for a number of days now and haven't been able to find a solution, hopefully I'm not missing something obvious.
Thanks!
I don't know how you've specified your glm model, but for HSD.test, it's looking to match the particular treatment name with the same name specified in the glm formula as well as the data frame. This is why your main effect, factorA will work, but not the 3-way interaction. For multiple comparison tests on interactions, I find it easiest to generate the interactions separately and add them to the data frame as additional columns. The glm model can then be specified using the new variables which code for the interaction.
For example,
set.seed(42)
glm.dat <- data.frame(y = rnorm(1000), factorA = sample(letters[1:2],
size = 1000, replace = TRUE),
factorB = sample(letters[1:2], size = 1000, replace = TRUE),
factorC = sample(letters[1:2], size = 1000, replace = TRUE))
# Generate interactions explicitly and add them to the data.frame
glm.dat$factorAB <- with(glm.dat, interaction(factorA, factorB))
glm.dat$factorAC <- with(glm.dat, interaction(factorA, factorC))
glm.dat$factorBC <- with(glm.dat, interaction(factorB, factorC))
glm.dat$factorABC <- with(glm.dat, interaction(factorA, factorB, factorC))
# General linear model
glm.mod <- glm(y ~ factorA + factorB + factorC + factorAB + factorAC +
factorBC + factorABC, family = 'gaussian', data = glm.dat)
# Multiple comparison test
library(agricolae)
comp <- HSD.test(glm.mod, trt = "factorABC", group = TRUE)
giving
comp$groups giving
trt means M
1 a.a.a 0.070052189 a
2 a.b.b 0.035684571 a
3 b.a.a 0.020517535 a
4 b.b.b -0.008153257 a
5 a.b.a -0.036136140 a
6 a.a.b -0.078891136 a
7 b.a.b -0.080845419 a
8 b.b.a -0.115808772 a

How are BRR weights used in the survey package for R?

Does anyone know how to use BRR weights in Lumley's survey package for estimating variance if your dataset already has BRR weights it in?
I am working with PISA data, and they already include 80 BRR replicates in their dataset. How can I get as.svrepdesign to use these, instead of trying to create its own? I tried the following and got the subsequent error:
dstrat <- svydesign(id=~uniqueID,strata=~strataVar, weights=~studentWeight,
data=data, nest=TRUE)
dstrat <- as.svrepdesign(dstrat, type="BRR")
Error in brrweights(design$strata[, 1], design$cluster[, 1], ...,
fay.rho = fay.rho, : Can't split with odd numbers of PSUs in a stratum
Any help would be greatly appreciated, thanks.
no need to use as.svrepdesign() if you have a data frame with the replicate weights already :) you can create the replicate weighted design directly from your data frame.
say you have data with a main weight column called mainwgt and 80 replicate weight columns called repwgt1 through repwgt80 you could use this --
yoursurvey <-
svrepdesign(
weights = ~mainwgt ,
repweights = "repwgt[0-9]+" ,
type = "BRR",
data = yourdata ,
combined.weights = TRUE
)
-- this way, you don't have to identify the exact column numbers. then you can run normal survey commands like --
svymean( ~variable , design = yoursurvey )
if you'd like another example, here's some example code and an explanatory blog post using the current population survey.
I haven't used the PISA data, I used the svprepdesign method last year with the Public Use Microsample from the American Community Survey (US Census Bureau) which also shipped with 80 replicate weights. They state to use the Fay method for that specific survey, so here is how one can construct the svyrep object using that data:
pums_p.rep<-svrepdesign(variables=pums_p[,2:7],
repweights=pums_p[8:87],
weights=pums_p[,1],combined.weights=TRUE,
type="Fay",rho=(1-1/sqrt(4)),scale=1,rscales=1)
attach(pums_p.rep)
#CROSS - TABS
#unweighted
xtabs(~ is5to17youth + withinAMILimit)
table(is5to17youth + withinAMILimit)
#weighted, mean income by sex by race for select age groups
svyby(~PINCP,~RAC1P+SEX,subset(
pums_p.rep,AGEP > 25 & AGEP <35),na.rm = TRUE,svymean,vartype="se","cv")
In getting this to work, I found the article from A. Damico helpful: Damico, A. (2009). Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis Techniques in Health Policy Data. The R Journal, 1(2), 37–44.

Resources