I am using the svyglm function to run a regression model. I have an interaction term which is a categorical variable of yes and no. The regression model is showing the estimates for no but I would like to analyse the yes. I was wondering how I can do that? Thanks for your help!
This is an image of the dataset covid.surveydesgin = enter image description here
The Data for this column is just Yes/No for 3000 row, I am unable to share the whole dataset.
YES = 692
NO = 2996
I tried to re-level using fct_relevel() and lapply() but that didn't work, any suggestions would be really appreciated.
Regression1<-svyglm(formula = ca_scghq1_dv~(None+Average+Above_Average+Astronomical)*ca_furlough, design = covid.surveydesgin)
summary(Regression1)
Call: svyglm(formula = ca_scghq1_dv ~ (None + Average + Above_Average + Astronomical) * ca_furlough, design = covid.surveydesgin)
Coefficients:
Estimate |Std. Error |t value |Pr(>|t|) |
(Intercept) 12.2457 0.5321 23.012 < 2e-16 ***|
|ca_furloughNo |0.5957 | 0.5731 | 1.039 |0.29906 |
|NoneTRUE:ca_furloughNo |-1.4100 |0.9212 |-1.531 |0.12644 |
|AverageTRUE:ca_furloughNo |1.7233 |1.3013 |1.324 |0.18596 |
|Above_AverageTRUE:ca_furloughNo |0.2059 |1.6990 |0.121 |0.90357 |
|AstronomicalTRUE:ca_furloughNo |1.8439 |2.2777 |0.810 |0.41855 |
Related
I'm trying to recreate survey statistics from Stata code in R, but I can't get the confidence intervals to come out the same. I'm subsetting the data by the county of interest, and then looking at what percent of student respondents don't wear bike helmets, split by grade, and what the confidence intervals are for those percents.
In the Stata code, we set the survey weights and create a frequency table with confidence intervals:
svyset psu [pweight = fwt_cnty], strata(strata) singleunit(centered)
svy, subpop(if c_valencia==1): tab grade helmet, per ci format(%8.1f) nomarginals vertical
The results look like this:
------------------------------
|Bike helmet: Rarely
| or never
Grade | SmeTime+ NvrRrly
----------+-------------------
9th | 17.09 82.91
| 5.66 58.55
| 41.45 94.34
|
10th | 10.19 89.81
| 5.78 82.64
| 17.36 94.22
|
11th | 7.834 92.17
| 3.498 83.38
| 16.62 96.5
|
12th | 11.28 88.72
| 4.559 74.71
| 25.29 95.44
------------------------------
Key: Row percentage
Lower 95% confidence bound for row percentage
Upper 95% confidence bound for row percentage
Pearson:
Uncorrected chi2(3) = 4.8590
Design-based F(1.74, 36.57) = 0.5336 P = 0.5666
The following R code should do the same thing as far as I can tell:
yrrs_cnty <- svydesign(id=~psu, strata=~strata, weights=~fwt_cnty, data=yrbss_hs_NM)
options(survey.lonely.psu = "adjust")
CITable <- svyby(~helmet,~{cntytxt=={params$counties}}+grade, yrrs_cnty, svyciprop, vartype ="ci", na.rm.all=T)
The results look like this:
{ cntytxt == { params$counties } } grade helmet ci_l ci_u
FALSE.1 FALSE 1 0.7884086 0.7634557 0.8113797
TRUE.1 TRUE 1 0.8291318 0.4233797 0.9697601
FALSE.2 FALSE 2 0.7946815 0.7652980 0.8212456
TRUE.2 TRUE 2 0.8980533 0.7114234 0.9692088
FALSE.3 FALSE 3 0.8127924 0.7889255 0.8345267
TRUE.3 TRUE 3 0.9216650 0.8040353 0.9712142
FALSE.4 FALSE 4 0.8057442 0.7737261 0.8342026
TRUE.4 TRUE 4 0.8871825 0.7264505 0.9588247
After applying the weights from the survey design, R produces the same percentages for each grade, (81.91, 89.81, 91.17, & 88.72) but the confidence intervals are much larger. Does anyone have ideas as to why? Both should be calculating a confidence interval for a proportion. I've tried casting the variables as factors, but that didn't seem to help either. Does anyone have ideas as to what is causing the difference in results?
Ok, it's hard to check this without the data. According to the Stata documentation, it's using the same basic confidence interval calculation that is the default for svyciprop: a Wald interval on the logit scale.
I used an example from the Stata manual to check
webuse nhanes2f
svyset psuid [pweight=finalwgt], strata(stratid)
svy: tab rural diabetes, per ci nomarginals vertical row
which gives
(running tabulate on estimation sample)
Number of strata = 31 Number of obs = 10,335
Number of PSUs = 62 Population size = 116,997,257
Design df = 31
------------------------
| diabetes,
1=rural, | 1=yes, 0=no
0=urban | 0 1
----------+-------------
0 | 96.63 3.374
| 96.2 2.991
| 97.01 3.804
|
1 | 96.45 3.547
| 95.62 2.867
| 97.13 4.38
------------------------
Key: row percentage
lower 95% confidence bound for row percentage
upper 95% confidence bound for row percentage
Pearson:
Uncorrected chi2(1) = 0.2027
Design-based F(1, 31) = 0.1794 P = 0.6748
In R I get
dhanes<-svydesign(id=~psuid, strata=~stratid, weights=~finalwgt, data=nhanes, nest=TRUE)
svyby(~diabetes,~rural, design=dhane, svyciprop, vartype="ci",method="logit",na.rm=TRUE)
giving
rural diabetes ci_l ci_u
0 0 0.03373569 0.02990043 0.03804360
1 1 0.03546713 0.02858347 0.04393358
That's close. Stata and R are using different defaults for the denominator degrees of freedom: Stata is using the design df for the whole sample and R is using the design df for the subpopulation. Making R use the whole-sample df gives almost perfect agreement
> svyby(~diabetes,~rural, design=dhanes, svyciprop, vartype="ci",method="logit",na.rm=TRUE,df=degf(dhanes))
rural diabetes ci_l ci_u
0 0 0.03373569 0.02990532 0.03803744
1 1 0.03546713 0.02867079 0.04380188
So what's happening in your example? I don't know. It might be single-PSU strata, or different missing-value handling or something. It probably needs more information to work out, though
I'm running a meta-analysis where I'm interested in the effect of X on the effect of age on habitat use (raw mean values and variances) using the metafor package.
An example of one of my models is:
mod6 <-
rma.mv(
yi = Used_value,
V = Used_variance,
slab = Citation,
mods = ~ Age + poly(Slope, degrees = 2),
random = ~ 1 | Region,
data = vel.focal,
method = "ML"
)
My justification for not using Citation as a random effect is that using only Region accounts for more of the heterogeneity than when random = list( ~ 1 | Citation/ID, ~ 1 | Region) or when Citation/ID is used by itself.
What I need for output is the prediction for each age by region, but the predict() function for the model and the associated forest plot spits out the prediction for each row, as it assumes each row in the data is a unique study. In my case it is not as I have my input values separated by age and season.
predict(mod6)
pred se ci.lb ci.ub pi.lb pi.ub
Riehle and Griffith 1993.1 9.3437 2.3588 4.7205 13.9668 0.2362 18.4511
Riehle and Griffith 1993.2 9.3437 2.3588 4.7205 13.9668 0.2362 18.4511
Riehle and Griffith 1993.3 9.3437 2.3588 4.7205 13.9668 0.2362 18.4511
Spina 2000.1 8.7706 2.7386 3.4030 14.1382 -0.7364 18.2776
Spina 2000.2 8.5407 2.7339 3.1824 13.8991 -0.9611 18.0426
Spina 2000.3 8.5584 2.7406 3.1868 13.9299 -0.9509 18.0676
Vondracek and Longanecker 1993.1 12.6116 2.5138 7.6847 17.5385 3.3462 21.8769
Vondracek and Longanecker 1993.2 12.6116 2.5138 7.6847 17.5385 3.3462 21.8769
Vondracek and Longanecker 1993.3 12.3817 2.5327 7.4176 17.3458 3.0965 21.6669
Vondracek and Longanecker 1993.4 12.3817 2.5327 7.4176 17.3458 3.0965 21.6669
Does anybody know a way to modify the arguments inside predict() to tell it how you want your predictions output or to tell it that there are multiple rows per slab?
You need to use the newmods argument to specify the values for Age for which you want predicted values. You will have to plug in something for the linear and quadratic terms for the Slope variable as well (e.g., holding Slope constant at its mean and hence the quadratic term will just be the mean squared). Region is not a fixed effect, so it is not relevant if you want to compute predicted values based on the fixed effects. If you want to compute BLUPs for those random effects, you can do so with ranef(). One can then combine the predictions based on the fixed effects with the BLUPs. That would be the general idea, but implementing this will require a bit of programming.
I have this assignment:
Participants were asked to estimate how many minutes they spend
per day on Instagram. The results of this question are in the 'usage_duration'
column. Let's take a look at the distribution of their answers. Think of an
appropriate visualization for this and describe what you see in it. Include
a vertical line that shows the mean usage for the entire data set.
Here is the data set(usage_duration column):
|usage_duration|
| 54.0 |
| 6576.0 |
| 6.5 |
| 3.5 |
| 346 |
| 456 |
Here is the code that I am using to determine whats being asked in the question:
mean_of_duration <- mean(as.numeric(SurveyInsta$Usage_duration), na.rm = TRUE)
ggplot(SurveyInsta, mapping=aes(x = Usage_duration)) +
geom_histogram(color="blue", fill="red", color="Average") +
geom_vline(aes(xintercept = mean_of_duration, color="Average"), show.legend = TRUE)
however, I am getting this error:
Error: StatBin requires a continuous x variable: the x variable is discrete.Perhaps you want stat="count"?: StatBin requires a continuous x variable: the x variable is discrete.Perhaps you want stat="count"?
But I am not even sure if I am doing the thing that's actually being asked, so if you think that I misunderstood the assignment please let me know.
when computing profile confidence intervals using confint(m1) where m1 is a glmer() model there is a term ( or a few ) at the top which are labelled .sig01, .sig02, I can't find any documentation which explains what these mean though.
You probably didn't find the documentation for the 'merMod' class method of confint. In ?lme4::confint.merMod the following can be read at the parameters:
oldNames: (logical) use old-style names for variance-covariance parameters,
e.g. ".sig01", rather than newer (more informative) names such as
"sd_(Intercept)|Subject"? (See signames argument to profile).
The default option for oldNames is TRUE. Setting it to FALSE will give you a clearer output.
Example
library(lme4)
(gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial))
confint(gm1, oldNames=FALSE)
# 2.5 % 97.5 %
# sd_(Intercept)|herd 0.3460732 1.0999743
# (Intercept) -1.9012119 -0.9477540
# period2 -1.6167830 -0.4077632
# period3 -1.8010241 -0.5115362
# period4 -2.5007502 -0.8007554
What I have done so far:
I have a data.frame results with response Fail, and three factors PREP, CLEAN & ADHES.
ADHES has 3 levels: Crest Cryst Poly
I calculated the variances:
sigma..k=tapply(Fail,ADHES,var)
print(sqrt(sigma..k)):
Crest Cryst Poly
17.56668 41.64679 39.42669
then used leveneTest to test for constance of variance:
print(leveneTest(Fail~ADHES))
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 3.929 0.02588 *
51
The Question:
Now I want to use Levene's to test between only the Cryst & Poly levels of the factor ADHES, but I can't work out the syntax to do this in R.
Thanks to the hint #PauloCardoso gave me I worked it out:
leveneTest(subset(results,ADHES == 'Cryst' | ADHES == 'Poly')[,5],
subset(results,ADHES == 'Cryst' | ADHES == 'Poly')[,3])
('Fail' & 'ADHES' are columns 5 & 3 respectively in my data.frame 'results')
Obrigadinho!