I am using t-tests in R to test the significance of the difference in means that arises when adding weights, stratification, and clustering (respectively) to the survey design when utilizing the FGT measure of poverty, which I calculate using the svyfgt function in the convey package. I am running the t-tests by creating vectors for each survey design which include the mean, standard deviation, and sample size, hence, I need to obtain the standard deviation for the svyfgt mean.
In the survey package, there is a svysd function, which is used to calculate the standard deviation when complex survey designs are applied. This value is quite different from the value obtained by simply multiplying the SE by sqrt(n), as shown below:
library(survey)
wel <- c(68008.19, 128504.61, 21347.69,
33272.95, 61828.96, 32764.44,
92545.62, 58431.89, 95596.82,
117734.27)
rmul <- c(16, 16, 16, 16, 16, 16, 16,
20, 20, 20)
splin <- c(23149.64, 23149.64, 23149.64, 23149.64, 23149.64,
21322.23, 21322.23, 21322.23, 21322.23, 21322.23)
survey.data <- data.frame(wel, rmul, splin)
survey_weighted <- svydesign(data = survey.data,
ids = ~wel,
weights = ~rmul,
nest = TRUE)
svymean(~wel, survey_weighted)
svysd(~wel, survey_weighted)
11498*sqrt(10)
In the convey package, there is no equivalent "svyfgtsd" function, and simply multiplying the SE by sqrt(n) would seem to yield the wrong answer (based on the previously shown difference in results between svysd and that expression). Therefore, I am not sure how I might obtain the standard deviation for FGT_0_weighted. Is there a function I am not aware of, or a stats concept that might aid me here?
library(convey)
fgtsurvey_weighted <- convey_prep(survey_weighted)
FGT_0_weighted <- svyfgt(~wel,
fgtsurvey_weighted,
g=0,
abs_thresh = survey.data$splin)
FGT_0_weighted
For reference, I will be using the sd values in t-tests like so (disregard sd values):
FGT_0_unweighted_vector <- c(rnorm(9710, mean = 0.28919, sd = sd_FGT_0))
FGT_0_cluster_vector <- c(rnorm(9710, mean = 0.33259, sd = sd_FGT_0_cluster))
t.test(FGT_0_cluster_vector, FGT_0_unweighted_vector, var.equal = FALSE)
When the poverty threshold is absolute, the FGT is the mean of a binary variable (poor/non-poor); i.e., a proportion. The standard deviation of a binary variable is sqrt( p*(1-p) ).
However, you are probably looking for the standard error (a measure of the sampling error of the FGT estimate), just do SE( FGT_0_weighted ). That's what is used in t-tests.
Taking stratification and clustering into account will alter standard error estimates, while weighting will affect the mean (and all point estimates, like FGT) as well. Using t-tests to test whether mean estimates change makes sense for comparing weighted and unweighted estimates.
Working with sqrt(n) is misleading under complex sampling. The usual n is what is called nominal sample size, but the effective sample size is usually smaller than that (because of cluster sampling.).
A concept related to what you are tying to do is the design effect, but that is not yet implemented for svyfgt (although, for absolute thresholds, you can still get it using svymean).
Related
I'm trying to use gee to model counts of an outcome with a population offset.I have models with interaction terms and am trying to use the all effects package to summarize parameter estimates and odds ratios (ORs).
When I compute ORs by hand, I'm not sure why its not matching the output I get from the effects::allEffects() function. The data can't be shared but the model is
mdl <- geeglm(count~age+gender+age:gender+offset(log(totalpop)),
family="poisson", corstr="exchangeable", id=geo,
waves=year, data=df)
I use the below code to compute stuff manually. log_OR sums the interaction terms without intercepts added to parameter. log_odds sums the parameters with intercept. The code is taken from here.
tibble(
variables = names(coef(mdl)),
log_OR = c(...),
log_odds = c(...),
OR = exp(log_OR),
odds = exp(log_odds),
probability = odds / (1 + odds)
) %>%
mutate_if(is.numeric, ~round(., 5)) %>%
knitr::kable()
I then compare my manual calculations to the output of allEffects below. They don't match. Can someone help me see what I am doing wrong?
result <- allEffects(mdl)
allEffects(mdl) %>% summary()
variable <- result[["age:gender"]][["x"]]
Prob <- result$`age:gender`$fit
Prob_upper <- result$`age:gender`$upper
Prob_lower <- result$`age:gender`$lower
model_Est <- data.frame("Est"=Prob, "CI Lower"= Prob_lower,
"CI Upper"= Prob_upper)
model_Prob <- exp(model_Est)
model_est <- data.frame("Variable"=variable, model_est)
model_OR <- data.frame("Variable"=variable, model_OR)
You haven't given us very much to go on, but the cause is almost certainly that the offset isn't being dealt with properly. (The first thing I would try is running the model without the offset to see if the results from effects and your by-hand calculations match: that's not the model you want, but it will confirm that the problem is with the offsets.)
?effects says:
offset a function to be applied to the offset values (if
there is an offset) in a linear or generalized linear
model, or a mixed-effects model fit by ‘lmer’ or ‘glmer’;
or a numeric value, to which the offset will be set. The
default is the ‘mean’ function, and thus the offset will
be set to its mean; in the case of ‘"svyglm"’ objects,
the default is to use the survey-design weighted mean.
Note: Only offsets defined by the ‘offset’ argument to
‘lm’, ‘glm’, ‘svyglm’, ‘lmer’, or ‘glmer’ will be handled
correctly; use of the ‘offset’ function in the model
formula is not supported.
(emphasis added)
methods("effects") lists only effects.glm and effects.lm, which suggests that the model is being treated as a glm (i.e., there is no specialized method for GEE models). So, this suggests:
(1) you need to include offset= as a separate argument in your model.
(2) when doing your hand calculation, make sure the value of the offset is set to the mean value across all observations (unless you choose to use the offset= argument to effects/allEffects to change the default summary function).
I have data with 10000 instances, which resemble negative binomial distribution. I am sampling out of this data, but I need a subsample which is normally distributed and has a pre-specified mean. How can I achieve this?
library(MASS)
my_trees <- rnegbin(10000, mu = 15, theta = 3)
hist(my_trees)
mean(my_trees)
my_sample <- sample(my_trees, size = 500)
hist(my_sample)
mean(my_sample)
How can I sample data which will be normally distributed with a mean of, e.g. 25? I am aware of prob argument, and also read this related question, but anyhow I can not get what I want.
Normal distribution have two parameters ; location parameter and scale parameter.
You mentioned only mean (25) then you can generate n random values distributed from normal distribution with rnorm(n, mean = 25). For different sd(scale parameter), use rnorm(n, mean, sd). In the same way you generate my_trees with rnegbin, rnorm does the same job.
https://www.stat.umn.edu/geyer/old/5101/rlook.html provides information about several other distributions.
Is it possible to/how can I generate a beta-binomial distribution from an existing vector?
My ultimate goal is to generate a beta-binomial distribution from the below data and then obtain the 95% confidence interval for this distribution.
My data are body condition scores recorded by a veterinarian. The values of body condition range from 0-5 in increments of 0.5. It has been suggested to me here that my data follow a beta-binomial distribution, discrete values with a restricted range.
set1 <- as.data.frame(c(3,3,2.5,2.5,4.5,3,2,4,3,3.5,3.5,2.5,3,3,3.5,3,3,4,3.5,3.5,4,3.5,3.5,4,3.5))
colnames(set1) <- "numbers"
I see that there are multiple functions which appear to be able to do this, betabinomial() in VGAM and rbetabinom() in emdbook, but my stats and coding knowledge is not yet sufficient to be able to understand and implement the instructions provided on the function help pages, at least not in a way that has been helpful for my intended purpose yet.
We can look at the distribution of your variables, y-axis is the probability:
x1 = set1$numbers*2
h = hist(x1,breaks=seq(0,10))
bp = barplot(h$counts/length(x1),names.arg=(h$mids+0.5)/2,ylim=c(0,0.35))
You can try to fit it, but you have too little data points to estimate the 3 parameters need for a beta binomial. Hence I fix the probability so that the mean is the mean of your scores, and looking at the distribution above it seems ok:
library(bbmle)
library(emdbook)
library(MASS)
mtmp <- function(prob,size,theta) {
-sum(dbetabinom(x1,prob,size,theta,log=TRUE))
}
m0 <- mle2(mtmp,start=list(theta=100),
data=list(size=10,prob=mean(x1)/10),control=list(maxit=1000))
THETA=coef(m0)[1]
We can also use a normal distribution:
normal_fit = fitdistr(x1,"normal")
MEAN=normal_fit$estimate[1]
SD=normal_fit$estimate[2]
Plot both of them:
lines(bp[,1],dbetabinom(1:10,size=10,prob=mean(x1)/10,theta=THETA),
col="blue",lwd=2)
lines(bp[,1],dnorm(1:10,MEAN,SD),col="orange",lwd=2)
legend("topleft",c("normal","betabinomial"),fill=c("orange","blue"))
I think you are actually ok with using a normal estimation and in this case it will be:
normal_fit$estimate
mean sd
6.560000 1.134196
So I am wanting to create a logistic regression that simultaneously satisfies two constraints.
The link here, outlines how to use the Excel solver to maximize the value of Log-Likelihood value of a logistic regression, but I am wanting to implement a similar function in R
What I am trying to create in the end is an injury risk function. These take an S-shape function.
As we see, the risk curves are calculated from the following equation
Lets take some dummy data to begin with
set.seed(112233)
A <- rbinom(153, 1, 0.6)
B <- rnorm(153, mean = 50, sd = 5)
C <- rnorm(153, mean = 100, sd = 15)
df1 <- data.frame(A,B,C)
Lets assume A indicates if a bone was broken, B is the bone density and C is the force applied.
So we can form a logistic regression model that uses B and C are capable of explaining the outcome variable A. A simple example of the regression may be:
Or
glm(A ~ B + C, data=df1, family=binomial())
Now we want to make the first assumption that at zero force, we should have zero risk. This is further explained as A1. on pg.124 here
Here we set our A1=0.05 and solve the equation
A1 = 1 - (1-P(0))^n
where P(0) is the probability of injury when the injury related parameter is zero and n is the sample size.
We have our sample size and can solve for P(0). We get 3.4E-4. Such that:
The second assumption is that we should maximize the log-likelihood function of the regression
We want to maximize the following equation
Where pi is estimated from the above equation and yi is the observed value for non-break for each interval
My what i understand, I have to use one of the two functions in R to define a function for max'ing LL. I can use mle from base R or the mle2 from bbmle package.
I guess I need to write a function along these lines
log.likelihood.sum <- function(sequence, p) {
log.likelihood <- sum(log(p)*(sequence==1)) + sum(log(1-p)*(sequence==0))
}
But I am not sure where I should account for the first assumption. Ie. am I best to build it into the above code, and if so, how? Or will it be more effiecient to write a secondary finction to combine the two assumptions. Any advice would be great, as I have very limited experience in writing and understanding functions
I am trying to find a structural break in the mean of a time-series that is skewed, fat-tailed, and heteroskedastic. I apply the Andrews(1993) supF-test via the strucchange package. My understanding is that this is valid even with my nonspherical disturbances. But I would like to confirm this via bootstrapping. I would like to estimate the max t-stat from a difference in mean test at each possible breakpoint (just like the Andrews F-stat) and then bootstrap the critical value. In other words, I want to find my max t-stat in the time-ordered data. Then scramble the data and find the max t-stat in the scrambled data, 10,000 times. Then compare the max t-stat from the time-ordered data to a critical value given by the rank 9,500 max t-stat from the unordered data. Below I generate example data and apply the Andrews supF-test. Is there any way to "correct" the Andrews test for nonspherical disturbances? Is there any way to do the bootstrap I am trying to do?
library(strucchange)
Thames <- ts(matrix(c(rlnorm(120, 0, 1), rlnorm(120, 2, 2), rlnorm(120, 4, 1)), ncol = 1), frequency = 12, start = c(1985, 1))
fs.thames <- Fstats(Thames ~ 1)
sctest(fs.thames)
I'm adding a second answer to analyze the simulated Thames data provided.
Regarding the points from my first general methodological answer: (1) In this case, a log() transformation is clearly appropriate to deal with the extreme skewness of the observations. (2) As the data are heteroscedastic, the inference should be based on HC or HAC covariances. Below I employ the Newey-West HAC estimator, although the data are just heteroscedastic but not autocorrelated. The HAC-corrected inference affects the supF test and the confidence intervals for the breakpoint estimates. The breakpoints themselves and the corresponding segment-specific intercepts are estimated by OLS, i.e., treating the heteroscedasticity as a nuisance term. (3) I did not add any bootstrap or permutation inference as the asymptotic inference appears to be convincing enough in this case.
First, we simulate the data using a particular seed. (Note that other seeds may not lead to such clear-cut breakpoint estimates when analyzing the series in levels.)
library("strucchange")
set.seed(12)
Thames <- ts(c(rlnorm(120, 0, 1), rlnorm(120, 2, 2), rlnorm(120, 4, 1)),
frequency = 12, start = c(1985, 1))
Then we compute the sequence of HAC-corrected Wald/F statistics and estimate the optimal breakpoints (for m = 1, 2, 3, ... breaks) via OLS. To illustrate how much better this works for the series in logs rather than in levels, both versions are shown.
fs_lev <- Fstats(Thames ~ 1, vcov = NeweyWest)
fs_log <- Fstats(log(Thames) ~ 1, vcov = NeweyWest)
bp_lev <- breakpoints(Thames ~ 1)
bp_log <- breakpoints(log(Thames) ~ 1)
The visualization below shows the time series with the fitted intercepts in the first row, the sequence of Wald/F statistics with the 5% critical value of the supF test in the second row, and the residual sum of squares and BIC for the selection of the number of breakpoints in the last row. The code to replicate the graphic is at the end of this answer.
Both supF tests are clearly significant but in levels (sctest(fs_lev)) the test statistic is "only" 82.79 while in logs (sctest(fs_log)) it is 282.46. Also, the two peaks pertaining to the two breakpoints can be seen much better when analyzing the data in logs.
Similarly, the breakpoint estimates are somewhat better and the confidence intervals much narrower for the log-transformed data. In levels, we get:
confint(bp_lev, breaks = 2, vcov = NeweyWest)
##
## Confidence intervals for breakpoints
## of optimal 3-segment partition:
##
## Call:
## confint.breakpointsfull(object = bp_lev, breaks = 2, vcov. = NeweyWest)
##
## Breakpoints at observation number:
## 2.5 % breakpoints 97.5 %
## 1 NA 125 NA
## 2 202 242 263
plus an error message and warnings which all reflect that the asymptotic inference is not a useful approximation here. In contrast, the confidence intervals are quite reasonable for the analysis in logs. Due to the increased variance in the middle segment, its start and end are somewhat more uncertain than for the first and last segment:
confint(bp_log, breaks = 2, vcov = NeweyWest)
##
## Confidence intervals for breakpoints
## of optimal 3-segment partition:
##
## Call:
## confint.breakpointsfull(object = bp_log, breaks = 2, vcov. = NeweyWest)
##
## Breakpoints at observation number:
## 2.5 % breakpoints 97.5 %
## 1 107 119 121
## 2 238 240 250
##
## Corresponding to breakdates:
## 2.5 % breakpoints 97.5 %
## 1 1993(11) 1994(11) 1995(1)
## 2 2004(10) 2004(12) 2005(10)
Finally, the replication code for the figure above is included here. The confidence intervals for the breakpoints in levels are cannot added in the graphic due to the error mentioned above. Hence, only the log-transformed series also has the confidence intervals.
par(mfrow = c(3, 2))
plot(Thames, main = "Thames")
lines(fitted(bp_lev, breaks = 2), col = 4, lwd = 2)
plot(log(Thames), main = "log(Thames)")
lines(fitted(bp_log, breaks = 2), col = 4, lwd = 2)
lines(confint(bp_log, breaks = 2, vcov = NeweyWest))
plot(fs_lev, main = "supF test")
plot(fs_log, main = "supF test")
plot(bp_lev)
plot(bp_log)
(1) Skewness and heavy tails. As usual in linear regression models, the asymptotic justification for the inference does not depend on normality and also holds for any other error distribution given zero expectation, homoscedasticity, and lack of correlation (the usual Gauss-Markov assumptions). However, if you have a well-fitting skewed distribution for your data of interest, then you might be able to increase efficiency by basing your inference on the corresponding model. For example, the glogis package provides some functions for structural change testing and dating based on a generalized logistic distribution that allows for heavy tails and skewness. Windberger & Zeileis (2014, Eastern European Economics, 52, 66–88, doi:10.2753/EEE0012-8775520304) used this to track changes in skewness of inflation dynamics over time. (See ?breakpoints.glogisfit for a worked example.) Furthermore, if the skewness itself is not really of interest then a log or sqrt transformation might also be good enough to make the data more "normal".
(2) Heteroscedasticity and autocorrelation. As usual in linear regression models, the standard errors (or more broadly the covariance matrix) is not consistent in the presence of heteroscedasticity and/or autocorrelation. One can either try to include this explicitly in the model (e.g., an AR model) or treat it as a nuisance term and employ heteroscedasticity and autocorrelation consistent (HAC) covariance matrices (e.g., Newey-West or Andrews' quadratic spectral kernal HAC). The function Fstats() in strucchange allows to plug in such estimators, e.g., from the sandwich package. See ?durab for an example using vcovHC().
(3) Bootstrap and permutation p-values. The "scrambling" of the time series you describe above sounds more like applying permutations (i.e., sampling without replacement) rather than bootstrap (i.e., sampling with replacement). The former is feasible if the errors are uncorrelated or exchangeable. If you are regressing just on a constant, then you can employ the function maxstat_test() from the coin package to carry out the supF test. The test statistic is computed in a somewhat different way, however, this can be shown to be equivalent to the supF test in the constant-only case (see Zeileis & Hothorn, 2013, Statistical Papers, 54, 931–954, doi:10.1007/s00362-013-0503-4). If you want to perform the permutation test in a more general model, then you would have to do the permutations "by hand" and simply store the test statistic from each permutation. Alternatively, the bootstrap can be applied, e.g., via the boot package (where you would still need to write your own small function that computes the test statistic from a given bootstrap sample). There are also some R packages (e.g., tseries) that implement bootstrap schemes for dependent series.