I have zero values for some replicate and sampling weights. Therefore, when I use Svycoxph from the “survey” package, I get an error message that the “package Invalid weights, must be >0”. I think that one way might be to exclude these observations. I wonder if there is a way to keep those observations for the Cox proportional hazards model?
Thanks!
Julia
Zero replicate weights are perfectly normal -- with jackknife replicates they will be observations from the PSU being left out; with bootstrap replicates they will be from PSUs that appeared zero times in a particular resample. Zero sampling weights, on the other hand, don't make a lot of sense and probably indicate observations that aren't in the sample or aren't in the sampling frame (personally, I'd code these as NA)
The coxph() function in the survival package can't handle negative or zero weights, and that's what svycoxph() calls. For replicate weights, svycoxph() adds a small value (1e-10) to weights to stop problems with zeros.
However, svycoxph() can't handle zero sampling weights. You probably want to remove these before constructing the design object.
Related
I have matched two groups using the MatchIt package matching 2:1 nearest neighbor with replacement. After matching, I want to compare the difference in a test score (range: 0-100) between the two groups - however these scores are not normally distributed. I don't think I can use a weighted t-test (using the weights created by the matching program) since the data isn't normal. What should I use instead to analyze this continuous variable after matching?
If you have a large sample and the groups have similar variances, it doesn't matter whether the variable is normally distributed; the t-test p-value will be approximate but very close to accurate. Otherwise, you can try using a robust standard error to correct for heteroscedasticity; this is recommended with weights (including matching weights) in general. You might also think about how scores are generated. If you see floor or ceiling effects, maybe a tobit model would make sense. If it's very hard to get values close to 0 or 100, maybe a fractional logit model would make sense.
My question is on how to correctly interpret (and use) the 'weights' input variable in the nls function of R for non-linear weighted least squares regression.
The solution for solving the unknown parameters in weighted least squares theory is:
From this the variable P is the weight square matrix of size (NxN) where N is the number of data observations.
However, when I look at the nls documentation in R found here, it says the 'weights' to be input is a vector.
This has me puzzled since based on my understanding, the weights should be a square matrix. Some insights with those who have a better understanding is appreciated.
Weight variable in regression, is a measure of how important an observation is to your model due to different reasons (eg. may be in terms of reliability of measurement or inverse of variance estimate). Therefore, some observations may be more important/ weigh higher than others.
Weight vector , in matrix notation converts to a diagonal matrix for i in {1,2,3...n,} both represents the same thing (i.e. weight of ith observation). For nls package in R you need to supply weights in vector form.
Also, it should be noted that, weighted least squares is a special variant of generalized least squares in which we use weights to counter the heteroskedasticity. If the residuals are correlated for observations, perhaps a general model might be suitable.
PS: Cross validated would be the right place to get better detailed answer. Also, It seems to be memory efficient to store a vector rather than a matrix as the number of observations grows
I am trying to model count data on the number of absence days by worker in a year (dependant variable). I have a set of predictors, including information about workers, about their job, etc..., and most of them are categorical variables. Consequently, there is a large number of coefficient to estimate (83), but as I have more than 600 000 rows, I think it should not be problematic.
In addition, I have no missing values in my dataset.
My dependant variable contains lot of zero values, so I would like to estimate a zero inflated model (poisson or negative binomial), with the function zeroinfl of the pscl package, with the code:
zpoisson <- zeroinfl(formule,data=train,dist = "poisson",link="logit")
but I get the following erreur after a long running time:
Error in solve.default(as.matrix(fit$hessian)) : system is computationally singular: reciprocal condition number = 1.67826e-41
I think this error means some of my covariables are correlated, but it does not seem to be the case when checking pairwise correlation and Variance Inflation Factor (VIF). Moreover, I have also estimated other models like logit and Poisson or negative binomial count models, without problems whereas these types of models are also sensitive to correlated predictors.
Do you have an idea why the zeroinfl function does not work? Could it be linked to the fact that I have too much predictors, even if they are not correlated? I have already tried to remove some predictors with the Boruta algorithm, but it kept all of them.
Thanks in advance for your help.
A collinearity among regressors is one potential cause of this error. However, there are also others.
The problem may actually be computationally in the sense that the scaling of regressors is bad. Some regressor might take values in the thousands or millions and then have a tiny coefficient while other regressors take small values and have huge coefficients. This then leads to numerically instable Hessian matrices and the error above upon inversion. Typical causes include squared regressors x^2 when already x itself is large. Simply taking x/1000 or so might solve the problem.
The problem may also be separation or lack of variation in the response. For example, if for certain groups or factor levels, there are only zeros the corresponding coefficient estimates might diverge and have huge standard errors. Much like in (quasi-)complete separation in binary regression.
I would really appreciate any help with specifying probability weights in R without using the Lumley survey package. I am conducting mediation analysis in R using the Imai et al mediation package, which does not currently support svyglm.
The code I am currently running is:
olsmediator_basic<-lm(poledu ~ gateway_strict_alt + gender_n + spline1 + spline2 + spline3,
data = unifiedanalysis, weights = designweight).
However, I'm unsure if this is weighting the data correctly. The reason is that this code yields standard errors that differ from those I am getting in Stata. The Stata code I am running is:
reg poledu gateway_strict_alt gender_n spline1 spline2 spline3 [pweight=designweight]).
I was wondering if the weights option in R may not be for inverse probability weights, but I was unable to determine this from the documentation, this forum or elsewhere. If I am missing something, I really apologize - I am new to R as well as to this forum.
Thank you in advance for your help.
The R documentation specifies that the weights parameter of the lm function is inversely proportional to the variance of the observations. This is the definition of analytic weights, or aweights in Stata.
Have a look at the ipw package for inverse probability weighting.
To correct a previous answer - I looked up the manual on weights and found the following description for weights in lm
Non-NULL weights can be used to indicate that different observations have different variances (with the values in weights being inversely proportional to the variances); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations (including the case that there are w_i observations equal to y_i and the data have been summarized).
These are actually frequency weights (fweights in stata). They multiply out the observation n number of times as defined by the weight vector. Probability weights, on the other hand, refer to the probability that observations group is included in the population. Doing so adjusts the impact of the observation on the coefficients, but not on the standard errors, as they don't change the number of observations represented in the sample.
I am carrying out a zero-inflated negative binomial GLM on some insect count data in R. My problem is how to get R to read my species data as one stacked column so as to preserve the zero inflation. If I subtotal and import it into R as a single row titled Abundance, I loose the zeros and the model doesn't work. Already, I have tried to:
stack the data myself (there are 80 columns * 47 rows) so with 3760 rows after stacking manually you can imagine how slow R gets when using the pscl zeroinfl() command (It takes 20mins on my computer!, It still worked)
The next problem concerns a spatial correlation. Certain samplers sampled from the same medium so as to violate independence. Can I just put medium in as a factor in the model?
3760 rows take 20 mminutes with PSCL? my god, I have battle 30.000 rows :) that´s why my pscl calculation did not finish...
However, I then worked with a GLMM including nested random effects (lme/gamm) and a negative binomial distribution setting the theta to a low value so that the distribution handles the zero inflation. I think this depends on the degree of zeros. In my case it was 44% and the residuals looked rather good.