Can't generate correlated random numbers from binomial distributions in R using rmvbin - r

I am trying to get a sample of correlated random numbers from binomial distributions in R. I tried to use rmvbin and it worked fine with some probabilities:
> rmvbin(100, margprob = c(0.1,0.1), bincorr=0.5*diag(2)+0.5)
while the next call which is quite similar one raises an error:
> rmvbin(100, margprob = c(0.01,0.01), bincorr=0.5*diag(2)+0.5)
Error in commonprob2sigma(commonprob, simulvals) :
Extrapolation occurred ... margprob and commonprob not compatible?
I can't find any justification for this.

This is a math/stats "problem" and not an R problem (In the sense that it's not a problem but a consequence of the model)
Short version: For bivariate binary data there is a link between the marginal probabilities and the correlation that can be observed. You can see it if you do a bit of boring juggling with the marginal probabilities $p_A$ and $p_B$ and the simultaneous probability $p_{AB}$. In other words: the marginal probabilities put restrictions on range of allowed correlations (and vice versa), and you are violating this in your call.
For bivariate Gaussian random variables the marginals and the correlations are separate and can be specified independently of each other.
The question should probably be moved to stats exchange.

Related

How does fixest handle negative values of the demeaned dependent variable in poisson estimations?

I need to perform glm (poisson) estimations with fixed-effects (say merely unit FE) and several regressors (RHS variables). I have an unbalanced panel dataset where most (~90%) observations have missing values (NA) for some but not all regressors.
fixest::feglm() can handle this and returns my fitted model.
However, to do so, it (and fixest::demean too) removes observations that have at least one regressor missing, before constructing the fixed-effect means.
In my case, I am afraid this implies not using a significant share of available information in the data.
Therefore, I would like to demean my variables by hand, to be able to include as much information as possible in each fixed-effect dimension's mean, and then run feglm on the demeaned data. However, this implies getting negative dependent variable values, which is not compatible with Poisson. If I run feglm with "poisson" family and my manually demeaned data, I (coherently) get: "Negative values of the dependent variable are not allowed for the "poisson" family.". The same error is returned with data demeaned with the fixest::demean function.
Question:
How does feglm handle negative values of the demeaned dependent variable? Is there a way (like some data transformation) to reproduce fepois on a fixed-effect in the formula with fepois on demeaned data and a no fixed-effect formula?
To use the example from fixest::demean documentation (with two-way fixed-effects):
data(trade)
base = trade
base$ln_dist = log(base$dist_km)
base$ln_euros = log(base$Euros)
# We center the two variables ln_dist and ln_euros
# on the factors Origin and Destination
X_demean = demean(X = base[, c("ln_dist", "ln_euros")],
fe = base[, c("Origin", "Destination")])
base[, c("ln_dist_dm", "ln_euros_dm")] = X_demean
and I would like to reproduce
est_fe = fepois(ln_euros ~ ln_dist | Origin + Destination, base)
with
est = fepois(ln_euros_dm ~ ln_dist_dm, base)
I think there are two main problems.
Modelling strategy
In general, it is important to be able to formally describe the estimated model.
In this case it wouldn't be possible to write down the model with a single equation, where the fixed-effects are estimated using all the data and other variables only on the non-missing observations. And if the model is not clear, then... maybe it's not a good model.
On the other hand, if your model is well defined, then removing random observations should not change the expectation of the coefficients, only their variance. So again, if your model is well specified, you shouldn't worry too much.
By suggesting that observations with missing values are relevant to estimate the fixed-effects coefficients (or stated differently, that they are used to demean some variables) you are implying that these observations are not randomly distributed. And now you should worry.
Just using these observations to demean the variables wouldn't remove the bias on the estimated coefficients due to the selection to non-missingness. That's a deeper problem that cannot be removed by technical tricks but rather by a profound understanding of the data.
GLM
There is a misunderstanding with GLM. GLM is a super smart trick to estimate maximum likelihood models with OLS (there's a nice description here). It was developed and used at a time when regular optimization techniques were very expensive in terms of computational time, and it was a way to instead employ well developed and fast OLS techniques to perform equivalent estimations.
GLM is an iterative process where typical OLS estimations are performed at each step, the only changes at each iteration concern the weights associated to each observation. Therefore, since it's a regular OLS process, techniques to perform fast OLS estimations with multiple fixed-effects can be leveraged (as is in the fixest package).
So actually, you could do what you want... but only within the OLS step of the GLM algorithm. By no means you should demean the data before running GLM because, well, it makes no sense (the FWL theorem has absolutely no hold here).

AUC in Weka vs R

I have received AUCs and prediction from a collaborated generated in Weka. The statistical model behin that was cross validated, so my dataset with the predictions includes columns for fold, predicted probability and true class. Using this data I was unable to replicate the AUCs given the predicted probabilities in R. The values always differ slightly.
Additional details:
Weka was used via GUI, not command line
I checked the AUC in R with packages pROC and ROCR
I first tried calculating the AUC over the collected predictions (without regard to fold) and I got different AUCs
Then I tried calculating the AUCs per fold and averaging. This did also not match.
The model was ridge logistic regression and there is a single tie in the predictions
The first fold has one sample more than the others. I have tried taking a weighted average, but this did not work out either
I have even tested averaging the AUCs after logit-transformation (for normality)
Taking the median instead of the mean did not help either
I am familiar with how the AUC is calculated in R, but I don't see what Weka could do differently.

R package multgee -- initial values

I'm working on estimating a generalized estimating equation in R. I have a multinomial (ordinal) outcome, and so I have been attempting to use the package multgee since, as far as I know, packages like geepack or gee don't allow for the estimation of multinomial outcomes.
However, I'm running into some issues. The documentation seems good, but in particular, it seems to be requiring initial (starting) values in the model. If I try to run the model without it, I get a line requesting starting values. Here's the model:
formula <- PAINAD_recode~Age + Gender + group_2part + Cornell + stim_intensity_scale
fitmod <- ordLORgee(formula,data=data,bstart=c(1,0,1,0,1),id=data$subject,
repeated=data$trial)
I just threw in some ones and zeroes for the starting values there. However, when I enter starting values (even plausible ones), it claims that:
Starting values and parameters vector differ in length
I thought that with five predictors, I would need five starting values. I can't find more information about this particular matrix. Does anyone have any thoughts on this? The outcome here has five levels (ordinal) and the repeated component has 20 levels. ANy suggestions would be appreciated.

gbm multinomial distribution

I'm tryin to use gbm for the first time (actually any kind of regression tree for the first time) on my data, which consists of 14 continuous dependent variables and a factor as response variable with 13 levels. I came to gbm via a very good description by Elith et al., who however used a modification of the basic gbm package that can't handle multinomial distributions. The help of gbm claims that it can handle this:
"distribution: either a character string specifying the name of the distribution to use or a list
with a component name specifying the distribution and any additional param-eters needed. If not specified, gbm will try to guess: if the response has only
2 unique values, bernoulli is assumed; otherwise, if the response is a factor,
multinomial is assumed; otherwise, if the response has class "Surv", coxph is
assumed; otherwise, gaussian is assumed.
Currently available options are "gaussian" (squared error), "laplace" (absolute
loss), "tdist" (t-distribution loss), "bernoulli" (logistic regression for 0-1 out-comes), "huberized" (huberized hinge loss for 0-1 outcomes), "multinomial"
(classification when there are more than 2 classes), "adaboost" (the AdaBoost
exponential loss for 0-1 outcomes), "poisson" (count outcomes), "coxph" (right
censored observations), "quantile", or "pairwise" (ranking measure using the
LambdaMart algorithm)."
Nevertheless, it doesn't work, no matter, whether I specify "multinomial" or "let it guess". Anyone any idea what I am doing wrong? Or am I misunderstanding something completely - does a multinomial distribution of my data not mean, that my error loss function is also of multinomial distribution? It runs if I chose "gaussian", but I guess in that case something completely different is calculated?
I'd appreciate any help!
agnes
Are you using the newest version of gbm? I had a similar issue which was resolved after re-installing the gbm package.

specifying probability weights in R *without* using Lumley survey package

I would really appreciate any help with specifying probability weights in R without using the Lumley survey package. I am conducting mediation analysis in R using the Imai et al mediation package, which does not currently support svyglm.
The code I am currently running is:
olsmediator_basic<-lm(poledu ~ gateway_strict_alt + gender_n + spline1 + spline2 + spline3,
data = unifiedanalysis, weights = designweight).
However, I'm unsure if this is weighting the data correctly. The reason is that this code yields standard errors that differ from those I am getting in Stata. The Stata code I am running is:
reg poledu gateway_strict_alt gender_n spline1 spline2 spline3 [pweight=designweight]).
I was wondering if the weights option in R may not be for inverse probability weights, but I was unable to determine this from the documentation, this forum or elsewhere. If I am missing something, I really apologize - I am new to R as well as to this forum.
Thank you in advance for your help.
The R documentation specifies that the weights parameter of the lm function is inversely proportional to the variance of the observations. This is the definition of analytic weights, or aweights in Stata.
Have a look at the ipw package for inverse probability weighting.
To correct a previous answer - I looked up the manual on weights and found the following description for weights in lm
Non-NULL weights can be used to indicate that different observations have different variances (with the values in weights being inversely proportional to the variances); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations (including the case that there are w_i observations equal to y_i and the data have been summarized).
These are actually frequency weights (fweights in stata). They multiply out the observation n number of times as defined by the weight vector. Probability weights, on the other hand, refer to the probability that observations group is included in the population. Doing so adjusts the impact of the observation on the coefficients, but not on the standard errors, as they don't change the number of observations represented in the sample.

Resources