My question is quite simple, but I've been unable to find a clear answer in either R manuals or online searching. Is there a good way to verify what your reference is for the response variable when doing a logistic regression with glmer?
I am getting results that consistently run the exact opposite of theory and I think my response variable must be reversed from my intention, but I have been unable to verify.
My response variable is coded in 0's and 1's.
Thanks!
You could simulate some data where you know the true effects ... ?simulate.merMod makes this relatively easy. In any case,
the effects are interpreted in terms of their effect on the log-odds of a response of 1
e.g., a slope of 0.5 implies that a 1-unit increase in the predictor variable increases the log-odds of observing a 1 rather than a 0 by 0.5.
for questions of this sort, glmer inherits its framework from glm. In particular, ?family states:
For the ‘binomial’ and ‘quasibinomial’ families the response can
be specified in one of three ways:
1. As a factor: ‘success’ is interpreted as the factor not
having the first level (and hence usually of having the
second level).
2. As a numerical vector with values between ‘0’ and ‘1’,
interpreted as the proportion of successful cases (with the
total number of cases given by the ‘weights’).
3. As a two-column integer matrix: the first column gives the
number of successes and the second the number of failures.
Your data are a (common) special case of #2 (the "proportion of successes" is either zero or 100% for each case, because there is only one case per observation; the weights vector is a vector of all ones by default).
Related
I am currently trying to impute a three-level dataset with 87 columns and 71,756 rows. The variables comprise of which 4 identifier columns, 15 continuous outcome variables without missing entries, and 68 predictors and covariates with missing entries:
On level 1 (lowest, represents on individual) there are 16 ordinal and 20 dichotomous variables,
on level 2 there are 28 continuous variables, and
on level 3 (top) there are 4 ordinal variables.
I've been following Simon Grund's example for modeling three-level data using mice with the mice.impute.ml.lmer-function. Naturally, I had to make some adaptations to the example model to fit my data:
I tried setting model to "binary" to run a logistic mixed effects model for the dichotomous variables ("pmm" for the ordinal, "continuous" for the continuous).
I tried added random slopes and interaction effects.
mice.impute.2lonly.pmm was used instead of mice.impute.2lonly.norm for the top level imputation.
I added a post processing to a level 2 variable where I set upper and lower boundaries.
However when running mice (with some variables modeled as "binary" (without random slopes or interactions), I get the following warning:
Warning message in commonArgs(par, fn, control, environment()):
“maxfun < 10 * length(par)^2 is not recommended.”
Execution of mice hangs at this point.
I ran a test with mice (1 iteration), this time with all dichotomous variables as "pmm", and this time the function completed the run. However, adding variables to random_slopes it seemingly gets stuck (running infinitely) on the imputation of the first three variables. Now, my assumption is that this is due to the relatively large dataset, making the the process computationally very demanding.
I am wondering what exactly causes this error message, and if there are ways to avoid it. Also, I would like to know if there are ways to improve computational efficiency of such a large model.
I am not very familiar with mice, but I have some thoughts regarding how the data is imputed:
I am planning to use the imputed data for a structural equation model I've built, where all the variables are grouped into indicators of latent constructs. It therefore seems natural that the indicator variables that belongs to the same construct are imputed together.
In mice there is an argument called blocks which allows for multivariate imputation of the variables grouped together as list elements. However, creating blocks containing variables from different levels created the issue that I got the error message that no top level was defined in the predictorMatrix (i.e. no block set to -2). As an alternative method, it seems the formulas argument can be used in place of a predictor matrix. This options seems ideal, as it allows user defined formulas for each block. Also, if I understand the whole process correctly, the predictorMatrix is only passed on to mice.impute.2lonly.pmm and not mice.impute.ml.lmer. The question then is if the formulas argument can be used to define three-level models using lme4-syntax? ..and can these user defined models in formulas be passed on to mice.impute.ml.lmer? As a more general question, why can't mice.impute.ml.lmer be used for imputation at top level? (At least, it didn't work when I tried.)
Then there's also an argument group_index in mice.impute.ml.lmer used to pass group identifiers to mice.impute.bygroup. From reading the documentation I am still unsure what this function actually does, as I can find little information on it. However, it seems it is designed for grouping variables together by level, but not across grouping of variables from different levels, correct? However, what would distinguish mice.impute.bygroup from creating blocks? ..and what would the difference of doing this, rather than calling models in mice.impute.ml.lmer?
As for computational efficiency, I have no idea if grouping variables together would increase computational efficiency. I could really use some advice on this part.
I am new to the function GAMLSS in r, and when I run my code I always get this error: Response Variable out of range
After looking into the data frame, I realized the issue was one of response variables was 0.0000.
I was wondering if someone could explain to me why 0 is out of range and possible solutions to go around it (ex. such as replacement the values)?
LOGNO family corresponds to the log-normal distribution, which is defined for positive values only.
The possible solutions might be (but highly depend on the context):
use another distribution, which better models the response variable and allows zero values
sometimes zero values are reported if they are below the limit of detection (LOD).
In this case, one has a censored data set, and you may look for review of the methods, how to tackle it. A pragmatic approach is to substitute zeros with values like LOD/2, reviewed, for example, here. However, it may result in a very biased estimation.
I am using R (RStudio) to construct an index/synthetic indicator to evaluate, say, commercial efficiency. I am using the PCA() command from factorMineR package, and using 7 distinct variables. I have previously created similar indexes by calculating the weight of each particular variable over the first component (which can be obtained through PCA()$var$coord[,1]), with no problems, since each variable has a positive weight. However, there is one particular variable that has a weight with an undesired sign: negative. The variable is ‘delivery speed’ and this sign would imply that the greater the speed the less efficient the process. Then, what is going on? How would you amend this issue, preferably still using PCA?
The sign of variable weights shouldn't matter in PCA. Since on the whole, all of the components perfectly represent the original data (when p < n), for some components it is natural that there will be some positive weights and some negative weights. That doesn't mean that that particular variable has an undesired weight, rather that for that particular extracted signal (say, first principal component) the variable weight is negative.
For a better understanding, let's take the classical 2 dimensional example, which I took from this very useful discussion:
Can you see from the graph that one of the weights will necessary be negative for the 2nd principal component?
Finally, if that variable does actually disturb your analysis, one possible solution would be to apply Sparse PCA. Under cross-validated regularization that method is able to make some of the weights equal to zero. If in your case that negative weight is not significant enough, it might get reduced to zero under SPCA.
I am using phylopars() in Rphylopars package in R to generate the missing values in a large dataset about animal body traits (eg body size). (https://www.rdocumentation.org/packages/Rphylopars/versions/0.2.9/topics/phylopars)
This method is called imputation and what it does is to phylogenetically estimate this missing datas.
However the output of the imputation contains some negative values which make no sense because all the estimated trait have to be bigger than zero.
I wonder how I can fix this issue or how to set up a minimum limit for the estimated values.
I'm not new in R but new in Rphylopars so maybe that question is pretty naive but I couldn't find the solution.
For correlated traits, values from one trait will influence another trait resulting in potentially negative results. phylopars allows you to specify if traits are correlated, so you can try setting phylo_correlated = FALSE or imputing the traits individually.
Alternatively transforming can ensure the trait stays within range, depending on the nature of your data. log transforming (and back) and assessing the distribution can help.
The post
Classification functions in linear discriminant analysis in R
from user Tyler provides a function to produce the classification functions (not discriminant functions!) from an LDA model generated with lda().
I used these classification functions to calculate all classification scores for my data. I want to use the additional information e.g. to find out which was the second most probable class and to understand the development in different time slices
Now I would like to ask you for your help to interpret the following scenarios:
scores close to/exactly zero (is it possible to claim that this exact class effectively was not recognized?)
single negative scores of higher absolute value than highest positive value (Does it mean anything at all?)
results with all negative scores (in the original interpretation, the highest score determines the classification. Is this intended by the LDA or does it mean that really none of the classifications is a good fit and one could say that no known pattern could be identified?)
single very low positive values while all others are high absolute negative values (can I argue that the "signal strength" is low in this case?)
I know this is more of a statistical than a programming problem. I thought of it as a follow-up of the post at the beginning of this entry.
Thank you very much for your help!