Having issues in transforming my data for further analysis In R - r

I have a dataset here:
'''dataset
I want to perform linear and multiple regression.MoralRelationship and SkeletalP are both dependent variables while others are independent. I tried all the various method of Transformation I know but it did not yield any meaningful result from my diagnostic plot
I did this:
lm1<- lm(MoralRelationship ~ RThumb + RTindex + RTmid + RTFourth + RTFifth + Lthumb + Lindex
+ LTMid + LTFourth + LTfifth + BldGRP1 + BlDGR2, data=data)
I did same for SkeletalP
I did adiagnostic plot for both. then Tried to normalize the variables because there is correlation nor linearity. I took square term, log ,Sqrtof all independent variables also,log,1/x but no better output.
I also did
`lm(SkeletalP ~ RThumb + I(RThumb^2), data=data)`
if i will get a better result with one variable.
The independent variables are right skewed except for ANB which is normally distributed.
is there method I can use to transform my data? most importantly, to be uniformly distributed so that i can perform other statistical test.

Your dataset is kind of small. You can try dimensionality reduction like PCA, but I don't think it's appropriate here. It's also harder to interpret.
Have you tried other models? Tuning might help the fit of your regression models (e.g. Lasso/Ridge L1/L2 regulation)

Related

Trouble Converging Bifactor model using lavaan

Title basically explains it but I'm trying to build a bifactor model with psychopathy as one factor and subtypes as the other. I believe that I have everything constrained properly but that might be the issue.
Current code:
BifactorModel <- 'psychopathyBi =~ YPIS_1 + YPIS_2 + YPIS_3 + YPIS_4 + YPIS_5 + YPIS_6 + YPIS_7 + YPIS_8 + YPIS_9 +YPIS_10 + YPIS_11 + YPIS_12 + YPIS_13 + YPIS_14 + YPIS_15 + YPIS_16 + YPIS_17 + YPIS_18
GMbi =~ YPIS_4 + YPIS_5 + YPIS_8 + YPIS_9 + YPIS_14 + YPIS_16
CUbi =~ YPIS_3 + YPIS_6 + YPIS_10 + YPIS_15 + YPIS_17 + YPIS_18
DIbi =~ YPIS_1 + YPIS_2 + YPIS_7 + YPIS_11 + YPIS_12 + YPIS_13
psychopathyBi ~~ 0*GMbi
psychopathyBi ~~ 0*CUbi
psychopathyBi ~~ 0*DIbi
GMbi ~~ 0*CUbi
GMbi ~~ 0*DIbi
CUbi ~~ 0*DIbi
'
#fit bifactor model
bifactorFit <- cfa(BifactorModel, data = YPIS_Data)
#get summary of bifactor model
summary(bifactorFit, fit.measures = TRUE, standardized = TRUE)
This produces the following:
lavaan 0.6-9 did NOT end normally after 862 iterations
this is what the model should ultimately look like once converged
Any suggestions or comments would be greatly appreciated. Thanks in advance.
The variances of several of your latent variables are very small. For example, Dlbi appears to be effectively zero. That's the source of the issue here.
There are two things you can to try to remedy this.
First, it may work better to identify the model by fixing the latent variable variances to 1, rather than fixing the first indicator factor loadings to 1. Do this by specifying std.lv = TRUE.
Even then, it will likely be the case that loadings onto one or more of the group factors will have very small loadings. This indicates that there really isn't much of a distinct group factor in your data for this items that is distinct from the general factor. You should consider estimating a model that drops that group factor (as well as comparing with models dropping the other group factors one at a time). We discuss this issue some here: https://psyarxiv.com/q356f/
Additionally, you should constrain item loadings so that they are in the theoretically expected direction (e.g., all positive with a lower bound of 0). It is common for bifactor models to overextract variance in items and produce uninterpretable group factors that have a mix of positive and negative loadings. This can also cause convergence issues.
In general, this sort of unconstrained bifactor model tends to be overly flexible and tends to overfit to a similar degree as exploratory factor analysis. You should be sure to evaluate the bifactor model based not only on global model fit statistics, but also on whether the factor loadings actually resemble a true bifactor model--do the items each show substantial loadings on both the general factor and their group factor in the expected directions, or do items tend to load on only one or the other? See some examples in the paper linked above about this issue.
Another option would be to switch to exploratory bifactor modeling. This is implemented in R in the fungible package in the fungible::BiFAD() function. This approach is discussed here:
https://www.sciencedirect.com/science/article/pii/S0001879120300555
Exploratory bifactor models are useful because they rely on targeted EFA rotation to estimate loadings. This makes convergence much more likely and can help to diagnose when a group factor is too weak to identify in the data.

Scale LDA decision boundary

I have a rather unconventional problem and having a hard time finding a solution to this. Would really appreciate your help.
I have 4 genes(features) and my classification here is binary(0 and 1). After a lot of back and forth, I have finalized on using LDA to do my classification. I have different studies each comparing the same two classes and I trained my model using these 4 genes on each of these studies.
I want to visualize the LDA scores in the form of points plot. Something like below, where each section represents a different study/dataset. Samples of that dataset on the X axis and the LD1 value I get using -
lda_model = lda(formula = class ~ ., data = train)
predict(lda_model,train) on the Y axis.
Since I trained a different model on each dataset, we can clearly see the the decision boundary (which I assume is the black line) for each dataset is different and on a different scale. However, I want to scale the values on the Y axis is such a way that all my datasets are on the same scale and I can represent this plot with a single decision boundary( again, something I can clearly draw on the plot, like the red line).
The LD1 values here are - a(GeneA) + b(GeneB) + c(GeneC) + d(GeneD) - mean(a(GeneA) + b(GeneB) + c(GeneC) + d(GeneD)). This is done for each dataset individually. However, this is not exactly equal to (a(GeneA) + b(GeneB) + c(GeneC) + d(GeneD) + intercept) which we can get using logistic regression. I am trying to find that value or some method which can scale my Y axis across all the datasets using LDA.
Thanks for your help!
I did a min-max scaling and that seemed to work. It scaled all my data points across all datasets with decision boundary at zero.

Incorporating time series into a mixed effects model in R (using lme4)

I've had a search for similar questions and come up short so apologies if there are related questions that I've missed.
I'm looking at the amount of time spent on feeders (dependent variable) across various conditions with each subject visiting feeders 30 times.
Subjects are exposed to feeders of one type which will have a different combination of being scented/unscented, having visual patterns/being blank, and having these visual or scented patterns presented in one of two spatial arrangements.
So far my model is:
mod<-lmer(timeonfeeder ~ scent_yes_no + visual_yes_no +
pattern_one_or_two + (1|subject), data=data)
How can I incorporate the visit numbers into the model to see if these factors have an effect on the time spent on the feeders over time?
You have a variety of choices (this question might be marginally better for CrossValidated).
as #Dominix suggests, you can allow for a linear increase or decrease in time on feeder over time. It probably makes sense to allow this change to vary across birds:
timeonfeeder ~ time + ... + (time|subject)
you could allow for an arbitrary pattern of change over time (i.e. not just linear):
timeonfeeder ~ factor(time) + ... + (1|subject)
this probably doesn't make sense in your case, because you have a large number of observations, so it would require many parameters (it would be more sensible if you had, say, 3 time points per individual)
you could allow for a more complex pattern of change over time via an additive model, i.e. modeling change over time with a cubic spline. For example:
library(mgcv)
gamm(timeonfeeder ~ s(time) + ... , random = ~1|subject
(1) this assumes the temporal pattern is the same across subjects; (2) because gamm() uses lme rather than lmer under the hood you have to specify the random effect as a separate argument. (You could also use the gamm4 package, which uses lmer under the hood.)
You might want to allow for temporal autocorrelation. For example,
lme(timeonfeeder ~ time + ... ,
random = ~ time|subject,
correlation = corAR1(form= ~time|subject) , ...)

R - plotting the predictions from a mixed model with more than two predictors (continuous and factor)

I found this answer by Ben Bolker to a post and it is really helpful (How to plot random intercept and slope in a mixed model with multiple predictors?). However, if my model looks more like this: /n
mod <- lmer(resp ~ pred1 + pred2 + factor(pred3) + (1|RF1),data=d) and I also want to plot the factor's influence on the response keeping the other two constant, how would I create the nd dataframe instead? Also, how would I go about plotting random slopes? Thank you very much in advance!
EDIT: Ben, thank you very much for the answer and I apologize, of course it makes sense to give a reproducible example.
So, the first question: how can I plot the influence of a predictor keeping the others constant (as described in your answer to the above linked question) if I have a factor variable in my model?
Here is my example data: https://www.dropbox.com/s/ytlocw868fsnpu7/realdatasample.csv?dl=0, please treat confidentially :).
So the model would be:
moddata <- lmer(meanQUALNEW ~ meanDBH + meanCRRATIO + richn_tar + (1|region),data=realdatasample)
From what I understand, the example given in the link above is about constructing a plot for one predictor while keeping the other constant and then vice versa and taking into account the random effect. But how do I expand that code to account for three variables and especially if it is a factor?
The second question:
How can I visualize the random slopes in a model like this?
moddata1 <- lmer(meanQUALNEW ~ meanDBH + meanCRRATIO + richn_tar + (richn_tar-1|region),data=realdatasample)
As far as I understand, the packages visreg and effects provide ways to visualize the fixed part of such models in the accepted way (change in one predictor keeping others constant). But they don't work (as far as I know) for nice visualizations of the random effects variance components.
I realize that there is probably a lot of information about this out there, but I like the clear code example from above very much and would like to understand how to do these things "by hand".
Thanks so much for any help!

Regression coefficients by group in dataframe R

I have data of various companies' financial information organized by company ticker. I'd like to regress one of the columns' values against the others while keeping the company constant. Is there an easy way to write this out in lm() notation?
I've tried using:
reg <- lmList(lead2.dDA ~ paudit1 + abs.d.GINDEX + logcapx + logmkvalt +
logmkvalt2|pp, data=reg.df)
where pp is a vector of company names, but this returns coefficients as though I regressed all the data at once (and did not separate by company name).
A convenient and apparently little-known syntax for estimating separate regression coefficients by group in lm() involves using the nesting operator, /. In this case it would look like:
reg <- lm(lead2.dDA ~ 0 + pp/(paudit1 + abs.d.GINDEX + logcapx +
logmkvalt + logmkvalt2), data=reg.df)
Make sure that pp is a factor and not a numeric. Also notice that the overall intercept must be suppressed for this to work; in the new formulation, we have a different "intercept" for each group.
A couple comments:
Although the regression coefficients obtained this way will match those given by lmList(), it should be noted that with lm() we estimate only a single residual variance across all the groups, whereas lmList() would estimate separate residual variances for each group.
Like I mentioned in my earlier comment, the lmList() syntax that you gave looks like it should have worked. Since you say it didn't, this leads me to expect that really the problem is something else (although it's hard to tell what without a reproducible example), and so it seems likely that the solution I posted will fail for you as well, for the same unknown reasons. If you want more detailed guidance, please provide more information; help us help you.

Resources