I am a total novice to R, I have an assignment using Linear regression, where we have to produce 2 different models, to see which one is a better predictor of pain. The first model is just to contain age and gender. The second model is to include extra variables; The State Trait Anxiety Inventory scores, The Pain Catastrophizing Scale, The Mindful Attention Awareness Scale, and measures of cortisol levels in both salivia and serum (blood).
The research question states that we need to conduct a hierarchical regression, by building a model containing age and sex as predictors of pain (model 1), then building a new model with the predictors: age, sex, STAI, pain catastrophizing, mindfulness, and cortisol measures (model 2). Hence, the predictors used in model 1 are a subset of the predictors used in model 2. After completion of both models, need comparison to assess whether substantial new information was gained about pain in model 2 compared to model 1.
I am having a lot of problems with "sex" as a variable, someone had coded a "3" instead of male and female and although I have excluded the score, "3" is still coming up as a level in the data set, is there a way to remove this?
Furthermore how can I convert "sex" into a "factor" type vector instead of "character" vector? Can categorical variables be predictors in a model? I have attempted to do this using the following command, but it continues to return in errors.
sex_vector <- c("female", "male") etc.
factor.sex.vector <- factor(sex.vector)
Below is an excerpt of the data set:
data.frame': 156 obs. of 10 variables:
$ sex : Factor w/ 3 levels "3","female","male": 2 2 3 3 3 3 3 2 2 2 ...
Eliminate the unwanted value and then, as suggested by mt1022 apply factor again:
factor.sex.vector <- subset(factor.sex.vector, factor.sex.vector != 3)
factor.sex.vector <- factor(factor.sex.vector)
Related
I want to conduct a two-way ANOVA to investigate if self-rated-health is differently related to life-satisfaction in different age groups. First I checked the assumption of variance-homogenity with the levene Test in R. The outcome was that the assumption of variance-homogenity is violated (p = 2.2^e-16). Then I decided to calculate a Welch-ANOVA since it does not require homogene variances with the oneway.test() function. But after that you have to control for the alpha error with a paired t-test, though this is not possible for a two-way ANOVA.
What can I do now? And what is the detrimental outcome if I calculate a ANOVA even though the assumption of variance homogenity is violated?
I am new here and in stats, hope you can still understand my question.
my variables:
lz_20 (life satisfaction): numeric ;
Fjp40 (self-rated health): I converted it from numeric to factor w/ 5 levels ;
Falter (age): I converted it from numeric to factor w/ 3 levels
I have a training data trainhaving distance and dest_zip_code as predictor variables to predict delivery_days. I am using ranger RF model to create the 'Quantile RF regression' model object. Please note that the dest_zip_code levels in the training_data are based on 6 months.
Now, I have two identical test sets test_A and test_B
test_A has dest_zip_code from last 2 months and levels are also based on last 2 months.
test_B has dest_zip_code from last 2 months but levels are refactored to last 6 months (same levels as train data)
When I use the predict function on both the test sets with the same trained model object, at-least half of the predictions are different.
Can someone help me understand how the different factoring levels of test data with same observations is affecting the predictions?
Which one is theoretically correct?
I'm new to linear mixed effects models and I'm trying to use them for hypothesis testing.
In my data (DF) I have two categorical/factor variables: color (red/blue/green) and direction (up/down). I want to see if there are significant differences in scores (numeric values) across these factors and if there is an interaction effect, while accounting for random intercepts and random slopes for each participant.
What is the appropriate lmer formula for doing this?
Here's what I have...
My data is structured like so:
> str(DF)
'data.frame': 4761 obs. of 4 variables:
$ participant : Factor w/ 100 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ direction : Factor w/ 2 levels "down","up": 2 2 2 2 2 2 2 2 2 2 ...
$ color : Factor w/ 3 levels "red","blue",..: 3 3 3 3 3 3 3 3 3 3 ...
$ scores : num 15 -4 5 25 0 3 16 0 5 0 ...
After some reading, I figured that I could write a model with random slopes and intercepts for participants and one fixed effect like so:
model_1 <- lmer(scores ~ direction + (direction|participant), data = DF)
This gives me a fixed effect estimate and p-value for direction, which I understand to be a meaningful assessment of the effect of direction on scores while individual differences across participants are accounted for as a random effect.
But how do I add in my second fixed factor, color, and an interaction term whilst still affording each participant a random intercept and slope?
I thought maybe I could do this:
model_2 <- lmer(scores ~ direction * color + (direction|participant) + (color|participant), data = DF)
But ultimately I really don't know what exactly this formula means. Any guidance would be appreciated.
You can include several random slopes in at least two ways:
What you proposed: Estimate random slopes for both predictors, but don't estimate the correlation between them (i.e. assume the random slopes of different predictors don't correlate):
scores ~ direction * color + (direction|participant) + (color|participant)
The same but also estimate the correlation between random slopes of different predictors:
scores ~ direction * color + (direction + color|participant)
Please note two things:
First, in both cases, the random intercepts for "participant" are included, as are correlations between each random slope and the random intercept. This probably makes sense unless you have theoretical reasons to the contrary. See this useful summary if you want to avoid the correlation between random intercepts and slopes.
Second, in both cases you don't include a random slope for the interaction term! If the interaction effect is actually what you are interested in, you should at least try to fit a model with random slopes for it so to avoid potential bias in the fixed interaction effect. Here, again, you can choose to allow or avoid correltions between the interaction term's random slopes and other random slopes:
Without correlation:
scores ~ direction * color + (direction|participant) + (color|participant) + (direction:color|participant)
With correlation:
scores ~ direction * color + (direction * color|participant)
If you have no theoretical basis to decide between models with or without correlations between the random slopes, I suggest you do both, compare them with anova() and choose the one that fits your data better.
When using the importance() function on R's randomForest you can get a list of the most important predictors.
I was wondering how to tell which predictors are associated with 1 of the specific binary outcomes? (i.e. which predictors are associated with disease outcomes and which predictors are associated with disease-free outcomes).
Here is my code to get the list of important predictors:
# Make a data frame with predictor names and their importance
imp_RF_model <- importance(RF_model)
imp_RF_model <- data.frame(predictors = rownames(imp_RF_model), imp_RF_model)
# Order the predictor levels by importance
imp_sort_RF_model <- arrange(imp_RF_model, desc(MeanDecreaseGini))
imp_sort_RF_model$predictors <- factor(imp_sort_RF_model$predictors, levels = imp_sort_RF_model$predictors)
# Select the top 20 predictors
imp_20_RF_model <- imp_sort_RF_model[1:20, ]
For example, if protein A is a strong predictor, I want to know if high levels of protein A are associated with the disease, or if high levels of protein A are associated with disease-free samples. So I want to know if the predictor is inversely associated with the disease or directly associated with the disease.
I am doing statistical analysis for a dataset using GLM in R. Basically the predictor variables are: "Probe"(types of probes used in the experiment - Factor with 4 levels), "Extraction"(types of extraction used in the experiment - Factor with 2 levels), "Tank"(the tank number that the sample is collected from - integers from 1 to 9), and "Dilution"(the dilution of each sample - numbers: 3.125, 6.25, 12.5, 25, 50, 100). The response is the number of positive responses ("Positive") obtained from a number of repetition of the experiment ("Rep"). I want to assess the effects of all predictor variables (and their interactions) on the number of positive responses, so I tried to fit a GLM model like this:
y<-cbind(mydata$Positive,mydata$Rep - mydata$Positive)
model1<-glm(y~Probe*Extraction*Dilution*Tank, family=quasibinomial, data=mydata)
But I was later advised by my supervisor that the "Tank" predictor variable should not be treated as a level-based variable. i.e. it has values of 1 to 9, but it's just the tank label so the difference between 1 and, say, 7 is not important. Treating this variable as factor would only make a large model with bad results. So how to treat the "Tank" variable as a random factor and include it in the GLM?
Thanks
It is called a "mixed effect model". Check out the lme4 package.
library(lme4)
glmer(y~Probe + Extraction + Dilution + (1|Tank), family=binomial, data=mydata)
Also, you should probably use + instead of * to add factors. * includes all interactions and levels of each factor, which would lead to a huge overfitting model. Unless you have a specific reason to believe that there is interaction, in which case you should code that interaction explicitly.