Plotting a mixed effects model with three fixed effects - r

I am trying to find a way to visualize a mixed effects model for a project I am working on, but am unsure how to do this when using multiple fixed and random effects.
The project I am working on is an attempt to estimate the helpfulness of online reviews, based on several different factors. A sample of the data looks like this:
Participant Product Type Star Rating Anonymous Product Helpfulness
1 Exp Extr Yes 12 8
1 Search Extr Yes 6 6
1 Search Mid Yes 13 7
...
30 Exp Mid No 11 2
30 Exp Mid No 14 4
30 Search Extr No 9 5
The data is significantly longer than this (30 participants, who each saw roughly two dozen reviews, resulting in approx. 700 entries). Each participant sees a mix of products, product types, and star ratings, but all of the reviews they see will either be anonymous or not anonymous (no mix).
As a result, I tried to fit a maximal mixed model, with the following:
mixed(helpfulness ~ product_type * star_rating * anonymity
+ (product_type * star_rating | participant)
+ (star_rating * anonymity | product))
What I would like to do now is to find a way of visually representing the data, likely color-coding the 8 different "groups" (essentially, the different unique combinations of the 3 binary independent variables (2 types of products * 2 types of star ratings * 2 types of anonymity), to show how they relate to the helpfulness rating.

Try something like this:
library(ggplot2)
# make notional data
df <- data.frame(participant = seq(1,30,1),
product_type = sample(x=c('Exp', 'Search'), size=30, replace=T),
star_rating = sample(x=c('Extr', 'Mid'), size=30, replace=T),
anonymous = sample(x=c('Yes', 'No'), size=30, replace=T),
product = rnorm(n=30, mean=10, sd=5),
helpfullness = rnorm(n=30, mean=5, sd=3))
ggplot(df) +
geom_col(aes(x=participant, y=helpfullness, fill=product, color=anonymous)) +
facet_grid(c('product_type', 'star_rating'))
This captures all six variables in your data. You can also work with alpha and other aesthetics to include the variables. Using alpha might be better if you do not want to use a facet_grid. If you include alpha as an aesthetic, I would recommend using facet_wrap instead of facet_grid.

Related

Error in generalized linear mixed model cross-validation: The value in 'data[[cat_col]]' must be constant within each ID

I am trying to conduct a 5-fold cross validation on a generalized linear mixed model using the groupdata2 and cvms packages. This is the code I tried to run:
data <- groupdata2::fold(detect, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
cvms::cross_validate(
data,
"outcome ~ sex + year + season + (1 | bird) + (1 | obsname)",
family="binomial",
fold_cols = ".folds",
control = NULL,
REML = FALSE)
This is the error I receive:
Error in groupdata2::fold(detect, k = 4, cat_col = "outcome", id_col = "bird") %>% :
1 assertions failed:
* The value in 'data[[cat_col]]' must be constant within each ID.
In the package vignette, the following explanation is given: "A participant must always have the same diagnosis (‘a’ or ‘b’) throughout the dataset. Otherwise, the participant might be placed in multiple folds." This makes sense in the example. However, my data is based on the outcome of resighting birds, so outcome varies depending on whether the bird was observed on that particular survey. Is there a way around this?
Reproducible example:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
data <- groupdata2::fold(df, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
The full documentation says:
cat_col: Name of categorical variable to balance between folds.
E.g. when predicting a binary variable (a or b), we usually
want both classes represented in every fold.
N.B. If also passing an ‘id_col’, ‘cat_col’ should be
constant within each ID.
So in this case, where outcome varies within individual birds (id_col), you simply can't specify that the folds be balanced within respect to the outcome. (I don't 100% understand this constraint in the software: it seems it should be possible to do at least approximate balancing by selecting groups (birds) with a balanced range of outcomes, but I can see how it could make the balancing procedure a lot harder).
In my opinion, though, the importance of balancing outcomes is somewhat overrated in general. Lack of balance would mean that some of the simpler metrics in ?binomial_metrics (e.g. accuracy, sensitivity, specificity) are not very useful, but others (balanced accuracy, AUC, aic) should be fine.
A potentially greater problem is that you appear to have (potentially) crossed random effects (i.e. (1|bird) + (1|obsname)). I'm guessing obsname is the name of an observer: if some observers detected (or failed to detect) multiple birds and some birds were detected/failed by multiple observers, then there may be no way to define folds that are actually independent, or at least it may be very difficult.
You may be able to utilize the new collapse_groups() function in groupdata2 v2.0.0 instead of fold() for this. It allows you to take existing groups (e.g. bird) and collapse them to fewer groups (e.g. folds) with attempted balancing of multiple categorical columns, numeric columns, and factor columns (number of unique levels - though the same levels might be in multiple groups).
It does not have the constraints that fold() does with regards to changing outcomes, but on the other hand does not come with the same "guarantees" in the "non-changing outcome" context. E.g. it does not guarantee at least one of each outcome levels in all folds.
You need more birds than the number of folds though, so I've added a few to the test data:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,
4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,7)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1,0,1,
0,1,1,0,1,1,0,0,1,1,0,0,1,0,0,1,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
# Combine 'bird' groups to folds
data <- groupdata2::collapse_groups(
data = df,
n = 3,
group_cols="bird",
cat_col="outcome",
col_name = ".folds"
) %>%
arrange(.folds)
# Check the balance of the relevant columns
groupdata2::summarize_balances(
data=data,
group_cols=".folds",
cat_cols="outcome"
)$Groups
> # A tibble: 3 × 6
> .group_col .group `# rows` `# bird` `# outc_0` `# outc_1`
> <fct> <fct> <int> <int> <dbl> <dbl>
> 1 .folds 1 14 3 7 7
> 2 .folds 2 10 2 6 4
> 3 .folds 3 10 2 4 6
summarize_balances() shows us that we created 3 folds with 14 rows in the first fold and 10 in the other folds. There are 3 unique bird levels in the first fold and 2 in the others (normally only unique within the group, but here we know that birds are only in one group, as that is how collapse_groups() works with its group_cols argument).
The outcome variable (here # outc_0 and # outc_1) are somewhat decently balanced.
With larger datasets, you might want to run multiple collapsings and choose the one with the best balance from the summary. That can be done by adding num_new_group_cols = 10 to collapse_groups() (for even better results, enable the auto_tune setting) and then listing all the created group columns when running summarize_balances().
Hope this helps you or others in a similar position. The constraint in fold() is hard to get around with its current internal approach, but collapse_groups hopefully does the trick in those cases.
See more https://rdrr.io/cran/groupdata2/man/collapse_groups.html

How to conduct LSD test with interactions in R?

I have a field data
sowing_date<- rep(c("Early" ,"Normal"), each=12)
herbicide<- rep (rep(c("No" ,"Yes"), each=6),2)
nitrogen<- rep (rep(c("No" ,"Yes"), each=3),4)
Block<- rep(c("Block 1" ,"Block 2", "Block 3"), times=8)
Yield<- c(30,27,25,40,41,42,37,38,40,48,47,46,25,27,26,
41,41,42,38,39,42,57,59,60)
DataA<- data.frame(sowing_date,herbicide,nitrogen,Block,Yield)
and I conducted 3-way ANOVA
anova3way <- aov (Yield ~ sowing_date + herbicide + nitrogen +
sowing_date:herbicide + sowing_date:nitrogen +
herbicide:nitrogen + sowing_date:herbicide:nitrogen +
factor(Block), data=DataA)
summary(anova3way)
There is a 3-way interaction among 3 factors. So, I'd like to see which combination shows the greatest yield.
I know how to compare mean difference with single factor like below, but in case of interactions, how can I do that?
library(agricolae)
LSD_Test<- LSD.test(anova3way,"sowing_date")
LSD_Test
For example, I'd like to check the mean difference under 3 way interaction, and also interaction between each two factors.
For example, I'd like to get this LSD result in R
Could you tell me how can I do that?
Many thanks,
One way which does take some manual work is to encode the experimental parameters as -1 and 1 in order to properly separate the 2 and 3 parameter interactions.
Once you have values encoded you can pull the residual degree of freedoms and the sum of the error square from the ANOVA model and pass it to the LSD.test function.
See Example below:
sowing_date<- rep(c("Early" ,"Normal"), each=12)
herbicide<- rep (rep(c("No" ,"Yes"), each=6),2)
nitrogen<- rep (rep(c("No" ,"Yes"), each=3),4)
Block<- rep(c("Block 1" ,"Block 2", "Block 3"), times=8)
Yield<- c(30,27,25,40,41,42,37,38,40,48,47,46,25,27,26,
41,41,42,38,39,42,57,59,60)
DataA<- data.frame(sowing_date,herbicide,nitrogen,Block,Yield)
anova3way <- aov (Yield ~ sowing_date * herbicide * nitrogen +
factor(Block), data=DataA)
summary(anova3way)
#Encode the experiment's parameters as -1 and 1
DataA$codeSD <- ifelse(DataA$sowing_date == "Early", -1, 1)
DataA$codeherb <- ifelse(DataA$herbicide == "No", -1, 1)
DataA$codeN2 <- ifelse(DataA$nitrogen == "No", -1, 1)
library(agricolae)
LSD_Test<- LSD.test(anova3way, c("sowing_date"))
LSD_Test
#Manually defining the treatment and specifying the
# degrees of freedom and Sum of the squares (Frin the resduals from the ANOVA)
print(LSD.test(y=DataA$Yield, trt=DataA$sowing_date, DFerror=14, MSerror=34.3))
#Example for a two parameter value
print(LSD.test(y=DataA$Yield, trt=(DataA$codeSD*DataA$codeherb), DFerror=14, MSerror=34.3))
print(LSD.test(y=DataA$Yield, trt=(DataA$codeSD*DataA$codeherb*DataA$codeN2), DFerror=14, MSerror=34.3))
#calaculate the means and std (as a check)
#DataA %>% group_by(sowing_date) %>% summarize(mean=mean(Yield), sd=sd(Yield))
#DataA %>% group_by(codeSD*codeherb*codeN2) %>% summarize(mean=mean(Yield), sd=sd(Yield))
You will need to manually track which run/condition goes with the -1 and 1 in the final report.
Edit:
So my answer above with show the overall effect based on interactions. For example how does the interaction of herbicide and nitrogen effect yield.
Based on your comment where you want to determine which combination provides the greatest yield, you the use the LSD.test() function again but passing a vector of parameter names.
LSD_Test<- LSD.test(anova3way, c("sowing_date", "herbicide", "nitrogen"))
LSD_Test
From the groups part of the out put you can see Normal, Yes and Yes is the optimal yield. The "groups" column will identify the unique clusters of results. For example the last 2 rows provide a similar yield.
...
$groups
Yield groups
Normal:Yes:Yes 58.66667 a
Early:Yes:Yes 47.00000 b
Normal:No:Yes 41.33333 c
Early:No:Yes 41.00000 cd
Normal:Yes:No 39.66667 cd
Early:Yes:No 38.33333 d
Early:No:No 27.33333 e
Normal:No:No 26.00000 e
...

linear mix model on longitudinal data: time as continuous vs unordered factor vs ordered factor

Here I made some longitudinal data. Variable result is biomarker level from patients. Variable visit represent visit labels. Variable time means days from baseline t1. There are two response status, 'yes' and 'no'. What I want to find out is whether there is any difference in biomarker level between responder and non-responder over the time-course.
I use linear mix model for the analysis.
#generate data
df = data.frame(result = rnorm(200)+2,
visit = rep(c('t1', 't2', 't3', 't4', 't5'), 40),
time = rep(c(0, 8, 14, 30, 60), 40),
response = rep(c('yes', 'no'), each=100),
id = rep(1:40, each=5) )
# run lme model
library(lme4)
library(lmerTest)
lmer(result~time*response+(1|id),data=df)
lmer(result~factor(visit)*response+(1|id), data=df)
lmer(result~factor(visit, ordered=TRUE)*response+(1|id), data=df)
My question is:
1. In this analysis,should I use time(continuous) or visit (factor)?
2. If I use visit(factor), should it be ordered or unordered?
Is there any guideline in terms of choosing what type time variable to use (factor vs continuous)?
thanks a lot.
... should I treat visit time as continuous variable or as factor?
A factor since I figure you (properly?) want observations at the same time period to have the same random effect term. However, that would just be the same as having a random effect for visits if all visit levels and time fall at the same time. They do not in your example though I figure there is something wrong with it since e.g., id = 2 has its fist visit at time t = 60
> df[df$id == 2, ]
result visit time response id
5 1.84451763 t1 60 no 2
6 1.30286252 t2 0 no 2
7 0.40109211 t3 8 no 2
8 -0.01516773 t4 14 no 2
I guess the alternative would be something like (time|id) in which case each individual have a random slope of time or (time|response) in which case each of the two responds categories have a random slope. That might make sense in your application?

IDing levels in RPART output

I have 4 groups: sensitivity, category, extent, duration. Each of these 4 have 4 levels.
Sensitivity - A, B, C, D
Category - 1, 2, 3, 4
Extent - site, local, regional, pan-regional
Duration - short-term, medium-term, long-term, irreversible
I use a combination of these (there are 256 total) to get a severity classification.
There are 7 severity classes: S1, S2, S3, S4, S5, S6, N.
Using RPART, I get a tree that has Duration=cd as the first binary break. Then the next one on the left side of the tree is Sensitivity=ab.
How do I know which duration levels c and d are?
And how do I know which sensitivity levels a and b are?
Thanks!
Does using print() give you what you're after? This prints a list of nodes in the model and then the variables at each.
rpartmodel <- rpart(Price ~ Country + Mileage + Type, data = car.test.frame)
print(rpartmodel)
Alternatively try plotting your tree using the rpart.plot package?
library(rpart.plot)
prp(rpartmodel, faclen = 9)
More customization of the plots (I find) and it labels splits with the names of categories. faclen above controlling the number of characters shown for factor splits.

How to organize data for a multivariate probit model?

I've conducted a psychometric test on some subjects, and I'm trying to create a multivariate probit model.
The test was conducted as follows:
To subject 1 was given a certain stimulous under 11 different conditions, 10 times for each condition. Answers (correct=1, uncorrect=0) were registered.
So for subject 1, I have the following results' table:
# Subj 1
correct
cnt 1 0
1 0 10
2 0 10
3 1 9
4 5 5
5 7 3
6 10 0
7 10 0
8 10 0
9 9 1
10 10 0
11 10 0
This means that Subj1 answered uncorrectly 10 times under condition 1 and 2, and answered 10 times correctly under condition 10 and 11. For the other conditions, the response was increasing from condition 3 to condition 9.
I hope I was clear.
I usually analyze the data using the following code:
prob.glm <- glm(resp.mat1 ~ cnt, family = binomial(link = "probit"))
Here resp.mat1 is the responses' table, while cnt is the contrast c(1,11). So I'm able to draw the sigmoid curve using the predict() function. The graph, for the subject-1 is the following.
Now suppose I've conducted the same test on 20 subjects. I have now 20 tables, organized like the first one.
What I want to do is to compare subgroups, for example: male vs. female; young vs. older and so on. But I want to keep the inter-individual variability, so simply "adding" the 20 tables will be wrong.
How can I organize the data in order to use the glm() function?
I want to be able to write a command like:
prob.glm <- glm(resp.matTOT ~ cnt + sex, family = binomial(link = "probit"))
And then graphing the curve for sex=M, and sex=F.
I tried using the rbind() function, to create a unique table, then adding columns for Subj (1 to 20), Sex, Age. But it looks me a bad solution, so any alternative solutions will be really appreciated.
Looks like you are using the wrong function for the job. Check the first example of glmer in package lme4; it comes quite close to what you want. herd should be replaced by the subject number, but make sure that you do something like
mydata$subject = as.factor(mydata$subject)
when you have numerical subject numbers.
# Stolen from lme4
library(lattice)
library(
xyplot(incidence/size ~ period|herd, cbpp, type=c('g','p','l'),
layout=c(3,5), index.cond = function(x,y)max(y))
(gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial))
There's a multivariate probit command in the mlogit library of all things. You can see an example of the data structure required here:
https://stats.stackexchange.com/questions/28776/multinomial-probit-for-varying-choice-set

Resources