visualizing clusters extracted from MClust using ggplot2 - r

I am analysing the distribution of my data using mclust (follow-up to Clustering with Mclust results in an empty cluster)
Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
First, I evaluate the clusters present in my data:
library(reshape2)
library(mclust)
library(ggplot2)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Now having them identified, I would like to overlay the total distribution with distributions of the individual components. To do this, I tried to extract the assignment of each value to the respective cluster using:
df <- as.data.frame(data)
df$classification <- as.factor(df$value[fit$classification])
ggplot(df, aes(value, fill= classification)) +
geom_density(aes(col=classification, fill = NULL), size = 1)
As a result, I get the following:
It looks to have worked, however, I wonder,
a) where the descriptions (1602, 1639 and 1823) of the individual classifications come from
b) how I can scale the individual densities as a fraction of the total (for example 1823 contributes only 91 values out of 3258 observations; see above)
c) if it makes sense to alternatively use predicted normal distributions based on the mean + SD obtained?
Any help or suggestions are highly appreciated!

I think you could get what you want in the following way:
library(magrittr)
data_melt <- data_melt %>% mutate(class = as.factor(fit$classification))
ggplot(data_melt, aes(x=value, colour=class, fill=class)) +
geom_density(aes(y=..count..), alpha=.25)

Related

Plotting categorical interaction from GLMM - getting correct mean and confidence intervals

I'm trying to plot a categorical interaction from a GLMM (i.e. mean and confidence intervals) in R.
I tried building it with ggplot and CIs from bootMer. When I compare it to a plot made with plot_model(), however, both mean and CIs look very different and I would like to understand why. And foremost, how can I use ggplot to generate the most meaningful representation of the model output.
Here is some example data (not my actual, but similar structure) and the code I'm using:
DATA and MODEL:
set.seed=5
n <- 24
dat <- data.frame(IndID=c(rep(paste("x",1:n, sep="")), rep(paste("x",1:n, sep=""))),
sex=sample(c("female","male"), 48, replace=T),
treatment= rep( sample(letters[1:3],24, replace=T) ,each=2) ,
cohort=c(sample(c("small","large"),6, replace=T), sample(c("large","medium"),12, replace=T), sample(c("small","large"),6, replace=T),
sample(c("small","medium"),6,replace=T), sample(c("small","large","medium"),12, replace=T), sample(c("medium","large"),6, replace=T) ),
litter= c(rep( sample(2:6,12, replace=T) ,each=2),rep( sample(1:8,12, replace=T) ,each=2)),
morph=rnorm(mean=5,sd=2,n*2),
trial=rep(1:2, each=n))
head(dat)
IndID sex treatment cohort litter morph trial
1 x1 male b large 4 2.416531 1
2 x2 male b small 4 5.862501 1
3 x3 female c large 2 5.403233 1
4 x4 female c large 2 4.811382 1
5 x5 female b large 5 6.681123 1
6 x6 female b small 5 7.421683 1
m <- lmer(litter ~ treatment*sex + morph + as.factor(trial) +
(1|cohort), data=dat)
summary(m)
1. Using ggplot:
##get confidence intervals
boot <- bootMer(m, predict, nsim = 100, re.form = NA)
ci.lwr= apply(boot$t, 2, function(x) as.numeric(quantile(x, probs=.05, na.rm=TRUE)))
ci.upr= apply(boot$t, 2, function(x) as.numeric(quantile(x, probs=.95, na.rm=TRUE)))
fit <- predict(m)
df <- cbind(dat, data.frame(fit), ci.lwr, ci.upr)
df %>%
group_by( sex,treatment) %>%
summarise(across(c("fit", "ci.lwr", "ci.upr"), mean)) %>%
ggplot(aes(x = interaction(treatment, sex), y = fit), color=treatment) +
geom_pointrange(aes(min = ci.lwr, max = ci.upr, color=treatment)) +
ylim(1.5,7.5)
2. Using plot_model
plot_model(m, type = "pred", terms = c("sex","treatment"), ci.lvl=0.95,ylim=c(1.5,7.5))
And here is what it looks:
(colors of b and c are swapped and the axes are slightly off - sorry)
I know there are different plot types in plot_model, but I'm not sure which to use. And I'm not sure if my ggplot code is all correct and showing what it thinks it shows (means and 95% CIs from the model prediction).
Ultimately I would like to make a custom plot with ggplot, not using a wrapper, to have more flexibility.
I've read through many similar questions here, but few plot CIs from model and with categorical data and some I could not really understand or replicate.
Any advice would be appreciated - thanks!
(The interaction in this example is not significant, so I might not need to plot this, but it's just an example for practice)

How to simulate a strong correlation of data with R

Sometimes I try to simulate data by using the rnorm function, which I have done below:
mom.iq <- rnorm(n=1000,
mean=120,
sd=15)
kid.score <- rnorm(n=1000,
mean=45,
sd=20)
df <- data.frame(mom.iq,
kid.score)
But when I plot something like this, it usually ends up with data thats highly uncorrelated:
library(ggpubr)
ggscatter(df,
x="mom.iq",
y="kid.score")+
geom_smooth(method = "lm")
However, I would like to simulate something with a stronger correlation if possible. Is there an easy way to do this within R? I'm aware that I could just as easily just produce my own values manually, but thats not super practical for recreating large samples.
What you are doing is to generate two independent variables; so, it is normal not to be correlated. What you can do is this:
# In order to make the values reproducible
set.seed(12345)
# Generate independent variable
x <- rnorm(n=1000, mean=120, sd=15)
# Generate the dependen variable
y <- 3*x + 6 + rnorm(n=1000, mean = 0, sd = 5)
I used 3 and 6, but you can define them as you want (a and b) in order to get a linear dependence defined as y = a*x + b.
The sum of rnorm(n=1000, mean = 0, sd = 5) is done to add some variability and avoid a perfect correlation between x and y. If you want to get a more correlated data, reduce the standard deviation (sd) and to get a lower correlation, increase its value.
You can create your second variable by taking the first variable into account, and adding some error with rnorm in order to avoid making the relationship completely deterministic/
library(ggplot2)
dat <- data.frame(father_age = rnorm(1000, 35, 5)) |>
dplyr::mutate(child_score = -father_age * 0.5 + rnorm(1000, 0, 4))
dat |>
ggplot(aes(father_age, child_score)) +
geom_point() +
geom_smooth(method = "lm")
#> `geom_smooth()` using formula 'y ~ x'
Created on 2022-07-07 by the reprex package (v2.0.1)
It seems to me that you don't just want to simulate arbitrary x and y with a linear relationship (which the other two answers show). You give your variables meaningful names mod.iq and kid.score, so it appears to me that you want them to have certain mean and variance. In this case, you can use MASS::mvrnorm to simulate samples from multivariate normal, where you can specify correlation. This allows you to preserve the marginal mean and marginal variance you specified.
## your current specification of marginal mean and marginal standard deviation
mean_mod.iq <- 120
mean_kid.score <- 45
sd_mod.iq <- 15
sd_kid.score <- 20
## introduce correlation coefficient between two variables
## coefficient must be between -1 and 1
corcoef <- 0.8
## the result covariance between two variables
covariance <- corcoef * sd_mod.iq * sd_kid.score
## the variance-covariance matrix
Sigma <- matrix(c(sd_mod.iq^2, covariance, covariance, sd_kid.score^2), nrow = 2)
# [,1] [,2]
#[1,] 225 240
#[2,] 240 400
Now you can use MASS::mvrnorm.
xy <- MASS::mvrnorm(n = 500, mu = c(mean_mod.iq, mean_kid.score), Sigma = Sigma)
colnames(xy) <- c("mod.iq", "kid.score")
xydf <- data.frame(xy)
head(xydf)
# mod.iq kid.score
#1 111.6211 33.26241
#2 114.4765 42.49280
#3 115.8160 47.57242
#4 121.8656 53.16578
#5 152.1459 89.60617
#6 107.4360 39.00345
plot(xydf)
You can verify marginal mean and marginal variance of the simulated samples.
sapply(xydf, mean) ## mean, you specified 120 and 45
# mod.iq kid.score
# 119.9499 44.4193
sapply(xydf, sd) ## standard error, you specified 15 and 20
# mod.iq kid.score
# 15.35214 20.16483

Troubles predicting fixed effects from a hierarchical GAM in mgcv

I have been fitting different hierarchical GAMs (hereafter: HGAM) using mgcv in R. I can extract and plot their predictions for their random effects without problems. Conversely, extracting and plotting their predictions for their fixed effects only works for some models, and I don't know why.
Here is a practical example, which refers to the color spectra of flowers from two species (Taxon) sampled at various localities (also discussed here):
rm(list=ls()) # wipe R's memory clean
library(pacman) # load packages, installing them from CRAN if needed
p_load(RCurl) # allows accessing data from URL
ss <- read.delim(text=getURL("https://raw.githubusercontent.com/marcoplebani85/datasets/master/flower_color_spectra.txt"))
head(ss)
ss$density <- ifelse(ss$density<0, 0, ss$density) # set spurious negative reflectance values to zero
ss$clr <- ifelse(ss$Taxon=="SpeciesB", "red", "black")
ss <- with(ss, ss[order(Locality, wl), ])
These are the mean color spectra at the population level for the two species (rolling means were used):
Each color refers to a different species. Each line refers to a different locality.
The following model is a HGAM of type G according to Pedersen et al.'s classification (2019) and it does not give any issues:
gam_G1 <- bam(density ~ Taxon # main effect
+ s(wl, by = Taxon, k = 20) # interaction
+ s(Locality, bs="re"), # "re" is short for "random effect"
data = ss, method = 'REML',
family="quasipoisson"
)
# gam.check(gam_G1)
# k.check(gam_G1)
# MuMIn::AICc(gam_G1)
# gratia::draw(gam_G1)
# plot(gam_G1, pages=1)
# use gam_G1 to predict wl by Locality
# dataset of predictor values to estimate response values for:
nn <- unique(ss[, c("wl", "Taxon", "Locality", "clr")])
# predict:
pred <- predict(object= gam_G1, newdata=nn, type="response", se.fit=T)
nn$fit <- pred$fit
nn$se <- pred$se.fit
# use gam_G1 to predict wl by Taxon
# dataset of predictor values to estimate response values for:
nn <- unique(ss[, c("wl",
"Taxon",
"Locality",
"clr")])
nn$Locality=0 # turns random effect off
# after https://stats.stackexchange.com/q/131106/214127
# predict:
pred <- predict(object = gam_G1,
type="response",
newdata=nn,
se.fit=T)
nn$fit <- pred$fit
nn$se <- pred$se.fit
R warns me that factor levels 0 not in original fit, but it executes the task without issues:
Left panel: gam_G1 predictions at the Locality level. Right panel: gam_G1 predictions for the fixed effects.
Troublesome models
The following model is a HGAM of type "GI" sensu Pedersen et al. (2019). It produces more accurate predictions at the Locality level, but I can only get NA as predictions at the level of fixed effects:
# GI: models with a global smoother for all observations,
# plus group-level smoothers, the wiggliness of which is estimated individually
start_time <- Sys.time()
gam_GI1 <- bam(density ~ Taxon # main effect
+ s(wl, by = Taxon, k = 20) # interaction
+ s(wl, by = Locality, bs="tp", m=1)
# "tp" is short for "thin plate [regression spline]"
+ s(Locality, bs="re"),
family="quasipoisson",
data = ss, method = 'REML'
)
end_time <- Sys.time()
end_time - start_time # it took ~2.2 minutes on my computer
# gam.check(gam_GI1)
# k.check(gam_GI1)
# MuMIn::AICc(gam_GI1)
Attempt at drawing predictions for the fixed effects (Taxon and wl) according to gam_GI1:
# dataset of predictor values to estimate response values for:
nn <- unique(ss[, c("wl",
"Taxon",
"Locality",
"clr")])
nn$Locality=0 # turns random effect off
# after https://stats.stackexchange.com/q/131106/214127
# predict:
pred <- predict(object = gam_GI1,
type="response",
# exclude="c(Locality)",
# # this should turn random effect off
# # (doesn't work for me)
newdata=nn,
se.fit=T)
nn$fit <- pred$fit
nn$se <- pred$se.fit
head(nn)
# wl Taxon Locality clr fit se
# 1 298.34 SpeciesB 0 red NA NA
# 2 305.82 SpeciesB 0 red NA NA
# 3 313.27 SpeciesB 0 red NA NA
# 4 320.72 SpeciesB 0 red NA NA
# 5 328.15 SpeciesB 0 red NA NA
# 6 335.57 SpeciesB 0 red NA NA
Left panel: gam_GI1 predictions at the Locality level. Right panel (blank): gam_GI1 predictions for the fixed effects.
The following model, which includes a global smoother for all observations, plus group-level smoothers, all with the same "wiggliness", doesn't provide fixed-effect predictions either:
gam_GS1 <- bam(density ~ Taxon # main effect
+ s(wl, by = Taxon, k = 20) # interaction
+ s(wl, by = Locality, bs="fs", m=1),
# "fs" is short for "factor-smoother [interaction]"
family="quasipoisson",
data = ss, method = 'REML'
)
Why don't gam_GI1 and gam_GS1 produce predictions for their fixed effects, and how can I obtain them?
The models can take a few minutes to run. To save time, their output can be downloaded from here as an RData file. My R scripts (which include the code for plotting the figures) are available here.
I think you are conflating several things here; The by trick to turn off random effects only works for bs = "re" smooths. Locality is a factor (otherwise your random effect isn't a random intercept) and setting it to 0 is creating a new level (although it could be creating an NA as 0 isn't among the original levels.
If what you want to do is turn off anything to do with Locality, you should use exclude; however you have the invocation wrong. The reason why it's not working is because you are creating a character vector with a single element "c(Locality)". This fails for obvious reasons once you realize that c(Locality) doesn't related to anything in your model. What you need to provide here is a vector of smooth names as printed by summary(). For example, to exclude the smooth s(Locality, bs = "re"), {mgcv} knows this as s(Locality), so you would use exclude = "s(Locality)".
In your case, it is tedious to type out all the "s(wl):LocalityLevelX" labels for each smooth. As you have only two taxa, it would be easier to use the complimentary argument terms, where you list smooth labels that you want to include in the model. So you could do terms = c("s(wl):TaxonSpeciesB", "s(wl):TaxonSpeciesC") or whatever summary() displays for these smooths.
You also need to include the Taxon term in terms, which I think needs to be:
terms = c("TaxonSpeciesB", TaxonSpeciesC",
"s(wl):TaxonSpeciesB", "s(wl):TaxonSpeciesC")
If you install and load my {gratia} package, you can use smooths(gam_GI1) to list all the smooth labels as far as {mgcv} knows them.
The by trick works like this:
gam(y ~ x + s(z) + s(id, bs = "re", by = dummy)
where dummy is set to a numeric value 1 when fitting and to 0 when you are predicting. As this is a numeric by variable you are multiplying the smooth by dummy and hence why setting it to 0 excludes the term. The reason why your code isn't working is because you really want separate smooths for wl for each Locality; Locality is an actual variable of interest in your data/model, not a dummy variable we create to achieve the aim of excluding a term from the model.
Hopefully now you can see why exclude and terms are much better solutions than this dummy trick.
FYI, in bs = "tp", the "tp" doesn't mean tensor product smooth. It mean thin plate regression spline (TPRS). You only get tensor product smooths through te(), t2(), or ti() terms.

Having issues using the lme4 predict function on my mixed models

I’m having a bit of a struggle trying to use the lme4 predict function on my mixed models. When making predications I want to be able to set some of my explanatory variables to a specified level but average across others.
Here’s some made up data that is a simplified, nonsense version of my original dataset:
a <- data.frame(
TLR4=factor(rep(1:3, each=4, times=4)),
repro.state=factor(rep(c("a","j"),each=6,times=8)),
month=factor(rep(1:2,each=8,times=6)),
sex=factor(rep(1:2, each=4, times=12)),
year=factor(rep(1:3, each =32)),
mwalkeri=(sample(0:15, 96, replace=TRUE)),
AvM=(seq(1:96))
)
The AvM number is the water vole identification number. The response variable (mwalkeri) is a count of the number of fleas on each vole. The main explanatory variable I am interested in is Tlr4 which is a gene with 3 different genotypes (coded 1, 2 and 3). The other explanatory variables included are reproductive state (adult or juvenile), month (1 or 2), sex (1 or 2) and year (1, 2 or 3). My model looks like this (of course this model is now inappropriate for the made up data but that shouldn't matter):
install.packages("lme4")
library(lme4)
mm <- glmer(mwalkeri~TLR4+repro.state+month+sex+year+(1|AvM), data=a,
family=poisson,control=glmerControl(optimizer="bobyqa"))`
summary(mm)
I want to make predictions about parasite burden for each different Tlr4 genotype while accounting for all the other covariates. To do this I created a new dataset to specify the level I wanted to set each of the explanatory variables to and used the predict function:
b <- data.frame(
TLR4=factor(1:3),
repro.state=factor(c("a","a","a")),
month=factor(rep(1, times=3)),
sex=factor(rep(1, times=3)),
year=factor(rep(1, times=3))
)
predict(mm, newdata=b, re.form=NA, type="response")
This did work but I would really prefer to average across years instead of setting year to one particular level. However, whenever I attempt to average year I get this error message:
Error in model.frame.default(delete.response(Terms), newdata, na.action = na.action, : factor year has new level
Is it possible for me to average across years instead of selecting a specified level? Also, I've not worked out how to get the standard error associated with these predictions. The only way I've been able to get standard error for predictions was using the lsmeans() function (from the lsmeans package):
c <- lsmeans(mm, "TLR4", type="response")
summary(c, type="response")
Which automatically generates the standard error. However, this is generated by averaging across all the other explanatory variables. I'm sure it’s probably possible to change that but I would rather use the predict() function if I can. My goal is to create a graph with Tlr4 genotype on the x-axis and predicted parasite burden on the y-axis to demonstrate the predicted differences in parasite burden for each genotype while all other significant covariants are accounted for.
You might be interested in the merTools package which includes a couple of functions for creating datasets of counterfactuals and then making predictions on that new data to explore the substantive impact of variables on the outcome. A good example of this comes from the README and the package vignette:
Let's take the case where we want to explore the impact of a model with an interaction term between a category and a continuous predictor. First, we fit a model with interactions:
data(VerbAgg)
fmVA <- glmer(r2 ~ (Anger + Gender + btype + situ)^2 +
(1|id) + (1|item), family = binomial,
data = VerbAgg)
Now we prep the data using the draw function in merTools. Here we draw the average observation from the model frame. We then wiggle the data by expanding the dataframe to include the same observation repeated but with different values of the variable specified by the var parameter. Here, we expand the dataset to all values of btype, situ, and Anger.
# Select the average case
newData <- draw(fmVA, type = "average")
newData <- wiggle(newData, var = "btype", values = unique(VerbAgg$btype))
newData <- wiggle(newData, var = "situ", values = unique(VerbAgg$situ))
newData <- wiggle(newData, var = "Anger", values = unique(VerbAgg$Anger))
head(newData, 10)
#> r2 Anger Gender btype situ id item
#> 1 N 20 F curse other 5 S3WantCurse
#> 2 N 20 F scold other 5 S3WantCurse
#> 3 N 20 F shout other 5 S3WantCurse
#> 4 N 20 F curse self 5 S3WantCurse
#> 5 N 20 F scold self 5 S3WantCurse
#> 6 N 20 F shout self 5 S3WantCurse
#> 7 N 11 F curse other 5 S3WantCurse
#> 8 N 11 F scold other 5 S3WantCurse
#> 9 N 11 F shout other 5 S3WantCurse
#> 10 N 11 F curse self 5 S3WantCurse
Now we simply pass this new dataset to predictInterval in order to generate predictions for these counterfactuals. Then we plot the predicted values against the continuous variable, Anger, and facet and group on the two categorical variables situ and btype respectively.
plotdf <- predictInterval(fmVA, newdata = newData, type = "probability",
stat = "median", n.sims = 1000)
plotdf <- cbind(plotdf, newData)
ggplot(plotdf, aes(y = fit, x = Anger, color = btype, group = btype)) +
geom_point() + geom_smooth(aes(color = btype), method = "lm") +
facet_wrap(~situ) + theme_bw() +
labs(y = "Predicted Probability")

ggplot2: Logistic Regression - plot probabilities and regression line

I have a data.frame containing a continuous predictor and a dichotomous response variable.
> head(df)
position response
1 0 1
2 3 1
3 -4 0
4 -1 0
5 -2 1
6 0 0
I can easily compute a logistic regression by means of the glm()-function, no problems up to this point.
Next, I want to create a plot with ggplot, that contains both the empiric probabilities for each of the overall 11 predictor values, and the fitted regression line.
I went ahead and computed the probabilities with cast() and saved them in another data.frame
> probs
position prob
1 -5 0.0500
2 -4 0.0000
3 -3 0.0000
4 -2 0.2000
5 -1 0.1500
6 0 0.3684
7 1 0.4500
8 2 0.6500
9 3 0.7500
10 4 0.8500
11 5 1.0000
I plotted the probabilities:
p <- ggplot(probs, aes(x=position, y=prob)) + geom_point()
But when I try to add the fitted regression line
p <- p + stat_smooth(method="glm", family="binomial", se=F)
it returns a warning: non-integer #successes in a binomial glm!.
I know that in order to plot the stat_smooth "correctly", I'd have to call it on the original df data with the dichotomous variable. However if I use the dfdata in ggplot(), I see no way to plot the probabilities.
How can I combine the probabilities and the regression line in one plot, in the way it's meant to be in ggplot2, i.e. without getting any warning or error messages?
There are basically three solutions:
Merging the data.frames
The easiest, after you have your data in two separate data.frames would be to merge them by position:
mydf <- merge( mydf, probs, by="position")
Then you can call ggplot on this data.frame without warnings:
ggplot( mydf, aes(x=position, y=prob)) +
geom_point() +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = FALSE)
Avoiding the creation of two data.frames
In future you could directly avoid the creation of two separate data.frames which you have to merge later. Personally, I like to use the plyr package for that:
librayr(plyr)
mydf <- ddply( mydf, "position", mutate, prob = mean(response) )
Edit: Use different data for each layer
I forgot to mention, that you can use for each layer another data.frame which is a strong advantage of ggplot2:
ggplot( probs, aes(x=position, y=prob)) +
geom_point() +
geom_smooth(data = mydf, aes(x = position, y = response),
method = "glm", method.args = list(family = "binomial"),
se = FALSE)
As an additional hint: Avoid the usage of the variable name df since you override the built in function stats::df by assigning to this variable name.

Resources