Interaction effect plot with CIs and emmeans contrast - r

I'm having trouble creating an interaction effect plot. There is probably something fairly simple I don't yet know how to do. I'm pretty new to R and ggplot. My reprex is below. Your insight is greatly appreciated!
The data is from UCLA and I'm also adapting their example for my purposes here.
library(here)
library(emmeans)
library(tidyverse)
dat <- read.csv("https://stats.idre.ucla.edu/wp-content/uploads/2019/03/exercise.csv")
Convert prog into factor variable
dat$prog <- factor(dat$prog, labels = c("jog","swim","read"))
The model
contcat <- lm(loss ~ hours * prog, data=dat)
summary(contcat)
I create mylist with certain points on hours and the two categories in prog that I want to contrast.
(mylist <- list(hours = seq(0, 4, .5), prog=c("jog","read")))
I then pass the object contcat into the emmeans. I request that predicted values of every combination of hours and prog be specified in at=mylist and store the output into an object called emcontcat.
emcontcat <- emmeans(contcat, ~ hours * prog, at=mylist)
I use emmip to output a set of values using plotit=FALSE.
contcatdat <- emmip(contcat, prog ~ hours, at = mylist, CIs=TRUE, plotit=FALSE)
The output object is fed to ggplot. The interaction effect is plotted along with CI bands.
ggplot(data=contcatdat, aes(x=hours, y=yvar, color=prog)) +
geom_line() +
geom_ribbon(aes(ymax=UCL, aymin=LCL, fill=prog), alpha=0.4)
The plot looks like this:
But overlapping CIs do not always correspond to the portions of the lines where there is no significant differences in predicted values. I want to add hashed lines for the portions of the lines where there is no significant difference in predicted values. This figure below
shows the kind of figure I'm trying to create. (The figure is from a paper by Trenton Mize (2019) found here at Fig. 14.)
To get the simple effect (i.e., difference of two predicted values), I pass emcontcat into a function called contrast where we can request "pairwise" differences (or simple effects). P-values are given for jog - read at each level of hours that was specified in mylist.
contrast(emcontcat, "pairwise", by="hours")
The output:
Where I am having trouble is how to incorporate the simple effect (i.e., the parts of hours where jog - read are significantly different or not) into ggplot as hashed or solid portions of the lines like the Mize 2019 figure.

We want to know if the intervals overlap, and if so, we want dashed lines. Actually that's easy by writing a respective function itvl_is_l(). However, on the LHS of the plot, there is just one point, but to draw a line we need a minimum of two. So we have to interpolate with "approximate", which is also done internally in the plot functions. Since we want to do everything for the two progs, we use by.
Preprocessing
## merge interpolations by prog
aux <- by(contcatdat, contcatdat$prog, \(x) {
x <- merge(x, data.frame(hours=with(x, seq.int(min(hours), max(hours),
length.out=1e3))), all=TRUE)
x$prog <- unique(na.omit(x$prog))
u <- c('yvar', 'LCL', 'UCL')
x[u] <- lapply(x[u], \(x) approx(x, xout=seq_along(x))$y)
x
})
## logical interval intersect function
itvl_is_l <- \(a, b) {unname(as.vector(ifelse(b[, 1] > a[, 2] | a[, 1] > b[2], TRUE, FALSE)))}
## check if intersecting CIs
its <- itvl_is_l(aux$jog[c('LCL', 'UCL')], aux$read[c('LCL', 'UCL')])
aux <- lapply(aux, `[<-`, 'its', val=its) ## add as variable
aux <- lapply(aux, \(x) transform(x, itsn=cumsum(c(0, diff(x$its)) != 0) + 1)) ## making a sequence out of it
contcatdat <- do.call(rbind, aux) ## combine back as contcatdat
Plot
clr <- c('#FF0000', '#0000FF', '#0000001A') ## some colors
png('foo.png', 600, 400) ## open .png device
plot(yvar ~ hours, contcatdat, type='n')
grid()
## lines left
lines(yvar ~ hours, contcatdat, subset=prog == 'jog' & itsn > 2, lwd=2, col=clr[1])
lines(yvar ~ hours, contcatdat, subset=prog == 'read' & itsn > 2, lwd=2, col=clr[2])
## lines middle, dashed
lines(yvar ~ hours, contcatdat, subset=prog == 'jog' & itsn == 2, lwd=2, col=clr[1], lty=2)
lines(yvar ~ hours, contcatdat, subset=prog == 'read' & itsn == 2, lwd=2, col=clr[2], lty=2)
## lines right
lines(yvar ~ hours, contcatdat, subset=prog == 'jog' & itsn < 2, lwd=2, col=clr[1])
lines(yvar ~ hours, contcatdat, subset=prog == 'read' & itsn < 2, lwd=2, col=clr[2])
## CIs
with(subset(contcatdat, prog == 'jog'),
polygon(c(hours, rev(hours)), c(UCL, rev(LCL)), border=NA, col=clr[3]))
with(subset(contcatdat, prog == 'read'),
polygon(c(hours, rev(hours)), c(UCL, rev(LCL)), border=NA, col=clr[3]))
## legend
legend('topleft', legend=unique(contcatdat$prog), title='Group', col=clr[1:2], lty=1, lwd=2)
dev.off() ## close .png device
You could also try to plot the polygons first and opaque with a border, if that might look better.
Data:
contcatdat <- structure(list(prog = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), levels = c("jog",
"read"), class = "factor"), hours = c(0, 0, 0.5, 0.5, 1, 1, 1.5,
1.5, 2, 2, 2.5, 2.5, 3, 3, 3.5, 3.5, 4, 4), yvar = c(-6.78065983345649,
2.21637209230689, -3.05428518360714, 0.738291278604121, 0.672089466242214,
-0.739789535098646, 4.39846411609157, -2.21787034880141, 8.12483876594092,
-3.69595116250418, 11.8512134157903, -5.17403197620695, 15.5775880656396,
-6.65211278990971, 19.303962715489, -8.13019360361248, 23.0303373653383,
-9.60827441731525), SE = c(1.64384530410457, 1.48612021916972,
1.25520349531108, 1.14711211184156, 0.87926401607137, 0.820840725755632,
0.543079708493216, 0.531312719216624, 0.375535476484592, 0.376041650300328,
0.558013604603198, 0.501120592808483, 0.89777081499028, 0.781944232621328,
1.27470257475094, 1.1056003463909, 1.66373129934114, 1.44356083265185
), df = c(894, 894, 894, 894, 894, 894, 894, 894, 894, 894, 894,
894, 894, 894, 894, 894, 894, 894), LCL = c(-10.0069052579393,
-0.700318757711651, -5.51777400669205, -1.51305511813823, -1.05357261502514,
-2.35078883599747, 3.33260443922245, -3.26063588462286, 7.38780492844162,
-4.43397842739773, 10.7560441598055, -6.15754180868669, 13.815604150934,
-8.18677301395645, 16.8022045883112, -10.3000681349591, 19.7650632676689,
-12.4414373187615), UCL = c(-3.55441440897366, 5.13306294232543,
-0.590796360522233, 2.98963767534648, 2.39775154750957, 0.871209765800175,
5.46432379296068, -1.17510481297997, 8.86187260344022, -2.95792389761063,
12.946382671775, -4.19052214372721, 17.3395719803452, -5.11745256586298,
21.8057208426668, -5.96031907226584, 26.2956114630078, -6.77511151586902
), tvar = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), levels = c("jog", "read"), class = "factor"),
xvar = c(0, 0, 0.5, 0.5, 1, 1, 1.5, 1.5, 2, 2, 2.5, 2.5,
3, 3, 3.5, 3.5, 4, 4)), estName = "yvar", clNames = c("lower.CL",
"upper.CL"), pri.vars = c("prog", "hours"), adjust = "none", side = 0, delta = 0, type = "link", mesg = "Confidence level used: 0.95", row.names = c(NA,
18L), class = c("summary_emm", "data.frame"), labs = list(xlab = "hours",
ylab = "Linear prediction", tlab = "prog"), vars = list(byvars = character(0),
tvars = "prog"))

Related

How to plot many probability density functions (pdfs) without sharp edges?

I have an issue with plotting continuous distributions without sharp edges in ggplot2. I need to show two of them on one plot. Also, it does not have to be ggplot2 to achieve this result.
I am aware, that the number of data points directly influences the smoothness of the line, but it was not the case here. Below you can see some sample data (from dput)
sample.data<-list(beta.data = structure(list(cluster = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), beta.density = c(0, 3.42273368363934e-43, 8.42987148403021e-29,
2.04764468657484e-20, 1.69485562831516e-14, 6.07999638837842e-10, 2.88180370232676e-06, 0.00314064636750876, 0.954118897015866, 0, 0, 3.80101893822358e-36, 6.43342582657081e-22, 6.82956252277493e-14, 1.75291058931833e-08, 0.000131874335695378, 0.0750918340641428, 3.72532418759802, 5.05242078519544, 0), pr = c(0, 0.111111111111111, 0.222222222222222, 0.333333333333333, 0.444444444444444, 0.555555555555556, 0.666666666666667, 0.777777777777778, 0.888888888888889, 1, 0, 0.111111111111111, 0.222222222222222, 0.333333333333333, 0.444444444444444, 0.555555555555556, 0.666666666666667, 0.777777777777778, 0.888888888888889, 1)), row.names = c(NA, -20L), class = "data.frame"), beta.params = structure(list(cluster = 1:2, a = c(49, 50), b = c(2, 10), ni.beta = c(0.961,0.833), sd.beta = c(0.00072, 0.00228)), row.names = c(NA,-2L), class = "data.frame"))
Before I was using geom_col, but it discretizes values. I went with geom_area:
ggplot(sample.data$beta.data, aes(x = pr, y = beta.density)) +
geom_area(stat = "function",
fun = dbeta,
args = list(shape1 = sample.data$beta.params[1,"a"], shape2 = sample.data$beta.params[1,"b"]),
fill = "#F1C40F",
colour = "black",
alpha = 0.7) +
geom_area(stat = "function",
fun=dbeta,
args = list(shape1 = sample.data$beta.params[2,"a"], shape2 = sample.data$beta.params[2,"b"]),
fill = "#3498DB",
colour = "black",
alpha = 0.7)
I presented you the data with 10 points, but 1000 points look almost the same. It is not the case here, where even 100 points looks ok:
p = seq(0,1, length=100)
plot(p, dbeta(p, 50, 10), ylab="Density", type ="l", col=4, , lwd = 2)
Here I am attaching code to simulate the data. Oh, and these troublesome beta parameters were a = 49 and b = 2.
len <- 100
p <- seq(0,1, length.out = len)
df <- data.frame(rbind(cbind("cl" = rep(1, times = length(p)), "beta" = dbeta(p, 50, 10),"p"= p),
cbind("cl" = rep(1, times = length(p)), "beta" = dbeta(p, 40, 2),"p"= p)))
Do you have any ideas?
EDIT: The pdfs stands here for probability density functions. That is why I have not put "pdf" as a tag. My apologies for the confusion!
Anyway, when I tried to print graphic to PDF file, the result was poor as well (sharp edges). But it the end, it shouldn't matter. I want to see smooth lines whatever I do (reasonably).
EDIT2 It is possible to achieve because:
library(mosaic)
theme_set(theme_bw())
xpbeta(c(0.7, 0.90), shape1 = 49, shape2 = 2)
It produces nice, smoothed beta dist with parameters (49, 2). But then again, I need to show two dists in one chart.
I have found the answer. It still needs some editing (like transparency/alpha which I couldn't figure out), but in general, this is what I meant. Code:
library(mosaic)
plotDist('beta', params=list(49,2), kind='density', type = "h", col = "#3498DB", xlim = c(0,1))
plotDist('beta', params=list(50, 10), kind='density', , type = "h", col = "#F1C40F", add = TRUE)
plotDist('beta', params=list(49,2), kind='density', add = TRUE, col = "black")
plotDist('beta', params=list(50, 10), kind='density', add = TRUE, col = "black")
Result:
We can add as many distributions as we want, using "add" parameter.
Parameter type = "h", is used to draw filled distribution. Without it, the only line is visible. In my answer, I draw the two lines and two filled dists. I would be really happy if someone could show a better answer, though.
EDIT:
I think I found my perfect answer!
Here is the code:
library(ggformula)
theme_set(theme_bw())
gf_dist("beta", shape1 = 49, shape2 = 2, geom = "area", alpha = 0.5, fill = "#F1C40F") %>%
gf_dist("beta", shape1 = 49, shape2 = 2) %>%
gf_dist("beta", shape1 = 50, shape2 = 10, geom = "area", alpha = 0.5, fill = "#3498DB") %>%
gf_dist("beta", shape1 = 50, shape2 = 10)
It is much faster than the previous code, parameter alpha is obvious and it is relatively easy to combine many plots! Because of transparency, you can nicely see the overlap of both distributions.

Group results under data names with sapply function and plot in R

Let's say I have a data which involves 3 separate data. Here is my data;
data<-structure(list(x = structure(list(value = c(2L, 4L, 5L, 6L, 9L,
4L, 3L, 2L, 10L, 6L)), .Names = "value", class = "data.frame", row.names = c(NA,
-10L)), y = structure(list(value = c(2, 2.1, 4, 3, 0, 1.2, 4.2,
3, 4, 9)), .Names = "value", class = "data.frame", row.names = c(NA,
-10L)), z = structure(list(value = c(1, 2, 7, 6, 0.3, 5.4, 4,
3, 6, 7)), .Names = "value", class = "data.frame", row.names = c(NA,
-10L))), .Names = c("x", "y", "z"))
And here is my sample function;
sam<-function(x) {
ex<-c(3,2,4,5,2)
z<-data.frame(x)
y<-as.matrix(sapply(z, as.numeric))
h<-lapply(c(2,5,10), function(xx) tapply(y, as.integer(gl(nrow(x), xx, nrow(x)) ), FUN = sum))
names(h)<-c("min2", "min5", "min10")
min2<-h[[1]]
pdf("plots.pdf")
plot(min2, ex, main="min. compare",
xlab="Historical Values ", ylab="Disaggregated Values", pch=19, col = "blue")
dev.off()
return(h)
}
In the function, I am aggregating values as shown. And then plotting min2 with ex data.
With the code below, I tried to use the function for all data like;
v1<-sapply(data, sam)
But I can not see calculation's name as min2 min5 min10 in result list. And also results are coming complexly, not under the x, y and z
I desire these two;
1) Grouping results under each data name. Like;
[x] [y] [z]
min2 min2 min2
min5 min5 min5
min10 min10 min10
2) Plotting the desired ones for all x, y and z as mentioned above. And export three plots to one pdf or separately.
To get the output, like #JonnyPhelps suggested, use lapply instead of sapply. To make the plots and get correlation you need to alter the function:
sam<-function(x) {
ex<-c(3,2,4,5,2)
z<-data.frame(x)
y<-as.matrix(sapply(z, as.numeric))
h<-lapply(c(2,5,10), function(xx) tapply(y, as.integer(gl(nrow(x), xx, nrow(x)) ), FUN = sum))
names(h)<-c("min2", "min5", "min10")
min2<-h[[1]]
plot(min2, ex, main="min. compare",
xlab="Historical Values ", ylab="Disaggregated Values",
pch=19, col = "blue")
COR = cor.test(min2,ex)
LABEL = paste("cor=",signif(COR$estimate,3),"\np=",signif(COR$p.value,3))
mtext(LABEL,side=3,padj=2)
return(h)
}
The correlation is calculated and you use mtext to place it at the top of the plot. You can play around with padj and adj to get the text where you need.
In your previous function, you called the plot in the function, this overwrites the file with every iteration. To plot all on a pdf, you need to do:
pdf("plots.pdf")
v1<-lapply(data, sam)
dev.off()
Or if you want them on the same page:
pdf("plots.pdf",width=8,height=4)
par(mfrow=c(1,3))
v1<-lapply(data, sam)
dev.off()

The leaf's labels aren't showing completely when I use as.dendrogram

When I plot the dendrogram using "as.dendrogram" all the leaf labels are cut.
See those linked images below as examples:
When I plot using "hclust" (all labels showing correctly)
Dendrogram 1
When I plot using "as.dendrogram"
Dendrogram 2
Codes:
The data I created just to show the problem
test <- matrix(c(34,34,32,27,12,1,2,1,1,1), ncol=2)
colnames(test) <- c('Variable1', 'Variable2')
rownames(test) <- c('African Forest Elephant', 'Asian Domestic Elephant','African White Rhino','West African Giraffe','African Mountain Gorilla')
test_table <- as.table(test)
To plot the first linked image
hctest = hclust(dist(test_table))
plot(hctest, axes = FALSE)
The second one
hctest = as.dendrogram(hctest)
plot(hctest, edge.root = TRUE, horiz = TRUE)
PS: If I use "horiz = FALSE" the same problem persists. I tried exporting the image and also tried using R markdown, but nothing changes.
You can fix this with par by adjusting the plot margins mar=c(bottom, left, top, right). (Defaults are c(5, 4, 4, 2) + 0.1).
par(mar=c(5, 4, 4, 10) + 0.1)
plot(hctest, edge.root=TRUE, horiz=TRUE)
Data
hctest <- structure(list(structure(5L, members = 1L, height = 0, label = "African Mountain Gorilla", leaf = TRUE),
structure(list(structure(4L, members = 1L, height = 0, label = "West African Giraffe", leaf = TRUE),
structure(list(structure(3L, members = 1L, height = 0, label = "African White Rhino", leaf = TRUE),
structure(list(structure(1L, label = "African Forest Elephant", members = 1L, height = 0, leaf = TRUE),
structure(2L, label = "Asian Domestic Elephant", members = 1L, height = 0, leaf = TRUE)), members = 2L, midpoint = 0.5, height = 1)), members = 3L, midpoint = 0.75, height = 2.23606797749979)), members = 4L, midpoint = 0.875, height = 7.07106781186548)), members = 5L, midpoint = 0.9375, height = 22.0227155455452, class = "dendrogram")

How to plot the decision boundary for a Gaussian Naive Bayes classifier?

I use the toy dataset (class membership variable & 2 features) below to apply a Gaussian Naive Bayes model and plot the contours of the class-specific bivariate normal distributions.
How to add a line for the decision boundary to the plot below?
Like here:
(Image source: https://alliance.seas.upenn.edu/~cis520/dynamic/2016/wiki/uploads/Lectures/2class_gauss_NB.jpg)
# Packages
library(klaR)
library(MASS)
# Data
d <- structure(list(y = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L), .Label = c("0", "1"), class = "factor"), x1 = c(2, 2.8, 1.5, 2.1, 5.5, 8, 6.9, 8.5, 2.5, 7.7), x2 = c(1.5, 1.2, 1, 1, 4, 4.8, 4.5, 5.5, 2, 3.5)), .Names = c("y", "x1", "x2"), row.names = c(NA, -10L), class = "data.frame")
# Naive Bayes Model
mN <- NaiveBayes(y ~ x1+x2, data = d)
# Data
# Class 1
m1 <- mean(d[which(d$y==1),]$x1)
m2 <- mean(d[which(d$y==1),]$x2)
mu1_2 <- c(m1,m2) # Mean
sd1 <- sd(d[which(d$y==1),]$x1)
sd2 <- sd(d[which(d$y==1),]$x2)
Sigma1_2 <- matrix(c(sd1, 0, 0, sd2), 2) # Covariance matrix
bivn1_2 <- mvrnorm(5000, mu = mu1_2, Sigma = Sigma1_2 ) # from Mass package: Simulate bivariate normal PDF
bivn1_2.kde <- kde2d(bivn1_2[,1], bivn1_2[,2], n = 50) # from MASS package: Calculate kernel density estimate
# Class 0
m3 <- mean(d[which(d$y==0),]$x1)
m4 <- mean(d[which(d$y==0),]$x2)
mu3_4 <- c(m3,m4) # Mean
sd3 <- sd(d[which(d$y==0),]$x1)
sd4 <- sd(d[which(d$y==0),]$x2)
Sigma3_4 <- matrix(c(sd3, 0, 0, sd4), 2) # Covariance matrix
bivn3_4 <- mvrnorm(5000, mu = mu3_4, Sigma = Sigma3_4 ) # from Mass package: Simulate bivariate normal PDF
bivn3_4.kde <- kde2d(bivn3_4[,1], bivn3_4[,2], n = 50) # from MASS package: Calculate kernel density estimate
# Plot
plot(x= d$x1, y=d$x2, xlim=c(-1,10), ylim=c(-1,10), col=d$y, pch=19, cex=2, ylab="x2", xlab="x1")
contour(bivn1_2.kde, add = TRUE, col="darkgrey") # from base graphics package
contour(bivn3_4.kde, add = TRUE, col="darkgrey") # from base graphics package
text(labels = "Class 1",x = 8, y=7, col="grey")
text(labels = "Class 0",x = 0, y=4, col="grey")

How to change the color of dendrogram for each group in a cluster

Here is my data
df<- structure(list(name = structure(c(2L, 12L, 1L, 16L, 14L, 10L,
9L, 5L, 15L, 4L, 8L, 13L, 7L, 6L, 3L, 11L), .Label = c("All",
"Bab", "boro", "bra", "charli", "delta", "few", "hora", "Howe",
"ist", "kind", "Kiss", "myr", "No", "TT", "where"), class = "factor"),
value = c(1.251, -1.018, -1.074, -1.137, 1.018, 1.293, 1.022,
-1.008, 1.022, 1.252, -1.005, 1.694, -1.068, 1.396, 1.646,
1.016)), .Names = c("name", "value"), class = "data.frame", row.names = c(NA,
-16L))
here what I do
d <- dist(as.matrix(df$value),method = "euclidean")
#compute cluster membership
hcn <- hclust(d,method = "ward.D2")
plot(hcn)
and it gives me what I want as follows
Here all groups are shown by black color and the dendrogram is not that clear what I want is to change the color of each group and also use the name in vertical instead the number and finally I want to be able to remo the hclust(."ward.D2") while change the x label and y label as I want
You could use the dendextend package, aimed for tasks such as this:
# install the package:
if (!require('dendextend')) install.packages('dendextend'); library('dendextend')
## Example:
dend <- as.dendrogram(hclust(dist(USArrests), "ave"))
d1=color_branches(dend,k=5, col = c(3,1,1,4,1))
plot(d1) # selective coloring of branches :)
d2=color_branches(d1,k=5) # auto-coloring 5 clusters of branches.
plot(d2)
# More examples are in ?color_branches
You can see many examples in the presentations and vignettes of the package, in the "usage" section in the following URL: https://github.com/talgalili/dendextend
Or you can use also:
You should use dendrapply.
For instance:
# Generate data
set.seed(12345)
desc.1 <- c(rnorm(10, 0, 1), rnorm(20, 10, 4))
desc.2 <- c(rnorm(5, 20, .5), rnorm(5, 5, 1.5), rnorm(20, 10, 2))
desc.3 <- c(rnorm(10, 3, .1), rnorm(15, 6, .2), rnorm(5, 5, .3))
data <- cbind(desc.1, desc.2, desc.3)
# Create dendrogram
d <- dist(data)
hc <- as.dendrogram(hclust(d))
# Function to color branches
colbranches <- function(n, col)
{
a <- attributes(n) # Find the attributes of current node
# Color edges with requested color
attr(n, "edgePar") <- c(a$edgePar, list(col=col, lwd=2))
n # Don't forget to return the node!
}
# Color the first sub-branch of the first branch in red,
# the second sub-branch in orange and the second branch in blue
hc[[1]][[1]] = dendrapply(hc[[1]][[1]], colbranches, "red")
hc[[1]][[2]] = dendrapply(hc[[1]][[2]], colbranches, "orange")
hc[[2]] = dendrapply(hc[[2]], colbranches, "blue")
# Plot
plot(hc)
I get this information from: How to create a dendrogram with colored branches?
We could instead draw rectangles around groups, let's say there are 5 groups(k = 5):
# plot dendogram
plot(hcn)
# then draw dendogram with red borders around the 5 clusters
rect.hclust(hcn, k = 5, border = "red")
EDIT:
Remove x axis label, and add names instead of numbers:
plot(hcn, xlab = NA, sub = NA, labels = df$name)
rect.hclust(hcn, k = 5, border = "red")

Resources