Scaling plots in the terminal nodes of ctree graph - r

I am trying to scale the plots that appear in the terminal nodes of a ctree. I have tried using the yscale parameter but this just results plots that extend beyond the plotting window
For example: Here is a ctree for two exponential distributions
set.seed(1)
classA <-data.frame(class = "a", val = round(rexp(500, rate = 0.2),0))
classB <-data.frame(class = "b", val = round(rexp(500, rate = 0.05),0))
df <- as.data.frame(rbind(classA,classB))
ct = ctree(val~., data = df)
plot(ct)
Now if I try to scale the y axis of the plots from 0 to 70 to zoom in on the box plots and cut-off the outliers, I can use:
plot(ct,terminal_panel = node_boxplot(ct,yscale =c(0,70)))
This works to scale the y axis, but now the plot extends beyond the plotting box.
Sorry I would show images, but don't have enough privileges on stackoverflow yet.
Thanks for any suggestions

First of all: In an example like this it would be better to log-transform the response because then the association tests employed in ctree() will have more power to detect differences for splitting in the tree. Possibly some small continuity correction might help if there are exact zeros.
But, of course, the problem of the proper scaling in the terminal nodes is separate from this. The reason was that the viewports for the terminal nodes were not set to clip = TRUE and hence didn't clip graphical elements outside the viewport region.
I've just fixed this problem in the partykit package on R-Forge. A new CRAN release is not scheduled yet but you can either check out the partykit-SVN from R-Forge or just download the current partykit/R/plot.R source code.

Related

Displaying hierarchical clusters at cluster level (without cases)

I am interested in visualizing the results of a hierarchical cluster analysis. Is it possible to use a dendrogram to display the names or labels of clusters (and subclusters) without displaying the original cases that went into the cluster analysis?
For example, this code applies a hierarchical cluster analysis to the mtcars dataset.
data("mtcars")
clust <- hclust(get_dist(mtcars, method = "pearson"), method = "complete")
plot(clust)
Let's say I cut the tree at 4 clusters and rename the clusters "sedan", "truck", "sportscar", and "van" (totally arbitrary labels).
clust1 <- cutree(clust,4)
clust1 <- dplyr::recode(clust1,
'1'='sedan',
'2'='truck',
'3'='sportscar',
'4'='van')
Is it possible to display a dendrogram which shows these four labels as the nodes on the bottom of the tree, suppressing the names of the original car names?
I am also interested in displaying subclusters within clusters in a similar way, but that may be outside the scope of this question. Bonus points if you can also give a suggestion for how to display subclusters within clusters in a dendrogram while suppressing the names of the original cases! :)
Thank you in advance!
Yes, you can do this. I do not understand your get_dist so I will illustrate using the ordinary distance dist.
data("mtcars")
clust <- hclust(dist(mtcars), method = "complete")
To cut off and display just the top of the tree, change it to a dendrogram and use upper. But you need to know what to height to cut it at. That is in the structure clust.
tail(clust$height)
[1] 113.3023 134.8119 141.7044 214.9367 261.8499 425.3447
Since you want four branches, you can cut at any height between the third and fourth heights (from the end). I will use 213.
MTC_Dend = as.dendrogram(clust)
TreeTop = cut(MTC_Dend, h = 213)$upper
You can get the basic plot now with plot(TreeTop), but it won't have the labels that you want. To change the labels, use the package dendextend which offers a tool specifically to change the labels.
library("dendextend")
labels(TreeTop) = c('sedan','truck', 'sportscar', 'van')
plot(TreeTop)

How would I split a histogram or plot that show the number of main Principal Components?

I have performed PCA Analysis using the prcomp function apart of the FactoMineR package on quite a substantial dataset of 3000 x 500.
I have tried plotting the main Principal Components that cover up to 100% of cumulative variance proportion with a fviz_eig plot. However, this is a very large plot due to the large dimensions of the dataset. Is there any way in R to split a plot into multiple plots using a for loop or any other way?
Here is a visual of my plot that only cover 80% variance due to the fact it being large. Could I split this plot into 2 plots?
Large Dataset Visualisation
I have tried splitting the plot up using a for loop...
for(i in data[1:20]) {
fviz_eig(data, addlabels = TRUE, ylim = c(0, 30))
}
But this doesn't work.
Edited Reproducible example:
This is only a small reproducible example using an already available dataset in R but I used a similar method for my large dataset. It will show you how the plot actually works.
# Already existing data in R.
install.packages("boot")
library(boot)
data(frets)
frets
dataset_pca <- prcomp(frets)
dataset_pca$x
fviz_eig(dataset_pca, addlabels = TRUE, ylim = c(0, 100))
However, my large dataset has a lot more PCs that this one (possibly 100 or more to cover up to 100% of cumulative variance proportion) and therefore this is why I would like a way to split the single plot into multiple plots for better visualisation.
Update:
I have performed what was said by #G5W below...
data <- prcomp(data, scale = TRUE, center = TRUE)
POEV = data$sdev^2 / sum(data$sdev^2)
barplot(POEV, ylim=c(0,0.22))
lines(0.7+(0:10)*1.2, POEV, type="b", pch=20)
text(0.7+(0:10)*1.2, POEV, labels = round(100*POEV, 1), pos=3)
barplot(POEV[1:40], ylim=c(0,0.22), main="PCs 1 - 40")
text(0.7+(0:6)*1.2, POEV[1:40], labels = round(100*POEV[1:40], 1),
pos=3)
and I have now got a graph as follows...
Graph
But I am finding it difficult getting the labels to appear above each bar. Can someone help or suggest something for this please?
I am not 100% sure what you want as your result,
but I am 100% sure that you need to take more control over
what is being plotted, i.e. do more of it yourself.
So let me show an example of doing that. The frets data
that you used has only 4 dimensions so it is hard to illustrate
what to do with more dimensions, so I will instead use the
nuclear data - also available in the boot package. I am going
to start by reproducing the type of graph that you displayed
and then altering it.
library(boot)
data(nuclear)
N_PCA = prcomp(nuclear)
plot(N_PCA)
The basic plot of a prcomp object is similar to the fviz_eig
plot that you displayed but has three main differences. First,
it is showing the actual variances - not the percent of variance
explained. Second, it does not contain the line that connects
the tops of the bars. Third, it does not have the text labels
that tell the heights of the boxes.
Percent of Variance Explained. The return from prcomp contains
the raw information. str(N_PCA) shows that it has the standard
deviations, not the variances - and we want the proportion of total
variation. So we just create that and plot it.
POEV = N_PCA$sdev^2 / sum(N_PCA$sdev^2)
barplot(POEV, ylim=c(0,0.8))
This addresses the first difference from the fviz_eig plot.
Regarding the line, you can easily add that if you feel you need it,
but I recommend against it. What does that line tell you that you
can't already see from the barplot? If you are concerned about too
much clutter obscuring the information, get rid of the line. But
just in case, you really want it, you can add the line with
lines(0.7+(0:10)*1.2, POEV, type="b", pch=20)
However, I will leave it out as I just view it as clutter.
Finally, you can add the text with
text(0.7+(0:10)*1.2, POEV, labels = round(100*POEV, 1), pos=3)
This is also somewhat redundant, but particularly if you change
scales (as I am about to do), it could be helpful for making comparisons.
OK, now that we have the substance of your original graph, it is easy
to separate it into several parts. For my data, the first two bars are
big so the rest are hard to see. In fact, PC's 5-11 show up as zero.
Let's separate out the first 4 and then the rest.
barplot(POEV[1:4], ylim=c(0,0.8), main="PC 1-4")
text(0.7+(0:3)*1.2, POEV[1:4], labels = round(100*POEV[1:4], 1),
pos=3)
barplot(POEV[5:11], ylim=c(0,0.0001), main="PC 5-11")
text(0.7+(0:6)*1.2, POEV[5:11], labels = round(100*POEV[5:11], 4),
pos=3, cex=0.8)
Now we can see that even though PC 5 is much smaller that any of 1-4,
it is a good bit bigger than 6-11.
I don't know what you want to show with your data, but if you
can find an appropriate way to group your components, you can
zoom in on whichever PCs you want.

Is there a way to rescale the axes of a plot produced by plot.clusterlm (R)?

I have run cluster analysis on some time series data using permuco in R. (Permutes the labels of control/treatment conditions and calculates the F statistic as to how likely it is that these time clusters of significant differences occurred by chance.)
So far so good.
I have produced a number of plots using the inbuilt function plot.clusterlm that comes with this package. However, the data come from different groups, and the F values on the y axis get rescaled in each plot, i.e. the values and ticks are reset depending on how strong the effects are.
This is problematic, because the different plots based on different cluster analyses are not visually comparable.
I would like to rescale the y axis, so that all clusters are visualised along the same F values (0-10 for example).
I haven't been able to do that, and I was wondering if there is a way to pass any additional functions into the plot.clusterlm to do this.
This is the usage of the function, but I don't see a way to rescale the y axis. (Although rescaling the x axis is possible by manipulating the nbbaselinepts & nbptsperunit, but that's not what I want...)
plot(x, effect = "all", type = "statistic",
multcomp = "clustermass", alternative = "two.sided",
enhanced_stat = FALSE, nbbaselinepts = 0, nbptsperunit = 1, ...)
If you have any ideas on this, please let me know.
Thank you!
Thanks for using permuco! I opened an issue on GitHub to have a solution for implementing these features. You can expect changes in further releases of permuco.
However, the plot() method shows the F statistic which is not a good measure of effect size. A better measure of effect size is the partial-eta square which is implemented in the afex package
In the base R plotting device axes are altered like this:
x<-1:10; y=x*x
# Simple graph
plot(x, y)
# Enlarge the scale
plot(x, y, xlim=c(1,15), ylim=c(1,150))
# Log scale
plot(x, y, log="y")
This is an example from STHDA where you can find many helpful tutorials.

Is there an R function for plotting the 3rd dimension of a correspondence analysis using FactoMineR (or any other package)?

I am performing a correspondence analysis on categorical, frequency data pulled from archaeological site reports. I chose CA because, as I understand, it can handle presence/absence, which is often the nature of archaeological data. I used the FactoMineR and factoextra packages to create a nice biplot of the first and second dimensions. However, looking at the eigenvalue percentages, I'd really like to plot the 3rd dimension against the first two to visualize some associations/relationships that appear in the results (archaeologists often struggle with multivariate stats, myself included, and having a visual would help overall). However, I can't find any documentation on how to plot a third dimension, either using FactoMineR or factoextra, or any other package. Has anyone ever done this, or any workaround suggestions?
I've looked through the FactoMineR and factoextra documentation. I've also asked around, and have received suggestions to try ggbiplot and ggfortify; however those only seem to work with PCA, FA, etc. data.
lodgestotal4.ca <- CA(lodgestotal4) #run analysis
fviz_ca_biplot(lodgestotal4.ca, repel = TRUE) #biplot of dim 1 & 2
print(lodgestotal4.ca$col) #eigenvalues
fviz_pca_ind(rnaseq_X.pca, axes = c(1, 3), #chose dimensions to plot
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
Like this?

Changing labels size while plotting conditional inference trees in R

I need to insert conditional inference trees (plotted in party library of R) into a text of PhD thesis, that's why I have to tune all the graphical parameters.
I know that the optimal width is 700 (just because it fits the format of thesis to the best). The problem is that in this case one can't see the list of factors which leads to one or both nodes in a lower level of the tree.
I tried to specify cex parameter while plotting, but it gave me no effect. I need to lower the labels size at the plot.
I'll appreciate any help.
The code looks like follows:
blgrcit <- ctree(Suffix ~ cluster + quality + declination, blgr)
jpeg("bulgarian_tree.jpeg", width = 700)
plot(blgrcit, cex = 0.4)
dev.off()
The graphics in party (and the more recent reimplementation in partykit) are implemented in grid and hence many standard base graphics parameters are not supported.
If you want to change the font size for all elements of a ctree plot, then the easiest thing to do is to use the partykit implementation and set the gp graphical parameters. For example:
library("partykit")
ct <- ctree(Species ~ ., data = iris)
plot(ct)
plot(ct, gp = gpar(fontsize = 8))
Instead (or additionally) you might also consider to use a vector PDF graphic instead of a raster JPG graphic for your thesis. Then I usually recommend to make height/width of the pdf() large enough so that all elements of the plot look "good". And then this can be scaled to the text width when including it in the document because scaling is not an issue for vector graphics.

Resources