Changing labels size while plotting conditional inference trees in R - r

I need to insert conditional inference trees (plotted in party library of R) into a text of PhD thesis, that's why I have to tune all the graphical parameters.
I know that the optimal width is 700 (just because it fits the format of thesis to the best). The problem is that in this case one can't see the list of factors which leads to one or both nodes in a lower level of the tree.
I tried to specify cex parameter while plotting, but it gave me no effect. I need to lower the labels size at the plot.
I'll appreciate any help.
The code looks like follows:
blgrcit <- ctree(Suffix ~ cluster + quality + declination, blgr)
jpeg("bulgarian_tree.jpeg", width = 700)
plot(blgrcit, cex = 0.4)
dev.off()

The graphics in party (and the more recent reimplementation in partykit) are implemented in grid and hence many standard base graphics parameters are not supported.
If you want to change the font size for all elements of a ctree plot, then the easiest thing to do is to use the partykit implementation and set the gp graphical parameters. For example:
library("partykit")
ct <- ctree(Species ~ ., data = iris)
plot(ct)
plot(ct, gp = gpar(fontsize = 8))
Instead (or additionally) you might also consider to use a vector PDF graphic instead of a raster JPG graphic for your thesis. Then I usually recommend to make height/width of the pdf() large enough so that all elements of the plot look "good". And then this can be scaled to the text width when including it in the document because scaling is not an issue for vector graphics.

Related

How would I split a histogram or plot that show the number of main Principal Components?

I have performed PCA Analysis using the prcomp function apart of the FactoMineR package on quite a substantial dataset of 3000 x 500.
I have tried plotting the main Principal Components that cover up to 100% of cumulative variance proportion with a fviz_eig plot. However, this is a very large plot due to the large dimensions of the dataset. Is there any way in R to split a plot into multiple plots using a for loop or any other way?
Here is a visual of my plot that only cover 80% variance due to the fact it being large. Could I split this plot into 2 plots?
Large Dataset Visualisation
I have tried splitting the plot up using a for loop...
for(i in data[1:20]) {
fviz_eig(data, addlabels = TRUE, ylim = c(0, 30))
}
But this doesn't work.
Edited Reproducible example:
This is only a small reproducible example using an already available dataset in R but I used a similar method for my large dataset. It will show you how the plot actually works.
# Already existing data in R.
install.packages("boot")
library(boot)
data(frets)
frets
dataset_pca <- prcomp(frets)
dataset_pca$x
fviz_eig(dataset_pca, addlabels = TRUE, ylim = c(0, 100))
However, my large dataset has a lot more PCs that this one (possibly 100 or more to cover up to 100% of cumulative variance proportion) and therefore this is why I would like a way to split the single plot into multiple plots for better visualisation.
Update:
I have performed what was said by #G5W below...
data <- prcomp(data, scale = TRUE, center = TRUE)
POEV = data$sdev^2 / sum(data$sdev^2)
barplot(POEV, ylim=c(0,0.22))
lines(0.7+(0:10)*1.2, POEV, type="b", pch=20)
text(0.7+(0:10)*1.2, POEV, labels = round(100*POEV, 1), pos=3)
barplot(POEV[1:40], ylim=c(0,0.22), main="PCs 1 - 40")
text(0.7+(0:6)*1.2, POEV[1:40], labels = round(100*POEV[1:40], 1),
pos=3)
and I have now got a graph as follows...
Graph
But I am finding it difficult getting the labels to appear above each bar. Can someone help or suggest something for this please?
I am not 100% sure what you want as your result,
but I am 100% sure that you need to take more control over
what is being plotted, i.e. do more of it yourself.
So let me show an example of doing that. The frets data
that you used has only 4 dimensions so it is hard to illustrate
what to do with more dimensions, so I will instead use the
nuclear data - also available in the boot package. I am going
to start by reproducing the type of graph that you displayed
and then altering it.
library(boot)
data(nuclear)
N_PCA = prcomp(nuclear)
plot(N_PCA)
The basic plot of a prcomp object is similar to the fviz_eig
plot that you displayed but has three main differences. First,
it is showing the actual variances - not the percent of variance
explained. Second, it does not contain the line that connects
the tops of the bars. Third, it does not have the text labels
that tell the heights of the boxes.
Percent of Variance Explained. The return from prcomp contains
the raw information. str(N_PCA) shows that it has the standard
deviations, not the variances - and we want the proportion of total
variation. So we just create that and plot it.
POEV = N_PCA$sdev^2 / sum(N_PCA$sdev^2)
barplot(POEV, ylim=c(0,0.8))
This addresses the first difference from the fviz_eig plot.
Regarding the line, you can easily add that if you feel you need it,
but I recommend against it. What does that line tell you that you
can't already see from the barplot? If you are concerned about too
much clutter obscuring the information, get rid of the line. But
just in case, you really want it, you can add the line with
lines(0.7+(0:10)*1.2, POEV, type="b", pch=20)
However, I will leave it out as I just view it as clutter.
Finally, you can add the text with
text(0.7+(0:10)*1.2, POEV, labels = round(100*POEV, 1), pos=3)
This is also somewhat redundant, but particularly if you change
scales (as I am about to do), it could be helpful for making comparisons.
OK, now that we have the substance of your original graph, it is easy
to separate it into several parts. For my data, the first two bars are
big so the rest are hard to see. In fact, PC's 5-11 show up as zero.
Let's separate out the first 4 and then the rest.
barplot(POEV[1:4], ylim=c(0,0.8), main="PC 1-4")
text(0.7+(0:3)*1.2, POEV[1:4], labels = round(100*POEV[1:4], 1),
pos=3)
barplot(POEV[5:11], ylim=c(0,0.0001), main="PC 5-11")
text(0.7+(0:6)*1.2, POEV[5:11], labels = round(100*POEV[5:11], 4),
pos=3, cex=0.8)
Now we can see that even though PC 5 is much smaller that any of 1-4,
it is a good bit bigger than 6-11.
I don't know what you want to show with your data, but if you
can find an appropriate way to group your components, you can
zoom in on whichever PCs you want.

How to set heigth of rows grid in graph lines on ggplots (R)?

I'm trying plots a graph lines using ggplot library in R, but I get a good plots but I need reduce the gradual space or height between rows grid lines because I get big separation between lines.
This is my R script:
library(ggplot2)
library(reshape2)
data <- read.csv('/Users/keepo/Desktop/G.Con/Int18/input-int18.csv')
chart_data <- melt(data, id='NRO')
names(chart_data) <- c('NRO', 'leyenda', 'DTF')
ggplot() +
geom_line(data = chart_data, aes(x = NRO, y = DTF, color = leyenda), size = 1)+
xlab("iteraciones") +
ylab("valores")
and this is my actual graphs:
..the first line is very distant from the second. How I can reduce heigth?
regards.
The lines are far apart because the values of the variable plotted on the y-axis are far apart. If you need them closer together, you fundamentally have 3 options:
change the scale (e.g. convert the plot to a log scale), although this can make it harder for people to interpret the numbers. This can also change the behavior of each line, not just change the space between the lines. I'm guessing this isn't what you will want, ultimately.
normalize the data. If the actual value of the variable on the y-axis isn't important, just standardize the data (separately for each value of leyenda).
As stated above, you can graph each line separately. The main drawback here is that you need 3 graphs where 1 might do.
Not recommended:
I know that some graphs will have the a "squiggle" to change scales or skip space. Generally, this is considered poor practice (and I doubt it's an option in ggplot2 because it masks the true separation between the data points. If you really do want a gap, I would look at this post: axis.break and ggplot2 or gap.plot? plot may be too complexe
In a nutshell, the answer here depends on what your numbers mean. What is the story you are trying to tell? Is the important feature of your plots the change between them (in which case, normalizing might be your best option), or the actual numbers themselves (in which case, the space is relevant).
you could use an axis transformation that maps your data to the screen in a non-linear fashion,
fun_trans <- function(x){
d <- data.frame(x=c(800, 2500, 3100), y=c(800,1950, 3100))
model1 <- lm(y~poly(x,2), data=d)
model2 <- lm(x~poly(y,2), data=d)
scales::trans_new("fun",
function(x) as.vector(predict(model1,data.frame(x=x))),
function(x) as.vector(predict(model2,data.frame(y=x))))
}
last_plot() + scale_y_continuous(trans = "fun")
enter image description here

Scaling plots in the terminal nodes of ctree graph

I am trying to scale the plots that appear in the terminal nodes of a ctree. I have tried using the yscale parameter but this just results plots that extend beyond the plotting window
For example: Here is a ctree for two exponential distributions
set.seed(1)
classA <-data.frame(class = "a", val = round(rexp(500, rate = 0.2),0))
classB <-data.frame(class = "b", val = round(rexp(500, rate = 0.05),0))
df <- as.data.frame(rbind(classA,classB))
ct = ctree(val~., data = df)
plot(ct)
Now if I try to scale the y axis of the plots from 0 to 70 to zoom in on the box plots and cut-off the outliers, I can use:
plot(ct,terminal_panel = node_boxplot(ct,yscale =c(0,70)))
This works to scale the y axis, but now the plot extends beyond the plotting box.
Sorry I would show images, but don't have enough privileges on stackoverflow yet.
Thanks for any suggestions
First of all: In an example like this it would be better to log-transform the response because then the association tests employed in ctree() will have more power to detect differences for splitting in the tree. Possibly some small continuity correction might help if there are exact zeros.
But, of course, the problem of the proper scaling in the terminal nodes is separate from this. The reason was that the viewports for the terminal nodes were not set to clip = TRUE and hence didn't clip graphical elements outside the viewport region.
I've just fixed this problem in the partykit package on R-Forge. A new CRAN release is not scheduled yet but you can either check out the partykit-SVN from R-Forge or just download the current partykit/R/plot.R source code.

Align text to a plot with variable size in R

I am very new in using the power of R to create graphical output.
I use the forest()-function in the metafor-package to create Forest plots of my meta-analyses. I generate several plots using a loop and then save them via png().
for (i in 1:ncol(df)-2)){
dat <- escalc(measure="COR", ri=ri, ni=ni, data=df) # Calcultes Effect Size
res_re <- rma.uni(yi, vi, data=dat, method="DL", slab=paste(author)) # Output of meta-analysis
png(filename=path, width=8.27, height=11.69, units ="in", res = 210)
forest(res_re, showweight = T, addfit= T, cex = .9)
text(-1.6, 18, "Author(s) (Year)", pos=4)
text( 1.6, 18, "Correlation [95% CI]", pos=2)
dev.off()
}
This works great if the size of the plot is equal. However, each iteration of the loop integrates a different number of studies in the forest plot. Thus, the text-elements are not on the right place and the forest-plot with many studies looks a bit strange. I have two questions:
How can I align the "Author(s) (Year)" and "Correlation [95%CI]" automatically to the changing size of the forest-plot such that the headings are above the upper line of the forest-table?
How can I scale the size of the forest plot such that the width and the size of the text-elements is the same for all plots and for each additional study just a new line will be added (changing height)?
Each forest-plot should look like this:
Here is what you will have to do to get this to work:
I would fix xlim across plots, so that there is a fixed place to place the "Author(s) (Year)" and "Correlation [95%CI]" headings. After you have generated a forest plot, take a look at par()$usr[1:2]. Use these values as a starting point to adjust xlim so that it is appropriate for all your plots. Then use those two values for the two calls to text().
There are k rows in each plot. The headings should go two rows above that. So, use text(<first xlim value>, res_re$k+2, "Author(s) (Year)", pos=4) and text(<second xlim value>, res_re$k+2, "Correlation [95% CI]", pos=2)
Set cex in text() to the same value you specified in your call to forest().
The last part is tricky. You have fixed cex, so the size of the text-elements should be the same across plots. But if there are more studies, then the k rows get crammed into less space, so they become less separated. If I understand you correctly, you want to keep the spacing between rows equal across plots by adjusting the actual height of the plot. Essentially, this will require making height in the call to png() a function of k. For each extra study, an additional amount needs to be added to height so that the row spacing stays constant, so something along the lines of height=<some factor> + res_re$k * <some factor>. But the increase in height as a function of k may also be non-linear. Getting this right would take a lot of try and error. There may be a clever way of determining this programmatically (digging into ?par and maybe ?strheight).
So make it easier for others to chime in, the last part of your question comes down to this: How do I have to adjust the height value of a plotting device, so that the absolute spacing between the rows in plot(1:10) and plot(1:20) stays equal? This is an interesting question in itself, so I am going to post this as a separate question.
ad 4.: In Wolfgangs question (Constant Absolute Spacing of Row in R Plots) you will find how to make plot-height depending on the amount of rows in it.
For forest() it would work a little different, since this function internally modifies the par("mar")-values.
However, if you set margins to zero, you only need to include the attribute yaxs="i" in your forest()-function, so that the y-axis will be segmented for the range of the data and nothing else. The device than needs to be configured to have the height (length(ma$yi)+4.5)*fact*res with fact as inches/line (see below) and res as pixels/inch (resolution).
The 4.5 depends if you have left addfit=T and intercept=T in your meta-analysis model (in that case forest() internally sets ylim <- c(-1.5, k + 3)). Otherwise you'd have to use 2.5 (than it would be ylim <- c(0.5, k + 3)).
If you feel like using margins you would do the following (I edited the following part, after I recognized some mistake):
res <- 'your desired resolution' # pixels per inch
fact <- par("mai")[1]/par("mar")[1] # calculate inches per line
### this following part is copied from inside the forest()-function.
# forest() modifies the margin internally in the same way.
par.mar <- par("mar")
par.mar.adj <- par.mar - c(0, 3, 1, 1)
par.mar.adj[par.mar.adj < 0] <- 0
###
ylim <- c(-1.5, length(ma$yi)+3) # see above
ylim.abs <- abs(ylim[1])+abs(ylim[2])-length(ma$yi) # calculate absolute distance of ylim-argument
pixel.bottom <- (par.mar.adj[1])*fact*res # calculate pixels to add to bottom and top based on the margin that is internally used by forest().
pixel.top <- (par.mar.adj[3])*fact*res
png(filename='path',
width='something meaningful',
height=((length(ma$yi)+ylim.abs)*fact*res) + pixel.bottom + pixel.top,
res=res)
par(mar=par.mar) # make sure that inside the new device the margins you want to define are actually used.
forest(res_re, showweight = T, addfit= T, cex = .9, yaxs="i")
...
dev.off()

Use wordlayout results for ggplot geom_text

The R package wordcloud has a very useful function which is called wordlayout. It takes initial positions of words and their respective sizes an rearranges them in a way that they do not overlap. I would like to use the results of this functions to do a geom_text plot in ggplot.
I came up with the following example but soon realized that there seems to be a big difference betweetn cex (wordlayout) and size (geom_plot) since words in graphics package appear way larger.
here is my sample code. Plot 1 is the original wordcloud plot which has no overlaps:
library(wordcloud)
library(tm)
library(ggplot2)
samplesize=100
textdf <- data.frame(label=sample(stopwords("en"),samplesize,replace=TRUE),x=sample(c(1:1000),samplesize,replace=TRUE),y=sample(c(1:1000),samplesize,replace=TRUE),size=sample(c(1:5),samplesize,replace=TRUE))
#plot1
plot.new()
pdf(file="plot1.pdf")
textplot(textdf$x,textdf$y,textdf$label,textdf$size)
dev.off()
#plot2
ggplot(textdf,aes(x,y))+geom_text(aes(label = label, size = size))
ggsave("plot2.pdf")
#plot3
new_pos <- wordlayout(x=textdf$x,y=textdf$y,words=textdf$label,cex=textdf$size)
textdf$x <- new_pos[,1]
textdf$y <- new_pos[,2]
ggplot(textdf,aes(x,y))+geom_text(aes(label = label, size = size))
ggsave("plot3.pdf")
#plot4
textdf$x <- new_pos[,1]+0.5*new_pos[,3]#this is the way the wordcloud package rearranges the positions. I took this out of the textplot function
textdf$y <- new_pos[,2]+0.5*new_pos[,4]
ggplot(textdf,aes(x,y))+geom_text(aes(label = label, size = size))
ggsave("plot4.pdf")
is there a way to overcome this cex/size difference and reuse wordlayout for ggplots?
cex stands for character expansion and is the factor by which text is magnified relative the default, specified by cin - set on my installation to 0.15 in by 0.2 in: see ?par for more details.
#hadley explains that ggplot2 sizes are measured in mm. Therefore cex=1 would correspond to size=3.81 or size=5.08 depending on if it is being scaled by the width or height. Of course, font selection may cause differences.
In addition, to use absolute sizes, you need to have the size specification outside the aes otherwise it considers it a variable to map to and choose the scale itself, eg:
ggplot(textdf,aes(x,y))+geom_text(aes(label = label),size = textdf$size*3.81)
Sadly I think you're going to find the short answer is no! I think the package handles the text vector mapping differently from ggplot2, so you can tinker with size and font face/family, etc. but will struggle to replicate exactly what the package is doing.
I tried a few things:
1) Try to plot the grobs from textdata using annotation_custom
require(plyr)
require(grid)
# FIRST TRY PLOT INDIVIDUAL TEXT GROBS
qplot(0:1000,0:1000,geom="blank") +
alply(textdf,1,function(x){
annotation_custom(textGrob(label=x$label,0,0,c("center","center"),gp=gpar(cex=x$size)),x$x,x$x,x$y,x$y)
})
2) Run the wordlayout() function which should readjust the text, but difficult to see for what font (similarly doesn't work)
# THEN USE wordcloud() TO GET CO-ORDS
plot.new()
wordlayout(textdf$x,textdf$y,words=textdf$label,cex=textdf$size,xlim=c(min(textdf$x),max(textdf$x)),ylim=c(min(textdf$y),max(textdf$y)))
plotdata<-cbind(data.frame(rownames(w)),w)
colnames(plotdata)=c("word","x","y","w","h")
# PLOT WORDCLOUD DATA
qplot(0:1000,0:1000,geom="blank") +
alply(plotdata,1,function(x){
annotation_custom(textGrob(label=x$word,0,0,c("center","center"),gp=gpar(cex=x$h*40)),x$x,x$x,x$y,x$y)
})
Here's a cheat if you just want to overplot other ggplot functions on top of it (although the co-ords don't seem to match up exactly between the data and the plot). It basically images the wordcloud, removes the margins, and under-plots it at the same scale:
# make a png file of just the panel
plot.new()
png(filename="bgplot.png")
par(mar=c(0.01,0.01,0.01,0.01))
textplot(textdf$x,textdf$y,textdf$label,textdf$size,xaxt="n",yaxt="n",xlab="",ylab="",asp=1)
dev.off()
# library to get PNG file
require(png)
# then plot it behind the panel
qplot(0:1000,0:1000,geom="blank") +
annotation_custom(rasterGrob(readPNG("bgplot.png"),0,0,1,1,just=c("left","bottom")),0,1000,0,1000) +
coord_fixed(1,c(0,1000),c(0,1000))

Resources