How to set different kind of bar plot in terminal nodes? - r

I am running a MOB tree on a dataset and I want to modify plots in terminal nodes. I am going to use bar chart of the coefficients of the models which fitted by MOB in each node as my terminal node.
For example, I run the MOB tree on "PimaIndiansDiabetes" dataset in "mlbench" package. Here is the codes:
pid_formula <- diabetes ~ glucose | pregnant + pressure + triceps +
insulin + mass + pedigree + age
logit <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...) {
glm(y ~ 0 + x, family = binomial, start = start, ...)
}
pid_tree <- mob(pid_formula, data = PimaIndiansDiabetes, fit = logit)
then I have model for each node. for example I have "mass=-9.95+0.058*glucose" for Node number 2. I want to make bar charts from these coefficients (ex: -9.95 and 0.058 for node number 2) and use these bar charts as my terminal nodes in final tree plot. Any idea how to do that? Thanks in advance.

To implement such a graphic in partykit you would have to write a new panel function for the plot() method (or rather a panel-generating function). The starting point could be partykit::node_barplot which first extracts the fitted probabilities of a classification tree and then draws them using the grid package. Instead, you could extract the estimated parameters with coef() and then draw these using grid. It's a bit technical but not extremely complicated.
However, I wouldn't recommend to implement such a function. The reason is that this would be best suited to compare the different coefficients within the same node. But as slope and intercept are on completely different scales this is not easy to interpret. Instead one should give more emphasis to differences in the same coefficient across nodes. The basis for this would also be:
coef(pid_tree)
## x(Intercept) xglucose
## 2 -9.951510 0.05870786
## 4 -6.705586 0.04683748
## 5 -2.770954 0.02353582
Additionally one could consider the corresponding standard errors for confidence intervals. (Keep in mind that these have to be taken with a grain of salt, though: They do not adjust for estimating the tree but pretend the terminal groups were given exogenously. Still they are useful as rough yardsticks.) I include a small convenience function to do this:
confintplot <- function(object, ylim = NULL,
xlab = "Parameter per node", ylab = "Estimate",
main = "", index = NULL, ...)
{
## point estimates and interval
cf <- coef(object)
node <- nodeids(object, terminal = TRUE)
ci <- nodeapply(object, ids = node, FUN = function(n)
confint(info_node(n)$object, ...))
if (!is.null(index)) {
cf <- cf[, index, drop = FALSE]
ci <- lapply(ci, "[", index, , drop = FALSE)
}
cfnm <- rownames(ci[[1L]])
nodenm <- rownames(cf)
## set up dimensions
n <- length(ci)
k <- nrow(ci[[1L]])
at <- t(outer(1:k, seq(-0.15, 0.15, length.out = n), "+"))
## empty plot
if(is.null(ylim)) ylim <- range(unlist(ci))
plot(0, 0, type = "n", xlim = range(at), ylim = ylim,
xlab = xlab, ylab = ylab, main = main, axes = FALSE)
## draw every parameter
for(i in 1L:k) {
arrows(at[,i], sapply(ci, "[", i, 1L), at[,i], sapply(ci, "[", i, 2L),
code = 3, angle = 90, length = 0.05)
points(at[, i], cf[, cfnm[i]], pch = 19, col = "white", cex=1.15)
points(at[, i], cf[, cfnm[i]], pch = nodenm, cex = 0.65)
}
axis(1, at = 1:k, labels = cfnm)
axis(2)
box()
}
Using this we can create one plot for each parameter (intercept vs. slope) separately. This shows that the intercept is increasing across nodes while the slope is decreasing.
par(mfrow = c(1, 2))
confintplot(pid_tree, index = 1)
confintplot(pid_tree, index = 2)
It is also possible to show these on a common y-axis. However, this completely obscures the changes in the slope because of the different scales:
confintplot(pid_tree)
Final comment: I would recommend using glmtree() for this particular kind of model instead of mob() "by hand". The former is faster and provides some extra features, especially easy forecasting.

Related

R - Meta-Analysis - Plotting forest plot from multi-level random-effects model with subgroups

I am having trouble with plotting a forest plot based on a multi-level model, in which I'd also like to display pooled effects of subgroups, as well as the results for subgroup differences.
So far, I have managed to produce a plot of the data where clusters are grouped together. I would like to extend this plot by adding pooled effects of subgroups at the right positions, without losing the grouping of the clusters. (As it is explained here, but also while keeping what is shown in the last example of this).
This is the code I have used so far to produce the "normal" forest plot for my model (sorry, it's pretty long):
# ma_data => my data
# main_3L => my multi-level model
# Prepare row argument for separation by study
dd <- c(0, diff(ma_data$ID))
dd[dd > 0] <- 1
rows <- (1:main_3L$k) + cumsum(dd)
par(tck=-.01, mgp = c(1.6,.2,0), cex=1)
# refactor ID var
ma_data$ID_plot <- substr(ma_data$short_cite, 1, nchar(ma_data$short_cite))
ma_data$ID_plot <- paste(sub(" ||) ","",substr(ma_data$ID_plot,0,2)), substr(ma_data$ID_plot,3,nchar(ma_data$ID_plot)), sep="")
tiff("./figures/forestFull_ext1.tiff", width=3200,height=4500, res=300)
# Plot the forest!
metafor::forest(main_3L,
addpred = TRUE, # adds prediction interval
cex=0.5,
header="Author(s) and Year",
rows=rows, # uses the vector created above
order=order(ma_data$ID, ma_data$es_adj),
ylim=c(0.5,max(rows)+3),
xlim=c(-5,3),
xlab="Hedges' G",
ilab=cbind(as.character(ma_data$setup),as.character(ma_data$target_1), as.character(ma_data$measure_type), ma_data$task, as.character(ma_data$cogdom_pooled), ma_data$sample_size_exp),
ilab.xpos=c(-3.9,-3.6,-3.3,-2.8,-2.2,-1.7),
slab=ma_data$ID_plot,
mlab = mlabfun("Overall RE Modell", main_3L, main_3L.I2)) # Adds Q,Qp, I² and sigma² values.
abline(h = rows[c(1,diff(rows)) == 2] - 1, lty="dotted")
# adds a second polygon with robust estimates for standard error
addpoly(coeftest.main_3L$beta, sei = coeftest.main_3L$SE,
rows = -2.5,
cex = 0.5,
mlab = "Robust RE Model estimate",
col = "darkred")
par(cex=0.5, font=2)
# text(c(-4,-3.7,-3.2,-2.5, -2), 150.5, pos=3, c("Target", "Measure","Task","Cognitive Domain", "N"))
text(c(-3.9,-3.6,-3.3,-2.8,-2.2,-1.7), 150.5, pos=3, c("Setup", "Target", "Measure","Task","Cognitive Domain", "N"))
dev.off()
Specifically, I need to know how to "make space" for the additional rows and polygons.
Also, is there an option in the forest() function to display only the pooled effects of subgroups and main effect, bot not the individual effect sizes? I know that it is possible in the meta package, but have not found anything similar in metafor.
Any help is greatly appreciated!

Drawing random numbers from a power law distribution in R

I am using the R package "poweRlaw" to estimate and subsequently draw from discrete power law distributions, however the distribution drawn from the fit does not seem to match the data. To illustrate, consider this example from a guide for this package: https://cran.r-project.org/web/packages/poweRlaw/vignettes/b_powerlaw_examples.pdf. Here we first download an example dataset from the package and then fit a discrete power law.
library("poweRlaw")
data("moby", package = "poweRlaw")
m_pl = displ$new(moby)
est = estimate_xmin(m_pl)
m_pl$setXmin(est)
The fit looks like a good one, as we can't discard the hypothesis that this data is drawn from a power distribution (p-value > 0.05):
bs = bootstrap_p(m_pl, threads = 8)
bs$p
However, when we draw from this distribution using the built in function dist_rand(), the resulting distribution is shifted to the right of the original distribution:
set.seed(1)
randNum = dist_rand(m_pl, n = length(moby))
plot(density(moby), xlim = c(0, 1000), ylim = c(0, 1), xlab = "", ylab = "", main = "")
par(new=TRUE)
plot(density(randNum), xlim = c(0, 1000), ylim = c(0, 1), col = "red", xlab = "x", ylab = "Density", main = "")
I am probably misunderstanding what it means to draw from a power distribution, but does this happen because we only fit the tail of the experimental distribution (so we draw after the parameter Xmin)? If something like this is happening, is there any way I can compensate for this fact so that the fitted distribution resembles the experimental distribution?
So there's a few things going on here.
As you hinted at in your question, if you want to compare distributions, you need to truncate moby, so moby = moby[moby >= m_pl$getXmin()]
Using density() is a bit fraught. This is a kernel density smoother, that draws Normal distributions over discrete points. As the powerlaw has a very long tail, this is suspect
Comparing the tails of two powerlaw distributions is tricky (simulate some data and see).
Anyway, if you run
set.seed(1)
x = dist_rand(m_pl, n = length(moby))
# Cut off the tail for visualisation
moby = moby[moby >= m_pl$getXmin() & moby < 100]
plot(density(moby), log = "xy")
x = x[ x < 100]
lines(density(x), col = 2)
Gives something fairly similar.

Log-Histogram bins in R: hist(log(x)) vs hist(x, log = "x)

My question is how to properly implement logarithmic bins for histograms in R? The following test case highlights my issue. Given some lognormally distributed data that has been left-truncated, we try and plot the histogram. Depending on if you plot the histogram directly of the log-transformed data, versus you use a log-axis, you can have 2 drastically different interpretations of the data. Method 1 seems to be wrong, as you actually have a linear axis from ~3 to ~ 8, but Method 2 seems to erase the lognormal that is obviously there. Without truncation, Method 1 also seems to be preferrable.
sln <- rlnorm(5e3, 10.12, 1.93)
sln <- sln[sln>5e3]
brks<- hist(log10(sln), breaks = "FD", prob = T, plot = T, main = "method 1")$breaks
hist(sln, breaks = 10^brks, prob = T, log = "x", main = 'Method 2')
EDIT
I'll add that stepping away from histograms and using the kernel density estimates has a similar problem:
plot(density(log(sln)), main = 'Method 1')
plot(density(sln), log = "x", main = 'Method 2')

QQ plot in r from tassel pipeline

For my GWAS analysis I am using the tassel pipeline. In my GWAS I am studying two correlated traits.
I want to plot a Q_Q plot for two trait in one plot like the one which we can obtain from tassel Program.
Any one has any suggestion with which package of r I can do that?
With qq() command from qqman package I plot QQ plot in seprate plot but I want a plot which involved my two traits as i did in Tassel
Ay suggestion?
A QQ-Plot in your case compares quantiles of the empirical distribution of your result to quantiles of the distribution that you'd expect theoretically if the null hypothesis is true.
If you have n data points, it makes sense to compare the n-quantiles, because then the actual quantiles of your empirical distribution are just your data points, ordered.
The theoretical distribution of p-values is the uniform distribution. Think of it, that's exactly the reason why they exist. If a measurement is assigned for example a p-value of 0.05, you'd expect this or a more extreme measurement by pure chance (null hypothesis) in only 5% of your experiments, if you repeat that experiment very often. A measurement with p=0.5, is expected in 50% of the cases. So, generalizing to any value p, your cumulative distribution function
CDF(p) = P[measurement with p-value of ≤ p] = p.
Look in Wikipedia, that's the
CDF for the uniform distribution between 0 and 1.
Therefore, the expected n-quantiles for your QQ-Plot are {1/n, 2/n, ... n/n}. (They represent the case that the null hypothesis is true)
So, now we have the theoretical quantiles (x-axis) and the actual quantiles. In R code, this is something like
expected_quantiles <- function(pvalues){
n = length(pvalues)
actual_quantiles = sort(pvalues)
expected_quantiles = seq_along(pvalues)/n
data.frame(expected = expected_quantiles, actual = actual_quantiles)
}
You can take the -log10 of these values and plot them, for example like so
testdata1 <- c(runif(98,0,1), 1e-4, 2e-5)
testdata2 <- c(runif(96,0,1), 1e-3, 2e-3, 2e-4)
qq <- lapply(list(d1 = testdata1, d2 = testdata2), expected_quantiles)
xlim <- rev(-log10(range(rbind(qq$d1, qq$d2)$expected))) * c(1, 1.1)
ylim <- rev(-log10(range(rbind(qq$d1, qq$d2)$actual))) * c(1, 1.1)
plot(NULL, xlim = xlim, ylim = ylim)
points(x = -log10(qq$d1$expected) ,y = -log10(qq$d1$actual), col = "red")
points(x = -log10(qq$d2$expected) ,y = -log10(qq$d2$actual), col = "blue")
abline(a = 0, b = 1)

R Statistics Distributions Plotting

I am having some trouble with a homework I have at Statistics.
I am required to graphical represent the density and the distribution function in two inline plots for a set of parameters at my choice ( there must be minimum 4 ) for Student, Fisher and ChiS repartitions.
Let's take only the example of Student Repartition.
From what I have searched on the internet, I have come with this:
First, I need to generate some random values.
x <- rnorm( 20, 0, 1 )
Question 1: I need to generate 4 of this?
Then I have to plot these values with:
plot(dt( x, df = 1))
plot(pt( x, df = 1))
But, how to do this for four set of parameters? They should be represented in the same plot.
Is this the good approach to what I came so far?
Please, tell me if I'm wrong.
To plot several densities of a certain distribution, you have to first have a support vector, in this case x below.
Then compute the values of the densities with the parameters of your choice.
Then plot them.
In the code that follows, I will plot 4 Sudent-t pdf's, with degrees of freedom 1 to 4.
x <- seq(-5, 5, by = 0.01) # The support vector
y <- sapply(1:4, function(d) dt(x, df = d))
# Open an empty plot first
plot(1, type = "n", xlim = c(-5, 5), ylim = c(0, 0.5))
for(i in 1:4){
lines(x, y[, i], col = i)
}
Then you can make the graph prettier, by adding a main title, changing the axis titles, etc.
If you want other distributions, such as the F or Chi-squared, you will use x strictly positive, for instance x <- seq(0.0001, 10, by = 0.01).

Resources