Drawing random numbers from a power law distribution in R - r

I am using the R package "poweRlaw" to estimate and subsequently draw from discrete power law distributions, however the distribution drawn from the fit does not seem to match the data. To illustrate, consider this example from a guide for this package: https://cran.r-project.org/web/packages/poweRlaw/vignettes/b_powerlaw_examples.pdf. Here we first download an example dataset from the package and then fit a discrete power law.
library("poweRlaw")
data("moby", package = "poweRlaw")
m_pl = displ$new(moby)
est = estimate_xmin(m_pl)
m_pl$setXmin(est)
The fit looks like a good one, as we can't discard the hypothesis that this data is drawn from a power distribution (p-value > 0.05):
bs = bootstrap_p(m_pl, threads = 8)
bs$p
However, when we draw from this distribution using the built in function dist_rand(), the resulting distribution is shifted to the right of the original distribution:
set.seed(1)
randNum = dist_rand(m_pl, n = length(moby))
plot(density(moby), xlim = c(0, 1000), ylim = c(0, 1), xlab = "", ylab = "", main = "")
par(new=TRUE)
plot(density(randNum), xlim = c(0, 1000), ylim = c(0, 1), col = "red", xlab = "x", ylab = "Density", main = "")
I am probably misunderstanding what it means to draw from a power distribution, but does this happen because we only fit the tail of the experimental distribution (so we draw after the parameter Xmin)? If something like this is happening, is there any way I can compensate for this fact so that the fitted distribution resembles the experimental distribution?

So there's a few things going on here.
As you hinted at in your question, if you want to compare distributions, you need to truncate moby, so moby = moby[moby >= m_pl$getXmin()]
Using density() is a bit fraught. This is a kernel density smoother, that draws Normal distributions over discrete points. As the powerlaw has a very long tail, this is suspect
Comparing the tails of two powerlaw distributions is tricky (simulate some data and see).
Anyway, if you run
set.seed(1)
x = dist_rand(m_pl, n = length(moby))
# Cut off the tail for visualisation
moby = moby[moby >= m_pl$getXmin() & moby < 100]
plot(density(moby), log = "xy")
x = x[ x < 100]
lines(density(x), col = 2)
Gives something fairly similar.

Related

plot create cutoff line at particular point

consider the following plot:
pwrt<-pwr.t.test(d=.8,n=c(10,20,30,40,50,60,70,80,90,100),sig.level=.05,type="two.sample",alternative="two.sided")
plot(pwrt$n,pwrt$power,type="b",xlab="sample size",ylab="power", main = "Power curve for t-test d = .8")
which creates
I would like to add a vertical line as a 'cutoff' point at power = .9 for example. And also to compute the exact x-value (sample size) for this cutoff point
How do I do this? Any help is much appreciated.
You can calculate the sample size for a given power with the same pwr.t.test function.
From help(pwr.t.test):
Exactly one of the parameters 'd','n','power' and 'sig.level' must be passed as NULL, and that parameter is determined from the others.
library(pwr)
N90 <- pwr.t.test(d=.8,power = 0.9,sig.level=.05,type="two.sample",alternative="two.sided")$n
N90
[1] 33.82555
From there, it's simple to add a line and text label.
plot(pwrt$n,pwrt$power,type="b",xlab="sample size",ylab="power", main = "Power curve for t-test d = .8")
abline(v = N90)
text(x = N90 + 7, y = 0.8, labels = paste0("N = ",round(N90,2)))

How to smooth a curve in R?

location diffrence<-c(0,0.5,1,1.5,2)
Power<-c(0,0.2,0.4,0.6,0.8,1)
plot(location diffrence,Power)
The guy which has written the paper said he has smoothed the curve using a weighted moving average with weights vector w = (0.25,0.5,0.25) but he did not explained how he did this and with which function he achieved that.i am really confused
Up front, as #MartinWettstein cautions, be careful in when you smooth data and what you do with it (infer from it). Having said that, a simple exponential moving average might look like this.
# replacement data
x <- seq(0, 2, len=5)
y <- c(0, 0.02, 0.65, 1, 1)
# smoothed
ysm <-
zoo::rollapply(c(NA, y, NA), 3,
function(a) Hmisc::wtd.mean(a, c(0.25, 0.5, 0.25), na.rm = TRUE),
partial = FALSE)
# plot
plot(x, y, type = "b", pch = 16)
lines(x, ysm, col = "red")
Notes:
the zoo:: package provides a rolling window (3-wide here), calling the function once for indices 1-3, then again for indices 2-4, then 3-5, 4-6, etc.
with rolling-window operations, realize that they can be center-aligned (default of zoo::rollapply) or left/right aligned. There are some good explanations here: How to calculate 7-day moving average in R?)
I surround the y data with NAs so that I can mimic a partial window. Normally with rolling-window ops, if k=3, then the resulting vector is length(y) - (k-1) long. I'm inferring that you want to include data on the ends, so the first smoothed data point would be effectively (0.5*0 + 0.25*0.02)/0.75, the second smoothed data point (0.25*0 + 0.5*0.02 + 0.25*0.65)/1, and the last smoothed data point (0.25*1 + 0.5*1)/0.75. That is, omitting the 0.25 times a missing data point. That's a guess and can easily be adjusted based on your real needs.
I'm using Hmisc::wtd.mean, though it is trivial to write this weighted-mean function yourself.
This is suggestive only, and not meant to be authoritative. Just to help you begin exploring your smoothing processes.

Log-Histogram bins in R: hist(log(x)) vs hist(x, log = "x)

My question is how to properly implement logarithmic bins for histograms in R? The following test case highlights my issue. Given some lognormally distributed data that has been left-truncated, we try and plot the histogram. Depending on if you plot the histogram directly of the log-transformed data, versus you use a log-axis, you can have 2 drastically different interpretations of the data. Method 1 seems to be wrong, as you actually have a linear axis from ~3 to ~ 8, but Method 2 seems to erase the lognormal that is obviously there. Without truncation, Method 1 also seems to be preferrable.
sln <- rlnorm(5e3, 10.12, 1.93)
sln <- sln[sln>5e3]
brks<- hist(log10(sln), breaks = "FD", prob = T, plot = T, main = "method 1")$breaks
hist(sln, breaks = 10^brks, prob = T, log = "x", main = 'Method 2')
EDIT
I'll add that stepping away from histograms and using the kernel density estimates has a similar problem:
plot(density(log(sln)), main = 'Method 1')
plot(density(sln), log = "x", main = 'Method 2')

QQ plot in r from tassel pipeline

For my GWAS analysis I am using the tassel pipeline. In my GWAS I am studying two correlated traits.
I want to plot a Q_Q plot for two trait in one plot like the one which we can obtain from tassel Program.
Any one has any suggestion with which package of r I can do that?
With qq() command from qqman package I plot QQ plot in seprate plot but I want a plot which involved my two traits as i did in Tassel
Ay suggestion?
A QQ-Plot in your case compares quantiles of the empirical distribution of your result to quantiles of the distribution that you'd expect theoretically if the null hypothesis is true.
If you have n data points, it makes sense to compare the n-quantiles, because then the actual quantiles of your empirical distribution are just your data points, ordered.
The theoretical distribution of p-values is the uniform distribution. Think of it, that's exactly the reason why they exist. If a measurement is assigned for example a p-value of 0.05, you'd expect this or a more extreme measurement by pure chance (null hypothesis) in only 5% of your experiments, if you repeat that experiment very often. A measurement with p=0.5, is expected in 50% of the cases. So, generalizing to any value p, your cumulative distribution function
CDF(p) = P[measurement with p-value of ≤ p] = p.
Look in Wikipedia, that's the
CDF for the uniform distribution between 0 and 1.
Therefore, the expected n-quantiles for your QQ-Plot are {1/n, 2/n, ... n/n}. (They represent the case that the null hypothesis is true)
So, now we have the theoretical quantiles (x-axis) and the actual quantiles. In R code, this is something like
expected_quantiles <- function(pvalues){
n = length(pvalues)
actual_quantiles = sort(pvalues)
expected_quantiles = seq_along(pvalues)/n
data.frame(expected = expected_quantiles, actual = actual_quantiles)
}
You can take the -log10 of these values and plot them, for example like so
testdata1 <- c(runif(98,0,1), 1e-4, 2e-5)
testdata2 <- c(runif(96,0,1), 1e-3, 2e-3, 2e-4)
qq <- lapply(list(d1 = testdata1, d2 = testdata2), expected_quantiles)
xlim <- rev(-log10(range(rbind(qq$d1, qq$d2)$expected))) * c(1, 1.1)
ylim <- rev(-log10(range(rbind(qq$d1, qq$d2)$actual))) * c(1, 1.1)
plot(NULL, xlim = xlim, ylim = ylim)
points(x = -log10(qq$d1$expected) ,y = -log10(qq$d1$actual), col = "red")
points(x = -log10(qq$d2$expected) ,y = -log10(qq$d2$actual), col = "blue")
abline(a = 0, b = 1)

How to set different kind of bar plot in terminal nodes?

I am running a MOB tree on a dataset and I want to modify plots in terminal nodes. I am going to use bar chart of the coefficients of the models which fitted by MOB in each node as my terminal node.
For example, I run the MOB tree on "PimaIndiansDiabetes" dataset in "mlbench" package. Here is the codes:
pid_formula <- diabetes ~ glucose | pregnant + pressure + triceps +
insulin + mass + pedigree + age
logit <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...) {
glm(y ~ 0 + x, family = binomial, start = start, ...)
}
pid_tree <- mob(pid_formula, data = PimaIndiansDiabetes, fit = logit)
then I have model for each node. for example I have "mass=-9.95+0.058*glucose" for Node number 2. I want to make bar charts from these coefficients (ex: -9.95 and 0.058 for node number 2) and use these bar charts as my terminal nodes in final tree plot. Any idea how to do that? Thanks in advance.
To implement such a graphic in partykit you would have to write a new panel function for the plot() method (or rather a panel-generating function). The starting point could be partykit::node_barplot which first extracts the fitted probabilities of a classification tree and then draws them using the grid package. Instead, you could extract the estimated parameters with coef() and then draw these using grid. It's a bit technical but not extremely complicated.
However, I wouldn't recommend to implement such a function. The reason is that this would be best suited to compare the different coefficients within the same node. But as slope and intercept are on completely different scales this is not easy to interpret. Instead one should give more emphasis to differences in the same coefficient across nodes. The basis for this would also be:
coef(pid_tree)
## x(Intercept) xglucose
## 2 -9.951510 0.05870786
## 4 -6.705586 0.04683748
## 5 -2.770954 0.02353582
Additionally one could consider the corresponding standard errors for confidence intervals. (Keep in mind that these have to be taken with a grain of salt, though: They do not adjust for estimating the tree but pretend the terminal groups were given exogenously. Still they are useful as rough yardsticks.) I include a small convenience function to do this:
confintplot <- function(object, ylim = NULL,
xlab = "Parameter per node", ylab = "Estimate",
main = "", index = NULL, ...)
{
## point estimates and interval
cf <- coef(object)
node <- nodeids(object, terminal = TRUE)
ci <- nodeapply(object, ids = node, FUN = function(n)
confint(info_node(n)$object, ...))
if (!is.null(index)) {
cf <- cf[, index, drop = FALSE]
ci <- lapply(ci, "[", index, , drop = FALSE)
}
cfnm <- rownames(ci[[1L]])
nodenm <- rownames(cf)
## set up dimensions
n <- length(ci)
k <- nrow(ci[[1L]])
at <- t(outer(1:k, seq(-0.15, 0.15, length.out = n), "+"))
## empty plot
if(is.null(ylim)) ylim <- range(unlist(ci))
plot(0, 0, type = "n", xlim = range(at), ylim = ylim,
xlab = xlab, ylab = ylab, main = main, axes = FALSE)
## draw every parameter
for(i in 1L:k) {
arrows(at[,i], sapply(ci, "[", i, 1L), at[,i], sapply(ci, "[", i, 2L),
code = 3, angle = 90, length = 0.05)
points(at[, i], cf[, cfnm[i]], pch = 19, col = "white", cex=1.15)
points(at[, i], cf[, cfnm[i]], pch = nodenm, cex = 0.65)
}
axis(1, at = 1:k, labels = cfnm)
axis(2)
box()
}
Using this we can create one plot for each parameter (intercept vs. slope) separately. This shows that the intercept is increasing across nodes while the slope is decreasing.
par(mfrow = c(1, 2))
confintplot(pid_tree, index = 1)
confintplot(pid_tree, index = 2)
It is also possible to show these on a common y-axis. However, this completely obscures the changes in the slope because of the different scales:
confintplot(pid_tree)
Final comment: I would recommend using glmtree() for this particular kind of model instead of mob() "by hand". The former is faster and provides some extra features, especially easy forecasting.

Resources