abline() is not working with weighted.hist() - r

So I used the plotrix library to plot a histogram using some weights , the histogram shows up as expected but when I tried a plot the mean as a vertical line it won't show up at all
Here's a snippet of my code:
library("plotrix")
library("zoom")
vals = seq.int(from = 52.5 , to = 97.5 , by = 5)
weights <- c(18.01,18.26,16.42,14.07,11.67,9.19,6.46,3.85,1.71,0.34)/100
mean <- sum(vals*weights)
wh <- weighted.hist(x = vals , w = weights , freq = FALSE)
abline(v = mean)
the abline() seems to work only with the normal hist() function
I am sorry if the question sounds stupid , I am R newbie however I did my research and could not find any helpful info.
Thanks in advance.

You should provide a sample of your data. Your calculation of the weighted mean is only correct if your weights sum to 1. If they do not, you should use weighted.mean(vals, weights) or sum(vals * weights/sum(weights)). The following example is slightly modified from the one on the weighted.hist manual page (help(weighted.hist)):
vals <- sample(1:10, 300, TRUE)
weights <- (101:400)/100
weighted.hist(vals, weights, breaks=1:10, main="Test weighted histogram")
(mean <- weighted.mean(vals, weights))
# [1] 5.246374
The histogram starts at 1, but this is 0 on the x-axis coordinates so we need to subtract 1 to get the line in the right place:
abline(v=mean-1, col="red")
Using your data we need to identify the first boundary to adjust the mean so it plots in the correct location"
wh$breaks[1]
# [1] 52.5
abline(v=mean - wh$breaks[1], col="red")

Related

Undertanding Sample Sizes in qqplots with R

I'm trying to plot QQ graphs with the MASS Boston data set and comparing how the plots will change with increased random data points. I'm looking at the R documentation on qqnorm() but it doesn't seem to let me select an n value as a parameter? I'd like to plot the QQplots of “random” samples of size 10, 100, and 1000 samples all from a normal distribution for the same variable all in a 3x1 matrix.
Example would be if I wanted to look at the QQplot for Boston Crime, how would I get
qqnorm(Boston$crim) #find how to set n = 10
qqnorm(Boston$crim) #find how to set n = 100
qqnorm(Boston$crim) #find how to set n = 1000
Also if someone could elaborate when to use qqplot() vs qqnorm(), I'd appreciate it.
I'm inclined to believe that I should use qqplot() as such, as it does seem to give me the output I want, but I want to make sure that using rnorm(n) and then using that variable as a second argument is okay to do:
x <- rnorm(10)
y <- rnorm(100)
z <- rnorm(1000)
par(mfrow = c(1,3))
qqplot(Boston$crim, x)
qqplot(Boston$crim, y)
qqplot(Boston$crim, z)
The question is not clear but to plot samples of a vector, define a vector N of sample sizes and loop through it. The lapply loop will sample from the vector, plot with the Q-Q line and return the qqnorm plot.
data(Boston, package = "MASS")
set.seed(2021)
N <- c(10, 100, nrow(Boston))
qq_list <- lapply(N, function(n){
subtitle <- paste("Sample size:", n)
i <- sample(nrow(Boston), n, replace = FALSE)
qq <- qqnorm(Boston$crim[i], sub = subtitle)
qqline(Boston$crim[i])
qq
})

Why is a histogram for normal samples rougher near the mode than near the tail?

I am trying to understand a particular behavior of the histogram of samples generated from rnorm.
set.seed(1)
x1 <- rnorm(1000L)
x2 <- rnorm(10000L)
x3 <- rnorm(100000L)
x4 <- rnorm(1000000L)
plot.hist <- function(vec, title, brks) {
h <- hist(vec, breaks = brks, density = 10,
col = "lightgray", main = title)
xfit <- seq(min(vec), max(vec), length = 40)
yfit <- dnorm(xfit, mean = mean(vec), sd = sd(vec))
yfit <- yfit * diff(h$mids[1:2]) * length(vec)
return(lines(xfit, yfit, col = "black", lwd = 2))
}
par(mfrow = c(2, 2))
plot.hist(x1, title = 'Sample = 1E3', brks = 100)
plot.hist(x2, title = 'Sample = 1E4', brks = 500)
plot.hist(x3, title = 'Sample = 1E5', brks = 1000)
plot.hist(x4, title = 'Sample = 1E6', brks = 1000)
You will notice that in each case (I am not making cross comparison; I know that as sample size gets larger the match between histogram and the curve is better), the histogram approximates the standard normal better towards the tails, but poorer towards the mode. Simply put, I'm trying to understand why each histogram is rougher in the middle compared to the tails. Is this an expected behavior or have I missed something basic?
Our eyes are fooling us. The density near the mode is high so that we can observe the variation more evidently. The density near the tail is so low so that we can not really spot anything. The following code performs sort of a "standardization", allowing us to visualize the variation on a relative scale.
set.seed(1)
x1 <- rnorm(1000L)
x2 <- rnorm(10000L)
x3 <- rnorm(100000L)
x4 <- rnorm(1000000L)
foo <- function(vec, title, brks) {
## bin estimation
h <- hist(vec, breaks = brks, plot = FALSE)
## compute true probability between adjacent break points
p2 <- pnorm(h$breaks[-1])
p1 <- pnorm(h$breaks[-length(h$breaks)])
p <- p2 - p1
## compute estimated probability between adjacent break points
phat <- h$count / length(vec)
## compute and plot their absolute relative difference
v <- abs(phat - p) / p
##plot(h$mids, v, main = title)
## plotting on log scale is much better!!
v.log <- log(1 + v)
plot(h$mids, v.log, main = title)
## invisible return
invisible(list(v = v, v.log = v.log))
}
par(mfrow = c(2, 2))
v1 <- foo(x1, title = 'Sample = 1E3', brks = 100)
v2 <- foo(x2, title = 'Sample = 1E4', brks = 500)
v3 <- foo(x3, title = 'Sample = 1E5', brks = 1000)
v4 <- foo(x4, title = 'Sample = 1E6', brks = 1000)
The relative variation is the lowest near the middle (toward 0), but very high near the two edges. This is well explained in statistics:
We have more samples near the middle, so (sample sd) : (sample mean) there is lower;
We have few samples near the edge, maybe 1 or 2, so (sample sd) : (sample mean) there is big.
a little explanation on the log-transform I take
v.log = log(1 + v). Its Taylor expansion ensures that v.log is close to v for very small v around 0. As v gets larger, log(1 + v) gets closer to log(v), thus the usual log-transform is recovered.
This is not just true for normal samples. If we had fixed bins (rather than data-determined ones as we normally do) and we condition on the total number of observations, then the counts would be multinomial.
The expected value of the count in bin i is then n·p(i) where p(i) is the proportion of the population density that falls in bin (i).
The variance of the count in bin i would then be n·p(i)·(1-p(i)). With many bins and a smooth non-peaky density like a normal, (1-p(i)) will be very close to 1; p(i) will typically be smallish (much smaller that 1/2).
The variance of the count (and hence the standard deviation of it) is an increasing function of the expected height:
With a fixed bin width the height is proportional to the expected count and the standard deviation of the bin-height is an increasing function of the height.
So this motivates exactly what you see.
In practice it's not the case that the bin boundaries are fixed; as you add observations or generate a new sample they will change, but the number of bins changes fairly slowly as a function of the sample size (typically as the cube root, or sometimes as the log) and a more sophisticated analysis than the one here is required to get the exact form. However, the outcome is the same -- under commonly observed conditions the variance of the height of a bin typically increases monotonically with the height of the bin.
rnorm() draws a random sample from a normal distribution. The size of the sample is the first argument to rnorm(). So if you do hist(rnorm(10)) you will of course get something that doesn't look much like the normal bell curve because your sample size is so small. If you do hist(rnorm(1000)) it will be better and if you do hist(rnorm(1e8)) your sample should approximate the curve quite well.

R Statistics Distributions Plotting

I am having some trouble with a homework I have at Statistics.
I am required to graphical represent the density and the distribution function in two inline plots for a set of parameters at my choice ( there must be minimum 4 ) for Student, Fisher and ChiS repartitions.
Let's take only the example of Student Repartition.
From what I have searched on the internet, I have come with this:
First, I need to generate some random values.
x <- rnorm( 20, 0, 1 )
Question 1: I need to generate 4 of this?
Then I have to plot these values with:
plot(dt( x, df = 1))
plot(pt( x, df = 1))
But, how to do this for four set of parameters? They should be represented in the same plot.
Is this the good approach to what I came so far?
Please, tell me if I'm wrong.
To plot several densities of a certain distribution, you have to first have a support vector, in this case x below.
Then compute the values of the densities with the parameters of your choice.
Then plot them.
In the code that follows, I will plot 4 Sudent-t pdf's, with degrees of freedom 1 to 4.
x <- seq(-5, 5, by = 0.01) # The support vector
y <- sapply(1:4, function(d) dt(x, df = d))
# Open an empty plot first
plot(1, type = "n", xlim = c(-5, 5), ylim = c(0, 0.5))
for(i in 1:4){
lines(x, y[, i], col = i)
}
Then you can make the graph prettier, by adding a main title, changing the axis titles, etc.
If you want other distributions, such as the F or Chi-squared, you will use x strictly positive, for instance x <- seq(0.0001, 10, by = 0.01).

Drawing an interval on the graph using vertical line in R ?

install.packages("devtools")
library(devtools)
devtools::install_github("google/CausalImpact")
library(CausalImpact)
set.seed(1)
x1 <- 100 + arima.sim(model = list(ar = 0.999), n = 100)
y <- 1.2 * x1 + rnorm(100)
y[71:100] <- y[71:100] + 10
data <- cbind(y, x1)
pre.period <- c(1, 70)
post.period <- c(71, 100)
impact <- CausalImpact(data, pre.period, post.period)
plot(impact, "cumulative")
Say i want the graph to show an interval from 71-100 with the x scales starting at 1 from the first dotted line any ideas on how to do this?
Does anyone have any idea how to add a second vertical dotted line depicting an interval on the graph? Thanks.
You can use abline() to add lines to a graph, with the argument v = 70 setting a vertical line at x = 70. I'm not sure how to restart the x-scale from that point however - it doesn't seem like something that would be possible but perhaps someone else knows how.
You can reset the axes using this.
In your initial plot command, set xaxt = "n" This ensures that the plot function does not mark the axes.
You can then draw the abline(v=70) as mentioned above.
Then use axis(1,at=seq(60,80,by=1),las=1) 1 stands for x-axis and in the at attribute, mention the x limits you want. I've put in 60 to 80 as an example.

Visualize data using histogram in R

I am trying to visualize some data and in order to do it I am using R's hist.
Bellow are my data
jancoefabs <- as.numeric(as.vector(abs(Janmodelnorm$coef)))
jancoefabs
[1] 1.165610e+00 1.277929e-01 4.349831e-01 3.602961e-01 7.189458e+00
[6] 1.856908e-04 1.352052e-05 4.811291e-05 1.055744e-02 2.756525e-04
[11] 2.202706e-01 4.199914e-02 4.684091e-02 8.634340e-01 2.479175e-02
[16] 2.409628e-01 5.459076e-03 9.892580e-03 5.378456e-02
Now as the more cunning of you might have guessed these are the absolute values of some model's coefficients.
What I need is an histogram that will have for axes:
x will be the number (count or length) of coefficients which is 19 in total, along with their names.
y will show values of each column (as breaks?) having a ylim="" set, according to min and max of those values (or something similar).
Note that Janmodelnorm$coef simply produces the following
(Intercept) LON LAT ME RAT
1.165610e+00 -1.277929e-01 -4.349831e-01 -3.602961e-01 -7.189458e+00
DS DSA DSI DRNS DREW
-1.856908e-04 1.352052e-05 4.811291e-05 -1.055744e-02 -2.756525e-04
ASPNS ASPEW SI CUR W_180_270
-2.202706e-01 -4.199914e-02 4.684091e-02 -8.634340e-01 -2.479175e-02
W_0_360 W_90_180 W_0_180 NDVI
2.409628e-01 5.459076e-03 -9.892580e-03 -5.378456e-02
So far and consulting ?hist, I am trying to play with the code bellow without success. Therefore I am taking it from scratch.
# hist(jancoefabs, col="lightblue", border="pink",
# breaks=8,
# xlim=c(0,10), ylim=c(20,-20), plot=TRUE)
When plot=FALSE is set, I get a bunch of somewhat useful info about the set. I also find hard to use breaks argument efficiently.
Any suggestion will be appreciated. Thanks.
Rather than using hist, why not use a barplot or a standard plot. For example,
## Generate some data
set.seed(1)
y = rnorm(19, sd=5)
names(y) = c("Inter", LETTERS[1:18])
Then plot the cofficients
barplot(y)
Alternatively, you could use a scatter plot
plot(1:19, y, axes=FALSE, ylim=c(-10, 10))
axis(2)
axis(1, 1:19, names(y))
and add error bars to indicate the standard errors (see for example Add error bars to show standard deviation on a plot in R)
Are you sure you want a histogram for this? A lattice barchart might be pretty nice. An example with the mtcars built-in data set.
> coef <- lm(mpg ~ ., data = mtcars)$coef
> library(lattice)
> barchart(coef, col = 'lightblue', horizontal = FALSE,
ylim = range(coef), xlab = '',
scales = list(y = list(labels = coef),
x = list(labels = names(coef))))
A base R dotchart might be good too,
> dotchart(coef, pch = 19, xlab = 'value')
> text(coef, seq(coef), labels = round(coef, 3), pos = 2)

Resources