How to draw the sum of two CDF in R - r

I try to plot a linear combination of the CDF of standard normal distribution 1-Phi(2-x)+Phi(-2-x), where Phi is the CDF of standard normal distribution. I know how to draw one, such as Phi(2-x) and I also put the code as below. But how to draw the sum of two CDF in R or Python?
My code is
m <- 0
s <- 1
z <- pnorm(x,mean=m,sd=s)
plot(2-x, z,type="l",col="blue",lwd=2,las=1, xlab="X")

curve(-pnorm(2 - x) + pnorm(-2 - x), from = -10, to = 10)
pnorm is the cumulative distribution function. You can use + for addition, just like normal. At the ?pnorm help page you can see that the defaults are mean = 0 and sd = 1, so you don't need to specify them.
curve is a handy shortcut, but you could do a little bit more work to use plot as you did in the question:
n = 500
x = seq(-5, 5, length.out = n)
plot(x, -pnorm(2 - x) + pnorm(-2 - x),
type="l",col="blue",lwd=2,las=1, xlab="X")

Related

abline() is not working with weighted.hist()

So I used the plotrix library to plot a histogram using some weights , the histogram shows up as expected but when I tried a plot the mean as a vertical line it won't show up at all
Here's a snippet of my code:
library("plotrix")
library("zoom")
vals = seq.int(from = 52.5 , to = 97.5 , by = 5)
weights <- c(18.01,18.26,16.42,14.07,11.67,9.19,6.46,3.85,1.71,0.34)/100
mean <- sum(vals*weights)
wh <- weighted.hist(x = vals , w = weights , freq = FALSE)
abline(v = mean)
the abline() seems to work only with the normal hist() function
I am sorry if the question sounds stupid , I am R newbie however I did my research and could not find any helpful info.
Thanks in advance.
You should provide a sample of your data. Your calculation of the weighted mean is only correct if your weights sum to 1. If they do not, you should use weighted.mean(vals, weights) or sum(vals * weights/sum(weights)). The following example is slightly modified from the one on the weighted.hist manual page (help(weighted.hist)):
vals <- sample(1:10, 300, TRUE)
weights <- (101:400)/100
weighted.hist(vals, weights, breaks=1:10, main="Test weighted histogram")
(mean <- weighted.mean(vals, weights))
# [1] 5.246374
The histogram starts at 1, but this is 0 on the x-axis coordinates so we need to subtract 1 to get the line in the right place:
abline(v=mean-1, col="red")
Using your data we need to identify the first boundary to adjust the mean so it plots in the correct location"
wh$breaks[1]
# [1] 52.5
abline(v=mean - wh$breaks[1], col="red")

Why is a histogram for normal samples rougher near the mode than near the tail?

I am trying to understand a particular behavior of the histogram of samples generated from rnorm.
set.seed(1)
x1 <- rnorm(1000L)
x2 <- rnorm(10000L)
x3 <- rnorm(100000L)
x4 <- rnorm(1000000L)
plot.hist <- function(vec, title, brks) {
h <- hist(vec, breaks = brks, density = 10,
col = "lightgray", main = title)
xfit <- seq(min(vec), max(vec), length = 40)
yfit <- dnorm(xfit, mean = mean(vec), sd = sd(vec))
yfit <- yfit * diff(h$mids[1:2]) * length(vec)
return(lines(xfit, yfit, col = "black", lwd = 2))
}
par(mfrow = c(2, 2))
plot.hist(x1, title = 'Sample = 1E3', brks = 100)
plot.hist(x2, title = 'Sample = 1E4', brks = 500)
plot.hist(x3, title = 'Sample = 1E5', brks = 1000)
plot.hist(x4, title = 'Sample = 1E6', brks = 1000)
You will notice that in each case (I am not making cross comparison; I know that as sample size gets larger the match between histogram and the curve is better), the histogram approximates the standard normal better towards the tails, but poorer towards the mode. Simply put, I'm trying to understand why each histogram is rougher in the middle compared to the tails. Is this an expected behavior or have I missed something basic?
Our eyes are fooling us. The density near the mode is high so that we can observe the variation more evidently. The density near the tail is so low so that we can not really spot anything. The following code performs sort of a "standardization", allowing us to visualize the variation on a relative scale.
set.seed(1)
x1 <- rnorm(1000L)
x2 <- rnorm(10000L)
x3 <- rnorm(100000L)
x4 <- rnorm(1000000L)
foo <- function(vec, title, brks) {
## bin estimation
h <- hist(vec, breaks = brks, plot = FALSE)
## compute true probability between adjacent break points
p2 <- pnorm(h$breaks[-1])
p1 <- pnorm(h$breaks[-length(h$breaks)])
p <- p2 - p1
## compute estimated probability between adjacent break points
phat <- h$count / length(vec)
## compute and plot their absolute relative difference
v <- abs(phat - p) / p
##plot(h$mids, v, main = title)
## plotting on log scale is much better!!
v.log <- log(1 + v)
plot(h$mids, v.log, main = title)
## invisible return
invisible(list(v = v, v.log = v.log))
}
par(mfrow = c(2, 2))
v1 <- foo(x1, title = 'Sample = 1E3', brks = 100)
v2 <- foo(x2, title = 'Sample = 1E4', brks = 500)
v3 <- foo(x3, title = 'Sample = 1E5', brks = 1000)
v4 <- foo(x4, title = 'Sample = 1E6', brks = 1000)
The relative variation is the lowest near the middle (toward 0), but very high near the two edges. This is well explained in statistics:
We have more samples near the middle, so (sample sd) : (sample mean) there is lower;
We have few samples near the edge, maybe 1 or 2, so (sample sd) : (sample mean) there is big.
a little explanation on the log-transform I take
v.log = log(1 + v). Its Taylor expansion ensures that v.log is close to v for very small v around 0. As v gets larger, log(1 + v) gets closer to log(v), thus the usual log-transform is recovered.
This is not just true for normal samples. If we had fixed bins (rather than data-determined ones as we normally do) and we condition on the total number of observations, then the counts would be multinomial.
The expected value of the count in bin i is then n·p(i) where p(i) is the proportion of the population density that falls in bin (i).
The variance of the count in bin i would then be n·p(i)·(1-p(i)). With many bins and a smooth non-peaky density like a normal, (1-p(i)) will be very close to 1; p(i) will typically be smallish (much smaller that 1/2).
The variance of the count (and hence the standard deviation of it) is an increasing function of the expected height:
With a fixed bin width the height is proportional to the expected count and the standard deviation of the bin-height is an increasing function of the height.
So this motivates exactly what you see.
In practice it's not the case that the bin boundaries are fixed; as you add observations or generate a new sample they will change, but the number of bins changes fairly slowly as a function of the sample size (typically as the cube root, or sometimes as the log) and a more sophisticated analysis than the one here is required to get the exact form. However, the outcome is the same -- under commonly observed conditions the variance of the height of a bin typically increases monotonically with the height of the bin.
rnorm() draws a random sample from a normal distribution. The size of the sample is the first argument to rnorm(). So if you do hist(rnorm(10)) you will of course get something that doesn't look much like the normal bell curve because your sample size is so small. If you do hist(rnorm(1000)) it will be better and if you do hist(rnorm(1e8)) your sample should approximate the curve quite well.

R Statistics Distributions Plotting

I am having some trouble with a homework I have at Statistics.
I am required to graphical represent the density and the distribution function in two inline plots for a set of parameters at my choice ( there must be minimum 4 ) for Student, Fisher and ChiS repartitions.
Let's take only the example of Student Repartition.
From what I have searched on the internet, I have come with this:
First, I need to generate some random values.
x <- rnorm( 20, 0, 1 )
Question 1: I need to generate 4 of this?
Then I have to plot these values with:
plot(dt( x, df = 1))
plot(pt( x, df = 1))
But, how to do this for four set of parameters? They should be represented in the same plot.
Is this the good approach to what I came so far?
Please, tell me if I'm wrong.
To plot several densities of a certain distribution, you have to first have a support vector, in this case x below.
Then compute the values of the densities with the parameters of your choice.
Then plot them.
In the code that follows, I will plot 4 Sudent-t pdf's, with degrees of freedom 1 to 4.
x <- seq(-5, 5, by = 0.01) # The support vector
y <- sapply(1:4, function(d) dt(x, df = d))
# Open an empty plot first
plot(1, type = "n", xlim = c(-5, 5), ylim = c(0, 0.5))
for(i in 1:4){
lines(x, y[, i], col = i)
}
Then you can make the graph prettier, by adding a main title, changing the axis titles, etc.
If you want other distributions, such as the F or Chi-squared, you will use x strictly positive, for instance x <- seq(0.0001, 10, by = 0.01).

How do I plot a power function 1-\phi(4.65-x/2) in R?

I want to plot a [power function][1] in R, namely 1-\phi(4.65-z/2). This can be written as \int_{-\infty}^{4.65-z/2}\frac{1}{2\pi} \exp(-\frac{x^2}{2}}) in latex.
Can someone explain how to plot this? Is there a specific command for the phi function?
This function, \Phi, is a cumulative distribution function of a standard normal random variable, and yes, there is a function for that in R: dnorm. Hence,
z <- seq(-2, 20, length = 1000)
plot(z, 1 - dnorm(4.65 - z / 2), type = 'l')
# or also just curve(1 - dnorm(4.65 - x / 2), -2, 20)

levelplot - how to use it, any simple examples?

I woud like to understand how levelplot works. I have almost no experience with plots and R.
What confuses me, is how should I interpret for example x~y*z ?
Lets assume I have a function, and I would like to show how often certain value occurs by using 3d plot. I would have hence x = x, y = f(x) and z = count. How to obtain such simple plot by using levelplot (or something else if it is not appriopriate).
In addition, should I group "count" myself - 3 columns in my data from, or just have 2 columns - x and f(x) and have duplications?
Hope my question is clear, I tried to read levelplot documentation, however I could not find any tutorial that teaches basics.
The following example is from the ?levelplot documentation.
The formula z~x*y means that z is a function of x, y and the interaction between x and y. Had the function been z~x+y it would have meant that z is a function of x and y, ignoring any interaction.
You can read more about the formula interface in the help for ?formula.
x <- seq(pi/4, 5 * pi, length.out = 100)
y <- seq(pi/4, 5 * pi, length.out = 100)
r <- as.vector(sqrt(outer(x^2, y^2, "+")))
grid <- expand.grid(x=x, y=y)
grid$z <- cos(r^2) * exp(-r/(pi^3))
levelplot(z~x*y, grid, cuts = 50, scales=list(log="e"), xlab="",
ylab="", main="Weird Function", sub="with log scales",
colorkey = FALSE, region = TRUE)

Resources