Undertanding Sample Sizes in qqplots with R - r

I'm trying to plot QQ graphs with the MASS Boston data set and comparing how the plots will change with increased random data points. I'm looking at the R documentation on qqnorm() but it doesn't seem to let me select an n value as a parameter? I'd like to plot the QQplots of “random” samples of size 10, 100, and 1000 samples all from a normal distribution for the same variable all in a 3x1 matrix.
Example would be if I wanted to look at the QQplot for Boston Crime, how would I get
qqnorm(Boston$crim) #find how to set n = 10
qqnorm(Boston$crim) #find how to set n = 100
qqnorm(Boston$crim) #find how to set n = 1000
Also if someone could elaborate when to use qqplot() vs qqnorm(), I'd appreciate it.
I'm inclined to believe that I should use qqplot() as such, as it does seem to give me the output I want, but I want to make sure that using rnorm(n) and then using that variable as a second argument is okay to do:
x <- rnorm(10)
y <- rnorm(100)
z <- rnorm(1000)
par(mfrow = c(1,3))
qqplot(Boston$crim, x)
qqplot(Boston$crim, y)
qqplot(Boston$crim, z)

The question is not clear but to plot samples of a vector, define a vector N of sample sizes and loop through it. The lapply loop will sample from the vector, plot with the Q-Q line and return the qqnorm plot.
data(Boston, package = "MASS")
set.seed(2021)
N <- c(10, 100, nrow(Boston))
qq_list <- lapply(N, function(n){
subtitle <- paste("Sample size:", n)
i <- sample(nrow(Boston), n, replace = FALSE)
qq <- qqnorm(Boston$crim[i], sub = subtitle)
qqline(Boston$crim[i])
qq
})

Related

abline() is not working with weighted.hist()

So I used the plotrix library to plot a histogram using some weights , the histogram shows up as expected but when I tried a plot the mean as a vertical line it won't show up at all
Here's a snippet of my code:
library("plotrix")
library("zoom")
vals = seq.int(from = 52.5 , to = 97.5 , by = 5)
weights <- c(18.01,18.26,16.42,14.07,11.67,9.19,6.46,3.85,1.71,0.34)/100
mean <- sum(vals*weights)
wh <- weighted.hist(x = vals , w = weights , freq = FALSE)
abline(v = mean)
the abline() seems to work only with the normal hist() function
I am sorry if the question sounds stupid , I am R newbie however I did my research and could not find any helpful info.
Thanks in advance.
You should provide a sample of your data. Your calculation of the weighted mean is only correct if your weights sum to 1. If they do not, you should use weighted.mean(vals, weights) or sum(vals * weights/sum(weights)). The following example is slightly modified from the one on the weighted.hist manual page (help(weighted.hist)):
vals <- sample(1:10, 300, TRUE)
weights <- (101:400)/100
weighted.hist(vals, weights, breaks=1:10, main="Test weighted histogram")
(mean <- weighted.mean(vals, weights))
# [1] 5.246374
The histogram starts at 1, but this is 0 on the x-axis coordinates so we need to subtract 1 to get the line in the right place:
abline(v=mean-1, col="red")
Using your data we need to identify the first boundary to adjust the mean so it plots in the correct location"
wh$breaks[1]
# [1] 52.5
abline(v=mean - wh$breaks[1], col="red")

Can someone explain what these lines of code mean?

I have been trying to find a way to make a scatter plot with colour intensity that is indicative of the density of points plotted in the area (it's a big data set with lots of overlap). I found these lines of code which allow me to do this but I want to make sure I actually understand what each line is actually doing.
Thanks in advance :)
get_density <- function(x, y, ...){
dens <- MASS::kde2d(x, y, ...)
ix <- findInterval(x, dens$x)
iy <- findInterval(y, dens$y)
ii <- cbind(ix, iy)
return(dens$z[ii])
}
set.seed(1)
dat <- data.frame(x = subset2$conservation.phyloP, y = subset2$gene.expression.RPKM)
dat$density <- get_density(dat$x, dat$y, n = 100)
Below is the function with some explanatory comments, let me know if anything is still confusing:
# The function "get_density" takes two arguments, called x and y
# The "..." allows you to pass other arguments
get_density <- function(x, y, ...){
# The "MASS::" means it comes from the MASS package, but makes it so you don't have to load the whole MASS package and can just pull out this one function to use.
# This is where the arguments passed as "..." (above) would get passed along to the kde2d function
dens <- MASS::kde2d(x, y, ...)
# These lines use the base R function "findInterval" to get the density values of x and y
ix <- findInterval(x, dens$x)
iy <- findInterval(y, dens$y)
# This command "cbind" pastes the two sets of values together, each as one column
ii <- cbind(ix, iy)
# This line takes a subset of the "density" output, subsetted by the intervals above
return(dens$z[ii])
}
# The "set.seed()" function makes sure that any randomness used by a function is the same if it is re-run (as long as the same number is used), so it makes code more reproducible
set.seed(1)
dat <- data.frame(x = subset2$conservation.phyloP, y = subset2$gene.expression.RPKM)
dat$density <- get_density(dat$x, dat$y, n = 100)
If your question is about the MASS::kde2d function itself, it might be better to rewrite this StackOverflow question to reflect that!
It looks like the same function is wrapped into a ggplot2 method described here, so if you switch to making your plot with ggplot2 you could give it a try.

R Statistics Distributions Plotting

I am having some trouble with a homework I have at Statistics.
I am required to graphical represent the density and the distribution function in two inline plots for a set of parameters at my choice ( there must be minimum 4 ) for Student, Fisher and ChiS repartitions.
Let's take only the example of Student Repartition.
From what I have searched on the internet, I have come with this:
First, I need to generate some random values.
x <- rnorm( 20, 0, 1 )
Question 1: I need to generate 4 of this?
Then I have to plot these values with:
plot(dt( x, df = 1))
plot(pt( x, df = 1))
But, how to do this for four set of parameters? They should be represented in the same plot.
Is this the good approach to what I came so far?
Please, tell me if I'm wrong.
To plot several densities of a certain distribution, you have to first have a support vector, in this case x below.
Then compute the values of the densities with the parameters of your choice.
Then plot them.
In the code that follows, I will plot 4 Sudent-t pdf's, with degrees of freedom 1 to 4.
x <- seq(-5, 5, by = 0.01) # The support vector
y <- sapply(1:4, function(d) dt(x, df = d))
# Open an empty plot first
plot(1, type = "n", xlim = c(-5, 5), ylim = c(0, 0.5))
for(i in 1:4){
lines(x, y[, i], col = i)
}
Then you can make the graph prettier, by adding a main title, changing the axis titles, etc.
If you want other distributions, such as the F or Chi-squared, you will use x strictly positive, for instance x <- seq(0.0001, 10, by = 0.01).

Plot a step function (cadlag) in R (two dots: continuity and discontinuity point)

Essentially I want to plot a compound Poisson process. Everything works fine except that I don't know how to edit the plot parameters correctly.
I want to have the continuity points with a full dot and the discontinuity points with an empty dot. Right now I only am able to manage the full dot.
Minimal working example (plots an compound Poisson path with 10 jumps)
n <- 10
n.t <- cumsum(rexp(n))
x <- c(0,cumsum(rnorm(n)))
plot(stepfun(n.t, x), xlab="t", ylab="X",do.points = TRUE,pch = 16,col.points = "blue",verticals = FALSE)
So how can I add the discontinuity points to the right? Any idea?
You can use points to add the points after the original plot.
set.seed(2017) ## For reproducibility
## Your code
n <- 10
n.t <- cumsum(rexp(n))
x <- c(0,cumsum(rnorm(n)))
plot(stepfun(n.t, x), xlab="t", ylab="X",
do.points = TRUE,pch = 16,col.points = "blue",verticals = FALSE)
## Add the endpoints
points(n.t, x[-length(x)], pch = 1)

New outliers appear after I remove existing ones using QQ Plot Results

I'm working on the PCA section from Michael Faraway's Linear Models with R (chapter 11, page 164).
PCA analysis is sensitive to outliers and the Mahalanobis distance helps us identify them.
The author checks for outliers by plotting the Mahalanobis distance against the quantiles of a chi-squared distribution.
if require(faraway)==F install.packages("faraway"); require(faraway)
data(fat, package='faraway')
cfat <- fat[,9:18]
n <- nrow(cfat); p <- ncol(cfat)
plot(qchisq(1:n/(n+1),p), sort(md), xlab=expression(paste(chi^2,
"quantiles")),
ylab = "Sorted Mahalanobis distances")
abline(0,1)
I identify the points:
identify(qchisq(1:n/(n+1),p), sort(md))
It appears that the outliers are in rows 242:252. I remove these outliers and re-create the QQ Plot:
cfat.mod <- cfat[-c(242:252),] #remove outliers
robfat <- cov.rob(cfat.mod)
md <- mahalanobis(cfat.mod, center=robfat$center, cov=robfat$cov)
n <- nrow(cfat.mod); p <- ncol(cfat.mod)
plot(qchisq(1:n/(n+1),p), sort(md), xlab=expression(paste(chi^2,
"quantiles")),
ylab = "Sorted Mahalanobis distances")
abline(0,1)
identify(qchisq(1:n/(n+1),p), sort(md))
Alas, it appears now that a new set of points (rows 234:241) are now outliers. This keeps happening every time I remove additional outliers.
Look forward to understanding what I'm doing wrong.
To identify the points correctly, make sure the labels correspond to the positions of the points in the data. The functions order or sort with index.return=TRUE will give the sorted indices. Here is an example, arbitrarily removing the points with md greater than a threshold.
## Your data
data(fat, package='faraway')
cfat <- fat[, 9:18]
n <- nrow(cfat)
p <- ncol(cfat)
md <- sort(mahalanobis(cfat, colMeans(cfat), cov(cfat)), index.return=TRUE)
xs <- qchisq(1:n/(n+1), p)
plot(xs, md$x, xlab=expression(paste(chi^2, 'quantiles')))
## Use indices in data as labels for interactive identify
identify(xs, md$x, labels=md$ix)
## remove those with md>25, for example
inds <- md$x > 25
cfat.mod <- cfat[-md$ix[inds], ]
nn <- nrow(cfat.mod)
md1 <- mahalanobis(cfat.mod, colMeans(cfat.mod), cov(cfat.mod))
## Plot the new data
par(mfrow=c(1, 2))
plot(qchisq(1:nn/(nn+1), p), sort(md1), xlab='chisq quantiles', ylab='')
abline(0, 1, col='red')
car::qqPlot(md1, distribution='chisq', df=p, line='robust', main='With car::qqPlot')

Resources