plotting two samples against the uniform distribution in R - r

I am very much an R novice so I am guessing this question is rather stupid/simple...
I have two vectors that represent two samples.
I would like to plot each of them (different colors) against the uniform CDF (something like a Q-Q plot).
To be precise, I would like something very similar to plot #7 here (could not find what was used to draw that plot...). Figure 7 is displayed below:
only with multiple samples and some flexibility with changing the axis labels, colors and such.
Could you please point at a good direction?

For example:
set.seed(10)
N <- 1000
B = rt(N,df=10)
C = rchisq(N,df=10)
op <- layout(matrix(c(1,2),ncol=2,nrow=1))
qqnorm(B,col='green',ylab='student')
qqline(B, col = 2)
qqnorm(C,col='blue',ylab=expression( chi^2 ))
qqline(C, col = 2)

The basics for drawing a QQ-plot are:
qqnorm(variable, main = "QQ Plot", xlab = "Theoretical Quantiles", ylab = "Sample Quantiles")
qqline(variable, col = "red")

Related

inverse transform cauchy dist r

I'm trying to use the inverse cumulative distribution method to plot a histogram from the standard cauchy distribution and I'm getting a strange plot that doesn't look like the textbook standard cauchy. I think I have my inverse function correct (x = tan(pi*(x - 1/2))) so I would appreciate some help. Here is the r code that I have used:
n <- 10000
u <- runif(n)
c.samp <- sapply(u, function(u) tan(pi*(u - 1/2)))
hist(c.samp, breaks = 90, col = "blue",
main = "Hist of Cauchy")
The resulting plot just doesn't look correct:
Any help is appreciated, thank you.
The histogram and sampling technique is correct.
Compare the results with the following (which uses the R Cauchy sampling function).
c.samp2 <- rcauchy(n)
hist(c.samp2, breaks = 90, col = "blue",
main = "Hist of Cauchy 2")
The output here also look incorrect, but it is not.
First, you should note the x-axis is by default chosen based on the extreme values that you happen to encounter. As you probably know, the Cauchy distribution is extremely fat-tailed and very large, but rare, values are expected. When running 10000 samples from the Cauchy distribution, those relatively few single measurements squeeze the plot and do not show up on the plot because only very few observations are allocated to each bins in those extremes.
The default parameters of how hist chooses the bins are also poorly suited for distribution like the Cauchy. Try e.g.
hist(c.samp2, breaks = "FD", col = "blue",
bins = 50,
main = "Hist of Cauchy 2",
xlim = c(-500, 500))
I suggest to read the help("hist") page carefully and play around with the parameters to get a good and useful histogram.
By tweaking the chosen x-axis ranges, using an y-axis probability scale, adding the theoretical distribution and a "rug", you get something more useful.
hist(c.samp, breaks = "FD", col = "blue",
main = "Hist of Cauchy distribution",
xlim = c(-50, 50),
freq = FALSE)
curve(dcauchy, add = TRUE, col = "red")
rug(c.samp)
Note that using c.samp or c.samp2 now hardly changes the plot.

How to line (cut) a dendrogram at the best K

How do I draw a line in a dendrogram that corresponds the best K for a given criteria?
Like this:
Lets suppose that this is my dendrogram, and the best K is 4.
data("mtcars")
myDend <- as.dendrogram(hclust(dist(mtcars)))
plot(myDend)
I know that abline function is able to draw lines in graphs similarly to the one showed above. However, I don't know how could I calculate the height, so the function is used as abline(h = myHeight)
The information that you need to get the heights came with hclust. It has a variable containing the heights. To get the 4 clusters, you want to draw your line between the 3rd biggest and 4th biggest height.
HC = hclust(dist(mtcars))
myDend <- as.dendrogram(HC)
par(mar=c(7.5,4,2,2))
plot(myDend)
k = 4
n = nrow(mtcars)
MidPoint = (HC$height[n-k] + HC$height[n-k+1]) / 2
abline(h = MidPoint, lty=2)

plot theoretic distribution against the real data histogram on one figure

I want to plot the histogram with real data and compare it with a theoretical normal distribution in one plot. But the scale looks different. Two plots have different scale
# you can generate some ramdom data on ystar which is realy data.
x<-seq(-4,4,length=200)
y<-dnorm(x,mean=0, sd=1)
plot(x,y, type = "l", lwd = 2, xlim = c(-3.5,3.5),ylim=c(0,0.7))
par(new = TRUE)
hist(ystar,xlim = c(-10,10),freq = FALSE,ylim=c(0,0.7),breaks = 50)
Desire output
Assuming that ystar is a vector, you should change this:
y<-dnorm(x,mean=0, sd=1)
To:
y<-dnorm(x,mean=mean(ystar), sd=sd(ystar))
This will produce a distribution function that approximately matches the histogram.
You should then be able to use the same x-limits for both the histogram and the theoretical distribution, which will eliminate the strange overlapping axis labels you have in your current version.

How to change values on y-axis for lattice xyplot

I have an xy plot in lattice on which I'm showing four different things. The plot looks like this right now. The values for pink line range from 1 to 15000, however, values for other lines range from 20 to 300. This is why all lines other than pink seem static. However, there are fluctuations in them but I feel the graph isn't showing them property because of yaxis. Is there a way I can shorten the yaxis such that the graph is better representing the other lines as well?
This is how it looks when I don't plot the pink line all together. This shows there are fluctuations which I'd like to show.
If you can use the base package instead of lattice it is quite simple. The code below is vastly simplified from one of my own plots. You will have to fiddle a little with it to add two more lines.
line description
1,2 plot from a data frame. ylab will be on side 2 (left) scale will be automatically determined from the data
3 start a second plot
4 plot from a data frame, use axes=FALSE, xlab=NA, ylab=NA
5 create the axis for side 4 (right) scale will be automatically determined from the data
6 make the ylab for side 4
1 plot(df[c(4,5)], type = "s", col = "blue", main = "Battery Life",
2 xlab="minutes", ylab="percent")
3 par(new=TRUE)
4 plot(df[c(4,6)], type = "s", col = "red", axes = FALSE, xlab = NA, ylab = NA)
5 axis(side = 4)
6 mtext(side = 4, line = 3, "Slope ( minutes)")
You can use the latticeExtra package to create a graph with 2 separate y-axis.
As the comments suggest, I would rather create 2 separate plots. It's a cleaner solution.
As an alternativ: maybe you could add a conditioning variable to your data ("magnitude" or so) which groups your data into suitable chunks. Then you could present your data as shown below.
library("lattice")
library("latticeExtra")
dat1 <- data.frame(x=1:100, y1=rep(1:10,10), y2=rep(100:91,10))
dat2 <- data.frame(x=1:200, y=c(rep(1:10,10), rep(100:91,10)),
z=c(rep("small",100), rep("huge",100)))
p1 <- xyplot(y1~x, data=dat1, type="l")
p2 <- xyplot(y2~x, data=dat1, type="l")
doubleYScale(p1, p2) # 2 y-axis: bad
xyplot(y ~ x | z, data=dat2, type="l", scales="free") # 2 plots: good

histogram and pdf in the same graph [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Fitting a density curve to a histogram in R
I'd like to plot on the same graph the histogram and various pdf's. I've tried for just one pdf with the following code (adopted from code I've found in the web):
hist(data, freq = FALSE, col = "grey", breaks = "FD")
.x <- seq(0, 0.1, length.out=100)
curve(dnorm(.x, mean=a, sd=b), col = 2, add = TRUE)
It gives me an error. Can you advise me?
For multiple pdf's what's the trick?
And I've observed that the histogram seems to be plot the density (on y-y axis) instead of the number of observations.... how can I change this?
Many thanks!
It plots the density instead of the frequency because you specified freq=FALSE. It is not very fair to complain about it doing exactly what you told it to do.
The curve function expects an expression involving x (not .x) and it does not require you to precompute the x values. You probably want something like:
a <- 5
b <- 2
hist( rnorm(100, a, b), freq=FALSE )
curve( dnorm(x,a,b), add=TRUE )
To head of your next question, if you specify freq=TRUE (or just leave it out for the default) and add the curve then the curve just runs along the bottom (that is the whole purpose of plotting the histogram as a density rather than frequencies). You can work around this by scaling the expression given to curve by the width of the bins and the number of total points:
out <- hist( rnorm(100, a, b) )
curve( dnorm(x,a,b)*100*diff(out$breaks[1:2]), add=TRUE )
Though personally the first option (density scale) without tickmark labels on the y-axis makes more sense to me.
h<-hist(data, breaks="FD", col="red", xlab="xTitle", main="Normal pdf and histogram")
xfit<-seq(min(data),max(data),length=100)
x.norm<-rnorm(n=100000, mean=a, sd=b)
yfit<-dnorm(xfit,mean=mean(x.norm),sd=sd(x.norm))
yfit <- yfit*diff(h$mids[1:2])*length(loose_All)
lines(xfit, yfit, col="blue", lwd=2)

Resources