I have a data set and I'm looking to put a curve through the histogram for some values. Here is my code:
g = na.omit(peeronly$wellbeing1yr)
hist(g)
h <- hist(g, breaks = 10, density = 10, ylim=c(0,70),
col = "red", xlab = "Well-being score", main = " ")
xfit <- seq(min(g), max(g), length = 100)
yfit <- dnorm(xfit, mean = mean(g), sd = sd(g))
yfit <- yfit * diff(h$mids[1:2]) * length(g)
lines(xfit, yfit, col = "blue", lwd = 2)
And here is the output:
Why does the curve not go through the highest bar and why does it go weird towards the end?
There are a couple of things which might stop a histogram fitting a normal distribution. The most obvious one us that your data aren't normally distributed. In your case, it looks as though the distribution is kurtotic (i.e. too peaked to be normal). It is possible to test this with stats::shapiro.test()
The other reason it may not appear to fit well is that the shape of the histogram is sensitive to cuts on the x axis, so playing with these can sometimes give the same data an apparently better fit.
Related
I would like to shorten the height of my normal dist curve so that the full curve can be seen on the graph.
histCvferr <- hist(cvf_ref_err, breaks = 10, density = 60,
col = "lightgray", xlab = "Residuals", main = "")
xfit <- seq(min(cvf_ref_err), max(cvf_ref_err), length = 40)
yfit <- dnorm(xfit, mean = mean(cvf_ref_err), sd = sd(cvf_ref_err))
yfit <- yfit * diff(h$mids[1:2]) * length(cvf_ref_err)
lines(xfit, yfit, col = "black", lwd = 2)
As you can see the top part of the curve cuts off
And also how can I change the bins so that they are black outlines with no fill?
Try passing a new argument "ylim" in the hist() function and change the range of y. Try passing the following argument in the hist() function. I hope this might help you.
ylim = c(0, 70)
I'm trying to graph two normal distributions over two histograms in the same plot in R. Here is an example of what I would like it to look like:
Here is my current code but I'm not getting the second Normal distribution to properly overlay:
g = R_Hist$`AvgFeret,20-60`
m<-mean(g)
std<-sqrt(var(g))
h <- hist(g, breaks = 20, xlab="Average Feret Diameter", main = "Histogram of 60-100um beads", col=adjustcolor("red", alpha.f =0.2))
xfit <- seq(min(g), max(g), length = 680)
yfit <- dnorm(xfit, mean=mean(g), sd=sd(g))
yfit <- yfit*diff(h$mids[1:2]) * length(g)
lines(xfit, yfit, col = "red", lwd=2)
k = R_Hist$`AvgFeret,60-100`
ms <-mean(k)
stds <-sqrt(var(k))
j <- hist(k, breaks=20, add=TRUE, col = adjustcolor("blue", alpha.f = 0.3))
xfit <- seq(min(j), max(j), length = 314)
yfit <- dnorm(xfit, mean=mean(j), sd=sd(j))
yfit <- yfit*diff(j$mids[1:2]) * length(j)
lines(xfit, yfit, col="blue", lwd=2)
and here is the graph this code is generating:
I haven't yet worked on figuring out how to rescale the axis so any help on that would also be appreciated, but I'm sure I can just look that up! Should I be using ggplot2 for this application? If so how do you overlay a normal curve in that library?
Also as a side note, here are the errors generated from graphing the second (blue) line:
To have them on the same scale, the easiest might be to run hist() first to get the values.
h <- hist(g, breaks = 20, plot = FALSE)
j <- hist(k, breaks = 20, plot = FALSE)
ymax <- max(c(h$counts, j$counts))
xmin <- 0.9 * min(c(g, k))
xmax <- 1.1 * max(c(g,k))
Then you can simply use parameters xlim and ylim in your first call to hist():
h <- hist(g, breaks = 20,
xlab="Average Feret Diameter",
main = "Histogram of 60-100um beads",
col=adjustcolor("red", alpha.f =0.2),
xlim=c(xmin, xmax),
ylim=c(0, ymax))
The errors for the second (blue) line are because you didn't replace j (the histogram object) with k (the raw values):
xfit <- seq(min(k), max(k), length = 314)
yfit <- dnorm(xfit, mean=mean(k), sd=sd(k))
yfit <- yfit*diff(j$mids[1:2]) * length(k)
As for the ggplot2 approach, you can find a good answer here and in the posts linked therein.
I am trying to generate 100 random data(s) from normal distribution, create histogram of it and put density function over the histogram.
So far i have created
set.seed(123)
rs <- rnorm(100, mean = weighted.mean(femals$Salary), sd = sd(femals$Salary))
h <- hist(rs, col = "lightgray" , density = 50 )
xfit <- seq(min(femals$Salary), max(femals$Salary), length = 40)
yfit <- dnorm(xfit, mean = mean(femals$Salary), sd = sd(femals$Salary))
yfit <- yfit * diff(h$mids[1:2]) * length(femals$Salary)
lines(xfit, yfit, col = "red", lwd = 2)
The result of this is
However i am unsure if this is correct. Isnt density function way to low for that histogram? Shouldnt density follow the edges of histogram? Is this correct or did i make mistake in my code?
the mean and standard deviaton are:
weighted mean(femals$Salary) = 5138.852
sed(femals$Salary) = 539.8707
I've explored similar questions asked about this topic but I am having some trouble producing a nice curve on my histogram. I understand that some people may see this as a duplicate but I haven't found anything currently to help solve my problem.
Although the data isn't visible here, here is some variables I am using just so you can see what they represent in the code below.
Differences <- subset(Score_Differences, select = Difference, drop = T)
m = mean(Differences)
std = sqrt(var(Differences))
Here is the very first curve I produce (the code seems most common and easy to produce but the curve itself doesn't fit that well).
hist(Differences, density = 15, breaks = 15, probability = TRUE, xlab = "Score Differences", ylim = c(0,.1), main = "Normal Curve for Score Differences")
curve(dnorm(x,m,std),col = "Red", lwd = 2, add = TRUE)
I really like this but don't like the curve going into the negative region.
hist(Differences, probability = TRUE)
lines(density(Differences), col = "Red", lwd = 2)
lines(density(Differences, adjust = 2), lwd = 2, col = "Blue")
This is the same histogram as the first, but with frequencies. Still doesn't look that nice.
h = hist(Differences, density = 15, breaks = 15, xlab = "Score Differences", main = "Normal Curve for Score Differences")
xfit = seq(min(Differences),max(Differences))
yfit = dnorm(xfit,m,std)
yfit = yfit*diff(h$mids[1:2])*length(Differences)
lines(xfit, yfit, col = "Red", lwd = 2)
Another attempt but no luck. Maybe because I am using qnorm, when the data obviously isn't normal. The curve goes into the negative direction again.
sample_x = seq(qnorm(.001, m, std), qnorm(.999, m, std), length.out = l)
binwidth = 3
breaks = seq(floor(min(Differences)), ceiling(max(Differences)), binwidth)
hist(Differences, breaks)
lines(sample_x, l*dnorm(sample_x, m, std)*binwidth, col = "Red")
The only curve that visually looks nice is the 2nd, but the curve falls into the negative direction.
My question is "Is there a "standard way" to place a curve on a histogram?" This data certainly isn't normal. 3 of the procedures I presented here are from similar posts but I am having some troubles obviously. I feel like all methods of fitting a curve will depend on the data you're working with.
Update with solution
Thanks to Zheyuan Li and others! I will leave this up for my own reference and hopefully others as well.
hist(Differences, probability = TRUE)
lines(density(Differences, cut = 0), col = "Red", lwd = 2)
lines(density(Differences, adjust = 2, cut = 0), lwd = 2, col = "Blue")
OK, so you are just struggling with the fact that density goes beyond "natural range". Well, just set cut = 0. You possibly want to read plot.density extends “xlim” beyond the range of my data. Why and how to fix it? for why. In that answer, I was using from and to. But now I am using cut.
## consider a mixture, that does not follow any parametric distribution family
## note, by construction, this is a strictly positive random variable
set.seed(0)
x <- rbeta(1000, 3, 5) + rexp(1000, 0.5)
## (kernel) density estimation offers a flexible nonparametric approach
d <- density(x, cut = 0)
## you can plot histogram and density on the density scale
hist(x, prob = TRUE, breaks = 50)
lines(d, col = 2)
Note, by cut = 0, density estimation is done strictly within range(x). Outside this range, density is 0.
I am using the following code to create a standard normal distribution in R:
x <- seq(-4, 4, length=200)
y <- dnorm(x, mean=0, sd=1)
plot(x, y, type="l", lwd=2)
I need the x-axis to be labeled at the mean and at points three standard deviations above and below the mean. How can I add these labels?
The easiest (but not general) way is to restrict the limits of the x axis. The +/- 1:3 sigma will be labeled as such, and the mean will be labeled as 0 - indicating 0 deviations from the mean.
plot(x,y, type = "l", lwd = 2, xlim = c(-3.5,3.5))
Another option is to use more specific labels:
plot(x,y, type = "l", lwd = 2, axes = FALSE, xlab = "", ylab = "")
axis(1, at = -3:3, labels = c("-3s", "-2s", "-1s", "mean", "1s", "2s", "3s"))
Using the code in this answer, you could skip creating x and just use curve() on the dnorm function:
curve(dnorm, -3.5, 3.5, lwd=2, axes = FALSE, xlab = "", ylab = "")
axis(1, at = -3:3, labels = c("-3s", "-2s", "-1s", "mean", "1s", "2s", "3s"))
But this doesn't use the given code anymore.
If you like hard way of doing something without using R built in function or you want to do this outside R, you can use the following formula.
x<-seq(-4,4,length=200)
s = 1
mu = 0
y <- (1/(s * sqrt(2*pi))) * exp(-((x-mu)^2)/(2*s^2))
plot(x,y, type="l", lwd=2, col = "blue", xlim = c(-3.5,3.5))
An extremely inefficient and unusual, but beautiful solution, which works based on the ideas of Monte Carlo simulation, is this:
simulate many draws (or samples) from a given distribution (say the normal).
plot the density of these draws using rnorm. The rnorm function takes as arguments (A,B,C) and returns a vector of A samples from a normal distribution centered at B, with standard deviation C.
Thus to take a sample of size 50,000 from a standard normal (i.e, a normal with mean 0 and standard deviation 1), and plot its density, we do the following:
x = rnorm(50000,0,1)
plot(density(x))
As the number of draws goes to infinity this will converge in distribution to the normal. To illustrate this, see the image below which shows from left to right and top to bottom 5000,50000,500000, and 5 million samples.
In general case, for example: Normal(2, 1)
f <- function(x) dnorm(x, 2, 1)
plot(f, -1, 5)
This is a very general, f can be defined freely, with any given parameters, for example:
f <- function(x) dbeta(x, 0.1, 0.1)
plot(f, 0, 1)
I particularly love Lattice for this goal. It easily implements graphical information such as specific areas under a curve, the one you usually require when dealing with probabilities problems such as find P(a < X < b) etc.
Please have a look:
library(lattice)
e4a <- seq(-4, 4, length = 10000) # Data to set up out normal
e4b <- dnorm(e4a, 0, 1)
xyplot(e4b ~ e4a, # Lattice xyplot
type = "l",
main = "Plot 2",
panel = function(x,y, ...){
panel.xyplot(x,y, ...)
panel.abline( v = c(0, 1, 1.5), lty = 2) #set z and lines
xx <- c(1, x[x>=1 & x<=1.5], 1.5) #Color area
yy <- c(0, y[x>=1 & x<=1.5], 0)
panel.polygon(xx,yy, ..., col='red')
})
In this example I make the area between z = 1 and z = 1.5 stand out. You can move easily this parameters according to your problem.
Axis labels are automatic.
This is how to write it in functions:
normalCriticalTest <- function(mu, s) {
x <- seq(-4, 4, length=200) # x extends from -4 to 4
y <- (1/(s * sqrt(2*pi))) * exp(-((x-mu)^2)/(2*s^2)) # y follows the formula
of the normal distribution: f(Y)
plot(x,y, type="l", lwd=2, xlim = c(-3.5,3.5))
abline(v = c(-1.96, 1.96), col="red") # draw the graph, with 2.5% surface to
either side of the mean
}
normalCriticalTest(0, 1) # draw a normal distribution with vertical lines.
Final result: