Histogram Normalization in R - r

I'm new to R and my question might be a little silly, but any help is appreciated. I want to graphically explore a sample to find an appropriate distribution from which the sample could have been drawn. But when I plot a histogram of the sample, the density of the sample exceeds the theoretical maximum value of 1 :
How do I adjust this error? Do I need to transform the data or do I have to adjust the bins of the histogram?
My R code:
dataSample = read.table("sample6.txt", fill = TRUE)
sampleMatrix = as.matrix(dataSample)
sampleVector = as.vector(sampleMatrix)
h = hist(sampleVector, plot=F)
x =c(min(sampleVector, na.rm=T), max(sampleVector, na.rm=T))
ylim = range(0, max(get("density", h), max(density)))
hist(sampleVector, prob = T, col = "lightgreen", xlim = x,
ylim = ylim, main = "Histogram of data sample", xlab = "sample", ylab = "density")
This is my data sample:
0.5604785 0.0231508 0.2715692 0.2464922 0.2743465
0.434444 0.1779845 1.163666 0.5195378 0.08565649
0.2003622 0.3372351 0.02383633 0.2765776 0.1596984
0.3688299 0.2727399 0.3578011 0.4405475 0.07207568
0.424764 1.406219 1.12157 2.170512 0.6944183
2.429551 0.889546 0.1930762 0.579666 0.06834702
0.03690897 0.391838 1.019549 0.272865 0.1993042
0.02951076 0.3739699 0.2612313 1.988982 1.100386
0.9509101 1.978394 0.2469858 0.1256963 1.645895
0.1024105 0.336701 0.1322722 0.3881196 1.152153
0.6207026 1.506684 0.2826296
Thanks in advance!

Related

superimpose normal density curve to histogram malfunctioning (base r)

I am using base R, and had a code for teaching about normal distribution, and have ran the code successfully many times.
Now, however, when I superimpose the normal density curve, it doesn't seem to function properly.
Here is an example code:
set.seed(100)
data <- rnorm(1000, mean = 0, sd = 1)
hist(data, main = "Normal Distribution", xlab = "X", ylab = "Frequency", col = "444", xlim=c(-4,4))
Now I try to superimpose a density curve over the plot, using the density() command:
lines(density(data), col = "red", lwd = 2)
As you see, the line is flat, and I am perplexed as to why? So I tried another method:
x <- seq(-4, 4, length.out = 100)
lines(x, dnorm(x, mean = 0, sd = 1), col = "red", lwd = 2)
But I get the same result.
Any thoughts why it's not working properly?
The answer came to me thanks to one of the users comments.
Using base R, the hist() function will not plot a probability function by default, which is what needed here. Thus, if I set freq=F the code will worked.
Here is the correct answer:
set.seed(100)
data <- rnorm(1000, mean = 0, sd = 1)
hist(data, main = "Normal Distribution", xlab = "X", ylab = "Frequency", col = "444", xlim=c(-4,4), freq = F)
lines(density(data), col ='777', lwd = 2)

Histogram overlay not visible

I need to overlay a normal distribution curve based on a dataset on a histogram of the same dataset.
I get the histogram and the normal curve right individually. But the curve just stays a flat line when combined to the histogram using the add = TRUE attribute in the curve function.
I did try adjusting the xlim and ylim to check if it works but am not getting the intended results, I am confused about how to set the (x and y) limits to suit both the histogram and the curve.
Any suggestions? My dataset is a set of values for 100 individuals daily walk distances ranging from min = 0.4km to max = 10km
bd.m <- read_excel('walking.xlsx')
hist(bd.m, ylim = c(0,10))
curve(dnorm(x, mean = mean(bd.m), sd = sd(bd.m)), add = TRUE, col = 'red')
You need to set freq = FALSEin the call to hist. For example:
dt <- rnorm(1000, 2)
hist(dt, freq = F)
curve(dnorm(x, mean = mean(dt), sd = sd(dt)), add = TRUE, col = 'red')

R - How do I plot a histogram with a specific y-axis (rather than just frequency)?

I'm looking to create a histogram which plots a ranking (1-400ish values) along the x-axis and a frequency per 1000 people on the y-axis. Is there a way to do this with the hist() function? Currently I am using plot.default() like so:
plot.default(frequencyData$Deprivation.rank, frequencyData$Prescription.per.1000.people, xlab =
"Deprivation Rank", ylab = "per 1000", main=graph_title, xlim=NULL, ylim=NULL, type="h")
Once I have this done I would like to calculate the mean and standard deviation and plot them as a line on the graph I am getting which currently (using plot.deafult) looks like this
Anyone able to provide any help?
You didn't provide any data to test with, but I suppose you can do it without any particular problem. Just use the hist() function.
Here's the documentation which you can reach typing ?hist in the R console:
Histograms Description
The generic function hist computes a histogram
of the given data values. If plot = TRUE, the resulting object of
class "histogram" is plotted by plot.histogram, before it is returned.
Usage hist(x, ...)
Default S3 method:
hist(x, breaks = "Sturges",
freq = NULL, probability = !freq,
include.lowest = TRUE, right = TRUE,
density = NULL, angle = 45, col = NULL, border = NULL,
main = paste("Histogram of" , xname),
xlim = range(breaks), ylim = NULL,
xlab = xname, ylab,
axes = TRUE, plot = TRUE, labels = FALSE,
nclass = NULL, warn.unused = TRUE, ...)
So, you can try changing the plot function as this:
hist(
frequencyData$Deprivation.rank,
frequencyData$Prescription.per.1000.people,
xlab = "Deprivation Rank",
ylab = "per 1000",
main = graph_title,
xlim = NULL,
ylim = NULL
)
ggplot alternative
You can obtain a better-looking graph by using the ggplot library, like this example:
library(ggplot2)
ggplot(frequencyData, aes(Deprivation.rank, Prescription.per.1000.people)) +
geom_histogram() +
xlab("Deprivation Rank") +
ylab("per 1000") +
ggtitle(graph_title)
Hope this helps.

Calculate intersection point of two density curves in R

I have two vectors of 1000 values (a and b), from which I created density plots and histograms. I would like to retrieve the coordinates (or just the y value) where the two plots cross (it does not matter if it detects several crossings, I can discriminate them afterwards). Please find the data in the following link. Sample Data
xlim = c(min(c(a,b)), max(c(a,b)))
hist(a, breaks = 100,
freq = F,
xlim = xlim,
xlab = 'Test Subject',
main = 'Difference plots',
col = rgb(0.443137, 0.776471, 0.443137, 0.5),
border = rgb(0.443137, 0.776471, 0.443137, 0.5))
lines(density(a))
hist(b, breaks = 100,
freq = F,
col = rgb(0.529412, 0.807843, 0.921569, 0.5),
border = rgb(0.529412, 0.807843, 0.921569, 0.5),
add = T)
lines(density(b))
Using locate() is not optimal, since I need to retrieve this from several plots (but will use that approach if nothing else is viable). Thanks for your help.
We calculate the density curves for both series, taking care to use the same range. Then, we compare whether the y-value for a is greater than b at each x-value. When the outcome of this comparison flips, we know the lines have crossed.
df <- merge(
as.data.frame(density(a, from = xlim[1], to = xlim[2])[c("x", "y")]),
as.data.frame(density(b, from = xlim[1], to = xlim[2])[c("x", "y")]),
by = "x", suffixes = c(".a", ".b")
)
df$comp <- as.numeric(df$y.a > df$y.b)
df$cross <- c(NA, diff(df$comp))
points(df[which(df$cross != 0), c("x", "y.a")])
which gives you

Limits and Breaks

enter image description here
For this question, I have been able to split the data into two histograms with one being income above the median and the other being income below median. The following code is what I've done so far:
library(openintro)
data("countyComplete")
attach("countyComplete")
median(median_household_income, na.rm = FALSE)
x<-subset(countyComplete,median_household_income > 42445)
y<-subset(countyComplete,median_household_income < 42445)
par(mfrow=c(1,2))
hist(x$median_household_income, main="Income Above Median" )
hist(y$median_household_income,main = "Income Below Median")
However, I am a bit confused about how do I force histograms to use same limits on y axis, as well as breaks. Could someone point me in the right direction. I tried to do this:
par(mfrow=c(1,2))
hist(x$median_household_income,
breaks=seq(0,100,by=5),
freq = FALSE,
ylim=c(0,.15),
xlim = range(breaks),
main="Income Above Median")
hist(y$median_household_income, main = "Income Below Median")
But I only get one histogram showing up on my plot screen and the console says
"Error in hist.default(x$median_household_income, breaks = seq(0, 100, :
some 'x' not counted; maybe 'breaks' do not span range of 'x'".
What do I do?
I would forget the breaks argument. It doesn't make sense, you are plotting values below and above the median, they do not intersect.
As for the histograms, I have precomputed the median, and the maximum value of the density.
library(openintro)
data("countyComplete")
med <- median(countyComplete$median_household_income, na.rm = FALSE)
x <- subset(countyComplete, median_household_income > med)
y <- subset(countyComplete, median_household_income < med)
hx <- hist(x$median_household_income, plot = FALSE)
hy <- hist(y$median_household_income, plot = FALSE)
MaxY <- max(c(hx$density, hy$density))
op <- par(mfrow = c(1, 2))
hist(x$median_household_income, main = "Income Above Median",
freq = FALSE, ylim = c(0, MaxY))
hist(y$median_household_income, main = "Income Below Median",
freq = FALSE, ylim = c(0, MaxY))
par(op)

Resources