R histogram - standard deviation of multiple density lines - r

I plotted a histogram below that is the mean density of multiple vectors. The frequency distribution of each vector is shown by the grey lines overlaid on the histogram. Rather than plotting each of these lines, is there a way to plot the standard deviation above and below the mean for the frequency distribution across the vectors? That is, the standard deviation of the grey lines.
I tried getting the density of each vector and calculating the standard deviation of the y variable, but the line from that did not seem to correspond to the mean.
ln <- length(names(data))
hist(data_mean, breaks=100, prob=TRUE)
for( i in 1:ln ) {
lines(density(data[,i], na.rm = TRUE), col="grey", lwd=1)
}
dev.off()

I think the code below will work. In short, I determine the densities of each vector, approx to some known vector of x values, jam it all together in a matrix, and then calculate the summary stats and plot. Is this what you were looking to do?
#Make up some fake data (each column is a sample)
mat=matrix(rnorm(5000,2,0.5),ncol=50)
#Determine density of each column
dens=apply(mat, 2, density)
#Interpolate the densities so they all have same x coords
approxDens=lapply(dens, approx, xout=seq(0.1,3.5,by=0.1))
#Create your output matrix, and fill it with values
approxDens2=matrix(0, ncol=ncol(mat), nrow=length(approxDens[[1]]$y))
for(i in 1:length(approxDens)){
approxDens2[,i]=approxDens[[i]]$y}
#Determine the mean and sd of density values given an x value
mn = rowMeans(approxDens2)
stdv = apply(approxDens2,1,sd)
#pull out those x values you approx-ed things by for plotting
xx = approxDens[[1]]$x
#plot it out
plot(xx, mn, las=1, ylim=c(0,1), type='l', ylab='Density', xlab='X')
lines(xx, mn+stdv, lty=2);lines(xx, mn-stdv, lty=2)

Im not completely sure about what you want, but you are able to save the values of the density. Try
x <- rnorm(100)
dens <- density(x)
dens$y

Related

Plot Frequency Distribution of One-Column Data in R

I have a single series of values (i.e. one column of data), and I would like to create a plot with the range of data values on the x-axis and the frequency that each value appears in the data set on the y-axis.
What I would like is very close to a Kernel Density Plot:
# Kernel Density Plot
d <- density(mtcars$mpg) # returns the density data
plot(d) # plots the results
and Frequency distribution in R on stackoverflow.
However, I would like frequency (as opposed to density) on the y-axis.
Specifically, I'm working with network degree distributions, and would like a double-log scale with open, circular points, i.e. this image.
I've done research into related resources and questions, but haven't found what I wanted:
Cookbook for R's Plotting distributions is close to what I want, but not precisely. I'd like to replace the y-axis in its density curve example with "count" as it is defined in the histogram examples.
The ecdf() function in R (i.e. this question) may be what I want, but I'd like the observed frequency, and not a normalized value between 0 and 1, on the y-axis.
This question is related to frequency distributions, but I'd like points, not bars.
EDIT:
The data is a standard power-law distribution, i.e.
dat <- c(rep(1, 1000), rep(10, 100), rep(100, 10), 100)
The integral of a density is approximately 1 so multiplying the density$y estimate by the number of values should give you something on the scale of a frequency. If you want a "true" frequency then you should use a histogram:
d <- density(mtcars$mpg)
d$y <- d$y * length(mtcars$mpg) ; plot(d)
This is a histogram with breaks that are 1 unit each:
hist(mtcars$mpg,
breaks=trunc(min(mtcars$mpg)):(1+trunc(max(mtcars$mpg))), add=TRUE)
So this is the superposed comparison:
d <- density(mtcars$mpg)
d$y <- d$y * length(mtcars$mpg) ; plot(d, ylim=c(0,4) )
hist(mtcars$mpg, breaks=trunc(min(mtcars$mpg)):(1+trunc(max(mtcars$mpg))), add=TRUE)
You'll want to look at the density page where the default density bandwidth choice is criticized and alternatives offered. f you use the adjust parameter you might see a closer (smoothed correspondence to the histogram
If you have discrete values for observations and want to make a plot with points on the log scale, then
dat <- c(rep(1, 1000), rep(10, 100), rep(100, 10), 100)
dd<-aggregate(rep.int(1, length(dat))~dat, FUN=sum)
names(dd)<-c("val","freq")
plot(freq~val, dd, log="xy")
might be what you are after.

Calibration (inverse prediction) from LOESS object in R

I have fit a LOESS local regression to some data and I want to be able to find the X value associated with a given Y value.
plot(cars, main = "Stopping Distance versus Speed")
car_loess <- loess(cars$dist~cars$speed,span=.5)
lines(1:50, predict(car_loess,data.frame(speed=1:50)))
I was hoping that I could use teh inverse.predict function from the chemCal package, but that does not work for LOESS objects.
Does anyone have any idea how I might be able to do this calibrationa in a better way than predicticting Y values from a long vector of X values and looking through the resulting fitted Y for the Y value of interest and taking its corresponding X value?
Practically speaking in the above example, let's say I wanted to find the speed at which the stopping distance is 15.
Thanks!
The predicted line that you added to the plot is not quite right. Use code like this instead:
# plot the loess line
lines(cars$speed, car_loess$fitted, col="red")
You can use the approx() function to get a linear approximation from the loess line at a give y value. It works just fine for the example that you give:
# define a given y value at which you wish to approximate x from the loess line
givenY <- 15
estX <- approx(x=car_loess$fitted, y=car_loess$x, xout=givenY)$y
# add corresponding lines to the plot
abline(h=givenY, lty=2)
abline(v=estX, lty=2)
But, with a loess fit, there may be more than one x for a given y. The approach I am suggesting does not provide you with ALL of the x values for the given y. For example ...
# example with non-monotonic x-y relation
y <- c(1:20, 19:1, 2:20)
x <- seq(y)
plot(x, y)
fit <- loess(y ~ x)
# plot the loess line
lines(x, fit$fitted, col="red")
# define a given y value at which you wish to approximate x from the loess line
givenY <- 15
estX <- approx(x=fit$fitted, y=fit$x, xout=givenY)$y
# add corresponding lines to the plot
abline(h=givenY, lty=2)
abline(v=estX, lty=2)

How to plot exponential function on barplot R?

So I have a barplot in which the y axis is the log (frequencies). From just eyeing it, it appears that bars decrease exponentially, but I would like to know this for sure. What I want to do is also plot an exponential on this same graph. Thus, if my bars fall below the exponential, I would know that my bars to decrease either exponentially or faster than exponential, and if the bars lie on top of the exponential, I would know that they dont decrease exponentially. How do I plot an exponential on a bar graph?
Here is my graph if that helps:
If you're trying to fit density of an exponential function, you should probably plot density histogram (not frequency). See this question on how to plot distributions in R.
This is how I would do it.
x.gen <- rexp(1000, rate = 3)
hist(x.gen, prob = TRUE)
library(MASS)
x.est <- fitdistr(x.gen, "exponential")$estimate
curve(dexp(x, rate = x.est), add = TRUE, col = "red", lwd = 2)
One way of visually inspecting if two distributions are the same is with a Quantile-Quantile plot, or Q-Q plot for short. Typically this is done when inspecting if a distribution follows standard normal.
The basic idea is to plot your data, against some theoretical quantiles, and if it matches that distribution, you will see a straight line. For example:
x <- qnorm(seq(0,1,l=1002)) # Theoretical normal quantiles
x <- x[-c(1, length(x))] # Drop ends because they are -Inf and Inf
y <- rnorm(1000) # Actual data. 1000 points drawn from a normal distribution
l.1 <- lm(sort(y)~sort(x))
qqplot(x, y, xlab="Theoretical Quantiles", ylab="Actual Quantiles")
abline(coef(l.1)[1], coef(l.1)[2])
Under perfect conditions you should see a straight line when plotting the theoretical quantiles against your data. So you can do the same plotting your data against the exponential function you think it will follow.

plotrix: How to plotCI without the y-values. y-values are outside the confidence interval

How to plotCI without the y-values. I just want the interval to be plotted. This is because my y-values are outside the confidence intervals.
I tried: plotCI(x, y=NULL, ui=U, li=L) where all are numeric vectors; and it did not work
Now, if for one entry, y=2, U=4, and L=3, the interval will go all the way down to 2 (rather than down to 3=L)
What I need is y (where y could be below or above the vertical confidence interval)
Thanks for your help.
Rather than dig into the details of what plotCI does and why it doesn't work with missing y values, I would do the old-fashioned version with arrows():
x <- 1:10
L <- -(1:10)
U <- 1:10
ylim <- range(c(L,U))
plot(x,y=rep(NA,length(x)),type="n",ylim=ylim)
arrows(x,L,x,U,code=3,angle=90,length=0.1)
See also http://rwiki.sciviews.org/doku.php?id=tips:graphics-base:errbars
Here's an example to add confidence interval using a dot chart:
http://www.r-bloggers.com/r-tutorial-add-confidence-intervals-to-dotchart-2/
example code:
### Create data frame with mean and std dev
x <- data.frame(mean=tapply(mtcars$mpg, list(mtcars$cyl), mean), sd=tapply(mtcars$mpg, list(mtcars$cyl), sd) )
### Add lower and upper levels of confidence intervals
x$LL <- x$mean-2*x$sd
x$UL <- x$mean+2*x$sd
### plot dotchart with confidence intervals
title <- "MPG by Num. of Cylinders with 95% Confidence Intervals"
dotchart(x$mean, col="blue", xlim=c(floor(min(x$LL)/10)*10, ceiling(max(x$UL)/10)*10), main=title )
for (i in 1:nrow(x)){
lines(x=c(x$LL[i],x$UL[i]), y=c(i,i))
}
grid()

How do I scale the y-axis on a histogram by the x values in R?

I have some data which represents a sizes of particles. I want to plot the frequency of each binned-size of particles as a histogram, but scale the frequency but the size of the particle (so it represents total mass at that size.)
I can plot a histogram fine, but I am unsure how to scale the Y-axis by the X-value of each bin.
e.g. if I have 10 particles in the 40-60 bin, I want the Y-axis value to be 10*50=500.
You would better use barplot in order to represent the total mass by the area of the bins (i.e. height gives the count, width gives the mass):
sizes <- 3:10 #your sizes
part.type <- sample(sizes, 1000, replace = T) #your particle sizes
count <- table(part.type)
barplot(count, width = size)
If your particle sizes are all different, you should first cut the range into appropriate number of intervals in order to create part.type factor:
part <- rchisq(1000, 10)
part.type <- cut(part, 4)
count <- table(part.type)
barplot(count, width = size)
If the quantity of interest is only total mass. Then, the appropriate plot is the dotchart. It is also much clearer comparing to the bar plot for a large number of sizes:
part <- rchisq(1000, 10)
part.type <- cut(part, 20)
count <- table(part.type)
dotchart(count)
Representing the total mass with bins would be misleading because the area of the bins is meaningless.
if you really want to use the mid point of each bin as a scaling factor:
d<-rgamma(100,5,1.5) # sample
z<-hist(d,plot=FALSE) # make histogram, i.e., divide into bins and count up
co<-z$counts # original counts of each bin
z$counts<-z$counts*z$mids # scaled by mids of the bin
plot(z, xlim=c(0,10),ylim=c(0,max(z$counts))) # plot scaled histogram
par(new=T)
plot(z$mids,co,col=2, xlim=c(0,10),ylim=c(0,max(z$counts))) # overplot original counts
instead, if you want to use the actual value of each sample point as a scaling factor:
d<-rgamma(100,5,1.5)
z<-hist(d,plot=FALSE)
co<-z$counts # original counts of each bin
z$counts<-aggregate(d,list(cut(d,z$breaks)),sum)$x # sum up the value of data in each bin
plot(z, xlim=c(0,10),ylim=c(0,max(z$counts))) # plot scaled histogram
par(new=T)
plot(z$mids,co,col=2, xlim=c(0,10),ylim=c(0,max(z$counts))) # overplot original counts
Just hide the axes and replot them as needed.
# Generate some dummy data
datapoints <- runif(10000, 0, 100)
par (mfrow = c(2,2))
# We will plot 4 histograms, with different bin size
binsize <- c(1, 5, 10, 20)
for (bs in binsize)
{
# Plot the histogram. Hide the axes by setting axes=FALSE
h <- hist(datapoints, seq(0, 100, bs), col="black", axes=FALSE,
xlab="", ylab="", main=paste("Bin size: ", bs))
# Plot the x axis without modifying it
axis(1)
# This will NOT plot the axis (lty=0, labels=FALSE), but it will return the tick values
yax <- axis(2, lty=0, labels=FALSE)
# Plot the axis by appropriately scaling the tick values
axis(2, at=yax, labels=yax/bs)
}

Resources