Convert a vector to density vector in R - r

I have a vector
v = [..., -10, -10, -10, ..., 1, 2, 5, 6, 7, 9, ...]
The geom_density plots the histogram of this vector in a smooth fashion, like a density function!
How can I use the auc, area under the curve, function of library MESS, to compute the areas under the curve for the density plot of such vector in a given interval, let say (-1, 3)?

"The geom_density plots the histogram of this vector in a smooth fashion, like a density function!"
Well, that's because geom_density performs a kernel density estimation! So it's not "like a density function", it is a density function.
Under the hood of geom_density it is actually stats::density that performs the density estimation. The kernel density estimates are given such that they define a proper probability density function with unit area under the curve.
We can confirm that by
x <- rnorm(100)
dens <- density(x)
df <- data.frame(x = dens$x, y = dens$y)
sum(df$y) * diff(df$x)[1]
#[1] 1.000952
Close enough.
It's straight-forward to integrate the density function over a specific range by summing the corresponding values in df; since you don't provide sample data I leave that up to you.

Related

R histogram - standard deviation of multiple density lines

I plotted a histogram below that is the mean density of multiple vectors. The frequency distribution of each vector is shown by the grey lines overlaid on the histogram. Rather than plotting each of these lines, is there a way to plot the standard deviation above and below the mean for the frequency distribution across the vectors? That is, the standard deviation of the grey lines.
I tried getting the density of each vector and calculating the standard deviation of the y variable, but the line from that did not seem to correspond to the mean.
ln <- length(names(data))
hist(data_mean, breaks=100, prob=TRUE)
for( i in 1:ln ) {
lines(density(data[,i], na.rm = TRUE), col="grey", lwd=1)
}
dev.off()
I think the code below will work. In short, I determine the densities of each vector, approx to some known vector of x values, jam it all together in a matrix, and then calculate the summary stats and plot. Is this what you were looking to do?
#Make up some fake data (each column is a sample)
mat=matrix(rnorm(5000,2,0.5),ncol=50)
#Determine density of each column
dens=apply(mat, 2, density)
#Interpolate the densities so they all have same x coords
approxDens=lapply(dens, approx, xout=seq(0.1,3.5,by=0.1))
#Create your output matrix, and fill it with values
approxDens2=matrix(0, ncol=ncol(mat), nrow=length(approxDens[[1]]$y))
for(i in 1:length(approxDens)){
approxDens2[,i]=approxDens[[i]]$y}
#Determine the mean and sd of density values given an x value
mn = rowMeans(approxDens2)
stdv = apply(approxDens2,1,sd)
#pull out those x values you approx-ed things by for plotting
xx = approxDens[[1]]$x
#plot it out
plot(xx, mn, las=1, ylim=c(0,1), type='l', ylab='Density', xlab='X')
lines(xx, mn+stdv, lty=2);lines(xx, mn-stdv, lty=2)
Im not completely sure about what you want, but you are able to save the values of the density. Try
x <- rnorm(100)
dens <- density(x)
dens$y

I am plotting vectors in R in a 2-D cartestian system. My X and Y arrays are unequal in size, so how do I plot my X and Y vectors?

I am attempting to plot discrete functions in R for a flow model equation. I have to plot the original function u(x) = tanh(x - 0.1), with u(x) on the Y-axis and x on the X-axis. I then must plot a discrete function that describes the slope.
u <- array(0,dim=c(21))
#Plot the original function u(x)=tanh(ax-x0)
curve(tanh(x-0.1), from=0, to=5, n=100, col="red", xlab="x", ylab = "u(x)")
grid (NULL,NULL, col = "lightgray", lty="dotted")
x = seq(0, 5, by=0.25)
for (i in 1:21){
u[i] = tanh(x[i]-0.1)
}
x1 = seq(0, 4.75, by=0.25)
du1 <- array(0,dim=c(20))
for (i in 1:20){
du1[i] = (u[i+1]-u[i])/0.25
}
plot(x1, du1, xlab = "x", ylab = "du/dx")
So per the definition of my derivative function, my du/dx vector will only have 20 vector points, but my x vector still has 21 points. I must then repeat giving defined du/dx vectors that have 19 and 18 vector points. Is there any way I can plot the du/dx vs. x functions all on the same graph without having to redefine x every time?
I'm not sure I'm totally clear on what you're asking, but here's code that prevents you from writing out 18 individual code blocks (using the "diff" function in base).
derivs <- matrix(NA, nrow=21, ncol=18)
x <- seq(0, 5, by=0.25)
orig <- tanh(x-0.1)
derivs[,1] <- c(diff(orig)/.25, NA)
for(col in 2:18) {
print(col)
derivs[,col] <- c((diff(derivs[,col-1])/.25), NA)
}
The resulting matrix (here called "derivs" has a column for each derivative (first column is first derivative, second is second derivative, etc...)
One reason I'm a bit confused about what you're trying for is that, if you were to plot all these on one graph, it would be a really weird graph, because the order of magnitudes are really different between the first few, and the last few derivatives.
The dimensions aren't really different for each derivative; I've simply padded it with NAs, which won't appear on a graph.
Also note that you can use the diff function to get second-order differences and so forth.
PS. The graph will probably look more reasonable if, rather than taking the differences as you did (and as I did, to emulate you), so that the different is assigned to the first x value...you probably want to center. E.g. every other derivative would actually be plotted at .125, .375, etc.)

Plot Frequency Distribution of One-Column Data in R

I have a single series of values (i.e. one column of data), and I would like to create a plot with the range of data values on the x-axis and the frequency that each value appears in the data set on the y-axis.
What I would like is very close to a Kernel Density Plot:
# Kernel Density Plot
d <- density(mtcars$mpg) # returns the density data
plot(d) # plots the results
and Frequency distribution in R on stackoverflow.
However, I would like frequency (as opposed to density) on the y-axis.
Specifically, I'm working with network degree distributions, and would like a double-log scale with open, circular points, i.e. this image.
I've done research into related resources and questions, but haven't found what I wanted:
Cookbook for R's Plotting distributions is close to what I want, but not precisely. I'd like to replace the y-axis in its density curve example with "count" as it is defined in the histogram examples.
The ecdf() function in R (i.e. this question) may be what I want, but I'd like the observed frequency, and not a normalized value between 0 and 1, on the y-axis.
This question is related to frequency distributions, but I'd like points, not bars.
EDIT:
The data is a standard power-law distribution, i.e.
dat <- c(rep(1, 1000), rep(10, 100), rep(100, 10), 100)
The integral of a density is approximately 1 so multiplying the density$y estimate by the number of values should give you something on the scale of a frequency. If you want a "true" frequency then you should use a histogram:
d <- density(mtcars$mpg)
d$y <- d$y * length(mtcars$mpg) ; plot(d)
This is a histogram with breaks that are 1 unit each:
hist(mtcars$mpg,
breaks=trunc(min(mtcars$mpg)):(1+trunc(max(mtcars$mpg))), add=TRUE)
So this is the superposed comparison:
d <- density(mtcars$mpg)
d$y <- d$y * length(mtcars$mpg) ; plot(d, ylim=c(0,4) )
hist(mtcars$mpg, breaks=trunc(min(mtcars$mpg)):(1+trunc(max(mtcars$mpg))), add=TRUE)
You'll want to look at the density page where the default density bandwidth choice is criticized and alternatives offered. f you use the adjust parameter you might see a closer (smoothed correspondence to the histogram
If you have discrete values for observations and want to make a plot with points on the log scale, then
dat <- c(rep(1, 1000), rep(10, 100), rep(100, 10), 100)
dd<-aggregate(rep.int(1, length(dat))~dat, FUN=sum)
names(dd)<-c("val","freq")
plot(freq~val, dd, log="xy")
might be what you are after.

How to plot exponential function on barplot R?

So I have a barplot in which the y axis is the log (frequencies). From just eyeing it, it appears that bars decrease exponentially, but I would like to know this for sure. What I want to do is also plot an exponential on this same graph. Thus, if my bars fall below the exponential, I would know that my bars to decrease either exponentially or faster than exponential, and if the bars lie on top of the exponential, I would know that they dont decrease exponentially. How do I plot an exponential on a bar graph?
Here is my graph if that helps:
If you're trying to fit density of an exponential function, you should probably plot density histogram (not frequency). See this question on how to plot distributions in R.
This is how I would do it.
x.gen <- rexp(1000, rate = 3)
hist(x.gen, prob = TRUE)
library(MASS)
x.est <- fitdistr(x.gen, "exponential")$estimate
curve(dexp(x, rate = x.est), add = TRUE, col = "red", lwd = 2)
One way of visually inspecting if two distributions are the same is with a Quantile-Quantile plot, or Q-Q plot for short. Typically this is done when inspecting if a distribution follows standard normal.
The basic idea is to plot your data, against some theoretical quantiles, and if it matches that distribution, you will see a straight line. For example:
x <- qnorm(seq(0,1,l=1002)) # Theoretical normal quantiles
x <- x[-c(1, length(x))] # Drop ends because they are -Inf and Inf
y <- rnorm(1000) # Actual data. 1000 points drawn from a normal distribution
l.1 <- lm(sort(y)~sort(x))
qqplot(x, y, xlab="Theoretical Quantiles", ylab="Actual Quantiles")
abline(coef(l.1)[1], coef(l.1)[2])
Under perfect conditions you should see a straight line when plotting the theoretical quantiles against your data. So you can do the same plotting your data against the exponential function you think it will follow.

Calculating an area under a continuous density plot

I have two density curves plotted using this:
Network <- Mydf$Networks
quartiles <- quantile(Mydf$Avg.Position, probs=c(25,50,75)/100)
density <- ggplot(Mydf, aes(x = Avg.Position, fill = Network))
d <- density + geom_density(alpha = 0.2) + xlim(1,11) + opts(title = "September 2010") + geom_vline(xintercept = quartiles, colour = "red")
print(d)
I'd like to compute the area under each curve for a given Avg.Position range. Sort of like pnorm for the normal curve. Any ideas?
Calculate the density seperately and plot that one to start with. Then you can use basic arithmetics to get the estimate. An integration is approximated by adding together the area of a set of little squares. I use the mean method for that. the length is the difference between two x-values, the height is the mean of the y-value at the begin and at the end of the interval. I use the rollmeans function in the zoo package, but this can be done using the base package too.
require(zoo)
X <- rnorm(100)
# calculate the density and check the plot
Y <- density(X) # see ?density for parameters
plot(Y$x,Y$y, type="l") #can use ggplot for this too
# set an Avg.position value
Avg.pos <- 1
# construct lengths and heights
xt <- diff(Y$x[Y$x<Avg.pos])
yt <- rollmean(Y$y[Y$x<Avg.pos],2)
# This gives you the area
sum(xt*yt)
This gives you a good approximation up to 3 digits behind the decimal sign. If you know the density function, take a look at ?integrate
Three possibilities:
The logspline package provides a different method of estimating density curves, but it does include pnorm style functions for the result.
You could also approximate the area by feeding the x and y variables returned by the density function to the approxfun function and using the result with the integrate function. Unless you are interested in precise estimates of small tail areas (or very small intervals) then this will probably give a reasonable approximation.
Density estimates are just sums of the kernels centered at the data, one such kernel is just the normal distribution. You could average the areas from pnorm (or other kernels) with the sd defined by the bandwidth and centered at your data.

Resources