How to calculate x-values of the convolution of two distributions? - r

(This question may be suited for https://stats.stackexchange.com/, but I'm thinking it's just how you calculate what I want in R that is my question).
I'm trying to add multiple distributions together, and then look at the resulting distribution. I'll illustrate my problem with a simple example using normally distributed random variables, p1 and p2.
set.seed(21)
N <- 1000
p1 <- rnorm(N, mean = 0, sd = 1)
p2 <- rnorm(N, mean = 10, sd = 1)
Which we can plot:
data.frame(p1, p2) %>%
gather(key="dist", value="value") %>%
ggplot(aes(value, color=dist)) + geom_density()
I can add these distributions together using convolve. Okay so that's fine. But what I can't figure out, is how to plot the summation of the distributions with the appropriate x-values. In the examples I've seen, it looks like the x-values are manually added in a way that doesn't seem "accurate" for lack of better work. See this Example.
I can "add" them together and plot:
pdf.c <- convolve(pdf1.y, pdf2.y, type = "open")
plot(pdf.c, type="l")
My question is how to get the corresponding x-values of the new distribution. I'm sure I'm missing something from a foundational statistics point of view.
Appendix for pdf1 and pdf2:
set.seed(21)
N <- 1000
p1 <- rnorm(N, mean = 0, sd = 1)
p2 <- rnorm(N, mean = 10, sd = 1)
pdf1.x <- density(p1)$x
pdf2.x <- density(p2)$x
pdf1.y <- density(p1)$y / sum(density(p1)$y)
pdf2.y <- density(p2)$y / sum(density(p2)$y)
df1 <- data.frame(pdf.x = pdf1.x, pdf.y = pdf1.y, dist = "1", stringsAsFactors = FALSE)
df2 <- data.frame(pdf.x = pdf2.x, pdf.y = pdf2.y, dist = "2", stringsAsFactors = FALSE)
df <- bind_rows(df1, df2)

Assuming that p1 and p2 are discretized uniformly, with the same interval dx between successive x values. (I see that you have discretized p1 and p2 at random points -- that's not the same, and, without thinking about it some more, I don't have an answer for that.) Let x1 = x1_0 + (k - 1) times dx, k = 1, 2, 3, ..., n1 be the points at which p1 is discretized, and x2 = x2_0 + (k - 1) times dx, k = 1, 2, 3, ..., n2 be the points at which p2 is discretized.
Each point xi_k = xi_0 + (k - 1) times dx represents the center point of a bar which has width dx and height pi(xi_k), i = 1, 2. Thus the mass of the bar is dx times pi(xi_k), and the total mass for all bars approaches 1 as dx approaches 0. These masses are the values which are convolved. If the discretized masses are normalized to 1, then their convolution will also be normalized to 1.
To be very careful, the range over which the distributions are discretized are xi_0 - dx/2 to xi_0 + (ni - 1) times dx + dx/2. After computing the convolution, the range for the result is likewise -dx/2 and +dx/2 the first and last points, respectively.
The convolution has n = n1 + n2 - 1 points, namely x1_0 + x2_0 + (k - 1) times dx, k = 1, 2, 3, ..., n1 + n2 - 1. The first point is x1_0 + x2_0 (i.e. first point for p1 plus first point for p2) and the last point is x1_0 + x2_0 + (n1 + n2 - 2) times dx = (x1_0 + (n1 - 1) times dx) + (x2_0 + (n2 - 1) times dx) (i.e. last point for p1 plus last point for p2). From this you can construct x values corresponding to the convolution via the seq function or something like that.

Related

Smooth change of day length

I want to model what it might look like to have the day length change smoothly over time (but remain sinusoidal). The formula for a "chirp", to change the instantaneous frequency is given at https://en.wikipedia.org/wiki/Chirp but it doesn't look right when coded for a 24h period over 5 days and then a transition to 12h over another 5 days:
period = list( c(24,24,5), c(24,12,5) )
alpha = list( c(0,5), c(0,5) )
s_samples = 100
A=50
O=50
simulatedData = data.frame(t=numeric(), v=numeric()) #initialise the output
daySteps = c(0, cumsum(unlist(period)[seq(3,length(unlist(period)), by=3)])) #set up the period starts and ends to set over, starting at 0
##Cycle over each of the items in the list
for(set in seq(period) ){
t_points = s_samples*period[[set]][3]
t = seq(daySteps[set], daySteps[set+1], length.out=t_points) #make the time
slope = (24/period[[set]][2]-24/period[[set]][1])/(max(t)-min(t)) # get the slope
f0 = 24/period[[set]][1] - slope*(min(t)) # find the freq when t0
c = (24/period[[set]][2]-f0)/(max(t)) #calculate the chirp see https://en.wikipedia.org/wiki/Chirp and https://dsp.stackexchange.com/questions/57904/chirp-after-t-seconds
wt = ((c*(t^2))/2) + f0*(t) # calc the freq
a = alpha[[set]][1]
v = A * cos(2*pi*wt - a) + O
simulatedData = rbind(simulatedData, data.frame(t, v) )
}
plot(simulatedData, type="l", lwd=2)
t = seq(0,sum(unlist(period)[seq(3,length(unlist(period)), by=3)]), by=1/24)
points(t, A*cos(2*pi*t)+O, col=3, type="l", lty=2)
points(t, A*cos(2*(24/12)*pi*t)+O, col=4, type="l", lty=2)
The first 24 are perfect, as expected, and the last part of the second 5 days matches a 12h cycled, but the first part of that period looks 180deg out of phase. What's wrong?
I think you're making this a lot more complex than it needs to be. Remember that many R functions are already vectorized. The following function will produce a linear chirp between frequencies f0 and f1 between t0 and t1, with an optional phi parameter to specify at what point on the cycle you want your sequence to begin:
chirp <- function(f0, f1, t0, t1, phi = 0, n_steps = 1000)
{
C <- (f1 - f0)/(t1 - t0)
x <- seq(t0, t1, length.out = n_steps)
y <- sin(2 * pi * (C / 2 * (x - t0)^2 + f0 * (x - t0)) + phi) # Ref Wikipedia
data.frame(x, y)
}
Of course, it can also produce the static first half of your plot by "chirping" between two identical frequencies, so we can get a data frame of x, y points on the plot by doing
df <- rbind(chirp(1, 1, 0, 5), chirp(1, 2, 5, 10))
Which results in:
plot(df$x, df$y, type = "l")
Note that between 5 and 10 days there are 7.5 cycles, so if you wanted to smoothly continue frequency 2, you would need to set the phi parameter to a half cycle (i.e. to pi):
df <- rbind(df, chirp(2, 2, 10, 15, phi = pi))
plot(df$x, df$y, type = "l")
Note that the phases of the chirped signal and a 2 Hz signal will only match after n seconds if the chirp occurs over an even number of periods of the original signal. For an odd number, the phase will be out by 180 degrees. This is a mathematical consequence of a linear chirp. To see this, let's use our function to chirp over 6 seconds so the phases match at 10 seconds:
plot(df$x, df$y, type = "l")
lines(df2$x, df2$y, lty = 2, col = "green")
lines(df3$x, df3$y, lty = 2, col = "blue")
lines(df$x, df$y)

Mean distance of the points from a fixed target point

I have some points in my data and I am trying to find the mean (arithmetic mean) distance to a target point.
I am taking two routes here:
One, using the 'Distance Between Two Points' formula to compute the distance between each point to the target and then getting a mean distance of those distance values.
Other, find the mean point of all points and then find the distance between this mean point and the target point.
I am not sure which approach is correct (both give different answers) if I want to get the average distance of all points to a target point?
My target point is in green and mean point is in red.
Following is my R code:
# three data points
a <- c(1.6, 2.3, 3.4)
b <- c(3.1, 4.1, 0.5)
# target point
t_x <- 1.1
t_y <- 0.1
df <- data.frame("x" = a, "y" = b)
# mean of the distances
df$distance <- sqrt(((df$x - t_x)^2) + ((df$y - t_y)^2))
print(mean(df$distance))
# distance from the mean point to the target
mean_x <- mean(df$x)
mean_y <- mean(df$y)
print(sqrt((mean_x - t_x)^2 + (mean_y - t_y)^2))
# plotting all
ggplot(df, aes(x = x, y = y)) +
geom_point() +
coord_cartesian(xlim = c(-5,5), ylim = c(-5,5)) +
geom_point(aes(x=mean_x, y=mean_y), color = "red") +
geom_point(aes(x=t_x, y=t_y), color = "green")
Just consider the case with one target point and two other points
Target: (0,0)
Point1: (-1,0)
Point2: (1, 0)
If you take the average of Points 1 and 2, you get (0,0) so the average distance to the target is 0. But the distance from each of the points to the target is 1 so the mean distance in 1. In general these two calcuations are quite different
It really just comes down to how you define the problem. In this example, do you expect the answer to be 0 or 1.
Here's an intuitive explanation of why your first method is the correct one. Imagine your target is at (0, 0):
t_x <- 0
t_y <- 0
Now suppose we draw some points around it that are all the same distance from it - in fact, they all lie on the unit circle and by definition are a distance of 1 from the target:
library(ggplot2)
t_x <- 0
t_y <- 0
rads <- seq(0, 2 * pi, length.out = 17)[-17]
df <- data.frame(x = cos(rads), y = sin(rads), xend = 0, yend = 0)
ggplot(df, aes(x, y)) +
geom_point(col = "red") +
geom_segment(aes(xend = xend, yend = yend), linetype = 2) +
coord_equal() +
geom_point(x = t_x, y = t_y, size = 5, colour = "red")
Now, not surprisingly, since all points are distance 1 from the target, the average distance by Pythagoras will also be 1:
# mean of the distances
df$distance <- sqrt(((df$x - t_x)^2) + ((df$y - t_y)^2))
print(mean(df$distance))
#> [1] 1
But now consider what happens if we take the average of all our x values - they cancel to 0. So do the y values, so the average of all points is (0, 0). When you measure the distance from (0, 0) to (0, 0), the answer, of course, is 0:
mean_x <- mean(df$x)
mean_y <- mean(df$y)
print(sqrt((mean_x - t_x)^2 + (mean_y - t_y)^2))
#> [1] 0
Created on 2020-08-22 by the reprex package (v0.3.0)

Simulate an AR(1) process with uniform innovations

I need to plot an AR(1) graph for the process
y[k] = 0.75 * y[k-1] + e[k] for y0 = 1.
Assume that e[k] is uniformly distributed on the interval [-0.5, 0.5].
I am trying to use arima.sim:
library(tseries)
y.0 <- arima.sim(model=list(ar=.75), n=100)
plot(y.0)
It does not seem correct. Also, what parameters do I change if y[0] = 10?
We want to use R base function arima.sim for this task, and no extra libraries are required.
By default, arima.sim generates ARIMA with innovations ~ N(0,1). If we want to change this, we need to control the rand.gen or innov argument. For example, you want innovations from uniform distributions U[-0.5, 0.5], we can do either of the following:
arima.sim(model=list(ar=.75), n=100, rand.gen = runif, min = -0.5, max = 0.5)
arima.sim(model=list(ar=.75), n = 100, innov = runif(100, -0.5, 0.5))
Example
set.seed(0)
y <- arima.sim(model=list(ar=.75), n = 100, innov = runif(100, -0.5, 0.5))
ts.plot(y)
In case we want to have explicit control on y[0], we can just shift the above time series such that it starts from y[0]. Suppose y0 is our desired starting value, we can do
y <- y - y[1] + y0
For example, starting from y0 = 1:
y <- y - y[1] + 1
ts.plot(y)

How to automatically fit data with several normal cumulative distribution functions in R

I have several data sets (hundreds of them actually), that I know can be fitted with the sum of several normal cumulative distributions (see here).
Here is one example of such data set, here with two cumulative distribution functions:
library(pracma)
library(minpack.lm)
x <- seq(1, 1000, length.out = 50)
k1 <- 0.5
mu1 <- 500
sigma1 <- 100
y1 <- k1 * (1 + erf((x - mu1) / (sqrt(2) * sigma1)))
k2 <- 0.5
mu2 <- 300
sigma2 <- 50
y2 <- k2 * (1 + erf((x - mu2) / (sqrt(2) * sigma2)))
my.df <- data.frame(x, y = y1 + y2, type = "data")
ggplot(my.df, aes(x, y)) + geom_line()
Now I want to fit those curves, so I use nls to do so:
model <- nlsLM(y ~ k1 * (1 + erf((x - mu1) / (sqrt(2) * sigma1)))
+ k2 * (1 + erf((x - mu2) / (sqrt(2) * sigma2))),
start= c(mu1 = 500 , sigma1 = 50, k1 = 0.5,
mu2 = 300 , sigma2 = 50, k2 = 0.5),
data = my.df,
control = nls.lm.control(maxiter = 500))
tmp <- data.frame(x, y = predict(model), type = "fit")
combined <- rbind(my.df, tmp)
ggplot(combined, aes(x, y, colour = type, shape = type)) + geom_line() + geom_point()
Here is what I get:
The fit is great. However, I helped nls a lot:
I gave it a perfect fitting curve as input, not raw data
I told it my curve was the sum of two functions (not one or three)
And I almost gave the solution by providing very close parameter values
To fix the first point, I compute 3 models for one, two and three functions and choose the one with the minimum deviance.
For the second point, with my hundreds of data sets unfortunately, the parameters change quite a bit and I have disappointing results when I give the same starting parameters for all sets.
Is there a better way to select those starting values?
I heard of the mixtools library, but I'm not sure it works for CDF (cumulative distribution functions).

Visual Comparison of Regression & PCA

I'm trying to perfect a method for comparing regression and PCA, inspired by the blog Cerebral Mastication which has also has been discussed from a different angle on SO. Before I forget, many thanks to JD Long and Josh Ulrich for much of the core of this. I'm going to use this in a course next semester. Sorry this is long!
UPDATE: I found a different approach which almost works (please fix it if you can!). I posted it at the bottom. A much smarter and shorter approach than I was able to come up with!
I basically followed the previous schemes up to a point: Generate random data, figure out the line of best fit, draw the residuals. This is shown in the second code chunk below. But I also dug around and wrote some functions to draw lines normal to a line through a random point (the data points in this case). I think these work fine, and they are shown in First Code Chunk along with proof they work.
Now, the Second Code Chunk shows the whole thing in action using the same flow as #JDLong and I'm adding an image of the resulting plot. Data in black, red is the regression with residuals pink, blue is the 1st PC and the light blue should be the normals, but obviously they are not. The functions in First Code Chunk that draw these normals seem fine, but something is not right with the demonstration: I think I must be misunderstanding something or passing the wrong values. My normals come in horizontal, which seems like a useful clue (but so far, not to me). Can anyone see what's wrong here?
Thanks, this has been vexing me for a while...
First Code Chunk (Functions to Draw Normals and Proof They Work):
##### The functions below are based very loosely on the citation at the end
pointOnLineNearPoint <- function(Px, Py, slope, intercept) {
# Px, Py is the point to test, can be a vector.
# slope, intercept is the line to check distance.
Ax <- Px-10*diff(range(Px))
Bx <- Px+10*diff(range(Px))
Ay <- Ax * slope + intercept
By <- Bx * slope + intercept
pointOnLine(Px, Py, Ax, Ay, Bx, By)
}
pointOnLine <- function(Px, Py, Ax, Ay, Bx, By) {
# This approach based upon comingstorm's answer on
# stackoverflow.com/questions/3120357/get-closest-point-to-a-line
# Vectorized by Bryan
PB <- data.frame(x = Px - Bx, y = Py - By)
AB <- data.frame(x = Ax - Bx, y = Ay - By)
PB <- as.matrix(PB)
AB <- as.matrix(AB)
k_raw <- k <- c()
for (n in 1:nrow(PB)) {
k_raw[n] <- (PB[n,] %*% AB[n,])/(AB[n,] %*% AB[n,])
if (k_raw[n] < 0) { k[n] <- 0
} else { if (k_raw[n] > 1) k[n] <- 1
else k[n] <- k_raw[n] }
}
x = (k * Ax + (1 - k)* Bx)
y = (k * Ay + (1 - k)* By)
ans <- data.frame(x, y)
ans
}
# The following proves that pointOnLineNearPoint
# and pointOnLine work properly and accept vectors
par(mar = c(4, 4, 4, 4)) # otherwise the plot is slightly distorted
# and right angles don't appear as right angles
m <- runif(1, -5, 5)
b <- runif(1, -20, 20)
plot(-20:20, -20:20, type = "n", xlab = "x values", ylab = "y values")
abline(b, m )
Px <- rnorm(10, 0, 4)
Py <- rnorm(10, 0, 4)
res <- pointOnLineNearPoint(Px, Py, m, b)
points(Px, Py, col = "red")
segments(Px, Py, res[,1], res[,2], col = "blue")
##========================================================
##
## Credits:
## Theory by Paul Bourke http://local.wasp.uwa.edu.au/~pbourke/geometry/pointline/
## Based in part on C code by Damian Coventry Tuesday, 16 July 2002
## Based on VBA code by Brandon Crosby 9-6-05 (2 dimensions)
## With grateful thanks for answering our needs!
## This is an R (http://www.r-project.org) implementation by Gregoire Thomas 7/11/08
##
##========================================================
Second Code Chunk (Plots the Demonstration):
set.seed(55)
np <- 10 # number of data points
x <- 1:np
e <- rnorm(np, 0, 60)
y <- 12 + 5 * x + e
par(mar = c(4, 4, 4, 4)) # otherwise the plot is slightly distorted
plot(x, y, main = "Regression minimizes the y-residuals & PCA the normals")
yx.lm <- lm(y ~ x)
lines(x, predict(yx.lm), col = "red", lwd = 2)
segments(x, y, x, fitted(yx.lm), col = "pink")
# pca "by hand"
xyNorm <- cbind(x = x - mean(x), y = y - mean(y)) # mean centers
xyCov <- cov(xyNorm)
eigenValues <- eigen(xyCov)$values
eigenVectors <- eigen(xyCov)$vectors
# Add the first PC by denormalizing back to original coords:
new.y <- (eigenVectors[2,1]/eigenVectors[1,1] * xyNorm[x]) + mean(y)
lines(x, new.y, col = "blue", lwd = 2)
# Now add the normals
yx2.lm <- lm(new.y ~ x) # zero residuals: already a line
res <- pointOnLineNearPoint(x, y, yx2.lm$coef[2], yx2.lm$coef[1])
points(res[,1], res[,2], col = "blue", pch = 20) # segments should end here
segments(x, y, res[,1], res[,2], col = "lightblue1") # the normals
############ UPDATE
Over at Vincent Zoonekynd's Page I found almost exactly what I wanted. But, it doesn't quite work (obviously used to work). Here is a code excerpt from that site which plots normals to the first PC reflected through a vertical axis:
set.seed(1)
x <- rnorm(20)
y <- x + rnorm(20)
plot(y~x, asp = 1)
r <- lm(y~x)
abline(r, col='red')
r <- princomp(cbind(x,y))
b <- r$loadings[2,1] / r$loadings[1,1]
a <- r$center[2] - b * r$center[1]
abline(a, b, col = "blue")
title(main='Appears to use the reflection of PC1')
u <- r$loadings
# Projection onto the first axis
p <- matrix( c(1,0,0,0), nrow=2 )
X <- rbind(x,y)
X <- r$center + solve(u, p %*% u %*% (X - r$center))
segments( x, y, X[1,], X[2,] , col = "lightblue1")
And here is the result:
Alright, I'll have to answer my own question! After further reading and comparison of methods that people have put on the internet, I have solved the problem. I'm not sure I can clearly state what I "fixed" because I went through quite a few iterations. Anyway, here is the plot and the code (MWE). The helper functions are at the end for clarity.
# Comparison of Linear Regression & PCA
# Generate sample data
set.seed(39) # gives a decent-looking example
np <- 10 # number of data points
x <- -np:np
e <- rnorm(length(x), 0, 10)
y <- rnorm(1, 0, 2) * x + 3*rnorm(1, 0, 2) + e
# Plot the main data & residuals
plot(x, y, main = "Regression minimizes the y-residuals & PCA the normals", asp = 1)
yx.lm <- lm(y ~ x)
lines(x, predict(yx.lm), col = "red", lwd = 2)
segments(x, y, x, fitted(yx.lm), col = "pink")
# Now the PCA using built-in functions
# rotation = loadings = eigenvectors
r <- prcomp(cbind(x,y), retx = TRUE)
b <- r$rotation[2,1] / r$rotation[1,1] # gets slope of loading/eigenvector 1
a <- r$center[2] - b * r$center[1]
abline(a, b, col = "blue") # Plot 1st PC
# Plot normals to 1st PC
X <- pointOnLineNearPoint(x, y, b, a)
segments( x, y, X[,1], X[,2], col = "lightblue1")
###### Needed Functions
pointOnLineNearPoint <- function(Px, Py, slope, intercept) {
# Px, Py is the point to test, can be a vector.
# slope, intercept is the line to check distance.
Ax <- Px-10*diff(range(Px))
Bx <- Px+10*diff(range(Px))
Ay <- Ax * slope + intercept
By <- Bx * slope + intercept
pointOnLine(Px, Py, Ax, Ay, Bx, By)
}
pointOnLine <- function(Px, Py, Ax, Ay, Bx, By) {
# This approach based upon comingstorm's answer on
# stackoverflow.com/questions/3120357/get-closest-point-to-a-line
# Vectorized by Bryan
PB <- data.frame(x = Px - Bx, y = Py - By)
AB <- data.frame(x = Ax - Bx, y = Ay - By)
PB <- as.matrix(PB)
AB <- as.matrix(AB)
k_raw <- k <- c()
for (n in 1:nrow(PB)) {
k_raw[n] <- (PB[n,] %*% AB[n,])/(AB[n,] %*% AB[n,])
if (k_raw[n] < 0) { k[n] <- 0
} else { if (k_raw[n] > 1) k[n] <- 1
else k[n] <- k_raw[n] }
}
x = (k * Ax + (1 - k)* Bx)
y = (k * Ay + (1 - k)* By)
ans <- data.frame(x, y)
ans
}
Try changing this line of your code:
res <- pointOnLineNearPoint(x, y, yx2.lm$coef[2], yx2.lm$coef[1])
to
res <- pointOnLineNearPoint(x, new.y, yx2.lm$coef[2], yx2.lm$coef[1])
So you're calling the correct y values.
In Vincent Zoonekynd's code, change the line u <- r$loadings to u <- solve(r$loadings). In the second instance of solve(), the predicted component scores along the first principal axis (i.e., the matrix of predicted scores with the second predicted components scores set to zero) need to be multiplied by the inverse of the loadings/eigenvectors. Multiplying data by the loadings gives predicted scores; dividing predicted scores by the loadings give data. Hope that helps.

Resources