I would like to calculate a density function of a distribution whose characteristics function is known. As a simple example take the normal distribution.
norm.char<-function(t,mu,sigma) exp((0+1i)*t*mu-0.5*sigma^2*t^2)
and then I would like to use R's fft function. but I don't get the multiplicative constants right and I have to reorder the result (take the 2nd half and then the first half of the values). I tried something like
xmax = 5
xmin = -5
deltat = 2*pi/(xmax-xmin)
N=2^8
deltax = (xmax-xmin)/(N-1)
x = xmin + deltax*seq(0,N-1)
t = deltat*seq(0,N-1)
density = Re(fft(norm.char(t*2*pi,mu,sigma)))
density = c(density[(N/2+1):N],density[1:(N/2)])
But this is still not correct. Does anybody know a good reference on the fft in R in the context of density calculations? Obviously the problem is the mixture of the continuous FFT and the discrete one. Can anybody recommend a procedure?
Thanks
It is just cumbersome: take a pen and paper,
write the integral you want to compute
(the Fourier transform of the characteristic function),
discretize it, and rewrite the terms so that they look like
a discrete Fourier transform (the FFT assumes that the interval starts
at zero).
Note that fft is an unnormalized transform: there is no 1/N factor.
characteristic_function_to_density <- function(
phi, # characteristic function; should be vectorized
n, # Number of points, ideally a power of 2
a, b # Evaluate the density on [a,b[
) {
i <- 0:(n-1) # Indices
dx <- (b-a)/n # Step size, for the density
x <- a + i * dx # Grid, for the density
dt <- 2*pi / ( n * dx ) # Step size, frequency space
c <- -n/2 * dt # Evaluate the characteristic function on [c,d]
d <- n/2 * dt # (center the interval on zero)
t <- c + i * dt # Grid, frequency space
phi_t <- phi(t)
X <- exp( -(0+1i) * i * dt * a ) * phi_t
Y <- fft(X)
density <- dt / (2*pi) * exp( - (0+1i) * c * x ) * Y
data.frame(
i = i,
t = t,
characteristic_function = phi_t,
x = x,
density = Re(density)
)
}
d <- characteristic_function_to_density(
function(t,mu=1,sigma=.5)
exp( (0+1i)*t*mu - sigma^2/2*t^2 ),
2^8,
-3, 3
)
plot(d$x, d$density, las=1)
curve(dnorm(x,1,.5), add=TRUE)
Related
I am attempting to sample along multiple lines (roads) at regular intervals and am struggling to obtain exact perpendicular angles for each road segment. I have split each road into points giving the node at which each line changes orientation and what I have so far creates a point within a straight segment of each road and appears to be working fine.
This is the code I am using to produce perpendicular angles for each node segment.
# X and Y for 3 points along a line
road_node <- matrix(
c(
381103, 381112, 381117,
370373, 370301, 370290
),
ncol = 2,
)
road_node <- as.data.frame(road_node)
angle_inv <- c()
for (i in 2:nrow(road_node) - 1) {
n1 <- road_node[i, ]
n2 <- road_node[i + 1, ]
x <- as.numeric(n1[1] - n2[1])
y <- as.numeric(n1[2] - n2[2])
ang <- atan2(y, x) + 1 / 2 * pi
if (!is.na(ang) && ang < 0) {
ang <- 2 + ang
}
angle_inv <- rbind(angle_inv, ang)
}
Where road_node gives the coordinates of each node.
From this I take the mid points and the inverse angles to create two points either side of the mid points, to produce a line segment.
# X Y and Angles (angles for one segment are the same
mids <- matrix(
c(
381374.5, 381351.0, 381320.5,
371590.5,371560.0, 371533.590,
2.3, 2.3, 2.3
),
nrow = 3,
)
mids <- as.data.frame(mids)
pts <- c()
for (i in 1:nrow(mids)) {
x1 <- mids[i, 1] + 10 * cos(mids[i, 3])
y1 <- mids[i, 2] + 10 * sin(mids[i, 3])
x2 <- mids[i, 1] - 10 * cos(mids[i, 3])
y2 <- mids[i, 2] - 10 * sin(mids[i, 3])
p1 <- cbind(x1, y1)
p2 <- cbind(x2, y2)
pair <- rbind(p1, p2)
pts <- rbind(pts, pair)
}
Some line segments appear to be correctly perpendicular to the node they are associate with, however some are not. Each appear to correctly share the same length.
I believe the problem lies with either how I am selecting my angles using atan2, or with how I am selecting my points either side of the node segment.
Firstly, there's no need to use trigonometry to solve this. Instead you can use the inverse reciprocal of the slope intercept form of the line segment equation, then calculate points on a perpendicular line passing through a give point.
See Equation from 2 points using Slope Intercept Form
Also your mid points appear incorrect and there are only 2 mid points as 3 points = 2 line segments.
This code appears to work fine
# Function to calculate mid points
mid_point <- function(p1,p2) {
return(c(p1[1] + (p2[1] - p1[1]) / 2,p1[2] + (p2[2] - p1[2]) / 2))
}
# Function to calculate slope of line between 2 points
slope <- function(p1,p2) {
return((p2[2] - p1[2]) / (p2[1] - p1[1]))
}
# Function to calculate intercept of line passing through given point wiht slope m
calc_intercept <- function(p,m) {
return(p[2] - m * p[1])
}
# Function to calculate y for a given x, slope m and intercept b
calc_y <- function(x,m,b) {
return(c(x, m * x + b))
}
# X and Y for 3 points along a line
road_node <- matrix(
c(
381103, 381112, 381117,
370373, 370301, 370290
),
ncol = 2,
)
road_node <- as.data.frame(road_node)
perp_segments <- c()
for (i in 2:nrow(road_node) - 1) {
n1 <- road_node[i, ]
n2 <- road_node[i + 1, ]
# Calculate mid point
mp <- mid_point(n1,n2)
# Calculate slope
m <- slope(n1,n2)
# Calculate intercept subsituting n1
b <- calc_intercept(n1,m)
# Calculate inverse reciprocal of slope
new_m <- -1.0 / m
# Calculate intercept of perpendicular line through mid point
new_b <- calc_intercept(mp,new_m)
# Calculate points 10 units away in x direction at mid_point
p1 <- rbind(calc_y(as.numeric(mp[1])-10,new_m,new_b))
p2 <- rbind(calc_y(as.numeric(mp[1])+10,new_m,new_b))
# Add point pair to output vector
pair <- rbind(p1,p2)
perp_segments <- rbind(perp_segments,pair)
}
This is how it looks geometrically (image)
I hope this helps.
Edit 1:
I thought about this more and came up with this simplified function. If you tink of the problem as a right isosceles triangle (45,45,90), then all you need to do is find the point which is the required distance from the reference point interpolated along the line segment, then invert its x and y distances from the reference points, then add and subtract these from the reference point.
Function calc_perp
Arguments:
p1, p2 - two point vectors defining the end points of the line segment
n - the distance from the line segment
interval - the interval along the line segment of the reference point from the start (default 0.5)
proportion - Boolean defining whether the interval is a proportion of the length or a constant (default TRUE)
# Function to calculate Euclidean distance between 2 points
euclidean_distance <-function(p1,p2) {
return(sqrt((p2[1] - p1[1])**2 + (p2[2] - p1[2])**2))
}
# Function to calculate 2 points on a line perpendicular to another defined by 2 points p,p2
# For point at interval, which can be a proportion of the segment length, or a constant
# At distance n from the source line
calc_perp <-function(p1,p2,n,interval=0.5,proportion=TRUE) {
# Calculate x and y distances
x_len <- p2[1] - p1[1]
y_len <- p2[2] - p1[2]
# If proportion calculate reference point from tot_length
if (proportion) {
point <- c(p1[1]+x_len*interval,p1[2]+y_len*interval)
}
# Else use the constant value
else {
tot_len <- euclidean_distance(p1,p2)
point <- c(p1[1]+x_len/tot_len*interval,p1[2]+y_len/tot_len*interval)
}
# Calculate the x and y distances from reference point to point on line n distance away
ref_len <- euclidean_distance(point,p2)
xn_len <- (n / ref_len) * (p2[1] - point[1])
yn_len <- (n / ref_len) * (p2[2] - point[2])
# Invert the x and y lengths and add/subtract from the refrence point
ref_points <- rbind(point,c(point[1] + yn_len,point[2] - xn_len),c(point[1] - yn_len,point[2] + xn_len))
# Return the reference points
return(ref_points)
}
Examples
> calc_perp(c(0,0),c(1,1),1)
[,1] [,2]
point 0.5000000 0.5000000
1.2071068 -0.2071068
-0.2071068 1.2071068
> calc_perp(c(0,0),c(1,1),sqrt(2)/2,0,proportion=FALSE)
[,1] [,2]
point 0.0 0.0
0.5 -0.5
-0.5 0.5
This is how the revised function looks geometrically with your example and n = 10 for distance from line:
The task:
Eric the fly has a friend, Ernie. Assume that the two flies sit at independent locations, uniformly distributed on the globe’s surface. Let D denote the Euclidean distance between Eric and Ernie (i.e., on a straight line through the interior of the globe).
Make a conjecture about the probability density function of D and give an
estimate of its expected value, E(D).
So far I have made a function to generate two points on the globe's surface, but I am unsure what to do next:
sample3d <- function(2)
{
df <- data.frame()
while(n > 0){
x <- runif(1,-1,1)
y <- runif(1,-1,1)
z <- runif(1,-1,1)
r <- x^2 + y^2 + z^2
if (r < 1){
u <- sqrt(x^2+y^2+z^2)
vector = data.frame(x = x/u,y = y/u, z = z/u)
df <- rbind(vector,df)
n = n- 1
}
}
df
}
E <- sample3d(2)
This is an interesting problem. I'll outline a computational approach; I'll leave the math up to you.
First we fix a random seed for reproducibility.
set.seed(2018);
We sample 10^4 points from the unit sphere surface.
sample3d <- function(n = 100) {
df <- data.frame();
while(n > 0) {
x <- runif(1,-1,1)
y <- runif(1,-1,1)
z <- runif(1,-1,1)
r <- x^2 + y^2 + z^2
if (r < 1) {
u <- sqrt(x^2 + y^2 + z^2)
vector = data.frame(x = x/u,y = y/u, z = z/u)
df <- rbind(vector,df)
n = n- 1
}
}
df
}
df <- sample3d(10^4);
Note that sample3d is not very efficient, but that's a different issue.
We now randomly sample 2 points from df, calculate the Euclidean distance between those two points (using dist), and repeat this procedure N = 10^4 times.
# Sample 2 points randomly from df, repeat N times
N <- 10^4;
dist <- replicate(N, dist(df[sample(1:nrow(df), 2), ]));
As pointed out by #JosephWood, the number N = 10^4 is somewhat arbitrary. We are using a bootstrap to derive the empirical distribution. For N -> infinity one can show that the empirical bootstrap distribution is the same as the (unknown) population distribution (Bootstrap theorem). The error term between empirical and population distribution is of the order 1/sqrt(N), so N = 10^4 should lead to an error around 1%.
We can plot the resulting probability distribution as a histogram:
# Let's plot the distribution
ggplot(data.frame(x = dist), aes(x)) + geom_histogram(bins = 50);
Finally, we can get empirical estimates for the mean and median.
# Mean
mean(dist);
#[1] 1.333021
# Median
median(dist);
#[1] 1.41602
These values are close to the theoretical values:
mean.th = 4/3
median.th = sqrt(2)
Background:
I have a curve whose Y-values are produced by my small R function below (neatly annotated). If you run my entire R code, you see my curve (but remember, it's a function so if I changed the argument values, I could get a different curve):
Question:
Obviously, one can determine/assume many intervals that would cover/take 95% of the total area under this curve. But using, optimize(), how can I find the SHORTEST (in x-value units) of these many possible 95% intervals? What then would be the corresponding x-values for the the two ends of this shortest 95% interval?
Note: The idea of shortest interval for a uni-modal curve like mine makes sense. In reality, the shortest one would be the one that tends to be toward the middle where the height (y-value) is larger, so then x-value doesn't need to be so large for the intended interval to cover/take 95% of the total area under the curve.
Here is my R code (please run the entire code):
ppp <- function(f, N, df1, df2, petasq, alpha, beta) {
pp <- function(petasq) dbeta(petasq, alpha, beta)
ll <- function(petasq) df(f, df1, df2, (petasq * N) / (1 - petasq) )
marg <- integrate(function(x) pp(x)*ll(x), 0, 1)[[1]]
po <- function(x) pp(x)*ll(x) / marg
return(po(petasq) )
}
## ### END OF MY R FUNCTION.
# Now I use my function above to get the y-values for my plot:
petasq <- seq(0, 1, by = .0001) ## These are X-values for my plot
f <- 30 # a function needed argument
df1 <- 3 # a function needed argument
df2 <- 108 # a function needed argument
N <- 120 # a function needed argument
alpha = 5 # a function needed argument
beta = 4 # a function needed argument
## Now use the ppp() function to get the Y-values for the X-value range above:
y.values <- ppp(f, N, df1, df2, petasq, alpha, beta)
## Finally plot petasq (as X-values) against the Y.values:
plot(petasq, y.values, ty="l", lwd = 3 )
Based on your revised question, I found the optimization that minimizes the SHORTEST distance (in x-value units) between LEFT and RIGHT boundaries:
ppp <- function(petasq, f, N, df1, df2, alpha, beta) {
pp <- function(petasq) dbeta(petasq, alpha, beta)
ll <- function(petasq) df(f, df1, df2, (petasq * N) / (1 - petasq) )
marg <- integrate(function(x) pp(x)*ll(x), 0, 1)[[1]]
po <- function(x) pp(x)*ll(x) / marg
return(po(petasq) )
}
petasq <- seq(0, 1, by = .0001) ## These are X-values for my plot
f <- 30 # a function needed argument
df1 <- 3 # a function needed argument
df2 <- 108 # a function needed argument
N <- 120 # a function needed argument
alpha = 5 # a function needed argument
beta = 4 # a function needed argument
optim_func <- function(x_left) {
int_function <- function(petasq) {
ppp(petasq, f=f, N=N, df1=df1, df2=df2, alpha=alpha, beta=beta)
}
# For every LEFT value, find the corresponding RIGHT value that gives 95% area.
find_95_right <- function(x_right) {
(0.95 - integrate(int_function, lower=x_left, upper=x_right, subdivisions = 10000)$value)^2
}
x_right_obj <- optimize(f=find_95_right, interval=c(0.5,1))
if(x_right_obj$objective > .Machine$double.eps^0.25) return(100)
#Return the DISTANCE BETWEEN LEFT AND RIGHT
return(x_right_obj$minimum - x_left)
}
#MINIMIZE THE DISTANCE BETWEEN LEFT AND RIGHT
x_left <- optimize(f=optim_func, interval=c(0.30,0.40))$minimum
find_95_right <- function(x_right) {
(0.95 - integrate(int_function, lower=x_left, upper=x_right, subdivisions = 10000)$value)^2
}
int_function <- function(petasq) {
ppp(petasq, f=f, N=N, df1=df1, df2=df2, alpha=alpha, beta=beta)
}
x_right <- optimize(f=find_95_right, interval=c(0.5,1))$minimum
See the comments in the code. Hopefully this finally satisfies your question :) Results:
> x_right
[1] 0.5409488
> x_left
[1] 0.3201584
Also, you can plot the distance between LEFT and RIGHT as a function of the left boundary:
left_x_values <- seq(0.30, 0.335, 0.0001)
DISTANCE <- sapply(left_x_values, optim_func)
plot(left_x_values, DISTANCE, type="l")
If we think of this as trying to calculate the interval with the smallest area, we can start calculating the areas of each of the regions we are plotting. We can then find the largest area (which presumably will be near the center) and start walking out till we found the area we are looking for.
Since you've already calculate the x and y values for the plot, i'll reuse those to save some calculations. Here's an implementation of that algorithm
pseduoarea <- function(x, y, target=.95) {
dx <- diff(x)
areas <- dx * .5 * (head(y,-1) + tail(y, -1))
peak <- which.max(areas)
range <- c(peak, peak)
found <- areas[peak]
while(found < target) {
if(areas[range[1]-1] > areas[range[2]+1]) {
range[1] <- range[1]-1
found <- found + areas[range[1]-1]
} else {
range[2] <- range[2]+1
found <- found + areas[range[2]+1]
}
}
val<-x[range]
attr(val, "indexes")<-range
attr(val, "area")<-found
return(val)
}
And we call it with
pseduoarea(petasq, y.values)
# [1] 0.3194 0.5413
This does assume that all the values in petasq are equally spaced
I don't think you need to use optimize (unless this were part of an unadmitted homework assignment). Instead just normalize a cumulative sum and figure out at which points your criteria are met:
> which(cusm.y >= 0.025)[1]
[1] 3163
> which(cusm.y >= 0.975)[1]
[1] 5375
You can check that these are reasonable indices to use for the pulling values from the petasq vector with:
abline( v= c( petasq[ c( which(cusm.y >= 0.025)[1], which(cusm.y >= 0.975)[1])]),
col="red")
This is admittedly equivalent to constructing an integration function with a normalization constant across the domain of the "density" function. The fact that the intervals are all of equal dimension allows omitting the differencing of "x"-vector from the height times base calculation.
I suppose there is another interpretation possible. That would require that we discover how many values of an ascending-sorted version of petasq are needed to sum to 95% of the total sum. This gives a different strategy and the plot shows where a horizontal line would intersect the curve:
which( cumsum( sort( y.values, decreasing=TRUE) ) > 0.95* sum(y.values, na.rm=TRUE) )[1]
#[1] 2208
sort( y.values, decreasing=TRUE)[2208]
#[1] 1.059978
png()
plot(petasq, y.values, ty="l", lwd = 3 )
abline( h=sort( y.values, decreasing=TRUE)[2208], col="blue")
dev.off()
To get the petasq values you would need to determine the first y.values that exceeded that value and then the next y.values that dropped below that level. These can be obtained via:
order(y.values, decreasing=TRUE)[2208]
#[1] 3202
order(y.values, decreasing=TRUE)[2209]
#[1] 5410
And then the plot would look like:
png(); plot(petasq, y.values, ty="l", lwd = 3 )
abline( v= petasq[ c(3202, 5410)], col="blue", lty=3, lwd=2)
dev.off()
The area between the two dotted blue lines is 95% of the total area above the zero line:
I am doing cluster analysis of several time series in R (the sales of a product in different stores).
I am using the first order temporal correlation coefficient CORT(S1,S2), in package TSclust, where S1 and S2 are two time series.
The literaure (https://cran.r-project.org/web/packages/TSclust/TSclust.pdf) explains that CORT belongs to the interval [-1,1]: when CORT(S1,S2)=1 both series show a similar dynamic behavior, and when CORT(S1,S2)=-1 they have opposite behavior.
I would like to know how to see the results of CORT, in order to observe the values of CORT for each pair of time series.
We can see the next example in TSclust package:
## Create three sample time series
x <- cumsum(rnorm(100))
y <- cumsum(rnorm(100))
z <- sin(seq(0, pi, length.out=100))
## Compute the distance and check for coherent results
diss.CORT(x, y, 2)
diss.CORT(x, z, 2)
diss.CORT(y, z, 2)
So with the above code we can calculate de dissimilarity index using the coefficient CORT(S1,S2), but we cannot consult the values of the CORT coefficient.
So, does anyone how to see the values of CORT coefficient in R?
Thanks in advance.
I am not sure if this is what you want, but any how this is what I did:
View(diss.CORT)
where R shows:
function (x, y, k = 2, deltamethod = "Euclid")
{
.ts.sanity.check(x, y)
.check.equal.length.ts(x, y)
corrt <- corrtemporder1(x, y)
type <- (pmatch(deltamethod, c("Euclid", "Frechet", "DTW")))
typedist <- 0
if (is.na(type)) {
stop(paste("Unknown method", deltamethod))
}
else if (type == 1) {
typedist <- as.numeric(dist(rbind(x, y)))
}
else if (type == 2) {
typedist <- diss.FRECHET(x, y)
}
else if (type == 3) {
typedist <- dtw(x, y, dist.method = "Manhattan", distance.only = T)$distance
}
(2/(1 + exp(k * corrt))) * typedist
}
Now if you go through that and start reading the script it seems that you are looking for line where corrt <- corrtemporder1(x, y). google it and you get to: https://github.com/cran/TSclust/blob/master/R/diss.R
#############################################################################
################# Temporal Correlation Distance #########################
#############################################################################
##CHOUAKRIA-DOUZAL
corrtemporder1 <- function (x, y) {
p <- length(x)
sum((x[2:p] - x[1:(p-1)]) * (y[2:p] - y[1:(p-1)])) / ( sqrt( sum((x[2:p] - x[1:(p-1)])^2) ) * sqrt( sum((y[2:p] - y[1:(p-1)])^2) ))
}
Now, I think this is what you are looking for.
This is my first post to the R-community, so pardon me if it is silly. I would like to use the functions geom_density2d and stat_density2d in ggplot2 to plot kernel density estimates, but the problem is that they can't handle weighted data. From what I understand, these two functions call the function kde2d from package MASS to make the kernel density estimate. And the kde2d doesn't take data weights as a parameter.
Now, I have found this altered version of kde2d http://www.inside-r.org/node/226757, which takes weights as a parameter and is based on the source code of kde2d. The code of this function:
kde2d.weighted <- function (x, y, w, h, n = 25, lims = c(range(x), range(y))) {
nx <- length(x)
if (length(y) != nx)
stop("data vectors must be the same length")
if (length(w) != nx & length(w) != 1)
stop("weight vectors must be 1 or length of data")
gx <- seq(lims[1], lims[2], length = n) # gridpoints x
gy <- seq(lims[3], lims[4], length = n) # gridpoints y
if (missing(h))
h <- c(bandwidth.nrd(x), bandwidth.nrd(y));
if (missing(w))
w <- numeric(nx)+1;
h <- h/4
ax <- outer(gx, x, "-")/h[1] # distance of each point to each grid point in x-direction
ay <- outer(gy, y, "-")/h[2] # distance of each point to each grid point in y-direction
z <- (matrix(rep(w,n), nrow=n, ncol=nx, byrow=TRUE)*matrix(dnorm(ax), n, nx)) %*% t(matrix(dnorm(ay), n, nx))/(sum(w) * h[1] * h[2]) # z is the density
return(list(x = gx, y = gy, z = z))
}
I would like to make the functions geom_density2d and stat_density2d call kd2d.weighted instead of kde2d, and by that making them accept weighted data.
I have never changed any functions in existing R packages so my question is what is the easiest way doing this?
You can actually pass your own density data to geom_contour which would probably be the easiest. Let's start with a sample dataset by adding weights to the geyser data.
library("MASS")
data(geyser, "MASS")
geyserw <- transform(geyser,
weight = sample(1:5, nrow(geyser), replace=T)
)
Now we use your weighted function to calculate the density and turn it into a data.frame
dens <- kde2d.weighted(geyserw$duration, geyserw$waiting, geyserw$weight)
dfdens <- data.frame(expand.grid(x=dens$x, y=dens$y), z=as.vector(dens$z))
Now we plot the data
ggplot(geyserw, aes(x = duration, y = waiting)) +
geom_point() + xlim(0.5, 6) + ylim(40, 110) +
geom_contour(aes(x=x, y=y, z=z), data= dfdens)
And that should do it