How to calculate the volume under a surface defined by discrete data? - r

I need to determine the volume beneath a series of surfaces represented by discrete data points. In my data, each sample is stored as a separate data frame within a list of data frames. Here is some (small) example data:
df1 <- data.frame(x=c(2,2,2,3,3,3,4,4,4,5,5,5,6,6,6),
y=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3),
z=c(0,2,0,4,6,7,3,2,1,2,7,8,9,4,2))
df2 <- data.frame(x=c(2,2,2,3,3,3,4,4,4,5,5,5,6,6,6),
y=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3),
z=c(1,1,2,3,5,6,2,1,3,3,8,9,8,3,1))
DF <- list(df1,df2)
Answers to similar questions are either in other languages (matlab, python), or the answers do not contain useable script to address the problem (as here). I can think of two acceptable ways to estimate the volume beneath each surface: 1) write out a discretized version of simpson's rule as a function in R that is applied across the list of data frames (DF); 2) calculate an arbitrary relationship between x, y, and z and use multivariate numerical integration to find the volume under the surface (with functions like simpson2d / quad2d in the package pracma or adaptIntegrate in cubature).
Regarding the first approach, the formula for the composite simpson's rule (that I would like to use) is here, but due to its complexity, I have been unsuccessful in writing a working double summation function. In this expression, I(lambda(em) lambda(ex)) is equal to z in the above datasets at each x,y grid point, and Delta(em) and Delta(ex) represent the interval between x and y points.
The second approach would essentially extend the approach found here to multivariate spline fits and then pass the predicted z values as a function for integration. Here's what I have tried so far for this approach:
require(pracma)
df1.loess <- loess(z ~ x + y, data=DF[[1]])
mod.fun <- function(x,y) predict(df1.loess, newdata=x,y)
simpson2d(mod.fun, x=c(2,6), y=c(1,3))
But this does not yield useful results.
In reality, I have a list of almost 100 data frames for individual samples, so I really need to be able to express the solution as a series of lapply functions that automate these calculations across all data frames in the list. An example looks something like this:
require(akima)
DF.splines <- lapply(DF, function(x,y,z) interp(x = "x", y = "y", z = "z",
linear=F, nx=4, ny=2))
Unfortunately, this produces an exception for missing values and Infs. I'm extremely open to any suggestions for how to successfully implement one of these strategies, or to utilize a different (simpler?) approach. Could a kriging function (like km in the DiceKriging package) produce a better fit that could be passed on for numerical integration?

I am assuming that the volume surface mesh is defined by connecting points via straight lines. Then you can find the volume beneath that surface via
triangular tessellation of the (x,y) grid into triangles T_i with area A_i
finding the corresponding z values Z_i for each of the triangles T_i
calculating the volume V_i of the truncated prisms (defined by T_i and Z_i) via V_i=A_i*sum(Z_i)/3 (see https://en.wikipedia.org/wiki/Prism_(geometry) and https://math.stackexchange.com/questions/2371139/volume-of-truncated-prism)
summing up all truncated prism volumes V_i
Keep in mind, however, that the volume does depend on your tessellation and that the tessellation is not unique. But your problem is not fully defined in the sense that it does not describe how one should interpolate between points. So any approach to calculate a volume will have to make additional assumptions.
Going back to my solution approach, points 1 and 2 can be achieved via the geometry package.
Here some code
library(geometry)
getVolume=function(df) {
#find triangular tesselation of (x,y) grid
res=delaunayn(as.matrix(df[,-3]),full=TRUE,options="Qz")
#calulates sum of truncated prism volumes
sum(mapply(function(triPoints,A) A/3*sum(df[triPoints,"z"]),
split.data.frame(res$tri,seq_along(res$areas)),
res$areas))
}
sapply(DF,getVolume)
#[1] 32.50000 30.33333
Since it's hard to check whether the results are consistent, here a simple example where we know the right answer. It's a cube with side length 2 where we have cut out a wedge along the x axis. The cut-out region is 1/4 of the total volume.
cutOutCube=expand.grid(c(0,1,2),c(0,1,2))
colnames(cutOutCube)=c("x","y")
cutOutCube$z=ifelse(cutOutCube$x==1,1,2)
sapply(list(cutOutCube),getVolume)
#[1] 6
That's correct since 2^3*(1-1/4)=6.
Another sanity check can be performed by calculating the "complement" of the volume w.r.t. to a simple cuboid where all z values are set to the max z value (in your case max(z)=9 in both cases). The simple cuboid volumes are 72 for both of your cases. Not let's define the complement surfaces and sum up volume and complement volume
df1c=df1
df1c$z=max(df1c$z)-df1c$z
df2c=df2
df2c$z=max(df2c$z)-df2c$z
DFc=list(df1c,df2c)
sapply(DFc,getVolume)+sapply(DF,getVolume)
#[1] 72 72
So volume and complement volume give the right simple cuboid volume in both cases.

You could approximate the surface through a "barycentric Lagrangian" approach as implemented in function barylag2d in the pracma package. Then, to avoid any vectorization problems, apply the Gaussian quadrature rules explicitly.
library(pracma)
df1 <- data.frame(x=c(2,2,2,3,3,3,4,4,4,5,5,5,6,6,6),
y=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3),
z=c(0,2,0,4,6,7,3,2,1,2,7,8,9,4,2))
# Define the nodes in x- and y-direction
xn <- df1$x[c(1,4,7,10,13)]
yn <- df1$y[1:3]
# Define the matrix representing the function
m1 <- matrix(df1$z, nrow=5, byrow=TRUE)
f <- function(x, y)
c(pracma::barylag2d(m1, xn, yn, x, y))
# 32 nodes in integration intervals
n <- 32
xa <- 2; xb <- 6; ya <- 1; yb <- 3
# Apply quadrature rules explicitely
cx <- gaussLegendre(n, xa, xb)
x <- cx$x; wx <- cx$w
cy <- gaussLegendre(n, ya, yb)
y <- cy$x; wy <- cy$w
# Sum weights * values over all nodes
I <- 0
for (i in 1:n) {
for (j in 1:n) {
I <- I + wx[i] * wy[j] * f(x[i], y[j])
}
}
I # 40.37037
The integral value of 40 seems reasonable given the data. simpson2d or quad2d will not work in this setting.
You may try whether adaptIntegrate will work with the so defined function f.

Related

Using FFT in R to Determine Density Function for IID Sum

The goal is to compute the density function of a sum of n IID random variables via the density function of one of these random variables by:
Transforming the density function into the characteristic function via fft
Raise the characteristic function to the n
Transform the resulting characteristic function into the density function of interest via fft(inverse=TRUE)
The below is my naive attempt at this:
sum_of_n <- function(density, n, xstart, xend, power_of_2)
{
x <- seq(from=xstart, to=xend, by=(xend-xstart)/(2^power_of_2-1))
y <- density(x)
fft_y <- fft(y)
fft_sum_of_y <- (fft_y ^ n)
sum_of_y <- Re(fft(fft_sum_of_y, inverse=TRUE))
return(sum_of_y)
}
In the above, density is an arbitrary density function: for example
density <- function(x){return(dgamma(x = x, shape = 2, rate = 1))}
n indicates the number of IID random variables being summed. xstart and xend are the start and end of the approximate support of the random variable. power_of_2 is the power of 2 length for the numeric vectors used. As I understand things, lengths of powers of two increase the efficiency of the fft algorithm.
I understand at least partially why the above does not work as intended in general. Firstly, the values themselves will not be scaled correctly, as fft(inverse=TRUE) does not normalize by default. However, I find that the values are still not correct when I divide by the length of the vector i.e.
sum_of_y <- sum_of_y / length(sum_of_y)
which based on my admittedly limited understanding of fft is the normalizing calculation. Secondly, the resulting vector will be out of phase due to (someone correct me on this if I am wrong) the shifting of the zero frequency that occurs when fft is performed. I have tried to use, for example, pracma's fftshift and ifftshift, but they do not appear to address this problem correctly. For symmetric distributions e.g. normal, this is not difficult to address since the phase shift is typically exactly half, so that an operation like
sum_of_y <- c(sum_of_y[(length(y)/2+1):length(y)], sum_of_y[1:(length(y)/2)])
works as a correction. However, for asymmetric distributions like the gamma distribution above this fails.
In conclusion, are there adjustments to the code above that will result in an appropriately scaled and appropriately shifted final density function for the IID sum?

Syntax: using Vectorize to vectorize a multidimensional function in this example R

I am looking at the last example in this short docs:
https://www.rdocumentation.org/packages/chebpol/versions/1.3-952/topics/mlappx
Reproduced here,
require(chebpol)
## evenly spaced grid-points
su <- seq(0,1,length.out=10)
## irregularly spaced grid-points
s <- su^3
## create approximation on the irregularly spaced grid
ml1 <- Vectorize(mlappx(exp,list(s)))
## test it, since exp is convex, the linear approximation lies above
## the exp between the grid points
ml1(su) - exp(su)
## multi linear approx
f <- function(x) exp(sum(x^2))
grid <- list(s,su)
ml2 <- mlappx(evalongrid(f,grid=grid),grid)
# an equivalent would be ml2 <- mlappx(f,grid)
a <- runif(2); ml2(a); f(a)
# we also get an approximation outside of the domain, of disputable quality
ml2(c(1,2)); f(c(1,2))
The last line evaluates the approximation (via linear interpolation) of f for the point (1,2). What the the darned syntax to get it to evaluate multiple pairs of points if we vectorize with:
ml2 <- Vectorize(mlappx(evalongrid(f,grid=grid),grid))
Just a short note. I'm the author of chebpol. Recently I've upgraded it with several more interpolation methods. They have a uniform interface through the ipol() function, and are all vectorized (they can take a matrix of column vectors), and are parallelized.

Points uniformly distributed on unit disk (2D)

I am trying to generate 10,000 points from the uniform distribution on the unit disk and plot these points.
The method I am using has three steps. The first step is generating the magnitude of the point x. This point has cdf F(x) = x^2 min(x) = 0 and max(x) = 1. The second step involves generating a 2 dimensional vector (which I will call y) from the multivariate normal distribution with mu being the zero vector and sigma being the 2x2 identity matrix - MVN(0,I). Last I normalize the vector y to have length x. I have tried to code the solution in R but I do not think my answer is correct. I would really appreciate if I could be pointed in the right direction.
u = runif(10000)
x = u^2
y = mvrnorm(10000, mu=rep(0,2), Sigma=diag(2))
y_norm = (x*y)/sqrt(sum(y^2))
plot(y_norm, asp = 1)
I used the MASS package for mvrnorm. Also I have included the plot that I ended up with:
You need to compute the length of each of the rows in your y matrix, you are getting the square root of the sum of all the numbers in y, which is just scaling your multinomial by a constant. Also, you need x to be sqrt(u) rather than u^2 - this code normalises each row by its length and users sqrt(u) scaling and it looks nice and uniform:
plot(sqrt(u)*y/sqrt(y[,1]^2+y[,2]^2))
There are better ways of making uniform points on a disc, unless this is just an exercise to do it this way...

Time taken to krige in gstat package in R

The following R program creates an interpolated surface using 470 data points using walker Lake data in gstat package.
source("D:/kriging/allfunctions.r") # Reads in all functions.
source("D:/kriging/panel.gamma0.r") # Reads in panel function for xyplot.
library(lattice) # Needed for "xyplot" function.
library(geoR) # Needed for "polygrid" function.
library(akima)
library(gstat);
library(sp);
walk470 <- read.table("D:/kriging/walk470.txt",header=T)
attach(walk470)
coordinates(walk470) = ~x+y
walk.var1 <- variogram(v ~ x+y,data=walk470,width=10) #the width has to be tuned resulting different point pairs
plot(walk.var1,xlab="Distance",ylab="Semivariance",main="Variogram for V, Lag Spacing = 5")
model1.out <- fit.variogram(walk.var1,vgm(70000,"Sph",40,20000))
plot(walk.var1, model=model1.out,xlab="Distance",ylab="Semivariance",main="Variogram for V, Lag Spacing = 10")
poly <- chull(coordinates(walk470))
plot(coordinates(walk470),type="n",xlab="X",ylab="Y",cex.lab=1.6,main="Plot of Sample and Prediction Sites",cex.axis=1.5,cex.main=1.6)
lines(coordinates(walk470)[poly,])
poly.in <- polygrid(seq(2.5,247.5,5),seq(2.5,297.5,5),coordinates(walk470)[poly,])
points(poly.in)
points(coordinates(walk470),pch=16)
coordinates(poly.in) <- ~ x+y
krige.out <- krige(v ~ 1, walk470,poly.in, model=model1.out)
print(krige.out)
This program calculates the following for each point of 2688 points
(470x470) matrix inversion
(470x470) and (470x1) matrix multiplication
Is gstat package is using some smart way for calculation. I knew from previous stackoverflow query that it uses cholesky decomposition for matrix inversion. Is it normal speed for one machine to calculate it so quickly.
It uses LDL' decomposition, which is similar to Choleski. As you are using global kriging, the covariance matrix needs to be decomposed only once; then, for each prediction point, a system is solved, which is O(n). No 470x470 matrix gets ever inverted, neither are solutions obtained by multiplying it. Inverses are notational devices, but avoided as computational strategy when possible. In R, for instance, compare runtime of solve(A,b) with solve(A) %*% b.
Use the source, Luke!

summing 2 distance matrices for getting a third 'overall' distance matrix (ecological context)

I am ecologist, using mainly the vegan R package.
I have 2 matrices (sample x abundances) (See data below):
matrix 1/ nrow= 6replicates*24sites, ncol=15 species abundances (fish)
matrix 2/ nrow= 3replicates*24sites, ncol=10 species abundances (invertebrates)
The sites are the same in both matrices. I want to get the overall bray-curtis dissimilarity (considering both matrices) among pairs of sites. I see 2 options:
option 1, averaging over replicates (at the site scale) fishes and macro-invertebrates abundances, cbind the two mean abundances matrix (nrow=24sites, ncol=15+10 mean abundances) and calculating bray-curtis.
option 2, for each assemblage, computing bray-curtis dissimilarity among pairs of sites, computing distances among sites centroids. Then summing up the 2 distance matrix.
In case I am not clear, I did these 2 operations in the R codes below.
Please, could you tell me if the option 2 is correct and more appropriate than option 1.
thank you in advance.
Pierre
here is below the R code exemples
generating data
library(plyr);library(vegan)
#assemblage 1: 15 fish species, 6 replicates per site
a1.env=data.frame(
Habitat=paste("H",gl(2,12*6),sep=""),
Site=paste("S",gl(24,6),sep=""),
Replicate=rep(paste("R",1:6,sep=""),24))
summary(a1.env)
a1.bio=as.data.frame(replicate(15,rpois(144,sample(1:10,1))))
names(a1.bio)=paste("F",1:15,sep="")
a1.bio[1:72,]=2*a1.bio[1:72,]
#assemblage 2: 10 taxa of macro-invertebrates, 3 replicates per site
a2.env=a1.env[a1.env$Replicate%in%c("R1","R2","R3"),]
summary(a2.env)
a2.bio=as.data.frame(replicate(10,rpois(72,sample(10:100,1))))
names(a2.bio)=paste("I",1:10,sep="")
a2.bio[1:36,]=0.5*a2.bio[1:36,]
#environmental data at the sit scale
env=unique(a1.env[,c("Habitat","Site")])
env=env[order(env$Site),]
OPTION 1, averaging abundances and cbind
a1.bio.mean=ddply(cbind(a1.bio,a1.env),.(Habitat,Site),numcolwise(mean))
a1.bio.mean=a1.bio.mean[order(a1.bio.mean$Site),]
a2.bio.mean=ddply(cbind(a2.bio,a2.env),.(Habitat,Site),numcolwise(mean))
a2.bio.mean=a2.bio.mean[order(a2.bio.mean$Site),]
bio.mean=cbind(a1.bio.mean[,-c(1:2)],a2.bio.mean[,-c(1:2)])
dist.mean=vegdist(sqrt(bio.mean),"bray")
OPTION 2, computing for each assemblage distance among centroids and summing the 2 distances matrix
a1.dist=vegdist(sqrt(a1.bio),"bray")
a1.coord.centroid=betadisper(a1.dist,a1.env$Site)$centroids
a1.dist.centroid=vegdist(a1.coord.centroid,"eucl")
a2.dist=vegdist(sqrt(a2.bio),"bray")
a2.coord.centroid=betadisper(a2.dist,a2.env$Site)$centroids
a2.dist.centroid=vegdist(a2.coord.centroid,"eucl")
summing up the two distance matrices using Gavin Simpson 's fuse()
dist.centroid=fuse(a1.dist.centroid,a2.dist.centroid,weights=c(15/25,10/25))
summing up the two euclidean distance matrices (thanks to Jari Oksanen correction)
dist.centroid=sqrt(a1.dist.centroid^2 + a2.dist.centroid^2)
and the 'coord.centroid' below for further distance-based analysis (is it correct ?)
coord.centroid=cmdscale(dist.centroid,k=23,add=TRUE)
COMPARING OPTION 1 AND 2
pco.mean=cmdscale(vegdist(sqrt(bio.mean),"bray"))
pco.centroid=cmdscale(dist.centroid)
comparison=procrustes(pco.centroid,pco.mean)
protest(pco.centroid,pco.mean)
An easier solution is just to flexibly combine the two dissimilarity matrices, by weighting each matrix. The weights need to sum to 1. For two dissimilarity matrices the fused dissimilarity matrix is
d.fused = (w * d.x) + ((1 - w) * d.y)
where w is a numeric scalar (length 1 vector) weight. If you have no reason to weight one of the sets of dissimilarities more than the other, just use w = 0.5.
I have a function to do this for you in my analogue package; fuse(). The example from ?fuse is
train1 <- data.frame(matrix(abs(runif(100)), ncol = 10))
train2 <- data.frame(matrix(sample(c(0,1), 100, replace = TRUE),
ncol = 10))
rownames(train1) <- rownames(train2) <- LETTERS[1:10]
colnames(train1) <- colnames(train2) <- as.character(1:10)
d1 <- vegdist(train1, method = "bray")
d2 <- vegdist(train2, method = "jaccard")
dd <- fuse(d1, d2, weights = c(0.6, 0.4))
dd
str(dd)
This idea is used in supervised Kohonen networks (supervised SOMs) to bring multiple layers of data into a single analysis.
analogue works closely with vegan so there won't be any issues running the two packages side by side.
The correctness of averaging distances depends on what are you doing with those distances. In some applications you may expect that they really are distances. That is, they satisfy some metric properties and have a defined relation to the original data. Combined dissimilarities may not satisfy these requirements.
This issue is related to the controversy of partial Mantel type analysis of dissimilarities vs. analysis of rectangular data that is really hot (and I mean red hot) in studies of beta diversities. We in vegan provide tools for both, but I think that in most cases analysis of rectangular data is more robust and more powerful. With rectangular data I mean normal sampling units times species matrix. The preferred dissimilarity based methods in vegan map dissimilarities onto rectangular form. These methods in vegan include db-RDA (capscale), permutational MANOVA (adonis) and analysis of within-group dispersion (betadisper). Methods working with disismilarities as such include mantel, anosim, mrpp, meandis.
The mean of dissimilarities or distances usually has no clear correspondence to the original rectangular data. That is: mean of the dissimilarities does not correspond to the mean of the data. I think that in general it is better to average or handle data and then get dissimilarities from transformed data.
If you want to combine dissimilarities, analogue::fuse() style approach is most practical. However, you should understand that fuse() also scales dissimilarity matrices into equal maxima. If you have dissimilarity measures in scale 0..1, this is usually minor issue, unless one of the data set is more homogeneous and has a lower maximum dissimilarity than others. In fuse() they are all equalized so that it is not a simple averaging but averaging after range equalizing. Moreover, you must remember that averaging dissimilarities usually destroys the geometry, and this will matter if you use analysis methods for rectangularized data (adonis, betadisper, capscale in vegan).
Finally about geometry of combining dissimilarities. Dissimilarity indices in scale 0..1 are fractions of type A/B. Two fractions can be added (and then divided to get the average) directly only if the denominators are equal. If you ignore this and directly average the fractions, then the result will not be equal to the same fraction from averaged data. This is what I mean with destroying geometry. Some open-scaled indices are not fractions and may be additive. Manhattan distances are additive. Euclidean distances are square roots of squared differences, and their squares are additive but not the distances directly.
I demonstrate these things by showing the effect of adding together two dissimilarities (and averaging would mean dividing the result by two, or by suitable weights). I take the Barro Colorado Island data of vegan and divide it into two subsets of slightly unequal sizes. A geometry preserving addition of distances of subsets of the data will give the same result as the analysis of the complete data:
library(vegan) ## data and vegdist
library(analogue) ## fuse
data(BCI)
dim(BCI) ## [1] 50 225
x1 <- BCI[, 1:100]
x2 <- BCI[, 101:225]
## Bray-Curtis and fuse: not additive
plot(vegdist(BCI), fuse(vegdist(x1), vegdist(x2), weights = c(100/225, 125/225)))
## summing distances is straigthforward (they are vectors), but preserving
## their attributes and keeping the dissimilarities needs fuse or some trick
## like below where we make dist structure dtmp to be replaced with the result
dtmp <- dist(BCI) ## dist skeleton with attributes
dtmp[] <- dist(x1, "manhattan") + dist(x2, "manhattan")
## manhattans are additive and can be averaged
plot(dist(BCI, "manhattan"), dtmp)
## Fuse rescales dissimilarities and they are no more additive
dfuse <- fuse(dist(x1, "man"), dist(x2, "man"), weights=c(100/225, 125/225))
plot(dist(BCI, "manhattan"), dfuse)
## Euclidean distances are not additive
dtmp[] <- dist(x1) + dist(x2)
plot(dist(BCI), dtmp)
## ... but squared Euclidean distances are additive
dtmp[] <- sqrt(dist(x1)^2 + dist(x2)^2)
plot(dist(BCI), dtmp)
## dfuse would rescale squared Euclidean distances like Manhattan (not shown)
I only considered addition above, but if you cannot add, you cannot average. It is a matter of taste if this is important. Brave people will average things that cannot be averaged, but some people are more timid and want to follow the rules. I rather go the second group.
I like this simplicity of this answer, but it only applies to adding 2 distance matrices:
d.fused = (w * d.x) + ((1 - w) * d.y)
so I wrote my own snippet to combine an array of multiple distance matrices (not just 2), and using standard R packages:
# generate array of distance matrices
x <- matrix(rnorm(100), nrow = 5)
y <- matrix(rnorm(100), nrow = 5)
z <- matrix(rnorm(100), nrow = 5)
dst_array <- list(dist(x),dist(y),dist(z))
# create new distance matrix with first element of array
dst <- dst_array[[1]]
# loop over remaining array elements, add them to distance matrix
for (jj in 2:length(dst_array)){
dst <- dst + dst_array[[jj]]
}
You could also use a vector of similar size to dst_array in order to define scaling factors
dst <- dst + my_scale[[jj]] * dst_array[[jj]]

Resources