Computing the Tukey median - r

I am trying to compute the data depth of two variables with the following function:
library(depth)
x <- data.frame(data$`math score`, data$`reading score`)
depth(1000, x, method = "Tukey", approx = FALSE, eps = 1e-8, ndir = 1000)
the first variable after depth is u which stands for Numerical vector whose depth is to be calculated. Dimension has to be the same as that of the observations.
I have 1000 observations however I get the following error message:
Error in depth(1000, x, method = "Tukey", approx = FALSE, eps = 1e-08, :
Dimension mismatch between the data and the point u.
Does someone know how to solve this issue?
Thank you in advance!

If you look at the documentation for the function depth, it says:
u    Numerical vector whose depth is to be calculated. Dimension has to be the same as that of the observations.
So u has to be a point in multidimensional space represented by a vector with n components, whereas x has to be a matrix or data frame of m by n components, (m rows for m points). You are comparing u to all the other multidimensional points in the set x to find the minimum number of points that could share a half-space with u.
Let's create a very example in two dimensional space:
library(depth)
set.seed(100)
x <- data.frame(x = c(rnorm(10, -5, 2), rnorm(10, 5, 2)), y = rnorm(20, 0, 2))
plot(x)
The depth function calculates the depth of a particular point relative to the data. So let's use the origin:
u <- data.frame(x = 0, y = 0)
points(u, col = "red", pch = 16)
Naively we might think that the origin here has a depth of 10/20 points (i.e. the most obvious way to partition this dataset is a vertical line through the origin with 10 points on each side, but instead we find:
depth(u, x)
#> [1] 0.35
This indicates that there is a half-space including the origin that only contains 0.35 of the points, i.e. 7 points out of 20:
depth(u, x) * nrow(x)
#> [1] 7
And we can see that visually like this:
abline(0, -0.07)
points(x[x$y < (-0.07 * x$x),], col = "blue", pch = 16)
Where we have coloured these 7 points blue.
So it's not clear what result you expect from the depth function, but you will need to give it a value of c(math_score, reading_score) where math_score and reading_score are test values for which you want to know the depth.

Related

How to generate an elliptical cylinder, populate it with randomly distributed points, and measure instances of overlap between those points in R

I would like to determine the probability that a randomly distributed object of Type A occupies or touches (overlaps) the same space as any randomly distributed object of Type B when populated inside an elliptical cylinder. I would then like to loop this simulation many times to generate a more reliable probability value.
I am able to draw the elliptical cylinder using the shape package:
library(shape)
emptyplot(c(-5, 5), c(-15, 15), main = "filled elliptic cylinder")
filledcylinder(rx = 9, ry = 5, len= 2, angle = 00, col = "white",
lcol = "black", lcolint = "grey")
I do not know how to add points (i.e. objects A and B) to this graph. However, I suspect graphical expression is not the way to go with this task (though I find visualising helpful). I suspect a better approach will be to create a function to describe the elliptical cylinder, similar to the cone in the following example, and run the simulation without graphical output:
# Create a function to describe a cone
cone <- function(x, y){
sqrt(x ^ 2 + y ^ 2)
}
# prepare variables.
x <- y <- seq(-1, 1, length = 30)
z <- outer(x, y, cone)
# plot as a 3D surface for visual reference (even though I actually want a volume)
persp(x, y, z,
main="Perspective Plot of a Cone",
zlab = "Height",
theta = 30, phi = 15,
col = "orange", shade = 0.4)
Sadly I do not know how to do this for my elliptical cylinder. I am aware of the paramaters for describing an elliptical cylinder from the following source:
https://mathworld.wolfram.com/EllipticCylinder.html
Unfortunately, I do not understand much of it. I hope the dimensions given in my filledcylinder can act as a guide. Ultimately the dimension values do not matter, what matters is the code structure into which values can be entered.
As for the objects:
Let there be 50 Type A objects and 50 Type B objects of size x=0.4, y=0.4, z=0.4 (same units as in my graphical elliptical cylinder example).
All objects are to be distributed at random within the volume of the elliptical cylinder, with the exception that objects of Type A cannot overlap with another object of Type A, and objects of Type B cannot overlap with other objects of Type B. Type A objects may overlap with Type B objects.
I would like to output the number of Type A objects that overlap with any Type B object in the given volume, this number as a percentage of total Type A objects, and as a percentage of total all objects for each run of the simulation.
I do not know how to even start to do this.
If you can help, I'm afraid statistics, geometry and non-basic R expressions will need to be explained as if to a (not particularly bright) child.
Thank you very very much for your time!
An implementation with heavily commented code for explanations. This assumes the A- and B-type objects must be entirely within the elliptical cylinder.
library(data.table)
rObj <- function(rx, ry, h, n, dims, eps = 2) {
# Function to create a random sample (by rejection) of non-overlapping
# rectangular prism objects inside an elliptical cylinder whose ellipse is
# centered at x = 0, y = 0 and whose height ranges from -dims[3]/2 to h -
# dims[3]/2. The objects have dimensions (x, y, z) = dims, and all edges are
# parallel or orthogonal to each of the x, y, or z axes.
# INPUTS:
# rx: length of the ellipse
# ry: width of the ellipse
# h: height of the elliptical cylinder
# n: number of non-overlapping objects to return
# dims: dimensions of the rectangular prism objects (vector of length 3)
# eps: oversampling factor
# OUTPUT: a data.table with 3 columns and n rows. Each row gives the
# coordinates of the centroid of a sampled object
dt <- data.table()
while(nrow(dt) < n) {
# increase oversampling if it is not the first pass
if (nrow(dt)) eps <- eps*2
rho <- sqrt(runif(eps*n))
phi <- runif(eps*n, 0, 2*pi)
dt <- data.table(
# sample object centroids
# see https://stackoverflow.com/questions/5529148/algorithm-calculate-pseudo-random-point-inside-an-ellipse
# First, uniformly sample on an ellipse centered on x = 0, y = 0,
# with xlength = rx - dims[1] and ylength = ry - dims[2]
# (any object with a centroid outside of this ellipse will stick out of
# the elliptical cylinder, although some with a centroid within the
# smaller ellipse will still stick out of the elliptical cylinder).
x = (rx - dims[1])/2*rho*cos(phi),
y = (ry - dims[2])/2*rho*sin(phi),
# uniformly sample centroid heights
z = runif(eps*n, 0, h - dims[3])
)[
# remove objects that stick out of bounds
# The ellipse satisfies (x/(rx/2))^2 + (y/(ry/2))^2 = 1, which is the
# same as (x/rx)^2 + (y/ry)^2 = 0.25. Taking advantage of symmetry, add
# half of the x and y dimensions of the objects to the absolute value of
# x and y (the object corner furthest from the foci of the ellipse) and
# check if the result satisfies the standard equation.
((abs(x) + dims[1]/2)/rx)^2 + ((abs(y) + dims[2]/2)/ry)^2 < 0.25
][
# remove objects that overlap a previously placed object
# Since each rectangular prism object is oriented with the x, y, z axes,
# two objects overlap if they are closer than their lengths in each
# dimension.
tabulate(
sequence((.N - 1L):1, 2:.N)[ # row numbers (always keep the first row)
(dist(x) < dims[1]) & (dist(y) < dims[2]) & (dist(z) < dims[3])
],
.N
) == 0L
]
}
dt[1:n] # keep the first n objects
}
# function to get pairwise distances between two vectors
dist2 <- function(x, y) abs(outer(x, y, "-"))
fsim <- function(rx, ry, h, nA, nB, dimA, dimB, nreps, eps = 2) {
# function to simulate placement of A and B rectangular prism objects inside
# an elliptical cylinder and count the number of A-type objects that
# intersect at least one B-type object. All object edges are parallel or
# orthogonal to each of the x, y, or z axes.
# INPUTS:
# rx: length of the ellipse
# ry: width of the ellipses
# h: height of the elliptical cylinder
# nA: number of non-overlapping A-type objects to return
# nB: number of non-overlapping B-type objects to return
# dimX: dimensions of the rectangular prism objects (vector of length 3)
# nreps: the number of replications to simulate
# eps: oversampling factor when randomly sampling non-overlapping objects
# by rejection
# OUTPUT: vector of length "nreps" giving the number of A-type objects that
# intersect at least one B-type object for each replication
dims <- rowMeans(cbind(dimA, dimB)) # average dimensions of the A and B objects
out <- integer(nreps) # initialize the output vector
# repeat the simulation "nreps" times
for (i in 1:nreps) {
# get the coordinates of the A- and B-type objects' centroids
A <- rObj(rx, ry, h, nA, dimA, eps)
B <- rObj(rx, ry, h, nB, dimB, eps)
# count the number of A-type objects that intersect at least one B-type
# object
out[i] <- sum(rowSums((dist2(A$x, B$x) < dims[1])*(dist2(A$y, B$y) < dims[2])*(dist2(A$z, B$z) < dims[3])) != 0L)
}
out
}
Time 10K simulation replications:
system.time(overlaps <- fsim(9, 5, 2, 50L, 50L, rep(0.4, 3), rep(0.4, 3), 1e4L))
#> user system elapsed
#> 27.19 0.25 27.67
mean(overlaps)
#> [1] 18.7408
One approach to get an approximate answer to this problem is to discretize things. Set up a volume as a 3 dimensional array of zeros, then randomly generate the parameters of your shapes one at a time.
For each generated shape, find all the elements of the array that would be inside the shape. If any locations would be outside the cylinder or overlap a shape of the same type, try again. Once you have a legal shape, mark those array entries (e.g. 1 for type A, 2 for type B). Do all type A first, then all type B, and keep count of the times when shape B occupies a space that was previously marked for shape A.

How to create 3D mesh using extracted LiDAR points in as.mesh3d function from rgl package in R

I am trying to create a 3D mesh of a specific building from points that I extracted from a lidar point cloud. I then created a matrix from the x, y and z values to feed into the as.mesh3d function from the rlg package and since its from a lidar survey, I have 27,000+ points for this one building. I run into an error when I try to create the mesh. I've copied in a sample of 20 points from the point cloud:
X <- c(1566328,1566328,1566328,1566328,1566328,1566327,1566327,1566327,
1566327,1566327,1566327,1566327,1566327,1566327,1566327,1566327,
1566326,1566326,1566326,1566326)
Y <- c(5180937,5180937,5180936,5180935,5180936,5180937,5180937,5180936,
5180936,5180935,5180935,5180935,5180936,5180936,5180937,5180938,
5180938,5180937,5180936,5180936)
Z <- c(19.92300028,19.98300046,19.93700046,19.88099962,19.93500046,19.99500046,
20.00400046,20.00600046,19.97199962,19.92499962,19.95400046,
19.99099991,20.01199991,19.97600020,19.95800008,19.93200008,
19.95300008,19.94800008,19.94300020,19.98399991)
#created a matrix
xyz <- matrix(c(X, Y, Z), byrow = TRUE, ncol = 3)
The problem arises when I try to create the mesh using as.mesh3d():
mesh <- as.mesh3d(xyz, y = NULL, Z = NULL, type = "triangle", col = "red")
This is what I get: Error in as.mesh3d.default(xyz, y = NULL, Z = NULL, type = "triangle", : Wrong number of vertices
The same error happens for the original dataset of 27000+ points despite all being of the same length.
I'm really not advanced in R and was hoping I could get some advice or solutions on how to get past this.
Thankyou
The as.mesh3d function assumes the points are already organized as triangles. Since you're giving it 20 points, that's not possible: it needs a multiple of 3 points.
There's a problem with your calculation of xyz: you say byrow = TRUE, but you're specifying values by column. Using
xyz <- cbind(X, Y, Z)
would work.
If I plot all of your points using text3d(xyz, text=1:20), it looks as though there are a lot of repeats.
There are several ways to triangulate those points, but they depend on assumptions about the surface. For example, if you know there is only one Z value for each (X, Y) pair, you could use as.mesh3d.deldir (see the help page) to triangulate. Here's the code and output for your sample:
dxyz <- deldir::deldir(X - mean(X), Y - mean(Y), z = Z)
# Warning message:
# In deldir::deldir(X - mean(X), Y - mean(Y), z = Z) :
# There were different z "weights" corresponding to
# duplicated points.
persp3d(dxyz, col = "red")
I had to subtract the means from X and Y because rounding errors caused it to look very bad without that: rgl does a lot of things in single precision, which only gives 7 or 8 decimal place accuracy.

Find the common area between two graphs with multiple intersection points

I have following simulated data of following 2 variables. I created the density plot as follows,
set.seed(1)
x1=density(rnorm(100,0.5,3))
x2=density(rnorm(100,1,3))
plot(x1)
lines(x2)
Is there any function that can use to find the common area for these 2 graphs using R ?
Do i need to perform an integration for intersecting points ?
Thank you
If you set the sequence both densities use for x values to be identical, you can use pmin on the y values. (Call str(x1) to see how they're stored.) For instance, to see how it works:
set.seed(1)
x1 <- density(rnorm(100,0.5,3), from = -10, to = 10, n = 501)
x2 <- density(rnorm(100,1,3), from = -10, to = 10, n = 501)
plot(x2, main = 'Density intersection')
lines(x1)
polygon(x1$x, pmin(x1$y, x2$y), 20, col = 'dodgerblue')
Taking the integral means just multiplying each pmin times the increment in the x sequence and summing the lot:
sum(pmin(x1$y, x2$y) * diff(x1$x[1:2]))
#> [1] 0.896468

Maximum at any point of two lines in R

Suppose you have two lines, L1 and L2, which for each x value (x1 and x2 for example) they have known points at L1={(x1,L1_y1), (x2,L1_y2)}, and L2={(x1,L2_y1), (x2,L2_y2)}. By joining these points they may or may not have an intersection at some x3 where x1
Now suppose you want to know the maximum at any x value (not restricted to just x1, x2 etc, but anywhere along the axis) of both of these lines. Obviously it is often trivial to calculate for just a few lines, and a few different x value, but in my case I have several tens of thousand x values and a few lines to check it against, so it can't be done manually.
In R, is there some code which will calculate the maximum at any given point x3?
An example of this can be seen here with L1={(1,1), (2,4)}, and L2={(1,4),(2,1)}, illustrated by:
Here the intersection of these lines is at (1.5, 2.5). L2 is the maximum before this, and L1 after. This maximum line is shown in red below.
As you can see, it isn't enough just to take the max at every point and join these up, and so it will need to consider the lines as some form of function, and then take the maximum of this.
Also, as mention before as there are several thousand x values it will need to generalise to larger data.
To test the code further if you wish you can randomly generate y values for some x values, and it will be clear to see from a plot if it works correctly or not.
Thanks in advance!
Defining points constituting your lines from the example
L1 <- list(x = c(1, 2), y = c(1, 4))
L2 <- list(x = c(1, 2), y = c(4, 1))
defining a function taking a pointwise maximum of two functions corresponding to the lines
myMax <- function(x)
pmax(approxfun(L1$x, L1$y)(x), approxfun(L2$x, L2$y)(x))
This gives
plot(L1$x, L1$y, type = 'l')
lines(L2$x, L2$y, col = 'red')
curve(myMax(x), from = 1, to = 2, col = 'blue', add = TRUE)
Clearly this extends to more complex L1 and L2 as approxfun is just a piecewise-linear approximation. Also, you may add L3, L4, and so on.

Adding bias in Taylor diagram in R

I am using the taylor.diagram function in the plotrix package e.g.
obs = runif(100,1,100)
mod1 = runif(100,1,100)
mod2 = runif(100,1,100)
mod3 = runif(100,1,100)
taylor.diagram(obs,mod1)
taylor.diagram(obs,mod2,add=TRUE)
taylor.diagram(obs,mod3,add=TRUE)
In the conventional Taylor diagram there is no bias but in his paper (Taylor, 2001, K.E. Summarizing multiple aspects of model performance in a single diagram Taylor JGR, 106, 7183-7192) Taylor says that
"Although the diagram has been designed to convey information about centered pattern differences it is also possible to indicate differences in overall means (i.e., the bias). This can be done on the diagram by attaching to each plotted point a line segment drawn at a right angle to the straight line defined by the point and the reference point. If the length of the attached line segment is equal to the bias, then the distance from the reference point to the end of the line segment will be equal to the total (uncentered) RMS error"
I admit that I don't know where to start to try and do this. Has anyone succeeded at adding this information on the plot?
If I understand correctly the bias is the difference in means between the model vector and the observation vector. Then, the problem is to, (a) find the line between the observation and model points, (b) find a line perpendicular to this line, (c) find a point along the perpendicular line, at a distance from the model point equal to the bias.
One possible solution is:
taylor.bias <- function(ref, model, normalize = FALSE){
R <- cor(model, ref, use = "pairwise")
sd.f <- sd(model)
sd.r <- sd(ref)
m.f <- mean(model)
m.r <- mean(ref)
## normalize if requested
if (normalize) {
m.f <- m.f/sd.r
m.r <- m.r/sd.r
sd.f <- sd.f/sd.r
sd.r <- 1
}
## calculate bias
bias <- m.f - m.r
## coordinates for model and observations
dd <- rbind(mp = c(sd.f * R, sd.f * sin(acos(R))), rp = c(sd.r, 0))
## find equation of line passing through pts
v1 <- solve(cbind(1, dd[,1])) %*% dd[,2]
## find perpendicular line
v2 <- c(dd[1,2] + dd[1,1]/v1[2], -1/v1[2])
## find point defined by bias
nm <- dd[1,] - c(0, v2[1])
nm <- nm / sqrt(sum(nm^2))
bp <- dd[1,] + bias*nm
## plot lines
arrows(x0 = dd[1,1], x1 = bp[1], y0 = dd[1,2], y1 = bp[2], col = "red", length = 0.05, lwd = 1.5)
lines(rbind(dd[2,], bp), col = "red", lty = 3)
lines(dd, col = "red", lty = 3)
}
Then,
library(plotrix)
obs = runif(100,1,100)
mod1 = runif(100,1,100)
taylor.diagram(obs,mod1)
taylor.bias(obs,mod1)
Where the length of the red vector indicates the bias and the length of dotted line joining the vector's tip to the reference point is the RMS error. The direction of the red vector indicates the sign of the bias -- in the picture below, negative bias.

Resources