Find the common area between two graphs with multiple intersection points - r

I have following simulated data of following 2 variables. I created the density plot as follows,
set.seed(1)
x1=density(rnorm(100,0.5,3))
x2=density(rnorm(100,1,3))
plot(x1)
lines(x2)
Is there any function that can use to find the common area for these 2 graphs using R ?
Do i need to perform an integration for intersecting points ?
Thank you

If you set the sequence both densities use for x values to be identical, you can use pmin on the y values. (Call str(x1) to see how they're stored.) For instance, to see how it works:
set.seed(1)
x1 <- density(rnorm(100,0.5,3), from = -10, to = 10, n = 501)
x2 <- density(rnorm(100,1,3), from = -10, to = 10, n = 501)
plot(x2, main = 'Density intersection')
lines(x1)
polygon(x1$x, pmin(x1$y, x2$y), 20, col = 'dodgerblue')
Taking the integral means just multiplying each pmin times the increment in the x sequence and summing the lot:
sum(pmin(x1$y, x2$y) * diff(x1$x[1:2]))
#> [1] 0.896468

Related

Computing the Tukey median

I am trying to compute the data depth of two variables with the following function:
library(depth)
x <- data.frame(data$`math score`, data$`reading score`)
depth(1000, x, method = "Tukey", approx = FALSE, eps = 1e-8, ndir = 1000)
the first variable after depth is u which stands for Numerical vector whose depth is to be calculated. Dimension has to be the same as that of the observations.
I have 1000 observations however I get the following error message:
Error in depth(1000, x, method = "Tukey", approx = FALSE, eps = 1e-08, :
Dimension mismatch between the data and the point u.
Does someone know how to solve this issue?
Thank you in advance!
If you look at the documentation for the function depth, it says:
u    Numerical vector whose depth is to be calculated. Dimension has to be the same as that of the observations.
So u has to be a point in multidimensional space represented by a vector with n components, whereas x has to be a matrix or data frame of m by n components, (m rows for m points). You are comparing u to all the other multidimensional points in the set x to find the minimum number of points that could share a half-space with u.
Let's create a very example in two dimensional space:
library(depth)
set.seed(100)
x <- data.frame(x = c(rnorm(10, -5, 2), rnorm(10, 5, 2)), y = rnorm(20, 0, 2))
plot(x)
The depth function calculates the depth of a particular point relative to the data. So let's use the origin:
u <- data.frame(x = 0, y = 0)
points(u, col = "red", pch = 16)
Naively we might think that the origin here has a depth of 10/20 points (i.e. the most obvious way to partition this dataset is a vertical line through the origin with 10 points on each side, but instead we find:
depth(u, x)
#> [1] 0.35
This indicates that there is a half-space including the origin that only contains 0.35 of the points, i.e. 7 points out of 20:
depth(u, x) * nrow(x)
#> [1] 7
And we can see that visually like this:
abline(0, -0.07)
points(x[x$y < (-0.07 * x$x),], col = "blue", pch = 16)
Where we have coloured these 7 points blue.
So it's not clear what result you expect from the depth function, but you will need to give it a value of c(math_score, reading_score) where math_score and reading_score are test values for which you want to know the depth.

Calculate raster with each cell equals the mean of all adjacent cells

I am working on an ecological problem, involving species distribution models. I have a raster which is essentially a landscape of probabilities of presence per cell, so to speak. I want to calculate a new raster, based on the old one, where each cell is equal to the mean of itself and all 8 adjacent cells. This is not the same as aggregating the cells by mean, which results in the border between the newly aggregated cells being calculated incorrectly.
I can do this with the bit of code provided, but the raster I am working with is way, way too big to run this calculation, as it uses too much memory. If I subdivide the raster, it will still take days to do. Does anyone have a more efficient way of calculating this? I have created a small version of the raster as an example, albeit somewhat clumsily:
require(raster)
## create raster called "ras" rather clumsily
## create raster called "ras" rather clumsily
# (UTM coordinates and a probability value for each cell, not really
# important)
s.x = seq(249990, by = 30, length.out = 20)
s.y = seq(6189390, by = 30, length.out = 20)
x.l = lapply(1:20, function(x){
rep(s.x[x], 20)
})
x.l2 = as.vector(c(x.l[[1]], x.l[[2]], x.l[[3]], x.l[[4]], x.l[[5]],
x.l[[6]], x.l[[7]], x.l[[8]], x.l[[9]], x.l[[10]],
x.l[[11]], x.l[[12]], x.l[[13]], x.l[[14]], x.l[[15]],
x.l[[16]],x.l[[17]], x.l[[18]], x.l[[19]], x.l[[20]]))
df = as.data.frame(cbind(x.l2, rep(s.y, 20), rnorm(20*20, 0.5, 0.2)))
colnames(df) = c("x", "y", "P")
coordinates(df) <- ~ x + y
gridded(df) <- TRUE
ras = raster(df)
# for each cell, make a vector of the values at
# the cell and all <=8 adjacent cells:
vl = lapply(1:length(ras), function(x){
extract(ras,
(c(x,(adjacent(ras, x, directions=8, pairs=F, sorted=F)))))
})
# find the mean for each cell
vm = sapply(1:length(ras), function(x){
as.vector(mean(vl[[x]], na.rm = T))
})
# create raster template
templ = ras/ras
# multiply into template for new raster
ras = vm*templ

Calculate all distances between two set of points using st_distance

I have two sets of points stored in R as sf objects. Point object x contains 204,467 and point y contains 5,297 points.
In theory, I would want to calculate the distance from all points in x to all points in y. I understand that this would create a beast of a matrix, but it is doable using st_distance(x, y, by_element=FALSE) in the sf package in about 40 minutes on my i7 desktop.
What I want to do is to calculate the distance from all of the points in x to all of the points in y, then I want to convert this into a data.frame, that contains all variables for the respective x and y pair of points. This is because I want flexibility in terms of aggregation using dplyr, for instance, I want to find the number of points in y, that is within 10, 50, 100 km from x, and where x$year < y$year.
I successfully created the distance matrix, which has around 1,083,061,699 cells. I know this is a very inefficient way of doing this, but it gives flexibility in terms of aggregation. Other suggestions are welcome.
Below is code to create two sf point objects, and measure the distance between them. Next, I would want to convert this into a data.frame with all variables from x and y, but this is where I fail to proceed.
If my suggested workflow is unfeasible, can someone provide an alternative solution to measure distance to all points within a predefined radius, and create a data.frame of the result with all variables from x and y?
# Create two sf point objects
set.seed(123)
library(sf)
pts1 <- st_as_sf(x = data.frame(id=seq(1,204467,1),
year=sample(seq(from = 1990, to = 2018, by = 1), size = 204467, replace = TRUE),
xcoord=sample(seq(from = -180, to = 180, by = 1), size = 204467, replace = TRUE),
ycoord=sample(seq(from = -90, to = 90, by = 1), size = 204467, replace = TRUE)),
coords=c("xcoord","ycoord"),crs=4326)
pts2 <- st_as_sf(x = data.frame(id=seq(1,5297,1),
year=sample(seq(from = 1990, to = 2018, by = 1), size = 5297, replace = TRUE),
xcoord=sample(seq(from = -180, to = 180, by = 1), size = 5297, replace = TRUE),
ycoord=sample(seq(from = -90, to = 90, by = 1), size = 5297, replace = TRUE)),
coords=c("xcoord","ycoord"),crs=4326)
distmat <- st_distance(pts1,pts2,by_element = FALSE)
I would consider approaching this differently. Once you have your distmat matrix, you can do the types of calculation you describe without needing a data.frame. You can use standard subsetting to find which points meet your specified criteria.
For example, to find the combinations of points where pts1$year is greater than pts2$year we can do:
subset_points = outer(pts1$year, pts2$year, `>`)
Then, to find how many of these are separated more than 100 km, we can do
library(units)
sum(distmat[subset_points] > (100 * as_units('km', 1)))
A note on memory usage
However you approach this with sf or data.frame objects, the chances are that you will start to bump up against RAM limits with 1e9 floating points in each matrix or column of a data.table. You might think about instead converting your distance matrix to a raster. Then the raster can be stored on disk rather than in memory, and you can utilise the memory-safe functions in the raster package to crunch your way through.
How we might use rasters to work from disk and save RAM
We can use memory-safe raster operations for the very large matrices like this, for example:
library(raster)
# convert our matrices to rasters, so we can work on them from disk
r = raster(matrix(as.numeric(distmat), length(pts1$id), length(pts2$id)))
s = raster(subset_points)
remove('distmat', 'subset_points')
# now create a raster equal to r, but with zeroes in the cells we wish to exclude from calculation
rs = overlay(r,s,fun=function(x,y){x*y}, filename='out1.tif')
# find which cells have value greater than x (1e6 in the example)
Big_cells = reclassify(rs, matrix(c(-Inf, 1e6, 0, 1e6, Inf, 1), ncol=3, byrow=TRUE), 'out.tiff', overwrite=T)
# and finally count the cells
N = cellStats(Big_cells, sum)

Maximum at any point of two lines in R

Suppose you have two lines, L1 and L2, which for each x value (x1 and x2 for example) they have known points at L1={(x1,L1_y1), (x2,L1_y2)}, and L2={(x1,L2_y1), (x2,L2_y2)}. By joining these points they may or may not have an intersection at some x3 where x1
Now suppose you want to know the maximum at any x value (not restricted to just x1, x2 etc, but anywhere along the axis) of both of these lines. Obviously it is often trivial to calculate for just a few lines, and a few different x value, but in my case I have several tens of thousand x values and a few lines to check it against, so it can't be done manually.
In R, is there some code which will calculate the maximum at any given point x3?
An example of this can be seen here with L1={(1,1), (2,4)}, and L2={(1,4),(2,1)}, illustrated by:
Here the intersection of these lines is at (1.5, 2.5). L2 is the maximum before this, and L1 after. This maximum line is shown in red below.
As you can see, it isn't enough just to take the max at every point and join these up, and so it will need to consider the lines as some form of function, and then take the maximum of this.
Also, as mention before as there are several thousand x values it will need to generalise to larger data.
To test the code further if you wish you can randomly generate y values for some x values, and it will be clear to see from a plot if it works correctly or not.
Thanks in advance!
Defining points constituting your lines from the example
L1 <- list(x = c(1, 2), y = c(1, 4))
L2 <- list(x = c(1, 2), y = c(4, 1))
defining a function taking a pointwise maximum of two functions corresponding to the lines
myMax <- function(x)
pmax(approxfun(L1$x, L1$y)(x), approxfun(L2$x, L2$y)(x))
This gives
plot(L1$x, L1$y, type = 'l')
lines(L2$x, L2$y, col = 'red')
curve(myMax(x), from = 1, to = 2, col = 'blue', add = TRUE)
Clearly this extends to more complex L1 and L2 as approxfun is just a piecewise-linear approximation. Also, you may add L3, L4, and so on.

Drawing a sample that changes the shape of the mother sample

Background:
I'm trying to modify the shape of a histogram resulted from an "Initial" large sample obtained using Initial = rbeta(1e5, 2, 3). Specifically, I want the modified version of the Initial large sample to have 2 additional smaller (in height) "humps" (i.e., another 2 smaller-height peaks in addition to the one that existed in the Initial large sample).
Coding Question:
I'm wondering how to manipulate sample() (maybe using its prob argument) in R base so that this command samples in a manner that the two additional humps be around ".5" and ".6" on the X-Axis?
Here is my current R code:
Initial = rbeta(1e5, 2, 3) ## My initial Large Sample
hist (Initial) ## As seen, here there is only one "hump" say near
# less than ".4" on the X-Axis
Modified.Initial = sample(Initial, 1e4 ) ## This is meant to be the modified version of the
# the Initial with two additional "humps"
hist(Modified.Initial) ## Here, I need to see two additional "humps" near
# ".5" and ".6" on the X-Axis
You can adjust the density distribution by combining it with beta distributions with the desired modes for a smoothed adjustment.
set.seed(47)
Initial = rbeta(1e5, 2, 3)
d <- density(Initial)
# Generate densities of beta distribution. Parameters determine center (0.5) and spread.
b.5 <- dbeta(seq(0, 1, length.out = length(d$y)), 50, 50)
b.5 <- b.5 / (max(b.5) / max(d$y)) # Scale down to max of original density
# Repeat centered at 0.6
b.6 <- dbeta(seq(0, 1, length.out = length(d$y)), 60, 40)
b.6 <- b.6 / (max(b.6) / max(d$y))
# Collect maximum densities at each x to use as sample probability weights
p <- pmax(d$y, b.5, b.6)
plot(p, type = 'l')
# Sample from density breakpoints with new probability weights
Final <- sample(d$x, 1e4, replace = TRUE, prob = p)
Effects on histogram are subtle...
hist(Final)
...but are more obvious in the density plot.
plot(density(Final))
Obviously all adjustments are arbitrary. Please don't do terrible things with your power.

Resources