Levelplot with incomplete data - r

Using the lattice package in R:
I have let myself deeply into a rabbit hole and now I need some help to get out.
I have some (expensive) data points that naturally live on a 32x32 grid but I don't have all the possible data points
> str(data)
'data.frame': 53 obs. of 3 variables:
$ X: num 16 16 16 16 13 13 13 13 23 23 ...
$ Y: num 20 16 23 10 16 23 20 10 16 23 ...
$ Z: num 1558 1561 1555 1540 1538 ...
When I try to use levelplot like this,
> levelplot(data$Z ~ rbind(data$X, data$X) * rbind(data$Y, data$Y),
xlim=c(0.5, 32.5), ylim=c(0.5, 32.5))
the plot has the colored patches clustered in a (for me) confusing way. Output from levelplot
What I would like to achieve is that I have one colored patch per 1-by-1 index pair corresponding to my data. Absent grid points can be left white.
I tried to understand the R documentation but have given up.
Further, I have tried a grid with dummy NA, and then tried filling out the relevant data points. Something like
> x <- seq(1, 32, length.out=32)
> y <- seq(1, 32, length.out=32)
> data <- expand.grid(X=x, Y=y)
> data$Z <- NA
> tmp <- res[selected_data, ]
> data[(data$X == tmp$X) & (data$Y == tmp$Y), 'Z'] <- tmp$Z
Error in `[<-.data.frame`(`*tmp*`, (data$X == tmp$Input_Channel) & (data$Y == :
replacement has 53 rows, data has 1024
Where res is the source of data points and selected_data is a vector of logicals used to select data from res. Anyway, this doesn't work.
Regardless, trying to make this latter approach work has been a wrong turn. I'd rather have a proper solution with levelplot rather than my failed work around.

I found a workable solution which I share to help others:
> dataX <- c(seq(1, 32), rep(1, 32), tmp$X)
> dataY <- c(rep(1, 32), seq(1, 32), tmp$Y)
> dataZ <- c(rep(NA, 64), tmp$Z)
> levelplot(dataZ ~ dataX * dataY)
Adding the NAs in this manner gives the desired output. Desired output from levelplot

Related

fill NA raster cells using focal defined by boundary

I have a raster and a shapefile. The raster contains NA and I am filling the NAs using the focal function
library(terra)
v <- vect(system.file("ex/lux.shp", package="terra"))
r <- rast(system.file("ex/elev.tif", package="terra"))
r[45:60, 45:60] <- NA
r_fill <- terra::focal(r, 5, mean, na.policy="only", na.rm=TRUE)
However, there are some NA still left. So I do this:
na_count <- terra::freq(r_fill, value = NA)
while(na_count$count != 0){
r_fill <- terra::focal(r_fill, 5, mean, na.policy="only", na.rm=TRUE)
na_count <- terra::freq(r_fill, value = NA)
}
Once all NA's are filled, I clip the raster again using the shapefile
r_fill <- terra::crop(r_fill, v, mask = T, touches = T)
This is what my before and after looks like:
I wondered if the while loop is an efficient way to fill the NAs or basically determine how many times I have to run focal to fill all the NAs in the raster.
Perhaps we can, or want to, dispense with the while( altogether by making a better estimate of focal('s w= arg in a world where r, as ground truth, isn't available. Were it available, we could readily derive direct value of w
r <- rast(system.file("ex/elev.tif", package="terra"))
# and it's variants
r2 <- r
r2[45:60, 45:60] <- NA
freq(r2, value=NA) - freq(r, value=NA)
layer value count
1 0 NA 256
sqrt((freq(r2, value=NA) - freq(r, value=NA))$count)
[1] 16
which might be a good value for w=, and introducing another variant
r3 <- r
r3[40:47, 40:47] <- NA
r3[60:67, 60:67] <- NA
r3[30:37, 30:37] <- NA
r3[70:77, 40:47] <- NA
rm(r)
We no longer have our ground truth. How might we estimate an edge of w=? Turning to boundaries( default values (inner)
r2_bi <- boundaries(r2)
r3_bi <- boundaries(r3)
# examining some properties of r2_bi, r3_bi
freq(r2_bi, value=1)$count
[1] 503
freq(r3_bi, value=1)$count
[1] 579
freq(r2_bi, value=1)$count/freq(r2_bi, value = 0)$count
[1] 0.1306833
freq(r3_bi, value=1)$count/freq(r3_bi, value = 0)$count
[1] 0.1534588
sum(freq(r2_bi, value=1)$count,freq(r2_bi, value = 0)$count)
[1] 4352
sum(freq(r3_bi, value=1)$count,freq(r3_bi, value = 0)$count)
[1] 4352
Taken in reverse order, sum[s] and freq[s] suggest that while the total area of (let's call them holes) are the same, they differ in number and r2 is generally larger than r3. This is also clear from the first pair of freq[s].
Now we drift into some voodoo, hocus pocus in pursuit of a better edge estimate
sum(freq(r2)$count) - sum(freq(r2, value = NA)$count)
[1] 154
sum(freq(r3)$count) - sum(freq(r3, value = NA)$count)
[1] 154
(sum(freq(r3)$count) - sum(freq(r3, value = NA)$count))
[1] 12.40967
freq(r2_bi, value=1)$count/freq(r2_bi, value = 0)$count
[1] 0.1306833
freq(r2_bi, value=0)$count/freq(r2_bi, value = 1)$count
[1] 7.652087
freq(r3_bi, value=1)$count/freq(r3_bi, value = 0)$count
[1] 0.1534588
taking the larger, i.e. freq(r2_bi 7.052087
7.652087/0.1306833
[1] 58.55444
154+58
[1] 212
sqrt(212)
[1] 14.56022
round(sqrt(212)+1)
[1] 16
Well, except for that +1 part, maybe still a decent estimate for w=, to be used on both r2 and r3 if called upon to find a better w, and perhaps obviate the need for while(.
Another approach to looking for squares and their edges:
wtf3 <- values(r3_bi$elevation)
wtf2 <- values(r2_bi$elevation)
wtf2_tbl_df2 <- as.data.frame(table(rle(as.vector(is.na(wtf2)))$lengths))
wtf3_tbl_df2 <- as.data.frame(table(rle(as.vector(is.na(wtf3)))$lengths))
names(wtf2_tbl_df2)
[1] "Var1" "Freq"
wtf2_tbl_df2[which(wtf2_tbl_df2$Var1 == wtf2_tbl_df2$Freq), ]
Var1 Freq
14 16 16
wtf3_tbl_df2[which(wtf3_tbl_df2$Freq == max(wtf3_tbl_df2$Freq)), ]
Var1 Freq
7 8 35
35/8
[1] 4.375 # 4 squares of 8 with 3 8 length vectors
bringing in v finally and filling
v <- vect(system.file("ex/lux.shp", package="terra"))
r2_fill_17 <- focal(r2, 16 + 1 , mean, na.policy='only', na.rm = TRUE)
r3_fill_9 <- focal(r3, 8 + 1 , mean, na.policy='only', na.rm = TRUE)
r2_fill_17_cropv <- crop(r2_fill_17, v, mask = TRUE, touches = TRUE)
r3_fill_9_cropv <- crop(r3_fill_9, v, mask = TRUE, touches = TRUE)
And I now appreciate your while( approach as your r2 looks better, more naturally transitioned, though the r3 looks fine. In my few, brief experiments with smaller than 'hole', i.e. focal(r2, 9, I got the sense it would take 2 passes to fill, that suggests focal(r2, 5 would take 4.
I guess further determining the proportion of fill:hole:rast for when to deploy a while would be worthwhile.

Find coordinates within radius around many starting points in grid

I have a grid of 10x10m coordinates that I extracted from a raster. I have a set of 'starting points'. For each starting point, I want to find the location (coordinates) of cells within a 10-50m radius around it.
I am aware of functions to do this with a raster starting point, but additional analyses that I have not included here require that I perform the search from a grid of coordinates in the format shown below.
The code below achieves my aim, however the outer function produces vectors that are far too large (> 10 Gb) on my actual dataset (which is a grid of 9 million 10x10m cells, with 3000 starting points).
I am looking for alternatives that achieve the same result as the following (simplified) code, but do not require large vector storage or looping over each starting point separately.
library(raster)
library(tidyverse)
#Set up the mock raster
orig=raster(nrows=100, ncols=100)
res(orig)=10
vals <- rep(c(1, 2, 3, 1, 2, 3, 1, 3, 2), times = c(72, 72, 72, 72, 72, 72, 72, 72, 72))
setValues(orig, vals)
values(orig) <- vals
xygrid <- as.data.frame(orig, xy = TRUE) %>% .[,1:2]
head(xygrid)
x y
1 -175 85
2 -165 85
3 -155 85
4 -145 85
5 -135 85
6 -125 85
#the initial starting points
init_locs <- c(5, 10, 15, 20)
#calculate the distance to every surrounding cell from starting point
Rx <- outer(xygrid[init_locs, 1], xygrid[, 1], "-")
Ry <- outer(xygrid[init_locs, 2], xygrid[, 2], "-")
R <- sqrt(Rx^2+Ry^2) #overall distance
for (i in 1:length(R[,1])) {
expr2 <- (R[i,] > 10 & R[i,] <= 50) #extract the location of cells within 10-50m
inv <- xygrid[expr2,] #extract the coordinates of these cells
}
head(inv)
x y
15 -35 85
16 -25 85
17 -15 85
18 -5 85
22 35 85
23 45 85
(Raster and spatial data are not my specialty, but this made me think of a naive approach that might work acceptably. I don't know anything about the methods #Robert Hijmans mentioned, those are likely much more performant. I just thought this sounded like an interesting question to explore with basic methods.)</caveat>
Approach
The main challenge here is you have 9 million cells, but only around 80 of those will be with 50m of any given point. If you calculate all those cells' distances to 3,000 starting points and then filter for those under 50m, that's 9M x 3k = 27 billion calculations, and a gigantic data structure, almost all of which is unnecessary.
We can quickly get ~1,000x more efficient by splitting this into two problems -- first, what general region of potentially-within-50m-points should we look at, and second, what is the actual distance to the points in those regions?
We can precalculate a modestly sized <2MB hash table for step 1. Then, by joining it to our locations (a very fast operation), we can focus our calculations on the 1/1000th of points that have a chance of being within 50m. I arbitrarily split the original cells into 100 x 100 = 10k sectors, each sector holding 30x30 cells.
1. Creating hash table
For the hash table, I'll assign each point to a sector, somewhat arbitrarily as 30x30 cells, so we have 100x100 = 10k sectors. This could be tuned based on speed vs. memory tradeoffs.
max_dist = 30 # sector width, in cells
xygrid2 <- expand_grid(
x = seq(0, 2999, by = 1), # 3000x3000 location grid
y = seq(0, 2999, by = 1))
xygrid2$sector_x = xygrid2$x %/% max_dist # 100 x 100 sectors
xygrid2$sector_y = xygrid2$y %/% max_dist
y_range = max(xygrid2$sector_y) + 1
xygrid2$sector_num = xygrid2$sector_x*y_range + xygrid2$sector_y
We now have 10,000 sectors assigned. Now which sectors are adjacent to which others? In every case, the adjacent sectors follow the same pattern. In this case, I have 100 sectors across x, so the sectors adjacent to sector S will have sector numbers that vary from S by -101 -100 -99 -1 0 1 99 100 101. We can use this pattern to assign all the adjacencies instantaneously. For simplicity, I leave in sectors outside our range; they will be ignored later anyway.
sector_num_deltas <- rep(-1:1, by = 3) + rep(-1:1, each = 3) * y_range
distinct(xygrid2, sector_num) %>%
uncount(9) %>% # copy each row 9 times, one for each adjacency
mutate(sector_num_adj = sector_num + sector_num_deltas) -> adjacencies
2. Join and calculate
Now that we have that, the rest goes much faster, since we can do the calculations only on the 1/1000th of sectors that are nearby. With that, we can now identify the 240,000 points that are within 50m of the 3,000 starting positions in under 4 seconds:
# Here are 3,000 random starting locations
set.seed(42)
sample_starts <- xygrid2 %>%
slice_sample(n = 3000) %>%
mutate(sample_num = row_number())
# Join each location to all the adjacent sectors, and then add all the
# locations within those sectors, and then calculate distances.
sample_starts %>% # 3,000 starting points...
# join each position to the nine adjacent sectors = ~27,000 rows
left_join(adjacencies, by = "sector_num") %>%
# join each sector to the (30x30 = 900) cells in those sectors --> 24 million rows
# That's a lot, but it's only 1/1000th of the starting problem with
# 3k x 9M = 27 billion comparisons!
left_join(xygrid2, by = c("sector_num_adj" = "sector_num")) %>%
select(-contains("sector")) %>%
mutate(dist = sqrt((x.x-x.y)^2 + (y.x-y.y)^2)) %>%
filter(dist <= 5) -> result
The result tells us that our 3,000 sample starting points are within 5 decimeters (50m) of 242,575 cells, about 80 for each starting point.
result
# A tibble: 242,575 x 6
x.x y.x sample_num x.y y.y dist
<dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 1069 140 1 1064 140 5
2 1069 140 1 1065 137 5
3 1069 140 1 1065 138 4.47
4 1069 140 1 1065 139 4.12
5 1069 140 1 1065 140 4
6 1069 140 1 1065 141 4.12
7 1069 140 1 1065 142 4.47
8 1069 140 1 1065 143 5
9 1069 140 1 1066 136 5
10 1069 140 1 1066 137 4.24
# … with 242,565 more rows
Here's a sample to see how that's working in a small corner of our data:
ggplot(a %>% mutate(sample_grp = sector_num_adj %% 8 %>% as.factor),
aes(x.y, y.y, color = sample_grp)) +
geom_point(data = adjacencies %>% filter(sector_num_adj == 5864) %>%
left_join(xygrid2) %>% distinct(x, y, sector_num),
color = "gray80", shape = 21,
aes(x, y)) +
geom_point(data = adjacencies %>% filter(sector_num == 5864) %>%
left_join(xygrid2) %>% distinct(x, y, sector_num),
color = "gray70", shape = 21,
aes(x, y)) +
annotate("text", alpha = 0.5,
x = c(1725, 1750),
y = c(1960, 1940),
label = c("Lookup area", "sector of\nstarting location")) +
geom_point(size = 1) +
scale_color_discrete(guide = FALSE) +
coord_equal() -> my_plot
library(gganimate)
animate(
my_plot +
gganimate::view_zoom_manual(pan_zoom = -1, ease = "quadratic-in-out",
xmin = c(0, 1700),
xmax = c(3000, 1800),
ymin = c(0, 1880),
ymax = c(3000, 1980)),
duration = 3, fps = 20, width = 300)
Example data --- you were using a lon/lat example, but based on your code, I am assuming that you are using planar data.
library(raster)
r <- raster(nrows=100, ncols=100, xmn=0, xmx=100, ymn=0, ymx=100, crs="+proj=utm +zone=1 +datum=WGS84")
values(r) <- 1:ncell(r) # for display only
xygrid <- as.data.frame(r, xy = TRUE)[,1:2]
locs <- c(8025, 1550, 5075)
dn <- 2.5 # min dist
dx <- 5.5 # max dist
The simplest approach would be to use pointDistance
p <- xyFromCell(r, locs)
d <- pointDistance(xygrid, p, lonlat=FALSE)
u <- unique(which(d>dn & d<dx) %% nrow(d))
pts <- xygrid[u,]
plot(r)
points(pts)
But you will probably run out of memory with that, and it is inefficient to compute all distance. Instead, you may intersect the points with a buffer around the points of interest
b1 <- buffer(SpatialPoints(p, proj4string=crs(r)), dx)
b2 <- buffer(SpatialPoints(p, proj4string=crs(r)), dn)
b <- erase(b1, b2)
x <- intersect(SpatialPoints(xygrid, proj4string=crs(r)), b)
plot(r)
points(x, cex=.5)
points(xyFromCell(r, locs), col="red", pch="x")
With terra it goes like this -- and works well for large datasets in version 1.1-11 that should be on CRAN this week
library(terra)
rr <- rast(r)
pp <- xyFromCell(rr, locs)
bb1 <- buffer(vect(pp), dx)
bb2 <- buffer(vect(pp), dn)
bb <- erase(bb1, bb2)
xx <- intersect(vect(as.matrix(xygrid)), bb)
You can do similar things with sf.
Given that you have so many data points, you might want to start with removing all points that are clearly not of interest
xySel <- lapply(locs, function(i) {
xy <- xygrid[i,]
s <- xygrid[,1] > xy[,1]-dx & xygrid[,1] < xy[,1]+dx & xygrid[,2] > xy[,2]-dx & xygrid[,2] < xy[,2]+dx
xygrid[s,]
})
xySel = do.call(rbind, xySel)
dim(xySel)
# [1] 363 2
dim(xygrid)
#[1] 10000 2
And now you could run pointDistance as above on all data (or else inside the lapply function)
You say that you need to use points, and not a raster. I have seen that idea many times, and 9 out of 10 times that is wrong. Maybe it is true in your case. For others who stumble upon this question, here are are two raster based approaches.
With the raster package you could use extract( ... ,cellnumbers=TRUE) or ajacent. With adjacent, you would first make a weights matrix using one of the buffers made above
buf <- disaggregate(b)[2,]
rb <- crop(r, buf)
w <- as.matrix(rasterize(buf, rb, background=NA) )
w[6,6]=0
And then use the weight matrix like this
a <- adjacent(r, locs, w, pairs=FALSE)
pts <- xyFromCell(r, a)
plot(r)
points(pts)
With terra you could use the cells method
d <- cells(rr, bb)
xy <- xyFromCell(rr, d[,2])
plot(rr)
points(xy, cex=.5)
lines(bb, col="red", lwd=2)

Elbow/knee in a curve in R

I've got this data processing:
library(text2vec)
##Using perplexity for hold out set
t1 <- Sys.time()
perplex <- c()
for (i in 3:25){
set.seed(17)
lda_model2 <- LDA$new(n_topics = i)
doc_topic_distr2 <- lda_model2$fit_transform(x = dtm, progressbar = F)
set.seed(17)
sample.dtm2 <- itoken(rawsample$Abstract,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = rawsample$id,
progressbar = F) %>%
create_dtm(vectorizer,vtype = "dgTMatrix", progressbar = FALSE)
set.seed(17)
new_doc_topic_distr2 <- lda_model2$transform(sample.dtm2, n_iter = 1000,
convergence_tol = 0.001, n_check_convergence = 25,
progressbar = FALSE)
perplex[i] <- text2vec::perplexity(sample.dtm2, topic_word_distribution =
lda_model2$topic_word_distribution,
doc_topic_distribution = new_doc_topic_distr2)
}
print(difftime(Sys.time(), t1, units = 'sec'))
I know there are a lot of questions like this, but I haven't been able to exactly find the answer to my situation. Above you see perplexity calculation from 3 to 25 topic number for a Latent Dirichlet Allocation model. I want to get the most sufficient value among those, meaning that I want to find the elbow or knee, for those values that might only be considered as a simple numeric vector which outcome looks like this:
1 NA
2 NA
3 222.6229
4 210.3442
5 200.1335
6 190.3143
7 180.4195
8 174.2634
9 166.2670
10 159.7535
11 153.7785
12 148.1623
13 144.1554
14 141.8250
15 138.8301
16 134.4956
17 131.0745
18 128.8941
19 125.8468
20 123.8477
21 120.5155
22 118.4426
23 116.4619
24 113.2401
25 114.1233
plot(perplex)
This is how plot looks like
I would say that the elbow would be 13 or 16, but I'm not completely sure and I want the exact number as an outcome. I saw in this paper that f''(x) / (1+f'(x)^2)^1.5 is the knee formula, which I tried like this and says it's 18:
> d1 <- diff(perplex) # first derivative
> d2 <- diff(d1) / diff(perplex[-1]) # second derivative
> knee <- (d2)/((1+(d1)^2)^1.5)
Warning message:
In (d2)/((1 + (d1)^2)^1.5) :
longer object length is not a multiple of shorter object length
> which.min(knee)
[1] 18
I can't fully figure this thing out. Would someone like to share how I could get the exact ideal topics number according to perplexity as an outcome?
Found this: "The LDA model with the optimal coherence score, obtained with an elbow method (the point with maximum absolute second derivative) (...)" in this paper, so this coding does the work: d1 <- diff(perplex); k <- which.max(abs(diff(d1) / diff(perplex[-1])))

Reshape dataframe to multidimensional array

I have a data frame that has a xyz and another variable, A.
data.frame(xx,yy,zz,Amp)
xx yy zz Amp
1 63021.71 403205.0 1.181028516 1170
2 63021.71 403105.0 0.977028516 1381
3 63021.71 403105.0 0.861028516 807
4 63021.71 403105.0 0.784028516 668
5 53021.71 403105.0 0.620028516 19919
6 53021.71 403305.0 0.455028516 32500
7 53021.71 403105.0 0.446028516 32500
8 43021.71 403105.0 0.436028516 32500
9 43021.71 404105.0 0.426028516 32500
10 43021.71 403105.0 0.281028516 17464
First I want to create regular grid for xyz.
Next I want to fill this grid with Amp values.
I would like to do this by creating by using arrays.
Any help would be much appreciated.
i would like the final result to look like this:
dim(Amp)
10 10 10
You do not have enough data in your MWE to create a 10x10x10 array without interpolation. Currently you have 3 unique xx values, 4 unique yy values, and 10 unique zz values. So you could create a 3x4x10 array, but you don't have enough values in Amp to assign to each point in a 3x4x10 3D regular grid. You only have 10 Amp values, describing 10 unique points in 3D space. A 3x4x10 regular grid array would have 120 Amp values, one for each point in the grid. Furthermore, values in a regular grid are equally spaced in each dimension and your yy and zz values are not equally spaced.
Check the spacing in each dimension:
> diff(sort(unique(xx)))
[1] 10000 10000
> diff(sort(unique(yy)))
[1] 100 100 800
> diff(sort(unique(zz)))
[1] 0.145 0.010 0.010 0.009 0.165 0.164 0.077 0.116 0.204
The current MWE looks like this in 3D:
library(rgl)
plot3d(xx,yy,zz, col="red")
To form a 10x10x10 regular grid, you need to convert your dataset into one that has 1000 coordinate points and Amp values. I'm not exactly sure how you'd like to do this given your MWE, but here's an example given the current data:
# MWE data
xx = c(63021.71,63021.71,63021.71,63021.71,53021.71,53021.71,53021.71,43021.71,43021.71,43021.71)
yy = c(403205,403105,403105,403105,403105,403305,403105,403105,404105,403105)
zz = c(1.181028516,0.977028516,0.861028516,0.784028516,0.620028516,0.455028516,0.446028516,0.436028516,0.426028516,0.281028516)
Amp = c(1170,1381,807,668,19919,32500,32500,32500,32500,17464)
# create equally-spaced vectors of 10 values in each dimension
xx <- seq(min(xx), max(xx), length.out = 10)
yy <- seq(min(yy), max(yy), length.out = 10)
zz <- seq(min(zz), max(zz), length.out = 10)
# fake up some Amp data points
set.seed(123)
Amp <- runif(1000, min = min(Amp), max=max(Amp))
# directly create a 10x10x10 regular grid of Amp values as an array
dfa <- array(data = Amp,
dim = c(10,10,10),
dimnames = list(xx,yy,zz)
)
> dim(dfa)
[1] 10 10 10
# Alternatively, make a data.frame first
df <- data.frame(expand.grid(xx,yy,zz))
names(df) <- c("xx","yy","zz")
df$Amp <- Amp
dfa <- array(data = df$Amp,
dim=c(length(unique(df$xx)),
length(unique(df$yy)),
length(unique(df$zz))),
dimnames=list(unique(df$xx), unique(df$yy), unique(df$zz))
)
# you'll want to verify that the Amp values were assigned to the correct xyz coordinates.
# Here's a little function to help:
get_arr_loc = function(x, y, z) {
x + (y-1)*10 + (z-1)*100
}
# and some arbitrary coordinates checked. This could be done in a more systematic way...
> df[get_arr_loc(1,1,1), "Amp"] == dfa[1,1,1]
[1] TRUE
> df[get_arr_loc(10,2,1), "Amp"] == dfa[10,2,1]
[1] TRUE
> df[get_arr_loc(3,6,9), "Amp"] == dfa[3,6,9]
[1] TRUE
> df[get_arr_loc(10,10,10), "Amp"] == dfa[10,10,10]
[1] TRUE

Find two densities' point of intersection in R

I have two densities that overlap as seen in the attached picture. I want to find out where the two lines meet. How would I go about doing that?
This is the code that produced the image:
... #reading in files etc.
pdf("test-plot.pdf")
d1 <- density(somedata)
d2 <- density(someotherdata)
plot(d1)
par(col="red")
lines(d2)
dev.off()
The original data is just two monodimensional vectors, so what I'm interested in is the intersection point of their densities.
I tried to use the solution shown in here, but unfortunately, it neither gives me a number nor even draws the lines correctly:
edit: I have found what I was looking for
# create and plot example data
set.seed(1)
plotrange <- c(-1,8)
d1 <- density(rchisq(1000, df=2), from=plotrange[1], to=plotrange[2])
d2 <- density(rchisq(1000, df=3)-1, from=plotrange[1], to=plotrange[2])
plot(d1)
lines(d2)
# look for points of intersection
poi <- which(diff(d1$y > d2$y) != 0)
# Mark those points with a circle:
points(x=d1$x[poi], y=d1$y[poi], col="red")
# or with lines:
abline(v=d1$x[poi], col="orange", lty=2)
abline(h=d1$y[poi], col="orange", lty=2)
intersect(x,y)
see this help file
For example: If your data are in the same data.frame df
intersect(df$col1, df$col2)
Here is a small example extending John's answer with an example.
require(ggplot2)
require(reshape2)
set.seed(12)
df <- data.frame(x = round(rnorm(100, 20, 10),1), y = round((100/log(100:199)),1))
str(df)
# 'data.frame': 200 obs. of 2 variables:
# $ variable: Factor w/ 2 levels "x","y": 1 1 1 1 1 1 1 1 1 1 ...
# $ value : num 16.8 25.7 20.5 22 19 ...
# Melt and plot
mdf <- melt(df)
ggplot(mdf) +
geom_density(aes(x = value, color = variable))
# Find points that intersect
intersect(df$x, df$y)
# [1] 18.9 20.1 21.3 21.5 21.0 19.6 19.0 20.0 19.8
# To make the answer more complete, here is the source code of intersect.
function (x, y)
{
y <- as.vector(y)
unique(y[match(as.vector(x), y, 0L)])
}
<bytecode: 0x10285d400>
<environment: namespace:base>
>
# It's actually posible to use unique and match to produce the same output
unique(as.vector(df$y)[match(as.vector(df$x), df$y, 0L)])
# [1] 18.9 20.1 21.3 21.5 21.0 19.6 19.0 20.0 19.8!
I'm sure your answers are correct, but here's what finally worked for me:
d1$x[abs(d1$y-d2$y) < 0.00001 && d1$x < 1000 && d1$x > 500]
(because I really only needed to find out one value and am a total R newbie, which made it difficult to understand your answers, since I don't even understand most basic R concepts yet. Thank you for your help and sorry.

Resources