How does geom_raster handle duplicated fill? - r

I would like to present the probabilities of some joint events as a raster using ggplot2 package and wonder how does geom_raster decides which value to promote in case of more than one cell values. I have cases where these events can have more than one probabilities for some reasons. In the code below and the picture above, I illustrate the point of my question at coordinate (10, 10). Does geom_raster considers the last value? Does it sample?
library(ggplot2)
# Normal raster
r <- data.frame(x = 1:10, y = rep(10, 10), value = 1:10)
p1 <- ggplot(r, aes(x, y, fill=value))+
geom_raster()+
coord_equal()+
theme(legend.position = 'bottom')+
labs(title = 'Normal raster: every cell has one value')
p1
# Assuming that coordinate (10, 10) have values 10 and 0
r <- rbind(r, c(10, 10, 0))
p2 <- ggplot(r, aes(x, y, fill=value))+
geom_raster()+
coord_equal()+
theme(legend.position = 'bottom')+
labs(title = 'Raster having 2 different values (10 then 0) at coordinates (10, 10)')
p2

It appears that just the last value for the cell is used. The logic can be found in the source code in the draw_panel function of GeomRaster. We see this code
x_pos <- as.integer((data$x - min(data$x))/resolution(data$x,
FALSE))
y_pos <- as.integer((data$y - min(data$y))/resolution(data$y,
FALSE))
nrow <- max(y_pos) + 1
ncol <- max(x_pos) + 1
raster <- matrix(NA_character_, nrow = nrow, ncol = ncol)
raster[cbind(nrow - y_pos, x_pos + 1)] <- alpha(data$fill,
data$alpha)
So what it does is makes a matrix with rows and columns for all the values, then it does an assignment using matrix indexing. When you do this, only the last assignment survives. For example
(m <- matrix(1:9, nrow=3))
# [,1] [,2] [,3]
# [1,] 1 4 7
# [2,] 2 5 8
# [3,] 3 6 9
(rowcols <- cbind(c(2,3,2), c(3,1,3)))
# [,1] [,2]
# [1,] 2 3
# [2,] 3 1
# [3,] 2 3
m[rowcols] <- 10:12
m
# [,1] [,2] [,3]
# [1,] 1 4 7
# [2,] 2 5 12
# [3,] 11 6 9
What we are doing is creating a matrix then changing the value of cell (2,3), (3,1) then (2,3) again. Only the last assignment to (2,3) is preserved (the 10 value is overwritten). So the value kept depends on the order your data is passed to the ggplot object.

Related

How to find the minimum in a distance matrix in R? My approach doesn´t work

I want to find the minimum value in my distance matrix in order to programm the single linkage algorithm for cluster analysis with R. But the output doesn´t show the coordinates (row number and column number) to identify the minimum.
I tried the "which" command to solve this.
This seems to be the right approach:
> x <- matrix(c(1, 2, 0, 4), nrow=2, ncol=2)
> which(x == min(x), arr.ind=TRUE)
row col
[1,] 1 2
I tried it with my case, but there is no right output:
> which(distance.matrix.euc==min(distance.matrix.euc), arr.ind=TRUE)
row col
I expect that R shows me the coordinates where the minimum value is in the distance matrix, but it shows nothing.
Do you have an idea what´s wrong.
If you create the distance.matrix.euc with the dist function in R, then its class will be dist, not a matrix.
set.seed(2)
x <- matrix(sample(1:10, 6, replace = FALSE), nrow=3)
x
# [,1] [,2]
# [1,] 5 1
# [2,] 6 10
# [3,] 9 7
distance_matrix <- dist(x)
distance_matrix
# 1 2
# 2 9.055385
# 3 7.211103 4.242641
class(distance_matrix)
# [1] "dist"
As #akrun suggested, you can convert your distance matrix into matrix class. Then, the which command should return closest points.
min_dist <- min(distance_matrix)
distance_matrix <- as.matrix(distance_matrix)
which(distance_matrix==min_dist, arr.ind=TRUE)
# row col
# 3 3 2
# 2 2 3

Plotting variable number of series in ggplot for use in Shiny app

I need to make a graph with ggplot where the number of series included in the plot is variable. The dataframe has dates (x variable) in the first column, and then anywhere from 1 to 15 additional columns containing variables to be plotted. I saw a previous post suggesting melt from the reshape package; however, I could not get that to work. I would like this to work regardless of the ncol= specification while dimming the test matrix. Really appreciate any help on this!
Simulated data:
#rm(list = ls())
library(ggplot2)
library(reshape2)
test <- matrix(0, nrow = 10, ncol = 5)
test[,1] <- seq(from = 2000, by = 1, length = 10)
for(i in 2:5){
test[1, i] <- 100
for(j in 2:10){
test[j, i] <- test[j-1, i] + rnorm(1, 25, 5)
}
}
colnames(test)[1] <- "date"
melt_test <- melt(test, id = "date")
ggplot(melt_test, aes(x=date, y=value, colour = variable, group = variable)) +
geom_line()
Fixing the data:
Your code produces an error
Error in dimnames(x) <- dn :
length of 'dimnames' [2] not equal to array extent
so I'll regenerate the data more directly into a data.frame vice a matrix:
library(ggplot2)
library(reshape2)
test <- matrix(0, nrow = 10, ncol = 4)
set.seed(2) # for reproducibility, always include this in SO questions when using random funcs
for(i in 1:4){
test[1,i] <- 100
for(j in 2:10){
test[j, i] <- test[j-1, i] + rnorm(1, 25, 5)
}
}
test[1:3,]
# [,1] [,2] [,3] [,4]
# [1,] 100.0000 100.0000 100.0000 100.0000
# [2,] 120.5154 124.3061 130.0641 122.0172
# [3,] 146.4397 151.3943 157.2255 150.9782
dat <- cbind.data.frame(date = seq(2000, by=1, length=nrow(test)), test)
dat[1:3,]
# date 1 2 3 4
# 1 2000 100.0000 100.0000 100.0000 100.0000
# 2 2001 120.5154 124.3061 130.0641 122.0172
# 3 2002 146.4397 151.3943 157.2255 150.9782
Now, when we reshape, we can see something better:
melt_dat <- reshape2::melt(dat, id="date")
melt_dat[1:3,]
# date variable value
# 1 2000 1 100.0000
# 2 2001 1 120.5154
# 3 2002 1 146.4397
Now things work:
ggplot(melt_dat, aes(x=date, y=value, colour = variable, group = variable)) +
geom_line()

Matrix Multiplication along specified array dimension with R/Rcpp

Given an n dimensional array X, a d by d-1 dimensional matrix V and two specified dimensions (p1, p2) <= (n, n); I would like a function that preforms matrix multiplication of V along the dimensions (p1, p2) of X.
That is given X:
library(abind)
set.seed(4)
X <- matrix(runif(4), 2, 2)
X <- abind(x, x+5, along = 3)
> a
, , 1
[,1] [,2]
[1,] 1 3
[2,] 2 4
, , 2
[,1] [,2]
[1,] 6 8
[2,] 7 9
and given a matrix V
V <- matrix(c(1, 2))
[,1]
[1,] 1
[2,] 2
For example, if p1=2 and p2=1 I would like to remove the following for loop
p1 <- 1
p2 <- 2
a.out <- array(0, c(2, 1, 2))
for (i in 1:dim(a)[2]){
a.out[,,i] <- a[,,i]%*%V # note indexed along other dimension
}
> a.out
, , 1
[,1]
[1,] 7
[2,] 10
, , 2
[,1]
[1,] 22
[2,] 25
The hard part here is that I want to allow for arbitrary dimensional arrays (i.e., n could be greater than 3).
1st Edit:
This problem is not the same as Indexing slice from 3D Rcpp NumericVector as I am discussing arbitrary number of dimensions >=2 and the question is not only about indexing.
2nd Edit:
Just to be a little more clear here is another example of what I am trying to do. Here the dimension of X is 4, p1 = 2, p3=3, and the dimension of X along the p1 dimension is 12. The following code computes the desired result as X.out for random X and V.
X <- array(rnorm(672), c(4, 7, 12, 2))
V <- matrix(rnorm(132), 12, 11) # p1 = 2, p2 = 3, V is of dimension D x D-1
d <- dim(X)
X.out <- array(0, dim=c(d[1:2], d[3]-1, d[4]))
for(i in 1:d[1]){
for (j in 1:d[4]){
X.out[i,,,j] <- X[i,,,j]%*%V # p1 = 2, p2 = 3
}
}

Find most distant point all other points in R

I'm having trouble finding a solution to this simple problem. I have been searching the forums and altought I have gotten closer to an answer this is not exactly what I need.
I'm trying to find from a set of x,y points which point is the furthest away from any other points i.e. not the maximum distance between points, but the one furthest from the rest.
I've tried
x <-c(x1,x2,x3....)
y <-c(y1,y2,y3...)
dist(cbind(x,y))
Which gives me a matrix of the distance between each point to each point. I can interrogate the data in MS Excel and find the answer. Find the minimum values in each column, then the maximum number across them.
If I were to plot the data, I would like to have as output the distance of either the red or blue line (depending on which is longer).
Starting from this example data set:
set.seed(100)
x <- rnorm(150)
y <- rnorm(150)
coord <- cbind(x,y)
dobj <- dist(coord)
Now dobj is a distance object, but you can't examine that directly. You'll have to convert that to a matrix first, and make sure you don't take zero distances between a point and itself into account:
dmat <- as.matrix(dobj)
diag(dmat) <- NA
The latter line replaces the diagonal values in the distance matrix with NA.
Now you can use the solution of amonk:
dmax <- max(apply(dmat,2,min,na.rm=TRUE))
This gives you the maximum distance to the nearest point. If you want to know which points these are, you can take an extra step :
which(dmat == dmax, arr.ind = TRUE)
# row col
# 130 130 59
# 59 59 130
So point 130 and 59 are the two points fulfilling your conditions. Plotting this gives you:
id <- which(dmat == dmax, arr.ind = TRUE)
plot(coord)
lines(coord[id[1,],], col = 'red')
Note how you get this info twice, as euclidean distances between two points are symmetric (A -> B is as long as B -> A ).
It looks like to me, that you have spatial points in some projection. One could argue, that the point furthest away from the rest, is the one which lies furthest from the center (the mean coordinates):
library(raster)
set.seed(21)
# create fake points
coords <- data.frame(x=sample(438000:443000,10),y=sample(6695000:6700000,10))
# calculate center
center <- matrix(colMeans(coords),ncol=2)
# red = center, magenta = furthest point (Nr.2)
plot(coords)
# furthest point #2
ix <- which.max(pointDistance(coords,center,lonlat = F))
points(center,col='red',pch='*',cex=3)
points(coords[ix,],col='magenta',pch='*',cex=3)
segments(coords[ix,1],coords[ix,2],center[1,1],center[1,2],col='magenta')
To find the points farthest from the rest of the points you could do something like this. I opted for the median distance as you said the point(s) farthest from the rest of the data. If you have a group of points very close to each other the median should remain robust to this.
There is probably also a way to do this with hierarchical clustering but it is escaping me at the moment.
set.seed(1234)
mat <- rbind(matrix(rnorm(100), ncol=2), c(-5,5), c(-5.25,4.75))
d <- dist(mat)
sort(apply(as.matrix(d), 1, median), decreasing = T)[1:5]
# 51 52 20 12 4
# 6.828322 6.797696 3.264315 2.806263 2.470919
I wrote up a handy little function you can use for picking from the largest of line distances. You can specify if you want the largest, second largest, and so forth with the n argument.
getBigSegment <- function(x, y, n = 1){
a <- cbind(x,y)
d <- as.matrix(dist(a, method = "euclidean"))
sorted <- order(d, decreasing = T)
sub <- (1:length(d))[as.logical(1:length(sorted) %% 2)]
s <- which(d == d[sorted[sub][n]], arr.ind = T)
t(cbind(a[s[1],], a[s[2],]))
}
With some example data similar to your own you can see:
set.seed(100)
mydata <- data.frame(x = runif(10, 438000, 445000) + rpois(10, 440000),
y = runif(10, 6695000, 6699000) + rpois(10, 6996000))
# The function
getBigSegment(mydata$x, mydata$y)
# x y
#[1,] 883552.8 13699108
#[2,] 881338.8 13688458
Below you can visualize how I would use such a function
# easy plotting function
pointsegments <- function(z, ...) {
segments(z[1,1], z[1,2], z[2,1], z[2,2], ...)
points(z, pch = 16, col = c("blue", "red"))
}
plot(mydata$x, mydata$y) # points
top3 <- lapply(1:3, getBigSegment, x = mydata$x, y = mydata$y) # top3 longest lines
mycolors <- c("black","blue","green") # 3 colors
for(i in 1:3) pointsegments(top3[[i]], col = mycolors[i]) # plot lines
legend("topleft", legend = round(unlist(lapply(top3, dist))), lty = 1,
col = mycolors, text.col = mycolors, cex = .8) # legend
This approach first uses chull to identify extreme_points, the points that lie on the boundary of the given points. Then, for each extreme_points, it calculates centroid of the extreme_points by excluding that particular extreme_points. Then it selects the point from extreme_points that's furthest away from the centroid.
foo = function(X = all_points){
plot(X)
chull_inds = chull(X)
extreme_points = X[chull_inds,]
points(extreme_points, pch = 19, col = "red")
centroid = t(sapply(1:NROW(extreme_points), function(i)
c(mean(extreme_points[-i,1]), mean(extreme_points[-i,2]))))
distances = sapply(1:NROW(extreme_points), function(i)
dist(rbind(extreme_points[i,], centroid[i,])))
points(extreme_points[which.max(distances),], pch = 18, cex = 2)
points(X[chull_inds[which.max(distances)],], cex = 5)
return(X[chull_inds[which.max(distances)],])
}
set.seed(42)
all_points = data.frame(x = rnorm(25), y = rnorm(25))
foo(X = all_points)
# x y
#18 -2.656455 0.7581632
So for df as your initial data frame you can perform the following:
df<-NULL#initialize object
for(i in 1:10)#create 10 vectors with 10 pseudorandom numbers each
df<-cbind(df,runif(10))#fill the dataframe
cordf<-cor(df);diag(cordf)<-NA #create correlation matrix and set diagonal values to NA
Hence:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] NA -0.03540916 -0.29183703 0.49358124 0.79846794 0.29490246 0.47661166 -0.51181482 -0.04116772 -0.10797632
[2,] -0.03540916 NA 0.47550478 -0.24284088 -0.01898357 -0.67102287 -0.46488410 0.01125144 0.13355919 0.08738474
[3,] -0.29183703 0.47550478 NA -0.05203104 -0.26311149 0.01120055 -0.16521411 0.49215496 0.40571893 0.30595246
[4,] 0.49358124 -0.24284088 -0.05203104 NA 0.60558581 0.53848638 0.80623397 -0.49950396 -0.01080598 0.41798727
[5,] 0.79846794 -0.01898357 -0.26311149 0.60558581 NA 0.33295170 0.53675545 -0.54756131 0.09225002 -0.01925587
[6,] 0.29490246 -0.67102287 0.01120055 0.53848638 0.33295170 NA 0.72936185 0.09463988 0.14607018 0.19487579
[7,] 0.47661166 -0.46488410 -0.16521411 0.80623397 0.53675545 0.72936185 NA -0.46348644 -0.05275132 0.47619940
[8,] -0.51181482 0.01125144 0.49215496 -0.49950396 -0.54756131 0.09463988 -0.46348644 NA 0.64924510 0.06783324
[9,] -0.04116772 0.13355919 0.40571893 -0.01080598 0.09225002 0.14607018 -0.05275132 0.64924510 NA 0.44698207
[10,] -0.10797632 0.08738474 0.30595246 0.41798727 -0.01925587 0.19487579 0.47619940 0.06783324 0.44698207 NA
Finally by executing:
max(apply(cordf,2,min,na.rm=TRUE),na.rm = TRUE)#avoiding NA's
one can get:
[1] -0.05275132
the maximum value of the local minima.
Edit:
In order to get the index of matrix
>which(cordf==max(apply(cordf,2,min,na.rm=TRUE),na.rm = TRUE))
[1]68 77
or in order to get the coordinates:
> which(cordf==max(apply(cordf,2,min,na.rm=TRUE),na.rm = TRUE), arr.ind = TRUE)
row col
[1,] 8 7
[2,] 7 8

Efficient way to compute standard deviation of nearest neighbours of each element in matrix

I would like to compute the standard deviation of the nearest neighbors (3*3 moving window) of each element in a matrix. I wrote some code in R to implement it:
library(FNN)
df <- matrix(1:10000, nrow = 100, ncol = 100, byrow = TRUE)
df_ <- reshape2::melt(df)
df_index <- df_[, c(1,2)]
df_query <- df_index
neighbor_index <- knnx.index(df_index, df_query, k = 9, algorithm = 'kd_tree')
neighbor_coor<- apply(neighbor_index, 1, function(x) df_query[x, ])
neighbor_sd <- lapply(neighbor_coor, function(x) sd(df[x[, 1], x[, 2]]))
sd <- do.call(rbind, neighbor_sd)
But the speed is too slow. Would you give me some advice to speed up? Are there other ways to implement it?
As #romanlustrik proposed in his comment, we can use a raster::focal() for this problem.
library(raster)
df <- matrix(1:10000, nrow = 100, ncol = 100, byrow = TRUE)
dfR <- raster(df)
dfSD <- as.matrix(focal(dfR, w = matrix(1,3,3), fun = sd))
where, w is the a matrix representing the nearest neighbors and their weighting within fun (in this case 3x3 which is the cell itself and it 8 neighbors). Thus, any neighborhood pattern is imaginable as long as it it can be represented by a matrix.
matrix(1,3,3)
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 1 1 1
# [3,] 1 1 1
An example with only the 4 neighbors (excluding diagonals and the cell itself):
matrix(c(0,1,0,1,0,1,0,1,0), 3, 3)
# [,1] [,2] [,3]
# [1,] 0 1 0
# [2,] 1 0 1
# [3,] 0 1 0

Resources