3D aggregation of RasterBrick or RasterStack - r

I have some data in a 3D grid identified by simple i,j,k locations (no real-world spatial information). These data are in a RasterStack right now.
b <- stack(system.file("external/rlogo.grd", package="raster"))
# add more layers
b <- stack(b,b)
# dimensions
dim(b)
[1] 77 101 6
yields 77 rows, 101 columns, 6 layers.
# upscale by 2
up <- aggregate(b,fact=2)
dim(up)
[1] 39 51 6
yields 39 rows, 51 columns, 6 layers.
Hoped-for behavior: 3 layers.
I'm looking for a method to aggregate across layers in addition to the present behavior, which is to aggregate within each layer. I'm open to other data structures, but would prefer an existing upscaling/resampling/aggregation algorithm to one I write myself.
Potentially related are http://quantitative-advice.gg.mq.edu.au/t/fast-way-to-grid-and-sum-coordinates/110/5 or the spacetime package, which assumes the layers are temporal rather than spatial, adding more complexity.

Supouse you define agg.fact variable to denote the value 2:
agg.fact <- 2
up <- aggregate(b, fact = agg.fact)
dim(up)
[1] 39 51 6
Now we generate a table which indicates which layers will be aggregate with anothers using agg.fact:
positions <- matrix(1:nlayers(b), nrow = nlayers(b)/agg.fact, byrow = TRUE)
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
And apply a function(in this case mean but could be max``,sum` or another ...) to each pair of layers
up2 <- stack(apply(positions, 1, function(x){
mean(b[[x[1]]], b[[x[2]]])
}))
dim(up2)
[1] 77 101 3
Or if want to aggregate in 3 dimensions (choose if want aggregate 1-2d and then 3d or viceverza):
up3 <- stack(apply(positions, 1, function(x){
aggregate(mean(b[[x[1]]], b[[x[2]]]), fact = agg.fact) #first 3d
#mean(aggregate(b[[x[1]]], fact = agg.fact), aggregate(b[[x[2]]]), fact = agg.fact) ##first 1d-2d
}))
dim(up3)
[1] 39 51 3

I did not read the documentation correctly. To aggregate across layers:
For example, fact=2 will result in a new Raster* object with 2*2=4 times fewer cells. If two numbers are supplied, e.g., fact=c(2,3), the first will be used for aggregating in the horizontal direction, and the second for aggregating in the vertical direction, and the returned object will have 2*3=6 times fewer cells. Likewise, fact=c(2,3,4) aggregates cells in groups of 2 (rows) by 3 (columns) and 4 (layers).
It may be necessary to play with expand=TRUE vs expand=FALSE to get it to work, but this seems inconsistent (I have reported it as a bug).

Related

Retrieve 100 samples closest to the centroids of each cluster after K means clustering using R

I'm trying to reduce the input data size by first performing a K-means clustering in R then sample 50-100 samples per representative cluster for downstream classification and feature selection.
The original dataset was split 80/20, and then 80% went into K means training. I know the input data has 2 columns of labels and 110 columns of numeric variables. From the label column, I know there are 7 different drug treatments. In parallel, I tested the elbow method to find the optimal K for the cluster number, it is around 8. So I picked 10, to have more data clusters to sample for downstream.
Now I have finished running the model <- Kmeans(), the output list got me a little confused of what to do. Since I have to scale only the numeric variables to put into the kmeans function, the output cluster membership don't have that treatment labels anymore. This I can overcome by appending the cluster membership to the original training data table.
Then for the 10 centroids, how do I find out what the labels are? I can't just do
training_set$centroids <- model$centroids
And most important question, how do I find 100 samples per cluster that are the closeted to their respective centroid?? I have seen one post here in python but no R resources yet.
Output 50 samples closest to each cluster center using scikit-learn.k-means library
Any pointers?
First we need a reproducible example of your data:
set.seed(42)
x <- matrix(runif(150), 50, 3)
kmeans.x <- kmeans(x, 10)
Now you want to find the observations in original data x that are closest to the centroids computed and stored as kmeans.x. We use the get.knnx() function in package FNN. We will just get the 5 closest observations for each of the 10 clusters.
library(FNN)
y <- get.knnx(x, kmeans.x$centers, 5)
str(y)
# List of 2
# $ nn.index: int [1:10, 1:5] 42 40 50 22 39 47 11 7 8 16 ...
# $ nn.dist : num [1:10, 1:5] 0.1237 0.0669 0.1316 0.1194 0.1253 ...
y$nn.index[1, ]
# [1] 42 38 3 22 43
idx1 <- sort(y$nn.index[1, ])
cbind(idx1, x[idx1, ])
# idx1
# [1,] 3 0.28614 0.3984854 0.21657
# [2,] 22 0.13871 0.1404791 0.41064
# [3,] 38 0.20766 0.0899805 0.11372
# [4,] 42 0.43577 0.0002389 0.08026
# [5,] 43 0.03743 0.2085700 0.46407
The row indices of the nearest neighbors are stored in nn.index so for the first cluster, the 5 closest observations are 42, 38, 3, 22, 43.

compare clusters' objects in R

I have two clustering results for the same variables but with different values each time. Let us create them with the following code:
set.seed(11)
a<-matrix(rnorm(10000),ncol=100)
colnames(a)<-(c(1:100))
set.seed(31)
b<-matrix(rnorm(10000),ncol=100)
colnames(b)<-colnames(a)
c.a<-hclust(dist(t(a)))
c.b<-hclust(dist(t(b)))
# clusters
groups.a<-cutree(c.a, k=15)
# take groups names
clus.a=list()
for (i in 1:15) clus.a[[i]] <- colnames(a)[groups.a==i]
# see the clusters
clus.a
groups.b<-cutree(c.b, k=15)
clus.b=list()
for (i in 1:15) clus.b[[i]] <- colnames(b)[groups.b==i]
# see the clusters
clus.b
What I get from that is two lists, clus.a and clus.b with the names (here just numbers from 1 to 100) of each cluster's variables.
Is there any way to examine if and which of the variables are clustered together in both clusterings? Meaning, how can I see if I have variables (could be teams of 2, 3, 4 etc) in same clusters for both clus.a and clus.b (doesn't have to be in the same cluster number).
If I understand your question correctly, you want to know if there are any clusters in a which have exactly the same membership as any of the clusters in b. Here's one way to do that.
Note: AFAICT in your example there are no matching clusters in a and b, so we create a few artificially to demo the solution.
# create artificial matches
clus.b[[3]] <- clus.a[[2]]
clus.b[[10]] <- clus.a[[8]]
clus.b[[15]] <- clus.a[[11]]
f <- function(a,b) (length(a)==length(b) & length(intersect(a,b))==length(a))
result <- sapply(clus.b,function(x)sapply(clus.a,f,b=x))
which(result, arr.ind=TRUE)
# row col
# [1,] 2 3
# [2,] 8 10
# [3,] 11 15
So this loops through all the clusters in b (sapply(clus.b,...)) and for each, loops through all the clusters in a looking for an exact match (in arbitrary order). For there to be a match, both clusters must have the same length, and the intersection of the two must contain all the elements in either - hence have the same length. This process produces a logical matrix with rows representing a and columns representing b.
Edit: To reflect the fact that OP is changing the question.
To detect clusters with two or more common elements, use:
f <- function(a,b) length(intersect(a,b))>1
result <- sapply(clus.b,function(x)sapply(clus.a,f,b=x))
matched <- which(result, arr.ind=TRUE)
matched
# row col
# [1,] 4 1
# [2,] 8 1
# [3,] 11 1
# [4,] 3 2
# ...
To identify which elements were present in both:
apply(matched,1,function(r) intersect(clus.a[[r[1]]],clus.b[[r[2]]]))

R - detect and summarize changes in matrices

I have two sets of matrices. Each matrix is 100x100 in dimension and I have 240 of them (imagine each matrix was collected in a month and I have a dataset composed of 240 months of 100x100 matrices).
The values in the matrices range from 1 to 15, representing vegetation types (grass, tropical forest, tundra etc).
My first set of matrices, m1, is my control experiment. My second set of matrices, m2, is a climate change experiment where changes in climate induce changes in the values of the matrices.
Therefore, the data is represented like this:
m1: set of 240 100x100 matrices, each matrix corresponding to a month (therefore 240 months of data). This is my control data
m2: same as m1, but the values are different because of some changes in climate. This is my experimental data.
Here is some data:
# generate dataset 1
set.seed(4)
someData1 <- round(runif(100 * 100 * 240, min=1, max=15),digits=0)
# generate dataset2
set.seed(5)
someData2 <- round(runif(100 * 100 * 240, min=1, max=15),digits=0)
# create matrices
k = 240; n=100; m = 100
m1 <- array(someData1, c(n,m,k))
m2 <- array(someData2, c(n,m,k))
What I would like to do is compare each cell of m2 relative to m1 in this way:
is the value different? yes/no
if yes, what was the change? for example 1 to 10, or 2 to 7 and so on.
and do the same for all 240 matrices in m2 relative to all 240 matrices in m1.
By the end, I would like to be able to:
have a binary matrix showing whether or not there has been changes in the values;
have a table with the frequency of changes in each class (i.e. 1 to 10, 2 to 7 etc).
Conceptually, what I need to achieve would be something like this:
where for simplicity sake I drew 5x5 matrices instead of 100x100 matrices.
How to achieve this in R?
To compare two matrices, use == or !=.
what.changed <- m1 != m2 # T if changed F if not
changes <- ifelse(what.changed, paste(m1, 'to', m2), NA)
changes # for your little matrices not the 100x100
[,1] [,2] [,3]
[1,] NA "7 to 10" "6 to 7"
[2,] NA NA NA
[3,] "3 to 4" "6 to 8" NA
Your matrices seem rather large, so I'm not sure if some sort of sparse matrix approach might be better. In regards to storing the changes as a string ("3 to 4"), perhaps you could only store changes where there is in fact a change, rather than creating such a large matrix where most of the elements are NA. e.g.
Or perhaps you could create a CSV/dataframe summarising your changes e.g. (using your 100x100x240 matrices to demonstrate the 3 coordinates):
# find coordinates of changes
change.coords <- which(m1 != m2, arr.ind=T)
colnames(change.coords) <- c('x', 'y', 'time') # whatever makes sense to your application
changes <- data.frame(change.coords, old=m1[change.coords], new=m2[change.coords])
head(changes)
x y time old new
1 1 1 1 9 4
2 2 1 1 1 11
3 3 1 1 5 14
4 5 1 1 12 2
5 6 1 1 5 11
6 7 1 1 11 8
Then you can print it out as you wish without having to store heaps of strings ("X to Y") and NAs, e.g (don't do this with your big example matrices, there are waaay too many changes and it will print them /all/):
with(changes, message(sprintf("Coords (%i, %i, %i): %i to %i\n",
x, y, time, old, new)))

Understanding Dynamic Time Warping

We want to use the dtw library for R in order to shrink and expand certain time series data to a standard length.
Consider, three time series with equivalent columns. moref is of length(rows) 105, mobig is 130 and mosmall is 100. We want to project mobig and mosmall to a length of 105.
moref <- good_list[[2]]
mobig <- good_list[[1]]
mosmall <- good_list[[3]]
Therefore, we compute two alignments.
ali1 <- dtw(mobig, moref)
ali2 <- dtw(mosmall, moref)
If we print out the alignments the result is:
DTW alignment object
Alignment size (query x reference): 130 x 105
Call: dtw(x = mobig, y = moref)
DTW alignment object
Alignment size (query x reference): 100 x 105
Call: dtw(x = mosmall, y = moref)
So exactly what we want? From my understanding we need to use the warping functions ali1$index1 or ali1$index2 in order to shrink or expand the time series. However, if we invoke the following commands
length(ali1$index1)
length(ali2$index1)
length(ali1$index2)
length(ali2$index2)
the result is
[1] 198
[1] 162
[1] 198
[1] 162
These are vector with indices (probably refering to other vectors). Which one of these can we use for the mapping? Aren't they all to long?
First of all, we need to agree that index1 and index2 are two vectors of the same length that maps query/input data to reference/stored data and vice versa.
Since you did not give out any data. Here is some dummy data to give people an idea.
# Reference data is the template that we use as reference.
# say perfect pronunciation from CNN
data_reference <- 1:10
# Query data is the input data that we want to map to our reference
# say random youtube audio
data_query <- seq(1,10,0.5) + rnorm(19)
library(dtw)
alignment <- dtw(x=data_query, y=data_reference, keep=TRUE)
alignment$index1
alignment$index2
lcm <- alignment$costMatrix
image(x=1:nrow(lcm), y=1:ncol(lcm), lcm)
plot(alignment, type="threeway")
Here are the outputs:
> alignment$index1
[1] 1 2 3 4 5 6 7 7 8 9 10 11 12 13 13 14 14 15 16 17 18 19
> alignment$index2
[1] 1 1 1 2 2 3 3 4 5 6 6 6 6 6 7 8 9 9 9 9 10 10
So basically, the mapping from index1 to index2 is how to map input data to the reference data.
i.e. the 10th data point at the input data has been matched to the 6th data point from the template.
index1: Warping function φx(k) for the query
index2: Warping function φy(k) for the reference
-- Toni Giorgino
Per your question, "what is the deal with the length of the index", since it is basically the coordinates of the optimal, path, it could be as long as m+n(really shallow) or min(m,n) (perfect diagonal). Clearly, it is not a one-to-one mapping which might bothers people a little bit, I guess you can do more research from here how to pick up the mapping you want.
I don't know if there is some buildin function functionality to pick up the best one-to-one mapping. But here is one way.
library(plyr)
mapping <- data.frame(index1=alignment$index1, index2=alignment$index2)
mapping <- ddply(mapping, .(index1), summarize, index2_new = max(index2))
Now mapping contains a one-to-one mapping from query to reference. Then you can map the query to the reference and scale the mapped input in whatever way you want.
I am not exactly sure about the content below the line and anyone is more than welcome to make any improvement how the mapping and scaling should work.
References: 1, 2

Traverse matrix (grid of points) in blocks of 4

This question is for a project and nothing to do with homeworks/acads. I am a working statistician.
So my question is, how would you write a R function, given a matrix with 400 rows and two columns where every 20 rows starting from the first, form a first row of coordinates in a grid of points like below where I would like the function to return the four corners of each individual square/rectangle within the grid:
Hence the output would have four columns and each row would indicate a rectangle. I am only looking at adjacent rectangles of the same size as in for example if the numbers below denote the row indices of the example matrix (which has two columns):
Example of row indices:
1 2 3
4 5 6
7 8 9
Would have to be traversed in the following order:
[1,2,4,5],[2,3,5,6],[4,5,7,8],[5,6,8,9] and
return the corresponding 2d points from the example input data set
which would have 9 rows and 2 points. But just that, here the grid is specified to be 3 by 3 while in my example the grid is 20 by 20 and my input dataset is 400 rows by 2 columns. If you look at the traversed result there is a pattern wherethe row indices in each 4 point block are incremented by 1. i just want to generalize this to a 400 by 2 or any setting where there is a 2 column matrix of points and there is a mention of the grid dimension.
Here is a solution if I have understood you correctly. It was a very interesting problem to be honest. :D
The idea is to make a box of a given edge length and then move this box around the grid and record it's vertices. Please see the following:
# Assuming the grid is always a square grid.
grid.size <- 20
# The matrix of row indices.
rindex.grid <- matrix(1:(grid.size * grid.size),
nrow=grid.size, ncol=grid.size, byrow=TRUE)
# We can traverse the grid by moving any given square either right or down in any
# single move. We choose to go right.
move.square.right <- function (this.square, steps=1) {
new.square <- this.square + steps
}
# Going right, capture co-ordinates of all squares in this row.
collect.sq.of.edge.length.in.row.number <- function (grid.size, elength,
rownum=1) {
first.square.in.row <- (rownum - 1) * grid.size + c(1, elength)
first.square.in.row <- c(first.square.in.row,
first.square.in.row + grid.size * (elength - 1))
squares.in.row <- t(sapply(X=seq_len(grid.size - (elength - 1)) - 1,
FUN=move.square.right,
this.square=first.square.in.row))
squares.in.row
}
# Now we start going down the columns and using the function above to collect
# squares in each row. The we will rbind the list of squares in each row into a
# dataframe. So what we get is a (grid.size - (elength - 1) ^ 2) x 4 matrix where
# each row is the co-ordinates of a square of edge length elength.
collect.sq.of.edge.length.in.grid <- function (grid.size, elength) {
all.squares=lapply(X=seq_len(grid.size - (elength - 1)),
FUN=collect.sq.of.edge.length.in.row.number,
grid.size=grid.size, elength=elength)
all.squares <- do.call(rbind, all.squares)
all.squares
}
This seems to show that we are getting the right number of boxes for all edge lengths:
tmp <- sapply(1:20, collect.sq.of.edge.length.in.grid, grid.size=grid.size)
sapply(tt, nrow)
[1] 400 361 324 289 256 225 196 169 144 121 100 81 64 49 36 25 16 9 4 1
Plus, it works well in your 3x3 example:
collect.sq.of.edge.length.in.grid(grid.size=3, elength=2)
[,1] [,2] [,3] [,4]
[1,] 1 2 4 5
[2,] 2 3 5 6
[3,] 4 5 7 8
[4,] 5 6 8 9
If you want to create a movable 20 x 20 "window" that can scroll down and/or across a 400x400 space, then use:
mcorners <- function(xidx, yidx) mat[xidx:(xidx+19),
yidx:(yidx+19])
mcorners(1,1) # should return mat[1:20, mat1:20]
Then supply mcorners() with arguments to fit your somewhat vaguely described needs. The traversal down the first column might involve:
sapply(1:381, function(ix) yourfunc( mcorners(ix, 1) ) )

Resources