R - detect and summarize changes in matrices - r

I have two sets of matrices. Each matrix is 100x100 in dimension and I have 240 of them (imagine each matrix was collected in a month and I have a dataset composed of 240 months of 100x100 matrices).
The values in the matrices range from 1 to 15, representing vegetation types (grass, tropical forest, tundra etc).
My first set of matrices, m1, is my control experiment. My second set of matrices, m2, is a climate change experiment where changes in climate induce changes in the values of the matrices.
Therefore, the data is represented like this:
m1: set of 240 100x100 matrices, each matrix corresponding to a month (therefore 240 months of data). This is my control data
m2: same as m1, but the values are different because of some changes in climate. This is my experimental data.
Here is some data:
# generate dataset 1
set.seed(4)
someData1 <- round(runif(100 * 100 * 240, min=1, max=15),digits=0)
# generate dataset2
set.seed(5)
someData2 <- round(runif(100 * 100 * 240, min=1, max=15),digits=0)
# create matrices
k = 240; n=100; m = 100
m1 <- array(someData1, c(n,m,k))
m2 <- array(someData2, c(n,m,k))
What I would like to do is compare each cell of m2 relative to m1 in this way:
is the value different? yes/no
if yes, what was the change? for example 1 to 10, or 2 to 7 and so on.
and do the same for all 240 matrices in m2 relative to all 240 matrices in m1.
By the end, I would like to be able to:
have a binary matrix showing whether or not there has been changes in the values;
have a table with the frequency of changes in each class (i.e. 1 to 10, 2 to 7 etc).
Conceptually, what I need to achieve would be something like this:
where for simplicity sake I drew 5x5 matrices instead of 100x100 matrices.
How to achieve this in R?

To compare two matrices, use == or !=.
what.changed <- m1 != m2 # T if changed F if not
changes <- ifelse(what.changed, paste(m1, 'to', m2), NA)
changes # for your little matrices not the 100x100
[,1] [,2] [,3]
[1,] NA "7 to 10" "6 to 7"
[2,] NA NA NA
[3,] "3 to 4" "6 to 8" NA
Your matrices seem rather large, so I'm not sure if some sort of sparse matrix approach might be better. In regards to storing the changes as a string ("3 to 4"), perhaps you could only store changes where there is in fact a change, rather than creating such a large matrix where most of the elements are NA. e.g.
Or perhaps you could create a CSV/dataframe summarising your changes e.g. (using your 100x100x240 matrices to demonstrate the 3 coordinates):
# find coordinates of changes
change.coords <- which(m1 != m2, arr.ind=T)
colnames(change.coords) <- c('x', 'y', 'time') # whatever makes sense to your application
changes <- data.frame(change.coords, old=m1[change.coords], new=m2[change.coords])
head(changes)
x y time old new
1 1 1 1 9 4
2 2 1 1 1 11
3 3 1 1 5 14
4 5 1 1 12 2
5 6 1 1 5 11
6 7 1 1 11 8
Then you can print it out as you wish without having to store heaps of strings ("X to Y") and NAs, e.g (don't do this with your big example matrices, there are waaay too many changes and it will print them /all/):
with(changes, message(sprintf("Coords (%i, %i, %i): %i to %i\n",
x, y, time, old, new)))

Related

Creating upper/lower triangular correlation matrix based on values from a group of text files?

Will try not to complicate things too much with my explanations, but I'm confused how to best go about filling a triangulated correlation matrix with no repeat values with existing correlation values derived from another package. This involves extracting specific values from a list of text files. This is what I have done so far:
# read in list of file names (they are named '1_1', '1_2' .. so on until '47_48' with no repeat values generated)
filenames <- read_table('/home/filenames.txt', col_names = 'file_id')
# create symmetrical matrix
M <- diag(48)
ct <- 1
for (sub in (filenames$file_id)) {
subj <- read.table(paste0(dat_dir, '/ht_', sub, '.HEreg'), sep="", fill=TRUE)
ht <- as.character(subj$V2[grep("rG",sub$V1)]) # wanting to extract the specific value in that column for each text file
M[ct,] <- as.numeric(ht) #input this value into the appropriate location
ct <- ct + 1
}
This obviously does not give me the triangulated output I would envision - I know there is an error with inputting the variable 'ht' into the matrix, but am not sure how to solve this moving forward. Ideally, the correlation value of file 1_1 should be inserted in row 1, col 1, file 1_2 should be inserted in row 2, col 1, so on and so forth, and avoiding repeats (should be 0's)
Should I turn to nested loops?
Much help would be appreciated from this R newbie here, I hope I didn't complicate things unnecessarily!
I think the easiest way would be to read in all your values into a vector. You can do this using a variation of your existing loop.
Let us assume that your desired size correlation matrix is 5x5 (I know you have 48x48 judging by your code, but to keep the example simple I will work with a smaller matrix).
Let us assume that you have read all of your correlation values into the vector x in column major order (same as R uses), i.e. the first element of x is row 2 column 1, second element is row 3 column 1 etc. I am further assuming that you are creating a symmetric correlation matrix, i.e. you have ones on the diagonal, which is why the indexing starts the way it does, because of your use of the diag() function. Let's assume your vector x contains the following values:
x <- 1:10
I know that these are not correlations, but they will make it easy to see how we fill the matrix, i.e. which vector element goes into which position in the resulting matrix.
Now, let us create the identity matrix and zero matrices for the upper and lower triangular correlations (off diagonal).
# Assuming 5x5 matrix
n_elements <- 5
m <- diag(n_elements)
m_upper <- m_lower <- matrix(0, n_elements, n_elements)
To quickly fill the lower triangular matrix, we can use the lower.tri().
m_lower[lower.tri(m_lower, diag = FALSE)] <- x
This will yield the following output:
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 1 0 0 0 0
[3,] 2 5 0 0 0
[4,] 3 6 8 0 0
[5,] 4 7 9 10 0
As you can see, we have successfully filled the lower triangular. Also note the order in which the elements of the vector is filled into the matrix. This is crucial for your results to be correct. The upper triangular is simply the transpose of the lower triangular, and then we can add our three matrices together to form your symmetric correlation matrix.
m_upper <- t(m_lower)
M <- m_lower + m + m_upper
Which yields the desired output:
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 2 3 4
[2,] 1 1 5 6 7
[3,] 2 5 1 8 9
[4,] 3 6 8 1 10
[5,] 4 7 9 10 1
As you see, there is no need to work with nested loops to fill these matrices. The only loop you need is to read in the results from files (which it appears you have a handle on). If you only want the triangulated output, you can simply stop at the lower triangular matrix above. If your vector of estimated correlations (in my example x) include the diagonal elements, simply set diag = TRUE in the lower.tri() function and you are good to go.

Find all combinations of numbers that sum to a target

I wish to find the speediest way to find up to 1000 possible combinations of 'n' integers to find a target integer.
For example. Say I wanted to sum to the number '20'. I want to find up to 1000 combinations of four integers that sum to this number. The integers can repeat themselves. I also have a condition that the integer must not be smaller than a particular number, in this case 4.
target<-20 #the number I wish to sum to
lowest<-4 #the smallest integer I allow
size<-4 #the number of integers I wish to use to sum
maxposs <- target - ((size-1) * lowest) #given the lowest, this is the max possible integer. In my example it is 8.
This is how I have started to work this out. Using combn to find all combinations of the four chosen integers and then filtering by those that sum to my target.
m <- combn(rep(lowest:maxposs,size), size)
m1<- m[,colSums(m)==target]
Here, 'm1' has 245 columns. There are only this many solutions. The last few columns:
# [,238] [,239] [,240] [,241] [,242] [,243] [,244] [,245]
#[1,] 4 4 4 4 4 4 5 5
#[2,] 5 5 5 6 7 4 6 4
#[3,] 7 4 5 4 4 5 4 5
#[4,] 4 7 6 6 5 7 5 6
However, in my real application, I can be dealing with very high integers (summing up to 1000) and want to limit myself to a random sample of 1000 possible combinations. As this is for a randomization statistical test, speed is of the essence. I wonder if anyone knows of a faster way of doing this. My way doesn't feel intuitively quick.
my_matrix <- matrix(nrow = 1000, ncol = 4)
i <- 1
nn <- 1000
while(i <= 1000){
x <- sample(x = 4:nn, size = 3)
y = nn - sum(x)
if(y >= 4){
my_matrix[i, ] <- c(x, y)
i <- i + 1
}
}
Per Gavin's suggestion, redone with a preallocated matrix. Now this runs in .158 seconds, twice as fast, and probably scales better.

finding set of multinomial combinations

Let's say I have a vector of integers 1:6
w=1:6
I am attempting to obtain a matrix of 90 rows and 6 columns that contains the multinomial combinations from these 6 integers taken as 3 groups of size 2.
6!/(2!*2!*2!)=90
So, columns 1 and 2 of the matrix would represent group 1, columns 3 and 4 would represent group 2 and columns 5 and 6 would represent group 3. Something like:
1 2 3 4 5 6
1 2 3 5 4 6
1 2 3 6 4 5
1 2 4 5 3 6
1 2 4 6 3 5
...
Ultimately, I would want to expand this to other multinomial combinations of limited size (because the numbers get large rather quickly) but I am having trouble getting things to work. I've found several functions that do binomial combinations (only 2 groups) but I could not locate any functions that do this when the number of groups is greater than 2.
I've tried two approaches to this:
Building up the matrix from nothing using for loops and attempting things with the reshape package (thinking that might be something there for this with melt() )
working backwards from the permutation matrix (720 rows) by attempting to retain unique rows within groups and or removing duplicated rows within groups
Neither worked for me.
The permutation matrix can be obtained with
library(gtools)
dat=permutations(6, 6, set=TRUE, repeats.allowed=FALSE)
I think working backwards from the full permutation matrix is a bit excessive but I'm tring anything at this point.
Is there a package with a prebuilt function for this? Anyone have any ideas how I shoud proceed?
Here is how you can implement your "working backwards" approach:
gps <- list(1:2, 3:4, 5:6)
get.col <- function(x, j) x[, j]
is.ordered <- function(x) !colSums(diff(t(x)) < 0)
is.valid <- Reduce(`&`, Map(is.ordered, Map(get.col, list(dat), gps)))
dat <- dat[is.valid, ]
nrow(dat)
# [1] 90

Traverse matrix (grid of points) in blocks of 4

This question is for a project and nothing to do with homeworks/acads. I am a working statistician.
So my question is, how would you write a R function, given a matrix with 400 rows and two columns where every 20 rows starting from the first, form a first row of coordinates in a grid of points like below where I would like the function to return the four corners of each individual square/rectangle within the grid:
Hence the output would have four columns and each row would indicate a rectangle. I am only looking at adjacent rectangles of the same size as in for example if the numbers below denote the row indices of the example matrix (which has two columns):
Example of row indices:
1 2 3
4 5 6
7 8 9
Would have to be traversed in the following order:
[1,2,4,5],[2,3,5,6],[4,5,7,8],[5,6,8,9] and
return the corresponding 2d points from the example input data set
which would have 9 rows and 2 points. But just that, here the grid is specified to be 3 by 3 while in my example the grid is 20 by 20 and my input dataset is 400 rows by 2 columns. If you look at the traversed result there is a pattern wherethe row indices in each 4 point block are incremented by 1. i just want to generalize this to a 400 by 2 or any setting where there is a 2 column matrix of points and there is a mention of the grid dimension.
Here is a solution if I have understood you correctly. It was a very interesting problem to be honest. :D
The idea is to make a box of a given edge length and then move this box around the grid and record it's vertices. Please see the following:
# Assuming the grid is always a square grid.
grid.size <- 20
# The matrix of row indices.
rindex.grid <- matrix(1:(grid.size * grid.size),
nrow=grid.size, ncol=grid.size, byrow=TRUE)
# We can traverse the grid by moving any given square either right or down in any
# single move. We choose to go right.
move.square.right <- function (this.square, steps=1) {
new.square <- this.square + steps
}
# Going right, capture co-ordinates of all squares in this row.
collect.sq.of.edge.length.in.row.number <- function (grid.size, elength,
rownum=1) {
first.square.in.row <- (rownum - 1) * grid.size + c(1, elength)
first.square.in.row <- c(first.square.in.row,
first.square.in.row + grid.size * (elength - 1))
squares.in.row <- t(sapply(X=seq_len(grid.size - (elength - 1)) - 1,
FUN=move.square.right,
this.square=first.square.in.row))
squares.in.row
}
# Now we start going down the columns and using the function above to collect
# squares in each row. The we will rbind the list of squares in each row into a
# dataframe. So what we get is a (grid.size - (elength - 1) ^ 2) x 4 matrix where
# each row is the co-ordinates of a square of edge length elength.
collect.sq.of.edge.length.in.grid <- function (grid.size, elength) {
all.squares=lapply(X=seq_len(grid.size - (elength - 1)),
FUN=collect.sq.of.edge.length.in.row.number,
grid.size=grid.size, elength=elength)
all.squares <- do.call(rbind, all.squares)
all.squares
}
This seems to show that we are getting the right number of boxes for all edge lengths:
tmp <- sapply(1:20, collect.sq.of.edge.length.in.grid, grid.size=grid.size)
sapply(tt, nrow)
[1] 400 361 324 289 256 225 196 169 144 121 100 81 64 49 36 25 16 9 4 1
Plus, it works well in your 3x3 example:
collect.sq.of.edge.length.in.grid(grid.size=3, elength=2)
[,1] [,2] [,3] [,4]
[1,] 1 2 4 5
[2,] 2 3 5 6
[3,] 4 5 7 8
[4,] 5 6 8 9
If you want to create a movable 20 x 20 "window" that can scroll down and/or across a 400x400 space, then use:
mcorners <- function(xidx, yidx) mat[xidx:(xidx+19),
yidx:(yidx+19])
mcorners(1,1) # should return mat[1:20, mat1:20]
Then supply mcorners() with arguments to fit your somewhat vaguely described needs. The traversal down the first column might involve:
sapply(1:381, function(ix) yourfunc( mcorners(ix, 1) ) )

Find minimum deviation from vector from a matrix of possible vectors

I have the following data
set.seed(11)
Data<-rbind(c(1:5),c(2:6))
Candidates <- matrix(1:25 + rnorm(25), ncol=5,
dimnames=list(NULL, paste0("x", 1:5)))
colnames(Data)<-colnames(Candidates)
I want to subtract each row of my Data from each row of the Candidate matrix
And return the minimal absolute difference
So for row one I want to find out the smallest amount of error possible
sum(abs(Data[1,]-Candidates[1,]))
sum(abs(Data[1,]-Candidates[2,]))
sum(abs(Data[1,]-Candidates[3,]))
sum(abs(Data[1,]-Candidates[4,]))
sum(abs(Data[1,]-Candidates[5,]))
In this case it's 38.15826. At the moment I'm not actually interested in finding out which Candidate row results in the smallest absolute deviation, I just want to know the smallest absolute deviation for each Data row.
I would then like to end up with a new dataset which has my original Data and the smallest deviation, e.g. row one would like this:
x1 x2 x3 x4 x5 MinDev
1 2 3 4 5 38.15826
My real Candidate Matrix is relatively small but my real Data is quite large,
so at the moment I'm just building a loop that
Err[i,]<- min(rbinds(
sum(abs(Data[i,]-Candidates[1,])),
sum(abs(Data[i,]-Candidates[2,]))...))
but I'm sure there's a better, more automated way to do this so that it can accomodate large Data matrices and Candidate matrices of different sizes.
Any ideas?
You can use sweep, rowSums, and apply to automate this
sum(abs(Data[1,]-Candidates[1,])) ## 38.15826
Testing on the first row of Data:
min(
rowSums(abs(
## subtract row 1 of Data from each row of Candidates
sweep(Candidates,2,Data[1,],"-"))))
## 38.15826
For convenience/readability, encapsulate this in a function:
getMinDev <- function(x) {
min(rowSums(abs(sweep(Candidates,2,x,"-"))))
}
Now apply to each row of Data:
cbind(Data,MinDev=apply(Data,1,getMinDev))
There may be methods that are marginally faster than sweep (e.g. the matrix computations given in #e4e5f4's answer), but this should be a good baseline. I like sweep because it is descriptive and doesn't depend on knowing that R uses column-major matrix ordering.
You can use apply with some matrix operations:
CalcMinDev <- function(x)
{
m <- t(matrix(rep(x, nrow(Candidates)), nrow=nrow(Candidates)))
min(rowSums(abs(m - Candidates)))
}
cbind(Data, MinDev=apply(Data, 1, CalcMinDev))
Following #BenBolker's suggestion to turn my comment (using dist function with method="manhattan") to an answer:
The idea: The trick is that if you supply a matrix to dist, it'll return the distance of all combinations back as a lower triangular matrix.
dist(rbind(Candidates, Data), method="manhattan")
# 1 2 3 4 5 6
# 2 8.786827
# 3 11.039044 3.718396
# 4 16.120267 7.333440 6.041076
# 5 21.465682 12.678855 10.426638 5.345415
# 6 38.158256 45.763021 48.015238 53.096461 58.441876
# 7 35.158256 40.763021 44.048344 48.096461 53.441876 5.000000
Here, 6th row and the 7th row (from index 1 to 5) are the distances you're interested in. So, basically, you'll just have to calculate indices to extract the elements you're interested.
The final code would look like:
idx1 <- seq_len(nrow(Data)) + nrow(Candidates)
idx2 <- seq_len(ncol(Candidates))
tt <- dist(rbind(Candidates, Data), method="manhattan")
transform(Data, minDev = apply(as.matrix(tt)[idx1, idx2], 1, min))
# x1 x2 x3 x4 x5 minDev
# 6 1 2 3 4 5 38.15826
# 7 2 3 4 5 6 35.15826

Resources