How to create adjacency matrix from grid coordinates in R? - r

I'm new to this site. I was wondering if anyone had experience with turning a list of grid coordinates (shown in example code below as df). I've written a function that can handle the job for very small data sets but the run time increases exponentially as the size of the data set increases (I think 800 pixels would take about 25 hours). It's because of the nested for loops but I don't know how to get around it.
## Dummy Data
x <- c(1,1,2,2,2,3,3)
y <- c(3,4,2,3,4,1,2)
df <- as.data.frame(cbind(x,y))
df
## Here's what it looks like as an image
a <- c(NA,NA,1,1)
b <- c(NA,1,1,1)
c <- c(1,1,NA,NA)
image <- cbind(a,b,c)
f <- function(m) t(m)[,nrow(m):1]
image(f(image))
## Here's my adjacency matrix function that's slowwwwww
adjacency.coordinates <- function(x,y) {
df <- as.data.frame(cbind(x,y))
colnames(df) = c("V1","V2")
df <- df[with(df,order(V1,V2)),]
adj.mat <- diag(1,dim(df)[1])
for (i in 1:dim(df)[1]) {
for (j in 1:dim(df)[1]) {
if((df[i,1]-df[j,1]==0)&(abs(df[i,2]-df[j,2])==1) | (df[i,2]-df[j,2]==0)&(abs(df[i,1]-df[j,1])==1)) {
adj.mat[i,j] = 1
}
}
}
return(adj.mat)
}
## Here's the adjacency matrix
adjacency.coordinates(x,y)
Does anyone know of a way to do this that will work well on a set of coordinates a couple thousand pixels long? I've tried conversion to SpatialGridDataFrame and went from there but it won't get the adjacency matrix correct. Thank you so much for your time.

While I thought igraph might be the way to go here, I think you can do it more simply like:
result <- apply(df, 1, function(pt)
(pt["x"] == df$x & abs(pt["y"] - df$y) == 1) |
(abs(pt["x"] - df$x) == 1 & pt["y"] == df$y)
)
diag(result) <- 1
And avoid the loopiness and get the same result:
> identical(adjacency.coordinates(x,y),result)
[1] TRUE

Related

How do I save a single column of data produced from a while loop in R to a dataframe?

I have written the following very simple while loop in R.
i=1
while (i <= 5) {
print(10*i)
i = i+1
}
I would like to save the results to a dataframe that will be a single column of data. How can this be done?
You may try(if you want while)
df1 <- c()
i=1
while (i <= 5) {
print(10*i)
df1 <- c(df1, 10*i)
i = i+1
}
as.data.frame(df1)
df1
1 10
2 20
3 30
4 40
5 50
Or
df1 <- data.frame()
i=1
while (i <= 5) {
df1[i,1] <- 10 * i
i = i+1
}
df1
If you already have a data frame (let's call it dat), you can create a new, empty column in the data frame, and then assign each value to that column by its row number:
# Make a data frame with column `x`
n <- 5
dat <- data.frame(x = 1:n)
# Fill the column `y` with the "missing value" `NA`
dat$y <- NA
# Run your loop, assigning values back to `y`
i <- 1
while (i <= 5) {
result <- 10*i
print(result)
dat$y[i] <- result
i <- i+1
}
Of course, in R we rarely need to write loops like his. Normally, we use vectorized operations to carry out tasks like this faster and more succinctly:
n <- 5
dat <- data.frame(x = 1:n)
# Same result as your loop
dat$y <- 10 * (1:n)
Also note that, if you really did need a loop instead of a vectorized operation, that particular while loop could also be expressed as a for loop.
I recommend consulting an introductory book or other guide to data manipulation in R. Data frames are very powerful and their use is a necessary and essential part of programming in R.

How to Efficiently work with Sparse / "Long format" data matrix in R

EDIT: I found out that the Matrix package does everything I need. Super fast and flexible. Specifically, the related functions are
Data <- sparseMatrix(i=Data[,1], j=Data[,2], x=Data[,3])
or simply
Data <- Matrix(data=Data,sparse=T)
Once you have your matrix in this Matrix class, everything should work smoothly like a regular matrix (for the most part, anyway).
======================================================
I have a dataset in "Long format" right now, meaning that it has 3 columns: row name, column name, and value. All of the "missing" row-column pairs are equal to zero.
I need to come up with an efficient way to calculate the cosine similarity (or even just the regular dot product) between all possible pairs of rows. The full data matrix is 19000 x 62000, which is why I need to work with the Long format instead.
I came up with the following method, but it's WAY too slow. Any tips on maximizing efficiency, or any suggestions of a better method overall, would be GREATLY appreciated. Thanks!
Data <- matrix(c(1,1,1,2,2,2,3,3,3,1,2,3,1,2,4,1,4,5,1,2,2,1,1,1,1,3,1),
ncol = 3, byrow = FALSE)
Data <- data.frame(Data)
cosine.sparse <- function(data) {
a <- Sys.time()
colnames(data) <- c('V1', 'V2', 'V3')
nvars <- length(unique(data[,2]))
nrows <- length(unique(data[,1]))
sim <- matrix(nrow=nrows, ncol=nrows)
for (i in 1:nrows) {
data.i <- data[data$V1==i,]
length.i.sq <- sum(data.i$V3^2)
for (j in i:nrows) {
data.j <- data[data$V1==j,]
length.j.sq <- sum(data.j$V3^2)
common.vars <- intersect(data.i$V2, data.j$V2)
row1 <- data.i[data.i$V2 %in% common.vars,3]
row2 <- data.j[data.j$V2 %in% common.vars,3]
cos.sim <- sum(row1*row2)/sqrt(length.i.sq*length.j.sq)
sim[i,j] <- sim[j,i] <- cos.sim
}
if (i %% 500 == 0) {cat(i, " rows have been calculated.")}
}
b <- Sys.time()
time.elapsed <- b - a
print(time.elapsed)
return(sim)
}
cosine.sparse(Data2)

Trouble coding a number of matrix models to run simultaneously

I made a matrix based population model, however, I would like to run more than one simultaneously in order to represent different groups of animals, in order that dispersing individuals can move between matrices. I originally just repeated everything to get a second matrix but then I realised that because I run the model using a for loop and break() under certain conditions (when that specific matrix should stop running, ie that group has died out) it is, understandably, stopping the whole model rather than just that singular matrix.
I was wondering if anyone had any suggestions on the best ways to code the model so that instead of breaking, and stopping the whole for loop, it just stops running across that specific matrix. I'm a little stumped. I have include a single run of one matrix below.
Also if anyone has a more efficient way of creating and running 9 matrices than writing everything out 9 times advice much appreciated.
n.steps <- 100
mats <- array(0,c(85,85,n.steps))
ns <- array(0,c(85,n.steps))
ns[1,1]<-0
ns[12,1]<-rpois(1,3)
ns[24,1]<-rpois(1,3)
ns[85,1] <- 1
birth<-4
nextbreed<-12
for (i in 2:n.steps){
# set up an empty matrix;
mat <- matrix(0,nrow=85,ncol=85)
surv.age.1 <- 0.95
x <- 2:10
diag(mat[x,(x-1)]) <- surv.age.1
surv.age.a <- 0.97
disp <- 1:74
disp <- disp*-0.001
disp1<-0.13
disp<-1-(disp+disp1)
survdisp<-surv.age.a*disp
x <- 11:84
diag(mat[x,(x-1)])<-survdisp
if (i == nextbreed) {
pb <- 1
} else {
pb <- 0
}
if (pb == 1) {
(nextbreed <- nextbreed+12)
}
mat[1,85] <- pb*birth
mat[85,85]<-1
death<-sample(c(replicate(1000,
sample(c(1,0), prob=c(0.985, 1-0.985), size = 1))),1)
if (death == 0) {
break()}
mats[,,i]<- mat
ns[,i] <- mat%*%ns[,i-1]
}
group.size <- apply(ns[1:85,],2,sum)
plot(group.size)
View(mat)
View(ns)
As somebody else suggested on Twitter, one solution might be to simply turn the matrix into all 0s whenever death happens. It looks to me like death is the probability that a local population disappears? It which case it seems to make good biological sense to just turn the entire population matrix into 0s.
A few other small changes: I made a list of replicate simulations so I could summarize them easily.
If I understand correctly,
death<-sample(c(replicate(1000,sample(c(1,0), prob=c(0.985, 1-0.985), size =1))),1)
says " a local population dies completely with probability 1.5% ". In which case, I think you could replace it with rbinom(). I did that below and my plots look similar to those I made with your code.
Hope that helps!
lots <- replicate(100, simplify = FALSE, expr = {
for (i in 2:n.steps){
# set up an empty matrix;
mat <- matrix(0,nrow=85,ncol=85)
surv.age.1 <- 0.95
x <- 2:10
diag(mat[x,(x-1)]) <- surv.age.1
surv.age.a <- 0.97
disp <- 1:74
disp <- disp*-0.001
disp1<-0.13
disp<-1-(disp+disp1)
survdisp<-surv.age.a*disp
x <- 11:84
diag(mat[x,(x-1)])<-survdisp
if (i == nextbreed) {
pb <- 1
} else {
pb <- 0
}
if (pb == 1) {
(nextbreed <- nextbreed+12)
}
mat[1,85] <- pb*birth
mat[85,85]<-1
death<-rbinom(1, size = 1, prob = 0.6)
if (death == 0) {
mat <- 0
}
mats[,,i]<- mat
ns[,i] <- mat%*%ns[,i-1]
}
ns
})
lapply(lots, FUN = function(x) apply(x[1:85,],2,sum))

Increase performance of R for-loops after pre-allocation of data structures

I have read a bit about increasing performance of for loops in r, but I am still stuck with one taking ~140secs.
I will start with the code:
matrix <- matrix(NA, length(register[,1]), length(AK), dimnames = list(register[,1], AK))
data.cleaned <- data[data$FO %in% register[,1],]
rownames(data.cleaned) <- paste(1:nrow(data.cleaned))
for (i in 1 : nrow(data.cleaned)) {
for (j in 1 : nrow(matrix)) {
if (data.cleaned$FO[i] == rownames(matrix)[j]) {
for (k in 1 : ncol(matrix)) {
if (data.cleaned$AK[i] == colnames(matrix)[k])
{matrix[j,k] <- 1}
}
}
}
}
Unfortunately I can't deliver any reproducible example. That data.cleaned data frame is frame, which includes around 11000 rows. In each row there is an observation for FO (main category) and for AK (sub category for FO) (two different variables).
The goal is fill matrix[i,j] with 1 if there in one row is the corresponding FO and AK observation.
Does this make sense. Please also comment, if I need to specify or can write the post in a more clear/better way
First step:
You can set 
cnames.m <- colnames(matrix)
 before you go into the loops. At the right place you can do 
if (data.cleaned$AK[i] == cnames.m[k]) matrix[j,k] <- 1
Second step:
The inner loop is identical with
matrix[j, data.cleaned$AK[i] == cnames.m] <- 1
So there is no need to loop with k.
matrix <- matrix(NA, length(register[,1]), length(AK), dimnames = list(register[,1], AK))
data.cleaned <- data[data$FO %in% register[,1],]
rownames(data.cleaned) <- paste(1:nrow(data.cleaned))
cnames.m <- colnames(matrix)
for (i in 1 : nrow(data.cleaned)) for (j in 1 : nrow(matrix))
if (data.cleaned$FO[i] == rownames(matrix)[j]) matrix[j, data.cleaned$AK[i] == cnames.m] <- 1
one remark about object names:
it is not a good idea to name a matrix matrix (would you name a dog Dog?)

choose n most distant points in R

Given a set of xy coordinates, how can I choose n points such that those n points are most distant from each other?
An inefficient method that probably wouldn't do too well with a big dataset would be the following (identify 20 points out of 1000 that are most distant):
xy <- cbind(rnorm(1000),rnorm(1000))
n <- 20
bestavg <- 0
bestSet <- NA
for (i in 1:1000){
subset <- xy[sample(1:nrow(xy),n),]
avg <- mean(dist(subset))
if (avg > bestavg) {
bestavg <- avg
bestSet <- subset
}
}
This code, based on Pascal's code, drops the point that has the largest row sum in the distance matrix.
m2 <- function(xy, n){
subset <- xy
alldist <- as.matrix(dist(subset))
while (nrow(subset) > n) {
cdists = rowSums(alldist)
closest <- which(cdists == min(cdists))[1]
subset <- subset[-closest,]
alldist <- alldist[-closest,-closest]
}
return(subset)
}
Run on a Gaussian cloud, where m1 is #pascal's function:
> set.seed(310366)
> xy <- cbind(rnorm(1000),rnorm(1000))
> m1s = m1(xy,20)
> m2s = m2(xy,20)
See who did best by looking at the sum of the interpoint distances:
> sum(dist(m1s))
[1] 646.0357
> sum(dist(m2s))
[1] 811.7975
Method 2 wins! And compare with a random sample of 20 points:
> sum(dist(xy[sample(1000,20),]))
[1] 349.3905
which does pretty poorly as expected.
So what's going on? Let's plot:
> plot(xy,asp=1)
> points(m2s,col="blue",pch=19)
> points(m1s,col="red",pch=19,cex=0.8)
Method 1 generates the red points, which are evenly spaced out over the space. Method 2 creates the blue points, which almost define the perimeter. I suspect the reason for this is easy to work out (and even easier in one dimension...).
Using a bimodal pattern of initial points also illustrates this:
and again method 2 produces much larger total sum distance than method 1, but both do better than random sampling:
> sum(dist(m1s2))
[1] 958.3518
> sum(dist(m2s2))
[1] 1206.439
> sum(dist(xy2[sample(1000,20),]))
[1] 574.34
Following #Spacedman's suggestion, I have written a function that drops a point from the closest pair, until the desired number of points remains. It seems to work well, however, it slows down pretty quickly as you add points.
xy <- cbind(rnorm(1000),rnorm(1000))
n <- 20
subset <- xy
alldist <- as.matrix(dist(subset))
diag(alldist) <- NA
alldist[upper.tri(alldist)] <- NA
while (nrow(subset) > n) {
closest <- which(alldist == min(alldist,na.rm=T),arr.ind=T)
subset <- subset[-closest[1,1],]
alldist <- alldist[-closest[1,1],-closest[1,1]]
}

Resources