My problem is very simple: I need to create an adjacency list/matrix from a list of edges.
I have an edge list stored in a csv document with column1 = node1 and column2 = node2 and I would like to convert this to a weighted adjacency list or a weighted adjacency matrix.
To be more precise, here's how the data looks like -where the numbers are simply node ids:
node1,node2
551,548
510,512
548,553
505,504
510,512
552,543
512,510
512,510
551,548
548,543
543,547
543,548
548,543
548,542
Any tips on how to achieve the conversion from this to a weighted adjacency list/matrix?
This is how I resolved to do it previously, without success (courtesy of Dai Shizuka):
dat=read.csv(file.choose(),header=TRUE) # choose an edgelist in .csv file format
el=as.matrix(dat) # coerces the data into a two-column matrix format that igraph likes
el[,1]=as.character(el[,1])
el[,2]=as.character(el[,2])
g=graph.edgelist(el,directed=FALSE) # turns the edgelist into a 'graph object'
Thank you!
This response uses base R only. The result is a standard matrix used to represent the adjacency matrix.
el <- cbind(a=1:5, b=5:1) #edgelist (a=origin, b=destination)
mat <- matrix(0, 5, 5)
mat[el] <- 1
mat
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0 0 0 0 1
#[2,] 0 0 0 1 0
#[3,] 0 0 1 0 0
#[4,] 0 1 0 0 0
#[5,] 1 0 0 0 0
Here mat is your adjacency matrix defined from edgelist el, which is a simple cbind of the vectors 1:5 and 5:1.
If your edgelist includes weights, then you need a slightly different solution.
el <- cbind(a=1:5, b=5:1, c=c(3,1,2,1,1)) # edgelist (a=origin, b=destination, c=weight)
mat<-matrix(0, 5, 5)
for(i in 1:NROW(el)) mat[ el[i,1], el[i,2] ] <- el[i,3] # SEE UPDATE
mat
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0 0 0 0 3
#[2,] 0 0 0 1 0
#[3,] 0 0 2 0 0
#[4,] 0 1 0 0 0
#[5,] 1 0 0 0 0
UPDATE
Some time later I realized that the for loop (3rd line) in the previous weighted edgelist example is unnecessary. You can replace it with the following vectorized operation:
mat[el[,1:2]] <- el[,3]
The post on my website you mention in the question (https://sites.google.com/site/daishizuka/toolkits/sna/sna_data) uses the igraph package, so make sure that is loaded.
Moreover, I recently realized that igraph provides a much easier way to create a weighted adjacency matrix from edgelists, using graph.data.frame(). I've updated this on my site, but here is a simple example:
library(igraph)
el=matrix(c('a','b','c','d','a','d','a','b','c','d'),ncol=2,byrow=TRUE) #a sample edgelist
g=graph.data.frame(el)
get.adjacency(g,sparse=FALSE)
That should do it. The sparse=FALSE argument tells it to show the 0s in the adjacency matrix.
If you really don't want to use igraph, I think this is a clunky way to do it:
el=matrix(c('a','b','c','d','a','d','a','b','c','d'),ncol=2,byrow=TRUE) #a sample edgelist
lab=names(table(el)) #extract the existing node IDs
mat=matrix(0,nrow=length(lab),ncol=length(lab),dimnames=list(lab,lab)) #create a matrix of 0s with the node IDs as rows and columns
for (i in 1:nrow(el)) mat[el[i,1],el[i,2]]=mat[el[i,1],el[i,2]]+1 #for each row in the edgelist, find the appropriate cell in the empty matrix and add 1.
Start with your data frame edges and use igraph to obtain adjacency matrix:
head(edges)
node1 node2
1 551 548
2 510 512
3 548 553
4 505 504
5 510 512
6 552 543
library(igraph)
as.matrix(get.adjacency(graph.data.frame(edges)))
551 510 548 505 552 512 543 553 504 547 542
551 0 0 2 0 0 0 0 0 0 0 0
510 0 0 0 0 0 2 0 0 0 0 0
548 0 0 0 0 0 0 2 1 0 0 1
505 0 0 0 0 0 0 0 0 1 0 0
552 0 0 0 0 0 0 1 0 0 0 0
512 0 2 0 0 0 0 0 0 0 0 0
543 0 0 1 0 0 0 0 0 0 1 0
553 0 0 0 0 0 0 0 0 0 0 0
504 0 0 0 0 0 0 0 0 0 0 0
547 0 0 0 0 0 0 0 0 0 0 0
542 0 0 0 0 0 0 0 0 0 0 0
Another possibility with the qdapTools package:
library(qdapTools)
el[rep(seq_len(nrow(el)), el[,'c']), c('a', 'b')] %>%
{split(.[,'b'], .[,'a'])} %>%
mtabulate()
## 1 2 3 4 5
## 1 0 0 0 0 3
## 2 0 0 0 1 0
## 3 0 0 2 0 0
## 4 0 1 0 0 0
## 5 1 0 0 0 0
Related
I have a interaction datframe in R, like this:
> interaction
x y z
[1,] 4 1 112
[2,] 3 1 104
[3,] 2 4 19
[4,] 1 3 154
[5,] 3 5 332
[6,] 4 1 187
[7,] 5 5 489
[8,] 2 2 149
i want to covert it into a matrix, take x as rownames, take y as colnames, and take z as their interaction value, x,y can take same value.
anybody knows how to convert? maybe just one step in R. Please.
thank you very much!
-------------------2017/3/31---------------------------------------
or there is another edition of my question:
interactions <-data.frame(x=c(40,30,20,10,30,40,50,80),y=c(50,10,40,30,50,10,50,90),z=c(112,104,19,154,332,187,489,149))
m <- matrix(0,10,10)
colnames(m)<-c(10,20,30,40,50,60,70,80,90,100)
rownames(m)<-c(10,20,30,40,50,60,70,80,90,100)
how to covert the interactions data into matrix "m".
thank you!
Is this the sort of thing...? It assumes you want to add duplicates - such as rows 1 and 6 (both (4,1)). (See much better solution in comment below!)
intn <- data.frame(x=c(4,3,2,1,3,4,5,2),y=c(1,1,4,3,5,1,5,2),z=c(112,104,19,154,332,187,489,149))
m <- matrix(0,nrow=max(intn$x),ncol=max(intn$y))
for(i in seq_len(nrow(intn))) {
m[intn$x[i],intn$y[i]] <- m[intn$x[i],intn$y[i]] + intn$z[i]
}
m
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 154 0 0
[2,] 0 149 0 19 0
[3,] 104 0 0 0 332
[4,] 299 0 0 0 0
[5,] 0 0 0 0 489
In response to follow-up question - if there are more possible values of x and y, you can still use xtabs but add in some dummy data with the valid x and y values. The row and column names will be the combined dummy and actual values (as characters rather than numeric). Something like this...
xvals <- c(-2,0,1,2,3,4,5,2.5,7) #possible x values
yvals <- c(-1,1,2,2.5,3,4,5,6,7) #possible y values
dum <- data.frame(x=xvals,y=yvals) #xvals and yvals need to be same length
dum$z <- 0
m2 <- xtabs(z~x+y,rbind(dum,intn))
m2
y
x -1 1 2 2.5 3 4 5 6 7
-2 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 154 0 0 0 0
2 0 0 149 0 0 19 0 0 0
2.5 0 0 0 0 0 0 0 0 0
3 0 104 0 0 0 0 332 0 0
4 0 299 0 0 0 0 0 0 0
5 0 0 0 0 0 0 489 0 0
7 0 0 0 0 0 0 0 0 0
This is my R script with three nested for loops. It takes more than 2 minutes to finish 1 out of 2000 rounds of for loop. How to speedup this?
col<-NULL
row<-NULL
rep<-ncol(dat)-2
dist<-NULL
c1=3
for (i in 1:rep){
c2=3
for(j in 1:rep){
r=1
for (k in 1:nrow(dat)){
p<-(dat[r,c1]-dat[r,c2])^2
row<-rbind(row, p)
r=r+1
}
row<-sqrt(sum(row))
row1<-(1/(2*length(unique(dat[,1]))))*row
col<-cbind(col, row1)
c2=c2+1
row<-NULL
}
dist<-rbind(dist,col)
col<-NULL
c1=c1+1
}
EDIT:
> head(dat)
mark alle G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 G14 G15 G16 G17 G18 G19 G20 G21 G22 G23 G24
1 M1 228 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.0 0.5 0 0
2 M1 234 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.5 0.5 1 1
3 M1 232 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0 0
4 M1 240 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0 0
5 M1 230 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.5 0.0 0 0
6 M1 238 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0 0
I don't know the Modified Rogers Genetic distance but it looks like the euclidean distance multiplied by 1/(2*length(unique(dat$mark))):
f <- 1/(2*length(unique(dat$mark)))
d <- f*dist(t(dat[, -c(1, 2)]), method="euclidean")
The biggest thing you can do to speed up the loop is to preallocate the vectors and matrices before the loops. Then, instead of using cbind() and rbind(), add the results to the vectors/matrices like so:
# Was: row<-rbind(row, p)
row[k] <- p
# Was: col<-cbind(col, row1)
col[j] <- row1
# Was: dist<-rbind(dist,col)
dist[i, ] <- col
After that, you can explore ways to vectorize the operation or, better yet, see if there already exists a function to perform this task (or if the task is based on something for which there exists a function). Further, anything that does not depend on the loop (such as row1<-(1/(2*length(unique(dat[,1]))))) should be moved out of the loop. Otherwise you are just recalculating the same value over and over which negatively effects performance.
The key with loops is avoiding rbind() and cbind() by preallocating the vectors and matrices before the loop will provide a lot of performance boost.
Although similar function already exists, I tried my own way.
I removed one complete for loop, rbind and cbind.
Now this takes only 124 seconds to write 1014 X 1014 matrix against 2 minutes for just one round of 1014 X 1014 matrix (means 1 X 1014).
dat<-read.table("alreq1.txt", sep="\t",header=T)
col<-NULL
row<-NULL
rep<-ncol(dat)-2
dist<-NULL
dist<- data.frame(matrix(NA, nrow = rep, ncol = rep))
m<-1/sqrt(2*length(unique(dat[,1])))
c1=3
for (i in 1:rep){
c2=3
for(j in 1:rep){
p<-na.omit(dat[,c1]-dat[,c2])^2
row<-sum(p)
row<-sqrt(row)*m
col[j] <- row
c2=c2+1
row<-NULL
p<-NULL
}
dist[i,] <- col
c1=c1+1
col<-NULL
}
Hopefully, still this code can be improved.
I want to create a matrix for around 200 zipcodes and neighboring zipcodes that touch those zipcodes. The matrix would be 200*200 with 1's for the cells in which the two zipcodes touch and 0 when they are not neighboring zipcodes.
How could I create or get such a matrix? Thank you very much.
Best,
If you have access to a shapefile, this is relatively straightforward with the help of the spdep package.
Here's a standalone example using Californian zip code data (~3.5MB download):
# load libraries
library(rgdal)
library(spdep)
# download, unzip and import shapefile
download.file('http://geocommons.com/overlays/305142.zip', {f<-tempfile()})
unzip(f, exdir=tempdir())
shp <- readOGR(tempdir(), 'tigerline_shapefile_2010_2010_state_california_2010_census_5-digit_zip_code_tabulation_area_zcta5_state-based')
# identify neighbours for each poly
nbs <- setNames(poly2nb(shp), shp$ZCTA5CE10)
# convert to a binary neighbour matrix
nbs.mat <- nb2mat(nbs, zero.policy=TRUE, style='B')
# see?rgeos::gTouches for an alternative to the above steps
# assign zip codes as dimension names
dimnames(nbs.mat) <- list(shp$ZCTA5CE10, shp$ZCTA5CE10)
For our dataset, this returns a 1769 x 1769 matrix indicating which zip codes are neighbours. The first 10 rows and 10 columns look like this:
nbs.mat[1:10, 1:10]
## 94601 94501 94560 94587 94580 94514 94703 95601 95669 95901
## 94601 0 1 0 0 0 0 0 0 0 0
## 94501 1 0 0 0 0 0 0 0 0 0
## 94560 0 0 0 0 0 0 0 0 0 0
## 94587 0 0 0 0 0 0 0 0 0 0
## 94580 0 0 0 0 0 0 0 0 0 0
## 94514 0 0 0 0 0 0 0 0 0 0
## 94703 0 0 0 0 0 0 0 0 0 0
## 95601 0 0 0 0 0 0 0 0 0 0
## 95669 0 0 0 0 0 0 0 0 0 0
## 95901 0 0 0 0 0 0 0 0 0 0
Optionally, if you want a two-column matrix giving neighbouring pairs of zip codes (i.e., zip code in col 1, and neighbouring zip code in col 2), you can use the following.
nbs.list <- sapply(row.names(nbs.mat), function(x) names(which(nbs.mat[x, ] == 1)))
nbs.pairs <- data.frame(zipcode=rep(names(nbs.list), sapply(nbs.list, length)),
neighbour=unlist(nbs.list))
head(nbs.pairs)
## zipcode neighbour
## 946011 94601 94501
## 946012 94601 94602
## 946013 94601 94605
## 946014 94601 94606
## 946015 94601 94621
## 946016 94601 94619
I trying to create a loop in R, that replace im my matrix the 0 by 1 on in each 2 rows,
but I just can create one matrix with number 1. I donĀ“t know how to implement it fast and right!
R code
i<-1
r<-1
l<-1
repeat{
while(i<3){
if(l<7) {
r<-rbind(r,1)
l<-l+1
}else{
r<-cbind(r,1)
i<-i+1
l<-1
}
}
}
start example matrix
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
final matrix:
1 0 0
1 0 0
0 1 0
0 1 0
0 0 1
0 0 1
Is it a requirement to use a for loop? R works best using vectorized statements, e.g.:
diag(3)[rep(1:3, each = 2), ]
# [,1] [,2] [,3]
# [1,] 1 0 0
# [2,] 1 0 0
# [3,] 0 1 0
# [4,] 0 1 0
# [5,] 0 0 1
# [6,] 0 0 1
I am new in R, therefore my question might be really simple.
I have a 40 sites with abundances of zooplankton.
My data looks like this (columns are species abundances and rows are sites)
0 0 0 0 0 2 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 85 0
0 0 0 0 0 45 5 57 0
0 0 0 0 0 13 0 3 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 7 0
0 3 0 0 12 8 0 57 0
0 0 0 0 0 0 0 1 0
0 0 0 0 0 59 0 0 0
0 0 0 0 4 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 105 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 100 0
0 35 0 55 0 0 0 0 0
1 4 0 0 0 0 0 0 0
0 0 0 0 0 34 21 0 0
0 0 0 0 0 9 17 0 0
0 54 0 0 0 27 5 0 0
0 1 0 0 0 1 0 0 0
0 17 0 0 0 54 3 0 0
What I would like to is take a random sub-sample (e.g. 50 individuals) from each site without replacement several times (bootstrap) in order to calculate diversity indexes to the new standardized abundances afterwards.
Try something like this:
mysample <- mydata[sample(1:nrow(mydata), 50, replace=FALSE),]
What the OP is probably looking for here is a way to bootstrap the data for a Hill or Simpson diversity index, which provides some assumptions about the data being sampled:
Each row is a site, each column is a species, and each value is a count.
Individuals are being sampled for the bootstrap, NOT THE COUNTS.
To do this, bootstrapping programs will often model the counts as a string of individuals. For instance, if we had a record like so:
a b c
2 3 4
The record would be modeled as:
aabbbcccc
Then, a sample is usually drawn WITH replacement from the string to create a larger set based on the model set.
Bootstrapping a site: In R, we have a way to do this that is actually quite simple with the 'sample' function. If you select from the column numbers, you can provide probabilities using the count data.
# Test data.
data <- data.frame(a=2, b=3, c=4)
# Sampling from first row of data.
row <- 1
N_samples <- 50
samples <- sample(1:ncol(data), N_samples, rep=TRUE, prob=data[row,])
Converting the sample into the format of the original table: Now we have an array of samples, with each item indicating the column number that the sample belongs to. We can convert back to the original table format in multiple ways, but here is a fairly simple one using a simple counting loop:
# Count the number of each entry and store in a list.
for (i in 1:ncol(data)){
site_sample[[i]] <- sum(samples==i)
}
# Unlist the data to get an array that represents the bootstrap row.
site_sample <- unlist(site_sample)
Just stumbled upon this thread, and the vegan package has a function called 'rrarify' that does precisely what you're looking to do (and in the same ecological context, too)
This should work. It's a little more complicated than it looks at first, since each cell contains counts of a species. The solution uses the apply function to send each row of the data to the user-defined sample_species function. Then we generate n random numbers and order them. If there are 15 of species 1, 20 of species 2, and 20 of species 3, the random numbers generated between 1 and 15 signify species 1, 16 and 35 signify species 2, and 36-55 signify species 3.
## Initially takes in a row of the data and the number of samples to take
sample_species <- function(counts,n) {
num_species <- length(counts)
total_count <- sum(counts)
samples <- sample(1:total_count,n,replace=FALSE)
samples <- samples[order(samples)]
result <- array(0,num_species)
total <- 0
for (i in 1:num_species) {
result[i] <- length(which(samples > total & samples <= total+counts[i]))
total <- total+counts[i]
}
return(result)
}
A <- matrix(sample(0:100,10*40,replace=T), ncol=10) ## mock data
B <- t(apply(A,1,sample_species,50)) ## results