How do I find other zipcodes that touch a particular zipcode? - r

I want to create a matrix for around 200 zipcodes and neighboring zipcodes that touch those zipcodes. The matrix would be 200*200 with 1's for the cells in which the two zipcodes touch and 0 when they are not neighboring zipcodes.
How could I create or get such a matrix? Thank you very much.
Best,

If you have access to a shapefile, this is relatively straightforward with the help of the spdep package.
Here's a standalone example using Californian zip code data (~3.5MB download):
# load libraries
library(rgdal)
library(spdep)
# download, unzip and import shapefile
download.file('http://geocommons.com/overlays/305142.zip', {f<-tempfile()})
unzip(f, exdir=tempdir())
shp <- readOGR(tempdir(), 'tigerline_shapefile_2010_2010_state_california_2010_census_5-digit_zip_code_tabulation_area_zcta5_state-based')
# identify neighbours for each poly
nbs <- setNames(poly2nb(shp), shp$ZCTA5CE10)
# convert to a binary neighbour matrix
nbs.mat <- nb2mat(nbs, zero.policy=TRUE, style='B')
# see?rgeos::gTouches for an alternative to the above steps
# assign zip codes as dimension names
dimnames(nbs.mat) <- list(shp$ZCTA5CE10, shp$ZCTA5CE10)
For our dataset, this returns a 1769 x 1769 matrix indicating which zip codes are neighbours. The first 10 rows and 10 columns look like this:
nbs.mat[1:10, 1:10]
## 94601 94501 94560 94587 94580 94514 94703 95601 95669 95901
## 94601 0 1 0 0 0 0 0 0 0 0
## 94501 1 0 0 0 0 0 0 0 0 0
## 94560 0 0 0 0 0 0 0 0 0 0
## 94587 0 0 0 0 0 0 0 0 0 0
## 94580 0 0 0 0 0 0 0 0 0 0
## 94514 0 0 0 0 0 0 0 0 0 0
## 94703 0 0 0 0 0 0 0 0 0 0
## 95601 0 0 0 0 0 0 0 0 0 0
## 95669 0 0 0 0 0 0 0 0 0 0
## 95901 0 0 0 0 0 0 0 0 0 0
Optionally, if you want a two-column matrix giving neighbouring pairs of zip codes (i.e., zip code in col 1, and neighbouring zip code in col 2), you can use the following.
nbs.list <- sapply(row.names(nbs.mat), function(x) names(which(nbs.mat[x, ] == 1)))
nbs.pairs <- data.frame(zipcode=rep(names(nbs.list), sapply(nbs.list, length)),
neighbour=unlist(nbs.list))
head(nbs.pairs)
## zipcode neighbour
## 946011 94601 94501
## 946012 94601 94602
## 946013 94601 94605
## 946014 94601 94606
## 946015 94601 94621
## 946016 94601 94619

Related

Counting repeated 5-mers gene from 100 DNA sequence samples

I am beginner in R and trying to solve this but have been struggling for few days already. Please help a newbie out.
I extracted 100 samples each of length 1000 from a 100,000 DNA sequence. Then, I want to count "AATAA" appeared how many times in the each of the sample.
dog_100
# [1] "GGGTCCTTGAAAGAAGCACAGGGTGGGGGTGGGGGTGGGGGTGGGGGAAGGCAGAGAGGAGGAAACAGGTTTTTGTCCTCAGGGCGTTGCCAGTCTGAAGGAGGTGATGGGATAATTATTTATGAGAGTTCAGGAATGCCAGGCATGGATTAAATGCAAACTAATGGAAATGACACAGAACAATACATTACAC......................................"
#[2] "CCAGGCCAGAACTGAGGCCCTCAGGGCCCCCCAGAATTCCTCATTTGCAGGATAAAAATATACTCAGCTCTTCAATCTTGGTTCTTGCTACTGCACCATGTGCTTCCTGGACTCTGGGAGGCCAGGGGTTAAGTGGGAGTGTTTGAATAAGGGAAAGGATGAGCCCTTTCCCCACACTTTGCCCCAAATAAC......................................"
#[3]
#........
# [4]
#........
# [100]
#........
I wrote a function to identify and count the "AATAA".
R
library(stringr)
cal_AATAA <- function(DNA){
sam_pro <- numeric(length(DNA))
k <- 5
sam_code <- "AATAA"
for(i in 1:(length(DNA))){
Num <- str_length(DNA[i])
for(j in 1:(Num - k +1)){
if ((str_sub(DNA[i], j, j+k-1)) == sam_code){
sam_pro[i] <- sam_pro[i] + 1
}
else {
sam_pro[i] <- sam_pro[i]
}
}
return (sam_pro)
}
}
sample_100 <- cal_AATAA(dog_100)
What I got after running the function is
> sample_100
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[46] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[91] 0 0 0 0 0 0 0 0 0 0
Tried to debug my code but don't know where went wrong. Appreciate any tips or guidance.
R has a built in function called gregexpr which can be used for counting patterns in a string. It outputs a list, so we have to use sapply to loop through the elements of the output. For each element, we count the number of values that are greater than zero because a value of -1 indicates that any match was not found. Look at the output of gregexpr("ap", c("appleap", "orange")) as an example.
dna = c("AGTACGTGCATAGC", "GTAGCTAGCTAGCAT")
sam = "AGC"
sapply(gregexpr(sam, dna), function(x) sum(x > 0))
#[1] 1 3

Compute percentage weights on rows when one column is not numeric

I have this data called out:
Dates Consumer Staples Energy Financials Health Care
1 12/31/99 0 0 0 0 0
2 03/31/00 0 0 0 0 0
3 06/30/00 0 0 0 0 0
4 09/30/00 0 0 0 0 0
5 12/31/00 0 0 0 0 0
6 03/31/01 1000 0 0 50 0
7 06/30/01 0 0 0 0 0
I would like to compute the weights for each category on each row
but need to avoid summing the first column which is a date
Weights <- round(out[2:6]/rowSums(out[2:6])*100, 2)
1/ Is there a way to keep the dates in the first column, and compute
the weights of the next 5 columns in the same data set
2/ When a date has only 0 data, how to avoid the NAs?
Thank you for you help
outN <- out[,-1]
rownames(outN) <- out[,1]
Cap_Weights <- round(outN/rowSums(outN)*100, 2)
Cap_Weights[is.na(Cap_Weights)] <- 0

Random subsampling in R

I am new in R, therefore my question might be really simple.
I have a 40 sites with abundances of zooplankton.
My data looks like this (columns are species abundances and rows are sites)
0 0 0 0 0 2 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 85 0
0 0 0 0 0 45 5 57 0
0 0 0 0 0 13 0 3 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 7 0
0 3 0 0 12 8 0 57 0
0 0 0 0 0 0 0 1 0
0 0 0 0 0 59 0 0 0
0 0 0 0 4 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 105 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 100 0
0 35 0 55 0 0 0 0 0
1 4 0 0 0 0 0 0 0
0 0 0 0 0 34 21 0 0
0 0 0 0 0 9 17 0 0
0 54 0 0 0 27 5 0 0
0 1 0 0 0 1 0 0 0
0 17 0 0 0 54 3 0 0
What I would like to is take a random sub-sample (e.g. 50 individuals) from each site without replacement several times (bootstrap) in order to calculate diversity indexes to the new standardized abundances afterwards.
Try something like this:
mysample <- mydata[sample(1:nrow(mydata), 50, replace=FALSE),]
What the OP is probably looking for here is a way to bootstrap the data for a Hill or Simpson diversity index, which provides some assumptions about the data being sampled:
Each row is a site, each column is a species, and each value is a count.
Individuals are being sampled for the bootstrap, NOT THE COUNTS.
To do this, bootstrapping programs will often model the counts as a string of individuals. For instance, if we had a record like so:
a b c
2 3 4
The record would be modeled as:
aabbbcccc
Then, a sample is usually drawn WITH replacement from the string to create a larger set based on the model set.
Bootstrapping a site: In R, we have a way to do this that is actually quite simple with the 'sample' function. If you select from the column numbers, you can provide probabilities using the count data.
# Test data.
data <- data.frame(a=2, b=3, c=4)
# Sampling from first row of data.
row <- 1
N_samples <- 50
samples <- sample(1:ncol(data), N_samples, rep=TRUE, prob=data[row,])
Converting the sample into the format of the original table: Now we have an array of samples, with each item indicating the column number that the sample belongs to. We can convert back to the original table format in multiple ways, but here is a fairly simple one using a simple counting loop:
# Count the number of each entry and store in a list.
for (i in 1:ncol(data)){
site_sample[[i]] <- sum(samples==i)
}
# Unlist the data to get an array that represents the bootstrap row.
site_sample <- unlist(site_sample)
Just stumbled upon this thread, and the vegan package has a function called 'rrarify' that does precisely what you're looking to do (and in the same ecological context, too)
This should work. It's a little more complicated than it looks at first, since each cell contains counts of a species. The solution uses the apply function to send each row of the data to the user-defined sample_species function. Then we generate n random numbers and order them. If there are 15 of species 1, 20 of species 2, and 20 of species 3, the random numbers generated between 1 and 15 signify species 1, 16 and 35 signify species 2, and 36-55 signify species 3.
## Initially takes in a row of the data and the number of samples to take
sample_species <- function(counts,n) {
num_species <- length(counts)
total_count <- sum(counts)
samples <- sample(1:total_count,n,replace=FALSE)
samples <- samples[order(samples)]
result <- array(0,num_species)
total <- 0
for (i in 1:num_species) {
result[i] <- length(which(samples > total & samples <= total+counts[i]))
total <- total+counts[i]
}
return(result)
}
A <- matrix(sample(0:100,10*40,replace=T), ncol=10) ## mock data
B <- t(apply(A,1,sample_species,50)) ## results

Vertex names by creating a network object via an edgelist (R package: network)

I want to create a network object, representing a directed network on basis of an edgelist. The first column contains some unique ID of project leaders, the second project partners, let's say:
library("network")
x <- cbind(rbind(1,1,2,2,3), rbind(3,7,10,9,6))
y.nw <- network(x, matrix="edgelist", directed=TRUE, loops=FALSE)
Now my problem is: I need all vertexes to have the right ID, since after creating the network object I have to transfer it back to a adjacency matrix with the right corresponding firm IDs. However, I am not sure in which order I should assign them, since I sorted the dataframe by column 1 (project leaders), which, however, not always show up as project partners as well.
If your ids are sequential integers as in your example, you can produce the adjacency matrix corresponding to the edgelist in your example with:
>as.sociomatrix(y.nw))
1 2 3 4 5 6 7 8 9 10
1 0 0 1 0 0 0 1 0 0 0
2 0 0 0 0 0 0 0 0 1 1
3 0 0 0 0 0 1 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0
But maybe you have a different type of id system in your real input?

How to create weighted adjacency list/matrix from edge list?

My problem is very simple: I need to create an adjacency list/matrix from a list of edges.
I have an edge list stored in a csv document with column1 = node1 and column2 = node2 and I would like to convert this to a weighted adjacency list or a weighted adjacency matrix.
To be more precise, here's how the data looks like -where the numbers are simply node ids:
node1,node2
551,548
510,512
548,553
505,504
510,512
552,543
512,510
512,510
551,548
548,543
543,547
543,548
548,543
548,542
Any tips on how to achieve the conversion from this to a weighted adjacency list/matrix?
This is how I resolved to do it previously, without success (courtesy of Dai Shizuka):
dat=read.csv(file.choose(),header=TRUE) # choose an edgelist in .csv file format
el=as.matrix(dat) # coerces the data into a two-column matrix format that igraph likes
el[,1]=as.character(el[,1])
el[,2]=as.character(el[,2])
g=graph.edgelist(el,directed=FALSE) # turns the edgelist into a 'graph object'
Thank you!
This response uses base R only. The result is a standard matrix used to represent the adjacency matrix.
el <- cbind(a=1:5, b=5:1) #edgelist (a=origin, b=destination)
mat <- matrix(0, 5, 5)
mat[el] <- 1
mat
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0 0 0 0 1
#[2,] 0 0 0 1 0
#[3,] 0 0 1 0 0
#[4,] 0 1 0 0 0
#[5,] 1 0 0 0 0
Here mat is your adjacency matrix defined from edgelist el, which is a simple cbind of the vectors 1:5 and 5:1.
If your edgelist includes weights, then you need a slightly different solution.
el <- cbind(a=1:5, b=5:1, c=c(3,1,2,1,1)) # edgelist (a=origin, b=destination, c=weight)
mat<-matrix(0, 5, 5)
for(i in 1:NROW(el)) mat[ el[i,1], el[i,2] ] <- el[i,3] # SEE UPDATE
mat
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0 0 0 0 3
#[2,] 0 0 0 1 0
#[3,] 0 0 2 0 0
#[4,] 0 1 0 0 0
#[5,] 1 0 0 0 0
UPDATE
Some time later I realized that the for loop (3rd line) in the previous weighted edgelist example is unnecessary. You can replace it with the following vectorized operation:
mat[el[,1:2]] <- el[,3]
The post on my website you mention in the question (https://sites.google.com/site/daishizuka/toolkits/sna/sna_data) uses the igraph package, so make sure that is loaded.
Moreover, I recently realized that igraph provides a much easier way to create a weighted adjacency matrix from edgelists, using graph.data.frame(). I've updated this on my site, but here is a simple example:
library(igraph)
el=matrix(c('a','b','c','d','a','d','a','b','c','d'),ncol=2,byrow=TRUE) #a sample edgelist
g=graph.data.frame(el)
get.adjacency(g,sparse=FALSE)
That should do it. The sparse=FALSE argument tells it to show the 0s in the adjacency matrix.
If you really don't want to use igraph, I think this is a clunky way to do it:
el=matrix(c('a','b','c','d','a','d','a','b','c','d'),ncol=2,byrow=TRUE) #a sample edgelist
lab=names(table(el)) #extract the existing node IDs
mat=matrix(0,nrow=length(lab),ncol=length(lab),dimnames=list(lab,lab)) #create a matrix of 0s with the node IDs as rows and columns
for (i in 1:nrow(el)) mat[el[i,1],el[i,2]]=mat[el[i,1],el[i,2]]+1 #for each row in the edgelist, find the appropriate cell in the empty matrix and add 1.
Start with your data frame edges and use igraph to obtain adjacency matrix:
head(edges)
node1 node2
1 551 548
2 510 512
3 548 553
4 505 504
5 510 512
6 552 543
library(igraph)
as.matrix(get.adjacency(graph.data.frame(edges)))
551 510 548 505 552 512 543 553 504 547 542
551 0 0 2 0 0 0 0 0 0 0 0
510 0 0 0 0 0 2 0 0 0 0 0
548 0 0 0 0 0 0 2 1 0 0 1
505 0 0 0 0 0 0 0 0 1 0 0
552 0 0 0 0 0 0 1 0 0 0 0
512 0 2 0 0 0 0 0 0 0 0 0
543 0 0 1 0 0 0 0 0 0 1 0
553 0 0 0 0 0 0 0 0 0 0 0
504 0 0 0 0 0 0 0 0 0 0 0
547 0 0 0 0 0 0 0 0 0 0 0
542 0 0 0 0 0 0 0 0 0 0 0
Another possibility with the qdapTools package:
library(qdapTools)
el[rep(seq_len(nrow(el)), el[,'c']), c('a', 'b')] %>%
{split(.[,'b'], .[,'a'])} %>%
mtabulate()
## 1 2 3 4 5
## 1 0 0 0 0 3
## 2 0 0 0 1 0
## 3 0 0 2 0 0
## 4 0 1 0 0 0
## 5 1 0 0 0 0

Resources