Populate Binary Matrix with Double For Loop in R - r

I'm working on populating a binary matrix based on values from a different table. I can create the matrix but am struggling with the looping needed to populate it. I think this is a pretty simple issue so I hope I can get some easy help.
Here's an example of my data:
start <- c(291, 291, 291, 702, 630, 768)
sequence <- c("chr9:103869456:103870456", "chr5:30823103:30824103", "chr11:49801703:49802703", "chr4:133865601:133866601", "chr12:55738034:55739034", "chr8:96569493:96570493")
motif <- c("ARI5B", "ARI5B", "ARI5B", "ATOH1", "EGR1", "EGR1")
df <- data.frame(start, sequence, motif)
I have created a character vector for each unique motif+start values like so:
x <- sprintf("%s_%d", df$motif, df$start)
x <- unique(x)
Next I create a binary matrix with the sequences as rows and the values from x as columns:
binmat <- matrix(0, nrow = length(df$sequence), ncol = length(x))
rownames(binmat) <- df$sequence
colnames(binmat) <- x
And now I'm stuck. I want to iterate through columns and rows and put a 1 in each position that has a match. For example, the first sequence is "chr9:103869456:103870456" and it has motif "ARI5B" at starting position 291, so it should get a 1 while the rest of the values in that row remain at 0. The output of this example should look like this:
ARI5B_291 ATOH1_702 EGR1_630 EGR1_768
chr9:103869456:103870456 1 0 0 0
chr5:30823103:30824103 1 0 0 0
chr11:49801703:49802703 1 0 0 0
chr4:133865601:133866601 0 1 0 0
chr12:55738034:55739034 0 0 1 0
chr8:96569493:96570493 0 0 0 1
But so far I am unsuccessful. I think I need a double for loop somewhere along these lines:
for (row in binmat){
for (col in binmat){
if (row && col %in% x){
1
} else { 0
}
}
}
But all I get are 0s.
Thanks in advance!

Aren't you just looking for table here? You can get the result as a vectorized one-liner, without loops, by doing:
table(factor(df$sequence, df$sequence), sprintf("%s_%d", df$motif, df$start))
ARI5B_291 ATOH1_702 EGR1_630 EGR1_768
chr9:103869456:103870456 1 0 0 0
chr5:30823103:30824103 1 0 0 0
chr11:49801703:49802703 1 0 0 0
chr4:133865601:133866601 0 1 0 0
chr12:55738034:55739034 0 0 1 0
chr8:96569493:96570493 0 0 0 1

Related

Creating a loop in R which also changes the column name

I am attempting to loop a command based upon a list (fish_species). And while I’ve found plenty of examples, I haven’t found one that also includes changing the column name as part of the loop. I have figured out how to get the desired result for an individual species (lines 10-13), but in the actual dataset I have ~500 species, and I’d prefer not to repeat this command 500+ times. Is there a way to substitute the values from a list where it says variable?
Fishdata$variable <- ifelse(fishdata$Species== “variable”,fishdata$Number,0)
I know how to do this is ArcGIS, but I am trying to expand my horizons and learn R. This is also my first post, so please excuse any screw ups.
Thank you for any help you can provide.
fishdata <-c()
fishdata$Site <-c(1,1,1,2,2,2)
fishdata$Species <- c("one_fish", "two_fish", "two_fish", "red_fish", "blue_fish", "blue_fish")
fishdata$Number <- c(1,1,1,1,1,1)
fishdata$one_fish <-0
fishdata$two_fish <-0
fishdata$red_fish <-0
fishdata$blue_fish <-0
fish_list <- c("one_fish","two_fish", "red_fish", "blue_fish")
fishdata$one_fish <- ifelse(fishdata$Species=="one_fish",fishdata$Number,0)
fishdata$two_fish <- ifelse(fishdata$Species=="two_fish",fishdata$Number,0)
fishdata$red_fish <- ifelse(fishdata$Species=="red_fish",fishdata$Number,0)
fishdata$blue_fish <- ifelse(fishdata$Species=="blue_fish",fishdata$Number,0)
You can use sapply to iterate over species,
sapply(fishdata$Species, function(i)ifelse(fishdata$Species== i, fishdata$Number,0))
# one_fish two_fish two_fish red_fish blue_fish blue_fish
#[1,] 1 0 0 0 0 0
#[2,] 0 1 1 0 0 0
#[3,] 0 1 1 0 0 0
#[4,] 0 0 0 1 0 0
#[5,] 0 0 0 0 1 1
#[6,] 0 0 0 0 1 1
$ is just an alternative to the [] operator:
a$x
a["x"]
So you can do:
fishdata[species] <- ifelse(fishdata$Species == species, fishdata$Number, 0)
for (species in fish_species) {
fishdata[species] <- ifelse(fishdata$Species == species, fishdata$Number, 0)
}

R Populate matrix with a list of indices

I try to create an adjacency matrix M from a list pList containing the indices that have to be equal to 1 in the matrix M.
For example, M is a 10x5 matrix
The variable pList contains 5 elements, each one is a vector of indices
Example :
s <- list("1210", c("254", "534"), "254", "534", "364")
M <- matrix(c(rep(0)),nrow = 5, ncol = length(unique(unlist(s))), dimnames=list(1:5,unique(unlist(s))))
Actually, my too simple solution is the brutal one with a for loop over rows of the matrix :
for (i in 1:nrow(M)){
M[i, as.character(s[[i]])] <- 1
}
So that the expected result is :
M
1210 254 534 364
1 1 0 0 0
2 0 1 1 0
3 0 1 0 0
4 0 0 1 0
5 0 0 0 1
The problem is that I have to manipulate matrices with several thousands of lines and it takes too much time.
I am not a "apply" expert but I wonder if there is a quicker solution
Thanks
Regards
We can convert the list to a matrix of row/column index, use that index to assign the elements in 'M' to 1.
M[as.matrix(stack(setNames(s, seq_along(s)))[,2:1])] <- 1
M
# 1210 254 534 364
#1 1 0 0 0
#2 0 1 1 0
#3 0 1 0 0
#4 0 0 1 0
#5 0 0 0 1
Or instead of using stack to convert to a data.frame, we can unlist the 's' to get the column index, cbind with row index created by replicating the sequence of list with length of each list element (using lengths) and assign the elements in 'M' to 1.
M[cbind(rep(seq_along(s), lengths(s)), unlist(s))] <- 1
Or yet another option would be to create a sparseMatrix
library(Matrix)
Un1 <- unlist(s)
sparseMatrix(i = rep(seq_along(s), lengths(s)),
j=as.integer(factor(Un1, levels = unique(Un1))),
x=1)

Working with matrices in r

I'm working on code to construct an option pricing matrix. What I have at the moment is the values along the diagonal part of the matrix. Currently I'm working in a matrix with 4 rows and 4 columns. What I'm attempting to do is to use the values in the diagonal part of the matrix to give values in the lower triangle of the matrix. So for my matrix Omat, Omat[1,1]+Omat[2,2] will give a value for [2,1], Omat[2,2]+Omat[3,3] will give a value for [3,2]. Then using these created values, Omat[2,1]+Omat[3,2] will give a value for [3,1].
My attempt:
Omat = diag(2, 4, 4)
Omat[j+i,j] <- Omat[i-1,j]+Omat[i,j+1]
Any ideas on how one could go about this?
What I currently have, a 4 row by 4 col matrix:
Omat
# 2 0 0 0
# 0 2 0 0
# 0 0 2 0
# 0 0 0 2
What I've been attempting to create, a 4 row by 4 col matrix:
0 0 0 0
4 0 0 0
8 4 0 0
16 8 4 0
You could try calculating successive diagonals underneath the main diagonal. Code could look like:
Omat = diag(2,4)
for(i in 1:(nrow(Omat)-1)) {
for( j in (i+1):nrow(Omat)) {
Omat[j,j-i] <- Omat[j,j-i+1] + Omat[j-1,j-i]
}
}
diag(Omat) <- 0
Am I probably missing something, but why not do this:
for (i in 2:dim){
for (j in 1:(i-1)){
Omat[i,j] <- Omat[i-1,j] + Omat[i,j+1]
}
}
diag(Omat) <- 0
,David.

Find # of rows between events in R

I have a series of data in the format (true/false). eg it looks like it can be generated from rbinom(n, 1, .1). I want a column that represents the # of rows since the last true. So the resulting data will look like
true/false gap
0 0
0 0
1 0
0 1
0 2
1 0
1 0
0 1
What is an efficient way to go from true/false to gap (in practice I'll this will be done on a large dataset with many different ids)
DF <- read.table(text="true/false gap
0 0
0 0
1 0
0 1
0 2
1 0
1 0
0 1", header=TRUE)
DF$gap2 <- sequence(rle(DF$true.false)$lengths) * #create a sequence for each run length
(1 - DF$true.false) * #multiply with 0 for all 1s
(cumsum(DF$true.false) != 0L) #multiply with zero for the leading zeros
# true.false gap gap2
#1 0 0 0
#2 0 0 0
#3 1 0 0
#4 0 1 1
#5 0 2 2
#6 1 0 0
#7 1 0 0
#8 0 1 1
The cumsum part might not be the most efficient for large vectors. Something like
if (DF$true.false[1] == 0) DF$gap2[seq_len(rle(DF$true.false)$lengths[1])] <- 0
might be an alternative (and of course the rle result could be stored temporarly to avoid calculating it twice).
Ok, let me put this in answer
1) No brainer method
data['gap'] = 0
for (i in 2:nrow(data)){
if data[i,'true/false'] == 0{
data[i,'gap'] = data[i-1,'gap'] + 1
}
}
2) No if check
data['gap'] = 0
for (i in 2:nrow(data)){
data[i,'gap'] = (data[i-1,'gap'] + 1) * (-(data[i,'gap'] - 1))
}
Really don't know which is faster, as both contain the same amount of reads from data, but (1) have an if statement, and I don't know how fast is it (compared to a single multiplication)

referencing indices in boot function

I have a large dataset (DF), a subset of which looks like this:
Site Event HardwareID Species Day1 Day2 Day3 Day4 Day5 Day6
1 1 16_11 x 0 0 0 0 0 0
1 1 29_11 y 0 0 6 2 0 1
1 1 36_11 d 0 0 0 0 0 1
1 1 41_11 y 0 0 2 4 1 1
1 1 41_11 x 0 0 0 0 0 1
1 1 58_11 a 0 0 1 0 0 0
1 1 62_11 y 0 0 0 1 0 0
1 1 62_11 z 0 0 0 0 0 0
1 1 62_11 x 0 0 0 0 0 1
2 1 40_AR b 0 0 0 0 0 0
2 1 12_11 z 0 0 1 0 0 0
I'd like to examine the minimum number of HardwareIDs to produce the most Species over the shortest amount of time, by calculating species accumulation curves (which intrinsically incorporates the Days columns) for each HardwareID, at each different site, and boostrapping the HardwareID selection part (so, look at accumulation curves using two HardwareIDs, then 3, then 4 etc, at each site).
I have written a function to create species accumulation curves (using specaccum) for a subset of these, such as:
Sites<-subset(DF,DF$Site==1)
samples<-function (x) {
specurve_sample<-(ddply(Sites[,4:length(colnames(Sites))],"Species",numcolwise(sum)))
specurve_sample<-specurve_sample[-1,]
n<-specurve_sample$Species
n<-drop.levels(n,reorder=FALSE)
specurve_sample<-specurve_sample[,-1]
specurve_sample <-t(specurve_sample)
colnames(specurve_sample)<-n
specurve_sample<-as.data.frame(specurve_sample)
sample_k<-specaccum(specurve_sample)
out<-rbind(sample_k$richness,sample_k$sd)
outnames<-c("Richness","SD")
st<-rep(Sites$Site[1],2)
out<-as.data.frame(cbind(outnames,st,out))
colnames(out)<-c("label","site","Days")
out
}
The function works fine if I subset my data before hand, but the boostrapping part does not work. I know I need to create a function (x,j) but cannot figure out where to place the j in my function. Here is the rest of my code. Many thanks for any assistance. James
all_data<-c()
for (i in 1:length(unique(DF$Site))) {
Sites<-subset(DF,DF$Site==i)
boots<-boot(Sites,samples, strata=Sites$HardwareID,R=1000)
all_data<-rbind(all_data,boots)
all_data
}
One straightforward way to do this is to create a function of x and j (as you have started to do), and have the first line of that function identify the relevant bootstrap subset from the whole collection, bootsub <- x[j, ]. Then, you can refer to this subset, bootsub throughout the rest of the function, and you need not refer to j again.
In your case, you don't want your function to refer back to your original data frame, Site. So, every where that you have Site in your function, change it to bootsub. For example:
samples <- function(x, j) {
bootsub <- x[j, ]
specurve_sample <- (ddply(bootsub[, 4:length(colnames(bootsub))], "Species", numcolwise(sum)))
specurve_sample <- specurve_sample[-1, ]
n <- specurve_sample$Species
n <- drop.levels(n, reorder=FALSE)
specurve_sample <- specurve_sample[, -1]
specurve_sample <- t(specurve_sample)
colnames(specurve_sample) <- n
specurve_sample <- as.data.frame(specurve_sample)
sample_k <- specaccum(specurve_sample)
out <- rbind(sample_k$richness, sample_k$sd)
outnames <- c("Richness", "SD")
st <- rep(bootsub$Site[1], 2)
out <- as.data.frame(cbind(outnames, st, out))
colnames(out) <- c("label", "site", "Days")
out
}
...
A follow up to the first two comments below. It's a little hard to troubleshoot without data, but this is my best guess. It may be that you have an issue with your subset() function, because you use i as an index of unique sites in the for() loop, but then refer to i as the value of the site in the call to subset(). Also, it is likely more efficient to run one call to do.call() after the for() loop, rather than multiple calls to rbind() inside the loop. Give this untested code a try.
# vector of unique sites
usite <- unique(DF$Site)
# empty list in which to put the bootstrap results
alldatlist <- vector("list", length(usite))
# loop through every site separately, save the bootstrap replicates ($t)
for(i in 1:length(usite)) {
Sites <- subset(DF, DF$Site==usite[i])
alldatlist[[i]] <- boot(Sites, samples, strata=Sites$HardwareID, R=1000)$t
}
# combine the list of results into a single matrix
all_data <- do.call(rbind, alldatlist)

Resources