r Counting occurences of specific number between specific number - r

I have a matrix where each element is eather 0 or 1.
I would like to obtain the frequencies of consecutive occurences of 0's in each row, given the last 0 in the sequence is followed by a 1.
For example:
A row with: 0, 1 , 0, 1, 0, 0
The expected result should be:
Consecutive 0's of length: 1
Frequency : 2
Another row with: 0, 1, 0, 0, 1, 0, 0, 0, 1
The expected result:
Consecutive 0's of length: 1 2 3
Frequency: 1 1 1
A further objective is then to sum the frequencies of the same length in order to know how many times a single 0 was followed by a 1, two consecutive 0's where followed by a 1 etc.
Here is an exemplary matrix on which I would like to apply the routine:
m = matrix( c(1, 0, 1, 1, 1, 1, 0, 0, 0, 0,
1, 1, 1, 1, 0, 1, 0, 0, 0, 0,
1, 0, 0, 0, 1, 1, 1, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
1, 0, 0, 0, 0, 0, 1, 1, 0, 0),
ncol = 10, nrow = 6, byrow=TRUE)
The expected result should then be like the matrix below:
result = matrix( c(3, 0, 1, 0, 3, 0, 0, 0, 0, 0), ncol=10, nrow=1)
colnames(result) <- c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10")
Where the column names are the lengths of consecutive 0's (followed by a 1) and the matrix entries the corresponding frequencies.
Note that I have a very large data matrix, and if possible I'd like to avoid loops. Thanks for any hints, comments and propositions.

Using base functions. The complication is you are discarding trailing zeros that do not end with 1.
Explanation in line.
set.seed(13L)
numRows <- 10e4
numCols <- 10
m <- matrix(sample(c(0L, 1L), numRows*numCols, replace=TRUE),
byrow=TRUE, ncol = numCols, nrow = numRows)
#add boundary conditions of all zeros and all ones
m <- rbind(rep(0L, numCols), rep(1L, numCols), m)
#head(m)
rStart <- Sys.time()
lens <- unlist(apply(m, 1, function(x) {
#find the position of the last 1 while handling boundary condition of all zeros
idx <- which(x==1)
endidx <- if (length(idx) == 0) length(x) else max(idx)
beginidx <- if(length(idx)==0) 1 else min(idx)
#tabulate the frequencies of running 0s.
runlen <- rle(x[beginidx:endidx])
list(table(runlen$lengths[runlen$values==0]))
}))
#tabulating results
res <- aggregate(lens, list(names(lens)), FUN=sum)
ans <- setNames(res$x[match(1:ncol(m), res$Group.1)], 1:ncol(m))
ans[is.na(ans)] <- 0
ans
# 1 2 3 4 5 6 7 8 9 10
#100108 43559 18593 7834 3177 1175 387 103 0 106
rEnd <- Sys.time()
print(paste0(round(rEnd - rStart, 2), attr(rEnd - rStart, "units")))
#[1] "27.67secs"
Do let me know the performance after running on the large matrix.

Related

Generate vector that groups (by name) columns in a new vector in R

I have a dataset like original with numeric (NP) and binary (all the rest) variables (my dataset is much larger and includes way more numeric and dummies):
NP <- c(4,6,18,1,3,12,8)
iso_mode_USA <- c(1, 0, 0, 0, 0, 1, 1)
iso_mode_CHN <- c(0, 1, 1, 0, 0, 0, 0)
iso_mode_COL <- c(0, 0, 0, 1, 1, 0, 0)
iso_mode_mod_USA <- c(1, 0, 0, 0, 0, 1, 1)
iso_mode_mod_CHN <- c(0, 1, 1, 0, 0, 0, 0)
iso_mode_mod_COL <- c(0, 0, 0, 1, 1, 0, 0)
exp_sector_4 <- c(0, 1, 0, 0, 1, 0, 0)
exp_sector_5 <- c(1, 0, 1, 0, 0, 0, 0)
exp_sector_7 <- c(0, 0, 0, 1, 0, 1, 1)
original <- data.frame(NP, iso_mode_USA, iso_mode_CHN, iso_mode_COL, iso_mode_mod_USA, iso_mode_mod_CHN, iso_mode_mod_CHN, exp_sector_4, exp_sector_5, exp_sector_7)
I want to have a vector that records the group of each column by the start of their names (e.g. NP forms one group, iso_mode_ forms another group, exp_sect_ forms another group and so on...). Therefore, the vector looks like:
vector <- c("1", "2", "2", "2", "3", "3", "3", "4", "4", "4")
Any idea on how to do it in dplyr (for many more variables)?
Thank you.
You can use grepl to find the name and which in apply to get the position.
tt <- paste0("^", unique(sub("_[^_]+$", "_", names(original))), "([^_]+$|$)")
apply(sapply(tt, grepl, names(original)), 1, which)
# [1] 1 2 2 2 3 3 3 4 4 4

How to create a confusion matrix using a function in R

I created the following data set:
actual <- c(1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0)
predicted <- c(1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0)
The following code works, but I want to use a function to create a confusion matrix instead:
#create new data frame
new_data <- data.frame(actual, predicted)
new_data["class"] <- ifelse(new_data["actual"]==0 & new_data["predicted"]==0, "TN",
ifelse(new_data["actual"]==0 & new_data["predicted"]==1, "FP",
ifelse(new_data["actual"]==1 & new_data["predicted"]==0, "FN", "TP")))
(conf.val <- table(new_data["class"]))
What might be the code to do that?
If you want the same output format as the one you posted, then consider this function
confusion <- function(pred, real) {
stopifnot(all(c(pred, real) %in% 0:1))
table(matrix(c("TN", "FP", "FN", "TP"), 2L)[cbind(pred, real) + 1L])
}
Output
> confusion(predicted, actual)
FN FP TN TP
1 2 5 4
The caret library offers a great collection of methods for machine learning
library(caret)
actual <- as.factor(c(1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0))
predicted <- as.factor(c(1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0))
caret::confusionMatrix(data = predicted, actual, positive="1")

Pair wise binary comparison - optimizing code in R

I have a file that represents the gene structure of bacteria models. Each row represents a model. A row is a fixed length binary string of which genes are present (1 for present and 0 for absent). My task is to compare the gene sequence for each pair of models and get a score of how similar they are and computer a dissimilarity matrix.
In total there are 450 models (rows) in one file and there are 250 files. I have a working code however it takes roughly 1.6 hours to do the whole thing for only one file.
#Sample Data
Generation: 0
[0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0]
[1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1]
[1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]
[0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0]
[0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0]
[1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0]
What my code does:
Reads the file
Convert the binary string into a data frame Gene, Model_1, Model_2,
Model_3, … Model_450
Run a nested for loop to do the pair-wise comparison (only the top
half of the matrix) – I take the two corresponding columns and add
them, then count the positions where the sum is 2 (meaning present
in both models)
Write the data to a file
Create the matrix later
comparison code
generationFiles = list.files(pattern = "^Generation.*\\_\\d+.txt$")
start.time = Sys.time()
for(a in 1:length(generationFiles)){
fname = generationFiles[a]
geneData = read.table(generationFiles[a], sep = "\n", header = T, stringsAsFactors = F)
geneCount = str_count(geneData[1,1],"[1|0]")
geneDF <- data.frame(Gene = paste0("Gene_", c(1:geneCount)), stringsAsFactors = F)
#convert the string into a data frame
for(i in 1:nrow(geneData)){
#remove the square brackets
dataRow = substring(geneData[i,1], 2, nchar(geneData[i,1]) - 1)
#removing white spaces
dataRow = gsub(" ", "", dataRow, fixed = T)
#splitting the string
dataRow = strsplit(dataRow, ",")
#converting to numeric
dataRow = as.numeric(unlist(dataRow))
colName = paste("M_",i,sep = "")
geneDF <- cbind(geneDF, dataRow)
colnames(geneDF)[colnames(geneDF) == 'dataRow'] <- colName
dataRow <- NULL
}
summaryDF <- data.frame(Model1 = character(), Model2 = character(), Common = integer(),
Uncommon = integer(), Absent = integer(), stringsAsFactors = F)
modelNames = paste0("M_",c(1:450))
secondaryLevel = modelNames
fileName = paste0("D://BellosData//GC_3//Summary//",substr(fname, 1, nchar(fname) - 4),"_Summary.txt")
for(x in 1:449){
secondaryLevel = secondaryLevel[-1]
for(y in 1:length(secondaryLevel)){
result = geneDF[modelNames[x]] + geneDF[secondaryLevel[y]]
summaryDF <- rbind(summaryDF, data.frame(Model1 = modelNames[x],
Model2 = secondaryLevel[y],
Common = sum(result == 2),
Uncommon = sum(result == 1),
Absent = sum(result == 0)))
}
}
write.table(summaryDF, fileName, sep = ",", quote = F, row.names = F)
geneDF <- NULL
summaryDF <- NULL
geneData <-NULL
}
converting to matrix
maxNum = max(summaryDF$Common)
normalizeData = summaryDF[,c(1:3)]
normalizeData[c('Common')] <- lapply(normalizeData[c('Common')], function(x) 1 - x/maxNum)
normalizeData[1:2] <- lapply(normalizeData[1:2], factor, levels=unique(unlist(normalizeData[1:2])))
distMatrixN = xtabs(Common~Model1+Model2, data=normalizeData)
distMatrixN = distMatrixN + t(distMatrixN)
Is there a way to make the process run faster? Is there a more efficient way to do the comparison?
This code should be faster. Nested loops are nightmare slow in R. Operations like rbind-ing one row at a time is also among the worst and slowest ideas in R programming.
Generate 450 rows with 20 elements of 0, 1 on each row.
M = do.call(rbind, replicate(450, sample(0:1, 20, replace = T), simplify = F))
Generate list of combination(450, 2) numbers of row pairs
L = split(v<-t(utils::combn(450, 2)), seq(nrow(v))); rm(v)
Apply whatever comparison function you want. In this case, the number of 1's at the same position for each row combinations. If you want to calculate different metrics, just write another function(x) where M[x[1],] is the first row and M[x[2],] is the second row.
O = lapply(L, function(x) sum(M[x[1],]&M[x[2],]))
Code takes ~4 seconds a fairly slow 2.6 Ghz Sandy Bridge
Get a clean data.frame with your results, three columns : row 1, row 2, metric between the two rows
data.frame(row1 = sapply(L, `[`, 1),
row2 = sapply(L, `[`, 2),
similarity_metric = do.call(rbind, O))
To be honest, I didn't thoroughly comb through your code to replicate exactly what you were doing. If this is not what you are looking for (or can't be modified to achieve what you are looking for), leave a comment.

Creating a vector with certain values at specific positions based on another vector

If I start with vector1, and test to see which items equal 1:
vector1 <- c(0, 1, 1, 1, 0, 1, 1, 1, 0, 1)
test <- which(vector1 == 1)
test now equals: 2, 3, 4, 6, 7, 8, 10
then, I want to randomly choose two of the items in test:
sample_vector <- sample(test, 2, replace = FALSE)
the above code generated a sample_vector: 6, 3
My question is how do I take sample_vector and turn it into:
vector2 <- 0, 0, 1, 0, 0, 1, 0, 0, 0, 0
I'm essentially looking to assign only the items in sample_vector to equal 1, and the remaining items from vector1 are assigned to equal 0 (i.e. so it looks like vector2). vector2 needs to have the same length at vector1 (10 items).
Thanks!
vector2 <- rep(0, length(vector1))
vector2[sample_vector] <- 1
set.seed(44)
vector1 <- c(0, 1, 1, 1, 0, 1, 1, 1, 0, 1)
test <- which(vector1 == 1)
sample_vector <- sample(test, 2, replace = FALSE)
sample_vector
#[1] 8 3
replace(tabulate(seq_along(vector1)) - 1, sample_vector, 1)
#[1] 0 0 1 0 0 0 0 1 0 0
Use this code.
vector2 <- rep(0,len(vector1))
vector2[sample_vector] = 1

Searching maximum length and IDs of consecutive TRUEs in R

I am searching a method (without a for-loop) to find the IDs and the length of the maximum consecutive TRUEs in an array. For example
foo <- as.numeric(runif(100) > 0.5)
gives you 100 shuffled 0 and 1. Now I am searching the longest consecuitive row of 1 and the corresponding IDs in this array. For example
foo2 <- c(0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1)
should lead to
max.dur = 8
max.ids = c(6, 13)
I try combinations of table, cumsum and which, but cannot find an appropriate way to do this.
Here's one way to do it:
foo2 <- c(0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1)
tmp <- rle(foo2) # calculates the lengths of runs of equal values
idx <- which.max(replace(tmp$length, !tmp$values, -1))
# index of maximum number of consecutive 1s
max.dur <- tmp$lengths[idx]
# [1] 8
max.ids <- c(0, cumsum(tmp$lengths))[c(idx, idx + 1)] + c(1, 0)
# [1] 6 13
You can use rle
foo <- c(0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1)
XX <- rle(foo)
max.dur <- max(XX$lengths)
max.dur
## [1] 8
max.ids <- cumsum(XX$lengths)[XX$lengths == max.dur] - c(max.dur - 1, 0)
max.ids
## [1] 6 13

Resources