Counting repeated 5-mers gene from 100 DNA sequence samples - r

I am beginner in R and trying to solve this but have been struggling for few days already. Please help a newbie out.
I extracted 100 samples each of length 1000 from a 100,000 DNA sequence. Then, I want to count "AATAA" appeared how many times in the each of the sample.
dog_100
# [1] "GGGTCCTTGAAAGAAGCACAGGGTGGGGGTGGGGGTGGGGGTGGGGGAAGGCAGAGAGGAGGAAACAGGTTTTTGTCCTCAGGGCGTTGCCAGTCTGAAGGAGGTGATGGGATAATTATTTATGAGAGTTCAGGAATGCCAGGCATGGATTAAATGCAAACTAATGGAAATGACACAGAACAATACATTACAC......................................"
#[2] "CCAGGCCAGAACTGAGGCCCTCAGGGCCCCCCAGAATTCCTCATTTGCAGGATAAAAATATACTCAGCTCTTCAATCTTGGTTCTTGCTACTGCACCATGTGCTTCCTGGACTCTGGGAGGCCAGGGGTTAAGTGGGAGTGTTTGAATAAGGGAAAGGATGAGCCCTTTCCCCACACTTTGCCCCAAATAAC......................................"
#[3]
#........
# [4]
#........
# [100]
#........
I wrote a function to identify and count the "AATAA".
R
library(stringr)
cal_AATAA <- function(DNA){
sam_pro <- numeric(length(DNA))
k <- 5
sam_code <- "AATAA"
for(i in 1:(length(DNA))){
Num <- str_length(DNA[i])
for(j in 1:(Num - k +1)){
if ((str_sub(DNA[i], j, j+k-1)) == sam_code){
sam_pro[i] <- sam_pro[i] + 1
}
else {
sam_pro[i] <- sam_pro[i]
}
}
return (sam_pro)
}
}
sample_100 <- cal_AATAA(dog_100)
What I got after running the function is
> sample_100
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[46] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[91] 0 0 0 0 0 0 0 0 0 0
Tried to debug my code but don't know where went wrong. Appreciate any tips or guidance.

R has a built in function called gregexpr which can be used for counting patterns in a string. It outputs a list, so we have to use sapply to loop through the elements of the output. For each element, we count the number of values that are greater than zero because a value of -1 indicates that any match was not found. Look at the output of gregexpr("ap", c("appleap", "orange")) as an example.
dna = c("AGTACGTGCATAGC", "GTAGCTAGCTAGCAT")
sam = "AGC"
sapply(gregexpr(sam, dna), function(x) sum(x > 0))
#[1] 1 3

Related

Genetic Algorithm in R: Specify number of 1s in binary chromosomes

I am using the rbga function, but my question still stands for other genetic algorithm implementations in R. Is there a way to specify the number of 1s in binary chromosomes?
I have the following example provided by the library documentation.
data(iris)
library(MASS)
X <- as.data.frame(cbind(scale(iris[,1:4]), matrix(rnorm(36*150), 150, 36)))
Y <- iris[,5]
iris.evaluate <- function(indices) {
print("Chromosome")
print(indices)
print("================================")
result = 1
if (sum(indices) > 2) {
huhn <- lda(X[,indices==1], Y, CV=TRUE)$posterior
result = sum(Y != dimnames(huhn)[[2]][apply(huhn, 1,
function(x)
which(x == max(x)))]) / length(Y)
}
result
}
monitor <- function(obj) {
minEval = min(obj$evaluations);
plot(obj, type="hist");
}
woppa <- rbga.bin(size=40, mutationChance=0.05, zeroToOneRatio=10,
evalFunc=iris.evaluate, showSettings=TRUE, verbose=TRUE)
Here are some of the chromosomes.
"Chromosome"
0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
"================================"
"Chromosome"
0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0
"================================"
"Chromosome"
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0
"================================"
"Chromosome"
0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
"================================"
The 1s (i.e., the chosen characteristics) are 5, 8, 5 and 4 respectively.
I am trying to follow the technique specified in a paper and they claim that they apply a genetic algorithm and in the end they pick a specific number of characteristics.
Is it possible to specify in a genetic algorithm the number of characteristics that I want my solution(s)/chromosome(s) to have?
Could this be done on the final solution/chromosome and if yes how?

Identify a vector within a list with at least n ocurrences of given value in R

I would like to write a function which would allow to filter the input data.
My input data is a list object containing named numeric vectors (minimal reproducible example below - dummy list).
vec1 <- c(rep(0, 10), rep(1, 4), rep(0,5), rep(-1,5))
vec2 <- c(rep(-1, 7), rep(0,99), rep(1, 6))
vec3 <- c(rep(1,2), rep(-1,2), rep(0,10), rep(-1,4), rep(0,8))
vec4 <- rep(0, 100)
dummy_list <- list(vec1, vec2, vec3, vec4)
names(dummy_list) <- c("first", "second", "third", "fourth")
I want my function to test whether in this vector any non-zero value occurs at least 5 times in a row.
My desired output should be a list containing only first two vectors of the initial dummy_list.
Below is one of my multiple attempts - I would like it to be as much similar to this as possible (except that the solution should work).
dummy_list <- Filter(function(x) which(rle(x$values !=0) x$lengths>5, dummy_list)
Note that we check whether any of the the rle length is greater or equal to 5.
Filter(function(x)any(with(rle(x), lengths[values!=0]>=5)), dummy_list)
$first
[1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 -1 -1 -1 -1 -1
$second
[1] -1 -1 -1 -1 -1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[32] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[63] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[94] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1

Find the vertical distance between the top most '1' and the bottom most '1' in a matrix in R

I have successfully imported a csv file into R. It is a 6 by 6 matrix.
0 0 0 0 0 0
0 1 0 0 0 0
0 1 1 0 0 0
0 1 0 0 0 1
0 1 0 1 0 0
0 0 0 0 0 0
'1' exists in the second row and also exists in the second last row. So the distance between them vertically is 4.
Would I use the dist function to calculate this? And if so how would I implement it to give me the value of 4?
diff(range(which(rowSums(mat) > 0)))
# [1] 3
Explanation: since the data is binary, we can look at the distance between rows where the row sum is >0.
Adapting Sathish's nicely share data, this works:
mat <- matrix(as.integer(unlist(strsplit('0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0', " "))),
nrow = 6, ncol = 6, byrow = TRUE)

Determining if each vector element not exceeds all previous elements

I need to compare element i with all previous elements i-1,i-2,..., and if i < i-1, i-2, ... return 1, otherwise return 0.
data <- c(10.3,14.3,7.7,15.8,14.4,16.7,15.3,20.2,17.1,7.7,15.3,16.3,19.9,14.4,18.7,20.7)
The result of comparing should be the following.
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I tried to make it with
as.integer(cummin(data)==data)
and i get
1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0
The first elements easy to fix. But what to do with another 1 on 10 position.
A possible approach:
v <- rank(data,ties='first')
out <- as.integer(cummin(v)==v)
# [1] 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
Taking care of the first element:
out[1] <- 0
try this:
sapply(1 : length(data), FUN = function(i) all(data[i] < data[1 : (i - 1)]) * 1)
#[1] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

changing values in vector given a location and condition with R

i'm having trouble manipulating vectors in R. i have a vector that looks like this:
stack <- append(append(rep(0,8),c(1,0,0,0,0,1)),rep(0,6))
[1] 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
my overall goal is to the manipulate the vector as such:
*when there is a 1, make the next three values in the vector 1.
*change the original 1 to 0.
so ultimately the vector would look like:
[1] 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 0 0 0
the second part I can do by:
replace(stack,which(stack == 1),0)
but I can't figure out how to do the first one efficiently. any help would be greatly appreciated.
You can use filter here :
c(filter(sx,c(0,0,0,0,1,1,1),circular=TRUE))
## [1] 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 0 0 0
Here's a possible base R option
temp <- which(stack == 1)
stack[as.vector(mapply(`:`, temp, temp + 3))] <- c(0, rep(1, 3))
stack
# [1] 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 0 0 0
I would go with regular expressions
stack <- paste0(stack, collapse="")
stack <- gsub("1.{3}", "0111", stack)
stack <- strsplit(stack, "+")

Resources